Share Nextflow pipelines with Github

August 6, 2014, 3:00 pm

≫ Next: Reproducibility in Science - Nextflow meets Docker

The GitHub code repository and collaboration platform is widely used between researchers to publish their work and to collaborate on projects source code.

Even more interestingly a few months ago GitHub announced improved support for researchers making it possible to get a Digital Object Identifier (DOI) for any GitHub repository archive.

With a DOI for your GitHub repository archive your code becomes formally citable in scientific publications.

Why use GitHub with Nextflow?

The latest Nextflow release (0.9.0) seamlessly integrates with GitHub. This feature allows you to manage your code in a more consistent manner, or use other people's Nextflow pipelines, published through GitHub, in a quick and transparent manner.

How it works

The idea is very simple, when you launch a script execution with Nextflow, it will look for a file with the pipeline name you've specified. If that file does not exist, it will look for a public repository with the same name on GitHub. If it is found, the repository is automatically downloaded to your computer and the code executed. This repository is stored in the Nextflow home directory, by default $HOME/.nextflow, thus it will be reused for any further execution.

You can try this feature out, having Nextflow (version 0.9.0 or higher) installed in your computer, by simply entering the following command in your shell terminal:

nextflow run nextflow-io/hello

The first time you execute this command Nextflow will download the pipeline at the following GitHub repository https://github.com/nextflow-io/hello, as you don't already have it in your computer. It will then execute it producing the expected output.

In order for a GitHub repository to be used as a Nextflow project, it must contain at least one file named main.nf that defines your Nextflow pipeline script.

Run a specific revision

Any Git branch, tag or commit ID in the GitHub repository can be used to specify a revision, that you want to execute, when running your pipeline by adding the -r option to the run command line. So for example you could enter:

nextflow run nextflow-io/hello -r mybranch

nextflow run nextflow-io/hello -r v1.1

This can be very useful when comparing different versions of your project. It also guarantees consistent results in your pipeline as your source code evolves.

Commands to manage pipelines

The following commands allows you to perform some basic operations that can be used to manage your pipelines. Anyway Nextflow is not meant to replace functionalities provided by the Git tool, you may still need it to create new repositories or commit changes, etc.

List available pipelines

The ls command allows you to list all the pipelines you have downloaded in your computer. For example:

nextflow ls

This prints a list similar to the following one:

cbcrg/piper-nf
nextflow-io/hello

Show pipeline information

By using the info command you can show information from a downloaded pipeline. For example:

$ nextflow info hello

This command prints:

 repo name  : nextflow-io/hello
 home page  : http://github.com/nextflow-io/hello
 local path : $HOME/.nextflow/assets/nextflow-io/hello
 main script: main.nf
 revisions  : 
 * master (default)
   mybranch
   v1.1 [t]
   v1.2 [t]

Starting from the top it shows: 1) the repository name; 2) the project home page; 3) the local folder where the pipeline has been downloaded; 4) the script that is executed when launched; 5) the list of available revisions i.e. branches + tags. Tags are marked with a [t] on the right, the current checked-out revision is marked with a * on the left.

Pull or update a pipeline

The pull command allows you to download a pipeline from a GitHub repository or to update it if that repository has already been downloaded. For example:

nextflow pull nextflow-io/examples

Downloaded pipelines are stored in the folder $HOME/.nextflow/assets in your computer.

Clone a pipeline into a folder

The clone command allows you to copy a Nextflow pipeline project to a directory of your choice. For example:

nextflow clone nextflow-io/hello target-dir

If the destination directory is omitted the specified pipeline is cloned to a directory with the same name as the pipeline base name (e.g. hello) in the current folder.

The clone command can be used to inspect or modify the source code of a pipeline. You can eventually commit and push back your changes by using the usual Git/GitHub workflow.

Drop an installed pipeline

Downloaded pipelines can be deleted by using the drop command, as shown below:

nextflow drop nextflow-io/hello

Limitations and known problems

~~GitHub private repositories currently are not supported~~ Support for private GitHub repositories has been introduced with version 0.10.0.
~~Symlinks committed in a Git repository are not resolved correctly when downloaded/cloned by Nextflow~~ Symlinks are resolved correctly when using Nextflow version 0.11.0 (or higher).

↧

Reproducibility in Science - Nextflow meets Docker

September 8, 2014, 3:00 pm

≫ Next: Using Docker for scientific data analysis in an HPC cluster

≪ Previous: Share Nextflow pipelines with Github

The scientific world nowadays operates on the basis of published articles. These are used to report novel discoveries to the rest of the scientific community.

But have you ever wondered what a scientific article is? It is a:

defeasible argument for claims, supported by
exhibited, reproducible data and methods, and
explicit references to other work in that domain;
described using domain-agreed technical terminology,
which exists within a complex ecosystem of technologies, people and activities.

Hence the very essence of Science relies on the ability of scientists to reproduce and build upon each other’s published results.

So how much can we rely on published data? In a recent report in Nature, researchers at the Amgen corporation found that only 11% of the academic research in the literature was reproducible by their groups [1].

While many factors are likely at play here, perhaps the most basic requirement for reproducibility holds that the materials reported in a study can be uniquely identified and obtained, such that experiments can be reproduced as faithfully as possible. This information is meant to be documented in the "materials and methods" of journal articles, but as many can attest, the information provided there is often not adequate for this task.

Promoting Computational Research Reproducibility

Encouragingly scientific reproducibility has been at the forefront of many news stories and there exist numerous initiatives to help address this problem. Particularly, when it comes to producing reproducible computational analyses, some publications are starting to publish the code and data used for analysing and generating figures.

For example, many articles in Nature and in the new Elife journal (and others) provide a "source data" download link next to figures. Sometimes Elife might even have an option to download the source code for figures.

As pointed out by Melissa Gymrek in a recent post this is a great start, but there are still lots of problems. She wrote that, for example, if one wants to re-execute a data analyses from these papers, he/she will have to download the scripts and the data, to only realize that he/she has not all the required libraries, or that it only runs on, for example, an Ubuntu version he/she doesn't have, or some paths are hard-coded to match the authors' machine.

If it's not easy to run and doesn't run out of the box the chances that a researcher will actually ever run most of these scripts is close to zero, especially if they lack the time or expertise to manage the required installation of third-party libraries, tools or implement from scratch state-of-the-art data processing algorithms.

Here comes Docker

Docker containers technology is a solution to many of the computational research reproducibility problems. Basically, it is a kind of a lightweight virtual machine where you can set up a computing environment including all the libraries, code and data that you need, within a single image.

This image can be distributed publicly and can seamlessly run on any major Linux operating system. No need for the user to mess with installation, paths, etc.

They just run the Docker image you provided, and everything is set up to work out of the box. Researchers have already started discussing this (e.g. here, and here).

Docker and Nextflow: a perfect match

One big advantage Docker has compared to traditional machine virtualisation technology is that it doesn't need a complete copy of the operating system, thus it has a minimal startup time. This makes it possible to virtualise single applications or launch the execution of multiple containers, that can run in parallel, in order to speedup a large computation.

Nextflow is a data-driven toolkit for computational pipelines, which aims to simplify the deployment of distributed and highly parallelised pipelines for scientific applications.

The latest version integrates the support for Docker containers that enables the deployment of self-contained and truly reproducible pipelines.

How they work together

A Nextflow pipeline is made up by putting together several processes. Each process can be written in any scripting language that can be executed by the Linux platform (BASH, Perl, Ruby, Python, etc). Parallelisation is automatically managed by the framework and it is implicitly defined by the processes input and output declarations.

By integrating Docker with Nextflow, every pipeline process can be executed independently in its own container, this guarantees that each of them run in a predictable manner without worrying about the configuration of the target execution platform. Moreover the minimal overhead added by Docker allows us to spawn multiple container executions in a parallel manner with a negligible performance loss when compared to a platform native execution.

An example

As a proof of concept of the Docker integration with Nextflow you can try out the pipeline example at this link.

It splits a protein sequences multi FASTA file into chunks of n entries, executes a BLAST query for each of them, then extracts the top 10 matching sequences and finally aligns the results with the T-Coffee multiple sequence aligner.

In a common scenario you generally need to install and configure the tools required by this script: BLAST and T-Coffee. Moreover you should provide a formatted protein database in order to execute the BLAST search.

By using Docker with Nextflow you only need to have the Docker engine installed in your computer and a Java VM. In order to try this example out, follow these steps:

Install the latest version of Nextflow by entering the following command in your shell terminal:

 curl -fsSL get.nextflow.io | bash

Then download the required Docker image with this command:

 docker pull nextflow/examples

You can check the content of the image looking at the Dockerfile used to create it.

Now you are ready to run the demo by launching the pipeline execution as shown below:

nextflow run examples/blast-parallel.nf -with-docker

This will run the pipeline printing the final alignment out on the terminal screen. You can also provide your own protein sequences multi FASTA file by adding, in the above command line, the option --query <file> and change the splitting chunk size with --chunk n option.

Note: the result doesn't have a real biological meaning since it uses a very small protein database.

Conclusion

The mix of Docker, GitHub and Nextflow technologies make it possible to deploy self-contained and truly replicable pipelines. It requires zero configuration and enables the reproducibility of data analysis pipelines in any system in which a Java VM and the Docker engine are available.

Learn how to do it!

Follow our documentation for a quick start using Docker with Nextflow at the following link http://www.nextflow.io/docs/latest/docker.html

↧

Using Docker for scientific data analysis in an HPC cluster

November 5, 2014, 3:00 pm

≫ Next: Introducing Nextflow REPL Console

≪ Previous: Reproducibility in Science - Nextflow meets Docker

Scientific data analysis pipelines are rarely composed by a single piece of software. In a real world scenario, computational pipelines are made up of multiple stages, each of which can execute many different scripts, system commands and external tools deployed in a hosting computing environment, usually an HPC cluster.

As I work as a research engineer in a bioinformatics lab I experience on a daily basis the difficulties related on keeping such a piece of software consistent.

Computing enviroments can change frequently in order to test new pieces of software or maybe because system libraries need to be updated. For this reason replicating the results of a data analysis over time can be a challenging task.

Docker has emerged recently as a new type of virtualisation technology that allows one to create a self-contained runtime environment. There are plenty of examples showing the benefits of using it to run application services, like web servers or databases.

However it seems that few people have considered using Docker for the deployment of scientific data analysis pipelines on distributed cluster of computer, in order to simplify the development, the deployment and the replicability of this kind of applications.

For this reason I wanted to test the capabilities of Docker to solve these problems in the cluster available in our institute.

Method

The Docker engine has been installed in each node of our cluster, that runs a Univa grid engine resource manager. A Docker private registry instance has also been installed in our internal network, so that images can be pulled from the local repository in a much faster way when compared to the public Docker registry.

Moreover the Univa grid engine has been configured with a custom complex resource type. This allows us to request a specific Docker image as a resource type while submitting a job execution to the cluster.

The Docker image is requested as a soft resource, by doing that the UGE scheduler tries to run a job to a node where that image has already been pulled, otherwise a lower priority is given to it and it is executed, eventually, by a node where the specified Docker image is not available. This will force the node to pull the required image from the local registry at the time of the job execution.

This environment has been tested with Piper-NF, a genomic pipeline for the detection and mapping of long non-coding RNAs.

The pipeline runs on top of Nextflow, which takes care of the tasks parallelisation and submits the jobs for execution to the Univa grid engine.

The Piper-NF code wasn't modified in order to run it using Docker. Nextflow is able to handle it automatically. The Docker containers are run in such a way that the tasks result files are created in the hosting file system, in other words it behaves in a completely transparent manner without requiring extra steps or affecting the flow of the pipeline execution.

It was only necessary to specify the Docker image (or images) to be used in the Nextflow configuration file for the pipeline. You can read more about this at this link.

Results

To benchmark the impact of Docker on the pipeline performance a comparison was made running it with and without Docker.

For this experiment 10 cluster nodes were used. The pipeline execution launches around 100 jobs, and it was run 5 times by using the same dataset with and without Docker.

The average execution time without Docker was 28.6 minutes, while the average pipeline execution time, running each job in a Docker container, was 32.2 minutes. Thus, by using Docker the overall execution time increased by something around 12.5%.

It is important to note that this time includes both the Docker bootstrap time, and the time overhead that is added to the task execution by the virtualisation layer.

For this reason the actual task run time was measured as well i.e. without including the Docker bootstrap time overhead. In this case, the aggregate average task execution time was 57.3 minutes and 59.5 minutes when running the same tasks using Docker. Thus, the time overhead added by the Docker virtualisation layer to the effective task run time can be estimated to around 4% in our test.

Keeping the complete toolset required by the pipeline execution within a Docker image dramatically reduced configuration and deployment problems. Also storing these images into the private and public repositories with a unique tag allowed us to replicate the results without the usual burden required to set-up an identical computing environment.

Conclusion

The fast start-up time for Docker containers technology allows one to virtualise a single process or the execution of a bunch of applications, instead of a complete operating system. This opens up new possibilities, for example the possibility to "virtualise" distributed job executions in an HPC cluster of computers.

The minimal performance loss introduced by the Docker engine is offset by the advantages of running your analysis in a self-contained and dead easy to reproduce runtime environment, which guarantees the consistency of the results over time and across different computing platforms.

Credits

Thanks to Arnau Bria and the all scientific systems admins team to manage the Docker installation in the CRG computing cluster.

↧

Introducing Nextflow REPL Console

April 13, 2015, 3:00 pm

≫ Next: Innovation In Science - The story behind Nextflow

≪ Previous: Using Docker for scientific data analysis in an HPC cluster

The latest version of Nextflow introduces a new console graphical interface.

The Nextflow console is a REPL (read-eval-print loop) environment that allows one to quickly test part of a script or pieces of Nextflow code in an interactive manner.

It is a handy tool that allows one to evaluate fragments of Nextflow/Groovy code or fast prototype a complete pipeline script.

Getting started

The console application is included in the latest version of Nextflow (0.13.1 or higher).

You can try this feature out, having Nextflow installed on your computer, by entering the following command in your shell terminal: nextflow console.

When you execute it for the first time, Nextflow will spend a few seconds downloading the required runtime dependencies. When complete the console window will appear as shown in the picture below.

Nextflow console

It contains a text editor (the top white box) that allows you to enter and modify code snippets. The results area (the bottom yellow box) will show the executed code's output.

At the top you will find the menu bar (not shown in this picture) and the actions toolbar that allows you to open, save, execute (etc.) the code been tested.

As a practical execution example, simply copy and paste the following piece of code in the console editor box:

echo true 

process sayHello {

 """
 echo Hello world
 """ 

}

Then, in order to evaluate it, open the Script menu in the top menu bar and select the Run command. Alternatively you can use the CTRL+R keyboard shortcut to run it (⌘+R on the Mac). In the result box an output similar to the following will appear:

[warm up] executor > local
[00/d78a0f] Submitted process > sayHello (1)
Hello world

Now you can try to modify the entered process script, execute it again and check that the printed result has changed.

If the output doesn't appear, open the View menu and make sure that the entry Capture Standard Output is selected (it must have a tick on the left).

It is worth noting that the global script context is maintained across script executions. This means that variables declared in the global script scope are not lost when the script run is complete, and they can be accessed in further executions of the same or another piece of code.

In order to reset the global context you can use the command Clear Script Context available in the Script menu.

Conclusion

The Nextflow console is a REPL environment which allows you to experiment and get used to the Nextflow programming environment. By using it you can prototype or test your code without the need to create/edit script files.

Note: the Nextflow console is implemented by sub-classing the Groovy console tool. For this reason you may find some labels that refer to the Groovy programming environment in this program.

↧

Innovation In Science - The story behind Nextflow

June 8, 2015, 3:00 pm

≫ Next: The impact of Docker containers on the performance of genomic pipelines

≪ Previous: Introducing Nextflow REPL Console

Innovation can be viewed as the application of solutions that meet new requirements or existing market needs. Academia has traditionally been the driving force of innovation. Scientific ideas have shaped the world, but only a few of them were brought to market by the inventing scientists themselves, resulting in both time and financial loses.

Lately there have been several attempts to boost scientific innovation and translation, with most notable in Europe being the Horizon 2020 funding program. The problem with these types of funding is that they are not designed for PhDs and Postdocs, but rather aim to promote the collaboration of senior scientists in different institutions. This neglects two very important facts, first and foremost that most of the Nobel prizes were given for discoveries made when scientists were in their 20's / 30's (not in their 50's / 60's). Secondly, innovation really happens when a few individuals (not institutions) face a problem in their everyday life/work, and one day they just decide to do something about it (end-user innovation). Without realizing, these people address a need that many others have. They don’t do it for the money or the glory; they do it because it bothers them! Many examples of companies that started exactly this way include Apple, Google, and Virgin Airlines.

The story of Nextflow

Similarly, Nextflow started as an attempt to solve the every-day computational problems we were facing with “big biomedical data” analyses. We wished that our huge and almost cryptic BASH-based pipelines could handle parallelization automatically. In our effort to make that happen we stumbled upon the Dataflow programming model and Nextflow was created. We were getting furious every time our two-week long pipelines were crashing and we had to re-execute them from the beginning. We, therefore, developed a caching system, which allows Nextflow to resume any pipeline from the last executed step. While we were really enjoying developing a new DSL and creating our own operators, at the same time we were not willing to give up our favorite PERL/Python scripts and one-liners, and thus Nextflow became a polyglot.

Another problem we were facing was that our pipelines were invoking a lot of third-party software, making distribution and execution on different platforms a nightmare. Once again while searching for a solution to this problem, we were able to identify a breakthrough technology Docker, which is now revolutionising cloud computation. Nextflow has been one of the first framework, that fully supports Docker containers and allows pipeline execution in an isolated and easy to distribute manner. Of course, sharing our pipelines with our friends rapidly became a necessity and so we had to make Nextflow smart enough to support Github and Bitbucket integration.

I don’t know if Nextflow will make as much difference in the world as the Dataflow programming model and Docker container technology are making, but it has already made a big difference in our lives and that is all we ever wanted…

Conclusion

Summarising, it is a pity that PhDs and Postdocs are the neglected engine of Innovation. They are not empowered to innovate, by identifying and addressing their needs, and to potentially set up commercial solutions to their problems. This fact becomes even sadder when you think that only 3% of Postdocs have a chance to become PIs in the UK. Instead more and more money is being invested into the senior scientists who only require their PhD students and Postdocs to put another step into a well-defined ladder. In todays world it seems that ideas, such as Nextflow, will only get funded for their scientific value, not as innovative concepts trying to address a need.

↧

The impact of Docker containers on the performance of genomic pipelines

June 14, 2015, 3:00 pm

≫ Next: MPI-like distributed execution with Nextflow

≪ Previous: Innovation In Science - The story behind Nextflow

In a recent publication we assessed the impact of Docker containers technology on the performance of bioinformatic tools and data analysis workflows.

We benchmarked three different data analyses: a RNA sequence pipeline for gene expression, a consensus assembly and variant calling pipeline, and finally a pipeline for the detection and mapping of long non-coding RNAs.

We found that Docker containers have only a minor impact on the performance of common genomic data analysis, which is negligible when the executed tasks are demanding in terms of computational time.

This publication is available as PeerJ preprint at this link.

↧

MPI-like distributed execution with Nextflow

November 12, 2015, 3:00 pm

≫ Next: Developing a bioinformatics pipeline across multiple environments

≪ Previous: The impact of Docker containers on the performance of genomic pipelines

The main goal of Nextflow is to make workflows portable across different computing platforms taking advantage of the parallelisation features provided by the underlying system without having to reimplement your application code.

From the beginning Nextflow has included executors designed to target the most popular resource managers and batch schedulers commonly used in HPC data centers, such as Univa Grid Engine, Platform LSF, SLURM, PBS and Torque.

When using one of these executors Nextflow submits the computational workflow tasks as independent job requests to the underlying platform scheduler, specifying for each of them the computing resources needed to carry out its job.

This approach works well for workflows that are composed of long running tasks, which is the case of most common genomic pipelines.

However this approach does not scale well for workloads made up of a large number of short-lived tasks (e.g. a few seconds or sub-seconds). In this scenario the resource manager scheduling time is much longer than the actual task execution time, thus resulting in an overall execution time that is much longer than the real execution time. In some cases this represents an unacceptable waste of computing resources.

Moreover supercomputers, such as MareNostrum in the Barcelona Supercomputer Center (BSC), are optimized for memory distributed applications. In this context it is needed to allocate a certain amount of computing resources in advance to run the application in a distributed manner, commonly using the MPI standard.

In this scenario, the Nextflow execution model was far from optimal, if not unfeasible.

Distributed execution

For this reason, since the release 0.16.0, Nextflow has implemented a new distributed execution model that greatly improves the computation capability of the framework. It uses Apache Ignite, a lightweight clustering engine and in-memory data grid, which has been recently open sourced under the Apache software foundation umbrella.

When using this feature a Nextflow application is launched as if it were an MPI application. It uses a job wrapper that submits a single request specifying all the needed computing resources. The Nextflow command line is executed by using the mpirun utility, as shown in the example below:

#!/bin/bash
#$ -l virtual_free=120G
#$ -q <queue name>
#$ -N <job name>
#$ -pe ompi <nodes>
mpirun --pernode nextflow run <your-project-name> -with-mpi [pipeline parameters]

This tool spawns a Nextflow instance in each of the computing nodes allocated by the cluster manager.

Each Nextflow instance automatically connects with the other peers creating an private internal cluster, thanks to the Apache Ignite clustering feature that is embedded within Nextflow itself.

The first node becomes the application driver that manages the execution of the workflow application, submitting the tasks to the remaining nodes that act as workers.

When the application is complete, the Nextflow driver automatically shuts down the Nextflow/Ignite cluster and terminates the job execution.

Nextflow distributed execution

Conclusion

In this way it is possible to deploy a Nextflow workload in a supercomputer using an execution strategy that resembles the MPI distributed execution model. This doesn't require to implement your application using the MPI api/library and it allows you to maintain your code portable across different execution platforms.

Although we do not currently have a performance comparison between a Nextflow distributed execution and an equivalent MPI application, we assume that the latter provides better performance due to its low-level optimisation.

Nextflow, however, focuses on the fast prototyping of scientific applications in a portable manner while maintaining the ability to scale and distribute the application workload in an efficient manner in an HPC cluster.

This allows researchers to validate an experiment, quickly, reusing existing tools and software components. This eventually makes it possible to implement an optimised version using a low-level programming language in the second stage of a project.

↧

Developing a bioinformatics pipeline across multiple environments

February 3, 2016, 3:00 pm

≫ Next: Error recovery and automatic resource management with Nextflow

≪ Previous: MPI-like distributed execution with Nextflow

As a new bioinformatics student with little formal computer science training, there are few things that scare me more than PhD committee meetings and having to run my code in a completely different operating environment.

Recently my work landed me in the middle of the phylogenetic tree jungle and the computational requirements of my project far outgrew the resources that were available on our institute’s Univa Grid Engine based cluster. Luckily for me, an opportunity arose to participate in a joint program at the MareNostrum HPC at the Barcelona Supercomputing Centre (BSC).

As one of the top 100 supercomputers in the world, the MareNostrum III dwarfs our cluster and consists of nearly 50'000 processors. However it soon became apparent that with great power comes great responsibility and in the case of the BSC, great restrictions. These include no internet access, restrictive wall times for jobs, longer queues, fewer pre-installed binaries and an older version of bash. Faced with the possibility of having to rewrite my 16 bodged scripts for another queuing system I turned to Nextflow.

Straight off the bat I was able to reduce all my previous scripts to a single Nextflow script. Admittedly, the original code was not great, but the data processing model made me feel confident in what I was doing and I was able to reduce the volume of code to 25% of its initial amount whilst making huge improvements in the readability. The real benefits however came from the portability.

I was able to write the project on my laptop (Macbook Air), continuously test it on my local desktop machine (Linux) and then perform more realistic heavy lifting runs on the cluster, all managed from a single GitHub repository. The BSC uses the Load Sharing Facility (LSF) platform with longer queue times, but a large number of CPUs. My project on the other hand had datasets that require over 100'000 tasks, but the tasks processes themselves run for a matter of seconds or minutes. We were able to marry these two competing interests deploying Nextflow in a distributed execution manner that resemble the one of an MPI application.

In this configuration, the queuing system allocates the Nextflow requested resources and using the embedded Apache Ignite clustering engine, Nextflow handles the submission of processes to the individual nodes.

Here is some examples of how to run the same Nextflow project over multiple platforms.

Local

If I wished to launch a job locally I can run it with the command:

nextflow run myproject.nf

Univa Grid Engine (UGE)

For the UGE I simply needed to specify the following in the nextflow.config file:

process {
        executor='uge'
        queue='my_queue'
}

And then launch the pipeline execution as we did before:

nextflow run myproject.nf

Load Sharing Facility (LSF)

For running the same pipeline in the MareNostrum HPC enviroment, taking advantage of the MPI standard to deploy my workload, I first created a wrapper script (for example bsc-wrapper.sh) declaring the resources that I want to reserve for the pipeline execution:

#!/bin/bash
#BSUB -oo logs/output_%J.out
#BSUB -eo logs/output_%J.err
#BSUB -J myProject
#BSUB -q bsc_ls
#BSUB -W 2:00
#BSUB -x
#BSUB -n 512
#BSUB -R "span[ptile=16]"
export NXF_CLUSTER_SEED=$(shuf -i 0-16777216 -n 1)
mpirun --pernode bin/nextflow run concMSA.nf -with-mpi

And then can execute it using bsub as shown below:

bsub < bsc-wrapper.sh

By running Nextflow in this way and given the wrapper above, a single bsub job will run on 512 cores in 32 computing nodes (512/16 = 32) with a maximum wall time of 2 hours. Thousands of Nextflow processes can be spawned during this and the execution can be monitored in the standard manner from a single Nextflow output and error files. If any errors occur the execution can of course to continued with -resume command line option.

Conclusion

Nextflow provides a simplified way to develop across multiple platforms and removes much of the overhead associated with running niche, user developed pipelines in an HPC environment.

↧

Error recovery and automatic resource management with Nextflow

February 10, 2016, 3:00 pm

≫ Next: Workflows & publishing: best practice for reproducibility

≪ Previous: Developing a bioinformatics pipeline across multiple environments

Recently a new feature has been added to Nextflow that allows failing jobs to be rescheduled, automatically increasing the amount of computational resources requested.

The problem

Nextflow provides a mechanism that allows tasks to be automatically re-executed when a command terminates with an error exit status. This is useful to handle errors caused by temporary or even permanent failures (i.e. network hiccups, broken disks, etc.) that may happen in a cloud based environment.

However in an HPC cluster these events are very rare. In this scenario error conditions are more likely to be caused by a peak in computing resources, allocated by a job exceeding the original resource requested. This leads to the batch scheduler killing the job which in turn stops the overall pipeline execution.

In this context automatically re-executing the failed task is useless because it would simply replicate the same error condition. A common solution consists of increasing the resource request for the needs of the most consuming job, even though this will result in a suboptimal allocation of most of the jobs that are less resource hungry.

Moreover it is also difficult to predict such upper limit. In most cases the only way to determine it is by using a painful fail-and-retry approach.

Take in consideration, for example, the following Nextflow process:

process align {
    executor 'sge' 
    memory 1.GB 
    errorStrategy 'retry' 

    input: 
    file 'seq.fa' from sequences 

    script:
    '''
    t_coffee -in seq.fa 
    '''
}

The above definition will execute as many jobs as there are fasta files emitted by the sequences channel. Since the retryerror strategy is specified, if the task returns a non-zero error status, Nextflow will reschedule the job execution requesting the same amount of memory and disk storage. In case the error is generated by t_coffee that it needs more than one GB of memory for a specific alignment, the task will continue to fail, stopping the pipeline execution as a consequence.

Increase job resources automatically

A better solution can be implemented with Nextflow which allows resources to be defined in a dynamic manner. By doing this it is possible to increase the memory request when rescheduling a failing task execution. For example:

process align { 
    executor 'sge'
    memory { 1.GB * task.attempt }
    errorStrategy 'retry' 

    input: 
    file 'seq.fa' from sequences 

    script:
    '''
    t_coffee -in seq.fa 
    '''
}

In the above example the memory requirement is defined by using a dynamic rule. The task.attempt attribute represents the current task attempt (1 the first time the task is executed, 2 the second and so on).

The task will then request one GB of memory. In case of an error it will be rescheduled requesting 2 GB and so on, until it is executed successfully or the limit of times a task can be retried is reached, forcing the termination of the pipeline.

It is also possible to define the errorStrategy directive in a dynamic manner. This is useful to re-execute failed jobs only if a certain condition is verified.

For example the Univa Grid Engine batch scheduler returns the exit status 140 when a job is terminated because it's using more resources than the ones requested.

By checking this exit status we can reschedule only the jobs that fail by exceeding the resources allocation. This can be done with the following directive declaration:

errorStrategy { task.exitStatus == 140 ? 'retry' : 'terminate' }

In this way a failed task is rescheduled only when it returns the 140 exit status. In all other cases the pipeline execution is terminated.

Conclusion

Nextflow provides a very flexible mechanism for defining the job resource request and handling error events. It makes it possible to automatically reschedule failing tasks under certain conditions and to define job resource requests in a dynamic manner so that they can be adapted to the actual job's needs and to optimize the overall resource utilisation.

↧

Workflows & publishing: best practice for reproducibility

April 12, 2016, 3:00 pm

≫ Next: Docker for dunces & Nextflow for nunces

≪ Previous: Error recovery and automatic resource management with Nextflow

Publication time acts as a snapshot for scientific work. Whether a project is ongoing or not, work which was performed months ago must be described, new software documented, data collated and figures generated.

The monumental increase in data and pipeline complexity has led to this task being performed to many differing standards, or lack of thereof. We all agree it is not good enough to simply note down the software version number. But what practical measures can be taken?

The recent publication describing Kallisto(Bray et al. 2016) provides an excellent high profile example of the growing efforts to ensure reproducible science in computational biology. The authors provide a GitHub repository that “contains all the analysis to reproduce the results in the kallisto paper”.

They should be applauded and indeed - in the Twittersphere - they were. The corresponding author Lior Pachter stated that the publication could be reproduced starting from raw reads in the NCBI Sequence Read Archive through to the results, which marks a fantastic accomplishment.

Hoping people will notice https://t.co/qiu3LFozMX by @yarbsalocin @hjpimentel @pmelsted reproducing ALL the #kallisto paper from SRA→results
— Lior Pachter (@lpachter) April 5, 2016

They achieve this utilising the workflow framework Snakemake. Increasingly, we are seeing scientists applying workflow frameworks to their pipelines, which is great to see. There is a learning curve, but I have personally found the payoffs in productivity to be immense.

As both users and developers of Nextflow, we have long discussed best practice to ensure reproducibility of our work. As a community, we are at the beginning of that conversation - there are still many ideas to be aired and details ironed out - nevertheless we wished to provide a state-of-play as we see it and to describe what is possible with Nextflow in this regard.

Guaranteed Reproducibility

This is our goal. It is one thing for a pipeline to be able to be reproduced in your own hands, on your machine, yet is another for this to be guaranteed so that anyone anywhere can reproduce it. What I mean by guaranteed is that when a given pipeline is executed, there is only one result which can be output. Envisage what I term the reproducibility triangle: consisting of data, code and compute environment.

Reproducibility Triangle

Figure 1: The Reproducibility Triangle. Data: raw data such as sequencing reads, genomes and annotations but also metadata such as experimental design. Code: scripts, binaries and libraries/dependencies. Environment: operating system.

If there is any change to one of these then the reproducibililty is no longer guaranteed. For years there have been solutions to each of these individual components. But they have lived a somewhat discrete existence: data in databases such as the SRA and Ensembl, code on GitHub and compute environments in the form of virtual machines. We think that in the future science must embrace solutions that integrate each of these components natively and holistically.

Implementation

Nextflow provides a solution to reproduciblility through version control and sandboxing.

Code

Version control is provided via native integration with GitHub and other popular code management platforms such as Bitbucket and GitLab. Pipelines can be pulled, executed, developed, collaborated on and shared. For example, the command below will pull a specific version of a simple Kallisto + Sleuth pipeline from GitHub and execute it. The -r parameter can be used to specify a specific tag, branch or revision that was previously defined in the Git repository.

nextflow run cbcrg/kallisto-nf -r v0.9

Environment

Sandboxing during both development and execution is another key concept; version control alone does not ensure that all dependencies nor the compute environment are the same.

A simplified implementation of this places all binaries, dependencies and libraries within the project repository. In Nextflow, any binaries within the the bin directory of a repository are added to the path. Also, within the Nextflow config file, environmental variables such as PERL5LIB can be defined so that they are automatically added during the task executions.

This can be taken a step further with containerisation such as Docker. We have recently published work about this: briefly a dockerfile containing the instructions on how to build the docker image resides inside a repository. This provides a specification for the operating system, software, libraries and dependencies to be run.

The images themself also have content-addressable identifiers in the form of digests, which ensure not a single byte of information, from the operating system through to the libraries pulled from public repos, has been changed. This container digest can be specified in the pipeline config file.

process {
    container = "cbcrg/kallisto-nf@sha256:9f84012739..."
}

When doing so Nextflow automatically pulls the specified image from the Docker Hub and manages the execution of the pipeline tasks from within the container in a transparent manner, i.e. without having to adapt or modify your code.

Data

Data is currently one of the more challenging aspect to address. Small data can be easily version controlled within git-like repositories. For larger files the Git Large File Storage, for which Nextflow provides built-in support, may be one solution. Ultimately though, the real home of scientific data is in publicly available, programatically accessible databases.

Providing out-of-box solutions is difficult given the hugely varying nature of the data and meta-data within these databases. We are currently looking to incorporate the most highly used ones, such as the SRA and Ensembl. In the long term we have an eye on initiatives, such as NCBI BioProject, with the idea there is a single identifier for both the data and metadata that can be referenced in a workflow.

Adhering to the practices above, one could imagine one line of code which would appear within a publication.

nextflow run [user/repo] -r [version] --data[DB_reference:data_reference] -with-docker

The result would be guaranteed to be reproduced by whoever wished.

Conclusion

With this approach the reproducilbility triangle is complete. But it must be noted that this does not guard against conceptial or implemenation errors. It does not replace proper documentation. What it does is to provide transparency to a result.

The assumption that the deterministic nature of computation makes results insusceptible to irreproducbility is clearly false. We consider Nextflow with its other features such its polyglot nature, out-of-the-box portability and native support across HPC and Cloud environments to be an ideal solution in our everyday work. We hope to see more scientists adopt this approach to their workflows.

The recent efforts by the Kallisto authors highlight the appetite for increasing these standards and we encourage the community at large to move towards ensuring this becomes the normal state of affairs for publishing in science.

References

Bray, Nicolas L., Harold Pimentel, Páll Melsted, and Lior Pachter. 2016. “Near-Optimal Probabilistic RNA-Seq Quantification.” Nature Biotechnology, April. Nature Publishing Group. doi:10.1038/nbt.3519.

Di Tommaso P, Palumbo E, Chatzou M, Prieto P, Heuer ML, Notredame C. (2015) "The impact of Docker containers on the performance of genomic pipelines." PeerJ 3:e1273 doi.org:10.7717/peerj.1273.

Garijo D, Kinnings S, Xie L, Xie L, Zhang Y, Bourne PE, et al. (2013) "Quantifying Reproducibility in Computational Biology: The Case of the Tuberculosis Drugome." PLoS ONE 8(11): e80278. doi:10.1371/journal.pone.0080278

↧

Docker for dunces & Nextflow for nunces

June 9, 2016, 3:00 pm

≫ Next: Deploy your computational pipelines in the cloud at the snap-of-a-finger

≪ Previous: Workflows & publishing: best practice for reproducibility

Below is a step-by-step guide for creating Docker images for use with Nextflow pipelines. This post was inspired by recent experiences and written with the hope that it may encourage others to join in the virtualization revolution.

Modern science is built on collaboration. Recently I became involved with one such venture between several groups across Europe. The aim was to annotate long non-coding RNA (lncRNA) in farm animals and I agreed to help with the annotation based on RNA-Seq data. The basic procedure relies on mapping short read data from many different tissues to a genome, generating transcripts and then determining if they are likely to be lncRNA or protein coding genes.

During several successful 'hackathon' meetings the best approach was decided and implemented in a joint effort. I undertook the task of wrapping the procedure up into a Nextflow pipeline with a view to replicating the results across our different institutions and to allow the easy execution of the pipeline by researchers anywhere.

Creating the Nextflow pipeline (here) in itself was not a difficult task. My collaborators had documented their work well and were on hand if anything was not clear. However installing and keeping aligned all the pipeline dependencies across different the data centers was still a challenging task.

The pipeline is typical of many in bioinformatics, consisting of binary executions, BASH scripting, R, Perl, BioPerl and some custom Perl modules. We found the BioPerl modules in particular where very sensitive to the various versions in the long dependency tree. The solution was to turn to Docker containers.

I have taken this opportunity to document the process of developing the Docker side of a Nextflow + Docker pipeline in a step-by-step manner.

Docker Installation

By far the most challenging issue is the installation of Docker. For local installations, the process is relatively straight forward. However difficulties arise as computing moves to a cluster. Owing to security concerns, many HPC administrators have been reluctant to install Docker system-wide. This is changing and Docker developers have been responding to many of these concerns with updates addressing these issues.

That being the case, local installations are usually perfectly fine for development. One of the golden rules in Nextflow development is to have a small test dataset that can run the full pipeline in minutes with few computational resources, ie can run on a laptop.

If you have Docker and Nextflow installed and you wish to view the working pipeline, you can perform the following commands to obtain everything you need and run the full lncrna annotation pipeline on a test dataset.

docker pull cbcrg/lncrna_annotation
nextflow run cbcrg/lncrna-annotation-nf -profile test

[If the following does not work, there could be a problem with your Docker installation.]

The first command will download the required Docker image in your computer, while the second will launch Nextflow which automatically download the pipeline repository and run it using the test data included with it.

The Dockerfile

The Dockerfile contains all the instructions required by Docker to build the Docker image. It provides a transparent and consistent way to specify the base operating system and installation of all software, libraries and modules.

We begin by creating a file Dockerfile in the Nextflow project directory. The Dockerfile begins with:

# Set the base image to debian jessie
FROM debian:jessie

# File Author / Maintainer
MAINTAINER Evan Floden <evanfloden@gmail.com>

This sets the base distribution for our Docker image to be Debian v8.4, a lightweight Linux distribution that is ideally suited for the task. We must also specify the maintainer of the Docker image.

Next we update the repository sources and install some essential tools such as wget and perl.

RUN apt-get update && apt-get install --yes --no-install-recommends \
    wget \
    locales \
    vim-tiny \
    git \
    cmake \
    build-essential \
    gcc-multilib \
    perl \
    python ...

Notice that we use the command RUN before each line. The RUN instruction executes commands as if they are performed from the Linux shell.

Also is good practice to group as many as possible commands in the same RUN statement. This reduces the size of the final Docker image. See here for these details and here for more best practices.

Next we can specify the install of the required perl modules using cpan minus:

# Install perl modules
RUN cpanm --force CPAN::Meta \
    YAML \
    Digest::SHA \
    Module::Build \
    Data::Stag \
    Config::Simple \
    Statistics::Lite ...

We can give the instructions to download and install software from GitHub using:

# Install Star Mapper
RUN wget -qO- https://github.com/alexdobin/STAR/archive/2.5.2a.tar.gz | tar -xz \ 
    && cd STAR-2.5.2a \
    && make STAR

We can add custom Perl modules and specify environmental variables such as PERL5LIB as below:

# Install FEELnc
RUN wget -q https://github.com/tderrien/FEELnc/archive/a6146996e06f8a206a0ae6fd59f8ca635c7d9467.zip \
    && unzip a6146996e06f8a206a0ae6fd59f8ca635c7d9467.zip \ 
    && mv FEELnc-a6146996e06f8a206a0ae6fd59f8ca635c7d9467 /FEELnc \
    && rm a6146996e06f8a206a0ae6fd59f8ca635c7d9467.zip

ENV FEELNCPATH /FEELnc
ENV PERL5LIB $PERL5LIB:${FEELNCPATH}/lib/

R and R libraries can be installed as follows:

# Install R
RUN echo "deb http://cran.rstudio.com/bin/linux/debian jessie-cran3/" >>  /etc/apt/sources.list &&\
apt-key adv --keyserver keys.gnupg.net --recv-key 381BA480 &&\
apt-get update --fix-missing && \
apt-get -y install r-base

# Install R libraries
RUN R -e 'install.packages("ROCR", repos="http://cloud.r-project.org/"); install.packages("randomForest",repos="http://cloud.r-project.org/")'

For the complete working Dockerfile of this project see here

Building the Docker Image

Once we start working on the Dockerfile, we can build it anytime using:

docker build -t skptic/lncRNA_annotation .

This builds the image from the Dockerfile and assigns a tag (i.e. a name) for the image. If there are no errors, the Docker image is now in you local Docker repository ready for use.

Testing the Docker Image

We find it very helpful to test our images as we develop the Docker file. Once built, it is possible to launch the Docker image and test if the desired software was correctly installed. For example, we can test if FEELnc and its dependencies were successfully installed by running the following:

docker run -ti lncrna_annotation

cd FEELnc/test

FEELnc_filter.pl -i transcript_chr38.gtf -a annotation_chr38.gtf \
> -b transcript_biotype=protein_coding > candidate_lncRNA.gtf

exit # remember to exit the Docker image

Tagging the Docker Image

Once you are confident your image is built correctly, you can tag it, allowing you to push it to Dockerhub.io. Dockerhub is an online repository for docker images which allows anyone to pull public images and run them.

You can view the images in your local repository with the docker images command and tag using docker tag with the image ID and the name.

docker images

REPOSITORY                               TAG                 IMAGE ID            CREATED             SIZE
lncrna_annotation                        latest              d8ec49cbe3ed        2 minutes ago       821.5 MB

docker tag d8ec49cbe3ed cbcrg/lncrna_annotation:latest

Now when we check our local images we can see the updated tag.

docker images

REPOSITORY                               TAG                 IMAGE ID            CREATED             SIZE
cbcrg/lncrna_annotation                 latest              d8ec49cbe3ed        2 minutes ago       821.5 MB

Pushing the Docker Image to Dockerhub

If you have not previously, sign up for a Dockerhub account here. From the command line, login to Dockerhub and push your image.

docker login --username=cbcrg
docker push cbcrg/lncrna_annotation

You can test if you image has been correctly pushed and is publicly available by removing your local version using the IMAGE ID of the image and pulling the remote:

docker rmi -f d8ec49cbe3ed

# Ensure the local version is not listed.
docker images

docker pull cbcrg/lncrna_annotation

We are now almost ready to run our pipeline. The last step is to set up the Nexflow config.

Nextflow Configuration

Within the nextflow.config file in the main project directory we can add the following line which links the Docker image to the Nexflow execution. The images can be:

General (same docker image for all processes):

process {    
    container = 'cbcrg/lncrna_annotation'
}

Specific to a profile (specified by -profile crg for example):

profile {
    crg {
        container = 'cbcrg/lncrna_annotation'
    }
}

Specific to a given process within a pipeline:

$processName.container = 'cbcrg/lncrna_annotation'

In most cases it is easiest to use the same Docker image for all processes. One further thing to consider is the inclusion of the sha256 hash of the image in the container reference. I have previously written about this, but briefly, including a hash ensures that not a single byte of the operating system or software is different.

    process {    
        container = 'cbcrg/lncrna_annotation@sha256:9dfe233b...'
    }

All that is left now to run the pipeline.

nextflow run lncRNA-Annotation-nf -profile test

Whilst I have explained this step-by-step process in a linear, consequential manner, in reality the development process is often more circular with changes in the Docker images reflecting changes in the pipeline.

CircleCI and Nextflow

Now that you have a pipeline that successfully runs on a test dataset with Docker, a very useful step is to add a continuous development component to the pipeline. With this, whenever you push a modification of the pipeline to the GitHub repo, the test data set is run on the CircleCI servers (using Docker).

To include CircleCI in the Nexflow pipeline, create a file named circle.yml in the project directory. We add the following instructions to the file:

machine:
    java:
        version: oraclejdk8
    services:
        - docker

dependencies:
    override:

test:
    override:
        - docker pull cbcrg/lncrna_annotation
        - curl -fsSL get.nextflow.io | bash
        - ./nextflow run . -profile test

Next you can sign up to CircleCI, linking your GitHub account.

Within the GitHub README.md you can add a badge with the following:

![CircleCI status](https://circleci.com/gh/cbcrg/lncRNA-Annotation-nf.png?style=shield)

Tips and Tricks

File permissions: When a process is executed by a Docker container, the UNIX user running the process is not you. Therefore any files that are used as an input should have the appropriate file permissions. For example, I had to change the permissions of all the input data in the test data set with:

find -type f -exec chmod 644 {} \; find -type d -exec chmod 755 {} \;

Summary

This was my first time building a Docker image and after a bit of trial-and-error the process was surprising straight forward. There is a wealth of information available for Docker and the almost seamless integration with Nextflow is fantastic. Our collaboration team is now looking forward to applying the pipeline to different datasets and publishing the work, knowing our results will be completely reproducible across any platform.

↧

Deploy your computational pipelines in the cloud at the snap-of-a-finger

August 31, 2016, 3:00 pm

≫ Next: Enabling elastic computing with Nextflow

≪ Previous: Docker for dunces & Nextflow for nunces

Learn how to deploy and run a computational pipeline in the Amazon AWS cloud with ease thanks to Nextflow and Docker containers

Nextflow is a framework that simplifies the writing of parallel and distributed computational pipelines in a portable and reproducible manner across different computing platforms, from a laptop to a cluster of computers.

Indeed, the original idea, when this project started three years ago, was to implement a tool that would allow researchers in our lab to smoothly migrate their data analysis applications in the cloud when needed - without having to change or adapt their code.

However to date Nextflow has been used mostly to deploy computational workflows within on-premise computing clusters or HPC data-centers, because these infrastructures are easier to use and provide, on average, cheaper cost and better performance when compared to a cloud environment.

A major obstacle to efficient deployment of scientific workflows in the cloud is the lack of a performant POSIX compatible shared file system. These kinds of applications are usually made-up by putting together a collection of tools, scripts and system commands that need a reliable file system to share with each other the input and output files as they are produced, above all in a distributed cluster of computers.

The recent availability of the Amazon Elastic File System (EFS), a fully featured NFS based file system hosted on the AWS infrastructure represents a major step in this context, unlocking the deployment of scientific computing in the cloud and taking it to the next level.

Nextflow support for the cloud

Nextflow could already be deployed in the cloud, either using tools such as ElastiCluster or CfnCluster, or by using custom deployment scripts. However the procedure was still cumbersome and, above all, it was not optimised to fully take advantage of cloud elasticity i.e. the ability to (re)shape the computing cluster dynamically as the computing needs change over time.

For these reasons, we decided it was time to provide Nextflow with a first-class support for the cloud, integrating the Amazon EFS and implementing an optimised native cloud scheduler, based on Apache Ignite, with a full support for cluster auto-scaling and spot/preemptible instances.

In practice this means that Nextflow can now spin-up and configure a fully featured computing cluster in the cloud with a single command, after that you need only to login to the master node and launch the pipeline execution as you would do in your on-premise cluster.

Demo !

Since a demo is worth a thousands words, I've record a short screencast showing how Nextflow can setup a cluster in the cloud and mount the Amazon EFS shared file system.

Note: in this screencast it has been cut the Ec2 instances startup delay. It required around 5 minutes to launch them and setup the cluster.

Let's recap the steps showed in the demo:

The user provides the cloud parameters (such as the VM image ID and the instance type) in the nextflow.config file.
To configure the EFS file system you need to provide your EFS storage ID and the mount path by using the sharedStorageId and sharedStorageMount properties.
To use EC2 Spot instances, just specify the price you want to bid by using the spotPrice property.
The AWS access and secret keys are provided by using the usual environment variables.
The nextflow cloud create launches the requested number of instances, configures the user and access key, mounts the EFS storage and setups the Nextflow cluster automatically. Any Linux AMI can be used, it is only required that the cloud-init package, a Java 7+ runtime and the Docker engine are present.
When the cluster is ready, you can SSH in the master node and launch the pipeline execution as usual with the nextflow run <pipeline name> command.
For the sake of this demo we are using paraMSA, a pipeline for generating multiple sequence alignments and bootstrap replicates developed in our lab.
Nextflow automatically pulls the pipeline code from its GitHub repository when the execution is launched. This repository includes also a dataset which is used by default.
The many bioinformatic tools used by the pipeline are packaged using a Docker image, which is downloaded automatically on each computing node.
The pipeline results are uploaded automatically in the S3 bucket specified by the --output s3://cbcrg-eu/para-msa-results command line option.
When the computation is completed, the cluster can be safely shutdown and the EC2 instances terminated with the nextflow cloud shutdown command.

Try it yourself

We are releasing the Nextflow integrated cloud support in the upcoming version 0.22.0. You can test it now by defining the following environment variable and running the Nextflow installer script as shown below:

    export NXF_VER=0.22.0-RC1
    curl get.nextflow.io | bash

Bare in mind that Nextflow requires a Unix-like operating system and a Java runtime version 7+ (Windows 10 users which have installed the Ubuntu subsystem should be able to run it, at their risk..).

Once you have installed it, you can follow the steps in the above demo. For your convenience we made publicly available the EC2 image ami-43f49030 (EU Ireland region) used to record this screencast.

Also make sure you have the following the following variables defined in your environment:

AWS_ACCESS_KEY_ID="<your aws access key>"
AWS_SECRET_ACCESS_KEY="<your aws secret key>"
AWS_DEFAULT_REGION="<your aws region>"

Conclusion

Nextflow provides state of the art support for cloud and containers technologies making it possibile to create computing clusters in the cloud and deploy computational workflows in a no-brainer way, with just two commands on your terminal.

In an upcoming post I will describe the autoscaling capabilities implemented by the Nextflow scheduler that allows, along with the use of spot/preemptible instances, a cost effective solution for the execution of your pipeline in the cloud.

Credits

Thanks to Evan Floden for reviewing this post and for writing the paraMSA pipeline.

↧

Enabling elastic computing with Nextflow

October 18, 2016, 3:00 pm

≫ Next: More fun with containers in HPC

≪ Previous: Deploy your computational pipelines in the cloud at the snap-of-a-finger

Learn how to deploy an elastic computing cluster in the AWS cloud with Nextflow

In the previous post I introduced the new cloud native support for AWS provided by Nextflow.

It allows the creation of a computing cluster in the cloud in a no-brainer way, enabling the deployment of complex computational pipelines in a few commands.

This solution is characterised by using a lean application stack which does not require any third party component installed in the EC2 instances other than a Java VM and the Docker engine (the latter it's only required in order to deploy pipeline binary dependencies).

Nextflow cloud deployment

Each EC2 instance runs a script, at bootstrap time, that mounts the EFS storage and downloads and launches the Nextflow cluster daemon. This daemon is self-configuring, it automatically discovers the other running instances and joins them forming the computing cluster.

The simplicity of this stack makes it possible to setup the cluster in the cloud in just a few minutes, a little more time than is required to spin up the EC2 VMs. This time does not depend on the number of instances launched, as they configure themself independently.

This also makes it possible to add or remove instances as needed, realising the long promised elastic scalability of cloud computing.

This ability is even more important for bioinformatic workflows, which frequently crunch not homogenous datasets and are composed of tasks with very different computing requirements (eg. a few very long running tasks and many short-lived tasks in the same workload).

Going elastic

The Nextflow support for the cloud features an elastic cluster which is capable of resizing itself to adapt to the actual computing needs at runtime, thus spinning up new EC2 instances when jobs wait for too long in the execution queue, or terminating instances that are not used for a certain amount of time.

In order to enable the cluster autoscaling you will need to specify the autoscale properties in the nextflow.config file. For example:

cloud {
  imageId = 'ami-43f49030'
  instanceType = 'm4.xlarge'

  autoscale {
     enabled = true
     minInstances = 5
     maxInstances = 10
  }
}

The above configuration enables the autoscaling features so that the cluster will include at least 5 nodes. If at any point one or more tasks spend more than 5 minutes without being processed, the number of instances needed to fullfil the pending tasks, up to limit specified by the maxInstances attribute, are launched. On the other hand, if these instances are idle, they are terminated before reaching the 60 minutes instance usage boundary.

The autoscaler launches instances by using the same AMI ID and type specified in the cloud configuration. However it is possible to define different attributes as shown below:

cloud {
  imageId = 'ami-43f49030'
  instanceType = 'm4.large'

  autoscale {
     enabled = true
     maxInstances = 10
     instanceType = 'm4.2xlarge'
     spotPrice = 0.05
  }
}

The cluster is first created by using instance(s) of type m4.large. Then, when new computing nodes are required the autoscaler launches instances of type m4.2xlarge. Also, since the spotPrice attribute is specified, EC2 spot instances are launched, instead of regular on-demand ones, bidding for the price specified.

Conclusion

Nextflow implements an easy though effective cloud scheduler that is able to scale dynamically to meet the computing needs of deployed workloads taking advantage of the elastic nature of the cloud platform.

This ability, along the support for spot/preemptible instances, allows a cost effective solution for the execution of your pipeline in the cloud.

↧

More fun with containers in HPC

December 19, 2016, 3:00 pm

≫ Next: Nextflow published in Nature Biotechnology

≪ Previous: Enabling elastic computing with Nextflow

Nextflow was one of the first workflow framework to provide built-in support for Docker containers. A couple of years ago we also started to experiment with the deployment of containerised bioinformatic pipelines at CRG, using Docker technology (see here and here).

We found that by isolating and packaging the complete computational workflow environment with the use of Docker images, radically simplifies the burden of maintaining complex dependency graphs of real workload data analysis pipelines.

Even more importantly, the use of containers enables replicable results with minimal effort for the system configuration. The entire computational environment can be archived in a self-contained executable format, allowing the replication of the associated analysis at any point in time.

This ability is the main reason that drove the rapid adoption of Docker in the bioinformatic community and its support in many projects, like for example Galaxy, CWL, Bioboxes, Dockstore and many others.

However, while the popularity of Docker spread between the developers, its adaption in research computing infrastructures continues to remain very low and it's very unlikely that this trend will change in the future.

The reason for this resides in the Docker architecture, which requires a daemon running with root permissions on each node of a computing cluster. Such a requirement raises many security concerns, thus good practices would prevent its use in shared HPC cluster or supercomputer environments.

Introducing Singularity

Alternative implementations, such as Singularity, have fortunately been promoted by the interested in containers technology.

Singularity is a containers engine developed at the Berkeley Lab and designed for the needs of scientific workloads. The main differences with Docker are: containers are file based, no root escalation is allowed nor root permission is needed to run a container (although a privileged user is needed to create a container image), and there is no separate running daemon.

These, along with other features, such as support for autofs mounts, makes Singularity a container engine better suited to the requirements of HPC clusters and supercomputers.

Moreover, although Singularity uses a container image format different to that of Docker, they provide a conversion tool that allows Docker images to be converted to the Singularity format.

Singularity in the wild

We integrated Singularity support in Nextflow framework and tested it in the CRG computing cluster and the BSC MareNostrum supercomputer.

The absence of a separate running daemon or image gateway made the installation straightforward when compared to Docker or other solutions.

To evaluate the performance of Singularity we carried out the same benchmarks we performed for Docker and compared the results of the two engines.

The benchmarks consisted in the execution of three Nextflow based genomic pipelines:

Rna-toy: a simple pipeline for RNA-Seq data analysis.
Nmdp-Flow: an assembly-based variant calling pipeline.
Piper-NF: a pipeline for the detection and mapping of long non-coding RNAs.

In order to repeat the analyses, we converted the container images we used to perform the Docker benchmarks to Singularity image files by using the docker2singularity tool.

The only change needed to run these pipelines with Singularity was to replace the Docker specific settings with the following ones in the configuration file:

singularity.enabled = true
process.container = '<the image file path>'

Each pipeline was executed 10 times, alternately by using Docker and Singularity as container engine. The results are shown in the following table (time in minutes):

Pipeline	Tasks	Mean task time		Mean execution time		Execution time std dev		Ratio
		Singularity	Docker	Singularity	Docker	Singularity	Docker
RNA-Seq	9	73.7	73.6	663.6	662.3	2.0	3.1	0.998
Variant call	48	22.1	22.4	1061.2	1074.4	43.1	38.5	1.012
Piper-NF	98	1.2	1.3	120.0	124.5	6.9	2.8	1.038

The benchmark results show that there isn't any significative difference in the execution times of containerised workflows between Docker and Singularity. In two cases Singularity was slightly faster and a third one it was almost identical although a little slower than Docker.

Conclusion

In our evaluation Singularity proved to be an easy to install, stable and performant container engine.

The only minor drawback, we found when compared to Docker, was the need to define the host path mount points statically when the Singularity images were created. In fact, even if Singularity supports user mount points to be defined dynamically when the container is launched, this feature requires the overlay file system which was not supported by the kernel available in our system.

Docker surely will remain the de facto standard engine and image format for containers due to its popularity and impressive growth.

However, in our opinion, Singularity is the tool of choice for the execution of containerised workloads in the context of HPC, thanks to its focus on system security and its simpler architectural design.

The transparent support provided by Nextflow for both Docker and Singularity technology guarantees the ability to deploy your workflows in a range of different platforms (cloud, cluster, supercomputer, etc). Nextflow transparently manages the deployment of the containerised workload according to the runtime available in the target system.

Credits

Thanks to Gabriel Gonzalez (CRG), Luis Exposito (CRG) and Carlos Tripiana Montes (BSC) for the support installing Singularity.

↧

Nextflow published in Nature Biotechnology

April 11, 2017, 3:00 pm

≫ Next: Nextflow workshop is coming!

≪ Previous: More fun with containers in HPC

We are excited to announce the publication of our work Nextflow enables reproducible computational workflows in Nature Biotechnology.

The article provides a description of the fundamental components and principles of Nextflow. We illustrate how the unique combination of containers, pipeline sharing and portable deployment provides tangible advantages to researchers wishing to generate reproducible computational workflows.

Reproducibility is a major challenge in today's scientific environment. We show how three bioinformatics data analyses produce different results when executed on different execution platforms and how Nextflow, along with software containers, can be used to control numerical stability, enabling consistent and replicable results across different computing platforms. As complex omics analyses enter the clinical setting, ensuring that results remain stable brings on extra importance.

Since its first release three years ago, the Nextflow user base has grown in an organic fashion. From the beginning it has been our own demands in a workflow tool and those of our users that have driven the development of Nextflow forward. The publication forms an important milestone in the project and we would like to extend a warm thank you to all those who have been early users and contributors.

We kindly ask if you use Nextflow in your own work to cite the following article:

Di Tommaso, P., Chatzou, M., Floden, E. W., Barja, P. P., Palumbo, E., & Notredame, C. (2017).Nextflow enables reproducible computational workflows. Nature Biotechnology, 35(4), 316–319. doi:10.1038/nbt.3820

↧

Nextflow workshop is coming!

April 25, 2017, 3:00 pm

≫ Next: Nextflow and the Common Workflow Language

≪ Previous: Nextflow published in Nature Biotechnology

We are excited to announce the first Nextflow workshop that will take place at the Barcelona Biomedical Research Park building (PRBB) on 14-15th September 2017.

This event is open to everybody who is interested in the problem of computational workflow reproducibility. Leading experts and users will discuss the current state of the Nextflow technology and how it can be applied to manage -omics analyses in a reproducible manner. Best practices will be introduced on how to deploy real-world large-scale genomic applications for precision medicine.

During the hackathon, organized for the second day, participants will have the opportunity to learn how to write self-contained, replicable data analysis pipelines along with Nextflow expert developers.

More details at this link. The registration form is available here (deadline 15th Jun).

See you in Barcelona!

Nextflow workshop

↧

Nextflow and the Common Workflow Language

July 19, 2017, 3:00 pm

≫ Next: Nexflow Hackathon 2017

≪ Previous: Nextflow workshop is coming!

The Common Workflow Language (CWL) is a specification for defining workflows in a declarative manner. It has been implemented to varying degrees by different software packages. Nextflow and CWL share a common goal of enabling portable reproducible workflows.

We are currently investigating the automatic conversion of CWL workflows into Nextflow scripts to increase the portability of workflows. This work is being developed as the cwl2nxf project, currently in early prototype stage.

Our first phase of the project was to determine mappings of CWL to Nextflow and familiarize ourselves with how the current implementation of the converter supports a number of CWL specific features.

Mapping CWL to Nextflow

Inputs in the CWL workflow file are initially parsed as channels or other Nextflow input types. Each step specified in the workflow is then parsed independently. At the time of writing subworkflows are not supported, each step must be a CWL CommandLineTool file.

The image below shows an example of the major components in the CWL files and then post-conversion (click to zoom).

CWL and Nextflow share a similar structure of defining inputs and outputs as shown above.

A notable difference between the two is how tasks are defined. CWL requires either a separate file for each task or a sub-workflow. CWL also requires the explicit mapping of each command line option for an executed tool. This is done using YAML meta-annotation to indicate the position, prefix, etc. for each command line option.

In Nextflow a task command is defined as a separated component in the process definition and it is ultimately a multiline string which is interpreted by a command script by the underlying system. Input parameters can be used in the command string with a simple variable interpolation mechanism. This is beneficial as it simplifies porting existing BASH scripts to Nextflow with minimal refactoring.

These examples highlight some of the differences between the two approaches, and the difficulties converting complex use cases such as scatter, CWL expressions, and conditional command line inclusion.

Current status

The cwl2nxf is a Groovy based tool with a limited conversion ability. It parses the YAML documents and maps the various CWL objects to Nextflow. Conversion examples are provided as part of the repository along with documentation for each example specifying the mapping.

This project was initially focused on developing an understanding of how to translate CWL to Nextflow. A number of CWL specific features such as scatter, secondary files and simple JavaScript expressions were analyzed and implemented.

The GitHub repository includes instructions on how to build cwl2nxf and an example usage. The tool can be executed as either just a parser printing the converted CWL to stdout, or by specifying an output file which will generate the Nextflow script file and if necessary a config file.

The tool takes in a CWL workflow file and the YAML inputs file. It does not currently work with a standalone CommandLineTool. The following example show how to run it:

java -jar build/libs/cwl2nxf-*.jar rnatoy.cwl samp.yaml

See the GitHub repository for further details.

Conclusion

We are continuing to investigate ways to improve the interoperability of Nextflow with CWL. Although still an early prototype, the cwl2nxf tool provides some level of conversion of CWL to Nextflow.

We are also planning to explore CWL Avro, which may provide a more efficient way to parse and handle CWL objects for conversion to Nextflow.

Additionally, a number of workflows in the GitHub repository have been implemented in both CWL and Nextflow which can be used as a comparison of the two languages.

The Nextflow team will be presenting a short talk and participating in the Codefest at BOSC 2017. We are interested in hearing from the community regarding CWL to Nextflow conversion, and would like to encourage anyone interested to contribute to the cwl2nxf project.

↧

Nexflow Hackathon 2017

September 29, 2017, 3:00 pm

≫ Next: Scaling with AWS Batch

≪ Previous: Nextflow and the Common Workflow Language

Last week saw the inaugural Nextflow meeting organised at the Centre for Genomic Regulation (CRG) in Barcelona. The event combined talks, demos, a tutorial/workshop for beginners as well as two hackathon sessions for more advanced users.

Nearly 50 participants attended over the two days which included an entertaining tapas course during the first evening!

One of the main objectives of the event was to bring together Nextflow users to work together on common interest projects. There were several proposals for the hackathon sessions and in the end five diverse ideas were chosen for communal development ranging from new pipelines through to the addition of new features in Nextflow.

The proposals and outcomes of each the projects, which can be found in the issues section of this GitHub repository, have been summarised below.

Nextflow HTML tracing reports

The HTML tracing project aims to generate a rendered version of the Nextflow trace file to enable fast sorting and visualisation of task/process execution statistics.

Currently the data in the trace includes information such as CPU duration, memory usage and completion status of each task, however wading through the file is often not convenient when a large number of tasks have been executed.

Phil Ewels proposed the idea and led the coordination effort with the outcome being a very impressive working prototype which can be found in the Nextflow branch html-trace.

An image of the example report is shown below with the interactive HTML available here. It is expected to be merged into the main branch of Nextflow with documentation in a near-future release.

Nextflow HTML execution report

Nextflow pipeline for 16S microbial data

The H3Africa Bioinformatics Network have been developing several pipelines which are used across the participating centers. The diverse computing resources available across the nodes has led to members wanting workflow solutions with a particular focus on portability.

With this is mind, Scott Hazelhurst proposed a project for a 16S Microbial data analysis pipeline which had previously been developed using CWL.

The participants made a new branch of the original pipeline and ported it into Nextflow.

The pipeline will continue to be developed with the goal of acting as a comparison between CWL and Nextflow. It is thought this can then be extended to other pipelines by both those who are already familiar with Nextflow as well as used as a tool for training newer users.

Nextflow modules prototyping

Toolboxing allows users to incorporate software into their pipelines in an efficient and reproducible manner. Various software repositories are becoming increasing popular, highlighted by the over 5,000 tools available in the Galaxy Toolshed.

Projects such as Biocontainers aim to wrap up the execution environment using containers. Myself and Johan Viklund wished to piggyback off existing repositories and settled on Dockstore which is an open platform compliant with the GA4GH initiative.

The majority of tools in Dockstore are written in the CWL and therefore we required a parser between the CWL CoomandLineTool class and Nextflow processes. Johan was able to develop a parser which generates Nextflow processes for several Dockstore tools.

As these resources such as Dockstore become mature and standardised, it will be possible to automatically generate a Nextflow Store and enable efficient incorporation of tools into workflows.

Example showing a Nextflow process generated from the Dockstore CWL repository for the tool BAMStats.

Nextflow pipeline for de novo assembly of nanopore reads

Nanopore sequencing is an exciting and emerging technology which promises to change the landscape of nucleotide sequencing.

With keen interest in Nanopore specific pipelines, Hadrien Gourlé lead the hackathon project for Nanoflow.

Nanoflow is a de novo assembler of bacterials genomes from nanopore reads using Nextflow.

During the two days the participants developed the pipeline for adapter trimming as well as assembly and consensus sequence generation using either Canu and Miniasm.

The future plans are to finalise the pipeline to include a polishing step and a genome annotation step.

Nextflow AWS Batch integration

Nextflow already has experimental support for AWS Batch and the goal of this project proposed by Francesco Strozzi was to improve this support, add features and test the implementation on real world pipelines.

Earlier work from Paolo Di Tommaso in the Nextflow repository, highlighted several challenges to using AWS Batch with Nextflow.

The major obstacle described by Tim Dudgeon was the requirement for each Docker container to have a version of the Amazon Web Services Command Line tools (aws-cli) installed.

A solution was to install the AWS CLI tools on a custom AWS image that is used by the Docker host machine, and then mount the directory that contains the necessary items into each of the Docker containers as a volume. Early testing suggests this approach works with the hope of providing a more elegant solution in future iterations.

The code and documentation for AWS Batch has been prepared and will be tested further before being rolled into an official Nextflow release in the near future.

Conclusion

The event was seen as an overwhelming success and special thanks must be made to all the participants. As the Nextflow community continues to grow, it would be fantastic to make these types meetings more regular occasions.

In the meantime we have put together a short video containing some of the highlights of the two days.

We hope to see you all again in Barcelona soon or at new events around the world!

↧

Scaling with AWS Batch

November 7, 2017, 3:00 pm

≫ Next: Running CAW with Singularity and Nextflow

≪ Previous: Nexflow Hackathon 2017

The latest Nextflow release (0.26.0) includes built-in support for AWS Batch, a managed computing service that allows the execution of containerised workloads over the Amazon EC2 Container Service (ECS).

This feature allows the seamless deployment of Nextflow pipelines in the cloud by offloading the process executions as managed Batch jobs. The service takes care to spin up the required computing instances on-demand, scaling up and down the number and composition of the instances to best accommodate the actual workload resource needs at any point in time.

AWS Batch shares with Nextflow the same vision regarding workflow containerisation i.e. each compute task is executed in its own Docker container. This dramatically simplifies the workflow deployment through the download of a few container images. This common design background made the support for AWS Batch a natural extension for Nextflow.

Batch in a nutshell

Batch is organised in Compute Environments, Job queues, Job definitions and Jobs.

The Compute Environment allows you to define the computing resources required for a specific workload (type). You can specify the minimum and maximum number of CPUs that can be allocated, the EC2 provisioning model (On-demand or Spot), the AMI to be used and the allowed instance types.

The Job queue definition allows you to bind a specific task to one or more Compute Environments.

Then, the Job definition is a template for one or more jobs in your workload. This is required to specify the Docker image to be used in running a particular task along with other requirements such as the container mount points, the number of CPUs, the amount of memory and the number of retries in case of job failure.

Finally the Job binds a Job definition to a specific Job queue and allows you to specify the actual task command to be executed in the container.

The job input and output data management is delegated to the user. This means that if you only use Batch API/tools you will need to take care to stage the input data from a S3 bucket (or a different source) and upload the results to a persistent storage location.

This could turn out to be cumbersome in complex workflows with a large number of tasks and above all it makes it difficult to deploy the same applications across different infrastructure.

How to use Batch with Nextflow

Nextflow streamlines the use of AWS Batch by smoothly integrating it in its workflow processing model and enabling transparent interoperability with other systems.

To run Nextflow you will need to set-up in your AWS Batch account a Compute Environment defining the required computing resources and associate it to a Job Queue.

Nextflow takes care to create the required Job Definitions and Job requests as needed. This spares some Batch configurations steps.

In the nextflow.config, file specify the awsbatch executor, the Batch queue and the container to be used in the usual manner. You may also need to specify the AWS region and access credentials if they are not provided by other means. For example:

process.executor = 'awsbatch'
process.queue = 'my-batch-queue'
process.container = your-org/your-docker:image
aws.region = 'eu-west-1'
aws.accessKey = 'xxx'
aws.secretKey = 'yyy'

Each process can eventually use a different queue and Docker image (see Nextflow documentation for details). The container image(s) must be published in a Docker registry that is accessible from the instances run by AWS Batch eg. Docker Hub, Quay or ECS Container Registry.

The Nextflow process can be launched either in a local computer or a EC2 instance. The latter is suggested for heavy or long running workloads.

Note that input data should be stored in the S3 storage. In the same manner the pipeline execution must specify a S3 bucket as a working directory by using the -w command line option.

A final caveat about custom containers and computing AMI. Nextflow automatically stages input data and shares tasks intermediate results by using the S3 bucket specified as a work directory. For this reason it needs to use the aws command line tool which must be installed either in your process container or be present in a custom AMI that can be mounted and accessed by the Docker containers.

You may also need to create a custom AMI because the default image used by AWS Batch only provides 22 GB of storage which may not be enough for real world analysis pipelines.

See the documentation to learn how to create a custom AMI with larger storage and how to setup the AWS CLI tools.

An example

In order to validate Nextflow integration with AWS Batch, we used a simple RNA-Seq pipeline.

This pipeline takes as input a metadata file from the Encode project corresponding to a search returning all human RNA-seq paired-end datasets (the metadata file has been additionally filtered to retain only data having a SRA ID).

The pipeline automatically downloads the FASTQ files for each sample from the EBI ENA database, it assesses the overall quality of sequencing data using FastQC and then runs Salmon to perform the quantification over the human transcript sequences. Finally all the QC and quantification outputs are summarised using the MultiQC tool.

For the sake of this benchmark we used the first 38 samples out of the full 375 samples dataset.

The pipeline was executed both on AWS Batch cloud and in the CRG internal Univa cluster, using Singularity as containers runtime.

It's worth noting that with the exception of the two configuration changes detailed below, we used exactly the same pipeline implementation at this GitHub repository.

The AWS deploy used the following configuration profile:

aws.region = 'eu-west-1'
aws.client.storageEncryption = 'AES256'
process.queue = 'large'
executor.name = 'awsbatch'
executor.awscli = '/home/ec2-user/miniconda/bin/aws'

While for the cluster deployment the following configuration was used:

executor = 'crg'
singularity.enabled = true
process.container = "docker://nextflow/rnaseq-nf"
process.queue = 'cn-el7'
process.time = '90 min'
process.$quant.time = '4.5 h'

Results

The AWS Batch Compute environment was configured to use a maximum of 132 CPUs as the number of CPUs that were available in the queue for local cluster deployment.

The two executions ran in roughly the same time: 2 hours and 24 minutes when running in the CRG cluster and 2 hours and 37 minutes when using AWS Batch.

It must be noted that 14 jobs failed in the Batch deployment, presumably because one or more spot instances were retired. However Nextflow was able to re-schedule the failed jobs automatically and the overall pipeline execution completed successfully, also showing the benefits of a truly fault tolerant environment.

The overall cost for running the pipeline with AWS Batch was $5.47 ($ 3.28 for EC2 instances, $1.88 for EBS volume and $0.31 for S3 storage). This means that with ~ $55 we could have performed the same analysis on the full Encode dataset.

It is more difficult to estimate the cost when using the internal cluster, because we don't have access to such detailed cost accounting. However, as a user, we can estimate it roughly comes out at $0.01 per CPU-Hour. The pipeline needed around 147 CPU-Hour to carry out the analysis, hence with an estimated cost of $1.47 just for the computation.

The execution report for the Batch execution is available at this link and the one for cluster is available here.

Conclusion

This post shows how Nextflow integrates smoothly with AWS Batch and how it can be used to deploy and execute real world genomics pipeline in the cloud with ease.

The auto-scaling ability provided by AWS Batch along with the use of spot instances make the use of the cloud even more cost effective. Running on a local cluster may still be cheaper, even if it is non trivial to account for all the real costs of a HPC infrastructure. However the cloud allows flexibility and scalability not possible with common on-premises clusters.

We also demonstrate how the same Nextflow pipeline can be transparently deployed in two very different computing infrastructure, using different containerisation technologies by simply providing a separate configuration profile.

This approach enables the interoperability across different deployment sites, reduces operational and maintenance costs and guarantees consistent results over time.

Credits

This post is co-authored with Francesco Strozzi, who also helped to write the pipeline used for the benchmark in this post and contributed to and tested the AWS Batch integration. Thanks to Emilio Palumbo that helped to set-up and configure the AWS Batch environment and Evan Floden for the comments.

↧

Running CAW with Singularity and Nextflow

November 15, 2017, 3:00 pm

≫ Next: Nextflow turns five! Happy birthday!

≪ Previous: Scaling with AWS Batch

This is a guest post authored by Maxime Garcia from the Science for Life Laboratory in Sweden. Max describes how they deploy complex cancer data analysis pipelines by using Nextflow and Singularity. We are very happy to share their experience across the Nextflow community.

The CAW pipeline

Cancer Analysis Workflow logo

Cancer Analysis Workflow (CAW for short) is a Nextflow based analysis pipeline developed for the analysis of tumour: normal pairs. It is developed in collaboration with two infrastructures within Science for Life Laboratory: National Genomics Infrastructure (NGI), in The Stockholm Genomics Applications Development Facility to be precise and National Bioinformatics Infrastructure Sweden (NBIS).

CAW is based on GATK Best Practices for the preprocessing of FastQ files, then uses various variant calling tools to look for somatic SNVs and small indels (MuTect1, MuTect2, Strelka, Freebayes), (GATK HaplotyeCaller), for structural variants(Manta) and for CNVs (ASCAT). Annotation tools (snpEff, VEP) are also used, and finally MultiQC for handling reports.

We are currently working on a manuscript, but you're welcome to look at (or even contribute to) our github repository or talk with us on our gitter channel.

Singularity and UPPMAX

Singularity is a tool package software dependencies into a contained environment, much like Docker. It's designed to run on HPC environments where Docker is often a problem due to its requirement for administrative privileges.

We're based in Sweden, and Uppsala Multidisciplinary Center for Advanced Computational Science (UPPMAX) provides Computational infrastructures for all Swedish researchers. Since we're analyzing sensitive data, we are using secure clusters (with a two factor authentication), set up by UPPMAX: SNIC-SENS.

In my case, since we're still developing the pipeline, I am mainly using the research cluster Bianca. So I can only transfer files and data in one specific repository using SFTP.

UPPMAX provides computing resources for Swedish researchers for all scientific domains, so getting software updates can occasionally take some time. Typically, Environment Modules are used which allow several versions of different tools - this is good for reproducibility and is quite easy to use. However, the approach is not portable across different clusters outside of UPPMAX.

Why use containers?

The idea of using containers, for improved portability and reproducibility, and more up to date tools, came naturally to us, as it is easily managed within Nextflow. We cannot use Docker on our secure cluster, so we wanted to run CAW with Singularity images instead.

How was the switch made?

We were already using Docker containers for our continuous integration testing with Travis, and since we use many tools, I took the approach of making (almost) a container for each process. Because this process is quite slow, repetitive and I~~'m lazy~~ like to automate everything, I made a simple NF script to build and push all docker containers. Basically it's just build and pull for all containers, with some configuration possibilities.

docker build -t ${repository}/${container}:${tag} ${baseDir}/containers/${container}/.

docker push ${repository}/${container}:${tag}

Since Singularity can directly pull images from DockerHub, I made the build script to pull all containers from DockerHub to have local Singularity image files.

singularity pull --name ${container}-${tag}.img docker://${repository}/${container}:${tag}

After this, it's just a matter of moving all containers to the secure cluster we're using, and using the right configuration file in the profile. I'll spare you the details of the SFTP transfer. This is what the configuration file for such Singularity images looks like: singularity-path.config

/*
vim: syntax=groovy
-*- mode: groovy;-*-
 * -------------------------------------------------
 * Nextflow config file for CAW project
 * -------------------------------------------------
 * Paths to Singularity images for every process
 * No image will be pulled automatically
 * Need to transfer and set up images before
 * -------------------------------------------------
 */

singularity {
  enabled = true
  runOptions = "--bind /scratch"
}

params {
  containerPath='containers'
  tag='1.2.3'
}

process {
  $ConcatVCF.container      = "${params.containerPath}/caw-${params.tag}.img"
  $RunMultiQC.container     = "${params.containerPath}/multiqc-${params.tag}.img"
  $IndelRealigner.container = "${params.containerPath}/gatk-${params.tag}.img"
  // I'm not putting the whole file here
  // you probably already got the point
}

This approach ran (almost) perfectly on the first try, except a process failing due to a typo on a container name...

Conclusion

This switch was completed a couple of months ago and has been a great success. We are now using Singularity containers in almost all of our Nextflow pipelines developed at NGI. Even if we do enjoy the improved control, we must not forgot that:

With great power comes great responsibility!

Credits

Thanks to Rickard Hammarén and Phil Ewels for comments and suggestions for improving the post.

↧