Quantcast
Channel: Nextflow Blog
Viewing all articles
Browse latest Browse all 157

Learn Nextflow in 2023

$
0
0

In 2023, the world of Nextflow is more exciting than ever! With new resources constantly being released, there is no better time to dive into this powerful tool. From a new Software Carpentries’ course to an in-depth write-up by 23andMe to new tutorials on Wave and Fusion, the options for learning Nextflow are endless.

We've compiled a list of the best resources in 2023 to make your journey to Nextflow mastery as seamless as possible. And remember, Nextflow is a community-driven project. If you have suggestions or want to contribute to this list, head to the GitHub page and make a pull request.

Before you start

Before learning Nextflow, you should be comfortable with the Linux command line and be familiar with some basic scripting languages such as Perl or Python. The beauty of Nextflow is that task logic can be written in your language of choice. You will just need to learn Nextflow’s domain-specific language (DSL) to control overall flow.

Nextflow is widely used in bioinformatics, so many tutorials focus on life sciences. However, Nextflow can be used for almost any data-intensive workflow, including image analysis, ML model training, astronomy, and geoscience applications.

So, let's get started! These resources will guide you from beginner to expert and make you unstoppable in the field of scientific workflows.

Contents

Why Learn Nextflow

There are hundreds of workflow managers to choose from. In fact, Meir Wahnon and several of his colleagues have gone to the trouble of compiling an awesome-workflow-engines list. The workflows community initiative is another excellent source of information about workflow engines.

  • Using Nextflow in your analysis workflows helps you implement reproducible pipelines. Nextflow pipelines follow FAIR guidelines (findability, accessibility, interoperability, and reuse). Nextflow also supports version control and containers to manage all software dependencies.
  • Nextflow is portable; the same pipeline written on a laptop can quickly scale to run on an HPC cluster, Amazon, Azure, Google Cloud services, or Kubernetes. With features like configuration profiles, code can be written so that it is 100% portable across different on-prem and cloud infrastructures enabling collaboration and avoiding lock-in.
  • It is massively scalable, allowing the parallelization of tasks using the dataflow paradigm without hard-coding pipelines to specific platforms, workload managers, or batch services.
  • Nextflow is flexible, supporting scientific workflow requirements like caching processes to avoid redundant computation and workflow reporting to help understand and diagnose workflow execution patterns.
  • It is growing fast, and support is available from Seqera Labs. The project has been active since 2013 with a vibrant developer community, and the Nextflow ecosystem continues to expand rapidly.
  • Finally, Nextflow is open source and licensed under Apache 2.0. You are free to use it, modify it and distribute it.

Meet the Tutorials!

Some of the best publicly available tutorials are listed below:

1. Introduction to Nextflow by 23andMe

This informative post by Anuved Verma and Anja Bog begins with the basic concepts of Nextflow. It builds towards how Nextflow is used at 23andMe. It includes a detailed explanation of how 23andMe runs their genotype imputation pipelines in the cloud, processing over 1 million individual samples daily in an environment with 10,000+ CPUs in a single compute environment.

Nextflow at 23andMe

2. A simple RNA-Seq hands-on tutorial

This hands-on tutorial from Seqera Labs will guide you through implementing a proof-of-concept RNA-seq pipeline. The goal is to become familiar with basic concepts, including defining parameters, using channels to pass data, and writing processes to perform tasks. It includes all scripts, input data, and resources and is perfect for getting a taste of Nextflow. One thing to consider is that some of the examples pre-date the availability of DSL2 – a set of syntax extensions to Nextflow that allows for defining module libraries and simplifying the writing of complex data analysis pipelines. The tutorial is still a great place to start, but you’ll want to learn about DSL 2 as well.

Tutorial link on GitHub

3. Nextflow workshop from Seqera Labs

Here you’ll dive deeper into Nextflow’s most prominent features and learn how to apply them. The full workshop includes an excellent section on containers, how to build them, and how to use them with Nextflow. The written materials come with examples and hands-on exercises. Optionally, you can also follow with a series of videos from a live training workshop. Better yet, these examples are updated to reflect the new DSL 2 syntax and concepts such as modules, so you can learn to take advantage of the latest powerful Nextflow features.

The workshop includes topics on:

  • Environment Setup
  • Basic NF Script and Concepts
  • Nextflow Processes
  • Nextflow Channels
  • Nextflow Operators
  • Basic RNA-Seq pipeline
  • Containers & Conda
  • Nextflow Configuration
  • On-premise & Cloud Deployment
  • DSL 2 & Modules
  • GATK hands-on exercise

Workshop& YouTube playlist.

4. Software Carpentry workshop

The Nextflow Software Carpentry workshop (still being developed) explains the use of Nextflow and nf-core as development tools for building and sharing reproducible data science workflows. The intended audience is those with little programming experience. The course provides a foundation to write and run Nextflow and nf-core workflows comfortably. Adapted from the Seqera training material above, the workshop has been updated by Software Carpentries instructors within the nf-core community to fit The Carpentries training style. The Carpentries emphasize feedback to improve teaching materials, so we would like to hear back from you about what you thought was well-explained and what needs improvement. Pull requests to the course material are very welcome. The workshop can be opened on Gitpod where you can try the exercises in an online computing environment at your own pace, while referencing the course material in another window alongside the tutorials.

The workshop can be opened on Gitpod where you can try the exercises in an online computing environment at your own pace, while referencing the course material in another window alongside the tutorials.

You can find the course in The Carpentries incubator.

5. Managing Pipelines in the Cloud - GenomeWeb Webinar

This on-demand webinar features Phil Ewels from SciLifeLab, nf-core (and now also Seqera Labs), Brendan Boufler from Amazon Web Services, and Evan Floden from Seqera Labs. The wide-ranging discussion covers the significance of scientific workflows, examples of Nextflow in production settings, and how Nextflow can be integrated with other processes.

Watch the webinar

6. Nextflow implementation patterns

This advanced documentation discusses recurring patterns in Nextflow and solutions to many common implementation requirements. Code examples are available with notes to follow along and a GitHub repository.

Nextflow Patterns& GitHub repository.

7. nf-core tutorials

A set of tutorials covering the basics of using and creating nf-core pipelines developed by the team at nf-core. These tutorials provide an overview of the nf-core framework, including:

  • How to run nf-core pipelines
  • What are the most commonly used nf-core tools
  • How to make new pipelines using the nf-core template
  • What are nf-core shared modules
  • How to add nf-core shared modules to a pipeline
  • How to make new nf-core modules using the nf-core module template
  • How nf-core pipelines are reviewed and ultimately released

nf-core usage tutorials and nf-core developer tutorials.

8. Awesome Nextflow

A collection of awesome Nextflow pipelines compiled by various contributors to the open-source Nextflow project.

Awesome Nextflow and GitHub

9. Wave showcase: Wave and Fusion tutorials

Wave and the Fusion file system are new Nextflow capabilities introduced in November 2022. Wave is a container provisioning and augmentation service fully integrated with the Nextflow ecosystem. Instead of viewing containers as separate artifacts that need to be integrated into a pipeline, Wave allows developers to manage containers as part of the pipeline itself.

Tightly coupled with Wave is the new Fusion 2.0 file system. Fusion implements a virtual distributed file system and presents a thin client allowing data hosted in AWS S3 buckets (and other object stores in the future) to be accessed via the standard POSIX filesystem interface expected by most applications.

Wave can help simplify development, improve reliability, and make pipelines easier to maintain. It can even improve pipeline performance. The optional Fusion 2.0 file system offers further advantages delivering performance on par with FSx for Lustre while enabling organizations to reduce their cloud computing bill and improve pipeline efficiency throughput. See the blog article released in February 2023 explaining the Fusion file system and providing benchmarks comparing Fusion to other data handling approaches in the cloud.

Wave showcase on GitHub

10. Building Containers for Scientific Workflows

While not strictly a guide about Nextflow, this article provides an overview of scientific containers and provides a tutorial involved in creating your own container and integrating it into a Nextflow pipeline. It also provides some useful tips on troubleshooting containers and publishing them to registries.

Building Containers for Scientific Workflows

11. Best Practices for Deploying Pipelines with Nextflow Tower

When building Nextflow pipelines, a best practice is to supply a nextflow_schema.json file describing pipeline input parameters. The benefit of adding this file to your code repo, is that if the pipeline is launched using Nextflow, the schema enables an easy-to-use web interface that users through the process of parameter selection. While it is possible to craft this file by hand, the nf-core community provides a handy schema build tool. This step-by-step guide explains how to adapt your pipeline for use with Nextflow Tower by using the schema build tool to automatically generate the nextflow_schema.json file.

Best Practices for Deploying Pipelines with Nextflow Tower

Cloud integration tutorials

In addition to the learning resources above, several step-by-step integration guides explain how to run Nextflow pipelines on your cloud platform of choice. Some of these tutorials extend to the use of Nextflow Tower. Organizations can use the Tower Cloud Free edition to launch pipelines quickly in the cloud. Organizations can optionally use Tower Cloud Professional or run self-hosted or on-premises Tower Enterprise environments as requirements grow. This year, we added Google Cloud Batch to the cloud services supported by Nextflow.

1. Nextflow and AWS Batch — Inside the Integration

This three-part series of articles provides a step-by-step guide explaining how to use Nextflow with AWS Batch. The first of three articles covers AWS Batch concepts, the Nextflow execution model, and explains how the integration works under the covers. The second article in the series provides a step by step guide explaining how to set up the AWS batch environment and how to run and troubleshoot open-source Nextflow pipelines. The third article builds on what you've learned, explaining how to integrate workflows with Nextflow Tower and share the AWS Batch environment with other users by "publishing" your workflows to the cloud.

Nextflow and AWS Batch — Inside the Integration (part 1 of 3, part 2 of 3, part 3 of 3)

2. Nextflow and Azure Batch — Inside the Integration

Similar to the tutorial above, this set of articles does a deep dive into the Nextflow Azure Batch integration. Part 1 covers Azure Batch and essential concepts, provides an overview of the integration and explains how to set up Azure Batch and Storage accounts. It also covers deploying a machine instance in the Azure cloud and configuring it to run Nextflow pipelines against the Azure Batch service.

Part 2 builds on what you learned in part 1 and shows how to use Azure Batch from within Nextflow Tower Cloud. It provides a walkthrough of how to make the environment set up in part 1 accessible to users through Tower's intuitive web interface.

Nextflow and Azure Batch — Inside the Integration (part 1 of 2, part 2 of 2)

3. Get started with Nextflow on Google Cloud Batch

This excellent article by Marcel Ribeiro-Dantas provides a step-by-step tutorial on using Nextflow with Google’s new Google Cloud Batch service. Google Cloud Batch is expected to replace the Google Life Sciences integration over time. The article explains how to deploy the Google Cloud Batch and Storage environments in GCP using the gcloud CLI. It then goes on to explain how to configure Nextflow to launch pipelines into the newly created Google Cloud Batch environment.

Get started with Nextflow on Google Cloud Batch

4. Nextflow and K8s Rebooted: Running Nextflow on Amazon EKS

While not commonly used for HPC workloads, Kubernetes has clear momentum. In this educational article, Ben Sherman provides an overview of how the Nextflow / Kubernetes integration has been simplified by avoiding the requirement for Persistent Volumes (PVs) and Persistent Volume Claims (PVCs). This detailed guide provides step-by-step instructions for using Amazon EKS as a compute environment complete with how to configure IAM Roles for Kubernetes Services Accounts (IRSA), now an Amazon EKS best practice.

Nextflow and K8s Rebooted: Running Nextflow on Amazon EKS

Additional resources

The following resources will help you dig deeper into Nextflow and other related projects like the nf-core community who maintain curated pipelines and a very active Slack channel. There are plenty of Nextflow tutorials and videos online, and the following list is no way exhaustive. Please let us know if we are missing anything.

1. Nextflow docs

The reference for the Nextflow language and runtime. These docs should be your first point of reference while developing Nextflow pipelines. The newest features are documented in edge documentation pages released every month with the latest stable releases every three months.

Latest stable& edge documentation.

2. Seqera Labs docs

An index of documentation, deployment guides, training materials and resources for all things Nextflow and Tower.

Seqera Labs docs

3. nf-core

nf-core is a growing community of Nextflow users and developers. You can find curated sets of biomedical analysis pipelines written in Nextflow and built by domain experts. Each pipeline is stringently reviewed and has been implemented according to best practice guidelines. Be sure to sign up to the Slack channel.

nf-core website and nf-core Slack

4. Nextflow Tower

Nextflow Tower is a platform to easily monitor, launch and scale Nextflow pipelines on cloud providers and on-premise infrastructure. The documentation provides details on setting up compute environments, monitoring pipelines and launching using either the web graphic interface, CLI or API.

Nextflow Tower and user documentation.

5. Nextflow on AWS

Part of the Genomics Workflows on AWS, Amazon provides a quickstart for deploying a genomics analysis environment on Amazon Web Services (AWS) cloud, using Nextflow to create and orchestrate analysis workflows and AWS Batch to run the workflow processes. While this article is packed with good information, the procedure outlined in the more recent Nextflow and AWS Batch – Inside the integration series, may be an easier place to start. Some of the steps that previously needed to be performed manually have been updated in the latest integration.

Nextflow on AWS Batch

6. Nextflow Data Pipelines on Azure Batch

Nextflow on Azure requires at minimum two Azure services, Azure Batch and Azure Storage. Follow the guide below developed by the team at Microsoft to set up both services on Azure, and to get your storage and batch account names and keys.

Azure Blog and GitHub repository.

7. Running Nextflow with Google Life Sciences

A step-by-step guide to launching Nextflow Pipelines in Google Cloud. Note that this integration process is specific to Google Life Sciences – an offering that pre-dates Google Cloud Batch. If you want to use the newer integration approach, you can also check out the Nextflow blog article Get started with Nextflow on Google Cloud Batch.

[Nextflow on Google Cloud](https://cloud.google.com/life-sciences/docs/tutorials/nextflow]

8. Bonus: Nextflow Tutorial - Variant Calling Edition

This Nextflow Tutorial - Variant Calling Edition has been adapted from the Nextflow Software Carpentry training material and Data Carpentry: Wrangling Genomics Lesson. Learners will have the chance to learn Nextflow and nf-core basics, to convert a variant-calling bash-script into a Nextflow workflow and to modularize the pipeline using DSL2 modules and sub-workflows.

The workshop can be opened on Gitpod where you can try the exercises in an online computing environment at your own pace, with the course material in another window alongside.

You can find the course in Nextflow Tutorial - Variant Calling Edition.

Community and support


Viewing all articles
Browse latest Browse all 157

Trending Articles