Metagenomics

Overview

The metagenomics workflow(s) consist of a basic gene cluster-based workflow followed by add-ons for binning an replication rate inference.

The basic workflow is a protein/gene-centric workflow. It will quantify taxon abundances as well as abundances for protein-coding orthologous gene clusters (groups of genes with putatively similar function across organisms). It is feasible to answer the following questions:

Are the taxa whose abundances are related to a phenotype?
Are there proteins whose genes are more prevalent/abundant in a specific phenotype?
What functional potential is present in particular metagenomes?

Feasible data:

paired or single end metagenomic shotgun sequencing
any depth (there is no separate shallow pipeline anymore)

Warning

The basic workflow currently does not work with Nanopore data due to the salmon step. We will add compatibility soon. You can still run all the previous steps which are fully compatible.

Steps:

Adapter and quality trimming with fastp (1)
Read annotation with Kraken2 with some custom HPC optimization
Taxon counting using Bracken
Assembly with MegaHit
De novo gene prediction with prodigal
Clustering of all genes on protein identity using mmseqs2 linclust
Pufferfish mapping index creation (needed for next step)
Gene quantification (mapping + counting) with salmon
Protein annotation using the EGGNoG mapper

quality reports in HTML and JSON are provided for each file

Setup

For the basic workflow set up the metagenomics environment.

conda env create -f pipelines/metagenomics/conda.yml

Workflow options

~~~ Diener Lab Metagenomics Workflow ~~~

Usage:
A run using all,default parameters can be started with:
> nextflow run main.nf --resume

An exampl erun could look like:
> nextflow run main.nf -with-conda /my/envs/metagenomics -resume \
                        --data_dir=./data --single_end=false --refs=/my/references \
                        --read_length=150

General options:
  --data_dir [str]              The main data directory for the analysis (must contain `raw`).
  --read_length [str]           The length of the reads.
  --single_end [bool]           Specifies that the input is single-end reads.
  --threads [int]               The maximum number of threads a single process can use.
                                This is not the same as the maximum number of total threads used.
Reference DBs:
  --refs [str]                  Folder in which to find references DBs.
  --eggnogg_refs [str]          Where to find EGGNOG references. Defaults to <refs>/eggnog.
  --kraken2_db [str]            Where to find the Kraken2 reference. Defaults to <refs>/kraken2_default.
  --kraken2_mem [str]           The maximum amount of memory for Kraken2. If not set will choose this automatically
                                based on the database size. Thus, only use to limit Kraken2 to less memory.
Quality filter:
  --trim_front [str]            How many bases to trim from the 5' end of each read.
  --min_length [str]            Minimum accepted length for a read.
  --quality_threshold [str]     Smallest acceptable average quality.
  --threshold [str]             Smallest abundance threshold used by Kraken.

Assembly:
  --contig_length [int]         Minimum length of a contig.
  --identity [double]           Minimum average nucleotide identity.
  --overlap [double]            Minimum required overlap between contigs.

Taxonomic classification:
  --batchsize [int]             The batch size for Kraken2 jobs. See documentation
                                for more info. Should be 1 on single machine setups
                                and much larger than one on HPC setups.
  --kraken2_mem [int]           Maximum memory in GB to use for Kraken2. If not set
                                this will be determined automatically from the database.
                                So, only set this if you want to overwrite the automatic
                                detection.

Additional workflows can be run after the basic workflow has finished.

Tip

On HPC systems we recommend to set the --batchsize option to something close to the number of samples unless you either have lots (>1000) or problematic samples.

See concepts for more info on batches.