Skip to content

16S amplicon sequencing workflow

Overview

Feasible data:

  • single-end or paired-end 16S amplicon sequencing data
  • decent depth (>10K reads per sample)

Steps:

  1. Automatic detection of Illumina read files and management of multiple runs
  2. Trimming, filtering and quality metrics (base qualities, entropy, lengths) with DADA2 and mbtools
  3. Denoising using DADA2
  4. Taxonomy assignment using DADA2 and GTDB 220
  5. Alignment of ASVs with DECIPHER and building of a phylogenetic tree from core alignment (FastTree).
  6. Creation of tabular and phyloseq output

Setup

Option 1: Everything in a conda env

mamba env create -f conda.yml
conda activate 16S
Rscript -e "remotes::install_github('dienerlab/miso')"

This can be activated with conda activate 16S.

Options 2: Local R, everything else in conda

This requires an installation of R on all nodes running the pipeline.

Tip

Only use this option if you are already using R for lots of other stuff. This might also use an older DADA2 version depending on your local R version.

mamba env create -f conda-no-r.yml
Rscript -e "install.packages('BiocManager')"
Rscript -e "remotes::install_github('dienerlab/miso')"

This can be activated with conda activate 16S-no-r.

Download taxonomy databases

cd into your work directory, then

mkdir -p refs
wget https://zenodo.org/records/10403693/files/GTDB_bac120_arc53_ssu_r220_genus.fa.gz?download=1 -O refs/GTDB_bac120_arc53_ssu_r220_genus.fa.gz
wget https://zenodo.org/records/10403693/files/GTDB_bac120_arc53_ssu_r220_species.fa.gz?download=1 -O refs/GTDB_bac120_arc53_ssu_r220_species.fa.gz

Workflow parameters

~~~ Diener Lab 16S Workflow ~~~

Usage:
A run using all,default parameters can be started with:
> nextflow run main.nf -resume

General options:
    --data_dir [str]              The main data directory for the analysis (must contain `raw`).
    --read_length [int]           The length of the reads.
    --forward-only [bool]         Run analysis only on forward reads.
    --threads [int]               The maximum number of threads a single process can use.
                                This is not the same as the maximum number of total threads used.
    --manifest [str]              A manifest file listing the files to be processed. Should be a CSV file with
                                columns "id", "forward", "reverse" (optional), and "run" (optional). Listing the
                                sample IDs and read files. If samples were sequenced in different runs indicate
                                this with the run column.
    --pattern [str]               The file pattern for the FASTQ files. Options are illumina, sra, and simple.
                                Only used if no manuscript was provided.

Reference DBs:
    --taxa_db [str]               Path to the default taxonomy database.
    --species_db [str]            Path to species database to perform exact matching to ASVs.

Quality filter:
    --trim_left [int]             How many bases to trim from the 5' end of each read.
    --trunc_forward [int]         Where to truncate forward reads. Default length - 5
    --trunc_reverse [int]         Where to truncate reverse reads. Default length - 20.
    --maxEE [int]                 Maximum number of expected errors per read.

Denoising:
    --min_overlap [int]           Minimum overlap between reverse and forward ASVs to merge them.
    --merge [bool]                Whether to merge several runs into a single output.

Most likely run using:

DLP=/path/to/pipelines
nextflow run $DLP/16S -resume -with-conda [PATH/TO/CONDA]/envs/16S[-no-r]

If you have only one run put all raw FASTQ files under raw.