Simple meta analysis

This is a strategy for a simple meta-analysis nextflow workflow. Here "simple" means that all you individual processing steps are already expected to run on all the data in a single study. This would be the case for Qiime2 analyses for instance because each Qiime2 always operates on all data from a study.

Here we first have a main table that lists the studies with the required parameters/settings for each study.

main table

id,layout,forward_primer,reverse_primer,trunc_f,trunc_r,location
study 1,paired,ACG,TGC,220,200,study_1
study 2,single,ACT,GCT,220,0,study_2

The nextflow pipeline then reads the CSV and passed the study information as value in all the steps. The initial channel has one entry for each study.

pipeline example

#!/usr/bin/env nextflow

params.data = "${launchDir}"
params.studies = "${params.data}/studies.csv"
params.visualize = true

workflow {
    // Read the studies table
    studies = Channel.fromPath(params.studies)
        .splitCsv(header: true, sep: ",")

    // Do some work on the studies

    studies | import_data | step1

    // Visualize if desired
    if (params.visualize) {
        visualize(import_data.out)
    }

}

process import_data {
    publishDir "${params.data}/imports", mode: 'copy', overwrite: true

    cpus 4
    memory "8GB"
    time "2h"

    input:
    val(study)

    output:
    tuple val(study), path("*.txt")

    script:
    if (study.layout == "paired") {
        """
        echo "importing from ${params.data}/${study.location}/manifest.tsv in paired-end layout." > '${study.id}.txt'
        """
    } else if (study.layout == "single") {
        """
        echo "importing from ${params.data}/${study.location}/manifest.tsv in single-end layout." > '${study.id}.txt'
        """
    } else {
        error "Invalid libray layout specified. Must be 'paired' or 'single' :("
    }
}

process step1 {
    publishDir "${params.data}/step1", mode: 'copy', overwrite: true

    cpus 1
    memory "2GB"
    time "1h"

    input:
    tuple val(study), path(imported)

    output:
    tuple val(study), path("*.result")

    script:
    """
    echo "processed ${study.id} data with truncations of ${study.trunc_f},${study.trunc_r}" > '${study.id}.result'
    """
}

process visualize {
    publishDir "${params.data}/viz", mode: 'copy', overwrite: true

    cpus 1
    memory "2GB"
    time "1h"

    input:
    tuple val(study), path(imported)

    output:
    tuple val(study), path("${study.id}.viz")

    script:
    """
    echo "visualized ${study.id} import" > '${study.id}.viz'
    """
}

Note

The individuals processng scripts don't make much sense here. Those just serve as an example how to inject the study parameters.

Running this will then distribute the row for each study.

$ nextflow run main.nf

 N E X T F L O W   ~  version 25.03.1-edge

Launching `main.nf` [stoic_murdock] DSL2 - revision: 0e2f97c09b

executor >  local (6)
[ec/8110b0] process > import_data (1) [100%] 2 of 2 ✔
[12/497c3c] process > step1 (1)       [100%] 2 of 2 ✔
[1e/c9f3fe] process > visualize (2)   [100%] 2 of 2 ✔

This also shows an example how to globally disable a part of the pipeline (visualization) while still retaining the cache.

$ nextflow run main.nf -resume --visualize false

 N E X T F L O W   ~  version 25.03.1-edge

Launching `main.nf` [intergalactic_goldwasser] DSL2 - revision: 0e2f97c09b

[f9/53dc3a] process > import_data (2) [100%] 2 of 2, cached: 2 ✔
[04/179b27] process > step1 (1)       [100%] 2 of 2, cached: 2 ✔

You can see the injection of the library layout and the truncation parameters from the main table.

cat imports/*.txt

importing from /home/cdiener/code/pipelines/docs/tips/meta-per-study/study_1/manifest.tsv in paired-end layout.
importing from /home/cdiener/code/pipelines/docs/tips/meta-per-study/study_2/manifest.tsv in single-end layout.

cat steps/*.result

processed study 1 data with truncations of 220,200
processed study 2 data with truncations of 220,0