Preprocessing of raw data

For filtering and trimming of the raw reads we usually use the DADA2 functions but wrap them in a reproducible workflow step.

## Also loading:
##   - dada2=1.28.0
##   - data.table=1.15.2
##   - ggplot2=3.5.0
##   - magrittr=2.0.3
##   - phyloseq=1.44.0
##   - ShortRead=1.58.0
##   - yaml=2.3.8
## Found tools:
##   - minimap2=2.27-r1193
##   - samtools=1.19.2
## 
## Attaching package: 'miso'
## The following object is masked _by_ 'package:BiocGenerics':
## 
##     normalize
## The following object is masked from 'package:graphics':
## 
##     layout

Finding your files

We will again use our helper function to get a list of sequencing files.

path <- system.file("extdata/16S", package = "miso")
files <- find_read_files(path)
print(files)
## Key: <id>
##                                                                                      forward
##                                                                                       <char>
## 1: /tmp/RtmpFnpEcH/temp_libpath39b6c5ce98287/miso/extdata/16S/F3D0_S188_L001_R1_001.fastq.gz
## 2: /tmp/RtmpFnpEcH/temp_libpath39b6c5ce98287/miso/extdata/16S/F3D1_S189_L001_R1_001.fastq.gz
## 3: /tmp/RtmpFnpEcH/temp_libpath39b6c5ce98287/miso/extdata/16S/F3D2_S190_L001_R1_001.fastq.gz
## 4: /tmp/RtmpFnpEcH/temp_libpath39b6c5ce98287/miso/extdata/16S/F3D3_S191_L001_R1_001.fastq.gz
## 5: /tmp/RtmpFnpEcH/temp_libpath39b6c5ce98287/miso/extdata/16S/Mock_S280_L001_R1_001.fastq.gz
##                                                                                      reverse
##                                                                                       <char>
## 1: /tmp/RtmpFnpEcH/temp_libpath39b6c5ce98287/miso/extdata/16S/F3D0_S188_L001_R2_001.fastq.gz
## 2: /tmp/RtmpFnpEcH/temp_libpath39b6c5ce98287/miso/extdata/16S/F3D1_S189_L001_R2_001.fastq.gz
## 3: /tmp/RtmpFnpEcH/temp_libpath39b6c5ce98287/miso/extdata/16S/F3D2_S190_L001_R2_001.fastq.gz
## 4: /tmp/RtmpFnpEcH/temp_libpath39b6c5ce98287/miso/extdata/16S/F3D3_S191_L001_R2_001.fastq.gz
## 5: /tmp/RtmpFnpEcH/temp_libpath39b6c5ce98287/miso/extdata/16S/Mock_S280_L001_R2_001.fastq.gz
##        id injection_order  lane
##    <char>           <num> <num>
## 1:   F3D0             188     1
## 2:   F3D1             189     1
## 3:   F3D2             190     1
## 4:   F3D3             191     1
## 5:   Mock             280     1

Configuration

All miso workflow step come with corresponding config_* that returns an example/default configuration. Changes can be done a-posteriori or by directly passing in the parameters. We will specify a temporary directory as storage point for the preprocessed data and truncate the forward reads to 240 bp and the reverse reads to 200 bp (based on our previous quality assessment).

config <- config_preprocess(out_dir = tempdir(), truncLen = c(240, 200))
config
## $threads
## [1] 1
## 
## $out_dir
## [1] "/tmp/RtmpqYem75"
## 
## $trimLeft
## [1] 10
## 
## $truncLen
## [1] 240 200
## 
## $maxEE
## [1] 2
## 
## $truncQ
## [1] 2
## 
## $maxN
## [1] 0
## 
## attr(,"class")
## [1] "config"

We can see that there are some more parameters that we could specify.

Running the preprocessing step

We can now run our preprocessing step.

filtered <- preprocess(files, config)
## INFO [2024-04-29 09:38:39] Preprocessing reads for 5 paired-end samples...
## INFO [2024-04-29 09:38:46] 4.03e+04/4.48e+04 (89.75%) reads passed preprocessing.

This will report the percentage of passed reads on the logging interface but you can also inspect that in detail by

filtered$passed
##      raw preprocessed     id
##    <num>        <num> <char>
## 1:  7793         6992   F3D0
## 2:  5869         5210   F3D1
## 3: 19620        17706   F3D2
## 4:  6758         6114   F3D3
## 5:  4779         4280   Mock