4 - Submitting data to SRA • miso

At some point of your project youy will want to make your data available to the public. This often required to happen in the form of a submission to the Sequence Read Archive (SRA), which can be challenging. miso includes helpers to prep a submission for you.

In order to create a submission you will need the following:

A file manifest for all of your sequencing data.
A metadata table describing the sequenced samples.
Some basic information about the study and sequencing.

Number 1 is managed as in all other miso steps in that it is a file list such as used in quality control etc. For instance for our example data:

library(miso)

## Also loading:

##   - dada2=1.28.0
##   - data.table=1.15.2
##   - ggplot2=3.5.0
##   - magrittr=2.0.3
##   - phyloseq=1.44.0
##   - ShortRead=1.58.0
##   - yaml=2.3.8

## Found tools:

##   - minimap2=2.27-r1193
##   - samtools=1.19.2

## 
## Attaching package: 'miso'

## The following object is masked _by_ 'package:BiocGenerics':
## 
##     normalize

## The following object is masked from 'package:graphics':
## 
##     layout

fi <- system.file("extdata/shotgun", package = "miso") %>% find_read_files()

Which gives you:

fi

## Key: <id>
##                                                                                         forward
##                                                                                          <char>
## 1: /tmp/RtmpFnpEcH/temp_libpath39b6c5ce98287/miso/extdata/shotgun/even1_S1_L001_R1_001.fasta.gz
## 2: /tmp/RtmpFnpEcH/temp_libpath39b6c5ce98287/miso/extdata/shotgun/even2_S2_L001_R1_001.fasta.gz
## 3: /tmp/RtmpFnpEcH/temp_libpath39b6c5ce98287/miso/extdata/shotgun/even3_S3_L001_R1_001.fasta.gz
##                                                                                         reverse
##                                                                                          <char>
## 1: /tmp/RtmpFnpEcH/temp_libpath39b6c5ce98287/miso/extdata/shotgun/even1_S1_L001_R2_001.fasta.gz
## 2: /tmp/RtmpFnpEcH/temp_libpath39b6c5ce98287/miso/extdata/shotgun/even2_S2_L001_R2_001.fasta.gz
## 3: /tmp/RtmpFnpEcH/temp_libpath39b6c5ce98287/miso/extdata/shotgun/even3_S3_L001_R2_001.fasta.gz
##        id injection_order  lane
##    <char>           <num> <num>
## 1:  even1               1     1
## 2:  even2               2     1
## 3:  even3               3     1

The metadata is a data frame describing additional attributes for each of those samples. For instance you could use the sample_data() slot of a phyloseq object. However, this table always needs to have a column specifying the sample ID as in the file manifest and a date column denoting the date of sample extraction. Let’s create one for the three samples in our example data:

metadata <- data.table(
    id = fi$id,
    date = rep("2019-01-01", 3),
    diet = c("vegan", "vegetarian", "flexitarian"),
    age = c(23, 38, 64)
)

For the additional study data miso will use a set of presets that will set a lot of the information for you but you will still need to specify additional aspects of your study. This is again managed by configurations and you can see the set of configuration variables by creating one with config_sra:

config_sra()

## $metadata
## NULL
## 
## $id_col
## [1] "id"
## 
## $date_col
## [1] "date"
## 
## $country
## [1] "USA"
## 
## $preset
## [1] "human gut 16S"
## 
## $out_dir
## [1] "sra"
## 
## $title
## NULL
## 
## $platform
## [1] "ILLUMINA"
## 
## $instrument_model
## [1] "Illumina MiSeq"
## 
## $bioproject
## NULL
## 
## $latitude
## [1] 42.36
## 
## $longitude
## [1] -71.0941
## 
## $make_package
## [1] TRUE
## 
## attr(,"class")
## [1] "config"

Except for the bioproject (fill only if you already have a BIOPROJECT ID) variable everything else needs to be specified (or the default values will be used). You can check the the list of available presets with:

names(sra_presets)

##  [1] "human gut 16S"         "mouse gut 16S"         "human gut metagenome" 
##  [4] "human skin 16S"        "human skin metagenome" "mouse gut metagenome" 
##  [7] "human gut RNA-Seq"     "mouse gut RNA-Seq"     "aquatic biofilm 16S"  
## [10] "in vitro metagenome"

Let’s fill in the data for the example study:

config <- config_sra(
    title = "Sequencing of the gut metagenome: the effect of age on Bacteroides",
    preset = "human gut metagenome",
    platform = "ILLUMINA",
    instrument = "Illumina HiSeq 2000",
    metadata = metadata
)

You can now proceed building the subsmission data:

sra <- sra_submission(fi, config)

## INFO [2024-04-29 09:44:06] Packing submission files to sra/sra_files.tar.gz.
## INFO [2024-04-29 09:44:06] Writing biosample attributes to sra/05_biosample_attributes.tsv.
## INFO [2024-04-29 09:44:06] Writing file metadata to sra/06_sra_metadata.tsv.
## INFO [2024-04-29 09:44:06] You are now ready for submission. Go to https://submit.ncbi.nlm.nih.gov/subs/sra/, log in and click on `New submission`. Fill in the general data for your project in steps 1 through 3. In step 4 choose `Environmental/Metagenome Genomic Sequences MIMS` with the appropriate environment. In step 5 and 6 upload the respective files in sra. In step 7 you can directly upload the `*.tar.gz` submission package. Just click on `continue` another time to have the archive unpacked as indicated.

This told us exactly how to submit the data. We urge you to first check all created files for correctness. This can be done by inspecting the return value. For instance:

sra$sra_metadata

When looking in the specified folder (./sra by default) you will see the tables for step 5 and 6:

list.files("sra")

## [1] "05_biosample_attributes.tsv" "06_sra_metadata.tsv"        
## [3] "sra_files.tar.gz"

For larger submission we recommend installing the Aspera client in your browser as this will speed up the upload significantly and allow for larger packages.