Under the hood

Snakemake, environments and containers

Snakemake is the center-piece of this pipeline. [snakemake](https://snakemake.readthedocs.io/en/stable/tutorial/tutorial.html) is a Python-based worflow-manager that enables processing of a large set of amplicon-based metagenomics sequencing reads into actionable output. It relies on a system of rules which are as many required processing steps. Each rule specifies input files, a Conda environment (or as an alternative a Singularity container) that includes all required softwares, and command line or a script to be executed the expected output files.

Conda environments

[Conda](https://docs.conda.io/projects/conda/en/latest/user-guide/getting-started.html) is a language-independent package and environment management tool. The Conda environment is a collection of installed Conda packages. For example, a research project might require VSEARCH 2.20.0 and its dependencies, whereas another environment associated with a completed project might necessitate the use of VSEARCH 2.15. Changing the environment, has no effect on the others. Switching between environments is simple because they can be easily activated or deactivated.

Singularity containers

The concept of reproducible analysis in bioinformatics extends beyond good documentation and code sharing. Analyses typically depend on an entire environment with numerous tools, libraries, and settings. Storage, reuse, and sharing environments via container software such as Docker and Singularity could improve reproducibility and productivity. By using singularity, users can create a single executable file that contains all aspects of their environment and allows to safely run environments from a variety of resources without requiring privileged access.

Input output definition

Input

A Python script is run at each execution of the [zAMP](https://github.com/metagenlab/zAMP) to match sequencing read files to each sample.

As exposed in the **sample selection** section of the **pipeline execution** page, this function can accept either local or [SRA-hosted read files](https://www.ncbi.nlm.nih.gov/sra/docs/).

Local reads

Reads are considered to be located locally by the script pointing toward the input read files in presence of the “local_samples:” argument in the config file. This parameters must point to a sample sheet, which is a spreadsheet in tabulation-separated values (tsv) format listing all the files in the analysis. The script matches by default the values found in the leftmost “Sample” column of the sample sheet with the filenames of the “.fastq.gz” locales in a directory defined by the “link_directory” parameter of the config file. This default behavior can be altered by two conditions encoded in the Python script :

In presence of a column named “OldSampleName”, the match with the filenames of the sequencing read files is done with this column, instead of the leftmost “Sample” column.
In presence of a “R1” column, the absolute path to the forward reads in considered instead of a name-based matching of the sequencing read files. In this case, and for paired-end reads, a “R2” column must point to the reverse read files.

In all cases (i.e. Sample column match, OldSampleName column match or absolute paths indicated by R1 and R2 columns), Snakemake rules will temporarily copy the read files into a raw_reads directory.

SRA reads

Reads are considered by the Python scripts to be located on the Sequence Read Archive (SRA) in presence of the “sra_samples:” argument in the config file.

In this case Snakemake rules will use SRA Toolkit to download the reads and convert them to “.fastq.gz” format into the raw_reads directory.

Output

Upon each execution of the pipeline, Python scripts will parse the config file and the sample sheet to generate lists of outputs. These lists are then fed to the pipeline Snakefile which instructs the pipeline the output to generate.

Logging and traceability

Snakemake logs

Upon each execution, Snakemake automatically creates a log file where all the standard output is recorded. These can be found from the working directory into:

.snakemake/log/

RSP4ABM logs

In addition to the default Snakemake’s logs, RSP4ABM create a log directory upon each execution in

logs/<year>/<months>/<day>/<time>/

This directory contains:

a copy of the executed Snakemake command (cmd.txt)
the git commit hash which indicates the version of the RST4ABM (git.txt)
the ID of the user who run the pipeline (user.txt)
a copy of the sample sheet (local_samples.tsv or sra_samples.tsv)
a copy of the config file (config.yaml)

In addition, almost all rules of RST4ABM generate a log file upon execution which records the output of the executed tools or script. These log files are organized in subdirectories of the log directory, mirroring the structure of the main pipeline.

Sequencing reads QC

QC rules assess the sequencing quality of all each sample with FastQC 1. Then, a MultiQC 2 report generates a report for each sequencing run (based on values of the sample sheet column indicated by the “run_column” parameter of the config file). A global MultiQC report is generated as well, but without interactive features to deal with the high number of samples

Denoising

Vsearch (OTU clustering)

PANDAseq

Vsearch

DADA2 (ASV denoising)

cutadapt

DADA2

Taxonomic assignment

reference database

classifiers

Post-processing

Taxonomic filtering

Rarefaction

Phylogenetic tree generation

Taxonomic collapsing

Normalization and abundance-based filtering

Exports

Fromatting

Wide to long melting

transpose_and_meta_count_table

Qiime2 formats

Picrust2

References

1: Andrews S, Krueger F, Seconds-Pichon A, Biggins F, Wingett S. FastQC. A quality control tool for high throughput sequence data. Babraham Bioinformatics. Babraham Institute. 2015.
2: Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: Summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016;