Under the hood¶
Input output definition¶
Input¶
A Python script is run at each execution of the RSP4ABM to match sequencing read files to each sample.
As exposed in the sample selection section of the Pipeline execution page, this function can accept either local or SRA-hosted read files.
Local reads¶
Reads are considered to be located locally by the script pointing toward the input read files in presence of the “local_samples:” argument in the config file. This parameters must point to a sample sheet, which is a spreadsheet in tabulation-separated values (tsv) format listing all the files in the analysis. The script matches by default the values found in the leftmost “Sample” column of the sample sheet with the filenames of the “.fastq.gz” locales in a directory defined by the “link_directory” parameter of the config file. This default behavior can be altered by two conditions encoded in the Python script :
In presence of a column named “OldSampleName”, the match with the filenames of the sequencing read files is done with this column, instead of the leftmost “Sample” column.
In presence of a “R1” column, the absolute path to the forward reads in considered instead of a name-based matching of the sequencing read files. In this case, and for paired-end reads, a “R2” column must point to the reverse read files.
In all cases (i.e. Sample column match, OldSampleName column match or absolute paths indicated by R1 and R2 columns), Snakemake rules will temporarily copy the read files into a raw_reads directory.
SRA reads¶
Reads are considered by the Python scripts to be located on the Sequence Read Archive (SRA) in presence of the “sra_samples:” argument in the config file.
In this case Snakemake rules will use SRA Toolkit to download the reads and convert them to “.fastq.gz” format into the raw_reads directory.
Output¶
Upon each execution of the pipeline, Python scripts will parse the config file and the sample sheet to generate lists of outputs. These lists are then fed to the pipeline Snakefile which instructs the pipeline the output to generate.
Logging and traceability¶
Snakemake logs¶
Upon each execution, Snakemake automatically creates a log file where all the standard output is recorded. These can be found from the working directory into:
.snakemake/log/
RSP4ABM logs¶
In addition to the default Snakemake’s logs, RSP4ABM create a log directory upon each execution in
logs/<year>/<months>/<day>/<time>/
This directory contains:
a copy of the executed Snakemake command (cmd.txt)
the git commit hash which indicates the version of the RST4ABM (git.txt)
the ID of the user who run the pipeline (user.txt)
a copy of the sample sheet (local_samples.tsv or sra_samples.tsv)
a copy of the config file (config.yaml)
In addition, almost all rules of RST4ABM generate a log file upon execution which records the output of the executed tools or script. These log files are organized in subdirectories of the log directory, mirroring the structure of the main pipeline.
Sequencing reads QC¶
QC rules assess the sequencing quality of all each sample with FastQC 1. Then, a MultiQC 2 report generates a report for each sequencing run (based on values of the sample sheet column indicated by the “run_column” parameter of the config file). A global MultiQC report is generated as well, but without interactive features to deal with the high number of samples
Post-processing¶
Taxonomic filtering¶
Rarefaction¶
Phylogenetic tree generation¶
Taxonomic collapsing¶
Normalization and abundance-based filtering¶
Exports¶
Picrust2¶
References¶
- 1
Andrews S, Krueger F, Seconds-Pichon A, Biggins F, Wingett S. FastQC. A quality control tool for high throughput sequence data. Babraham Bioinformatics. Babraham Institute. 2015.
- 2
Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: Summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016;