Pipeline execution

Note

Provided command-line examples are given as examples and are valid for a standard unix bash terminal.

Note

The instructions hereunder assume that the pipeline setup was completed (see System requirements and setup), including the preprocessing of the taxonomy reference database (see Taxonomic reference database preprocessing).

1. Select a method to define sequencing read files as input

Before executing RSP4ABM, you need to define how you provide the sequencing read files (in .fastq.gz format) to the pipeline. The pipeline accepts 4 different methods to define the input read files, with different requirements. Please first read hereunder to define the method of input definition that bests fits to your needs.

A. Sample column

The easy, but not completely foolproof, way to run the pipeline. The one to favor when you have .fastq reads all stored in a folder, with simples identifers matching your sample names.

A regex-based script automatically matches the sample names with the names of te sequencing read files found in a directory.

The leftmost column of the sample sheet is named “Sample” and contains the sample identifiers.

All samples have unique, simple names starting by a letter (and not a number).

e.g. “1” will fail as sample name

All samples names are contained without ambiguity within the filename of the .fastq.gz read files.

e.g. the pipeline will be confused by “Sample2” and “Sample2_bis”.

All reads are located in one single directory.

reads can be actually stored in the directory or be represented by symbolic links

The path of the directory containing the reads is designated by the “link_directory” parameter in the config file (”links/” by default).

Reads are considered to be paired-end (the pipeline looks for “R1” and “R2” .fastq.gz files) by default. For single-end or mixed reads, the sample sheet must contain a “LibraryLayout” column indicating “single” (or “paired”).

The path to the sample sheet describing your samples is indicated by the “local_samples” parameter in the config file.

B. OldSampleName column

The sample names used for sequencing and contained in the .fastq.gz read filenames can be impractical or incompatible to be used as sample identifier (e.g. starts with a number). In this case, match can be done with an “OldSampleName” column instead of the default “Sample” column.

Same as A. Sample column matching the presence of a column named “OldSampleName” in the sample sheet forces the pipeline to run the regex-based matching on this column instead of the leftmost “Sample” column.

This method implies that sample are given a “new” identifier from the Sample leftmost column of the sample sheet, instead of their “old” identifier (the one in the fastq.gz filename and in the OldSampleName column).

A column named “OldSampleName” where all samples have a unique name without ambiguity within their .fastq.gz filename.

Similar to the Sample column matching method for the rest.

C. Absolute path

As the foolproof approach, sequencing read files for each sample will be pointed by their absolute path.

In presence of a “R1” column in the sample sheet (+/- a “R2” column for paired-end reads), the pipeline considers the path indicated in this/these column(s) as the input file(s) for each sample.

A column named “R1” (and a “R2” for paired-end reads) with the path to the .fastq.gz read files for each sample.

The presence of a value in a “R2” column indicates to the pipeline that the sample was sequenced in paired-end.

The path to the sample sheet describing your samples is indicated by the “local_samples” parameter in the config file.

D. Sample Read Archive (SRA)

RSP4ABM is capable to automatically download and process reads stored on the Sequence Read Archive (SRA).

Reads stored on the SRA are designated by their Run identifier in the sample sheet. The sequencing data will be automatically downloaded and then processed by the pipeline.

In this case, the leftmost column of the sample sheet is named “Run” and matches “Run” identifiers from SRA.

A “LibraryLayout” column in the sample sheet indicates the sequencing layout (”single” or “paired”).

The path to the sample sheet describing your samples is indicated by the “sra_samples” parameter in the config file.

2. Create a working directory

Before the execution of the pipeline, prepare a new dedicated directory somewhere on the system where you have sufficient space (NOT within the pipeline directory).

for instance:

#make a new directory named "new_analysis"
$ mkdir new_analysis

# Move to the new directory
$ cd new_analysis

The config file, the sample sheet and eventually the links/ directory (see below) must be created within this working directory

3. Create a links directory (for Sample and OldSampleName column as input)

When using the method A. Sample column or the method B. OldSampleName column for the definition of the sequencing read files used as input, all the .fastq.gz files must be located within one single directory. By default, this directory is named “links/” and is found in the working directory, but another path can be defined by the “link_directory” parameter of the config file. It is recommended to create symbolic links of the original reads into this links/ directory.

for instance:

#make a new directory named "links"
$ mkdir links

# Create symbolic links for .fastq.gz files from distant directory into "links/"
$ ln -s path/to/a/distant/directory/containing/some/reads/*.fastq.gz links/

4. Write a sample sheet (aka metadata table):

The pipeline requires a spreadsheet in tabulation-separated values (tsv) format listing all the samples in the analysis and where:

The leftmost column (”Sample” or “Run”) is the sample identifier.

This identifier must be unique, start with a letter and not a number and cannot contain spaces or “-”. “_” are OK.
A “run_column” describes the sequencing run of each sample.

Each sample must have a different value under this column for each sequencing run included in the analysis. If all samples were sequenced together, then the same value must be repeated for all samples. Prefer an alphanumeric factor, e.g. “run_20200101”
A “grouping_column” regroups the samples for visualization purpose.

Some of the visualization generated by the pipeline will be generated individually for each value contained in the “grouping column”
A “sample_label” describes each sample.

This column must provide a unique, explicit description of each sample. It can be a replication of the *Sample column but also provides the opportunity to have more concise or explicit description of each sample*.
Optional (but recommended) columns describes technical metadata.

In this recommended to provide technical metadata (e.g. library preparation DNA yield) to support technical QC of the data
Optional (but recommended) columns describes experimental or clinical metadata.

In this recommended to provide clinical or experimental description of each sample, which will support later interpretation of the data.

sample sheet example:

Sample  sample_label    sample_group    seq_run
S1111   sample_1    patient1   run1
MIX9B   sample_2	patient1   run1
MIX7B   sample_3	patient2   run1
MIX10B  sample_4    patient2   run1
MIX7A   QC_1    QC  run1

5. Write a config file

Parameters must be provided to adapt the pipeline to the specificities of the analysis. This is done through a config file in the yaml <https://en.wikipedia.org/wiki/YAML> format. Create this file in the working directory, copy the content of the example below and adapt it to the analysis. The three first parameters (“link_directory”, *”sra_samples” and “local_samples”) are associated to the definition of input files. For the meaning of the other parameters, please refer to the short comment next by each parameter and the detailed description of the pipeline on the under_the_hood page.

for instance:

# Open a graphic text editor and create a config file. Once opened, copy the example below and adapt it.
$ gedit config.yaml

# Or use a command-line text editor, e.g.
# $ nano config.yaml

Config file example:

## Dataset ############################################################################################################
### Set input samples list. (Mandatory, one of them or both)
link_directory: links # links by default.
sra_samples: example_sra_samples.tsv ## To access to Sequences Read Archive
local_samples: example_local_samples.tsv ## Local analysis

### Sample processing metadata definition
run_column: seq_run # Column identifying the sequencing run (sequencing error are learned for each run with DADA2)

### Choose the denoising approach, either the "vsearch" to generate classical 97% identity OTUs or the "DADA2" for errors corrected 100% ASVs
denoiser: ["DADA2", "vsearch"] # "vsearch", "DADA2" or both. For ITS (see below) only DADA2 has been tested.

### PCR primers trimming sequences (with PANDAseq for vsearch approach and Cutadapt for DADA2)
Trim_primers: True # False to skip primers trimming (usefull if unkown primers or already trimmed in the input reads, for instance for SRA deposited reads)
ITS_or_16S: 16S # ITS or 16S, affects the primers trimming. With ITS, reverse occurence of the primer is allowed in the opposed read.
min_overlap: 20 # Min overlap of the reads for paired-end reads merging
forward_primer: CCTACGGGNGGCWGCAG # CCTACGGGNGGCWGCAG for Illumina V3V4
reverse_primer: GACTACHVGGGTATCTAATCC # GACTACHVGGGTATCTAATCC for Illumina V3V4

### Merged sequences length-based filtering
merged_min_length: 390 # from 390 to 400 for V3V4
merged_max_length: 480 # from 450 to 500 for V3V4

### DADA2 specific settings ###########################################################################################
#### Trim reads based on length. Reads shorter than this length are trimmed.
DADA2_F_reads_length_trim: 280 # 280 recommended to remove low quality ends of R1. Must be < than read_length - trimmed_primer_length
DADA2_R_reads_length_trim: 255 # 255 recommended to remove low quality ends of R2. Must be < than read_length - trimmed_primer_length

### Filter reads based on number of expected errors after trimming
F_extected_error: 6 # 8 recommended, increase if too much reads are filtered out. Apply to DADA2 approach only
R_extected_error: 6 # 8 recommended, increase if too much reads are filtered out. Apply to DADA2 approach only

### Taxonomic assignment ##############################################################################################
### Reference database
tax_DB_path:  "/data/databases/amplicon_based_metagenomics/16S/" # The path to the reference databases
tax_DB_name: ["ezbiocloud201805.201909", "ezbiocloud201805.202005"] # The names of the databases. Must be the "tax_DB_name" provided when preprocessing the reference database.

### Taxonimc classifier.
classifier: ["RDP", "qiimerdp", "dada2rdp","decipher"] # one or more from :"qiimerdp", "dada2rdp","decipher"


## Reads post-processing ##############################################################################################
rarefaction_value: [20000] # value for rarefaction. 20000 is generally enough reads but must be assessed by rarefaction curve.
min_prevalence: [0,10] # proporition (in %) of samples in which the feature has to be found to be kept
min_abundance: [0,100] # minimal count to be kept
normalization: ["CLR", "CSS","TMM", "NONE"] # value for counts normalization. Using script form https://github.com/biobakery/Maaslin2
viz_replace_empty_tax: TRUE # "TRUE" is recommended to replace empty taxonomic levels by "_sp" for species, "_g" for the genus etc.... Otherwise they are left empty
filter_tax_rank: ["Kingdom"] # At which taxonomic rank we filter in. (Usually "Kingdom")
filter_lineage: ["Bacteria"] # What value of filter_tax_rank we keep. (Usually "Bacteria")
filter_out_tax_rank: ["Phylum"] # At which taxonomic rank we filter out. (Usually "Phylum")
filter_out_lineage: ["Bacteria_phy"] # Which value of filter_out_tax_rank we filter out. (Usually "Bacteria_phy" to remove suspicious sequences assigned as Bacteria but which are probably not.

## Visualization ######################################################################################################
grouping_column: ["sample_group"] # Column for which a plot will be generated for each value of. Rarefaction curve will be only generated with colors from the first value in the list
sample_label: sample_label # Column in "local_samples" used to label samples in output. Cannot be the first column and must be unique. (KRONA, heatmaps, barplots)


## Special outputs ##############################################################################################################
### Phyloseq output for external plitting and analysis
melted_phyloseq: False # Generate melted phyloseq table (voluminous)
phyloseq_tax_ranks: ["OTU"] # Which collapse level for phyloseq output
transposed_tables: False # True to have transposed count table

### Qiime2 Visualization information
Qiime2_basic_output_visualization: True # or False

6. Make sure Snakemake is available

The recommended way to install Snakemake is to create a dedicated Conda environment. (see System requirements and setup). Thus, make sure to activate this environment.

for instance:

# Activate your conda environment
$ conda activate snakemake

# Test that Snakemake is available by printing its version
$ snakemake --version

7. Run the pipeline

Hint

At this stage, the content of your working directory depends upon what was selected in 1. Select a method to define sequencing read files as input.

Example of directory content for the basic Sample column strategy:

# List the content of your working directory
$ ls
config.yaml  links  local_sample.tsv

Then, the execution of the pipeline follows the principles of any *Snakemake pipeline execution*.

But to make it short, here are the requirement arguments.