Taxonomic reference database preprocessing¶

Note

Provided command-line examples are given as examples and are valid for a standard unix bash terminal.

Rational:¶

After processing of sequencing reads by a metagenomic pipeline, we expect amplicon sequences (OTUs or ASVs) to be assigned to the lowest possible taxonomic level (species). However, it is expected for some species to have more or less exactly the same sequence on the gene used as marker. Thus, all species cannot be differentiated without ambiguities based on marker genes, and that even more on the short fragment amplified and sequenced in amplicon-based metagenomics.

The classifiers integrated in RSP4ABM (original RDP 1, RDP integrated in QIIME 2 and Decipher IDTAXA 3) all have specific formatting requirements and the two last require an initial training.

Thus, to improve the taxonomic classification and to render to the end-user an information regarding the risk of confusion of certain taxa, the a dedicated workflow of the pipeline will fuse taxa represented by identical sequences. This workflow also adapts the format of the database to make it compatible with multiple classifiers and conduct the training required by some of them.

Working principle:¶

The user indicates to the pipeline:

the sequences of the used PCR primers and the path to the input reference database fasta file and taxonomy annotation file to be formatted.

Based on this information and using tools from Cutadapt 4 and VSEARCH 5, as well as home-made R 6 scripts, the pipeline first extracts the amplicon matching the used primes. Then, it unifies the taxonomy: in cases where the exact same amplicon is predicted for multiple taxa, it collapses together their identifiers at the genus/species (up to a user-defined number of occurrences). An error is raised in cases where the same sequence is observed across different families or ranks above.

In addition, the pipeline formats the database and executes the pre-training required for the original RDP classier as well as Decipher IDTAXA.

Finally, the pipeline will make a copy of the original database as well as computes hashes of all files for traceability purposes.

Execution:¶

A dedicated workflow is embedded in RSP4ABM for database preprocessing. This workflow is to be run only one time for each set of PCR primer and reference database.

First, the user must retrieve a database in the right format and a config file must be defined. Then, provided that the pipeline was properly setup (see Setup and system requirements), the dedicated workflow can be executed.

Taxonomy database¶

RSP4ABM requires a reference taxonomic database for classification. The database provided to RSP4ABM must be organized into two files, following the original QIIME format:

a Reference sequences fasta file.
a Reference taxonomy text file describing with the taxonomic classification of these sequences.

Reference sequences¶

The first file must be a fasta file with reference genomic sequences. The description of each sequence must be an unique sequence identifier.

For instance:

>1
CTGNCGGCGTGCCTAACACATNCAAGTCGAGCGGTGCTACGGAGGTCTTCGGACTGAAGTAGCATAGCGGCGGACGGGTGAGTAATACACAGGAACGTGCCCCTTGGAGGCGGATAGCTGTGGGAAACTGCAGGTAATCCGCCGTAAGCTCGGGAGAGGAAAGCCGGAAGGCGCCGAGGGAGCGGCCTGTGGCCCATCAGGTAGTTGGTAGGGTAAGAGCCTACCAAGCCGACGACGGGTAGCCGGTCTGAGAGGATGGACGGCCACAAGGGCACTGAGACACGGGCCCTACTCCTACGGGAGGCAGCAGTGGGGGATATTGGACAATGGGCGAAAGCCTGATCCAGCGACGCCGCGTGAGGGACGAAGTCCTTCGGGACGTAAACCTCTGTTGTAGGGGAAGAAGACAGTGACGGTACCCTACGAGGAAGCCCCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGGNCGAGCGTTACCCGGAATCACTGGGCGTAAAGGGTGCGTA
>2
AACGAACGCTGGCGGCAGGCTTAACACATGCAAGTCGAACGAAGTCTTCGGACTTAGTGGCGCACGGGTGAGTAACACGTGGGAATATACCTCTTGGTGGGGAATAACGTCGGGAAACTGACGCTAATACCGCATACGCCCTTCGGGGGAAAGATTTATCGCCGAGAGATTAGCCCGCGTCCGATTAGCTAGTTGGTGAGGTAATGGCTCACCAAGGCGACGATCGGTAGCTGGTCTGAGAGGATGATCAGCCACACTGGGACTGAGACACGGCCCAGACTCCTACGGGAGGCAGCAGTGGGGAATATTGGACAATGGGCGAAAGCCTGATCCAGCCATGCCGCGTGAGTGATGAAGGCCTTAGGGTTGTAAAGCTCTTTCACCCACGACGATGATGACGGTAGTGGGAGAAGAAGCCCCGGCTAACTTCGTGCCAGCAGCCGCGGTAATACGAAGGGGGCTAGCGTTGTTCGGAATTACTGGGCGTAAAGCGCACGCAGGCGGTGGTCATAGTCAGAAGTGAAAGCCCTGGGCTCAACCCGGGAATTGCTTTTGATACTGGACCGCTAGAATCACGGAGAGGGTAGTGGAATTCCGAGTGTAGAGGTGAAATTCGTAGATATTCGGAAGAACACCAGTGGCGAAGG

Reference taxonomy¶

The second file must be a text file where the first column is the sequence identifier and the second represents its 7 levels of taxonomy separated by “;” (Kingdom;Phylum;Class;Order;Family;Genus;Genus Species). Both columns must be separated by a tabulation.

For instance:

1   Bacteria;Proteobacteria;Alphaproteobacteria;Rhodospirillales;Rhodospirillaceae;Magnetospirillum;Magnetospirillum magnetotacticum
2   Bacteria;Fusobacteria;Fusobacteria_c;Fusobacteriales;Fusobacteriaceae;Fusobacterium;Fusobacterium nucleatu

Find your own reference database¶

We do not provide a taxonomic reference database. However, here is a short, non-exhaustive, list of databases from which we could successfully prepare a database with our pipeline.

EzBioCloud (16S rRNA - Bacteria)

Website

Publication

Silvia (16/18S rRNA, 23/28S rRNA - Bacteria and Eukarya )

Website

Publication

UNITE (ITS - Eukarya)

Website

Working directory¶

To execute the pipeline place yourself in any directory, but preferably not in the directory pipeline. It does not have to be where the input reference database files are, nor where you desire to save the output (these locations will be defined in the config file .)

for instance:

# create a directory to run the database preprocessing workflow
$ mkdir DB_processing
# Change into this new directory
$ cd DB_processing

Config file¶

As for the main pipeline, parameters must be provided in an config file in the .yaml format. Please adapt the following template to your situation.

for instance:

# Open a graphic text editor and create a config file. Once opened, copy the example below and adapt it.
$ gedit config_DB.yaml

# Or use a command-line text editor, e.g.
# $ nano config_DB.yaml

## Path to the input database files
DBpath_seq: "/home/master_students/bin/microbiome16S_pipeline/data/ezbiocloud201805/DB_amp.fasta" # reference genomic sequence in .fasta format
DBpath_tax: "/home/master_students/bin/microbiome16S_pipeline/data/ezbiocloud201805/DB_amp_taxonomy.txt" # reference taxonomy of these sequences. 

## Output
tax_DB_path: "/data/databases/amplicon_based_metagenomics/16S/" # path where to write the output of the database processing pipeline. 
tax_DB_name: "EzBioCloud_version" # desired name of processed database (name of the directory which will contain the output)

## Amplicons extraction
forward_primer: CCTACGGGNGGCWGCAG # the forward PCR primer used to extract the amplicon
reverse_primer: GACTACHVGGGTATCTAATCC # the reverse PCR primer used to extract the amplicon
amplicon_min_length: 300 # the min length of the amplicon
amplicon_max_length: 500 # the max length of the amplicon
excepted_errors: 4 # the accepted number of mistmach per primer
amplicon_min_coverage: 0.9 # the proportion of the primers that must be found in the reference sequences. 

## Taxa collapsing
numbers_species: 4 # the max number of species names to be pasted together. Over this number, taxonomic names will be replaced by a space holder. 
numbers_genus: 2 # the max number of genus names to be pasted together. Over this number, taxonomic names will be replaced by a space holder.

Pipeline execution¶

Once the reference database in the right format downloaded and the config file prepared, the database preprocessing pipeline can be executed.

snakemake \
    --snakefile <path/to/pipeline>/microbiome16S_pipeline>/DBprocess.Snakefile \
    --use-singularity --singularity-prefix <path/to/singularity/containers/storage/> \
    #or --use-conda --conda-prefix <path/to/conda/environements/storage/> \
    --cores <number_of_CPU_allocated_to_the_analysis> \
    --configfile config_DB.yaml

Working with without pipeline preprocessing?:¶

TO BE EXPLAINED!

References:¶

1: Wang Q, Garrity GM, Tiedje JM, Cole JR. Naïve Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Appl Environ Microbiol. 2007.
2: Caporaso JG, Kuczynski J, Stombaugh J, Bittinger K, Bushman FD, Costello EK, et al. QIIME allows analysis of high-throughput community sequencing data. Nature Methods. 2010.
3: Murali A, Bhargava A, Wright ES. IDTAXA: A novel approach for accurate taxonomic classification of microbiome sequences. Microbiome. 2018.
4: Compeau PEC, Pevzner PA, Tesler G, Papoutsoglou G, Roscito JG, Dahl A, et al. Cutadapt removes adapter sequences from high-throughput sequencing reads kenkyuhi hojokin gan rinsho kenkyu jigyo. EMBnet.journal. 2013.
5: Rognes T, Flouri T, Nichols B, Quince C, Mahé F. VSEARCH: A versatile open source tool for metagenomics. PeerJ. 2016
6: R Core Development Team. R: A language and environment for statistical computing. Vienna, Austria. 2019.