In silico validation tool

Aim

This dedicated aims at predicting in silico if specific taxa are:

amplified by a set of PCR primers used for amplicon-based metagenomics

accurately taxonomically classified based on the generated amplicon

Working principle

Based on a user-defined list, genome assemblies are downloaded from the NCBI database with Assembly Finder. Then, PCR primer sequences provided by the user are used to run an in silico PCR with Simulate_PCR. To the their discriminative power, the generated in silico amplicons are treated by the pipeline as they would if they were the results of sequencing reads (primers trimming, taxonomic classification).

Finally, this tool provides a table with, for each of the downloaded assembly, a description of the amplicons predicted to be amplified with the PCR primers (number of sequence variants, number of copies, expected and obtained taxonomic classification).

Requirements

This tools shares requirements with the RSP4ABM main pipeline. Thus, it requires:

a local copy of RSP4ABM ((cloned with git)) (with the –recursive flag to obtain Assembly Finder at the same time)
Snakemake
Singularity (here required and not only optional)
A taxonomic database preprocessed with our dedicated pipeline

Inputs

To execute the pipeline, one needs:

1. Input table

Note

It is recommended to generate a new folder (outside of the pipeline itself) where these files are created and the pipeline executed.

This table contains a list of taxa to be tested. First column must contain taxonomic identifiers matching the identifiers from the NCBI taxonomy database. Alternatively, instead of their names, taxa can also be indentified by their taxID. Then, a second column must describe the number of assemblies. The two columns should be separated by a tabulation.

Hint

The number of assemblies from a given taxa should not exceed the number of assemblies available on the NCBI.

Input file example:

UserInputNames	nb_genomes
Bacillus cereus ATCC 10987	1
Bifidobacterium adolescentis ATCC 15703	1
Clostridium beijerinckii NCIMB 8052	1
Deinococcus radiodurans R1	1
Enterococcus faecalis OG1RF	1
Escherichia coli str. K-12 substr. MG1655	1
Lactobacillus gasseri ATCC 33323	1
Rhodobacter sphaeroides 2.4.1	1
Staphylococcus epidermidis ATCC 12228	1
Streptococcus mutans UA159	1
Acinetobacter baumannii ATCC 17978	1
Schaalia odontolytica ATCC 17982	1
Bacteroides vulgatus ATCC 8482	1
Helicobacter pylori ATCC 700392	1
Neisseria meningitidis MC58	1
Porphyromonas gingivalis ATCC 33277	1
Cutibacterium acnes subsp. defendens ATCC 11828	1
Pseudomonas aeruginosa PA7	1
Staphylococcus aureus subsp. aureus NCTC 8325	1
208435	1

2. Config file

The config files specifies the different parameters of the pipeline as well as parameters for Assembly Finder.

Config file example:

################ Validation config ################

### Input table, listing the taxonomy and number of assemblies to test from
input_table_path: '16S_input_table_insilico.tsv'

### In silico PCR parameters, used by Simulate_PCR
forward_primer: CCTACGGGNGGCWGCAG # CCTACGGGNGGCWGCAG for Illumina V3V4
reverse_primer: GACTACHVGGGTATCTAATCC # GACTACHVGGGTATCTAATCC for Illumina V3V4
mismatch: 3 #0 - 3 mismatch
threeprime: 2 # Number of match at the 3' end for a hit to be considered

### Amplicon triming parameters, used by cutadapt
excepted_errors: 0.1 ## Proportion of mismatch tolerated allowed in the primers to be trimmed
amplicon_min_coverage: 0.8 ## Covered proporition of the primer to be trimmed 
merged_min_length: 390 # from 390 to 400 for V3V4
merged_max_length: 500 # from 450 to 500 for V3V4

### Database used for taxonomic assignment
tax_DB_path: "/data/databases/amplicon_based_metagenomics/16S/"
tax_DB_name: [ezbiocloud201805.202005] # must be the name of the folder containing files named "DB_amp.fasta" and "DB_amp_taxonomy.txt" in /data/. Can be multiple.
classifier: ["qiimerdp"]

### Processing for comparison of taxonomic assignment
viz_replace_empty_tax: TRUE

########### Assembly finder config ################
NCBI_key: '6dce38824889f62e188a25ae35c52a083c08'
NCBI_email: 'valentin.scherz@chuv.ch'

##Parameters for search_assemblies function
#This set of parameters is to search all possible assemblies
complete_assemblies: True ## Keep only complete assemblies
reference_assemblies: False ## Keep only reference assemblies
representative_assemblies: False ## Keep only representative assemblies
exclude_from_metagenomes: True ## Excluse from metagenome
Genbank_assemblies: True ## Take from Genbank
Refseq_assemblies: True ## Take from Refseq


##Parameters for the filtering function
Rank_to_filter_by: 'None'
#None: Assemblies are ranked by their assembly status (complete or not)
#and Refseq category (reference, representative ...)
#If you want to filter by species, set this parameter to 'species'. The filtering function will list all unique species
#and rank their assemblies according to assembly status and Refseq category.

https://github.com/metagenlab/microbiome16S_pipeline/blob/master/ressources/template_files/16S_input_table_insilico.tsv

Execution

Once all the requirements installed and the input files ready, one can exectute the pipeline. In an environment where Snakemake is available, it can be run as follows:

snakemake --snakefile {path_to_pipeline}/Insilico_taxa_assign.Snakefile  --use-singularity --singularity-prefix {path_to_singularity_images} --cores {number of cores} --configfile {path_to_config} --resources ncbi_requests={number of request to NCBI} -k