This documentation is under construction. Please give us your feedbacks by contacting us at hub@gencovery.com.
Introduction
Whole Genome Shotgun (WGS) metagenomic sequencing allows to sample all the genes of all the organisms present in a given complex sample. This method allows microbiologists to assess bacterial diversity and detect the abundance of microbes in various environments. WGS metagenomics also provides a means of studying non-cultivatable microorganisms that are otherwise difficult or impossible to analyse.
This type of sequencing data can be used in a so-called (1) mapping analysis against a database of reference genes or (2) metagenomic assemblies followed by annotation of the assembled sequences and identification of the taxa present.
The pipeline used here is called MetaPhlAn (gitHub page here) . No assembly is performed : not having this step enables a shorter computational time compared with pipelines with an assembly.
Data upload and preparation
Input fastq folder
One must upload one folder with all the sequencing data using the Databox. You must select the following format: Fastq folder.
Fastq folder : A folder containing all the sequencing data in fastq format, regardless of the sequencing strategy (paired or not).
MetaPhlan is not able to treat the information of paired end samples which can be useful to detect genomic rearrangements and repetitive sequence elements.
Protocol
Mapping reads
This task: gws_metag- Short Reads Mapping with MetaPhlAn
performs taxonomic assignment directly on the raw reads. The database used for the mapping is ChocoPhlAn.
ChocoPhlAn :
This database is built upon the Uniprot and NCBI resources where core genomes (gene families present in all species) are extracted and from core genomes, unique marker genes. These markers are clade-specific, meaning that it’s common to a group of genomes (at whatever taxonomic level). A marker is defined by being conserved coding sequences (CDS) that are not similar to sequences outside the clade.
In its last version, ChocoPhlAn 3, the number of annotated microbial genomes is around 99 k and the number of species attains 16.8 k, available in the UniProt Proteomes portal (The UniProt Consortium, 2019).
The first thing to chose is selecting the number of threads
to run the ananlysis. One can put as much threads as available for the pipeline to take less time.
Another option is unknown_count
. If selected, MetaPhlAn will estimate the proportion of unknown microorganisms in the samples.
Files :
Input : FastQfolder
Outputs : -Set of taxonomic abundance tables (relative abundance)
-bowtie_output(intermediate BowTie2 output for re-running MetaPhlAn quickly )
In the Set of taxonomic abundance tables, the computed abundance is a relative one (total=100%).
The table named abundance_table is the output of MetaPhlAn with the samples in columns and the taxonomic information in rows.
The other tables were made from the previous one in order to separate the taxa levels in different files and permuting rows and columns.
Those tables are easier if one wants to mqke graphics such as stacked barplots.