Introduction Version

Metabarcoding analysis

This documentation is under construction. Please give us your feedbacks by contacting us at hub@gencovery.com.


One of the most cost-effective and rapid genomic approaches to determining the species composition of a microbiome is to sequence one or more genes (called barcodes or tags) common within the living kingdom. For example, sequencing the gene coding for 16S rRNA will reveal the representation and abundance of different bacterial species present in samples. The reads obtained after sequencing are aligned with bacterial genomes referenced in databases, allowing the identification and taxonomic classification of the sequenced species.

Data upload and preparation

Input fastq folder

One must upload one folder with all the sequencing data using the Databox. You must select the following format: Fastq folder.
Fastq folder : A folder containing all the sequencing data in fastq format, regardless of the sequencing strategy (paired or not).

Generation of the metadata file

The uBiome Make metadata file task automatically generates a ready-to-use metadata file using ( task: uBiome -Qiime2 metadata table maker) when given a fastq folder as input. Once the metadata file is generated, you can add specific metadata columns in the expected file format (see below). You need to download, modify and re-upload the metadata file.

Metadata file : metadata can include technical details, such as descriptions of samples, treatment, subject, time, body sites... The accepted metadata format is a tab-separated file with well-defined headings and mandatory columns (as below)
Exemple :

#author: Paulson, Robert		
#data: 1996/08/17		
#project: Chaos		
#types_allowed:categorical or numeric	
#column-type	categorical	categorical	categorical
sample-id	forward-absolute-filepath	reverse-absolute-filepath	Treatment
Sample_1	Sample_1.forward.fq.gz	Sample_1.reverse.fq.gz	ctrl
Sample_2a	Sample_2a.forward.fq.gz	Sample_2a.reverse.fq.gz	T1
Sample_2b	Sample_2b.forward.fq.gz	Sample_2b.reverse.fq.gz	T1
Sample_3a	Sample_3a.forward.fq.gz	Sample_3a.reverse.fq.gz	T2
Sample_3b	Sample_3b.forward.fq.gz	Sample_3b.reverse.fq.gz	T2

The first six rows and the sample-id column are mandatory. The type of the columns must be entered according to the data they contain (categorical or numerical). The treatment column has been added according to the required format (Treatment + column-type: categorical)


STEP 1 - Checking the reads quality

This step (task: uBiome - Qiime2 quality check) allows to investigate sequencing quality from a sequencing dataset project.

⚠️ In case of adding additional information to the metadata file, do not forget to : ⚠️

  1. Download the automatically generated metadata file (see initial steps).
  2. Modify it with a spreadsheet app (e.g. Excel...) by adding the informative columns (for example, the treatment column). We advise not to put spaces in the names of samples as they may produce errors with some tools !
  3. Keep the tabulation as a column separator.
  4. Re-upload the file to the databox.
  5. Use this modified metadata file at this stage of quality control of the sequencing.

Files :

Inputs : Fastq folder and a metadata file

Outputs : Quality folder and boxplot quality figure(s)

Informations :

This first step generates a figure with a boxplot for each base position (i.e. a fastqc type figure). Based on this figure(s), we need to cut out low quality read positions (e.g. less than 25 PHRED), but we also need to consider that the length of the reads will be long enough to overlap in the next step. To do this, you need to consider read lengths and expected overlap (e.g. based on the rRNA region sequenced) according to your study design. In case of single-ended files, poor quality positions should also be cut out.

STEP 2 - Denoising, clustering sequences

This step method (task: uBiome -Qiime2 featureInference) requires two (paired-end, one for single-end) parameters to trim off the last bases of each sequence, to remove low quality regions of the sequences (see above). Reads can be hard clipped in 5' with an optional option.

Files :

Input : Quality folder
Outputs : Feature frequencies folder and feature count table

Informations :

In this step, we first denoise sequences to remove and/or correct noisy reads. Then, we dereplicate our sequences to reduce repetition and file size/memory requirements in downstream steps (count of each replicate is kept). Finally, we cluster sequences to collapse similar sequences (e.g., those that are ≥ 97% similar to each other) into single replicate sequences. These methods filter out noisy sequences, correct errors in marginal sequences, remove chimeric sequences, remove singletons, join denoised paired-end reads and then dereplicate those sequences.
The features produced by denoising methods go by many names of “amplicon sequence variant” (ASV). ASVs are a more recent development and provide better resolution in features than traditional OTU-based methods. ASVs can separate features based on differences of a single nucleotide in sequences of 400 bp or more, a resolution not possibly even with 99% identity OTU clustering. QIIME 2 currently offers denoising via DADA2 (Nearing et al, 2018).
DADA2 is a pipeline for detecting and correcting (where possible) Illumina amplicon sequence data. As implemented in the q2-dada2 plugin, this quality control process will additionally filter any phiX reads (commonly present in marker gene Illumina sequence data) that are identified in the sequencing data and will filter chimeric sequences.

STEP 3 - Assessing α rarefaction

For this step, the max depth parameter value that you provide to the task (task: uBiome -Qiime2 rarefaction) should be determined by reviewing the “Frequency per sample” information presented in the previous Feature table file that was created above. In general, choosing a value that is somewhere around the median frequency seems to work well, but you may want to increase that value if the lines in the resulting rarefaction plot don’t appear to be leveling out, or decrease that value if it seems that many of your samples are lost due to low total frequencies closer to the minimum sampling depth than the maximum sampling depth.

Files :

Input : Feature frequencies folder
Outputs : Rarefaction folder and rarefaction tables (NO RAREFACTION ARE DONE ON FEATURES COUNTS)

Informations :

We now have a feature table (observation matrix) of ASVs in each sample and a phylogenetic tree representing those ASVs, so we’re almost ready to perform various analyses of microbial diversity. However, first we must normalize our data to account for uneven sequencing depth between samples.
Although sequencing depth in a microbiome sample does not directly relate to the original biomass in a community, the relative sequencing depth has a large impact on observed communities (Weiss et al, 2017). Therefore, for most diversity metrics, a normalization approach is needed.
Rarefaction is a method for normalization via sub-sampling without replacement and is commonly used as a workaround for the issue of uneven sequencing depth. Rarefaction occurs in two steps: first, samples which are below the rarefaction depth are filtered out of the feature table. Then, all remaining samples are subsampled without replacement to get to the specified sequencing depth. It’s both important and sometimes challenging to select a rarefaction depth for diversity analyses. Several strategies exist to figure out an appropriate rarefaction depth - we will primarily consider alpha rarefaction in this tutorial, because it is a data-driven way to approach the problem.
We’ll use alpha rarefaction (QIIME2) to subsample the ASV table at different depths (between minimum (>=20) reads and maximum depth) and calculate the alpha diversity using one or more metrics (--p-metrics). We want to set a maximum depth close to the maximum number of sequences. By default, 10 rarefied tables are calculated at each sampling depth to provide an error estimate.

STEP 4 - Taxonomy, diversity assessments

An important parameter that needs to be provided to this step task (task: uBiome - Qiime2 Taxonomy Diversity) is sampling depth, which is the even sampling (i.e., rarefaction) depth. Because most diversity metrics are sensitive to different sampling depths across different samples, this script will randomly subsample the counts from each sample to the value provided for this parameter. For example, if you provide --p-sampling-depth 500, this step will subsample the counts in each sample without replacement so that each sample in the resulting table has a total count of 500. If the total count for any sample(s) are smaller than this value, those samples will be dropped from the diversity analysis.

We recommend making your choice by reviewing the previous rarefication views. Choose a value that is as high as possible (so you retain more sequences per sample) while excluding as few samples as possible. To do so, the visualization file from the previous step will display two lineplots. The plots will display the alpha diversity (observed features or shannon) as a function of the sampling depth. This is used to determine whether the richness or evenness has saturated based on the sampling depth. The rarefaction curve (select the lineplot 2D view) should “level out” as you approach the maximum sampling depth. This plateau value must be evaluated visually.

Files :

Input : Feature frequencies folder or rarefaction curves analysis (no rarefaction are done on feature counts)

Outputs : Diversity folder, diversities index table, distance matrix table (PCoA compatible) and taxonomic diversity table (stacked barplot-views are available)

Informations :
1 - Diversity indexes

Diversity analyses are available through the q2-diversity plugin, which supports computing alpha and beta diversity metrics, applying related statistical tests, and generating interactive visualizations. We’ll first apply the core-metrics-phylogenetic method, which rarefies the previous feature table to a user-specified depth, computes several alpha and beta diversity metrics, and generates principal coordinates analysis (PCoA) plots using Emperor for each of the beta diversity metrics. The metrics computed by default are:

Alpha diversity

- Shannon’s diversity index (a quantitative measure of community richness)

- Observed Features (a qualitative measure of community richness)

- Faith’s Phylogenetic Diversity (a qualitiative measure of community richness that incorporates phylogenetic relationships between the features)

- Evenness (or Pielou’s Evenness; a measure of community evenness)

Beta diversity

- Jaccard distance (a qualitative measure of community dissimilarity)

- Bray-Curtis distance (a quantitative measure of community dissimilarity)

- Unweighted UniFrac distance (a qualitative measure of community dissimilarity that incorporates phylogenetic relationships between the features)

- Weighted UniFrac distance (a quantitative measure of community dissimilarity that incorporates phylogenetic relationships between the features)

2 - Taxonomic classification

For the classification, we are using a pre-trained Machine-learning-based scikit-learn classifiers, that learn which features best distinguish each taxonomic group, adding an additional step to the classification process. In our case, we are using a Naive Bayes classifier pre-trained on the database Greengenes 13_8 99% OTUs full-length sequences and NCBI-16s full-lenght sequences (from 2022/07/12).

STEP 5 - Samples differential analysis

To use this task (Task: uBiome - Qiime2 ANCOM differential analysis), you need to specify the taxonomic level to perform the ANCOM test. This will allow you to assess which taxa have significantly different abundances among your samples.

Files :

Input : Diversity folder
Outputs : ANCOM test tables (ANCOM test result table, volcano plot table and percent abundance table)

Informations :

ANCOM can be applied to identify features that are differentially abundant (i.e. present in different abundances) across sample groups. As with any bioinformatics method, you should be aware of the assumptions and limitations of ANCOM before using it. We recommend reviewing the ANCOM paper before using this method (paper here).

STEP 6 - Functional analysis prediction

this task permit to predict functional analysis of 16s rRNA data . It uses PICRUSt2 : Phylogenetic Investigation of Communities by Reconstruction of Unobserved States.It wraps a number of tools to generate functional predictions based on 16S rRNA gene sequencing data.

Files :

Input 1: ASV-sequences.fasta generated using Q2FeatureInference task)
Input 2 : CSV table of the abundance of each ASV across each sample which is asv_table.csv (generated using Taxonomy_Diversity task )
Outputs :
EC_metagenome_out : Folder containing unstratified EC number metagenome predictions. (pred_metagenome_unstrat.tsv.gz), sequence table normalized by predicted 16S copy number abundances. (seqtab_norm.tsv.gz), and the per-sample NSTI values weighted by the abundance of each ASV (weighted_nsti.tsv.gz).
KO_metagenome_out : As EC_metagenome_out above, but for Kegg orthology metagenomes.
pathways_out : Folder containing predicted pathway abundances and coverages per-sample, based on predicted EC number abundances.

Informations :

PICRUSt2 (Phylogenetic Investigation of Communities by Reconstruction of Unobserved States) is a software for predicting functional abundances based only on marker gene sequences. Check out the paper here .
"Function" usually refers to gene families such as KEGG orthologs