Introduction Version

WGS with SqueezeMeta : with assembly

This documentation is under construction. Please give us your feedbacks by contacting us at hub@gencovery.com.


Whole Genome Shotgun (WGS) metagenomic sequencing allows to sample all the genes of all the organisms present in a given complex sample. This method allows microbiologists to assess bacterial diversity and detect the abundance of microbes in various environments. WGS metagenomics also provides a means of studying non-cultivatable microorganisms that are otherwise difficult or impossible to analyse. 
This type of sequencing data can be used in a so-called (1) mapping analysis against a database of reference genes or (2) metagenomic assemblies followed by annotation of the assembled sequences and identification of the taxa present.
The pipeline used here is called SqueezeMeta, with an assembly of the reads into contigs. Performing an assembly makes the computational time longer but on the other hand, the sequences to annotate will be larger and therefore, taxonomic, and functional assignment of genes will be more accurate

Data upload and preparation

Input fastq folder

One must upload one folder with all the sequencing data using the Databox. You must select the following format: Fastq folder.
Fastq folder : A folder containing all the sequencing data in fastq format, regardless of the sequencing strategy (paired or not).

Sample file

One must upload one file with all the information on the samples using the Databox. You must select the following format: File.

sample_file_list, expected format :
                          Sample1        readfileA_1.fastq       pair1
                          Sample1        readfileA_2.fastq      pair2
                          Sample1        readfileB_1.fastq       pair1
                          Sample1        readfileB_2.fastq      pair2
                          Sample3       readfileD_1.fastq       pair1       noassembly
                          Sample3       readfileD_2.fastq      pair2      noassembly

The sample file has to be tab separated.


Assembly to annotation

This task: gws_metag- SqueezeMeta Pipeline : assembly to annotation proceeds to all the steps one after another but the different steps can be explicited.

Cleaning of the reads

The first thing the pipeline does is cleaning the sequences: adapter removal, trimming and filtering based on quality scores with the Trimmomatic algorithm. The cleaning option lets you decide if wanting to proceed to a cleaning in the case one was already done.


The first thing to choose is how one wants to perform the assembly with the assembly type option. The co-assembly mode pools all the samples together and performs a single assembly. Another mode is called merged : assembly is performed for each sample and then, contigs are merged using CD-HIT tool.
If the dataset is composed of big samples, co-assembly could crash because of memory requirements.That’s when using the merge mode can be useful since it can perform a co-assembly for a large number of samples. More often than not, it’s better to use co-assembly because the risk of creating chimeric contigs is higher in merged mode.
The following step is the assembly of reads into contigs. Several assemblers can be used such as Megahit and SPAdes. It can be selected with the assembler option.
It is advised to use Megahit as research in literature showed that SPAdes has big memory requirements compared to Megahit.
Megahit has the best compromise regarding contig size, the diversity in the samples and the required memory. One of the drawbacks is a bias towards low coverage genomes.
To produce the best assembly, Megahit uses a list of k-mer length ([21,29,39,59,79,99,119,141]) and builds de Bruijn graphs for each of them, which make the assembly step significatively long. The list of kmers can be shortened with the speed option.
One can choose the minimum length of contigs to keep with the contig_lengthoption.

Gene and rRNA prediction

The Prinseq algorithm is used to filter the contigs by length and discard the short ones.
The contigs will then be annotated with a gene prediction software, Prodigal, which also retrieves the corresponding amino acid sequences. 16S rRNA sequences are also looked for via Barrnap and then classified with RDP classifier .
An optional step is to have an extra-sensitive detection of ORFs by doing a second pass by performing a BLASTX. To do so the doublepass option have to be selected.

Homology searching

The following step is homology searching with the Diamond software. The found genes are searched in taxonomic databases such as the Genbank nr database, eggNOG database for COG (Clusters of Orthologous Groups) annotation (possibility to select it or not with the clustering_orthologous_groups option) . The genes are also classified using a functional database, Pfam with the HMMER3 tool.

Taxonomic and functional assignment of genes

The results of the previous step are used to proceed to taxonomic and functional assignments. The taxonomic assignment is made with an LCA (Lowest Common Ancestor) algorithm.
LCA algorithm :
A hit must be above a threshold identity level for being assigned to a taxonomic rank, i.e. 85, 60, 55, 50, 46, 42, and 40% for species, genus, family, order, class, phylum, and superkingdom ranks, respectively. No classification will be made if the identity level is below 40%.
The functional assignment is made with the fun3 algorithm.

Fun3 algorithm :
It uses the results of the homology search (with Diamond), keeping only the hits below the set evalue or below the set identity. Fun3 gives two types of classification: one is the best hit (functional id of the highest scoring hit). The other is the best average hit that evaluates if the found functional id is significantly better than the others. To do so, it takes the first n hits (by default n=5) and calculates the average bit-score.
The gene will be assigned to the functional id with the highest average bitscore that is higher than a given percentage ( 10% by default) compared to the score of the one after. 
It has been shown that this method provides less assignments but on the other hand is more precise by having less wrong assignments inside closely related proteins. 
In the gene table, only the best hit is shown but there will be in addition the * sign if the assignment is also supported by the best average method.

Taxonomic assignment of contigs

Following the gene taxonomic assignment, a consensus assignment will then be made for the contigs. A contig will be annotated to the taxon assigned to most of the genes it contains. A disparity score is computed to assess the purity of the contig that can lead to potential exclusions of contaminated contigs.


To evaluate the abundance of each gene and contig in each sample, SqueezeMeta relies on the mapping of original reads to the assembly (containing contigs). Several algorithms can be used like BWA(Burrows-Wheeler Algorithm), Bowtie2 or Minimap2-sr. One can choose its algorithm with the mapper software option.
Bedtools will then be applied to retrieve the numbers of reads mapped to each gene and contig.
Average coverage and RPKM values are computed for gene and contig abundance information.


The next step in the analysis is the binning, that will associate a sequence with an organism using the coverages values. Different binning algorithms are used such as Maxbin, Metabat2. The results are merged with DASTool.
Average coverage and RPKM values are also computed.

Taxonomic assignment of bins

In the same way contigs have been assigned to a taxon using the gene assignments, consensus taxonomic assignments are made for the bins. A disparity score is also computed to assess in each bin how many contigs are not agreeing with the bin consensus.
The CheckM tool is then used to evaluate the bins’s completeness, contamination, and heterogeneity.


The first thing to choose is selecting the number of threads to run the analysis. One can put as much threads as available for the pipeline to take less time.
Then you can choose a speed in the speed mode option. One can choose between fast, average, slow and slow+.

  • Fast : assembly type : co-assembly, assembler: Megahit (with 2 kmers), contig_length: 1000, doublepass : no, no Pfam annotation
  • Average : assembly type : co-assembly, assembler: Megahit (with 4 kmers), contig_length: 500, doublepass : no
  • Slow : assembler: Megahit (with 6 kmers)
  • Slow+ : all choices left to the user

Files :

Input : FastQfolder, Sample file
Outputs : -Set of taxonomic, functional assignment and abundance tables
-Set of measures and statistic tables
-squeeze_meta_pipeline_folder (all files created by the pipeline)
In the Set of taxonomic, functional assignment and abundance tables, the computed abundance is a absolute one.
The pipeline produces one file per taxonomic level (e.g. Class Abundance) with the samples in columns and the taxonomic information in rows.
Other tables were made from the previous ones in order to separate the different superkingdoms in different files and permuting rows and columns.
Those tables are easier if one wants to mqke graphics such as stacked barplots.