Introduction Version

WGS with SqueezeMeta : with assembly


Whole Genome Shotgun (WGS) metagenomic sequencing allows to sample all the genes of all the organisms present in a given complex sample. This method allows microbiologists to assess bacterial diversity and detect the abundance of microbes in various environments. WGS metagenomics also provides a means of studying non-cultivatable microorganisms that are otherwise difficult or impossible to analyse. 

This type of sequencing data can be used in a so-called (1) mapping analysis against a database of reference genes or (2) metagenomic assemblies followed by annotation of the assembled sequences and identification of the taxa present.

Data upload and preparation

Input fastq folder

One must upload one folder with all the sequencing data using the

. You must select the following format: Fastq folder.

Sample file

One must upload one file with all the information on the samples using the

. You must select the following format: File.

sample_file_list, expected format :
                          Sample1        readfileA_1.fastq       pair1
                          Sample1        readfileA_2.fastq      pair2
                          Sample1        readfileB_1.fastq       pair1
                          Sample1        readfileB_2.fastq      pair2
                          Sample3       readfileD_1.fastq       pair1       noassembly
                          Sample3       readfileD_2.fastq      pair2      noassembly


Assembly to annotation

This task: gws_metag- SqueezeMeta Pipeline : assembly to annotation proceeds to all the steps one after another but the different steps can be explicited.

The first thing the pipeline does is cleaning the sequences: adapter removal, trimming and filtering based on quality scores with the Trimmomatic algorithm. The cleaning option lets you decide if wanting to proceed to a cleaning in the case one was already done.

The first thing to choose is how one wants to perform the assembly with the assembly type option. The co-assembly mode pools all the samples together and performs a single assembly. Another mode is called merged : assembly is performed for each sample and then, contigs are merged using CD-HIT tool.

The following step is the assembly of reads into contigs. Several assemblers can be used such as Megahit and SPAdes. It can be selected with the assembler option.

One can choose the minimum length of contigs to keep with the contig_lengthoption.

The Prinseq algorithm is used to filter the contigs by length and discard the short ones.

The contigs will then be annotated with a gene prediction software, Prodigal, which also retrieves the corresponding amino acid sequences. 16S rRNA sequences are also looked for via Barrnap and then classified with RDP classifier .

An optional step is to have an extra-sensitive detection of ORFs by doing a second pass by performing a BLASTX. To do so the doublepass option have to be selected.

The following step is homology searching with the Diamond software. The found genes are searched in taxonomic databases such as the Genbank nr database, eggNOG database for COG (Clusters of Orthologous Groups) annotation (possibility to select it or not with the clustering_orthologous_groups

option) . The genes are also classified using a functional database, Pfam with the HMMER3 tool.

The results of the previous step are used to proceed to taxonomic and functional assignments. The taxonomic assignment is made with an LCA (Lowest Common Ancestor) algorithm.

The functional assignment is made with the fun3 algorithm.

Following the gene taxonomic assignment, a consensus assignment will then be made for the contigs. A contig will be annotated to the taxon assigned to most of the genes it contains. A disparity score is computed to assess the purity of the contig that can lead to potential exclusions of contaminated contigs.

To evaluate the abundance of each gene and contig in each sample, SqueezeMeta relies on the mapping of original reads to the assembly (containing contigs). Several algorithms can be used like BWA(Burrows-Wheeler Algorithm), Bowtie2 or Minimap2-sr. One can choose its algorithm with the mapper software option.

Bedtools will then be applied to retrieve the numbers of reads mapped to each gene and contig.

Average coverage and RPKM values are computed for gene and contig abundance information.

The next step in the analysis is the binning, that will associate a sequence with an organism using the coverages values. Different binning algorithms are used such as Maxbin, Metabat2. The results are merged with DASTool.

Average coverage and RPKM values are also computed.

In the same way contigs have been assigned to a taxon using the gene assignments, consensus taxonomic assignments are made for the bins. A disparity score is also computed to assess in each bin how many contigs are not agreeing with the bin consensus.

The CheckM tool is then used to evaluate the bins’s completeness, contamination, and heterogeneity.


The first thing to choose is selecting the number of threads to run the analysis. One can put as much threads as available for the pipeline to take less time.

Then you can choose a speed in the speed mode option. One can choose between fast, average, slow and slow+.

  • Fast : assembly type : co-assembly, assembler: Megahit (with 2 kmers), contig_length: 1000, doublepass : no, no Pfam annotation
    • Average : assembly type : co-assembly, assembler: Megahit (with 4 kmers), contig_length: 500, doublepass : no
      • Slow : assembler: Megahit (with 6 kmers)
        • Slow+ : all choices left to the user

          Files :