This documentation is under construction. Please give us your feedbacks by contacting us at hub@gencovery.com.
Introduction
RNA sequencing (RNA-Seq) is used to study the transcriptome of living organisms. This high-throughput sequencing technique allows to measure the expression of an organism's genes and thus to compare the transcriptome of individuals under different conditions, to compare gene expression between different organs, at different times, etc.
Another major advantage of this technology is its ability to discover new isophorms, alleles, mutations (SNPs, InDels) within the sequenced genes. RNA-Seq is also an excellent tool for refining the annotation of assembled genomes.
This documentation presents the main steps you need to use RNA-Seq mapping.
Data upload and preparation
Input folder
One must upload one folder with all the sequencing data in the Databox. You must select the following format: Fastq folder.
Fastq folder : A folder containing all the sequencing data in fastq format, regardless of the sequencing strategy (paired or not).
Making the ready-to-use metadata file
The uBiome Make metadata file
task automatically generates a ready-to-use metadata file when given a fastq folder as input. Once the metadata file is generated, you can add specific metadata columns in the expected file format (see below).
Metadata file : metadata can include technical details, such as descriptions of samples, treatments, subjects, time, body sites,... The format of the Gencovery metadata file should be a tab-separated file with specific headings and mandatory columns.
Example :
#author: Paulson, Robert
#data: 1996/08/17
#project: Chaos
#types_allowed:categorical or numeric
#column-type categorical categorical categorical
sample-id forward-absolute-filepath reverse-absolute-filepath Treatment
Sample_1 Sample_1.forward.fq.gz Sample_1.reverse.fq.gz ctrl
Sample_2a Sample_2a.forward.fq.gz Sample_2a.reverse.fq.gz T1
Sample_2b Sample_2b.forward.fq.gz Sample_2b.reverse.fq.gz T1
Sample_3a Sample_3a.forward.fq.gz Sample_3a.reverse.fq.gz T2
Sample_3b Sample_3b.forward.fq.gz Sample_3b.reverse.fq.gz T2
The first six rows and the sample-id column are mandatory. The type of the columns must be entered according to the data they contain (categorical or numerical). The treatment column has been added according to the required format (Treatment + column-type: categorical)
Protocol
STEP 0 - Reads quality check OmiX – Fastqc quality check
This step is not mandatory. This task (task: OmiX - Fastqc quality check
) allows to investigate visually sequencing quality from a sequencing dataset project.
Files:
Inputs : Fastq folder
Outputs : Boxplot quality figure(s)
Informations :
This first step generates a figure with a boxplot for each base position (i.e. a fastqc type figure). Based on this figure(s), we need to cut out low quality read positions (e.g. less than 20 PHRED).
STEP 1 - Reads quality check and trimming
This step (task: OmiX - Trimgalore quality trimming
) allows to investigate sequencing quality from a sequencing dataset project and to remove reads low quality regions and short reads (parameters: quality=0 to 40 [default=20]; maximum unknown base, [Default=No filter]; minimum size, [Default=20 bp]). For paired-end project, singletons (i.e., when one reads from the pair) can be kept (parameter: singleton = Yes|No).
Files :
Inputs : Fastq folder and a Metadata file
Output : Trimmed fastq folder
Informations :
When raw sequencing data are produced, the initial step is to assess the quality of the sequencing and to minimize (or to cut) low quality sequenced regions.
Trimagalore is a bioinformatics pipeline that makes use of the publicly available adapter trimming tool Cutadapt (to get rid off adaptator) and FastQC (to detect low quality regions) for optional quality control once the trimming process has been completed.
STEP 2.a - Genome indexing
To perform genomic mapping you first need to index your genome data (task: OmiX - STAR genome index
). This task can be done according to two options: either with already indexed gencovery genome sequences or with a genome sequence file as an input (fasta format). For the latter option, we advise you to use genomes from reference databases (ensembl, NCBI, EBI...) which will offer you the annotation file of these genomes (i.e., the position of the genes in this genome). For more information on the annotation file format go to Ensembl website. To get direct download files go to :
Files :
Inputs : Genome fasta file and annotation gtf file
Output : Genome index folder
STEP 2.b - Genome mapping
Sequencing datasets which are contained in the upload fastq folder are mapped (task: OmiX - STAR genome mapping
) on the previously indexed genome. Metadata file must be provided to perform this step.
Files :
Inputs : Trimmed fastq folder, Genome index folder and metadata file (see First and foremost paragraph)
Output : Mapping files folder
Informations :
Once high-quality data is obtained from the previous pre-processing steps, the next step is genomic read mapping. Genome-wide read mapping does not require any knowledge of all the transcribed regions or how the exons are spliced together. With this approach, without any preconceptions, we can discover new unannotated transcripts, or even new genes in the case of a reference genomic sequence that are slightly distant from the species sequenced by RNA-Seq.
STEP 3 - Gene transcription quantification
To evaluate gene expression after reads mapping (with STAR, see Step 2.B) using the Salmon suite (task: OmiX - Salmon quantification STAR
), you just need to provide the previous step output folder (Mapping file folder containing: mapping file, the previously used genome and annotation file).
Files :
Input : Mapping files folder
Output : Gene expression files (raw count and tpm normalised file)
Informations :
This step uses the SALMON quant method to quantify gene expression with library size normalisation in two formats: raw count and tpm normalisation. For the following step, raw count file must be used for differential expression analysis.
STEP 4 - Gene expression differential analysis
Once the gene expression assessment has been performed, raw count file will be processed by the R package Deseq2 (task: OmiX - Deseq2 differential analysis
). For this step, you need to specify the metadata column to use to perform the comparison. Pairwise comparison files (with or without p-value filters) will be available in the output ressource set.
Files :
Input : Raw count gene expression file
Output : Lists of differential expressed genes (pairwise comparison)
Informations :
The DESeq2 package is designed for normalization and differential analysis of high-dimensional count data. It makes use of empirical Bayes techniques to estimate priors for log fold change and dispersion, and to calculate posterior estimates for these quantities.
For more information click here.