Exploring the blocks of life: An introduction to Genomics, Transcriptomics, Proteomics, Metagenomics

Description:

This course introduces the fundamentals of omics technologies, which allow the comprehensive analysis of biological systems at different levels of organisation, from the molecular to the systems level. Students will learn about the six broad categories of omics data, including genomics, transcriptomics, proteomics, metabolomics, epigenomics, and metagenomics, and how they can be analysed using specialised tools and techniques.

Content:

Omics data and its importance in the study of biological systems

Overview of the six broad categories of omics data and their applications

Sequencing and analysis tools for omics data

Computational tools for omics data analysis, including bioinformatics software

Identifying and quantifying genes, transcripts, proteins, and metabolites using omics data

Understanding the relationships between different components of biological systems using omics data

Identifying patterns in omics data that may be relevant to disease development and progression

Advances in the fields of genetics, biology, and medicine through the use of omics technologies

Outcome:

Upon completion of this course, students will have a strong foundation in the principles of omics technologies and their applications in the study of biological systems. They will be familiar with the sequencing and analysis tools used in omics research, and be able to use computational tools for data analysis. Students will also have an understanding of how omics data can be used to identify patterns that may be relevant to disease development and progression, and how omics technologies have advanced our understanding of genetics, biology, and medicine.

1/ Sequencing and genomic analysis tools

Genomics is the study of the complete set of genetic information (the genome) of an organism. It includes the analysis of the structure, function, evolution and mapping of all the genes in an organism. The goal of genomics is to understand how an organism's genes contribute to its overall phenotype, including its physical and behavioural characteristics.

Some of the most widely used genomic sequencing technologies today include:

Illumina: Illumina is a popular genome sequencing technology that uses a process called short-read sequencing to generate millions of short DNA fragments. These fragments are then assembled into a complete genome sequence. Illumina is widely used because of its high throughput and relatively low cost.

PacBio: PacBio is a genome sequencing technology that uses a process called single-molecule sequencing to generate longer DNA reads. This technology is known for its accuracy and ability to generate long contiguous sequences, making it useful for applications such as de novo genome assembly.

Oxford Nanopore: Oxford Nanopore is a genome sequencing technology that uses a process called nanopore sequencing to generate long DNA reads in real time. This technology is known for its ability to generate high quality data and for its portability, making it useful for field sequencing applications.

1-1/ Genomic data analysis tools

The bioinformatics tools mostly used today in genomics are

BLAST (Basic Local Alignment Search Tool): BLAST is a widely used tool for sequence alignment, which is the process of finding similarities between sequences. BLAST is used to identify sequences that are similar to a query sequence, such as a gene of interest, in a sequence database, such as a genome.

BWA (Burrows-Wheeler Aligner): BWA is a tool for aligning short-read sequencing data to a reference genome. BWA is used to match short reads generated by sequencing technologies, such as Illumina, to the genome, allowing researchers to identify the origin of the reads.

SAMtools (Sequence Alignment/Map Tools): SAMtools is a suite of tools for processing sequence alignment data in SAM (Sequence Alignment/Map) format. SAMtools is used for a variety of tasks, including sorting, merging and indexing alignments, and generating statistics on alignment data.

GATK (Genome Analysis Toolkit): GATK is a toolkit for the analysis of high-throughput sequencing data, including data generated by genome sequencing technologies. GATK is used for a variety of tasks, including variant calling (identifying differences between a genome and a reference genome), quality control and data pre-processing.

BEDtools: BEDtools is a software suite of utilities for comparing genomic features such as genes, transcripts and regulatory regions in different genome assemblies. It is typically used to analyse and manipulate files in BED (Browser Extensible Data) format, which is a plain text format used to represent genomic intervals or regions. It is widely used in genomics and bioinformatics research for tasks such as identifying overlaps between different sets of genomic regions, comparing genomic features between different species and performing functional genomics analyses.

SPAdes: SPAdes (St. Petersburg Assembly of DNA reads) is a bioinformatics tool for assembling DNA sequences from high-throughput sequencing data. SPAdes is designed to assemble genomes from short and long read sequencing technologies and is known for its ability to assemble genomes with high accuracy and completeness. It is widely used in genomics research for the assembly of bacterial, fungal and viral genomes, as well as for the assembly of plant and animal genomes. It is known for its robustness, accuracy and ability to assemble complex genomes, making it a popular choice for many research projects.

2/ Transcriptomic sequencing and analyses tools

Transcriptomics is the study of the complete set of RNA transcripts (the transcriptome) produced by the genes of an organism. Transcriptomics provides a better understanding of gene expression, the process by which DNA is used to produce RNA and proteins. By examining the transcriptome, researchers can understand which genes are turned on or off in response to different stimuli, such as changes in the environment or the development of a disease.

2-1/ Transcriptome sequencing tools

Transcriptome sequencing tools include:

Traditional RNA-seq: this is the most common form of RNA-seq and is used to analyse transcriptome data. It involves reverse transcription of RNA into cDNA, followed by high-throughput sequencing.

Whole transcriptome RNA-seq: this approach captures all RNA species in a sample, including messenger RNA (mRNA), transfer RNA (tRNA), ribosomal RNA (rRNA) and non-coding RNA (ncRNA).

Poly(A) RNA-seq: this method specifically targets mRNA by selecting only RNA molecules with poly(A) tails. This approach provides a more accurate representation of the transcriptome, as it eliminates ribosomal RNA and other non-coding RNA species.

Small RNA-seq: this approach is used to study small RNA molecules, such as microRNAs (miRNAs), piRNAs (piRNAs) and small interfering RNAs (siRNAs).

Stranded RNA-seq: this approach involves sequencing RNA in such a way that the direction of transcription can be determined. This is important for understanding the regulation of gene expression, as different genes can be transcribed from different strands of DNA.

Single Cell RNA-seq: this method allows the transcriptome of individual cells to be analysed, which can be useful for understanding cell diversity and identifying cell types.

2-2/ Transcriptomic data analysis tools

Tools for the analysis of transcriptomic data include :

STAR: STAR is a popular tool for RNA-seq read alignment to a reference genome. It is designed to handle different types of RNA-seq data, including those with long read lengths or high levels of sequencing depth. STAR uses a two-pass mapping strategy that enables it to achieve high sensitivity and accuracy in mapping spliced reads. It also has the ability to detect novel splice junctions and chimeric transcripts.

SALMON: SALMON is a tool for quantifying transcript abundance from RNA-seq data. It uses a lightweight alignment-free approach that enables fast and accurate quantification of transcript abundance. SALMON builds a lightweight index of the reference transcriptome and uses a variant of the expectation-maximization algorithm to estimate transcript abundance. It can handle single-end or paired-end reads, as well as reads from stranded or unstranded libraries. SALMON is highly scalable and can process large RNA-seq datasets in a timely manner.

RSEM (RNA-Seq by Expectation Maximization): RSEM is a tool for quantifying gene and transcript expression levels from RNA-seq data. RSEM uses a statistical model to estimate expression levels based on aligned reads and a reference genome or transcriptome.

DESeq2 (Differential gene expression analysis using RNA-seq data): DESeq2 is a tool for differential expression analysis of RNA-seq data. DESeq2 uses a statistical model to identify genes that are differentially expressed between different conditions or samples.

3/ Proteomic sequencing and analyses tools

Proteomics is the study of all the proteins (the proteome) produced by an organism. Proteomics provides a better understanding of the function of proteins and how they interact with each other to carry out cellular processes. Proteomics is used to identify and quantify proteins and to understand the role they play in biological processes such as development and disease progression.

3-1/ Sequencing tools

Proteome sequencing refers to the identification and characterisation of all proteins present in a sample. There are several mass spectrometry-based techniques that can be used for proteome sequencing, including :

Tandem mass spectrometry (MS/MS): This is the most widely used method for proteome sequencing. It involves breaking down proteins into smaller peptides in a process called digestion, and then analysing the peptides by mass spectrometry. Peptides can be identified on the basis of their mass-to-charge ratio, and their sequence can be determined by comparing the fragments generated during MS/MS analysis with a database of known protein sequences.

Unlabelled quantification: This method allows the quantification of proteins in a sample without the need to label them. It usually involves running the sample through a mass spectrometer and comparing the intensity of peaks in the spectra to those generated by a reference sample.

Isotope coding labelling: This method involves labelling the sample with isotopes, such as stable isotopes like stable isotope labelling by amino acids in cell culture (SILAC) or isobaric labels for relative and absolute quantification (iTRAQ), before performing mass spectrometry analysis. Isotopic tags allow the labelled sample to be distinguished from the reference sample, allowing the proteins present in the sample to be quantified.

Multiple reaction monitoring (MRM): This is a targeted mass spectrometry-based method that allows the simultaneous quantification of several peptides or proteins in a sample. It involves selecting specific peptides or proteins of interest and then monitoring the transitions between the precursor ions and the product ions generated during MS/MS analysis.

Nuclear magnetic resonance (NMR): This is a spectroscopic technique used to study the structure and dynamics of molecules. It works by measuring the magnetic properties of atomic nuclei in a strong magnetic field. NMR spectroscopy can provide information about molecular structure, including the number, type and arrangement of atoms and their bonds. NMR is widely used in a variety of fields, including chemistry, biochemistry, pharmacology and medical research. In biochemistry, NMR is used to study proteins, lipids and other biological molecules. In medical research, NMR imaging (MRI) is used to produce detailed images of the human body, including the brain, bones and organs. NMR technology has also been applied to the analysis of metabolites in the field of metabolomics.

3-2/ Analyses tools

There are several proteome sequencing data analysis tools that can be used to process and interpret the data generated by mass spectrometry techniques. Some of the most commonly used tools are :

MaxQuant: This open-source tool is very popular for the analysis of proteome sequencing data. It offers in-depth quantitative and qualitative analysis of the data, as well as the ability to combine the data with other types of biomolecular data.

MS-GF+: This fast and efficient proteome sequencing software uses real-time processing algorithms to identify and quantify proteins. It can be used to analyse mass spectrometry data obtained by various methods, such as SILAC and iTRAQ.

Perseus: This open-source tool provides a platform for the analysis of mass spectrometry data, including quantification, validation and visualisation of data. It can also be used to generate dashboards and graphs to facilitate the interpretation of results.

Proteome Discoverer: This professional software offers comprehensive analysis of mass spectrometry data, including quantification, validation and classification of proteins. It can also be used to generate dashboards and graphs to facilitate the interpretation of results.

4/ Metabolomic sequencing and analyses tools

Metabolomics is the study of all the small molecules (metabolites) produced by an organism. Metabolomics is used to understand the metabolic processes of an organism and how they change in response to different stimuli, such as changes in the environment or the development of a disease. Metabolomics is used to identify biomarkers, i.e. measurable changes in metabolites that can be used to diagnose and monitor disease.

4-1/ Sequencing tools

Metabolome sequencing tools are used to analyse small molecule metabolites present in biological samples. The most commonly used technology for metabolome sequencing is mass spectrometry (see Proteomics). In mass spectrometric metabolome sequencing, metabolites are extracted from the sample, subjected to various sample preparation techniques and then analysed by mass spectrometry. The mass spectrometer generates a mass spectrum for each metabolite, which can be used to identify and quantify the metabolite.

There are two main types of mass spectrometric metabolome sequencing: targeted and non-targeted. Targeted metabolomics involves the analysis of a specific set of metabolites, while non-targeted metabolomics involves the analysis of all metabolites present in the sample. Both types of analysis have their own strengths and limitations, and the choice of method will depend on the research question and the type of sample being analysed.

4-2/ Analyses tools

Metabolome data analysis tools are used to process and interpret the large amounts of data generated by metabolome sequencing experiments. There are many different tools, each with its own strengths and weaknesses. The choice of the right tool therefore depends on the research question, the type of data to be analysed and the user's preferences. Here are some commonly used tools:

openMS: An open-source tool for processing and analysing mass spectrometry-based metabolome data. It has a wide range of features for data processing and analysis, including peak detection, alignment and normalisation.

XCMS: A widely used tool for the processing and analysis of metabolome data generated by mass spectrometry. It has a user-friendly interface and offers a variety of functions for data processing, including peak detection, alignment and normalisation.

Metaboanalyst: A comprehensive tool for metabolome data analysis, including data visualisation and pathway analysis. It provides a suite of tools for metabolome data analysis, including data normalisation, statistical analysis and pathway analysis.

Chenomx NMR Suite: Software specifically designed for NMR data analysis in metabolomics, including quantification, spectral analysis and biomarker discovery.

5/ Epigenomic sequencing and analyses tools

Epigenomics is the study of modifications of DNA and associated proteins (the epigenome) that control gene expression without changing the underlying DNA sequence. These changes can be influenced by environmental factors, such as diet and exposure to toxins, and play a key role in the development and progression of diseases, including cancer.

5-1/ Sequencing tools

Epigenome sequencing tools are used to study changes in DNA and the histones around which DNA is wrapped that control gene expression and regulate other cellular processes. These changes, called epigenetic marks, can be inherited and play a key role in the regulation of gene expression and the development of disease. Some common tools for epigenome sequencing are :

ChIP-seq: Chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-seq) is a widely used technique to study protein-DNA interactions, such as those involving histone modifications and DNA-binding proteins. ChIP-seq can be used to identify specific genomic locations that are bound by a particular protein or modified by a specific histone tag.

Bisulfite-seq: Bisulfite sequencing (bisulfite-seq) is a powerful tool for studying DNA methylation, which is an epigenetic mark that can influence gene expression. Bisulfite-seq works by chemically converting unmethylated cytosines to uracils, while leaving the methylated cytosines unchanged. The resulting DNA is then sequenced, allowing the methylation status of each cytosine residue to be determined.

MNase-seq: Micrococcal nuclease sequencing (MNase-seq) is a method for studying the organisation of chromatin in the genome. MNase is a nuclease enzyme that specifically cleaves DNA at the linkage regions between nucleosomes, thereby mapping the position of nucleosomes in the genome.

Hi-c seq: Hi-C sequencing is a technique used to investigate the three-dimensional structure of chromatin in a genome-wide manner. It works by cross-linking chromatin regions in close proximity, digesting the DNA, and then ligating the fragments together. This results in chimeric fragments that can be sequenced to reveal the interactions between different regions of the genome. By analyzing the frequency and pattern of interactions, researchers can gain insights into the organization of chromosomes and the spatial relationships between different genomic elements, such as promoters, enhancers, and gene bodies. Hi-C sequencing has become an important tool for understanding the 3D genome architecture and its role in gene regulation and genome function.

5-2/ Analyses tools

Epigenomic data analysis tools are used to process and interpret the data generated by epigenome sequencing experiments. There are many tools available, each with its own strengths and weaknesses. The choice of tool therefore depends on the research question, the type of data to be analysed and the user's preferences.

Peak calling is an important step in the analysis of ChIP-seq data, a technique used to study protein-DNA interactions and modulation of gene expression. The objective of peak calling is to identify regions of the genome where a specific protein or histone modification is enriched, relative to the input DNA. This information can be used to infer the binding sites of the protein of interest and understand its role in gene regulation.

Some commonly used tools include:

MACS: MACS (Model-based Analysis of ChIP-seq) is a popular tool for the analysis of ChIP-seq data. It can be used to detect genomic regions that are protein bound or modified by a specific histone mark.

Bismark: Bismark is a tool for analysis of bisulfite-seq data. It can be used to quantify genome-wide DNA methylation levels and to determine the precise position of methylated sites in the genome.

SICER: SICER (Sequential Improvement for ChIP-seq Enrichment Region) is a tool for the analysis of ChIP-seq data. It can be used to detect regions enriched in specific proteins or histone marks.

HiCExplorer is a set of tools for analyzing and visualizing Hi-C sequencing data, including modules for read alignment, quality control, data normalization, and visualization.

6/ Meta-Genomic sequencing and analyses tools

Metagenomics is the study of the genetic material of a community of micro-organisms, such as bacteria, viruses, fungi and other micro-organisms. Metagenomics provides a better understanding of the diversity and function of microbes in different environments, such as the human gut, soil and oceans. This field is used to study the interactions between microorganisms and their environment and to understand the role they play in important processes, such as the decomposition of organic matter and nutrient cycling.

6-1/ Microbiome sequencing tools

Microbiome sequencing is the process of identifying and characterising the genetic material of microorganisms in a specific environment, such as the gut, skin or soil. There are several different sequencing methods that can be used to study the microbiome, including:

16S rRNA sequencing: This is the most widely used method for studying the microbiome. It focuses on sequencing the 16S ribosomal RNA gene, which is present in all bacteria and is used to identify different species of bacteria. The 16S rRNA gene is divided into several regions, and sequencing generally focuses on the V3-V4 region, which is the most conserved and offers the best resolution for bacterial identification.

Metagenomic sequencing: This is a more comprehensive method that sequences all of the genetic material present in a sample, rather than just the 16S rRNA gene. It provides a broader picture of the microbiome and can reveal information about the functional capabilities of the microorganisms present.

Shotgun sequencing: This is a type of metagenomic sequencing that randomly fragments the sample DNA and sequences the resulting pieces. It provides more complete information about the microbiome, including the functional capabilities of microorganisms and their metabolic pathways.

Transcriptomic sequencing: This method focuses on sequencing the messenger RNA (mRNA) present in a sample, rather than the DNA. The mRNA provides information about the genes that are actively expressed in the micro-organisms in the sample.

Whole genome sequencing: This method sequences the entire genome of individual bacteria in the sample. It provides the most detailed information about the microbiome, including information about specific genes and mutations present in each bacterium.

6-2/ Microbiome data analyses tools

There are many data analysis tools available for microbiome research that can be used to process and interpret data generated by sequencing methods. Some of the most commonly used tools are :

QIIME (Quantitative Insights into Microbial Ecology): This open-source tool is very popular for analysing microbiome data obtained from 16S rRNA sequencing. It provides a complete workflow for processing raw sequencing data and generating taxonomic profiles, as well as many other features for visualising and exploring microbiome data.

MEGAN (MEtaGenome ANalyzer): This tool is designed for the analysis of metagenomic data. It provides functional and taxonomic analyses of metagenomic data, and can be used to generate heat maps, network diagrams and other visualisations to help interpret the data.

Phyloseq: This tool is an R-based package that provides a flexible framework for the processing and analysis of microbiome data. It can be used for both 16S rRNA sequencing and metagenomic data, and offers a wide range of features for data visualisation and exploration.

MicrobeCensus: This tool is designed for whole genome sequencing of microbial communities. It provides a pipeline for assembling genomes from metagenomic data and estimating the relative abundance of each genome in the community.

HUMAnN2: This tool is designed for the processing and analysis of data from functional gene sequencing. It provides a pipeline for gene identification, pathway analysis and functional profiling of microbial communities.