Login

From Genes to Proteins: How Bioinformatics Databases Help Decipher Biological Pathways

Feb 12, 2024

Co-authors : 
W
Wassim A
D
Djomangan Adama O

Description:


This course is designed to introduce students to the field of bioinformatics databases and their applications in biological research. The course will cover the principles and fundamentals of bioinformatics databases, including their design, organization, and access. Students will learn about the major types of bioinformatics databases, and how they can be used to retrieve, analyze and compare large amounts of biological data. The course will also focus on the usage of popular bioinformatics databases, including NCBI's GenBank, UniProt, PDB, KEGG, Ensembl, and Pubmed.



Course Content:


Overview of bioinformatics databases and their importance in biological research


Types of bioinformatics databases and their organization


Retrieval and analysis of biological data from bioinformatics databases


Tools and interfaces for searching and interrogating bioinformatics databases


NCBI's GenBank: a public, annotated database of DNA sequences


UniProt: a database of protein sequences and functional information


PDB: a database of 3D structures of proteins and nucleic acids


KEGG: a database of genomic and metabolic pathways


Ensembl: a database of genomic and functional information for various organisms


Pubmed: a database of biomedical literature


And others...


Course Objectives:


Upon completion of this course, students will be able to:


Understand the principles and fundamentals of bioinformatics databases


Identify and utilize major types of bioinformatics databases


Retrieve and analyze biological data from bioinformatics databases using various tools and interfaces


Evaluate the advantages and limitations of different bioinformatics databases


Apply bioinformatics databases to solve biological problems


Prerequisites:


This course is designed for people who have a basic understanding of biology, genetics, and computer science. No prior experience with bioinformatics databases is required.





1/ NCBI's Genbank


NCBI's GenBank is a cornerstone of modern biology and a vital resource for genetic research. It is an extensive, publicly available database containing annotated DNA sequences from a diverse array of organisms, including bacteria, viruses, fungi, plants, and animals. Managed by the National Center for Biotechnology Information (NCBI) of the National Institutes of Health (NIH), GenBank is one of the largest and most widely used gene sequence databases in the world. Researchers can easily access GenBank through the NCBI website and use its tools for sequence alignment, annotation, and analysis. In addition to the DNA sequences themselves, GenBank also provides important contextual information, including the source organism and tissue as well as bibliographic data. Researchers from all over the world contribute to GenBank's constant growth and update it with new sequences on a daily basis. With its far-reaching impact on basic research and its vital role in developing new diagnostic tests, drugs, and vaccines, GenBank is an essential tool for geneticists and biotechnologists alike.



2/ UniProt


UniProt, short for Universal Protein Resource, is a database of protein sequences and functional information that provides a centralized resource of protein information. UniProt is a collaboration between the European Bioinformatics Institute (EBI), the Swiss Bioinformatics Institute (SIB) and the Protein Information Resource (PIR).


The database contains information on a wide range of proteins from various organisms, including eukaryotic and prokaryotic organisms. UniProt provides a unique identifier (accession number) for each protein in the database, and also provides a variety of other information such as protein sequence, function, structure, interactions and disease associations.


Users can access UniProt via the UniProt website and search for specific proteins using keywords, accession numbers or other identifying information. The database also provides tools for sequence alignment, annotation and analysis. UniProt also includes information on the protein gene, its orthologs, domains and post-translational modifications.


In addition to its main database, UniProt also provides several other resources, such as the UniProt Knowledge Base (UniProtKB), which contains manually annotated information about proteins, and the UniProt Archive (UniParc), which contains all protein sequences ever submitted to UniProt, including those that have been removed.


UniProt is widely used by the scientific community for protein research, and is also used in the development of new diagnostic tests, drugs and vaccines. UniProt data is also used for functional annotation of genomes and identification of new proteins for specific research areas. It is regularly updated with new information and new sequences from different organisms.



3/ PDB, Protein Data Bank


The Protein Data Bank (PDB) is a database containing the three-dimensional structures of biomolecules, mainly proteins and nucleic acids. PDB is maintained by the Worldwide Protein Data Bank (wwPDB), a collaboration between the Research Collaboratory for Structural Bioinformatics (RCSB), the European Bioinformatics Institute (EBI) and the Protein Data Bank in Japan (PDBj).


PDB contains a wide range of structures, including the structures of proteins and nucleic acids alone, as well as the structures of complexes of these biomolecules with other molecules, such as drugs and metal ions. The database contains information on the atomic coordinates of the structures, as well as information on the experimental methods used to determine the structures.


Users can access the PDB database via the PDB website and search for specific structures using keywords, PDB codes or other identifying information. The database also provides tools for visualisation and analysis of structures, and allows users to download structural data.


PDB is a valuable resource for structural biology research, as it allows scientists to study the three-dimensional structures of biomolecules and their interactions with other molecules. This information is crucial for understanding the function and mechanism of biomolecules, and also has important applications in the development of new drugs and therapies. PDB is updated daily with new structures submitted by researchers from around the world.




4/ KEGG, Kyoto Encyclopedia of Genes and Genomes


KEGG, short for Kyoto Encyclopedia of Genes and Genomes, is a database that provides information on the genomic and metabolic pathways of various organisms. The database is managed by the Bioinformatics Center of the Chemical Research Institute of Kyoto University in Japan.


KEGG provides a comprehensive view of an organism's genetic make-up and metabolic processes, integrating information on the genome, metabolic pathways and their interactions with the environment. The database contains a wide range of information on metabolic pathways, such as the enzymes and molecules involved in each pathway, as well as the genes that code for them. It also provides information on the genetic regulation of metabolism and the interactions between the different pathways.


Users can access KEGG through the KEGG website and search for specific pathways or genes using keywords, KEGG codes or other identifying information. The database also provides tools for visualising and analysing pathways, and allows users to download data.


KEGG is an important resource for systems biology research, as it allows scientists to study the interactions between different metabolic pathways and their regulation, and to understand the relationship between the genome and the metabolism of an organism. This information is crucial for understanding physiology, biotechnology and environmental science. KEGG is regularly updated with new information and new pathways from different organisms.


KEGG Pathway is a collection of diagrams that represent the molecular interaction and reaction networks in various biological systems, including metabolic pathways, signal transduction pathways, genetic information processing pathways and environmental information processing pathways.



5/ ENSEMBL


Ensembl is a database of genomic and functional information for a wide range of organisms, including mammals, birds, fish, invertebrates and plants. The database is maintained by the European Bioinformatics Institute (EBI) and the Wellcome Trust Sanger Institute (WTSI) as a collaboration between the two institutions.


Ensembl provides a comprehensive view of an organism's genetic make-up by integrating information on the genome, genes and their regulation. The database contains a wide range of information about the genome, such as gene location and sequence, exons and introns, regulatory elements, repeats and variations. It also provides information on functional annotation of the genome, such as prediction of protein-coding genes, functional domains, orthologs, paralogs and comparative genomics.


Users can access Ensembl via the Ensembl website and search for specific genes, regions or variations using keywords, gene identifiers or other identifying information. The database also provides tools for genome visualisation and analysis, and allows users to download data. The Ensembl website also provides a genome browser that allows users to view genomic data in a graphical format.


Ensembl is an important resource for genome research, as it allows scientists to study the genomic organisation and functional annotation of a wide range of organisms, and to understand the evolution and diversity of life. This information is crucial for understanding the genetics, phylogenetics and evolution of living organisms. Ensembl is regularly updated with new information and new genomes from different organisms. Users can access Ensembl via the Ensembl website and search for specific genes, regions or variations using keywords, gene identifiers or other identifying information. The database also provides tools for genome visualisation and analysis, and allows users to download data. The Ensembl website also provides a genome browser that allows users to view genomic data in a graphical format.



6/ Pubmed


PubMed is a database of biomedical literature, managed by the National Library of Medicine (NLM) of the National Institutes of Health (NIH). PubMed contains bibliographic information on a wide range of scientific articles, including journal articles, book chapters and conference proceedings, in the field of biomedicine and health.


PubMed allows users to search for articles using keywords, author names or other identifying information. The database also provides tools to filter and sort results, and allows users to save their searches and create alerts for new articles on specific topics. Articles indexed in PubMed come from a wide range of journals, including many leading biomedical journals.


In addition to providing access to the full text of articles, PubMed also provides links to other resources, such as full-text journal articles, sequences in GenBank, and 3D structures in the PDB. PubMed also provides access to other NLM databases, such as the MEDLINE database, which contains bibliographic information on articles in the field of medicine and related health sciences.


PubMed is an important resource for the biomedical community, allowing scientists, researchers and health professionals to keep abreast of the latest developments in their field and to discover new research that may be relevant to their work. PubMed is updated daily with new articles and information from various sources.


PubMed Central (PMC) is a free digital archive of the full-text literature of biomedical and life science journals from the US National Institutes of Health (NIH), managed by the National Library of Medicine (NLM). It provides access to articles from thousands of journals, including many open access journals.




7/ Databases for Model organisms


A model organism can be defined as a species that serves as a representative for a larger group of organisms and is studied to understand fundamental biological processes and mechanisms. Model organisms are chosen for their ease of experimental traceability, availability of genetic tools and their relevance to health, agronomy, bio-production etc. They provide a simple and well-understood system for studying complex biological processes, including development, genetics, physiology and disease.


Examples of model organisms are :


  • Mus musculus (mouse): A mammal commonly used to study genetics, development and disease, including human disease.
    • Caenorhabditis elegans: A nematode worm commonly used to study developmental biology, genetics and neurobiology.
      • Drosophila melanogaster (fruit fly): A widely studied insect used to understand genetics, development and disease.
        • Saccharomyces cerevisiae (yeast): A single-celled fungus used to study cell biology, genetics and biotechnology of eukaryotes.
          • Arabidopsis thaliana (watercress): A small flowering plant commonly used to study plant genetics and molecular biology.
            • Escherichia coli (bacterium): A gram-negative bacterium commonly used as a model organism to study microbial genetics and physiology.
              • Danio rerio (zebrafish): A fish commonly used to study vertebrate development, genetics and disease.
                • Chlamydomonas reinhardtii: A single-celled green alga commonly used to study photosynthesis, genetics and cell biology.
                  • Among the most popular and widely used databases of model organisms: The Mouse Genome Database (MGD): A database of genetic, genomic and functional information for the laboratory mouse.
                    • The Zebrafish Model Organism Database (ZFIN): A database of genetic, genomic and functional information for the zebrafish.
                      • The WormBase: A database of genetic, genomic and functional information on the nematode worm.
                        • The FlyBase: A database of genetic, genomic and functional information on the fruit fly.
                          • The Yeast Resource Center: A database of genetic, genomic and functional information on yeast.
                            • The Arabidopsis Information Resource (TAIR): A database of genetic, genomic and functional information on the plant Arabidopsis thaliana.
                              • The Database of Genotypes and Phenotypes (dbGaP): A database of genetic, genomic and phenotypic information for various organisms, including humans.

                                These databases provide a wealth of information on the genomes, genetics and biology of these model organisms and are widely used by researchers in the field of genetics, genomics and biology. They provide tools for visualisation, analysis and downloading of data, and are regularly updated with new information and data.



                                8/ Biomodels


                                BioModels is a database that provides published and curated quantitative models of biological systems. These models can represent a wide range of biological processes, including metabolism, gene regulation, signalling pathways and cellular processes. The models are derived from peer-reviewed publications and are available in a variety of formats, including SBML (Systems Biology Markup Language), CellML and SBGN (Systems Biology Graphical Notation).


                                The aim of BioModels is to provide researchers with a comprehensive resource of well-documented, peer-reviewed models that can be used for a variety of purposes, including model analysis, hypothesis testing and simulations. The database is regularly updated with new models, and it also provides tools for model visualisation and simulation, making it a valuable resource for computational and experimental biologists.


                                Overall, BioModels is a valuable resource for researchers who wish to use computational models to better understand biological systems. By providing access to a large collection of peer-reviewed models, it allows researchers to compare and contrast different models and to use these models as a starting point for their own research.