How to download a full genome from NCBI?
The National Center for Biotechnology Information (NCBI) offers a wide range of tools, resources and databases to assist scientists in studying genes, genomes, proteins, and more. For example, hundreds of thousands of genome pieces and their sequences are freely accessible, and the database continues to grow as more genomes are sequenced and deposited.
In a digital lab, you can upload files from an external source, such as an open database repository. In this use-case, we will demonstrate how to directly upload a complete genome sequence from NCBI.
- An access to Constellab and to a digital lab
- The brick "gws_core" (version > 5.9)
Steps to follow
We will create this pipeline using two tasks present in gws_core brick. The first task, "Download a resource from an external source," is designed to retrieve a file from an external source. In this example, we will use this task to download a FASTA file from NCBI. The second task is a generic operation that converts a resource of type 'File' into a resource of type 'Text.' Creating a 'Text' resource can be valuable when connecting it with other tasks that require text as input.
To create this pipeline, start by creating a new experiment, and then add the process "Download resource from external source". From the configuration page, you need to copy and paste a link to access the file to download. For example, we want to upload one of the complete genome of Bacillus thuringiensis from NCBI. We chose the assembly GCA_000008505.1_ASM850v1 from all versions available. We identified the file "GCA_000008505.1_ASM850v1_genomic.fna.gz" which contains the genomic sequence. We provide the link to this file below:
Next, we establish connections for the task 'Download resource from external source.' It will have two outputs: one standard output to save the file into the Databox and another task called 'Text importer,' which is used to convert a resource of type 'File' into a resource of type 'Text' (and this text resource will also be saved into the Databox).
From NCBI you can access a wide range of data files for each complete genome. Below is a screenshot from this page, showcasing information files (e.g., assembly report.txt) for the complete genome of Bacillus thuringiensis, along with various other data types that can be conveniently downloaded into the Databox. Additionally, if needed, you can import as many files as necessary into a single experiment by duplicating the 'Download resource from external source' task.