scVI : scRNA-seq data integration

Nour Larifi
Mar 11, 2024


After data filtration using "Data filtration"  task we proceed to data normalization and then data integration using scVI (single-cell Variational Inference). In fact scVI is a method based on a conditional variational auto encoder [Lopez et al., 2018] available in the scvi-tools package [Gayoso et al., 2022]. This step permit the elimination of batch effect and then clusters determination based on leiden algorithm [Traag et al. (2018)], an improved version of the Louvain algorithm [Blondel et al. (2008)]. As input, this task will take the combined_filtered.h5ad file generated by "Data filtration" task.

Main functions of data integration using scVI :

  • Normalization: Scanpy provides functions to normalize the gene expression data, ensuring that cells are comparable.
    • Gene selection: Identifying highly variable genes that contribute significantly to cell heterogeneity.
      • scVI Integration: Utilizing scVI, an unsupervised deep generative model, to integrate multiple scRNA-seq datasets from different experiments or batches into a shared latent space.
        • Leiden Clustering: Utilizing the Leiden algorithm, an extension of the Louvain algorithm, to group cells into distinct clusters based on shared similarities in the integrated space.

          Steps to follow

          1. Ensure that the version 0.1.2 of gws_scomix brick is loaded
            1. Then, create a new experiment.
              1. Import "Data integration" task.
                1. Add the combined_filtered.h5ad file generated by "Data filtration" as input ressource file.
                  1. Specify some parameters such clusters resolution value which is a parameter that controls the granularity of the clustering algorithm, influencing the number and size of the resulting clusters. In fact , a lower cluster resolution value will lead to more fine-grained clustering, where cells with subtle differences in gene expression will be separated into smaller clusters.
                    1. Run your experiment.

                      Description of output files

                      This task will generate :

                      • integrated_data.h5ad: This file contains the integrated single-cell RNA sequencing (scRNA-seq) data after the data integration process. It stores the integrated gene expression profiles of individual cells, along with associated metadata, clustering results, and visualization coordinates (UMAP).

                        • clustering.png : The image displays two subplots side by side, each showing the UMAP visualization of the integrated cells colored by cluster assignment (leiden) and sample identifier. This visualization allows for a visual inspection of the integrated data's clustering patterns, facilitating the identification of distinct cell populations and potential biological insights.

                          • integrated_data_statistical_informations : This table includes the total number of cells and the total number of genes present in the integrated dataset.