How to extract trends from data using Principal Component Analysis in Constellab?

Wassim Abou-Jaoudé
Jul 27, 2023, 7:35 PM

Co-authors : 
Thibault ETIENNE

Principal Component Analysis (PCA) is a widely used tool to help scientists to discover important relationships in datasets. More theoretically speaking, PCA aims at identifying the combination of features (a.k.a principal components) that explain the main trends in your data.

We show here how to use Constellab to apply PCA method for data analysis, using the well-known IRIS dataset. The IRIS dataset consists of 50 samples from each of three species of Iris flower (Iris setosa, Iris virginica and Iris versicolor), in which four features are measured from each sample: the length and the width of the sepals and petals, in centimeters.

Let's go!

Data preparation and preview

We start by loading the iris dataset in the Databox of the digital lab. One can have a preview of the raw dataset by selecting the imported file.

Hint on data import!

After importing the IRIS dataset as .csv file, you will need to first import this file as a Table resource. The Table resource can be easily visualized and used for any purposes in Constellab

To learn more about data import, please see the story How to import data in Constellab?

Here the first 4 columns correspond to each of the 4 features of the dataset (the sepal length, the sepal width, the petal length, the petal width, in centimeters), while each line corresponds to a sample of the dataset. Each sample is tagged with the corresponding species (setosa, virginica or versicolor).

Building of the workflow for the PCA analysis

We then build the workflow that will be used for our analysis, by selecting and connecting together:

  • the resource containing our processed Iris data as input data,
  • the Table column scaler task to standardise the dataset (means=0, standard deviation=1)
  • the PCA task, with a number of principal components computed set to 2 (meaning that we keep the two main components of the PCA analysis), and
  • the output process which will collect the result of our analysis.

Hint on data scaling!
It is important to note that PCA requires that your data are centered at least. This is generally included by default in PCA implementations. Constellab mainly relies on the Scikit-Learn implementation of PCA.
In this story, we decided to use an explicit block to standardize (by centering and reducing) the data.
Centering ensures that all the features have a mean equal to zero. The reduction of the data ensures that all the features have a standard deviation equal to one. Reduction is not mandatory for PCA but recommended to ensure that all the features have similar weights during the analysis.

Running the PCA workflow analysis and viewing the results

After running the experiment, we can access the results in a table, as well as in a 2D scatter plot to see the first 2 principal components of the Iris dataset. Each dot of the scatter plot corresponds to a sample of the Iris dataset, while each color denotes the species of the sample. Here, we see a clear separation between setosa species and the other species, whereas versicor et virginica species are less well separated.
We can further get the variances explained by each component in a bar plot view. Here, we see that a large part of the variance of the Iris dataset is explained by the first principal component (more than 70%).
Your experiment and your report can now be shared among your nice collaborators 😀.