Gencovery Documentation

Get started with for the ML copilot

Introduction

The ML Copilot simplifies machine learning workflows from data ingestion to model deployment. This guide will walk beginners through the step-by-step process of implementing machine learning projects.

1. Data preparation and ingestion

Start with collecting and preparing data from various sources and formats.

Collect data: Raw data can come from databases, spreadsheets, APIs, JSON, CSV, or other formats.

Load data: Use Pandas, NumPy, or specialized AutoML libraries like Auto-Sklearn, AutoGluon or H2O.ai.

Data cleaning: Handle missing values, remove duplicates, and correct inconsistent data.

2. Column type detection

Correctly identify data types to process them appropriately.

Boolean: True/False, Yes/No values.

Discrete Numerical: Integer-based values (e.g., number of products sold).

Continuous Numerical: Floating-point numbers (e.g., temperature, price).

Text: Unstructured text data.

3. Target column detection

Identify the purpose of each column:

Target/Label: The variable to be predicted.

4. Data normalization and cleaning

A critical step is the inspect the structure of your data and clean it

Feature transformation: If necessary, applying log transformations

Feature scaling: Normalization and standardization.

PCA: Unsupervised analysis to see the global structure. Can help to detect outliers.

Outlier detection: Use several univariate methods to detect outliers. Basic methods are z-score and IQR method.

Remove outliers: If required remove outliers

5. Task detection

Determine the type of learning problem:

Supervised analysis

Binary classification: The variable to predict has two classes (e.g., healthy or sick).

Multi-class classification: The variable to predict has more than two possible categories (e.g., healthy or diabetes/sepsis).

Regression: Predicting continuous numerical values.

Unsupervised analysis

Clustering: Grouping similar data points.

PCA: Unsupervised linear analysis to to reduce dimensionality see the data in lower dimensionalities (could help identifying similarities between data point)

t-SNE Unsupervised non-linear analysis to reduce dimensionality see the data in lower dimensionalities (could help identifying similarities between data point)

UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction (Arxiv, JOSS, GitHub). Analogous to t-SNE but alos for more general dimension reduction.

6. Feature engineering

Feature engineering improves model performance by creating new features:

Feature Creation: Deriving new variables from existing data.

Feature scaling: Normalization and standardization.

Encoding categorical variables: One-hot encoding or label encoding.

Feature transformation: Applying log transformations, PCA, or embeddings.

7. Feature Selection

Select the most important features to reduce dimensionality and improve efficiency:

Filter methods: Select features based on statistical tests.

Wrapper Methods: Recursive feature eliminati), forward/backward selection.

Embedded methods: Use models like Lasso and Decision Trees to rank feature importance.

8. Analysis of Obtained Results

Interpreting model outcomes:

Model explainability: SHAP values, feature importance.

Bias and fairness analysis: Checking for model biases.

Model drift detection: Tracking performance over time.

9. Creating User Interfaces and Visualizations

Developing interactive tools for users:

Dashboards: Displaying real-time insights.

Visualizations: Feature importance, confusion matrix, decision boundaries.

APIs & Deployment: Creating user-friendly interfaces for ML models.

Conclusion

AutoML simplifies the entire machine learning pipeline by automating data processing, model selection, feature engineering, and hyperparameter tuning. This guide provides an end-to-end overview for beginners to build their own AutoML solutions effectively.

Machine learning copilot

The application

About