V1 - UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction

Versions list

Overview Version 1

Necessary bricks

Contributor(s)

Thibault E

Publication date

Feb 29, 2024

Confidentiality

Public

Reactions

Detail of the version

Uniform Manifold Approximation and Projection (UMAP) is a dimension reduction technique that can be used for visualisation similarly to t-SNE, but also for general non-linear dimension reduction. The algorithm is founded on three assumptions about the data

The data is uniformly distributed on Riemannian manifold;

The Riemannian metric is locally constant (or can be approximated as such);

The manifold is locally connected.

From these assumptions it is possible to model the manifold with a fuzzy topological structure. The embedding is found by searching for a low dimensional projection of the data that has the closest possible equivalent fuzzy topological structure.

The details for the underlying mathematics can be found in our paper on ArXiv:

McInnes, L, Healy, J, UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction, ArXiv e-prints 1802.03426, 2018

UMAP has several hyperparameters that can have a significant impact on the resulting embedding. In this notebook we will be covering the four major ones:

n_neighbors

This parameter controls how UMAP balances local versus global structure in the data. It does this by constraining the size of the local neighborhood UMAP will look at when attempting to learn the manifold structure of the data. This means that low values of n_neighbors will force UMAP to concentrate on very local structure (potentially to the detriment of the big picture), while large values will push UMAP to look at larger neighborhoods of each point when estimating the manifold structure of the data, losing fine detail structure for the sake of getting the broader of the data.

min_dist

The min_dist parameter controls how tightly UMAP is allowed to pack points together. It, quite literally, provides the minimum distance apart that points are allowed to be in the low dimensional representation. This means that low values of min_dist will result in clumpier embeddings. This can be useful if you are interested in clustering, or in finer topological structure. Larger values ofmin_dist will prevent UMAP from packing points together and will focus on the preservation of the broad topological structure instead.

n_components

As is standard for many scikit-learn dimension reduction algorithms UMAP provides a n_components parameter option that allows the user to determine the dimensionality of the reduced dimension space we will be embedding the data into. Unlike some other visualisation algorithms such as t-SNE, UMAP scales well in the embedding dimension, so you can use it for more than just visualisation in 2- or 3-dimensions

Metric

Minkowski style metrics

euclidean

manhattan

chebyshev

minkowski

Miscellaneous spatial metrics

canberra

braycurtis

haversine

Normalized spatial metrics

mahalanobis

wminkowski

seuclidean

Angular and correlation metrics

cosine

correlation

Metrics for binary data

hamming

jaccard

dice

russellrao

kulsinski

rogerstanimoto

sokalmichener

sokalsneath

yule

For more information : https://umap-learn.readthedocs.io/en/latest/parameters.html

Input(s)

File or folder

Fs node

Output(s)

File or folder

Fs node

Parameters

Environment file

Code

Comments (0)

Write a comment

Ready-to-use smart agents

Have you developed an agent?

Share it to accelerate projects for the entire community.

Submit an agent