## UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction

Pip Python

#### Detail of the version

Uniform Manifold Approximation and Projection (UMAP) is a dimension reduction technique that can be used for visualisation similarly to t-SNE, but also for general non-linear dimension reduction. The algorithm is founded on three assumptions about the data

1. The data is uniformly distributed on Riemannian manifold;
1. The Riemannian metric is locally constant (or can be approximated as such);
1. The manifold is locally connected.

From these assumptions it is possible to model the manifold with a fuzzy topological structure. The embedding is found by searching for a low dimensional projection of the data that has the closest possible equivalent fuzzy topological structure.

The details for the underlying mathematics can be found in our paper on ArXiv:

McInnes, L, Healy, J, UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction, ArXiv e-prints 1802.03426, 2018

UMAP has several hyperparameters that can have a significant impact on the resulting embedding. In this notebook we will be covering the four major ones:

• n_neighbors
• This parameter controls how UMAP balances local versus global structure in the data. It does this by constraining the size of the local neighborhood UMAP will look at when attempting to learn the manifold structure of the data. This means that low values of `n_neighbors` will force UMAP to concentrate on very local structure (potentially to the detriment of the big picture), while large values will push UMAP to look at larger neighborhoods of each point when estimating the manifold structure of the data, losing fine detail structure for the sake of getting the broader of the data.
• min_dist
• The `min_dist` parameter controls how tightly UMAP is allowed to pack points together. It, quite literally, provides the minimum distance apart that points are allowed to be in the low dimensional representation. This means that low values of `min_dist` will result in clumpier embeddings. This can be useful if you are interested in clustering, or in finer topological structure. Larger values of`min_dist` will prevent UMAP from packing points together and will focus on the preservation of the broad topological structure instead.
• n_components
• As is standard for many `scikit-learn` dimension reduction algorithms UMAP provides a `n_components` parameter option that allows the user to determine the dimensionality of the reduced dimension space we will be embedding the data into. Unlike some other visualisation algorithms such as t-SNE, UMAP scales well in the embedding dimension, so you can use it for more than just visualisation in 2- or 3-dimensions
• Metric

Minkowski style metrics

• euclidean
• manhattan
• chebyshev
• minkowski

Miscellaneous spatial metrics

• canberra
• braycurtis
• haversine

Normalized spatial metrics

• mahalanobis
• wminkowski
• seuclidean

Angular and correlation metrics

• cosine
• correlation

Metrics for binary data

• hamming
• jaccard
• dice
• russellrao
• kulsinski
• rogerstanimoto
• sokalmichener
• sokalsneath
• yule

##### Output
File or folder
Optional
File or folder
##### Authors
Thibault ETIENNE