Uniform Manifold Approximation and Projection (UMAP) is a new technique for dimensionality reduction in the family of “neighbour graphs” techniques (as opposed to the “matrix factorization” family such as PCA).

The paper is quite math-heavy but here’s the TLDR:

• Let’s assume that our dataset is distributed uniformly on a manifold and that this manifold is locally connected.
• Let’s create a fuzzy metric that forces the “uniform” behaviour on the manifold (a Reimanian metric).
• $$w_i(X_i, X_j) = exp(-(d(X_i, X_j) - \rho_i)/ \sigma_i)$$ where d is a distance function, $$\rho_i$$ is the distance to $$X_i$$ nearest neighbour and $$\sigma_i$$ is the diameter of the neighbourhood.
• Since this metric is not symetrical, the actual metric is : $$f(\alpha, \beta) = \alpha + \beta - \alpha * \beta$$. Where $$alpha = w_i(X_i, X_j)$$ and $$beta = w_j(X_i, X_j)$$.

We then create an embedding $$w'$$ which will embed our data in the lower space (initialized randomly). Finally we optimize the $$w'$$ using the crossentropy between $$w'$$ and $$w$$ using SGD.

$C(w, w') = \sum_{i\sim j} w(i,j)log(\frac{w(i,j)}{w'(i,n)}) + (1 - w(i,j)) log(1 - \frac{w(i,j)}{1 - w'(i,n)})$

Why is this fast?

• Need to compute the k-NN only once.
• The comparison is only done on the neighbours of $$X_i, X_j$$.

Features:

• Fast (it is using Numba for LLVM compilation)
• Can be used on new data points.
• The classes can be provided to get a better embedding while keeping the relation between classes.

Code

Blog

Awesome Talk