Getting Started with UMAP Dimension Reduction

Author

Posted Oct 27, 2024

Reads 1.2K

Geometric Shape Objects on a Black Surface
Credit: pexels.com, Geometric Shape Objects on a Black Surface

UMAP dimension reduction is a powerful technique for visualizing high-dimensional data. It's a type of dimensionality reduction that helps us understand complex relationships between variables.

To get started with UMAP, you'll need to install the UMAP library, which is available in Python. UMAP is built on top of the scikit-learn library, making it easy to integrate with other data science tools.

The UMAP algorithm is based on a simple yet effective idea: it uses a combination of local and global structure to map high-dimensional data to a lower-dimensional space. This allows us to preserve the relationships between data points while reducing the number of dimensions.

One of the key benefits of UMAP is that it's highly customizable, allowing you to experiment with different parameters to achieve the best results for your specific dataset.

UMAP Basics

UMAP is a nonlinear dimension-reduction algorithm that overcomes some of the limitations of t-SNE. It works similarly to t-SNE in that it tries to find a low-dimensional representation that preserves relationships between neighbors in high-dimensional space.

Credit: youtube.com, UMAP Dimension Reduction, Main Ideas!!!

The original UMAP paper by Leland McInnes, John Healy, and James Melville introduced this algorithm. UMAP is a non-parametric algorithm that consists of two steps: computing a fuzzy topological representation of a dataset and optimizing the low-dimensional representation to have a close fuzzy topological representation as possible.

There are two main hyperparameters in UMAP that control the balance between local and global structure in the final projection: the number of nearest neighbors and the minimum distance between points in low-dimensional space.

Understanding

UMAP is a nonlinear dimension-reduction algorithm that overcomes some of the limitations of t-SNE.

UMAP works similarly to t-SNE in that it tries to find a low-dimensional representation that preserves relationships between neighbors in high-dimensional space, but with an increased speed and better preservation of the data's global structure.

There are two main hyperparameters in UMAP that are used to control the balance between local and global structure in the final projection: the number of nearest neighbors and the minimum distance between points in low-dimensional space.

Credit: youtube.com, UMAP explained | The best dimensionality reduction?

Fine-tuning these hyperparameters can be challenging, and this is where UMAP's speed is a big advantage: by running it multiple times with a variety of hyperparameter values, you can get a better sense of how the projection is affected.

UMAP presents several advantages compared to t-SNE, including achieving comparable visualization performance, preserving more of the global data structure, being fast and scalable, and not being restricted for visualization-only purposes.

The two main hyperparameters in UMAP are:

• The number of nearest neighbors: which controls how UMAP balances local versus global structure

• The minimum distance between points in low-dimensional space: which controls how tightly UMAP clumps data points together

Here are some key differences between UMAP and t-SNE:

  • UMAP preserves more of the global data structure, whereas the distance between clusters in t-SNE does not have significant meaning.
  • UMAP is fast and can scale to Big Data, whereas t-SNE can be slow and limited in its scalability.
  • UMAP is not restricted for visualization-only purposes, whereas t-SNE is primarily used for visualization.

Metric

The metric parameter in UMAP controls how distance is computed in the ambient space of the input data. By default, UMAP supports a wide variety of metrics.

One of the most common metrics used is the Minkowski style metric, which includes metrics like Euclidean, Manhattan, and Chebyshev. UMAP also supports Miscellaneous spatial metrics, such as Mahalanobis, WMinkowski, and SEuclidean.

Credit: youtube.com, UMAP explained in 1 min - Dimensional Reduction Algorithm in 3 steps

Angular and correlation metrics are also available, including Cosine and Correlation. For binary data, UMAP offers metrics like Hamming, Jaccard, and Dice.

If your input data is a distance/similarity matrix, you can use the 'precomputed' metric by setting metric='precomputed'. This can be useful when you've already calculated the pairwise distances between points.

UMAP also supports custom user-defined metrics, as long as they can be compiled in nopython mode by numba. This allows for a high degree of flexibility in defining distance functions.

Here are some of the metrics supported by UMAP:

  • Minkowski style metrics: euclidean, manhattan, chebyshev, minkowski
  • Miscellaneous spatial metrics: mahalanobis, wminkowski, seuclidean
  • Angular and correlation metrics: cosine, correlation
  • Metrics for binary data: hamming, jaccard, dice, russellrao, kulsinski, rogerstanimoto, sokalmichener, sokalsneath, yule

Minimum Distance (Mindist)

The minimum distance, or min_dist, parameter in UMAP controls how tightly points are packed together in the low-dimensional representation. This means that lower values will result in clumpier embeddings, which can be useful for clustering or preserving finer topological structure.

A lower min_dist value of 0.0 allows UMAP to find small connected components, clumps, and strings in the data. This is particularly useful for preserving the local characteristics of the high-dimensional data.

Credit: youtube.com, UMAP - High-Performance Dimension Reduction | Data Science Fundamentals

As min_dist is increased, these structures are pushed apart into softer, more general features. This can be seen in the example where min_dist=0.0 is compared to larger values.

The default value for min_dist is 0.1, but you can experiment with a range of values from 0.0 to 0.99 to see what works best for your data.

The actual values of mindist are somewhat arbitrary, though keeping the values in the range of 0.0 to 3.0 is a good place to start.

UMAP Parameters

UMAP has three key parameters that affect the result it produces. It's good to remember that UMAP is trying to work within the constraints you provide while giving the “best” possible result.

One of the parameters is Minimum Distance (mindist), which encourages UMAP to consider how close points can be represented in the low-dimensional space. Small values of mindist mean that UMAP can pack points in a tight embedding, preserving local characteristics of the high-dimensional data.

Credit: youtube.com, Visualizing High Dimension Data Using UMAP Is A Piece Of Cake Now

For the n_components parameter, UMAP provides an option to determine the dimensionality of the reduced dimension space. Unlike some other visualisation algorithms, UMAP scales well in the embedding dimension, so you can use it for more than just visualisation in 2- or 3-dimensions.

The Number of Neighbours (numneighbours) parameter determines how UMAP balances the preservation of global or local features of the high-dimension data. High values of numneighbours will result in UMAP considering the entire space as a whole.

Min Dist

The min_dist parameter controls how tightly UMAP can pack points together, literally setting the minimum distance apart that points are allowed to be in the low-dimensional representation.

Lower values of min_dist, such as 0.0, result in clumpier embeddings that emphasize finer topological structure, which can be useful for clustering or identifying small connected components in the data.

As min_dist increases, the embedding becomes softer and more general, providing a better overall view of the data at the cost of detailed topological structure. This is what happens when min_dist is set to 0.99.

UMAP can pack points in a tight embedding with small values of mindist, such as 0.0, which preserves local characteristics of the high-dimensional data.

N Components

Credit: youtube.com, scRNA-seq: Dimension reduction (PCA, tSNE, UMAP)

The n_components parameter is a crucial one when working with UMAP, and it determines the dimensionality of the reduced dimension space we'll be embedding our data into. You can set it to any number you like, but for visualization purposes, it's best to stick with 1, 2, or 3 dimensions.

Forcing UMAP to embed the data in a line by setting n_components to 1 can be a useful exercise, especially when you want to see the effects of the parameter. This can be done by randomly distributing the data on the y-axis to provide some separation between points.

With more dimensions in which to work, UMAP has an easier time separating out the colors in a way that respects the topological structure of the data. Setting n_components to 3, for example, allows UMAP to work in three dimensions, making it easier to visualize the data.

There's really no requirement to stop at n_components=3, and you can pick a larger embedding dimension if you're interested in density-based clustering or other machine learning techniques. Picking a dimension closer to the underlying manifold on which your data lies can be beneficial in these cases.

Parameters

Credit: youtube.com, Model with UMAP (parameters suggested by Chen et. al.)

UMAP has three key parameters that affect the result it produces. It's good to remember that UMAP is trying to work within the constraints you provide while giving the “best” possible result.

Configuring UMAP with the right parameters allows you to balance the global or local features of the original space. This is a crucial aspect of dimension reduction, as it helps you understand the relationships between different points in your data.

The n_components parameter option determines the dimensionality of the reduced dimension space. Unlike some other visualization algorithms, UMAP scales well in the embedding dimension, so you can use it for more than just visualization in 2- or 3-dimensions.

For example, setting n_components to 1 forces UMAP to embed the data in a line. This can be useful for visualization purposes, especially when you have a large number of points and need to provide some separation between them.

UMAP's n_components parameter can be set to any value, from 1 to the number of features in your original data. It's worth noting that using a larger embedding dimension can be beneficial for certain machine learning techniques, such as density-based clustering.

Credit: youtube.com, UMAP

The numneighbours parameter determines how UMAP balances the preservation of global or local features of the high-dimension data. This is a control that helps you understand how many points around any given point UMAP needs to consider when it transforms the space.

High values of numneighbours will result in UMAP considering the entire space as a whole, whereas low values will focus on the relationship between only a few points. This is relative to the number of points in your original data, so a "low" or "high" value is quantified by its relationship to the size of your data at hand.

Frequently Asked Questions

How does UMAP reduce dimensions?

UMAP reduces dimensions by constructing a high-dimensional graph representation of the data and optimizing a low-dimensional graph to preserve its structural similarity. This process allows UMAP to effectively map high-dimensional data to a lower-dimensional space while maintaining its underlying relationships.

How many dimensions does UMAP have?

UMAP typically displays data in two or three dimensions, reducing high-dimensional data to a more manageable and visualizable format

What does UMAP tell you?

UMAP reveals both local patterns within clusters and global connections between them, providing a comprehensive view of your data's structure and relationships

What is uniform manifold approximation and projection UMAP clustering analysis?

UMAP is a data clustering technique that uses graph layout algorithms to arrange data in low-dimensional space, preserving its structural similarity. It's a powerful tool for visualizing complex data and discovering hidden patterns

Is UMAP better than t-SNE?

UMAP tends to produce more compact clusters than t-SNE, but both methods can lead to different results even with the same data. The choice between UMAP and t-SNE depends on the specific characteristics of your data and the insights you're trying to gain.

Landon Fanetti

Writer

Landon Fanetti is a prolific author with many years of experience writing blog posts. He has a keen interest in technology, finance, and politics, which are reflected in his writings. Landon's unique perspective on current events and his ability to communicate complex ideas in a simple manner make him a favorite among readers.

Love What You Read? Stay Updated!

Join our community for insights, tips, and more.