Dimension reduction is a crucial step in data science that helps us make sense of complex data. It's a technique that reduces the number of features or dimensions in a dataset while preserving the most important information.
By reducing the number of features, we can avoid the curse of dimensionality, which occurs when a dataset has too many features and becomes difficult to analyze. This is especially true for machine learning models, which can become computationally expensive and prone to overfitting in high-dimensional spaces.
In practice, dimension reduction can be achieved through various techniques, such as Principal Component Analysis (PCA) and t-SNE. These methods can be used to reduce the number of features in a dataset, making it easier to visualize and analyze.
What is Dimension Reduction?
Dimensionality reduction (DR) is a technique used to simplify high-dimensional data, making it easier to analyze and understand.
High-dimensional data is common in modern biological datasets, with hundreds or even millions of measurements collected for a single sample. This can be a challenge for statistical methods, which often lack power when applied to such data.
The curse of dimensionality makes it difficult to explore high-dimensional data exhaustively, even with a large number of data points. This is because the data is sparsely submerged in a voluminous high-dimensional space.
DR can be beneficial for alleviating this challenging phenomenon by reducing the dimensionality of the data and retaining the signal of interest.
Common Techniques
Dimensionality reduction is a technique used to reduce the number of variables in a dataset while retaining the most important information. There are several common techniques used for dimensionality reduction.
One technique is feature selection, which involves selecting the most relevant variables from the original dataset. Another technique is dimensionality reduction, which involves finding a smaller set of new variables that contain the same information as the input variables.
Dimensionality reduction can be done in various ways, including using techniques such as linear discriminant analysis, missing value ratio, low variance filter, and high correlation filter. These techniques can be used to reduce the number of variables in a dataset and improve the performance of machine learning models.
Here are some common dimensionality reduction techniques:
- Missing Value Ratio: Drops variables with a large number of missing values
- Low Variance filter: Drops constant variables from the dataset
- High Correlation filter: Drops highly correlated features
- Random Forest: Finds the importance of each feature and keeps the topmost features
- Factor Analysis: Divides variables based on their correlation into different groups and represents each group with a factor
- Principal Component Analysis: Divides the data into a set of components that try to explain as much variance as possible
- Independent Component Analysis: Transforms the data into independent components that describe the data using less number of components
- ISOMAP: Works well for strongly non-linear data
- t-SNE: Works well for strongly non-linear data and is excellent for visualizations
- UMAP: Works well for high-dimensional data and has a shorter runtime compared to t-SNE
Techniques: Why Required
Dimensionality reduction techniques are essential because they help reduce the space required to store data, making it more manageable and efficient. This is crucial when working with large datasets that can be overwhelming to process.
Reducing dimensions also leads to less computation and training time, making algorithms perform better and faster. Some algorithms simply don't work well with high-dimensional data, so dimensionality reduction is necessary to make them useful.
Multicollinearity is another issue that dimensionality reduction techniques address. For example, if you have two variables like "time spent on treadmill in minutes" and "calories burnt", they're highly correlated, and there's no need to store both.
Here are some benefits of dimensionality reduction techniques:
- Reduced storage space
- Faster computation and training time
- Better algorithm performance
- Reduced multicollinearity
- Easier data visualization
These benefits make dimensionality reduction techniques a valuable tool in data analysis and machine learning. By applying these techniques, you can make your data more manageable, efficient, and easier to work with.
Variables' Projection
Variables' projection is a technique used in dimensionality reduction to visualize the relationships between variables in a dataset. It's a powerful tool for understanding the underlying structure of complex data.
By projecting the original variables onto a new set of axes, we can identify patterns and correlations that may not be immediately apparent in the original data. This is done by applying techniques such as Principal Component Analysis (PCA), which transforms the data into a new set of variables called principal components.
These principal components are linear combinations of the original variables and capture the maximum amount of variance in the data. The first principal component accounts for the most variance, followed by the second, third, and so on.
Here's an example of how PCA can be used for variables' projection:
This table shows how the original variables are projected onto the first two principal components. By analyzing the coefficients, we can understand the relationships between the variables and identify patterns in the data.
Variables' projection is a useful technique for data visualization and exploration, allowing us to gain insights into the underlying structure of complex data.
Feature Selection and Extraction
Feature selection is an essential step in dimension reduction, and there are several techniques to achieve this. Backward feature elimination is one such technique where we train a model with all the variables, calculate its performance, and then eliminate one variable at a time, observing the impact on the model's performance.
In the case of linear regression or logistic regression models, backward feature elimination can be used to remove variables that do not significantly impact the model's performance. This method is computationally expensive and is typically used on datasets with a small number of input variables.
Another technique is forward feature selection, which involves training a model with each feature separately and selecting the feature that produces the best performance. This process is repeated, adding one feature at a time, until no significant improvement is seen in the model's performance.
Here's a summary of the two techniques:
Feature extraction is another approach to dimension reduction, where we transform the original variables into new features that capture the most important information. Techniques like PCA, ICA, and SVD can be used for feature extraction.
PCA is a widely used technique that extracts principal components, which are linear combinations of the original variables. The first principal component explains the maximum variance in the dataset, while the subsequent components explain the remaining variance.
ICA, on the other hand, extracts independent components, which are not dependent on other variables. This technique is based on information theory and is useful when the variables are not linearly correlated.
SVD is another technique that decomposes the original variables into three constituent matrices, removing redundant features from the dataset.
Here's a brief comparison of the three techniques:
In summary, feature selection and extraction are essential steps in dimension reduction, and there are several techniques to achieve this, including backward feature elimination, forward feature selection, PCA, ICA, and SVD.
Choosing a Method
The abundance of available dimensionality reduction (DR) methods can seem intimidating, but you don't need to commit to just one tool. The choice of a DR method depends on the nature of your input data, such as whether it's continuous, categorical, count, or distance data.
You should also consider your intuition and domain knowledge about the collected measurements. Linear methods like principal component analysis (PCA) are more adept at preserving global structure, while nonlinear methods like kernel PCA are better at representing local interactions.
Consider the nature and resolution of your data, as DR methods can be focused on recovering either global or local structures. Nonlinear methods like t-Stochastic NE (t-SNE) are better at representing local interactions but do not preserve long-range interactions between data points.
You might enjoy: Dimension Reduction Pca
Choose a Method
Choosing a method for dimensionality reduction can be overwhelming, especially with so many techniques available. The choice of a DR method depends on the nature of your input data, including whether it's continuous, categorical, count, or distance data.
Consider your intuition and domain knowledge about the collected measurements, as observations can adequately capture only small-scale relationships between nearby data points but not long-range interactions between distant observations. Different methods apply to different types of data, so it's essential to choose a method that suits your data.
Linear methods like Principal Component Analysis (PCA) are more adept at preserving global structure, while nonlinear methods like kernel PCA are better at representing local interactions. If your data has assigned class labels, you might consider using supervised DR techniques like Partial Least Squares (PLS) or Linear Discriminant Analysis (LDA).
Here's a brief summary of the main differences between linear and nonlinear DR methods:
Remember that the number of dimensions can be at most the minimum of the number of observations and the number of variables in your dataset. You can use the distribution of eigenvalues to guide your choice of dimensions, or rely on "scree plots" and "the elbow rule" to make decisions.
Disadvantages:
Choosing a Method can be a daunting task, especially when you're dealing with complex data sets. PCA assumes linearity between variables, which can limit its effectiveness when dealing with non-linear relationships.
This can lead to inaccurate results, which can be frustrating, especially if you're trying to make data-driven decisions. PCA transforms the original features into a subspace, reducing interpretability of these components.
This means you may struggle to understand the underlying meaning of the transformed variables, which can be a major drawback. PCA can be sensitive to outliers since they can strongly influence the calculations.
Outliers can potentially lead to a skewed representation of the data, which can be misleading.
Preprocessing and Visualization
Before applying dimension reduction (DR), suitable data preprocessing is often necessary. For example, data centering—subtracting variable means from each observation—is a required step for PCA on continuous variables and is applied by default in most standard implementations.
Data transformations may be required depending on the application, type of input data, and DR method used. If changes in your data are multiplicative, e.g., your variables measure percent increase/decrease, consider using a log-transform before applying PCA.
In some cases, data needs to be normalized by dividing each measurement by a corresponding sample size factor, estimated using specialized methods like DESeq2 or edgeR. This is especially important when working with genomic sequencing data.
Here are some common data transformations and their applications:
For PCA plots, it's essential to adjust the aspect ratio to match the variances in the PC coordinates. This ensures that the plot accurately represents the data.
3.12 Umap
UMAP is a dimension reduction technique that can handle large datasets and high-dimensional data without too much difficulty. It combines the power of visualization with the ability to reduce the dimensions of the data, making it a powerful tool for data analysis.
Some of the key advantages of UMAP include its ability to preserve both local and global data structure, as well as its fast computation time compared to t-SNE. It uses the concept of k-nearest neighbor and optimizes the results using stochastic gradient descent.
The dimensions of the data are reduced using two key parameters: n_neighbors and min_dist. n_neighbors determines the number of neighboring points used, while min_dist controls how tightly embedding is allowed.
Here's a brief overview of the UMAP algorithm:
- Calculates the distance between points in high-dimensional space
- Projects points onto low-dimensional space
- Calculates the distance between points in low-dimensional space
- Uses Stochastic Gradient Descent to minimize the difference between distances
UMAP generally yields better results than t-SNE in separating and understanding independent components, resulting in lower correlation between the transformed variables. This makes it a great tool for data mining and understanding the global structure of the data.
Scaling
Scaling is an essential step in preprocessing data, especially when dealing with heterogeneous features with highly variable ranges or distinct units. This is because scaling ensures equal contribution from each variable, which is crucial for accurate results.
Data centering, or subtracting variable means from each observation, is a required step for PCA on continuous variables. Most standard implementations apply this step by default.
Scaling involves multiplying each measurement of a variable by a scalar factor so that the resulting feature has a variance of one. This helps to balance the influence of each variable on the results.
However, normalizing feature variances is not advised when the units of all variables are the same. This is because it can result in shrinkage of features containing strong signals and inflation of features with no signal.
Here are some common data transformations that may be required, depending on the application and type of input data:
In some cases, scaling may not be necessary, such as when working with high-throughput assays where the units of all variables are the same.
Reducing Similarity and Dissimilarity with Embedding
You can use embedding methods to reduce similarity and dissimilarity input data, even when variable measurements are not available. This approach can be effective, especially when the original data are binary, and the Euclidean distance is not appropriate.
One method is to use cMDS/PCoA and NMDS, which use pairwise dissimilarities between data points to find an embedding in Euclidean space. These methods aim to provide the best approximation to the supplied distances.
You can choose a dissimilarity metric that provides the best summary of your data, such as the Manhattan distance for binary data or the Jaccard distance for sparse features.
Here are some examples of dissimilarity metrics:
Optimization-based multidimensional scaling (MDS) is another approach that can be used to reduce similarity and dissimilarity input data. This method can be referred to as “local” MDS when it preserves only the local interactions by restricting the minimization problem to only the distances from data points to their neighbors.
Scree Plot
A scree plot is a visual tool that helps you decide on the number of dimensions to keep after dimensionality reduction (DR). It's a plot of the eigenvalues of the covariance matrix of your data.
For spectral methods, the eigenvalues can be used to decide how many dimensions are sufficient. The number of dimensions to keep can be selected based on an "elbow rule", where you look for a point where the rate of decrease in eigenvalues slows down.
In the example shown, you should keep the first five principal components. This is because the eigenvalues start to decrease more slowly after the fifth component.
You can also use the scree plot to evaluate whether incorporating more components achieves a significantly lower value of the loss function that the method minimizes.
Apply Correct Aspect Ratio for Visualizations
Applying the correct aspect ratio for your visualizations is crucial for accurately reflecting the output of the data reduction methods you use.
The aspect ratio of a 2D plot can strongly influence your perception of the data, and it's essential to ensure that the height-to-width ratio is consistent with the ratio between the corresponding eigenvalues.
In the case of PCA or PCoA, each output dimension has a corresponding eigenvalue proportional to the amount of variance it explains.
Two-dimensional PCA plots with equal height and width are misleading but frequently encountered because popular software programs often produce square graphics by default.
Adding + coords_fixed(1) to the ggplot2 R package will ensure a correct aspect ratio.
The aspect ratio issue is illustrated in a simulated example, where a rectangular plot shows an incorrect apparent grouping of the data points into a top and a bottom cluster.
For t-SNE, the convention is to make the projection plots square or cubical, as the dimensions are unordered and equally important.
Aspect Ratio for PCA
The aspect ratio of a PCA plot is crucial for accurately reflecting the data. The height-to-width ratio of a PCA plot should be consistent with the ratio between the corresponding eigenvalues.
In PCA, each output dimension has a corresponding eigenvalue proportional to the amount of variance it explains. If the relationship between the height and the width of a plot is arbitrary, an adequate picture of the data cannot be attained.
Using popular software programs for analyzing biological data often produces square (2D) or cubical (3D) graphics by default, which is misleading. To ensure a correct aspect ratio, you can add + coords_fixed(1) when using the ggplot2 R package.
The aspect ratio issue is illustrated with a simulated example, where Fig 2A and Fig 2B show incorrect clustering due to an inconsistent aspect ratio. In contrast, Fig 2C shows correct clustering, consistent with the true class assignment.
Here's a summary of the correct aspect ratio for PCA plots:
By adjusting the aspect ratio of a PCA plot, you can get a more accurate picture of the data, which is essential for making informed decisions.
Advanced Topics
Dimension reduction is a crucial step in data analysis, and there are several advanced techniques to explore. Neighborhood Component Analysis (NCA) is a non-parametric method for selecting features that maximizes prediction accuracy of regression and classification algorithms.
To further reduce dimensionality, we can use techniques like Principal Component Analysis (PCA) and Factor Analysis. PCA reduces the dimensionality of data by replacing several correlated variables with a new set of variables that are linear combinations of the original variables. Factor Analysis, on the other hand, fits a model to multivariate data to estimate interdependence of measured variables on a smaller number of unobserved (latent) factors.
For example, PCA can be used to analyze the quality of life in U.S. cities, while Factor Analysis can be used to investigate whether companies within the same sector experience similar week-to-week changes in stock prices.
Some popular dimension reduction techniques include:
- Neighborhood Component Analysis (NCA)
- Principal Component Analysis (PCA)
- Factor Analysis
These techniques can be used to reduce the complexity of high-dimensional data and improve the accuracy of predictive models.
Not a Method: Vectorizing and Mixture Models
Vectorizing data can be a game-changer for dimensionality reduction. By replacing precise product names with categories, we can preserve the most important features of our data set, even if we lose some information in the process.
Numerical data can be analyzed using nonparametric statistics, such as mean values or quantiles, to reduce the number of observations. This approach is sensitive to the underlying information we want to retain.
The choice of statistic is crucial, as an incorrect choice can result in missing necessary knowledge. For instance, using a simple mean for outlier detection is a bad idea.
Higher-order moments like variance or kurtosis are often more suitable for outlier detection tasks. This is especially true when dealing with complex data sets that have thousands of products, each with its own sales history.
By analyzing the nonparametric statistics of our data, we can gain valuable insights into the underlying patterns and relationships. This can help us build more accurate machine learning models that achieve our goals.
Locally Linear Embeddings
Locally Linear Embeddings (LLE) is a powerful tool for uncovering the underlying structure of high-dimensional data by mapping it to a lower-dimensional space while preserving local geometrical relationships.
LLE assumes that each data point can be expressed as a linear combination of its neighbors, which is a key concept in the algorithm. This assumption is based on the idea that the data lies on a smooth, manifold-like structure embedded in a higher-dimensional space.
One of the main advantages of LLE is its ability to capture intricate nonlinear relationships in data, making it suitable for complex data sets. This is particularly useful when traditional linear techniques fall short.
LLE can be computationally expensive, particularly for large data sets, due to its O(N^2) computational complexity. This means that the algorithm involves solving a system of linear equations for each data point, which can be time-consuming.
Proper parameter tuning is crucial for LLE, including the number of neighbors and the dimensionality of the embedding space. Improper parameter selection can lead to suboptimal results.
Here are some scenarios where LLE excels:
5. Latent Structure
Latent structure is a fascinating topic in data analysis, and it's essential to understand how to identify and interpret it.
Latent structure can manifest as clusters of data points that group together, indicating a relationship between the variables being measured.
In PCA plots, these clusters may appear as distinct groups, as shown in Figure 5A of the article, where wine properties are used to characterize wine categories.
Continuous gradients, on the other hand, can be identified when data points exhibit a gradual shift from one extreme to another, often appearing as smooth curves in DR visualizations.
A horseshoe or arch-shaped configuration can occur in PCA and cMDS plots when the associated eigenvectors take on a specific form, as seen in Figure 5B of the article.
This can be particularly useful when analyzing data involving a linear gradient, such as cell development or differentiation.
To identify latent gradients, focus on the discrepancies between observations at the endpoints (extremes) of the gradients by inspecting the differences between their values for any available external covariates.
If external covariates are available, include them in DR visualizations to gain a better understanding of the data, as shown in Figure 6A of the article.
By recognizing and interpreting latent structure, you can gain valuable insights into the underlying relationships and patterns in your data.
Multidomain Opportunities
Multidomain data offers a wealth of opportunities for discovery and analysis.
Integrating multiple datasets allows you to obtain a more accurate representation of higher-order interactions and evaluate the associated variability.
Sometimes, more than one set of measurements is collected for the same set of samples, such as high-throughput genomic studies involving data from multiple domains.
For example, microarray gene expression, miRNA expression, proteomics, and DNA methylation data might be gathered for the same biological sample.
This type of data can be subject to varying levels of uncertainty, with different regions of the data experiencing different rates of changes or fluctuations.
To deal with multidomain data, you can perform DR for each dataset separately and then align them together using a Procrustes transformation.
A number of more advanced methods have been developed, such as STATIS and DiSTATIS, which are generalizations of PCA and classical MDS, respectively.
These methods are used to analyze several sets of data tables collected on the same set of observations and combine datasets into a common consensus structure called the "compromise".
The datasets can be projected onto this consensus space, allowing you to observe different patterns in observations characterized by data from different domains.
DiSTATIS, for example, can be used to analyze multiple distance tables defined for the same observations, such as gene expression, methylation, clinical data, or data resampled from a known data-generating distribution.
Verify and Quantify Uncertainties
Verifying and quantifying uncertainties is crucial in data analysis, especially when dealing with dimensionality reduction techniques like Principal Component Analysis (PCA). Ill-defined PCs can lead to unstable results, where even a slight change in one observation can result in a completely different set of eigenvectors.
In such cases, it's essential to keep dimensions corresponding to similar eigenvalues together and not interpret them individually. This is because these PCs are not informative on their own, making it challenging to interpret their loadings.
Data point uncertainties can also impact the stability of dimensionality reduction output coordinates. For instance, projections of bootstrap samples can reveal instability in the DR output coordinates for each data point. This is evident in Fig 9, where smaller, circular markers correspond to each bootstrap trial, and larger, diamond markers represent the coordinates of the full dataset.
By analyzing the stability of DR output coordinates, you can better understand the robustness of your results and quantify uncertainties. This is particularly important for datasets with rank 2 or higher, where the DR output coordinates can be sensitive to small changes in the data.
Frequently Asked Questions
What is the difference between PCA and LDA and SVD?
PCA is ideal for unsupervised dimensionality reduction, while LDA excels in supervised learning with a focus on class separation. SVD is a versatile technique that can be applied to various applications, including collaborative filtering and matrix factorization.
What is dimension reduction in bioinformatics?
Dimensionality reduction in bioinformatics is a technique that transforms high-dimensional biological data into a more compact, lower-dimensional representation, preserving essential information. This method helps simplify complex data for easier analysis and interpretation.
Sources
- https://www.analyticsvidhya.com/blog/2018/08/dimensionality-reduction-techniques-python/
- https://www.mathworks.com/help/stats/dimensionality-reduction.html
- https://nexocode.com/blog/posts/dimensionality-reduction-techniques-guide/
- https://pmc.ncbi.nlm.nih.gov/articles/PMC6586259/
- https://hex.tech/blog/dimensionality-reduction-techniques/
Featured Images: pexels.com