Dimensionality Reduction

Abstract:

There are times when a dataset has a sort of symmetry that can be used to remove some of its redundant variables. Conceptually, it is like trying to describe the surface of a sphere in 3 dimensions. If we use the x-y-z coordinate we need to work with 3 variables. But, in a spherical coordinate, we only need \(\theta\) and \(\phi\) variables. That means the symmetry has removed one of the variables. The same concep exist in data science. Just like in coordinate transformations in physics we can define new variables (features) as linear or non-linear combination of the original variables such that a smaller set of the new variables remain important.


See this video, specially the second half, to find out more about the concept and how that is realted to physics: https://youtu.be/M_4AZCgT8To

Dimensionality reduction is a critical step in machine learning for simplifying datasets by reducing the number of features. This improves computational efficiency, reduces overfitting, and enhances model performance—especially when working with high-dimensional data.

Large feature sets introduce the curse of dimensionality , where data becomes sparse and learning becomes less effective. Dimensionality reduction addresses this by projecting data into a lower-dimensional space that preserves its key structure.

This is particularly important in the era of big data, where models must handle large, complex datasets efficiently.

Key Dimensionality Reduction Techniques You Should Know:

  • Principal Component Analysis (PCA) – projects data onto directions of maximum variance
  • Linear Discriminant Analysis (LDA) – maximizes class separation in labeled data
  • t-SNE and UMAP – non-linear methods that preserve local structure for high-quality visualization
  • Autoencoders – neural networks that learn compressed, non-linear feature representations
  • Feature Selection Methods :
  • Low Variance Filter
  • High Correlation Filter
  • Forward/Backward Feature Selection
  • High Correlation Filter

Principal Component Analysis (PCA)

Principal Component Analysis (PCA) is a linear dimensionality reduction technique that simplifies a dataset by projecting it into a lower-dimensional space while preserving as much variance (information) as possible. It reduces complexity, enhances visualization, and often improves the performance of machine learning models.

How PCA Works (Step-by-Step)

  1. Standardization
    Scale all features to have zero mean and unit variance to ensure equal contribution.

  2. Covariance Matrix Computation
    Understand how features vary together and identify potential correlations.

  3. Compute Eigenvectors and Eigenvalues

  4. Eigenvectors define new axes (principal components).
  5. Eigenvalues indicate the importance (variance captured) by each component.

  6. Sort and Select Components
    Rank components by eigenvalues and retain the top ones that explain the most variance.

  7. Form the Feature Vector
    Create a matrix from selected eigenvectors to define the new feature space.

  8. Project the Data
    Transform the original data into the lower-dimensional space defined by the principal components.

Example:

Run the following code and observe how the PCA-transformed features are better seperated than the original features

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Load the Wine dataset
data = load_wine()
X = data.data
y = data.target
feature_names = data.feature_names
target_names = data.target_names

# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply PCA (reduce to 2 components)
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# Plot original data using first two original features
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
for label in np.unique(y):
    plt.scatter(X[y == label, 0], X[y == label, 1], label=target_names[label], alpha=0.6)
plt.xlabel(feature_names[0])
plt.ylabel(feature_names[1])
plt.title("Original Features (first two)")
plt.legend()

# Plot PCA-transformed data
plt.subplot(1, 2, 2)
for label in np.unique(y):
    plt.scatter(X_pca[y == label, 0], X_pca[y == label, 1], label=target_names[label], alpha=0.6)
plt.xlabel("PC 1")
plt.ylabel("PC 2")
plt.title("PCA-transformed Features")
plt.legend()

plt.tight_layout()
plt.show()

Benefits of PCA

  • Reduces overfitting by eliminating redundant features
  • Speeds up training by lowering computational costs
  • Reveals hidden patterns in data
  • Enables 2D or 3D visualizations of high-dimensional datasets

Linear Discriminant Analysis (LDA)

Linear Discriminant Analysis (LDA) is a supervised machine learning technique used for classification , dimensionality reduction , and feature extraction . Its primary goal is to maximize class separability by projecting data onto a lower-dimensional space where classes are most distinguishable.

Unlike PCA , which focuses on capturing directions of maximum variance without considering class labels, LDA leverages label information to find projections that best separate the classes.

For example, in a dataset with two clearly defined groups, LDA identifies the linear boundary that best distinguishes these groups, enhancing classification performance.

How LDA Works (Step-by-Step)

  1. Compute class-wise mean vectors
  2. Calculate within-class and between-class scatter matrices
  3. Solve the generalized eigenvalue problem to find the linear discriminants (directions that maximize between-class variance relative to within-class variance)
  4. Select top discriminant components to form the new lower-dimensional feature space
  5. Project the data onto this new space for classification or visualization

Scikit-learn provides a simple and efficient implementation of LDA through LinearDiscriminantAnalysis .

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA

# Load dataset
data = load_wine()
X = data.data
y = data.target
feature_names = data.feature_names
target_names = data.target_names

# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply PCA (unsupervised)
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# Apply LDA (supervised)
lda = LDA(n_components=2)
X_lda = lda.fit_transform(X_scaled, y)

# Plotting
plt.figure(figsize=(12, 5))

# PCA Plot
plt.subplot(1, 2, 1)
for label in np.unique(y):
    plt.scatter(X_pca[y == label, 0], X_pca[y == label, 1], label=target_names[label], alpha=0.6)
plt.xlabel("PC 1")
plt.ylabel("PC 2")
plt.title("PCA (unsupervised)")
plt.legend()

# LDA Plot
plt.subplot(1, 2, 2)
for label in np.unique(y):
    plt.scatter(X_lda[y == label, 0], X_lda[y == label, 1], label=target_names[label], alpha=0.6)
plt.xlabel("LD 1")
plt.ylabel("LD 2")
plt.title("LDA (supervised)")
plt.legend()

plt.tight_layout()
plt.show()

t-SNE and UMAP

t-Distributed Stochastic Neighbor Embedding (t-SNE)

t-SNE is a non-linear dimensionality reduction technique primarily used for visualizing high-dimensional data in 2D or 3D. It works by:

  • Converting pairwise similarities between data points into joint probabilities
  • Minimizing the divergence between these probabilities in the high-dimensional and low-dimensional spaces

t-SNE is especially effective at uncovering clusters and local structure , making it ideal for exploring patterns in data embeddings (e.g., word vectors, image features). However, it can be computationally intensive and does not preserve global structure well.

Uniform Manifold Approximation and Projection (UMAP)

UMAP is a modern, non-linear technique that improves upon t-SNE in speed , scalability , and the ability to preserve both local and some global structure of the data. It is based on manifold learning and topological data analysis.

UMAP is highly efficient for large datasets and is now widely used for visualizing complex structures, such as in genomics, NLP, and image data.

Autoencoders

Autoencoders are a class of neural networks used primarily for dimensionality reduction , feature extraction , and representation learning . They work by compressing input data into a lower-dimensional latent space and then reconstructing it as accurately as possible, enabling the model to learn essential patterns and discard noise or irrelevant details.

How Autoencoders Work

Autoencoders follow an encoder-decoder architecture :

  • Encoder : Compresses the input data into a lower-dimensional latent representation. It progressively reduces the feature space using multiple neural network layers.

  • Decoder : Reconstructs the original input from the compressed representation by gradually increasing dimensionality back to the original space.

This compression forces the network to focus on informative features , enabling effective dimensionality reduction.

Variants of Autoencoders

  • Sparse Autoencoders : Introduce constraints to produce sparse latent representations, improving feature selection.
  • Denoising Autoencoders : Trained to remove noise from input data, widely used in image and signal processing.
  • Variational Autoencoders (VAEs) : Model the latent space as a probability distribution, enabling generative modeling and the creation of new, similar data samples.

Applications of Autoencoders

  • Dimensionality Reduction : A non-linear alternative to PCA, capable of capturing complex relationships in the data
  • Image Denoising : Learns to reconstruct clean images from noisy inputs
  • Generative Modeling : VAEs can generate synthetic but realistic data for augmentation or simulation

Case Study: Dimensionality Reduction in Smart Cities (Automotus & Encord)

Automotus , a company focused on AI-driven smart traffic monitoring, processes large volumes of video data. To manage the scale and complexity:

  • They applied dimensionality reduction techniques (like PCA and Autoencoders) to extract key features—traffic flow, vehicle types, congestion zones—without analyzing every pixel.
  • Partnering with Encord , they used tools like Encord Annotate and Encord Active for intelligent data management and labeling.

Results: - 20% increase in model accuracy
- 35% reduction in dataset size
- >33% savings in labeling costs
- Improved scalability and reduced infrastructure strain

Feature Selection Methods :

Low Variance Filter

The Low Variance Filter is a simple yet effective dimensionality reduction technique used to remove features that show little to no variability across samples. Such features are unlikely to contribute meaningful signals to a machine learning model.

How it Works:

  1. Compute Variance : Measure the variance of each feature across the dataset. It's common to scale or normalize the data first to ensure consistent comparisons.
  2. Set a Threshold : Define a cutoff—typically a small value relative to overall variance. Features below this threshold are considered uninformative.
  3. Drop Low-Variance Features : Remove features with variance lower than the threshold to reduce noise and dimensionality.

Real-World Applications:

  • Sensor Data : Filters out static or rarely-changing sensor readings to focus on meaningful fluctuations.
  • Image Processing : Removes pixel features dominated by background noise or uniform areas.
  • Text Data : Eliminates tokens (e.g., common stop words) that appear with similar frequency across all documents, helping improve model focus and accuracy.

Example:

import numpy as np
import pandas as pd
from sklearn.datasets import load_wine
from sklearn.feature_selection import VarianceThreshold
from sklearn.preprocessing import StandardScaler

# Load dataset
data = load_wine()
X = data.data
y = data.target
feature_names = np.array(data.feature_names)

# Standardize features (important before variance filtering)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply low variance filter (threshold can be adjusted)
threshold = 0.1
selector = VarianceThreshold(threshold=threshold)
X_reduced = selector.fit_transform(X_scaled)

# Get selected and removed features
mask = selector.get_support()
kept_features = feature_names[mask]
removed_features = feature_names[~mask]

print("Features kept (variance >= {:.2f}):".format(threshold))
print(kept_features)

print("\nFeatures removed (variance < {:.2f}):".format(threshold))
print(removed_features)

High Correlation Filter

The High Correlation Filter is a feature selection technique used to reduce redundancy by removing features that are highly correlated with one another. Retaining only one representative from a group of correlated features helps streamline the dataset and improve model efficiency.

How it Works:

  1. Calculate Correlation Matrix : Use a correlation metric like Pearson (for continuous data) or Spearman (for ordinal data) to assess pairwise relationships between features.
  2. Set Correlation Threshold : Common thresholds range from 0.8 to 0.9, depending on the tolerance for multicollinearity and the model’s sensitivity to feature overlap.
  3. Remove Redundant Features : Identify highly correlated pairs and keep the most relevant feature from each pair—based on domain knowledge, predictive strength, or data completeness.

While the Low Variance Filter removes uninformative features, the High Correlation Filter targets redundancy. Together, they help refine datasets by keeping only what truly adds value to the model.

Example:

import numpy as np
import pandas as pd
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler

# Load and standardize the dataset
data = load_wine()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

scaler = StandardScaler()
X_scaled = pd.DataFrame(scaler.fit_transform(X), columns=X.columns)

# Compute correlation matrix
corr_matrix = X_scaled.corr().abs()

# Upper triangle of correlation matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))

# Set correlation threshold
threshold = 0.9

# Find features with correlation greater than the threshold
to_drop = [column for column in upper.columns if any(upper[column] > threshold)]

# Drop them from the dataset
X_reduced = X_scaled.drop(columns=to_drop)

print("❌ Features removed due to high correlation (>{}):".format(threshold))
print(to_drop)

print("\n✅ Remaining features:")
print(X_reduced.columns.tolist())

import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, fmt=".2f", cmap="coolwarm", square=True)
plt.title("Feature Correlation Matrix (Wine Dataset)")
plt.show()

Forward Feature Construction

Forward Feature Construction (FFC) is an iterative feature selection technique that builds a model step-by-step, starting from an empty set of predictors. At each stage, the most informative feature is added, aiming to maximize model performance while maintaining interpretability.

How it Works:

  1. Start with a Baseline Model : Begin with no features—just a simple model to establish a performance baseline.
  2. Evaluate Candidates : Test each remaining feature individually by temporarily adding it to the model and measuring performance gain.
  3. Select and Add Best Feature : Choose the feature that results in the highest improvement and permanently add it.
  4. Repeat Until Optimal : Continue the process until adding more features no longer improves performance meaningfully.

Python’s mlxtend library and R’s stepAIC function support efficient implementation of FFC.

Real-World Applications:

  • Healthcare Analytics : Helps identify the most relevant patient variables (e.g., lab results, demographics) for predicting disease progression or treatment response.
  • Customer Churn Modeling : Builds models that pinpoint the key behaviors or attributes most predictive of churn, improving retention strategies.
  • Manufacturing Quality Control : Selects process measurements that most accurately predict product quality, optimizing inspections and resource allocation.

Example:

import numpy as np
import pandas as pd
from sklearn.datasets import load_wine
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score

# Load dataset
data = load_wine()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

# Standardize features
scaler = StandardScaler()
X_scaled = pd.DataFrame(scaler.fit_transform(X), columns=X.columns)

# Initialize
remaining_features = list(X_scaled.columns)
selected_features = []
best_score = 0
max_features = len(X_scaled.columns)  # Or set a custom cap

while remaining_features and len(selected_features) < max_features:
    scores = []
    for feature in remaining_features:
        current_features = selected_features + [feature]
        X_subset = X_scaled[current_features]
        clf = LogisticRegression(max_iter=1000, solver='liblinear')
        score = cross_val_score(clf, X_subset, y, cv=5).mean()
        scores.append((feature, score))

    # Pick the best new feature
    best_feature, best_feature_score = max(scores, key=lambda x: x[1])

    # If it improves performance, add it
    if best_feature_score > best_score:
        selected_features.append(best_feature)
        remaining_features.remove(best_feature)
        best_score = best_feature_score
        print(f"✅ Added: {best_feature} (CV Accuracy = {best_feature_score:.4f})")
    else:
        print(f"⛔ No improvement by adding more features.")
        break

print("\n🏁 Final selected features:")
print(selected_features)

Backward Feature Elimination

Backward Feature Elimination (BFE) is a top-down approach to feature selection that begins with all available features and iteratively removes the least useful ones. It is particularly effective in linear and logistic regression models where simplifying the feature set can improve interpretability and reduce overfitting.

How it Works:

  1. Start with All Features : Build a full model using every available feature to create a strong performance baseline.
  2. Identify Least Important Feature : Use metrics such as p-values, feature importance scores, or contribution to error reduction to find the least impactful feature.
  3. Remove and Reevaluate : Eliminate the identified feature and re-train the model. Evaluate performance using cross-validation or a holdout set to ensure stability.
  4. Repeat Until Optimal : Continue removing features one at a time until further removal causes a drop in performance.

Real-World Applications:

  • Credit Scoring Models : Helps identify the most influential financial attributes (e.g., credit utilization, payment history) for assessing loan risk.
  • Energy Usage Forecasting : Refines predictive models by removing redundant or low-impact sensor readings.
  • Marketing Campaign Optimization : Identifies core engagement drivers (e.g., email open rate, past purchase history) to build lean, effective models.

Leave a Comment

Comments

Are You a Physicist?


Join Our
FREE-or-Land-Job Data Science BootCamp