t-SNE: A Comprehensive Guide

Introduction

In the ever-expanding domain of machine learning and data analysis, uncovering the the underlying structure of high-dimensional datasets is a daunting task. t-distributed Stochastic Neighbor Embedding (t-SNE) is a powerful dimensionality reduction technique that excels at visualizing complex data in lower-dimensional spaces while preserving the local structure. In this extensive guide, we will delve deeper into the workings of t-SNE, exploring its applications, advantages, limitations, computational complexity, and comparing it with other methods like PCA. Furthermore, we will embark on a journey to implement t-SNE from scratch in Python, gaining invaluable insights into its inner workings along the way.

Historical Background of t-SNE

The roots of t-SNE trace back to the pioneering work of Geoffrey Hinton and Sam Roweis on Stochastic Neighbor Embedding (SNE) in 2002. SNE aimed to embed high-dimensional data into a lower-dimensional space while preserving pairwise similarities. However, SNE suffered from the crowding problem, where distant points in high-dimensional space were compressed together in the lower-dimensional embedding.

In 2008, Laurens van der Maaten and Geoffrey Hinton introduced t-SNE, an enhanced version of SNE that addressed the crowding problem. t-SNE employed a Student’s t-distribution to model similarities between data points in the lower-dimensional space, resulting in more faithful representations of local structure.

Mathematical Aspects of t-SNE

At its core, t-SNE operates by mapping high-dimensional data points to a lower-dimensional space while preserving local structure as much as possible. Unlike linear techniques such as PCA, which focus on preserving global variance, t-SNE prioritizes the preservation of local similarities between data points. The mathematical workings of t-SNE can be broken down into several key steps:

  1. Computing Pairwise Affinities:

    t-SNE begins by computing pairwise affinities between data points in the high-dimensional space. These affinities, often modeled using a Gaussian kernel, capture the local similarities between points. The affinity between points p and q is defined as the conditional probability that p​ would pick q​ as its neighbor if neighbors were picked in proportion to their probability density under a Gaussian centered at p.

  2. Constructing the Similarity Matrix:

    The computed affinities are then normalized to obtain a joint probability distribution over pairs of high-dimensional data points. This joint probability distribution represents the similarity between data points in the high-dimensional space.

  3. Defining Similarities in the Lower-dimensional Space:

    Next, t-SNE defines a similar joint probability distribution over pairs of points in the lower-dimensional embedding. This distribution is computed using Student’s t-distribution, which has heavier tails compared to the Gaussian distribution, making it more robust to outliers and preserving more of the local structure.

  4. Optimizing the Embedding:

    t-SNE employs gradient descent to optimize the embedding in the lower-dimensional space. The optimization process aims to minimize the mismatch between the joint probabilities by adjusting the positions of points in the embedding space. This is achieved by iteratively updating the positions of points based on the gradient of the Kullback-Leibler divergence between the two distributions.

  5. Early Exaggeration:

    In the initial stages of optimization, t-SNE employs early exaggeration to magnify the differences between the clusters in the lower-dimensional space. This helps in better separating the clusters and speeding up the convergence of the optimization algorithm.

By iteratively optimizing the embedding based on the mismatch between the joint probabilities in the high-dimensional and lower-dimensional spaces, t-SNE effectively preserves the local structure of the data in the visualization.

Applications of t-SNE

The versatility of t-SNE transcends various domains, including:

  • Data visualization: t-SNE is widely used to visualize high-dimensional data in two or three dimensions, enabling researchers and analysts to gain intuitive insights into the underlying structure of the data.

  • Clustering analysis: By revealing inherent clusters in the data, t-SNE facilitates clustering analysis, aiding in tasks such as customer segmentation and anomaly detection.

  • Feature engineering: t-SNE can be employed for feature engineering by reducing the dimensionality of the input data while preserving essential features, thereby enhancing the performance of machine learning models.

  • Natural Language Processing: In NLP tasks such as word embedding visualization, t-SNE is instrumental in representing high-dimensional word vectors in a visually interpretable manner.

Advantages of t-SNE

  • Preservation of Local Structure: t-SNE excels at preserving local similarities between data points, making it highly effective for visualizing complex datasets with intricate patterns.

  • Non-linearity: Unlike linear techniques like PCA, t-SNE preserves non-linear relationships in the data, providing more faithful representations in lower-dimensional space.

  • Intuitive visualization: The visualizations produced by t-SNE are intuitive and visually appealing, allowing for easy interpretation and analysis of high-dimensional data.

Limitations and Considerations

  • Computational Complexity: t-SNE can be computationally expensive, especially for large datasets, due to its quadratic time complexity. Approximate methods such as Barnes-Hut t-SNE can mitigate this issue.

  • Sensitive to Parameters: t-SNE results can vary based on parameters such as perplexity, learning rate, and number of iterations. Careful parameter tuning is essential to obtain meaningful visualizations.

  • Crowding Problem: In some cases, t-SNE may suffer from the crowding problem, where distant points in high-dimensional space are compressed into nearby points in the lower-dimensional embedding.

Comparing t-SNE with PCA

While both t-SNE and PCA are popular dimensionality reduction techniques, they differ in their approach and applications:

  • Linear vs. Non-Linear: PCA is a linear technique that focuses on preserving global variance, whereas t-SNE is non-linear and emphasizes the preservation of local structure.

  • Visualization Quality: t-SNE typically produces more visually appealing visualizations with better separation between clusters compared to PCA, making it preferable for visualization tasks.

  • Computational Complexity: PCA is computationally more efficient than t-SNE, especially for large datasets, due to its linear nature and lower time complexity.

Implementing t-SNE from Scratch in Python

Now, let's embark on the exciting journey of implementing t-SNE from scratch in Python. We'll break down the process into four key steps:

  1. Compute Pairwise Similarities

  2. Compute Perplexity and Entropy

  3. Gradient Descent Optimization

  4. Visualize the Results

We'll dive into the Python code for each step in the subsequent sections.

Step 1: Compute Pairwise Similarities

To start, we need to compute the pairwise similarities between data points in the high-dimensional space. We can use the Gaussian kernel to compute these similarities.

import numpy as np

def compute_pairwise_distances(X):
    '''Compute pairwise Euclidean distances.'''
    sum_X = np.sum(np.square(X), axis=1)
    D = np.add(np.add(-2 * np.dot(X, X.T), sum_X).T, sum_X)
    return -D

def compute_pairwise_affinities(X, perplexity=30, epsilon=1e-5):
    '''Compute pairwise affinities with perplexity.'''
    D = compute_pairwise_distances(X)
    P = np.zeros_like(D)
    beta = np.ones((X.shape[0], 1))
    sum_X = np.sum(np.square(X), axis=1)

    for i in range(X.shape[0]):
        Di = D[i, np.concatenate((np.r_[0:i], np.r_[i+1:X.shape[0]]))]
        H, thisP = Hbeta(Di, beta[i], perplexity)
        P[i, np.concatenate((np.r_[0:i], np.r_[i+1:X.shape[0]]))] = thisP

    return P

Step 2: Compute Perplexity and Entropy

The perplexity of a probability distribution measures how well it predicts a sample. We use binary search to find the perplexity that best matches the desired perplexity. Then, we compute the Shannon entropy of each row in the pairwise affinities matrix.

def Hbeta(D, beta, perplexity=30):
    '''Compute H and P for a specific value of the precision of a Gaussian distribution.'''
    P = np.exp(-D.copy() * beta)
    sumP = sum(P)
    H = np.log(sumP) + beta * np.sum(D * P) / sumP
    P /= sumP

    return H, P

Step 3: Gradient Descent Optimization

We use gradient descent to optimize the lower-dimensional embeddings to minimize the difference between pairwise affinities in high-dimensional and lower-dimensional spaces.

def t_sne(X, num_dims=2, perplexity=30, learning_rate=200, num_iters=1000, momentum=0.8):
    '''t-SNE algorithm.'''
    # Initialize Y randomly
    Y = np.random.randn(X.shape[0], num_dims)
    dY = np.zeros_like(Y)
    iY = np.zeros_like(Y)
    gains = np.ones_like(Y)

    # Compute pairwise affinities
    P = compute_pairwise_affinities(X, perplexity)
    P = P + np.transpose(P)
    P = P / np.sum(P)
    P = P * 4. # early exaggeration
    P = np.maximum(P, 1e-12)

    for iter in range(num_iters):
        # Compute pairwise affinities with current embeddings
        sum_Y = np.sum(np.square(Y), axis=1)
        num = -2. * np.dot(Y, Y.T)
        num = 1. / (1. + np.add(np.add(num, sum_Y).T, sum_Y))
        num[range(X.shape[0]), range(X.shape[0])] = 0.
        Q = num / np.sum(num)
        Q = np.maximum(Q, 1e-12)

        # Compute gradient
        PQ_diff = P - Q
        for i in range(X.shape[0]):
            dY[i,:] = np.sum(np.tile(PQ_diff[:, i] * num[:, i], (num_dims, 1)).T * (Y[i,:] - Y), axis=0)

        # Update gains
        gains = (gains + 0.2) * ((dY > 0.) != (iY > 0.)) + (gains * 0.8) * ((dY > 0.) == (iY > 0.))
        gains[gains < 0.01] = 0.01

        # Perform the update
        iY = momentum * iY - learning_rate * (gains * dY)
        Y = Y + iY
        Y = Y - np.mean(Y, axis=0)

        # Compute cost (KL divergence)
        if (iter + 1) % 100 == 0:
            C = np.sum(P * np.log(P / Q))
            print('Iteration %d: error is %f' % (iter + 1, C))

    return Y

Step 4: Visualize the Results

Finally, we can visualize the lower-dimensional embeddings using matplotlib.

import matplotlib.pyplot as plt

def plot_tsne(X, y):
    '''Plot t-SNE visualization'''
    X_tsne = t_sne(X)
    plt.figure(figsize=(10, 8))
    plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y, cmap='viridis')
    plt.title('t-SNE Visualization')
    plt.xlabel('t-SNE Component 1')
    plt.ylabel('t-SNE Component 2')
    plt.colorbar()
    plt.colorbar()
    plt.show()

# Example usage
from sklearn.datasets import load_iris

iris = load_iris()
X, y = iris.data, iris.target
plot_tsne(X, y)

Why Choose t-SNE Over PCA?

While both t-SNE and PCA are popular dimensionality reduction techniques, t-SNE offers several advantages over PCA:

  • Preservation of Local Structure: t-SNE excels at preserving local similarities between data points, making it ideal for visualizing complex datasets with intricate patterns.

  • Non-linearity: Unlike PCA, which is linear, t-SNE preserves non-linear relationships in the data, providing more faithful representations in lower-dimensional space.

  • Visualization Quality: t-SNE typically produces more visually appealing visualizations with better separation between clusters compared to PCA, making it preferable for visualization tasks.

Other Applications and Indirect Uses of t-SNE

Beyond its primary applications in dimensionality reduction and data visualization, t-SNE finds applications in various domains and indirect uses:

  • Gene Expression Analysis: t-SNE is employed in single-cell RNA sequencing analysis to visualize and cluster cells based on gene expression profiles.

  • Image Processing: In image processing tasks such as image retrieval and image segmentation, t-SNE can be used to reduce the dimensionality of feature vectors extracted from images.

  • Reinforcement Learning: In reinforcement learning, t-SNE can be utilized to visualize high-dimensional state or action spaces, aiding in the analysis and debugging of reinforcement learning algorithms.

Conclusions

t-SNE stands as a testament to the power of non-linear dimensionality reduction techniques in unraveling the mysteries hidden within high-dimensional datasets. From its robustness in preserving local structure to its versatility across various domains, t-SNE continues to be a cornerstone in the arsenal of machine learning and data analysis practitioners. By implementing t-SNE from scratch in Python, we've embarked on a journey of discovery, gaining invaluable insights into its inner workings and unleashing its potential for unlocking hidden patterns in data. As we navigate through the ever-evolving landscape of data science, let us embrace the elegance and efficacy of t-SNE in illuminating the path to deeper understanding and discovery.

0
Subscribe to my newsletter

Read articles from Kunal Kumar Sahoo directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Kunal Kumar Sahoo
Kunal Kumar Sahoo

I am a CS undergraduate with passion for applied mathematics. I like to explore the avenues of Artificial Intelligence and Robotics, and try to solve real-world problems with these tools.