t-SNE: A Comprehensive Guide
Introduction
In the ever-expanding domain of machine learning and data analysis, uncovering the the underlying structure of high-dimensional datasets is a daunting task. t-distributed Stochastic Neighbor Embedding (t-SNE) is a powerful dimensionality reduction technique that excels at visualizing complex data in lower-dimensional spaces while preserving the local structure. In this extensive guide, we will delve deeper into the workings of t-SNE, exploring its applications, advantages, limitations, computational complexity, and comparing it with other methods like PCA. Furthermore, we will embark on a journey to implement t-SNE from scratch in Python, gaining invaluable insights into its inner workings along the way.
Historical Background of t-SNE
The roots of t-SNE trace back to the pioneering work of Geoffrey Hinton and Sam Roweis on Stochastic Neighbor Embedding (SNE) in 2002. SNE aimed to embed high-dimensional data into a lower-dimensional space while preserving pairwise similarities. However, SNE suffered from the crowding problem, where distant points in high-dimensional space were compressed together in the lower-dimensional embedding.
In 2008, Laurens van der Maaten and Geoffrey Hinton introduced t-SNE, an enhanced version of SNE that addressed the crowding problem. t-SNE employed a Student’s t-distribution to model similarities between data points in the lower-dimensional space, resulting in more faithful representations of local structure.
Mathematical Aspects of t-SNE
At its core, t-SNE operates by mapping high-dimensional data points to a lower-dimensional space while preserving local structure as much as possible. Unlike linear techniques such as PCA, which focus on preserving global variance, t-SNE prioritizes the preservation of local similarities between data points. The mathematical workings of t-SNE can be broken down into several key steps:
Computing Pairwise Affinities:
t-SNE begins by computing pairwise affinities between data points in the high-dimensional space. These affinities, often modeled using a Gaussian kernel, capture the local similarities between points. The affinity between points p and q is defined as the conditional probability that p would pick q as its neighbor if neighbors were picked in proportion to their probability density under a Gaussian centered at p.
Constructing the Similarity Matrix:
The computed affinities are then normalized to obtain a joint probability distribution over pairs of high-dimensional data points. This joint probability distribution represents the similarity between data points in the high-dimensional space.
Defining Similarities in the Lower-dimensional Space:
Next, t-SNE defines a similar joint probability distribution over pairs of points in the lower-dimensional embedding. This distribution is computed using Student’s t-distribution, which has heavier tails compared to the Gaussian distribution, making it more robust to outliers and preserving more of the local structure.
Optimizing the Embedding:
t-SNE employs gradient descent to optimize the embedding in the lower-dimensional space. The optimization process aims to minimize the mismatch between the joint probabilities by adjusting the positions of points in the embedding space. This is achieved by iteratively updating the positions of points based on the gradient of the Kullback-Leibler divergence between the two distributions.
Early Exaggeration:
In the initial stages of optimization, t-SNE employs early exaggeration to magnify the differences between the clusters in the lower-dimensional space. This helps in better separating the clusters and speeding up the convergence of the optimization algorithm.
By iteratively optimizing the embedding based on the mismatch between the joint probabilities in the high-dimensional and lower-dimensional spaces, t-SNE effectively preserves the local structure of the data in the visualization.
Applications of t-SNE
The versatility of t-SNE transcends various domains, including:
Data visualization: t-SNE is widely used to visualize high-dimensional data in two or three dimensions, enabling researchers and analysts to gain intuitive insights into the underlying structure of the data.
Clustering analysis: By revealing inherent clusters in the data, t-SNE facilitates clustering analysis, aiding in tasks such as customer segmentation and anomaly detection.
Feature engineering: t-SNE can be employed for feature engineering by reducing the dimensionality of the input data while preserving essential features, thereby enhancing the performance of machine learning models.
Natural Language Processing: In NLP tasks such as word embedding visualization, t-SNE is instrumental in representing high-dimensional word vectors in a visually interpretable manner.
Advantages of t-SNE
Preservation of Local Structure: t-SNE excels at preserving local similarities between data points, making it highly effective for visualizing complex datasets with intricate patterns.
Non-linearity: Unlike linear techniques like PCA, t-SNE preserves non-linear relationships in the data, providing more faithful representations in lower-dimensional space.
Intuitive visualization: The visualizations produced by t-SNE are intuitive and visually appealing, allowing for easy interpretation and analysis of high-dimensional data.
Limitations and Considerations
Computational Complexity: t-SNE can be computationally expensive, especially for large datasets, due to its quadratic time complexity. Approximate methods such as Barnes-Hut t-SNE can mitigate this issue.
Sensitive to Parameters: t-SNE results can vary based on parameters such as perplexity, learning rate, and number of iterations. Careful parameter tuning is essential to obtain meaningful visualizations.
Crowding Problem: In some cases, t-SNE may suffer from the crowding problem, where distant points in high-dimensional space are compressed into nearby points in the lower-dimensional embedding.
Comparing t-SNE with PCA
While both t-SNE and PCA are popular dimensionality reduction techniques, they differ in their approach and applications:
Linear vs. Non-Linear: PCA is a linear technique that focuses on preserving global variance, whereas t-SNE is non-linear and emphasizes the preservation of local structure.
Visualization Quality: t-SNE typically produces more visually appealing visualizations with better separation between clusters compared to PCA, making it preferable for visualization tasks.
Computational Complexity: PCA is computationally more efficient than t-SNE, especially for large datasets, due to its linear nature and lower time complexity.
Implementing t-SNE from Scratch in Python
Now, let's embark on the exciting journey of implementing t-SNE from scratch in Python. We'll break down the process into four key steps:
Compute Pairwise Similarities
Compute Perplexity and Entropy
Gradient Descent Optimization
Visualize the Results
We'll dive into the Python code for each step in the subsequent sections.
Step 1: Compute Pairwise Similarities
To start, we need to compute the pairwise similarities between data points in the high-dimensional space. We can use the Gaussian kernel to compute these similarities.
import numpy as np
def compute_pairwise_distances(X):
'''Compute pairwise Euclidean distances.'''
sum_X = np.sum(np.square(X), axis=1)
D = np.add(np.add(-2 * np.dot(X, X.T), sum_X).T, sum_X)
return -D
def compute_pairwise_affinities(X, perplexity=30, epsilon=1e-5):
'''Compute pairwise affinities with perplexity.'''
D = compute_pairwise_distances(X)
P = np.zeros_like(D)
beta = np.ones((X.shape[0], 1))
sum_X = np.sum(np.square(X), axis=1)
for i in range(X.shape[0]):
Di = D[i, np.concatenate((np.r_[0:i], np.r_[i+1:X.shape[0]]))]
H, thisP = Hbeta(Di, beta[i], perplexity)
P[i, np.concatenate((np.r_[0:i], np.r_[i+1:X.shape[0]]))] = thisP
return P
Step 2: Compute Perplexity and Entropy
The perplexity of a probability distribution measures how well it predicts a sample. We use binary search to find the perplexity that best matches the desired perplexity. Then, we compute the Shannon entropy of each row in the pairwise affinities matrix.
def Hbeta(D, beta, perplexity=30):
'''Compute H and P for a specific value of the precision of a Gaussian distribution.'''
P = np.exp(-D.copy() * beta)
sumP = sum(P)
H = np.log(sumP) + beta * np.sum(D * P) / sumP
P /= sumP
return H, P
Step 3: Gradient Descent Optimization
We use gradient descent to optimize the lower-dimensional embeddings to minimize the difference between pairwise affinities in high-dimensional and lower-dimensional spaces.
def t_sne(X, num_dims=2, perplexity=30, learning_rate=200, num_iters=1000, momentum=0.8):
'''t-SNE algorithm.'''
# Initialize Y randomly
Y = np.random.randn(X.shape[0], num_dims)
dY = np.zeros_like(Y)
iY = np.zeros_like(Y)
gains = np.ones_like(Y)
# Compute pairwise affinities
P = compute_pairwise_affinities(X, perplexity)
P = P + np.transpose(P)
P = P / np.sum(P)
P = P * 4. # early exaggeration
P = np.maximum(P, 1e-12)
for iter in range(num_iters):
# Compute pairwise affinities with current embeddings
sum_Y = np.sum(np.square(Y), axis=1)
num = -2. * np.dot(Y, Y.T)
num = 1. / (1. + np.add(np.add(num, sum_Y).T, sum_Y))
num[range(X.shape[0]), range(X.shape[0])] = 0.
Q = num / np.sum(num)
Q = np.maximum(Q, 1e-12)
# Compute gradient
PQ_diff = P - Q
for i in range(X.shape[0]):
dY[i,:] = np.sum(np.tile(PQ_diff[:, i] * num[:, i], (num_dims, 1)).T * (Y[i,:] - Y), axis=0)
# Update gains
gains = (gains + 0.2) * ((dY > 0.) != (iY > 0.)) + (gains * 0.8) * ((dY > 0.) == (iY > 0.))
gains[gains < 0.01] = 0.01
# Perform the update
iY = momentum * iY - learning_rate * (gains * dY)
Y = Y + iY
Y = Y - np.mean(Y, axis=0)
# Compute cost (KL divergence)
if (iter + 1) % 100 == 0:
C = np.sum(P * np.log(P / Q))
print('Iteration %d: error is %f' % (iter + 1, C))
return Y
Step 4: Visualize the Results
Finally, we can visualize the lower-dimensional embeddings using matplotlib.
import matplotlib.pyplot as plt
def plot_tsne(X, y):
'''Plot t-SNE visualization'''
X_tsne = t_sne(X)
plt.figure(figsize=(10, 8))
plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y, cmap='viridis')
plt.title('t-SNE Visualization')
plt.xlabel('t-SNE Component 1')
plt.ylabel('t-SNE Component 2')
plt.colorbar()
plt.colorbar()
plt.show()
# Example usage
from sklearn.datasets import load_iris
iris = load_iris()
X, y = iris.data, iris.target
plot_tsne(X, y)
Why Choose t-SNE Over PCA?
While both t-SNE and PCA are popular dimensionality reduction techniques, t-SNE offers several advantages over PCA:
Preservation of Local Structure: t-SNE excels at preserving local similarities between data points, making it ideal for visualizing complex datasets with intricate patterns.
Non-linearity: Unlike PCA, which is linear, t-SNE preserves non-linear relationships in the data, providing more faithful representations in lower-dimensional space.
Visualization Quality: t-SNE typically produces more visually appealing visualizations with better separation between clusters compared to PCA, making it preferable for visualization tasks.
Other Applications and Indirect Uses of t-SNE
Beyond its primary applications in dimensionality reduction and data visualization, t-SNE finds applications in various domains and indirect uses:
Gene Expression Analysis: t-SNE is employed in single-cell RNA sequencing analysis to visualize and cluster cells based on gene expression profiles.
Image Processing: In image processing tasks such as image retrieval and image segmentation, t-SNE can be used to reduce the dimensionality of feature vectors extracted from images.
Reinforcement Learning: In reinforcement learning, t-SNE can be utilized to visualize high-dimensional state or action spaces, aiding in the analysis and debugging of reinforcement learning algorithms.
Conclusions
t-SNE stands as a testament to the power of non-linear dimensionality reduction techniques in unraveling the mysteries hidden within high-dimensional datasets. From its robustness in preserving local structure to its versatility across various domains, t-SNE continues to be a cornerstone in the arsenal of machine learning and data analysis practitioners. By implementing t-SNE from scratch in Python, we've embarked on a journey of discovery, gaining invaluable insights into its inner workings and unleashing its potential for unlocking hidden patterns in data. As we navigate through the ever-evolving landscape of data science, let us embrace the elegance and efficacy of t-SNE in illuminating the path to deeper understanding and discovery.
Subscribe to my newsletter
Read articles from Kunal Kumar Sahoo directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
Kunal Kumar Sahoo
Kunal Kumar Sahoo
I am a CS undergraduate with passion for applied mathematics. I like to explore the avenues of Artificial Intelligence and Robotics, and try to solve real-world problems with these tools.