We all know $2 + 3 - 1 = 4$ but what if I tell you that $King - man + woman = Queen$. You will wonder how we can perform algebra on words, but this is possible in the world of Machine Learning and by the end of this article you will be able to crack this sum.

In the exciting world of machine learning and artificial intelligence, we deal with a number of sophisticated complex algorithms. However, at the heart of many of these powerful techniques lies a surprisingly simple yet incredibly versatile mathematical operation: the dot product. Many of us who studied Maths or Physics in our high schools must be familiar with the idea of dot product. But, we might not have realised that this simple mathematical tool can be so much useful in the development of AI. In fact, Linear Algebra which covers vectors and matrices play a crucial role in AI and I have already talked about this in one of my previous articles From Vectors to Predictions: Linear Algebra in AI & ML. In this article we will delve into dot product’s mathematical elegance, geometric intuition, and its crucial role in shaping modern AI with the help of some real life examples.

💡

If you see gibberish mathematical equations on this page, please refresh this page. It’s because the Latex code might not have loaded properly.

Basics of the Dot Product: Mathematics and Geometry

At its core, the dot product is an operation that takes two equal-length vectors and returns a single number. For two vectors, $\vec{a}=[a_1,a_2,...,a_n]$ and $\vec{b}=[b_1,b_2,...,b_n]$, their dot product is calculated as:

$$\mathbf{a} \cdot \mathbf{b} = a_1b_1 + a_2b_2 + ... + a_nb_n = \sum_{i=1}^{n} a_i b_i$$

This simple summation of the products of corresponding elements of the two vectors $\vec{a}$ and $\vec{b}$ might seem abstract, but it carries profound geometric meaning offering insights into the relationship between the vectors in space.

Geometrically, the dot product is intimately related to the angle between the two vectors and their magnitudes (lengths). The relationship is given by:

$$\mathbf{a} \cdot \mathbf{b} = ||\mathbf{a}|| \ ||\mathbf{b}|| \cos(\theta)$$

Here:

$||\mathbf{a}||$ and $||\mathbf{b}||$ represent the magnitudes (Euclidean norms) of vectors $\vec{a}$ and $\vec{b}$ respectively.
$\theta$ is the angle between the two vectors, ranging from $0^\circ \text{ to } 180^\circ$

This equation reveals some key insights about these vectors (Refer to the above image):

Projection: The term $||\mathbf{b}|| \cos(\theta)$ represents the length of the projection of vector $\vec{b}$ onto the direction of vector $\vec{a}$. Thus, the dot product can be seen as the magnitude of $\vec{a}$ multiplied by the length of the projection of $\vec{b}$ onto $\vec{a}$ (or vice versa). This highlights how the dot product captures the extent to which one vector "goes in the direction" of the other.
Orthogonality: If two vectors are orthogonal (perpendicular), the angle between them is $90^\circ$, and $cos(90^\circ) = 0$. Therefore, the dot product of two orthogonal vectors is always zero regardless of their magnitudes (unless one of them is a zero vector). This property is fundamental in various machine learning techniques for identifying uncorrelated features or basis vectors.
Directional Relationship: The sign and magnitude of the dot product provide information about the relative direction of the two vectors:
- If a⋅b>0, then cos(θ)>0, implying that the angle θ is acute $0^\circ \le \theta < 90^\circ$. The vectors generally point in the same direction.
- If a⋅b<0, then cos(θ)<0, implying that the angle θ is obtuse $90^\circ < \theta \le 180^\circ$. The vectors generally point in opposite directions.
- If a⋅b=0, then cos(θ)=0, implying that the angle θ is $90^\circ$or one of the vectors is a zero vector. The vectors are orthogonal.

Cosine Similarity: Quantifying Similarity

Building upon the geometric interpretation of the dot product, we can derive a powerful metric for measuring the similarity between two vectors: cosine similarity. As we saw earlier, the dot product is influenced by both the angle between the vectors and their magnitudes. However, in many machine learning scenarios, we are primarily interested in the direction or orientation of the vectors, rather than their lengths.

Cosine similarity elegantly isolates this directional relationship by normalizing the dot product by the magnitudes (or norms) of the two vectors:

$$\text{Cosine Similarity}(\mathbf{a}, \mathbf{b}) = \frac{\mathbf{a} \cdot \mathbf{b}}{||\mathbf{a}|| \ ||\mathbf{b}||} = \frac{\sum_{i=1}^{n} a_i b_i}{\sqrt{\sum_{i=1}^{n} a_i^2} \sqrt{\sum_{i=1}^{n} b_i^2}}$$

Cosine similarity essentially measures the cosine of the angle $cos(θ)$ between two non-zero vectors. It provides a value between -1 and 1, where:

1 indicates that the vectors point in the exact same direction. (see vectors a and b in below figure)
-1 indicates that the vectors point in exactly opposite directions. (see vectors a and c in below figure)
0 indicates that the vectors are orthogonal. (see vectors a and d in below figure)

In essence, cosine similarity provides a measure of the angle between two vectors, abstracting away the effect of their scale. This makes it a robust metric for comparing items or entities represented by vectors in high-dimensional spaces, which is common in many machine learning applications.

The Dot Product in the Machine Learning and AI Toolkit

Now, let's explore how this seemingly simple operation powers various aspects of machine learning and artificial intelligence:

1. Neural Networks: The Building Blocks of Deep Learning

At the heart of artificial neural networks lie neurons, which perform a weighted sum of their inputs. This weighted sum is precisely a dot product!

Consider a neuron with input vector $\vec{x}=[x_1,x_2,...,x_n]$ and a weight vector $\vec{w}=[w_1,w_2,...,w_n]$. The raw input to the neuron's activation function is calculated as:

$$z = \mathbf{w} \cdot \mathbf{x} = w_1x_1 + w_2x_2 + ... + w_nx_n$$

This dot product combines the input features with their learned importance (weights). The result, z, is then passed through a non-linear activation function to introduce non-linearity into the network, enabling it to learn complex patterns.

In essence, the dot product allows the neuron to assess the alignment between the input pattern and the learned weight pattern. A high dot product indicates a strong activation.

2. Linear Regression: Finding the Best Fit

In linear regression, the goal is to find a linear relationship between input features and a continuous output variable. The predicted output $\hat{y}$ for an input vector $\vec{x}=[x_1,x_2,...,x_n]$ is given by:

$$\hat{y} = \mathbf{w} \cdot \mathbf{x} + b = w_1x_1 + w_2x_2 + ... + w_nx_n + b$$

Here, w is the vector of regression coefficients (weights), and b is the bias term. Again, the core calculation involves a dot product between the feature vector and the learned weights. The dot product effectively combines the contributions of each feature to the final prediction.

Real-World Examples: Dot Product in Action

Let's look at some concrete examples of how the dot product and cosine similarity are used in practical AI applications:

1. Netflix Recommendation Systems: Finding Your Next Binge

Imagine Netflix aiming to recommend movies or shows you might enjoy. One common approach leverages collaborative filtering, which analyzes how users have interacted with items (in this case, movies and shows).

User-based recommendations focus on finding users with similar viewing and rating patterns to you. Let's say we represent each user as a vector where each element corresponds to a specific movie in Netflix's catalog. The value in each element represents the user's rating for that movie (e.g., on a scale of 1 to 5, or 0 if not rated).

To find users with similar tastes to you, Netflix can calculate the cosine similarity between your user vector and the user vectors of other users. Users who have rated movies similarly to you (resulting in a high cosine similarity score) are considered to have similar preferences.

For example, consider three users: Alice, Bob, and Carol, and let's look at their ratings for a few movies (on a scale of 1 to 5, where "-" indicates not rated):

Movie	Alice	Bob	Carol
Action Movie	4	5	4
Comedy Movie	5	4	5
Romance Movie	2	1	-
Sci-Fi Movie	1	2	1

We can represent their preferences as vectors:

Alice: a= [ 4, 5, 2, 1 ]
Bob: b= [ 5, 4, 1, 2 ]
Carol: c=[ 4, 5, 0, 1 ] (We can treat unrated items as 0 for simplicity in this example, though more sophisticated methods exist)

Now, Netflix can calculate the cosine similarity between Alice's vector and Bob's vector, and between Alice's vector and Carol's vector. A higher cosine similarity between Alice and another user (say, Bob) would suggest that Bob has similar movie tastes to Alice. Based on this similarity, Netflix might recommend movies that Bob has enjoyed (and rated highly) but Alice hasn't seen yet.

This is just a basic example to show how dot product can be utilized in recommending movies but in reality Netflix uses more elaborate features to recommend movies to you.

2. Word Embeddings in Natural Language Processing (NLP): Understanding Meaning

In Natural Language Processing (NLP), words are represented as dense word embeddings that capture their semantic meaning. In simple words, a word embedding is vector form of a word e.g. the word Computer can be written as a vector [ 0.9, 0.2, 0.1 ] where each number represents some quality of computer. Words with similar meanings are located close to each other in the embedding space. Cosine similarity is used to quantify this semantic similarity by measuring the angle between word embedding vectors. A high cosine similarity indicates semantic closeness.

The analogy $King - man + woman = Queen$ illustrates this concept. While direct algebraic operations on words aren't possible, vector operations on their embeddings can approximate semantic relationships. For instance, the vector resulting from subtracting the "man" embedding from the "king" embedding and adding the "woman" embedding will be close (in terms of cosine similarity) to the "queen" embedding.

Consider a hypothetical simplified 3D embedding space for these words:

king: [ 0.8, 0.7, 0.9 ]
man: [ 0.9, 0.5, 0.1 ]
woman: [ -0.9, 0.6, 0.2 ]
queen: [ -0.7, 0.8, 0.8 ]

The operation king - man + woman yields [ -1.0, 0.8, 1.0 ], which is close to the "queen" vector.

This ability to measure semantic similarity using cosine similarity enables NLP models to perform tasks like text classification, machine translation, question answering, and information retrieval effectively. The dot product is crucial in training these embeddings, ensuring that words with frequent co-occurrence have high dot products and thus high semantic similarity.

Conclusion: The Power of Simplicity

The dot product, with its elegant mathematical definition and intuitive geometric interpretation, serves as a fundamental building block in numerous machine learning algorithms and AI applications. From the neurons in deep neural networks to the similarity measures in recommendation systems and the semantic understanding in NLP, this seemingly simple operation empowers machines to learn, reason, and make intelligent decisions. Understanding the dot product is not just about grasping a mathematical concept; it's about unlocking a deeper appreciation for the core principles that drive the exciting advancements in artificial intelligence.

From "King - Man + Woman = Queen" to AI Power: The Dot Product Explained