Understanding Embeddings and Cosine Similarity: A Beginner's Guide


Here is a stepping stone towards your journey to AI !
Imagine you're trying to teach a computer to understand words, sentences, or even images. How would you represent these things in a way that a computer can understand and work with? This is where embeddings come in.
What Are Embeddings?
An embedding is like a magical translator that converts words, sentences, images, or any other type of data into a list of numbers. But not just any random numbers – these numbers are carefully chosen to capture the meaning and relationships between different items.
Let's use a simple example: Consider these words: "king", "queen", "man", "woman"
In an embedding space, these words might be represented as lists of numbers (vectors) like this:
king: [0.9, 0.1, 0.7]
queen: [0.9, 0.9, 0.7]
man: [0.5, 0.1, 0.7]
woman: [0.5, 0.9, 0.7]
These numbers aren't random – they're designed so that words with similar meanings have similar numbers. Notice how "king" and "queen" share some numbers, just like "man" and "woman" do.
Why Do We Use Embeddings?
They help computers understand relationships between things
They make it possible to measure how similar different items are
They can capture complex patterns and meanings
They work well with machine learning algorithms
What is Cosine Similarity?
Now that we have our words represented as numbers, how do we measure how similar they are? This is where cosine similarity comes in.
Cosine similarity measures the angle between two vectors (our lists of numbers). The smaller the angle, the more similar the items are.
Think of it this way: Imagine you have two flashlights pointing from the same spot. If they're pointing in almost the same direction, they're similar (high cosine similarity, close to 1). If they're pointing in opposite directions, they're different (low cosine similarity, close to -1).
How Does Cosine Similarity Work?
The formula might look scary, but the concept is simple:
If two vectors point in the same direction: cosine similarity = 1
If they're perpendicular: cosine similarity = 0
If they point in opposite directions: cosine similarity = -1
Real-World Applications
Here are some everyday examples where embeddings and cosine similarity are used:
Search Engines When you search for "cute puppies," the search engine converts your query into an embedding and finds images or text with similar embeddings.
Recommendation Systems Netflix suggests movies by comparing the embeddings of movies you've watched with other movies in their database.
Language Translation Translation systems use embeddings to understand that "dog" in English and "perro" in Spanish refer to the same concept.
A Simple Example
Let's say you're building a recipe recommendation system. Each recipe gets converted into an embedding based on its ingredients and cooking methods. When someone likes a spicy Mexican dish, the system can find other recipes with similar embeddings – perhaps suggesting a spicy Thai dish because the embeddings capture that both cuisines share similar spice profiles.
Key Takeaways
Embeddings convert things (words, images, etc.) into lists of numbers that capture their meaning
Similar items have similar embeddings
Cosine similarity helps us measure how similar these embeddings are
This technology powers many of the smart features we use daily
Remember: While the math behind embeddings and cosine similarity can be complex, the basic idea is simple – we're just teaching computers to understand similarities the way humans do, just using numbers instead of intuition.
Bonus
If you want to create embeddings you can use text embedding model provided by Open AI and there are multiple models provided by Hugging face too. These embeddings are later to be stored in special database called vector databases.
Subscribe to my newsletter
Read articles from Zainul Khan directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
