Understanding Embeddings and Cosine Similarity: A Beginner's Guide

Zainul KhanZainul Khan
3 min read

Here is a stepping stone towards your journey to AI !

Imagine you're trying to teach a computer to understand words, sentences, or even images. How would you represent these things in a way that a computer can understand and work with? This is where embeddings come in.

What Are Embeddings?

An embedding is like a magical translator that converts words, sentences, images, or any other type of data into a list of numbers. But not just any random numbers – these numbers are carefully chosen to capture the meaning and relationships between different items.

Let's use a simple example: Consider these words: "king", "queen", "man", "woman"

In an embedding space, these words might be represented as lists of numbers (vectors) like this:

  • king: [0.9, 0.1, 0.7]

  • queen: [0.9, 0.9, 0.7]

  • man: [0.5, 0.1, 0.7]

  • woman: [0.5, 0.9, 0.7]

These numbers aren't random – they're designed so that words with similar meanings have similar numbers. Notice how "king" and "queen" share some numbers, just like "man" and "woman" do.

Why Do We Use Embeddings?

  1. They help computers understand relationships between things

  2. They make it possible to measure how similar different items are

  3. They can capture complex patterns and meanings

  4. They work well with machine learning algorithms

What is Cosine Similarity?

Now that we have our words represented as numbers, how do we measure how similar they are? This is where cosine similarity comes in.

Cosine similarity measures the angle between two vectors (our lists of numbers). The smaller the angle, the more similar the items are.

Think of it this way: Imagine you have two flashlights pointing from the same spot. If they're pointing in almost the same direction, they're similar (high cosine similarity, close to 1). If they're pointing in opposite directions, they're different (low cosine similarity, close to -1).

How Does Cosine Similarity Work?

The formula might look scary, but the concept is simple:

  • If two vectors point in the same direction: cosine similarity = 1

  • If they're perpendicular: cosine similarity = 0

  • If they point in opposite directions: cosine similarity = -1

Real-World Applications

Here are some everyday examples where embeddings and cosine similarity are used:

  1. Search Engines When you search for "cute puppies," the search engine converts your query into an embedding and finds images or text with similar embeddings.

  2. Recommendation Systems Netflix suggests movies by comparing the embeddings of movies you've watched with other movies in their database.

  3. Language Translation Translation systems use embeddings to understand that "dog" in English and "perro" in Spanish refer to the same concept.

A Simple Example

Let's say you're building a recipe recommendation system. Each recipe gets converted into an embedding based on its ingredients and cooking methods. When someone likes a spicy Mexican dish, the system can find other recipes with similar embeddings – perhaps suggesting a spicy Thai dish because the embeddings capture that both cuisines share similar spice profiles.

Key Takeaways

  1. Embeddings convert things (words, images, etc.) into lists of numbers that capture their meaning

  2. Similar items have similar embeddings

  3. Cosine similarity helps us measure how similar these embeddings are

  4. This technology powers many of the smart features we use daily

Remember: While the math behind embeddings and cosine similarity can be complex, the basic idea is simple – we're just teaching computers to understand similarities the way humans do, just using numbers instead of intuition.

Bonus

If you want to create embeddings you can use text embedding model provided by Open AI and there are multiple models provided by Hugging face too. These embeddings are later to be stored in special database called vector databases.

0
Subscribe to my newsletter

Read articles from Zainul Khan directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Zainul Khan
Zainul Khan