Vector Embeddings


Vector Embeddings are numeric representations of data that capture certain features of the data. They are mathematical representations of objects in a continuous vector space that are used to capture the semantic meaning or properties of these objects in a way that can be efficiently searched.
Vector embeddings are just representation of data in a way that the machine can have a relation between different word ( simply strings )
String is converted to integer to work on with. ( Tokenization )
Pre-Requisite:
Real World Examples: (non -tech)
You will (most of the time) find the index of a book at 2nd page.
Nutritional Value of packed food is always at top left of the flipped side.
Signature Line Is Always at the Bottom of Forms.
Warning Labels Are Always in Red or Bold
Our brain doesn’t know these things from the start — it learns them by seeing them often, and makes relation which it then applies to all similar things.
Similarly, text is converted to number and these numbers are placed in such a way that it make sense and relation.
- For this, we use Dimension ( 1D, 2D, … ND). It can have any number of dimension. ( It depend on the model, how many dimension it uses. )
Working:
For simple understanding we will use 2D.
As in the 2D, this same approach is used for N Dimension.
For similar relation we tend to calculate the distance and follow same displacement to find the new string with similar relation.
How actual coordinate for N-Dim looks like:
cat = [1.5, -0.4, 7.2, 19.6, 3.1, ..., 20.2]
kitty = [1.5, -0.4, 7.2, 19.5, 3.2, ..., 20.8]
Vectorizable Data:
Effective vector embeddings can be generated from any kind of data object.
Text data is the most common, followed by images, then audio data, but also time series data, 3D models, video, molecules etc.
Embeddings are generated such that two objects with similar semantics will have vectors that are "close" to each other, i.e. that have a "small" distance between them in vector space.
That distance can be calculated in multiple ways, one of the simplest being "The sum of the absolute differences between elements at position i in each vector".
Summary:
To summarize, vector embeddings are the numerical representation of unstructured data of different data types, such as text data, image data, or audio data.
Depending on the data type, vector embeddings are created using machine learning models that are able to translate the meaning of an object into a numerical representation in a high dimensional space.
Dimension ranging from 100 to 4000.
Thus, there are a variety of machine learning models able to create a variety of different types of vector embeddings, such as word embeddings, sentence embeddings, text embeddings, or image embeddings.
Vector embeddings capture the semantic relationship between data objects in numerical values and thus, you can find similar data points by determining their nearest neighbors in the high dimensional vector space.
This concept is also called similarity search and can be applied in different applications, such as text search, image search, or recommendation systems.
Subscribe to my newsletter
Read articles from Naveen Kumar directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
