Recommender System by GCP Vector Search

HU, PiliHU, Pili
4 min read

Recommender system has never been so easy and versatile given the latest development of AI. The new paradigm are just two folds.

  • Turn everything into a vector, aka embedding. For example, one can use GPT to embed text; one can use CLIP to embed text/ image into the same space... The latest generation of large models are mostly multi-modal, meaning they can handle the multi-media in the same space.
  • Leverage a vector database to store the embeddings of product catalogue. A product recommendation is simply a vector search of user query's embedding among the products embeddings.

Embedding on GCP

We use the gecko model to get text embeddings of 5 text phrases:

import numpy as np
from vertexai.preview.language_models import TextEmbeddingModel

model = TextEmbeddingModel.from_pretrained("textembedding-gecko@001")

import time
import tqdm  # to show a progress bar

# get embeddings for a list of texts
BATCH_SIZE = 5

def get_embeddings_wrapper(texts):
    embs = []
    for i in tqdm.tqdm(range(0, len(texts), BATCH_SIZE)):
        time.sleep(1)  # to avoid the quota error
        result = model.get_embeddings(texts[i : i + BATCH_SIZE])
        embs = embs + [e.values for e in result]
    return embs

embs = get_embeddings_wrapper([
    'hello world',
    'hello earth',
    'hello mooon',
    'bloody hell',
    'heaven'
])

embs = np.array(embs)
similarities = np.dot(embs, embs.T)

print(similarities)

Following is the result of inner product (IP) similarity:

[[0.99999993 0.90285621 0.82134352 0.65170191 0.66193165]
 [0.90285621 0.99999995 0.83382602 0.66465545 0.68855056]
 [0.82134352 0.83382602 0.99999981 0.64097287 0.62142062]
 [0.65170191 0.66465545 0.64097287 0.99999988 0.65023749]
 [0.66193165 0.68855056 0.62142062 0.65023749 1.00000001]]

We can see that:

  • Every vector is similar to itself, which is expected.
  • 'hello world' is very similar to 'hello earth' (row 1, col 1) and 'hello moon' (row 1, col 2), with 0.90 and 0.82 IP score, respectively; but it is not very similar to "bloody hell", or heaven, with only less than 0.7 IP similarity.

This is exactly the key: think of "hello world" is the user query, and "hello earth" and "hello moon" are the recommended products.

Then the question is: how do we scale up?

If there are only 5 products like in above example, an exhaustive inner product may quickly give the answer. However, the product catalogue can easily go thousands and the query volume can easily go to millions.

This is where vector database come into play.

Vector Index on GCP

Here are the overall steps:

  • Store the items as list of <id, embedding> pairs in JSONL format.
  • Upload the JSONL to a GCP GCS bucket.
  • Create index from the JSONL (think of "index" as the data file of a database)
  • Create endpoint (think of "endpoint" as the database server software that reads data file and serve user queries).
  • Deploy the index to the endpoint.

The last three steps could take ~20min for the first deployment.

Following are the reference codes.

Store embedding in JSONL format:

df = pd.DataFrame([
    [1, 'hello world'],
    [2, 'hello earth'],
    [3, 'hello mooon'],
    [4, 'bloody hell'],
    [5, 'heaven'],
],columns=['id', 'text'])

# df['embedding'] = df['text'].apply(get_embeddings_wrapper)
df = df.assign(embedding=get_embeddings_wrapper(list(df['text'])))

jsonl_string = df["id", "embedding"](https://hash.hupili.net/id-embedding/).to_json(orient="records", lines=True)
with open("questions.json", "w") as f:
    f.write(jsonl_string)

Create index/ index end point and deploy index to endpoint:

PROJECT_ID = # Change to your own
LOCATION = "us-central1"
UID='test_vs_1'

BUCKET_URI = f"gs://{PROJECT_ID}-test-vs-{UID}"

import os
os.system(f'gsutil mb -l {LOCATION} -p {PROJECT_ID} {BUCKET_URI}')
os.system(f'gsutil cp questions.json {BUCKET_URI}')

# init the aiplatform package
from google.cloud import aiplatform
aiplatform.init(project=PROJECT_ID, location=LOCATION)

# create index
my_index = aiplatform.MatchingEngineIndex.create_tree_ah_index(
    display_name=f"test-vs-index-{UID}",
    contents_delta_uri=BUCKET_URI,
    dimensions=768,
    approximate_neighbors_count=20,
    distance_measure_type="DOT_PRODUCT_DISTANCE",
)

# create IndexEndpoint
my_index_endpoint = aiplatform.MatchingEngineIndexEndpoint.create(
    display_name=f"test-vs-index-endpoint-{UID}",
    public_endpoint_enabled=True,
)

DEPLOYED_INDEX_ID = f"test_vs_deployed_index_{UID}"

my_index_endpoint.deploy_index(index=my_index, deployed_index_id=DEPLOYED_INDEX_ID)

Vector query on GCP

DEPLOYED_INDEX_ID = f"test_vs_deployed_index_{UID}"

for r in list(aiplatform.MatchingEngineIndexEndpoint.list()):
    print(r.name, r.display_name)

my_index_endpoint = aiplatform.MatchingEngineIndexEndpoint(
    index_endpoint_name='A-NUMERIC-ID-OF-THE-RESOURCE'
)

print(my_index_endpoint.deployed_indexes)

test_customer_input = "Hi, Earth!"
test_question_embeddings = get_embeddings_wrapper([test_customer_input])

# Test query
response = my_index_endpoint.find_neighbors(
    deployed_index_id=DEPLOYED_INDEX_ID,
    queries=test_question_embeddings,
    num_neighbors=20,
)

# show the result
import numpy as np

for idx, neighbor in enumerate(response[0]):
    id = np.int64(neighbor.id)
    print(id, neighbor.distance)

Note that A-NUMERIC-ID-OF-THE-RESOURCE is a bit obscure. You can use MatchingEngineIndexEndpoint.list to find the mapping between numeric ID and display name.

Another thing to note is that deployed index has to be passed as a parameter when query, and it is identified by display names. If you can spell out the display name from previous codes, the process is much easier. Or you can use my_index_endpoint.deployed_indexes to find out what deployed indexes are there.

Following is the query result:

2 0.9123721122741699
1 0.8374309539794922
3 0.7786675691604614
4 0.6758614182472229
5 0.6652098894119263

We can see that the top matches are "hello earth", "hello world", "hello moon".

Reflection

  • Endpoint is the actual server to handle vector search. Index is the data structure to make vector search possible. Endpoint is also the actual unit to be charged by GCP.
  • One can specify machine_type in deploy_index to change the machine type. If kept default, the machine type could be rather expensive.
  • The provisioning of a deployed index takes as much time as 20min-30min. This is less immediate compared with some open source solutions when the problem size is small-medium (e.g. hundreds/ thousands).

Future

  • Test hybrid embedding of image and texts using real product catalogues.
  • Deep dive into the tree-AH algorithm.
  • Test the accuracy (recall@K) and throughput.
0
Subscribe to my newsletter

Read articles from HU, Pili directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

HU, Pili
HU, Pili

Just run the code, or yourself.