Recommender System by GCP Vector Search
Recommender system has never been so easy and versatile given the latest development of AI. The new paradigm are just two folds.
- Turn everything into a vector, aka embedding. For example, one can use GPT to embed text; one can use CLIP to embed text/ image into the same space... The latest generation of large models are mostly multi-modal, meaning they can handle the multi-media in the same space.
- Leverage a vector database to store the embeddings of product catalogue. A product recommendation is simply a vector search of user query's embedding among the products embeddings.
Embedding on GCP
We use the gecko
model to get text embeddings of 5 text phrases:
import numpy as np
from vertexai.preview.language_models import TextEmbeddingModel
model = TextEmbeddingModel.from_pretrained("textembedding-gecko@001")
import time
import tqdm # to show a progress bar
# get embeddings for a list of texts
BATCH_SIZE = 5
def get_embeddings_wrapper(texts):
embs = []
for i in tqdm.tqdm(range(0, len(texts), BATCH_SIZE)):
time.sleep(1) # to avoid the quota error
result = model.get_embeddings(texts[i : i + BATCH_SIZE])
embs = embs + [e.values for e in result]
return embs
embs = get_embeddings_wrapper([
'hello world',
'hello earth',
'hello mooon',
'bloody hell',
'heaven'
])
embs = np.array(embs)
similarities = np.dot(embs, embs.T)
print(similarities)
Following is the result of inner product (IP) similarity:
[[0.99999993 0.90285621 0.82134352 0.65170191 0.66193165]
[0.90285621 0.99999995 0.83382602 0.66465545 0.68855056]
[0.82134352 0.83382602 0.99999981 0.64097287 0.62142062]
[0.65170191 0.66465545 0.64097287 0.99999988 0.65023749]
[0.66193165 0.68855056 0.62142062 0.65023749 1.00000001]]
We can see that:
- Every vector is similar to itself, which is expected.
- 'hello world' is very similar to 'hello earth' (row 1, col 1) and 'hello moon' (row 1, col 2), with 0.90 and 0.82 IP score, respectively; but it is not very similar to "bloody hell", or heaven, with only less than 0.7 IP similarity.
This is exactly the key: think of "hello world" is the user query, and "hello earth" and "hello moon" are the recommended products.
Then the question is: how do we scale up?
If there are only 5 products like in above example, an exhaustive inner product may quickly give the answer. However, the product catalogue can easily go thousands and the query volume can easily go to millions.
This is where vector database come into play.
Vector Index on GCP
Here are the overall steps:
- Store the items as list of
<id, embedding>
pairs inJSONL
format. - Upload the JSONL to a GCP GCS bucket.
- Create index from the JSONL (think of "index" as the data file of a database)
- Create endpoint (think of "endpoint" as the database server software that reads data file and serve user queries).
- Deploy the index to the endpoint.
The last three steps could take ~20min for the first deployment.
Following are the reference codes.
Store embedding in JSONL format:
df = pd.DataFrame([
[1, 'hello world'],
[2, 'hello earth'],
[3, 'hello mooon'],
[4, 'bloody hell'],
[5, 'heaven'],
],columns=['id', 'text'])
# df['embedding'] = df['text'].apply(get_embeddings_wrapper)
df = df.assign(embedding=get_embeddings_wrapper(list(df['text'])))
jsonl_string = df["id", "embedding"](https://hash.hupili.net/id-embedding/).to_json(orient="records", lines=True)
with open("questions.json", "w") as f:
f.write(jsonl_string)
Create index/ index end point and deploy index to endpoint:
PROJECT_ID = # Change to your own
LOCATION = "us-central1"
UID='test_vs_1'
BUCKET_URI = f"gs://{PROJECT_ID}-test-vs-{UID}"
import os
os.system(f'gsutil mb -l {LOCATION} -p {PROJECT_ID} {BUCKET_URI}')
os.system(f'gsutil cp questions.json {BUCKET_URI}')
# init the aiplatform package
from google.cloud import aiplatform
aiplatform.init(project=PROJECT_ID, location=LOCATION)
# create index
my_index = aiplatform.MatchingEngineIndex.create_tree_ah_index(
display_name=f"test-vs-index-{UID}",
contents_delta_uri=BUCKET_URI,
dimensions=768,
approximate_neighbors_count=20,
distance_measure_type="DOT_PRODUCT_DISTANCE",
)
# create IndexEndpoint
my_index_endpoint = aiplatform.MatchingEngineIndexEndpoint.create(
display_name=f"test-vs-index-endpoint-{UID}",
public_endpoint_enabled=True,
)
DEPLOYED_INDEX_ID = f"test_vs_deployed_index_{UID}"
my_index_endpoint.deploy_index(index=my_index, deployed_index_id=DEPLOYED_INDEX_ID)
Vector query on GCP
DEPLOYED_INDEX_ID = f"test_vs_deployed_index_{UID}"
for r in list(aiplatform.MatchingEngineIndexEndpoint.list()):
print(r.name, r.display_name)
my_index_endpoint = aiplatform.MatchingEngineIndexEndpoint(
index_endpoint_name='A-NUMERIC-ID-OF-THE-RESOURCE'
)
print(my_index_endpoint.deployed_indexes)
test_customer_input = "Hi, Earth!"
test_question_embeddings = get_embeddings_wrapper([test_customer_input])
# Test query
response = my_index_endpoint.find_neighbors(
deployed_index_id=DEPLOYED_INDEX_ID,
queries=test_question_embeddings,
num_neighbors=20,
)
# show the result
import numpy as np
for idx, neighbor in enumerate(response[0]):
id = np.int64(neighbor.id)
print(id, neighbor.distance)
Note that A-NUMERIC-ID-OF-THE-RESOURCE
is a bit obscure. You can use MatchingEngineIndexEndpoint.list
to find the mapping between numeric ID and display name.
Another thing to note is that deployed index has to be passed as a parameter when query, and it is identified by display names. If you can spell out the display name from previous codes, the process is much easier. Or you can use my_index_endpoint.deployed_indexes
to find out what deployed indexes are there.
Following is the query result:
2 0.9123721122741699
1 0.8374309539794922
3 0.7786675691604614
4 0.6758614182472229
5 0.6652098894119263
We can see that the top matches are "hello earth", "hello world", "hello moon".
Reflection
- Endpoint is the actual server to handle vector search. Index is the data structure to make vector search possible. Endpoint is also the actual unit to be charged by GCP.
- One can specify
machine_type
indeploy_index
to change the machine type. If kept default, the machine type could be rather expensive. - The provisioning of a deployed index takes as much time as 20min-30min. This is less immediate compared with some open source solutions when the problem size is small-medium (e.g. hundreds/ thousands).
Future
- Test hybrid embedding of image and texts using real product catalogues.
- Deep dive into the tree-AH algorithm.
- Test the accuracy (recall@K) and throughput.
Subscribe to my newsletter
Read articles from HU, Pili directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
HU, Pili
HU, Pili
Just run the code, or yourself.