Leveraging one of 23ai's new features - the VECTOR datatype.

Phil GrinhamPhil Grinham
10 min read

There are many fantastic blogs about using Oracle 23ai as a powerful tool to take advantage of the growing interest in artificial intelligence. For example, the excellent Oracle ACE Jonathan Dixon has a website that collects blogs related to Oracle, similar to this one, and he has integrated AI to help classify and determine a blog's relevance to Oracle.

This is the link to his article explaining how he did it, APEX Developer Blogs Website.

My objective was a little different, I wanted to use my always free Oracle autonomous database to load some documents as BLOBs and be able to search them using the new VECTOR datatype.

💡
Oracle Text has long been an option for this very requirement but I wanted to be able to use the new functions available in 23ai that centre around the VECTOR datatype.

So, my use case was to download the official 24.2 Oracle APEX documentation so I could search through it after converting it into a series of vectors. This was purely for experimentation and I am not trying to replace the existing online documentation provided by Oracle, which is much easier to search and, as it turns out (spoiler alert), more accurate too.

I wanted to load the documentation into the database so I could put an APEX screen on top of it to replicate searching the text to find the most relevant help articles.

💡
The download from Oracle produces a zip of all files that comprise the website, so I simply uploaded the compressed file and used the APEX_ZIP package to split the contents into individual BLOBs so I can work with each section that contains meaningful data.

Once I had all the documents extracted and stored as individual BLOBs, the next step was to embed them as converted VECTOR data types. If you're not familiar with embedding data or what a vector is, think of it as converting text into numerical values that represent its meaning. These values can then be compared to other text (like search terms) for semantic similarity or relevance.

💡
A fantastic visual representation of what vectors “look like” can be seen in this LinkedIn post by my friend and former colleague Paul Brookes.

Creating the embedding is where the AI Large Language Model (LLM) comes in, and we have many options to choose from. However, I wanted a solution that would stay entirely within the database, avoiding the need to set up ACLs, proxies or firewall exceptions, or transfer large amounts of data to and from an external model. Fortunately, Oracle has long offered this capability with its Machine Learning features, which are designed to have the model co-located with the data inside the database.

To do this, I followed this Oracle blog to download Hugging Face’s all-MiniLM-L12-v2 model as an ONNX file that can be used directly within the database and I followed this Oracle blog to load it into my OCI environment.

Step 1 was to create an Object Storage bucket as above.

Once created, I used the “Objects” section to upload the ONNX file downloaded from Oracle.

A Pre-Authenticated Request needed to be created so as to allow the database to connect to it. Here you can optionally specify an expiration date so it does not remain open indefinitely. We only need it for as long as it takes for us to run a few scripts in the database so I let it expire in a few hours.

You will receive a Pre-Authenticated Request URL. Remember, this will only be shown once, so be sure to copy it. As an elevated user, grant the following permissions to the schema where you want to use the model:

GRANT execute on dbms_cloud To <Your Schema Name>;
GRANT create mining model TO <Your Schema Name>;

Using the following block, you can connect to your database schema where you want the model to be loaded and create a cloud credential for your schema and get the ONNX file before proceeding to load it.

DECLARE
   V_CONTENTS BLOB;
BEGIN

   DBMS_CLOUD.CREATE_CREDENTIAL(CREDENTIAL_NAME => 'OBJ_STORE_CRED',
                                USERNAME        => 'your_schema',
                                PASSWORD        => 'your_password');
   V_CONTENTS := DBMS_CLOUD.GET_OBJECT(CREDENTIAL_NAME => 'OBJ_STORE_CRED',
                                       OBJECT_URI      => 'path of URI copied from pre auth request' ||
                                                          'all-MiniLM-L6-v2.onnx');
   DBMS_VECTOR.LOAD_ONNX_MODEL(MODEL_NAME => 'ALL_MINILM_L6', MODEL_DATA => V_CONTENTS);
END;

The model is now loaded and ready to be used to create VECTOR datatype embeddings. I used the DBMS_VECTOR_CHAIN package, which handles all the "heavy lifting" for you, as long as you are aware of its limitations.

There is a 4000-character limit on the embedding data, even though the parameters used in the package calls are defined as CLOB. The official documentation mentions this, but it would be clearer if they defined it as VARCHAR2 to indicate the 4k limit. Maybe this is in preparation for a future version that will accept larger payloads?

To work around this, the data needs to be “chunked” into 4000 characters at a time and, to Oracle’s credit, the same package provides exactly this function. So the idea is:

  1. I will loop through each of the files extracted from the zip stored in my AI_DOCUMENT table.

  2. Each file will be converted to text (from html/PDF).

  3. Each file will have extraneous characters such as excessive white-space and HTML markup removed using a custom function I created.

  4. This sanitised data will then be split into 4000 character chunks.

  5. Each chunk will be converted to the VECTOR datatype and deposited in a child table called AI_DOCUMENT_CHUNK representing a portion of the parent file it originated from.

INSERT INTO AI_DOCUMENT_CHUNK
         (AI_DOCUMENT_PK,
          EMBED_ID,
          EMBED_DATA,
          EMBED_VECTOR)
         SELECT DT.AI_DOCUMENT_PK,
                ET.EMBED_ID,
                ET.EMBED_DATA,
                TO_VECTOR(ET.EMBED_VECTOR)
         FROM   AI_DOCUMENT DT --,
         CROSS  APPLY DBMS_VECTOR_CHAIN.UTL_TO_EMBEDDINGS(DBMS_VECTOR_CHAIN.UTL_TO_CHUNKS(
                        REMOVE_PATTERNS_FROM_CLOB(DBMS_VECTOR_CHAIN.UTL_TO_TEXT(DT.BLOB_CONTENT)), 
                    JSON('{"by":"words"
                    ,"max":"100"
                    ,"overlap":"5"
                    ,"split by":"sentence"
                    ,"normalize":"all"
                    }')),
         JSON('{"provider":"database", "model":"ALL_MINILM_L6"}')
         ) T
         CROSS  APPLY JSON_TABLE(T.COLUMN_VALUE, '$[*]' 
                        COLUMNS(EMBED_ID NUMBER PATH '$.embed_id', 
                                EMBED_DATA VARCHAR2(4000) PATH '$.embed_data', 
                                EMBED_VECTOR CLOB PATH '$.embed_vector')) ET
         WHERE  DT.AI_DOCUMENT_PK = P_AI_DOCUMENT_PK;

The above snippet of code shows how steps 1 to 5 are being performed in a single SQL statement:

  1. The loop has not been included but you can see the WHERE clause is limiting to a single document at a time.

  2. DBMS_VECTOR_CHAIN.UTL_TO_TEXT is converting the BLOB into a CLOB.

  3. My custom function REMOVE_PATTERNS_FROM_CLOB is a series of REGEXP_REPLACE function calls tailored to my needs to strip the “noise” for the meaningful text.

  4. DBMS_VECTOR_CHAIN.UTL_TO_CHUNKS splits the data into 4k chunks. It accepts a JSON set of parameters to instruct it how to split the text.

  5. DBMS_VECTOR_CHAIN.UTL_TO_EMBEDDINGS then creates the VECTOR datatype, taking JSON parameters of its own and, in this implementation, leveraging the ONNX file we loaded earlier to convert the text.

💡
Point of interest: In the official Oracle AI Vector Deep Dive course, in one of the labs they make a point to mention “Currently, the vector embedding SQL function is significantly slower than creating vectors outside of the database with local embedding models.” [copied from the course transcript]

Once the data is loaded, we can add an APEX screen on top of it to perform simple vector-based similarity searches. In Oracle 23ai, there are three types of searches available to us:

I will be using the “Multi-Vector Similarity Search” approach, also known as the Multi-document approach.

💡
APEX 24.1.7 is the current version available to me in the always free OCI environment so I cannot currently leverage the new Search Configuration available in 24.2 that natively handles the new VECTOR datatype.

Here is a screenshot of the app where I asked for it to find the built-in substitution strings APEX provides developers.

The SQL underpinning the displayed classic report (using the Cards theme) is a little long but I will explain why later. Here it is in its entirety:

WITH DOC_DATA AS
 (SELECT D.FILENAME_ORIG AS CARD_TITLE,
         'EMBED ID: ' || C.EMBED_ID AS CARD_SUBTEXT,
         C.EMBED_DATA AS CARD_TEXT,
         C.EMBED_VECTOR,
         C.AI_DOCUMENT_PK,
         C.EMBED_ID,
         VECTOR_EMBEDDING(ALL_MINILM_L6 USING :P1_SEARCH AS DATA) AS QUESTION
  FROM   AI_DOCUMENT_CHUNK C
  JOIN   AI_DOCUMENT D
  ON     D.AI_DOCUMENT_PK = C.AI_DOCUMENT_PK)
 (SELECT DD.CARD_TITLE,
         DD.CARD_SUBTEXT,
         DD.CARD_TEXT
  FROM   DOC_DATA DD
  WHERE  :P1_METRIC = 'COSINE'
  ORDER  BY VECTOR_DISTANCE(DD.EMBED_VECTOR, DD.QUESTION, COSINE)
  FETCH  FIRST 10 PARTITIONS BY DD.AI_DOCUMENT_PK, 3 ROW ONLY) 
UNION ALL
 (SELECT DD.CARD_TITLE,
         DD.CARD_SUBTEXT,
         DD.CARD_TEXT
  FROM   DOC_DATA DD
  WHERE  :P1_METRIC = 'DOT'
  ORDER  BY VECTOR_DISTANCE(DD.EMBED_VECTOR, DD.QUESTION, DOT)
  FETCH  FIRST 10 PARTITIONS BY DD.AI_DOCUMENT_PK, 3 ROW ONLY) 
UNION ALL
 (SELECT DD.CARD_TITLE,
         DD.CARD_SUBTEXT,
         DD.CARD_TEXT
  FROM   DOC_DATA DD
  WHERE  :P1_METRIC = 'EUCLIDEAN'
  ORDER  BY VECTOR_DISTANCE(DD.EMBED_VECTOR, DD.QUESTION, EUCLIDEAN)
  FETCH  FIRST 10 PARTITIONS BY DD.AI_DOCUMENT_PK, 3 ROW ONLY) 
UNION ALL
 (SELECT DD.CARD_TITLE,
         DD.CARD_SUBTEXT,
         DD.CARD_TEXT
  FROM   DOC_DATA DD
  WHERE  :P1_METRIC = 'EUCLIDEAN_SQUARED'
  ORDER  BY VECTOR_DISTANCE(DD.EMBED_VECTOR, DD.QUESTION, EUCLIDEAN_SQUARED)
  FETCH  FIRST 10 PARTITIONS BY DD.AI_DOCUMENT_PK, 3 ROW ONLY) 
UNION ALL
 (SELECT DD.CARD_TITLE,
         DD.CARD_SUBTEXT,
         DD.CARD_TEXT
  FROM   DOC_DATA DD
  WHERE  :P1_METRIC = 'HAMMING'
  ORDER  BY VECTOR_DISTANCE(DD.EMBED_VECTOR, DD.QUESTION, HAMMING)
  FETCH  FIRST 10 PARTITIONS BY DD.AI_DOCUMENT_PK, 3 ROW ONLY) 
UNION ALL
 (SELECT DD.CARD_TITLE,
         DD.CARD_SUBTEXT,
         DD.CARD_TEXT
  FROM   DOC_DATA DD
  WHERE  :P1_METRIC = 'MANHATTAN'
  ORDER  BY VECTOR_DISTANCE(DD.EMBED_VECTOR, DD.QUESTION, MANHATTAN)
  FETCH  FIRST 10 PARTITIONS BY DD.AI_DOCUMENT_PK, 3 ROW ONLY) 
UNION ALL
 (SELECT DD.CARD_TITLE,
         DD.CARD_SUBTEXT,
         DD.CARD_TEXT
  FROM   DOC_DATA DD
  WHERE  :P1_METRIC = 'INNER_PRODUCT'
  ORDER  BY -1 * VECTOR_DISTANCE(DD.EMBED_VECTOR, DD.QUESTION, DOT)
  FETCH  FIRST 10 PARTITIONS BY DD.AI_DOCUMENT_PK, 3 ROW ONLY)

The first CTE retrieves my "vectored" document data and converts the search term in the :P1_SEARCH page item to the VECTOR datatype using the VECTOR_EMBEDDING function with the ONNX model I loaded earlier.

Each subquery within the UNION ALL is effectively doing the same thing just with a different metric applied depending upon which one was selected in the UI:

  • The embedded data from the documents uploaded and stored in AI_DOCUMENT_CHUNK.EMBED_DATA is compared to the now converted search text in the :P1_SEARCH page item through the VECTOR_DISTANCE function.

  • I order this calculated “distance” with the closest chunks first but also partitioning by the document and which makes this a multi-vector/document search implementation. It will bring back the top 10 documents with up to 3 results per document.

  • I wanted to try out the different metric parameters for the VECTOR_DISTANCE function passing in the :P1_METRIC page item option selected.

  • I had to go with this rather verbose query as it only accepts named constants for the metric and a CASE statement, like the following, returned error ORA-00909: invalid number of arguments:

<<snip>>
VECTOR_DISTANCE(DD.EMBED_VECTOR, DD.QUESTION, CASE WHEN :P1_METRIC = 'COSINE' THEN COSINE END)
</snip>>

If the screenshot of the application isn’t too small to read, you can see the results it found most similar semantically but it didn’t quite bring back the sections I would have expected (regardless of which metric option I selected). For the built-in APEX substitution strings, I was really hoping it would find this particular help document somewhere in the Top-K hits.

I think the issue might be related to the parameters I used to divide the documents into chunks, or even the embedding model used, which prevented the full context of this document from being recognised and scored correctly. Had the result set seen in the screenshot earlier been sent to an LLM as context for a RAG-based prompt, it wouldn't have given me the answer I was looking for (assuming it hadn't been trained on this documentation already).

I wondered if creating a vector index (IVF type) might help improve the results but sadly, and maybe not unsurprisingly, all it did was bring back the same data, just quicker.

CREATE VECTOR INDEX AI_DOC_CHUNK_IVF1 
ON AI_DOCUMENT_CHUNK(EMBED_VECTOR) 
ORGANIZATION NEIGHBOR PARTITIONS DISTANCE COSINE
WITH TARGET ACCURACY 95 PARAMETERS(TYPE IVF, NEIGHBOR PARTITIONS 50);

-- NOTE: In the classic report query, the FETCH for the union all WHERE :P1_METRIC = 'COSINE'
-- was changed to use the APPROX keyword so the index would be used (the index
-- was created using the COSINE metric so would only have an influence on this search metric)
FETCH  APPROX FIRST 3 PARTITIONS BY DD.AI_DOCUMENT_PK, 3 PARTITIONS BY DD.EMBED_ID, 1 ROW ONLY
💡
By setting the debug level to full trace in APEX I could see the explain plan that proved the index was actively used by the query. Also, I couldn’t create an INMEMORY HNSW index as there wasn’t sufficient space available for it.

In conclusion, even though the results weren't exactly what I hoped for, it was still a valuable exercise to become familiar with the new VECTOR datatype and learn how to use one of the most interesting features of Oracle 23ai. As I gain more experience and understanding, I hope to revisit this and fine-tune it to achieve more accurate results.

1
Subscribe to my newsletter

Read articles from Phil Grinham directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Phil Grinham
Phil Grinham

I am a Certified Senior Oracle Developer/Data Architect with a passion for all things data and an advocate for the Oracle APEX low code platform that I have seen mature to become a fully-fledged enterprise level solution. I have several decades of experience in multiple industries such as Finance, Energy and the Public Sector and hope that I can impart some of that experience in my blog posts.