Understanding Chunking, Embeddings, and Reconstruction in Data Processing

karthik nadarkarthik nadar
5 min read

Introduction to Data Processing Techniques

If you've ever worked with large structured datasets, you know the struggle of indexing them efficiently while still retrieving meaningful, context-aware results. When it comes to unstructured data, like PDFs or text documents, the solution is relatively straightforward. Tools like LangChain’s RecursiveCharacterTextSplitter offer out-of-the-box chunking that maintains overlapping context across chunks. This works well because natural language follows a sequential flow, and breaking it down intelligently preserves meaning.

But what about structured tabular data? I quickly discovered that the same principles don’t apply. Unlike text, tables lack a natural order or flow, and splitting them arbitrarily can destroy the relationships between rows and columns. The usual method of dumping everything into a vector database at once didn’t work either—it led to context loss, inefficient querying, and an inability to retrieve relevant information correctly.

Initially, I tried several approaches, but they all fell short:

  • Byte-wise chunking for entire rows failed for large databases because individual rows could exceed chunk limits.

  • Naïve row-based chunking destroyed relationships between columns, leading to incomplete queries.

  • Storing full tables as single embeddings made retrieval useless, as we couldn’t fetch granular data efficiently.

I needed a way to break down large tables into smaller pieces while keeping context intact—similar to how RecursiveCharacterTextSplitter works for text, but tailored for tabular data. That’s when I developed an approach that chunks structured data by:

  • Identifying primary keys to maintain row context.

  • Splitting attributes while preserving their relationships.

  • Ensuring chunks remain queryable while staying under size constraints.

In this blog, I’ll walk you through how I solved this problem, step by step, with real-world examples of how structured data can be stored, chunked, and retrieved efficiently—while maintaining context in a vector database.

Let’s get started!


Chunking Data for Efficient Processing

Chunking is the process of breaking down large data into smaller, structured units to improve performance and avoid memory constraints.

Sample Dataset

Imagine we have the following user data:

[
  { "id": 1, "name": "Alice", "age": 25, "email": "alice@example.com" },
  { "id": 2, "name": "Bob", "age": 30, "email": "bob@example.com" },
  { "id": 3, "name": "Charlie", "age": 35, "email": "charlie@example.com" }
]
  • id is the primary key.

  • name, age, and email are attributes.

Step 1: Identifying Primary Keys

Before chunking, we need to determine the primary key (PK).

const commonPkNames = ['id', 'ID', '_id', 'uuid', 'key'];

→ We check if any column names match common primary key names.


const columnCounts = {};
const uniqueValueColumns = [];

→ We track unique values for each column.


Loop Through Each Column

columnCounts = {
  "id": { 1, 2, 3 },  // All unique ✅
  "name": { "Alice", "Bob", "Charlie" },  
  "age": { 25, 30, 35 },
  "email": { "alice@example.com", "bob@example.com", "charlie@example.com" }
};

→ Since id is unique and matches common primary key names, we select it as the primary key.


Step 2: Splitting Data Into Chunks

Now, we split the dataset into smaller parts without exceeding 4000 bytes.

Each row is transformed into a PK-attribute format:

[
  { "pk": { "id": 1 }, "attribute": { "name": "Alice" } },
  { "pk": { "id": 1 }, "attribute": { "age": 25 } },
  { "pk": { "id": 1 }, "attribute": { "email": "alice@example.com" } },
  { "pk": { "id": 2 }, "attribute": { "name": "Bob" } }
]

Each attribute is stored separately while preserving the primary key.


Step 3: Handling Large Data Entries

if (entrySize > CHUNK_SIZE) {
  console.warn(`Oversized entry (${entrySize} bytes) detected`);
  chunks.push([entry]); // Store it as its own chunk
  continue;
}

→ If an entry is too large, it is stored separately.


Step 4: Fallback Chunking (If No PK Found)

If no primary key is found, we use row-based chunking:

[
  [{ "id": 1, "name": "Alice", "age": 25, "email": "alice@example.com" }],
  [{ "id": 2, "name": "Bob", "age": 30, "email": "bob@example.com" }]
]

→ This groups full rows into chunks.


Understanding Query Embeddings & Reconstruction

Now, let's look at how we process embeddings and reconstruct data.


Step 1: Generating the Embedding

We convert a query into a vector representation.

const questionEmbedding = await openai.embeddings.create({
  model: "text-embedding-ada-002",
  input: message
});

Transforms the query into numerical values.

Example

message = "Show me all users above 30";

The model outputs a vector representation:

[0.123, -0.456, 0.789, ...]

Step 2: Querying the Vector Database (Pinecone)

const queryResult = await index.query({
  vector: questionEmbedding.data[0].embedding,
  filter: { connectionId: String(connectionId) },
  topK: 15,
  includeMetadata: true
});

→ This finds 15 most similar stored embeddings.


Step 3: Processing Retrieved Results

We extract relevant matches:

{
  "matches": [
    {
      "id": "123",
      "score": 0.98,
      "metadata": {
        "type": "data",
        "data": "{ \"id\": 2, \"name\": \"Bob\", \"age\": 30, \"email\": \"bob@example.com\" }"
      }
    }
  ]
}

A higher score means a better match.


Step 4: Reconstructing Data

We merge primary keys and attributes back into rows.

chunk.entries.forEach(entry => {
  const pkKey = JSON.stringify(entry.pk);
  rowsByTable[tableName][pkKey] = rowsByTable[tableName][pkKey] || { ...entry.pk };
  Object.assign(rowsByTable[tableName][pkKey], entry.attribute);
});

Restores full rows from PK-attribute format.


Before Merging

[
  { "pk": { "id": 2 }, "attribute": { "name": "Bob" } },
  { "pk": { "id": 2 }, "attribute": { "age": 30 } },
  { "pk": { "id": 2 }, "attribute": { "email": "bob@example.com" } }
]

After Merging

[
  { "id": 2, "name": "Bob", "age": 30, "email": "bob@example.com" }
]

The original row is reconstructed successfully.


Step 5: Returning Final Data

The final result:

{
  "schema": [{ "column": "id", "type": "integer" }, { "column": "name", "type": "string" }],
  "sampleData": [
    {
      "tableName": "users",
      "sampleData": [
        { "id": 2, "name": "Bob", "age": 30, "email": "bob@example.com" }
      ]
    }
  ]
}

Schema and sample data are returned in a structured format.


Conclusion

Effectively handling both structured and unstructured data is crucial for efficient querying and retrieval,by employing techniques like chunking, embedding, and reconstruction, we can maintain context and relationships within datasets, ensuring that information remains accessible and meaningful. Whether dealing with natural language or tabular data, understanding these processes allows us to optimize data storage and retrieval, ultimately enhancing performance and accuracy. As data continues to grow in complexity and volume, mastering these techniques will be essential for anyone working in data-driven fields.

5
Subscribe to my newsletter

Read articles from karthik nadar directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

karthik nadar
karthik nadar