Efficient LLMs for Survey Cleaning

Problem Statement

I was building an AI pipeline to clean survey responses. The data structure was like this:

Sample Question:

{
  "id": 3271,
  "text": "How satisfied are you with our service?",
  "choices": [
    { "id": 1, "label": "Very Satisfied" },
    { "id": 2, "label": "Neutral" },
    { "id": 3, "label": "Dissatisfied" }
  ]
}

Sample Response:

{
  "responseId": 1001,
  "responses": [
    { "questionId": 3271, "responses": "2" }
  ]
}

Simple na? The user selected 2, meaning "Neutral".
Now, when sending batches of survey responses to LLM for cleaning and fraud detection, I had a big question in mind: How to send questions and responses efficiently without wasting tokens and making model slow?

My Thought Process

Initially, I thought - "Aree yaar, just send the full questions array and responses array. Simple."
So I was packing:

Full questions (with choices array)
Full responses (with choiceIds)

But slowly I realised...

Every batch was sending the same choices again and again. Every user response needed LLM to read question choices, scan array, match choiceId.
Even a small survey was eating 2k-3k tokens easily just for system context!
Then I thought:

"What if instead of sending same data again and again, I somehow make the choice lookup easier for the model?"

I had explored three Options

Option 1: Keep Choices as Array (Default)

Each question has choices: [{ id, label }] array.
Response uses choiceId.
LLM scans array to match.

Pros: Tiny initial payload.

Cons:

Model has to do O(n) array scanning.
Slow reasoning.
Wastes attention and tokens if survey grows.

(Imagine scanning 10 choices manually every time — uff..)

Option 2: Expand Label Inside Every Response

Instead of sending choiceId, I replace it with "Neutral", "Dissatisfied", etc.
Responses directly readable by model.

Pros: Fast LLM understanding.

Cons:

Response size doubles or triples.
Huge token waste.
Not good for 10k+ responses batch.

(At small scale ok, but at big scale — 🪦RIP tokens!)

Option 3: Prebuilt Choice Map per Question

Build a map like:

{
  "3271": {
    "1": "Very Satisfied",
    "2": "Neutral",
    "3": "Dissatisfied"
  }
}

Response stays as choiceId ("2").
LLM just does O(1) lookup using map.

Pros:

One-time small cost.
Fastest reasoning.
Smallest token usage long term.
Bulletproof at 100k, 1M responses scale.

Cons:

Slightly more work backend-side to generate map.

(But haan yaar... once done, clean and scalable!)

Final Flow

Survey Questions (choices array)
            ↓
Preprocess into Choice Map (one time)
            ↓
Store Choice Map in System Context
            ↓
Send Responses with choiceId only
            ↓
LLM does O(1) lookup from Map
            ↓
Efficient fraud detection and response validation

✅ Pucho advantages kya hai ?

No duplicate choices in every batch.
No ballooning of response size.
No array scanning overhead for LLM.

Key Benefits

Approach	Token Usage	LLM Speed	Scale Readiness
Choices as Array	Medium	Medium	Ok only for small surveys
Expanded Labels	High	Fast	Very costly at scale
Prebuilt Choice Map	Low	Fastest	Best for 100k+ responses

💡 Final Thought

Sometimes, small design decisions, like, whether to send a list vs a map, matter A LOT when you want to scale cleanly.
I learned this by thinking deeply from the angle of:

Token cost
LLM cognitive load
Real-world scaling for lakhs of survey responses

TL;DR

This idea is not only for surveys! It can be applied wherever structured choices are involved.
Some real examples:

Auto-grading MCQ exams at scale (education apps).
Screening candidate forms in HRTech startups.
Cleaning healthcare intake forms efficiently.
Processing ecommerce customer feedback forms cheaply.
Analyzing product satisfaction surveys in SaaS platforms.

Main benefits of using Maps in AI pipelines:

✅ Save massive tokens.
✅ Make LLM think faster.
✅ Scale to millions of records easily.
✅ Keep backend and API payloads clean and simple.

Thanks for reading! 🙏
If you're building AI pipelines like this, comment your thoughts and approaches.

Making LLMs Efficient for Survey Cleaning: My Journey from Arrays to Choice Maps

Problem Statement

My Thought Process

But slowly I realised...

I had explored three Options

Option 1: Keep Choices as Array (Default)

Option 2: Expand Label Inside Every Response

Option 3: Prebuilt Choice Map per Question

Final Flow

Key Benefits

💡 Final Thought

TL;DR

Subscribe to my newsletter

Karthik Sai

Karthik Sai