LLM Prompting & Testing

Amit SangwanAmit Sangwan
4 min read

If you're looking to master prompt engineering, AI model behavior, and build testable, production-grade prompts, this is your field guide. This guide walks through everything from basic prompting anatomy to evaluations (Evals) and parameter tuning.


What is Prompt Engineering?

Prompt engineering is the process of crafting instructions to guide large language models (LLMs) in generating reliable, relevant, and safe responses.

A "prompt" is any input or instruction you send to the model.


LLM Request Structure – API Anatomy

Example Request (OpenAI SDK - Python):

pythonCopyEditfrom openai import OpenAI
client = OpenAI()

response = client.responses.create(
    model="gpt-4.1",
    input="Write a one-sentence bedtime story about a unicorn."
)

print(response.output_text)

Key Parameters:

ParameterDescription
modelThe model version (e.g., gpt-4.1, gpt-3.5-turbo)
inputThe user's prompt (text instruction)
instructionsHigh-level system-level instruction (e.g., “Talk like a pirate.”)
temperatureControls randomness in outputs
top_pControls the diversity of token selection (nucleus sampling)
max_tokens(Optional) Limit response length
stop(Optional) Specify stop sequences

Message Roles in Chat Models

RoleDescription
systemSets behavior or tone. e.g., "You are a helpful assistant."
userEnd-user prompt
assistantModel’s response
developerSpecial role to override and guide all messages (OpenAI API only)

These are not prompt types, but how the API structures the interaction.


Prompt Types (Techniques)

Prompt TypeExample
Zero-shot“Translate to French: Where is the pharmacy?”
One-shotProvide 1 example before task
Few-shotProvide multiple examples
Chain-of-Thought (CoT)Ask the model to reason step-by-step
Instruction-style“Summarize the following in one sentence.”
Role-based“You are a financial advisor. Recommend a savings plan.”
Contextual PromptInclude background context in the prompt
ReflexionLet model review and revise its output

Prompt Writing Best Practices

Do:

  • Be explicit about your expectations.

  • Use examples (few-shot) when accuracy matters.

  • Use system role for behavior shaping.

  • Break tasks into steps (CoT) for logic problems.

Avoid:

  • Vague instructions: "Explain this?"

  • Overloading the prompt with unrelated info.

  • Relying on temperature alone to fix response quality.


Temperature vs Top_p

ParameterRangeUse ForHigher Value Means...
temperature0.0 – 2.0Controls randomnessMore creative / less deterministic
top_p0.0 – 1.0Probability sampling thresholdBroader token pool = more diversity

For stable, accurate outputs, set temperature=0.2 and top_p=0.9 as a starting point.


Summary Cheat Sheet

ConceptMeaning
system/user/assistantAPI message structure
Prompt typesHow you format task instructions
ReflexionSelf-evaluation by the model
Few-shot promptingFeed examples to improve accuracy
Instructions paramOverrides normal prompts in some APIs
Evaluation (Evals)Tools to benchmark and test prompts
Pin model snapshotsEnsure version consistency in production

What is RAG?

Retrieval-Augmented Generation (RAG) is a technique that combines two powerful components:

  1. Retrieval system: This searches a large external database or document collection to find relevant information based on a user's question.

  2. Generative LLM (Large Language Model): This uses the retrieved information to generate a precise, context-aware response

How RAG uses prompts

  • Normally, when you prompt an LLM, you send it a question or instruction, and it tries to answer based only on what it "knows" internally.

  • In RAG, before generating an answer, the system retrieves relevant documents related to your question.

  • These documents are then added into the prompt sent to the LLM.

Example:

User question:
“What is the company policy on remote work?”

  1. The retrieval system searches your company’s internal documents and finds a policy document about remote work.

  2. The prompt to the LLM becomes something like:

    Based on the following document excerpts, answer the question.

    Document excerpts:

    • "Our company allows remote work up to 3 days a week."

    • "Remote work requests should be approved by the team lead."

Question: What is the company policy on remote work?

What's Next: Model Evaluation

Prompting is just the start. In a production-grade system, you’ll also:

  • Use Promptfoo or OpenAI Evals to measure:

    • Consistency

    • Accuracy

    • Regression (across model versions)

  • Save prompt snapshots and logs

  • Write test cases in YAML/JSON for automated evaluation


What next:

ToolPurpose
PromptfooPrompt benchmarking CLI tool
DeepEvalEvaluation framework for LLM outputs
Jupyter NotebookInteractive Python scripting/testing
LangChain/Evals SDKPipeline and testing framework
0
Subscribe to my newsletter

Read articles from Amit Sangwan directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Amit Sangwan
Amit Sangwan

💼 Automation Engineer | AI Enthusiast | Tech Blogger Passionate about automation, AI agents, and testing. Exploring innovations in QA while sharing insights on technology and career growth. Always learning, always evolving. 🚀