Prompt Engineering and Token Management

Generative AI models have changed the way we interact with technology. But to get the most out of them, it’s important to know how to communicate with these models effectively. This blog explains the essentials of prompt engineering, token management, and model selection in simple terms, with practical examples.
Prompt Engineering: The Art of Talking to AI
Prompt engineering is the practice of crafting clear and effective instructions to guide AI models toward producing the kind of output you want. Here are the core techniques:
1. Zero-shot Prompting
This is the most basic approach. You give the model a task without any examples; it relies entirely on its training.
Example:
Prompt: Translate "Hello, how are you?" to French.
Output: Bonjour, comment ça va ?
2. Few-shot Prompting
Here, you provide a few examples in the prompt to help the model recognize the pattern.
Example:
Prompt:
Apple → Pomme
Banana → Banane
Orange → ?
Output: Orange
3. Chain-of-Thought Prompting
This technique asks the model to explain its reasoning step by step. It’s especially helpful for math, logic, or multi-step problems.
Example:
Prompt: If a train travels 300 miles in 5 hours, what is its speed? Show your work.
Output: To find the speed, divide the distance by the time: 300 miles ÷ 5 hours = 60 miles per hour.
4. Role-based Prompting
You assign a role to the AI, which helps tailor the response to a specific domain or tone.
Example:
Prompt: You are a historian. Explain the significance of the Renaissance in 3 sentences.
Output: The Renaissance marked a cultural rebirth in Europe, bridging the Middle Ages and modern history. It revived classical learning and spurred advancements in art, science, and philosophy. Figures like Leonardo da Vinci epitomized its spirit of innovation and humanism.
5. Iterative Prompting
This method involves refining your prompt based on the model’s earlier responses to get a more accurate or creative output.
Example:
First Prompt: Write a short story about a dragon.
Output: (A generic story)
Refined Prompt: Write a short story about a dragon who is afraid of fire and wants to become a chef.
Output: (A more unique and tailored story)
Token and Output Management
Generative models don’t see text the way humans do. Instead, they break input and output into tokens. These can be full words, parts of words, or punctuation. Managing tokens is key to controlling cost, output length, and coherence.
Key Parameters
Temperature
Controls randomness in the output. Lower values make the model more predictable. Higher values introduce more creativity.
Example:
Temperature 0.2: The sky is blue and clear.
Temperature 0.8: The sky is a brilliant azure, dotted with wispy clouds.
Top_p (Nucleus Sampling)
Top_p controls how many of the most likely next-word choices are considered when the model generates text. Rather than choosing from all possible words, it looks at the top few words that collectively account for p% of probability, and picks one randomly from that small set.
Think of top_p as:
"Only consider the most confident guesses, as long as they add up to 90% of the certainty."
How It Works (Step-by-Step):
Imagine the model is trying to pick the next word in this sentence:
"The sun is very ______"
The model assigns probabilities like:
Word | Probability |
hot | 0.40 |
bright | 0.25 |
far | 0.15 |
big | 0.10 |
orange | 0.05 |
old | 0.03 |
dangerous | 0.02 |
If top_p = 0.9:
The model will start summing from highest to lowest:
hot (0.40) → cumulative: 0.40
bright (0.25) → 0.65
far (0.15) → 0.80
big (0.10) → 0.90
At this point, it stops.
It will now randomly pick from [hot, bright, far, big] only.
Remaining words are ignored.
If top_p = 0.5:
hot (0.40) → 0.40
bright (0.25) → 0.65 → Exceeds 0.5
So it chooses only from [hot] (or maybe bright depending on rounding).
The output is more focused and repetitive.
Top_k
Restricts output to the top k most likely tokens at each step. Smaller values mean more control, but less variety.
Example:
top_k = 10: Chooses only from the 10 most probable words.
Max_tokens
Sets a hard limit on the length of the model’s response.
Example:
max_tokens = 50: The response will not exceed 50 tokens.
Stop Sequences
Tells the model when to stop generating output. Useful for stopping at specific patterns.
Example:
Stop Sequence = ###
: The model halts output when it reaches ###
.
Model Selection and API Integration
Choosing the right model depends on your specific use case. Different models are optimized for different needs.
Popular Models and Their Strengths
GPT-4
Strengths: Versatile, excellent reasoning, large context window
Best for: Creative writing, summarization, complex problem-solving
Claude (by Anthropic)
Strengths: Safe, good for dialogue
Best for: Conversational AI, customer service bots
LLaMA and Mistral
Strengths: Open-source and customizable
Best for: Research use, fine-tuning in enterprise or academic projects
Managing Cost and Usage
Cost Management
Track how many tokens your inputs and outputs consume. GPT-4, for example, is more expensive per token than GPT-3.5. Shorter, efficient prompts help reduce cost.
Rate Limiting
API services often limit the number of requests you can send per minute. Exceeding this can delay or block responses. Implement retry strategies or queueing if needed.
Secure API Calls
Always use environment variables to store API keys. Send requests over HTTPS to ensure your credentials and data are protected during communication.
By mastering prompt engineering, token management, and selecting the right model, you can unlock the full potential of generative AI. Whether you're building a chatbot, generating content, or solving complex problems, small improvements in how you design prompts can lead to major gains in performance and efficiency.
Subscribe to my newsletter
Read articles from Muhammad Hamdan directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Muhammad Hamdan
Muhammad Hamdan
I am a MEAN Stack Developer with expertise in SQL, AWS, and Docker, and over 2 years of professional experience as a Software Engineer, building scalable and efficient solutions.