OpenAI API terminology

Table of contents
Here’s an extensive list of OpenAI API terminology commonly used when interacting with models like gpt-4
via the Chat Completions or Completions endpoints:
Core Terminology
Roles
system
: Sets the tone, behavior, and context of the assistant. Acts as the overarching "instruction manual."user
: Represents input from the user, typically a query or instruction.assistant
: Represents the model’s response in the conversation.function
: A role used when defining functions for advanced interaction via plugins or programmatic APIs.
Endpoints
Chat Completions (
/v1/chat/completions
):- Designed for conversation-based interactions using roles like
system
,user
, andassistant
.
- Designed for conversation-based interactions using roles like
Completions (
/v1/completions
):- Simpler endpoint designed for single-turn completions without roles. Uses plain prompts and completions.
Fine-Tunes (
/v1/fine-tunes
):- For creating fine-tuned models tailored to your dataset.
API Parameters
model
:Specifies the model you want to use. Examples:
"gpt-4"
,"gpt-4-32k"
,"text-davinci-003"
.
"gpt-4-32k"
has a larger context window (32,768 tokens).
messages
:A list of messages in a conversation. Each message has a
role
andcontent
.Example:
[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is Python?"}, {"role": "assistant", "content": "Python is a programming language."} ]
max_tokens
:Maximum number of tokens (words + punctuation) the model can generate in the response.
Includes both input and output tokens.
temperature
:Controls randomness in the model's output.
Range:
0.0
(deterministic) to1.0
(high variability).
top_p
:Alternative to
temperature
for controlling randomness.Samples from the smallest subset of tokens whose cumulative probability is
top_p
.Range:
0.0
to1.0
.
frequency_penalty
:Penalizes repeated phrases or words.
Range:
-2.0
to2.0
.
presence_penalty
:Encourages or discourages the introduction of new topics.
Range:
-2.0
to2.0
.
stop
:A sequence of characters or strings where the model should stop generating text.
Example:
["\n"]
stops generation at a new line.
logprobs
:- Returns the log probabilities of tokens in the response.
stream
:- Enables streaming responses, sending data incrementally as it’s generated.
Tokens and Context
tokens
:- The smallest unit of text the model processes (words, punctuation, or parts of words).
context window
:The total number of tokens (input + output) the model can handle in a single request.
Examples:
gpt-4
: 8,192 tokens.gpt-4-32k
: 32,768 tokens.
prompt
:- The input text provided to the model. In Chat Completions, the prompt is the combined
messages
.
- The input text provided to the model. In Chat Completions, the prompt is the combined
Fine-Tuning Specific
prompt
:- The input that the model should respond to during training.
completion
:- The desired output for the given
prompt
.
- The desired output for the given
epoch
:- A full pass through the training data during fine-tuning.
hyperparameters
:- Configuration settings for the fine-tuning process, such as learning rate and batch size.
Response Fields
id
:- Unique identifier for the API request.
object
:- Type of object returned (e.g.,
chat.completion
).
- Type of object returned (e.g.,
created
:- Timestamp for the response creation.
choices
:List of generated outputs. Each choice includes:
message
: The content and role of the response.finish_reason
: Why the generation stopped (e.g.,"stop"
,"length"
,"content_filter"
).index
: The index of the choice (useful for multi-output requests).
usage
:Token usage stats:
prompt_tokens
: Tokens in the input prompt.completion_tokens
: Tokens in the generated response.total_tokens
: Combined token count.
Content Moderation
moderation
:A feature to check input or output for policy violations.
Uses the
/v1/moderations
endpoint.Flags content for categories like violence, hate speech, self-harm, etc.
Function Calling (Advanced)
functions
:List of callable functions that the assistant can invoke during interaction.
Example:
[ { "name": "get_weather", "description": "Fetch weather details for a location.", "parameters": { "type": "object", "properties": { "location": {"type": "string", "description": "City or location."} }, "required": ["location"] } } ]
function_call
:The assistant's response, specifying the function to invoke and its parameters.
Example:
{"name": "get_weather", "arguments": {"location": "San Francisco"}}
Miscellaneous
rate limits
:- Usage caps on API calls (varies by plan).
error codes
:Common API error codes:
401
: Unauthorized (invalid API key).429
: Rate limit exceeded.500
: Internal server error.
streaming
:- Incremental response generation for real-time applications.
Context and Retention
Context Window
Definition: The maximum number of tokens (including both input and output) that the model can handle in a single request.
Implications:
The larger the context window, the more conversation history or input details you can provide.
If the input exceeds the context window, the oldest parts of the input are truncated.
Context Window Sizes:
gpt-4
: 8,192 tokensgpt-4-32k
: 32,768 tokensgpt-3.5
: 4,096 tokens
Context Retention
Definition: How well the model "remembers" the ongoing conversation within the context window.
Details:
The model does not have long-term memory; it "remembers" only what is included in the current
messages
list.For longer conversations, you may need to summarize earlier interactions to conserve tokens.
Summarization for Context Retention
Definition: The process of summarizing older parts of a conversation to save space while maintaining key details within the context window.
Example:
def summarize_conversation(messages, model="gpt-3.5"): summary_prompt = "Summarize the following conversation briefly:\n\n" + "\n".join([msg["content"] for msg in messages]) response = openai.ChatCompletion.create( model=model, messages=[{"role": "user", "content": summary_prompt}], max_tokens=200 ) return response["choices"][0]["message"]["content"]
Conversation History
Definition: A chronological collection of
messages
(roles likeuser
,system
,assistant
) passed to the API to simulate "memory."Best Practice: Maintain an external log of conversation history and include only relevant portions in the
messages
list to manage context size.
Token Management
Definition: Managing the input and output tokens to optimize for the context window limit.
Tips:
Preprocess inputs to remove irrelevant information.
Use summarization to condense earlier interactions.
Monitor token usage using
response["usage"]
.
Behavior and Tone
System Instruction
Definition: A
system
message at the start of a conversation that establishes the assistant’s behavior, tone, or personality.Example:
{"role": "system", "content": "You are a helpful assistant who answers questions concisely and always provides examples."}
Message Role
Definition: Assigns responsibility to a message (e.g.,
system
,user
,assistant
,function
).Roles:
system
: Defines the assistant’s behavior.user
: Represents user queries or instructions.assistant
: Represents model responses.function
: Indicates a callable function in the conversation.
Behavioral Persistence
Definition: Ensuring the model consistently follows a specific tone or format throughout interactions.
Techniques:
Use a strong
system
message.Periodically inject reminders as
assistant
messages:{"role": "assistant", "content": "Remember, respond concisely and provide examples."}
Advanced Features
Token Overlap
Definition: Reusing tokens from the previous response in the next prompt to simulate continuity.
Why It Matters:
- Ensures smooth interaction when summarizing or truncating earlier messages.
Function Calling
Definition: Enables the assistant to call a function and include the output in responses.
Example:
User request:
{"role": "user", "content": "What's the weather in Paris?"}
Assistant response:
{"role": "assistant", "function_call": {"name": "get_weather", "arguments": {"location": "Paris"}}}
Streaming
Definition: A feature that enables incremental responses from the model, useful for real-time applications.
How to Use: Set
stream=True
in the API request.
Optimization Concepts
Temperature
Definition: Controls randomness in output. Higher values produce more creative responses.
0.0: Deterministic (minimal randomness).
1.0: High variability.
Top-p Sampling
Definition: Selects tokens from the smallest subset whose cumulative probability exceeds
top_p
.1.0: No restriction (all tokens considered).
0.9: Top 90% of tokens.
Frequency and Presence Penalties
Frequency Penalty: Reduces repetition of existing phrases in the response.
Presence Penalty: Encourages introducing new topics.
Errors and Troubleshooting
Truncation
Definition: Occurs when the input or output exceeds the context window. Older parts of the conversation are removed first.
Solution: Summarize earlier parts of the conversation or reduce the verbosity of inputs.
Rate Limits
Definition: Maximum number of API requests allowed per minute or per token usage.
Example:
- Free-tier limit: 20 requests per minute.
Token Overrun
Definition: Exceeding the
max_tokens
limit for a single request (sum of prompt and response tokens).Solution: Reduce
messages
size or increasemax_tokens
.
Miscellaneous
Moderation
- Definition: Checking inputs or outputs for violations of safety or ethical policies using the
/v1/moderations
endpoint.
Embeddings
- Definition: Vector representations of text used for semantic similarity tasks.
Summary of Key Terms:
Category | Term | Definition |
Context | Context Window | Total tokens the model can process at once. |
Retention | How the model "remembers" within the context window. | |
Summarization | Reducing earlier content to fit within the context window. | |
Behavior | System Prompt | Instruction for model tone/behavior. |
Role | Defines responsibility for each message (system, user, assistant, function). | |
Optimization | Temperature | Controls randomness in responses. |
Top-p Sampling | Limits token choices to the most probable subset. | |
Advanced Features | Streaming | Enables real-time, incremental responses. |
Function Calling | Allows the assistant to invoke specific functions for user queries. |
Subscribe to my newsletter
Read articles from Anix Lynch directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
