OpenAI API terminology

Here’s an extensive list of OpenAI API terminology commonly used when interacting with models like gpt-4 via the Chat Completions or Completions endpoints:

Core Terminology

Roles

system: Sets the tone, behavior, and context of the assistant. Acts as the overarching "instruction manual."
user: Represents input from the user, typically a query or instruction.
assistant: Represents the model’s response in the conversation.
function: A role used when defining functions for advanced interaction via plugins or programmatic APIs.

Endpoints

Chat Completions (/v1/chat/completions):
- Designed for conversation-based interactions using roles like system, user, and assistant.
Completions (/v1/completions):
- Simpler endpoint designed for single-turn completions without roles. Uses plain prompts and completions.
Fine-Tunes (/v1/fine-tunes):
- For creating fine-tuned models tailored to your dataset.

API Parameters

model:
- Specifies the model you want to use. Examples:
  - "gpt-4", "gpt-4-32k", "text-davinci-003".
- "gpt-4-32k" has a larger context window (32,768 tokens).

messages:

A list of messages in a conversation. Each message has a role and content.

Example:

  [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is Python?"},
    {"role": "assistant", "content": "Python is a programming language."}
  ]

max_tokens:
- Maximum number of tokens (words + punctuation) the model can generate in the response.
- Includes both input and output tokens.
temperature:
- Controls randomness in the model's output.
- Range: 0.0 (deterministic) to 1.0 (high variability).
top_p:
- Alternative to temperature for controlling randomness.
- Samples from the smallest subset of tokens whose cumulative probability is top_p.
- Range: 0.0 to 1.0.
frequency_penalty:
- Penalizes repeated phrases or words.
- Range: -2.0 to 2.0.
presence_penalty:
- Encourages or discourages the introduction of new topics.
- Range: -2.0 to 2.0.
stop:
- A sequence of characters or strings where the model should stop generating text.
- Example: ["\n"] stops generation at a new line.
logprobs:
- Returns the log probabilities of tokens in the response.
stream:
- Enables streaming responses, sending data incrementally as it’s generated.

Tokens and Context

tokens:
- The smallest unit of text the model processes (words, punctuation, or parts of words).
context window:
- The total number of tokens (input + output) the model can handle in a single request.
- Examples:
  - gpt-4: 8,192 tokens.
  - gpt-4-32k: 32,768 tokens.
prompt:
- The input text provided to the model. In Chat Completions, the prompt is the combined messages.

Fine-Tuning Specific

prompt:
- The input that the model should respond to during training.
completion:
- The desired output for the given prompt.
epoch:
- A full pass through the training data during fine-tuning.
hyperparameters:
- Configuration settings for the fine-tuning process, such as learning rate and batch size.

Response Fields

id:
- Unique identifier for the API request.
object:
- Type of object returned (e.g., chat.completion).
created:
- Timestamp for the response creation.
choices:
- List of generated outputs. Each choice includes:
  - message: The content and role of the response.
  - finish_reason: Why the generation stopped (e.g., "stop", "length", "content_filter").
  - index: The index of the choice (useful for multi-output requests).
usage:
- Token usage stats:
  - prompt_tokens: Tokens in the input prompt.
  - completion_tokens: Tokens in the generated response.
  - total_tokens: Combined token count.

Content Moderation

moderation:
- A feature to check input or output for policy violations.
- Uses the /v1/moderations endpoint.
- Flags content for categories like violence, hate speech, self-harm, etc.

Function Calling (Advanced)

functions:

List of callable functions that the assistant can invoke during interaction.

Example:

  [
    {
      "name": "get_weather",
      "description": "Fetch weather details for a location.",
      "parameters": {
        "type": "object",
        "properties": {
          "location": {"type": "string", "description": "City or location."}
        },
        "required": ["location"]
      }
    }
  ]

function_call:
- The assistant's response, specifying the function to invoke and its parameters.
- Example:
```
  {"name": "get_weather", "arguments": {"location": "San Francisco"}}
```

Miscellaneous

rate limits:
- Usage caps on API calls (varies by plan).
error codes:
- Common API error codes:
  - 401: Unauthorized (invalid API key).
  - 429: Rate limit exceeded.
  - 500: Internal server error.
streaming:
- Incremental response generation for real-time applications.

Context and Retention

Context Window

Definition: The maximum number of tokens (including both input and output) that the model can handle in a single request.
Implications:
- The larger the context window, the more conversation history or input details you can provide.
- If the input exceeds the context window, the oldest parts of the input are truncated.
Context Window Sizes:
- gpt-4: 8,192 tokens
- gpt-4-32k: 32,768 tokens
- gpt-3.5: 4,096 tokens

Context Retention

Definition: How well the model "remembers" the ongoing conversation within the context window.
Details:
- The model does not have long-term memory; it "remembers" only what is included in the current messages list.
- For longer conversations, you may need to summarize earlier interactions to conserve tokens.

Summarization for Context Retention

Definition: The process of summarizing older parts of a conversation to save space while maintaining key details within the context window.

Example:

  def summarize_conversation(messages, model="gpt-3.5"):
      summary_prompt = "Summarize the following conversation briefly:\n\n" + "\n".join([msg["content"] for msg in messages])
      response = openai.ChatCompletion.create(
          model=model,
          messages=[{"role": "user", "content": summary_prompt}],
          max_tokens=200
      )
      return response["choices"][0]["message"]["content"]

Conversation History

Definition: A chronological collection of messages (roles like user, system, assistant) passed to the API to simulate "memory."
Best Practice: Maintain an external log of conversation history and include only relevant portions in the messages list to manage context size.

Token Management

Definition: Managing the input and output tokens to optimize for the context window limit.
Tips:
- Preprocess inputs to remove irrelevant information.
- Use summarization to condense earlier interactions.
- Monitor token usage using response["usage"].

Behavior and Tone

System Instruction

Definition: A system message at the start of a conversation that establishes the assistant’s behavior, tone, or personality.

Example:

  {"role": "system", "content": "You are a helpful assistant who answers questions concisely and always provides examples."}

Message Role

Definition: Assigns responsibility to a message (e.g., system, user, assistant, function).
Roles:
- system: Defines the assistant’s behavior.
- user: Represents user queries or instructions.
- assistant: Represents model responses.
- function: Indicates a callable function in the conversation.

Behavioral Persistence

Definition: Ensuring the model consistently follows a specific tone or format throughout interactions.

Techniques:

Use a strong system message.

Periodically inject reminders as assistant messages:

  {"role": "assistant", "content": "Remember, respond concisely and provide examples."}

Advanced Features

Token Overlap

Definition: Reusing tokens from the previous response in the next prompt to simulate continuity.
Why It Matters:
- Ensures smooth interaction when summarizing or truncating earlier messages.

Function Calling

Definition: Enables the assistant to call a function and include the output in responses.

Example:

User request:

  {"role": "user", "content": "What's the weather in Paris?"}

Assistant response:

  {"role": "assistant", "function_call": {"name": "get_weather", "arguments": {"location": "Paris"}}}

Streaming

Definition: A feature that enables incremental responses from the model, useful for real-time applications.
How to Use: Set stream=True in the API request.

Optimization Concepts

Temperature

Definition: Controls randomness in output. Higher values produce more creative responses.
- 0.0: Deterministic (minimal randomness).
- 1.0: High variability.

Top-p Sampling

Definition: Selects tokens from the smallest subset whose cumulative probability exceeds top_p.
- 1.0: No restriction (all tokens considered).
- 0.9: Top 90% of tokens.

Frequency and Presence Penalties

Frequency Penalty: Reduces repetition of existing phrases in the response.
Presence Penalty: Encourages introducing new topics.

Errors and Troubleshooting

Truncation

Definition: Occurs when the input or output exceeds the context window. Older parts of the conversation are removed first.
Solution: Summarize earlier parts of the conversation or reduce the verbosity of inputs.

Rate Limits

Definition: Maximum number of API requests allowed per minute or per token usage.
Example:
- Free-tier limit: 20 requests per minute.

Token Overrun

Definition: Exceeding the max_tokens limit for a single request (sum of prompt and response tokens).
Solution: Reduce messages size or increase max_tokens.

Miscellaneous

Moderation

Definition: Checking inputs or outputs for violations of safety or ethical policies using the /v1/moderations endpoint.

Embeddings

Definition: Vector representations of text used for semantic similarity tasks.

Summary of Key Terms:

Category	Term	Definition
Context	Context Window	Total tokens the model can process at once.
	Retention	How the model "remembers" within the context window.
	Summarization	Reducing earlier content to fit within the context window.
Behavior	System Prompt	Instruction for model tone/behavior.
	Role	Defines responsibility for each message (system, user, assistant, function).
Optimization	Temperature	Controls randomness in responses.
	Top-p Sampling	Limits token choices to the most probable subset.
Advanced Features	Streaming	Enables real-time, incremental responses.
	Function Calling	Allows the assistant to invoke specific functions for user queries.

OpenAI API terminology

Table of contents

Core Terminology

Roles

Endpoints

API Parameters

Tokens and Context

Fine-Tuning Specific

Response Fields

Content Moderation

Function Calling (Advanced)

Miscellaneous

Context and Retention

Context Window

Context Retention

Summarization for Context Retention

Conversation History

Token Management

Behavior and Tone

System Instruction

Message Role

Behavioral Persistence

Advanced Features

Token Overlap

Function Calling

Streaming

Optimization Concepts

Temperature

Top-p Sampling

Frequency and Presence Penalties

Errors and Troubleshooting

Truncation

Rate Limits

Token Overrun

Miscellaneous

Moderation

Embeddings

Summary of Key Terms:

Subscribe to my newsletter

Anix Lynch

Anix Lynch