OpenAI API terminology

Anix LynchAnix Lynch
8 min read

Here’s an extensive list of OpenAI API terminology commonly used when interacting with models like gpt-4 via the Chat Completions or Completions endpoints:


Core Terminology

Roles

  • system: Sets the tone, behavior, and context of the assistant. Acts as the overarching "instruction manual."

  • user: Represents input from the user, typically a query or instruction.

  • assistant: Represents the model’s response in the conversation.

  • function: A role used when defining functions for advanced interaction via plugins or programmatic APIs.


Endpoints

  1. Chat Completions (/v1/chat/completions):

    • Designed for conversation-based interactions using roles like system, user, and assistant.
  2. Completions (/v1/completions):

    • Simpler endpoint designed for single-turn completions without roles. Uses plain prompts and completions.
  3. Fine-Tunes (/v1/fine-tunes):

    • For creating fine-tuned models tailored to your dataset.

API Parameters

  • model:

    • Specifies the model you want to use. Examples:

      • "gpt-4", "gpt-4-32k", "text-davinci-003".
    • "gpt-4-32k" has a larger context window (32,768 tokens).

  • messages:

    • A list of messages in a conversation. Each message has a role and content.

    • Example:

        [
          {"role": "system", "content": "You are a helpful assistant."},
          {"role": "user", "content": "What is Python?"},
          {"role": "assistant", "content": "Python is a programming language."}
        ]
      
  • max_tokens:

    • Maximum number of tokens (words + punctuation) the model can generate in the response.

    • Includes both input and output tokens.

  • temperature:

    • Controls randomness in the model's output.

    • Range: 0.0 (deterministic) to 1.0 (high variability).

  • top_p:

    • Alternative to temperature for controlling randomness.

    • Samples from the smallest subset of tokens whose cumulative probability is top_p.

    • Range: 0.0 to 1.0.

  • frequency_penalty:

    • Penalizes repeated phrases or words.

    • Range: -2.0 to 2.0.

  • presence_penalty:

    • Encourages or discourages the introduction of new topics.

    • Range: -2.0 to 2.0.

  • stop:

    • A sequence of characters or strings where the model should stop generating text.

    • Example: ["\n"] stops generation at a new line.

  • logprobs:

    • Returns the log probabilities of tokens in the response.
  • stream:

    • Enables streaming responses, sending data incrementally as it’s generated.

Tokens and Context

  • tokens:

    • The smallest unit of text the model processes (words, punctuation, or parts of words).
  • context window:

    • The total number of tokens (input + output) the model can handle in a single request.

    • Examples:

      • gpt-4: 8,192 tokens.

      • gpt-4-32k: 32,768 tokens.

  • prompt:

    • The input text provided to the model. In Chat Completions, the prompt is the combined messages.

Fine-Tuning Specific

  • prompt:

    • The input that the model should respond to during training.
  • completion:

    • The desired output for the given prompt.
  • epoch:

    • A full pass through the training data during fine-tuning.
  • hyperparameters:

    • Configuration settings for the fine-tuning process, such as learning rate and batch size.

Response Fields

  • id:

    • Unique identifier for the API request.
  • object:

    • Type of object returned (e.g., chat.completion).
  • created:

    • Timestamp for the response creation.
  • choices:

    • List of generated outputs. Each choice includes:

      • message: The content and role of the response.

      • finish_reason: Why the generation stopped (e.g., "stop", "length", "content_filter").

      • index: The index of the choice (useful for multi-output requests).

  • usage:

    • Token usage stats:

      • prompt_tokens: Tokens in the input prompt.

      • completion_tokens: Tokens in the generated response.

      • total_tokens: Combined token count.


Content Moderation

  • moderation:

    • A feature to check input or output for policy violations.

    • Uses the /v1/moderations endpoint.

    • Flags content for categories like violence, hate speech, self-harm, etc.


Function Calling (Advanced)

  • functions:

    • List of callable functions that the assistant can invoke during interaction.

    • Example:

        [
          {
            "name": "get_weather",
            "description": "Fetch weather details for a location.",
            "parameters": {
              "type": "object",
              "properties": {
                "location": {"type": "string", "description": "City or location."}
              },
              "required": ["location"]
            }
          }
        ]
      
  • function_call:

    • The assistant's response, specifying the function to invoke and its parameters.

    • Example:

        {"name": "get_weather", "arguments": {"location": "San Francisco"}}
      

Miscellaneous

  • rate limits:

    • Usage caps on API calls (varies by plan).
  • error codes:

    • Common API error codes:

      • 401: Unauthorized (invalid API key).

      • 429: Rate limit exceeded.

      • 500: Internal server error.

  • streaming:

    • Incremental response generation for real-time applications.

Context and Retention

Context Window

  • Definition: The maximum number of tokens (including both input and output) that the model can handle in a single request.

  • Implications:

    • The larger the context window, the more conversation history or input details you can provide.

    • If the input exceeds the context window, the oldest parts of the input are truncated.

  • Context Window Sizes:

    • gpt-4: 8,192 tokens

    • gpt-4-32k: 32,768 tokens

    • gpt-3.5: 4,096 tokens


Context Retention

  • Definition: How well the model "remembers" the ongoing conversation within the context window.

  • Details:

    • The model does not have long-term memory; it "remembers" only what is included in the current messages list.

    • For longer conversations, you may need to summarize earlier interactions to conserve tokens.


Summarization for Context Retention

  • Definition: The process of summarizing older parts of a conversation to save space while maintaining key details within the context window.

  • Example:

      def summarize_conversation(messages, model="gpt-3.5"):
          summary_prompt = "Summarize the following conversation briefly:\n\n" + "\n".join([msg["content"] for msg in messages])
          response = openai.ChatCompletion.create(
              model=model,
              messages=[{"role": "user", "content": summary_prompt}],
              max_tokens=200
          )
          return response["choices"][0]["message"]["content"]
    

Conversation History

  • Definition: A chronological collection of messages (roles like user, system, assistant) passed to the API to simulate "memory."

  • Best Practice: Maintain an external log of conversation history and include only relevant portions in the messages list to manage context size.


Token Management

  • Definition: Managing the input and output tokens to optimize for the context window limit.

  • Tips:

    • Preprocess inputs to remove irrelevant information.

    • Use summarization to condense earlier interactions.

    • Monitor token usage using response["usage"].


Behavior and Tone

System Instruction

  • Definition: A system message at the start of a conversation that establishes the assistant’s behavior, tone, or personality.

  • Example:

      {"role": "system", "content": "You are a helpful assistant who answers questions concisely and always provides examples."}
    

Message Role

  • Definition: Assigns responsibility to a message (e.g., system, user, assistant, function).

  • Roles:

    • system: Defines the assistant’s behavior.

    • user: Represents user queries or instructions.

    • assistant: Represents model responses.

    • function: Indicates a callable function in the conversation.

Behavioral Persistence

  • Definition: Ensuring the model consistently follows a specific tone or format throughout interactions.

  • Techniques:

    • Use a strong system message.

    • Periodically inject reminders as assistant messages:

        {"role": "assistant", "content": "Remember, respond concisely and provide examples."}
      

Advanced Features

Token Overlap

  • Definition: Reusing tokens from the previous response in the next prompt to simulate continuity.

  • Why It Matters:

    • Ensures smooth interaction when summarizing or truncating earlier messages.

Function Calling

  • Definition: Enables the assistant to call a function and include the output in responses.

  • Example:

    • User request:

        {"role": "user", "content": "What's the weather in Paris?"}
      
    • Assistant response:

        {"role": "assistant", "function_call": {"name": "get_weather", "arguments": {"location": "Paris"}}}
      

Streaming

  • Definition: A feature that enables incremental responses from the model, useful for real-time applications.

  • How to Use: Set stream=True in the API request.


Optimization Concepts

Temperature

  • Definition: Controls randomness in output. Higher values produce more creative responses.

    • 0.0: Deterministic (minimal randomness).

    • 1.0: High variability.

Top-p Sampling

  • Definition: Selects tokens from the smallest subset whose cumulative probability exceeds top_p.

    • 1.0: No restriction (all tokens considered).

    • 0.9: Top 90% of tokens.

Frequency and Presence Penalties

  • Frequency Penalty: Reduces repetition of existing phrases in the response.

  • Presence Penalty: Encourages introducing new topics.


Errors and Troubleshooting

Truncation

  • Definition: Occurs when the input or output exceeds the context window. Older parts of the conversation are removed first.

  • Solution: Summarize earlier parts of the conversation or reduce the verbosity of inputs.

Rate Limits

  • Definition: Maximum number of API requests allowed per minute or per token usage.

  • Example:

    • Free-tier limit: 20 requests per minute.

Token Overrun

  • Definition: Exceeding the max_tokens limit for a single request (sum of prompt and response tokens).

  • Solution: Reduce messages size or increase max_tokens.


Miscellaneous

Moderation

  • Definition: Checking inputs or outputs for violations of safety or ethical policies using the /v1/moderations endpoint.

Embeddings

  • Definition: Vector representations of text used for semantic similarity tasks.

Summary of Key Terms:

CategoryTermDefinition
ContextContext WindowTotal tokens the model can process at once.
RetentionHow the model "remembers" within the context window.
SummarizationReducing earlier content to fit within the context window.
BehaviorSystem PromptInstruction for model tone/behavior.
RoleDefines responsibility for each message (system, user, assistant, function).
OptimizationTemperatureControls randomness in responses.
Top-p SamplingLimits token choices to the most probable subset.
Advanced FeaturesStreamingEnables real-time, incremental responses.
Function CallingAllows the assistant to invoke specific functions for user queries.

0
Subscribe to my newsletter

Read articles from Anix Lynch directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Anix Lynch
Anix Lynch