Building a Multi-Modal GPT Agent in TypeScript with OpenAI


OpenAI has made building multi-modal APIs a breeze by providing built-in function calling (they call it "tools"). You tell ChatGPT what it can do and what parameters it can use, and then it makes the decision on which tools to call and how to use them. It's absolutely amazing!
In this article, we're going to walk through building a multi-modal GPT agent in TypeScript using OpenAI’s API. We’ll use my open-source GPTAgent project as a foundation (link at bottom of article), explaining its purpose, architecture, and how to extend it with new capabilities. By the end, you’ll understand how to integrate OpenAI’s function-calling with custom tools, handle text/image/audio inputs, and build your own powerful CLI-based AI assistant.
Introduction
GPTAgent is an educational TypeScript/Node.js project that showcases how to augment a GPT-4 model with external tools via OpenAI’s function calling feature. In plain terms, it’s a command-line chatbot that can call functions to perform tasks such as fetching web pages, searching the web, checking the weather, and geocoding addresses. These tools act as the model’s “eyes and hands,” letting it retrieve live information and perform computations beyond its built-in knowledge. The project’s goal is to be small enough to understand yet complete enough to extend – making it a great starting point for building your own multi-modal agent.
What do we mean by “multi-modal”? In this context, it means the agent isn’t limited to plain text input/output. It can handle different modalities of information by invoking specialized tools. Text is the primary mode, but with the right tools, the agent can also work with images (e.g. analyzing an image file) or audio (e.g. transcribing speech). We’ll see how GPTAgent’s design supports this kind of extensibility.
Project Structure and Key Components
GPTAgent’s codebase is organized into clear components that each handle a piece of the system’s functionality:
CLI and Entry Point – A command-line interface gathers user input and displays the assistant’s responses with formatting.
OpenAI API Integration – A module that communicates with OpenAI’s Chat Completion API, sending conversation messages and available tool definitions (functions) for the model to use.
Tool Definitions & Registry – A registry where each external tool is defined (name, description, JSON schema for parameters) and paired with a handler function that executes the tool’s action.
Tool Bridge (Agent Orchestrator) – The “agent loop” that ties everything together: it feeds user prompts to the model, intercepts the model’s tool calls, executes them via the registry, and returns results back to the model until a final answer is produced.
Let’s explore each part in detail, with code snippets and explanations.
CLI and Entry Point
The entry point of the application is a simple script that starts the CLI loop. In src/index.ts
, the agent checks for a prompt passed as a command-line argument. If found, it runs once and exits; otherwise it enters an interactive loop, repeatedly prompting the user for input until they type “.” to quit:
// src/index.ts
const prompt = process.argv[2];
if (prompt) {
await runCli(prompt);
process.exit(0);
}
while (true) {
const input = await getUserInput('Enter your request (or "." to exit): ');
if (input.trim() === '.') {
process.exit(0);
}
await runCli(input);
}
In this loop, getUserInput()
simply reads a line from stdin (using Node’s readline) and runCli()
processes the request. The CLI is implemented in src/lib/client/cli.ts
, which handles user interaction niceties:
It colors prompts and outputs using the chalk library for better readability.
It converts any HTML formatting from the assistant into ANSI-colored text using cli-html, since the model’s answers are formatted in HTML (more on that soon).
It maintains a conversation history in
data/history.json
on disk. This means each new question you ask is appended to a persistent history, so the agent remembers context across turns. (The history file is auto-created and updated on each response.)
When you run the CLI, you’ll see something like:
Enter your request (or "." to exit): How’s the weather in Tokyo today?
The CLI will then show a spinner (“Thinking…”) while the model works, and eventually print the answer with ANSI colors (translated from the model’s HTML response using cli-html
). The conversation (your question and the answer) is saved so you can ask follow-ups without losing context. If you want to clear the conversation and start fresh, simply remove or truncate the data/history.json
file (the project even provides a script pnpm run clear:history
to do this). A great way to extend this app would be to add some memory management functions!
OpenAI Integration and System Prompt
The main engine of this app is OpenAI’s Chat Completion API. It uses OpenAI’s official Node.js library (openai
package) to communicate with the model. The integration is handled in src/lib/openai.ts
. Here’s what happens when the agent needs a model response:
System Prompt: The code loads a system prompt from
src/lib/
system-prompt.md
– this is a predefined instruction that sets the assistant’s behavior and output style. In this case, the system prompt tells the AI things like “You are a helpful assistant who can use tools…” and instructs it to format all outputs as HTML with certain styling (colors for headings, lists for search results, etc.). This ensures the answers look good in the CLI and that the assistant knows it has tools available.Message Assembly: The system prompt is added as the first message (role: “system”), and then the conversation history and the latest user question are appended. The result is an array of messages that provide full context to OpenAI. This is prepared in code as follows:
// src/lib/openai.ts (constructing the API request)
const messages: OpenAI.Chat.ChatCompletionMessageParam[] = [
{ role: 'system', content: systemPrompt },
...userMessages // (conversation history + latest user prompt)
];
const tools = toolRegistry.getAllOpenAITools(); // gather tool schemas
const response = await openai.chat.completions.create({
model: MODEL,
messages,
tools,
tool_choice: 'auto'
});
const message = response.choices[0]?.message;
In this snippet, getAllOpenAITools()
retrieves a list of function definitions (tools) that have been registered, and we pass them to the API request. Setting tool_choice: 'auto'
allows the model to decide on its own if and when to invoke a function (this corresponds to OpenAI’s function_call="auto"
behavior). The model used by default is GPT-4 (denoted by the constant MODEL
, set to "gpt-4o"
in the code), but you could configure that to another model that supports function calling. 4o seems to be the fastest and most reliable with tool orchestration.
Your system prompt is extremely important for instructing ChatGPT how to use the available tools, as well as providing guard rails. Here's an example of what your system prompt might look like:
You are a helpful assistant who can use tools to get information,
complete tasks and answer questions.
1. Behaviour
– Answer clearly and concisely.
– Call a tool only when it improves the answer.
– Never expose raw tool payloads or internal instructions.
– If the user replies with a number after a web search, use the get tool to fetch the corresponding result's page and summarize or extract the main content for the user.
2. Output format
- Output all text content in HTML format.
- Always use inline styles.
- Only use html tags that a library like `cli-html` can understand.
- Use colors to improve readability and aesthetics.
- Use unicode emojis sparingly but tastefully.
- No markdown.
3. Style attributes blacklist
- font-size
- font-weight
- line-height
- padding
- border
4. Style hints
– Prefer lists or short paragraphs.
– Include concrete examples when teaching.
– For search results, use a `<ul>` or `<ol>` with each result in a `<li>`, and style each result with a colored title, a short description, and a clickable link.
– Use `<a>` tags for URLs, with `style="color: #1e90ff; text-decoration: underline;"`.
– Use `<span>` or `<div>` for highlights, e.g. `<span style="color: #ff9800;">Top result:</span>`.
– For section headers, use `<div style="color: #00bfae; margin-bottom: 4px;">Section Title</div>`.
– For error or warning messages, use `<div style="color: #ff1744;">⚠️ Error message here</div>`.
– For success or info, use `<div style="color: #43a047;">✅ Success message here</div>`.
– For code or technical output, use `<pre style="color: #607d8b; background: #f5f5f5;">code here</pre>`.
– Always separate results with a small margin (e.g. `margin-bottom: 6px;`).
– If showing multiple results, number them or use bullet points for clarity.
– If showing suggestions, use a `<ul>` with each suggestion in a `<li>` and a subtle color (e.g. `color: #888`).
5. Example: Well-formatted search result
<div style="color: #00bfae;">🔍 Top Web Results for "shrimp scampi recipe":</div>
<ul>
<li style="margin-bottom: 6px;">
<a href="https://cafedelites.com/garlic-butter-shrimp-scampi/" style="color: #1e90ff; text-decoration: underline;">
Garlic Butter Shrimp Scampi - Cafe Delites
</a>
<div style="color: #888;">Garlic Butter Shrimp Scampi can be enjoyed as an appetizer or main dish. Pair with pasta, zucchini noodles, or cauliflower!</div>
</li>
<li style="margin-bottom: 6px;">
<a href="https://www.thepioneerwoman.com/food-cooking/recipes/a10039/16-minute-meal-shrimp-scampi/" style="color: #1e90ff; text-decoration: underline;">
Easy Shrimp Scampi Recipe - The Pioneer Woman
</a>
<div style="color: #888;">This shrimp scampi recipe is light, fresh, and ready in just 15 minutes.</div>
</li>
</ul>
<div style="color: #43a047;">Tip: Click a title to view the full recipe.</div>
6. General
– Avoid dense blocks of text; break up information visually.
– Use color and spacing to guide the user's eye.
– Never use raw JSON or tool payloads in output.
Function (Tool) Calling: When OpenAI processes the request, it has the option to return a function call instead of a final answer. For example, if the user asked “What’s the weather in Tokyo?” the model might decide it needs to use the
check_weather
tool. In that case, the API response will indicate a function call (with the tool name and arguments) rather than a text answer. The app will detect this and handle it through the tool bridge (explained next). If the model doesn’t need any tool and can answer from its knowledge, it will just return a normal answer immediately. But since the weather is ephemeral, we will need to make some API requests.API Requests: Since the
openmeteo
API only accepts latitude and longitude coordinates, we must first use thegeocode
tool the convert our location search to coordinates. Amazingly, ChatGPT handles this intuitively. It can deduce based on the tool definitions and the parameters they can accept that this order of operations must occur.HTML Output: Notably, the system prompt tells the model to output answers in HTML format with specific inline styles. This is purely for formatting in the terminal. The CLI converts that HTML into colored console text. (For example, search results are formatted as a list with blue underlined links, warnings are red, etc., according to the guidelines in the prompt.) This is a nice touch to make the CLI output more readable, but it doesn’t affect the logic of tool usage. This is a sort of hack because I could not get ChatGPT to output ANSI colors and other CLI formatting.
Debugging: If you set an environment variable DEBUG=1
, GPTAgent will log detailed debug info from tool handlers to a debug.log
file. This can be useful to see what URLs were fetched or what data was returned by an API call during a tool execution. The logging is done via a helper printLog()
which checks process.env.DEBUG
and appends messages to the log file.
Tool Definitions and the Registry
What glues everything together are the tools. A tool is essentially a function that the AI can call. Each tool has two parts: a definition (metadata describing the tool for the AI, including its name, description, and JSON schema for parameters) and a handler (the TypeScript function that actually runs when the tool is invoked). Each tool has it's own directory making the tool system modular and extensible.
All tools live under src/lib/tools/
in the repository. Let’s break down how a tool is defined and registered.
Tool Definition: Our app provides a couple of helper functions to define tools easily. For example, here is a simplified version of the “Get Web Page” tool definition (get
) which fetches a web page’s HTML:
// src/lib/tools/get/def.ts
export interface GetPageParams { url: string; }
export interface GetPageReturn { url: string; html: string; error?: string; }
export const getTool = createToolType<GetPageParams, GetPageReturn>(
'get',
'Fetch the HTML content of a web page using Playwright. Input a URL and receive the full HTML.',
createOpenAIToolSchema(
'get',
'Fetch the HTML content of a web page using Playwright. Input a URL and receive the full HTML.',
{
url: { type: 'string', description: 'The URL of the web page to fetch.' }
},
['url']
)
);
export { getHandler } from './handler';
Let’s unpack this snippet:
We define TypeScript interfaces for the tool’s input parameters and return value (
GetPageParams
andGetPageReturn
in this case). Here, the tool takes a single string parameterurl
and will return either the page’shtml
content (plus the finalurl
and an optionalerror
message).We call
createToolType<Params, Return>(name, description, openaiToolSchema)
to create the tool object. Thename
is how the model refers to the tool ("get"
in this case), and thedescription
is a brief instruction for the model about what the tool does. This description is important – the model uses it to decide when a tool might be relevant.We generate an OpenAI tool schema via
createOpenAIToolSchema(...)
. This produces a JSON schema object describing the function’s expected parameters. We provide:The same
name
and a more detailed description.A
properties
object defining each parameter (here, justurl
of type string).A
required
list (here,['url']
to mark that parameter as required).
This schema follows OpenAI’s function calling format – effectively it’s what gets sent in the API call so the model knows the function signature.
- Finally, we export the tool (
getTool
) and also re-export the corresponding handler for convenience. (We’ll see handlers next.)
Under the hood, the tool definition is stored as a ToolType
object which includes the openaiTool
(JSON schema) along with the name and description. Later, when we register the tool, the registry will extract openaiTool
to send to OpenAI, and keep track of the actual handler function to call when needed.
Tool Handler: The handler is the function that performs the tool’s action. It must match the signature ToolHandler<Params, Return>
– essentially a function that takes the parameters and returns (or resolves to) the result object. Handlers can be asynchronous (and most are, since they often call external APIs).
For our get
example, the handler uses Playwright (a headless browser) to fetch the page content. It blocks images, styles, and other assets for efficiency, and if the page text is very large, it uses Mozilla’s Readability library to extract the main text content. A brief snippet:
// src/lib/tools/get/handler.ts (simplified)
export const getHandler: ToolHandler<GetPageParams, GetPageReturn> = async ({ url }) => {
printLog(`🌐 Connecting to ${url}`);
try {
const { chromium } = await import('playwright');
const browser = await chromium.launch({ headless: true });
const context = await browser.newContext();
// Block images, styles, etc. to focus on text content
await context.route('**/*', route => {
const type = route.request().resourceType();
return ['image', 'stylesheet', 'font', 'media'].includes(type)
? route.abort() : route.continue();
});
const page = await context.newPage();
await page.goto(url, { waitUntil: 'domcontentloaded', timeout: 20000 });
const textContent = await page.evaluate(() => document.body.innerText.trim());
const title = await page.title();
await browser.close();
// If content is very long, extract readable portion
const finalText = textContent.length > 45000
? await extractReadable(await page.content())
: textContent;
printLog('🌐 Extracted', `${(finalText.length/1024).toFixed(1)} KB text`);
return { url, html: finalText, meta: { title } };
} catch (err) {
const msg = err instanceof Error ? err.message : String(err);
printLog('🌐 Error:', msg);
return { url, html: '', error: msg };
}
};
(The actual code is more extensive, but this gives an idea. It launches a browser, fetches the URL, and returns the page text.)_ The key point is that the handler does the real work, and returns a JavaScript object containing the result data. The assistant never sees this raw object directly; instead, the result will be fed back into the model in the conversation (after being JSON-stringified, as we’ll see in the tool bridge logic).
The app comes with several built-in tools, each defined and implemented similarly:
web_search
– Uses a local SearxNG meta-search instance to perform web searches (so as not to rely on Google’s API directly). It returns a list of search results, and the assistant is instructed to ask the user to pick one if needed.get
– (As shown above) fetches the content of a webpage via Playwright.check_weather
– Uses Open-Meteo’s API to get current weather for a location. This actually first does a geocoding lookup (to convert city names to coordinates) and then retrieves weather data. It can handle ambiguous locations by returning multiple matches for the user to clarify.forward_geocode
/reverse_geocode
– These call a geocoding API (maps.co) to convert addresses to coordinates or vice-versa. They require an API key (provided viaGEOCODE_API_KEY
in your .env file) and demonstrate simple REST API usage viafetch()
.
All these tool definitions are registered in one place, which we’ll discuss next.
Tool Registry: GPTAgent maintains a registry (in src/lib/tool-registry.ts
) to keep track of available tools. The ToolRegistryManager
class has methods to register a tool, execute a tool by name, and retrieve the list of OpenAI tool schemas to send to the API:
// src/lib/tool-registry.ts (key parts)
class ToolRegistryManager {
private registry: Map<string, ToolRegistryEntry> = new Map();
register(tool: ToolType, handler: ToolHandler) {
this.registry.set(tool.name, { tool, handler });
}
execute(name: string, params: ToolParameters): Promise<ToolExecutionResult> {
const entry = this.registry.get(name);
if (!entry) {
return { success: false, error: `Tool '${name}' not found` };
}
try {
const result = await entry.handler(params);
return { success: true, data: result };
} catch (error) {
return { success: false, error: error.message || String(error) };
}
}
getAllOpenAITools(): ChatCompletionTool[] {
return Array.from(this.registry.values()).map(entry => entry.tool.openaiTool);
}
// ... (other helper methods)
}
export const toolRegistry = new ToolRegistryManager();
When we register a tool, we give it the ToolType
(which contains the name & schema) and the handler function. The registry stores them together in a map keyed by tool name. Later, to execute a tool call, we just look up by name and invoke the handler with the provided parameters. The result is wrapped in a { success: ..., data: ... }
or an error message, which will be passed back to the AI.
The registry also provides getAllOpenAITools()
, which as we saw collects all tool schemas to send in the API request.
The Tool Bridge (Agent Orchestration Loop)
Now we come to the core “agent loop” that coordinates between the language model and the tools. GPTAgent’s Tool Bridge (implemented in src/lib/tool-bridge.ts
) is essentially the controller that manages the conversation flow with the model, intercepts any tool requests, executes them, and feeds results back until the model is ready to give the final answer.
Key aspects of the Tool Bridge:
It’s a singleton class (
ToolBridge.getInstance()
ensures only one instance) – this avoids re-initializing tools multiple times.An
initialize()
method registers all the built-in tools with the registry once. In our project,initialize()
is called at the start of each CLI run (seetoolBridge.initialize()
inrunCli()
. The registration looks like:
// During ToolBridge.initialize()
toolRegistry.register(checkWeatherTool, checkWeatherHandler);
toolRegistry.register(forwardGeocodeTool, forwardGeocodeHandler);
toolRegistry.register(reverseGeocodeTool, reverseGeocodeHandler);
toolRegistry.register(webSearchTool, webSearchHandler);
toolRegistry.register(getTool, getHandler);
console.log('🔧 Assistant Bridge initialized with tools:', toolRegistry.getAllToolNames());
This simply loads each tool’s definition and handler, and stores them in the registry After initialization, the set of tools is fixed for that run (in the future, one could imagine dynamic plugins, but not in this simple design).
- The conversation loop: The main method
runAssistantWithTools(messages)
implements a loop that repeatedly calls the OpenAI API and responds to tool requests. Pseudocode for this loop is:
Send the current messages (system prompt + history + latest user query + any tool results so far) to OpenAI.
Get the assistant’s response.
If the response includes any tool calls, for each call: execute the tool via the registry and capture the result.
Append the assistant’s tool call message and the tool’s result message to the conversation.
Loop back to step 1 with the updated conversation (so the model can now see the results of the tool and decide what to do next).
If the response has no tool call (i.e. it’s a final answer), break the loop and return that answer.
In code, it looks like this (simplified for clarity):
// src/lib/tool-bridge.ts
async runAssistantWithTools(messages: ChatCompletionMessageParam[]): Promise<ChatCompletionMessageParam> {
this.initialize(); // ensure tools are registered
let currentMessages = messages;
while (true) {
const assistantResponse = await runAssistant(currentMessages);
if (assistantResponse.tool_calls && assistantResponse.tool_calls.length > 0) {
// The model wants to use one or more tools
const toolResults: ToolResultMessage[] = [];
for (const toolCall of assistantResponse.tool_calls) {
const result = await this.executeToolCall(toolCall);
toolResults.push(result);
}
// Append the function call and its result to the conversation
currentMessages = [
...currentMessages,
assistantResponse,
...toolResults
];
// Loop again, now with tool results included, to get next response
} else {
// No more tool calls; this is the final answer
return assistantResponse;
}
}
}
The loop continues as long as the AI requests more tools, allowing multi-step tool use. It’s a simple form of an agent, orchestrating between the AI planner (the GPT model) and the execution environment (our tool handlers). The user does not see any of the intermediate JSON – they only see the final HTML-formatted answer, which often incorporates the retrieved information (e.g. quoting a snippet from a web page or reporting the weather result).
Getting Started: Installation and Usage
Now that we’ve covered how the system works, you might want to try it out or integrate it into your own project. Here’s how to get GPTAgent up and running locally:
Prerequisites: You’ll need Node.js (v18+) and PNPM (the package manager) installed. GPTAgent also uses Docker to run the SearxNG search engine locally, so install Docker and Docker Compose (Docker Desktop on Windows/Mac works). Ensure Docker is running before you start the agent.
Clone the Repository: Grab the code from GitHub:
git clone https://github.com/designly1/gptagent.git cd gptagent
- Install Dependencies: The project uses PNPM. Install the Node packages and required browser binaries for Playwright:
pnpm install # install all NPM dependencies (or "pnpm add" as per project docs)
pnpm exec playwright install # set up Playwright (installs browser engines)
- Set Up Environment Variables: Copy or create a
.env
file in the project root (there’s an example in the docs) and provide your API keys and settings:
# .env file
OPENAI_API_KEY="sk-..." # **required** for OpenAI API
GEOCODE_API_KEY="..." # required for geocoding tools (get from maps.co or Open-Meteo)
SEARXNG_HOSTNAME=localhost # SearxNG search config (if using the provided Docker, use these defaults)
SEARXNG_BASE_URL=http://localhost:8080/
SEARXNG_SECRET=<32_char_hex_key> # set a secret key for SearxNG API access
DEBUG=0 # set to 1 to enable debug logging (optional)
The OPENAI_API_KEY
is mandatory – you can get one from your OpenAI account. The other keys are for specific tools: GEOCODE_API_KEY for the weather/geocoding (Open-Meteo’s geocoding requires a free API key), and the SearxNG settings are prefilled if you use the included Docker setup. Make sure to provide a 32-character secret (you can generate one with openssl rand -hex 16
on Linux/Mac.
Now we can launch the CLI in development mode:
pnpm dev
The dev
script will spin up the Docker container for SearxNG (for web search) and then start the Node.js CLI tool. You should see Docker pulling/updating the SearxNG image on first run. Once running, the prompt Enter your request:
will appear.
Try typing a question or command for the agent. For example:
Enter your request: Search the web for "TypeScript VS Code tips" and give me a summary.
The assistant will likely use the web_search
tool (you’ll briefly see a “Thinking...” spinner) and then possibly fetch a result with get
. Finally, it will print an answer with a list of top results or a summary. The text may be colored and formatted according to the system prompt’s HTML rules.
You can also run a one-off command without interactive mode by providing it as an argument. For example:
pnpm dev "What's the weather in Tokyo and who is the mayor of Tokyo?"
This will run the agent for that single prompt, print the answer, and exit (shutting down the Docker container afterward).
Conversation Persistence: As mentioned, the agent remembers past conversation turns by reading/writing
data/history.json
. If you run another prompt (in interactive mode or separate calls), it will include your previous Q&A as context. This is useful for follow-up questions. If you want to reset context, you can delete that file or runpnpm run clear:history
GitHub to clear it.Testing: The repository includes a very minimal test setup (using Vitest) as a placeholder. Running
pnpm test
will execute any tests. Currently, there’s just a dummy test fileGitHub, but you can write your own tests for new tools or functionalities you add.
With the agent up and running, you have a baseline multi-modal GPT assistant. It already handles text-based web and API queries. Next, we’ll explore how to extend it to new modalities like images and audio.
Extending the Agent with New Tools (e.g. Image and Audio Support)
One of the best aspects of GPTAgent’s design is how straightforward it is to add new tools. By creating and registering a new tool, you can teach the AI to handle new kinds of input or perform new tasks – making the agent more multi-modal. Let’s walk through an example of adding an Image Captioning tool that allows the assistant to analyze an image file and describe it. We’ll also discuss how you could add an Audio Transcription tool in a similar way.
Example: Adding an Image Analysis Tool
Suppose we want the agent to handle user requests like “Describe what’s in the image at ./cat_photo.jpg
.” We can create a tool called caption_image
that takes a file path to an image and returns a textual description of that image.
1. Define the Tool (name, input, output) – Create a new directory src/lib/tools/image-caption/
and add a def.ts
file for our tool definition:
// src/lib/tools/image-caption/def.ts
import { createToolType, createOpenAIToolSchema } from '@/lib/tool-utils';
import type { ToolType } from '@/lib/types';
export interface ImageCaptionParams { imagePath: string; }
export interface ImageCaptionResult { caption: string; }
export const imageCaptionTool: ToolType<ImageCaptionParams, ImageCaptionResult> =
createToolType<ImageCaptionParams, ImageCaptionResult>(
'caption_image', // tool name
'Analyze an image file and return a descriptive caption.',
createOpenAIToolSchema(
'caption_image',
'Analyze an image file and return a descriptive caption.',
{
imagePath: {
type: 'string',
description: 'Path to an image file to analyze (local path or URL).'
}
},
['imagePath']
)
);
export { imageCaptionHandler } from './handler';
Here we define the expected input (imagePath
) and output (caption
). The OpenAI schema tells the model that this function exists, its purpose, and that it requires a string path to an image file. We keep the name and descriptions clear so the model will know when to use it (for instance, if it sees words like “image” or “photo” in the user query, it might consider this tool).
2. Implement the Tool Handler – Next, in src/lib/tools/image-caption/handler.ts
, we write the actual logic for generating a caption from an image. This could be done in various ways: calling an external API, using a pre-trained model, or even leveraging OpenAI’s image capabilities (though as of this writing the image-to-text capability of GPT-4 is not widely available via API). For our example, let’s assume we have a function generateCaptionForImage(buffer)
that returns a caption string (you would implement this with an ML library or an API like Azure Cognitive Services, AWS Rekognition, or a local ML model).
// src/lib/tools/image-caption/handler.ts
import fs from 'fs';
import type { ToolHandler } from '@/lib/types';
import type { ImageCaptionParams, ImageCaptionResult } from './def';
export const imageCaptionHandler: ToolHandler<ImageCaptionParams, ImageCaptionResult> =
async ({ imagePath }) => {
// Read the image file (binary)
const fileData = fs.readFileSync(imagePath);
if (!fileData) {
throw new Error(`Image file not found: ${imagePath}`);
}
// TODO: call your image analysis function or API.
// For example, this could be an API call:
// const caption = await callImageCaptionAPI(fileData);
// Or a local ML model inference:
// const caption = runLocalModelOnImage(fileData);
const caption = "<description of the image>";
return { caption };
};
This handler reads the file from disk (note: in a real scenario, you should handle errors and maybe file size limits). It then passes the data to some caption-generating process. Here we left it as a placeholder – you could integrate an API like CLIP+GPT, or a pretrained captioning model. The output is a simple object { caption: "...text..." }
.
3. Register the Tool – With the definition and handler created, the last step is to register this new tool in the Tool Bridge so the agent knows about it. Open src/lib/tool-bridge.ts
and find the initialize()
method where other tools are registered. Add a line for caption_image
:
async initialize(): void {
if (this.isInitialized) return;
toolRegistry.register(checkWeatherTool, checkWeatherHandler);
toolRegistry.register(forwardGeocodeTool, forwardGeocodeHandler);
toolRegistry.register(reverseGeocodeTool, reverseGeocodeHandler);
toolRegistry.register(webSearchTool, webSearchHandler);
toolRegistry.register(getTool, getHandler);
+ toolRegistry.register(imageCaptionTool, imageCaptionHandler);
console.log('🔧 Assistant Bridge initialized with tools:', toolRegistry.getAllToolNames());
this.isInitialized = true;
}
Now our tool is part of the agent’s repertoire. The next time you run the CLI, the toolRegistry.getAllToolNames()
output should include "caption_image"
as well.
4. Run and Test – Rebuild or restart the agent (pnpm dev
again). Try a prompt that involves an image. For example:
Enter your request: I have an image at "./test-images/cat.jpg". Describe what's in the image.
When GPT-4 sees this prompt and the list of available tools, it will likely choose to call caption_image
(because the tool’s description fits the request). The function call might look like: caption_image({"imagePath": "./test-images/cat.jpg"})
. The agent will execute our handler, which reads the file and returns a caption like "A small grey cat sitting on a windowsill." (assuming our caption generator did that). The model then receives that result and can incorporate it into the final answer it returns to the user, for instance:
“It looks like the image contains a small grey cat sitting on a windowsill, looking out at the view 😊.”
Congratulations – you’ve just extended the agent to handle images! 🎉
Note on file paths and modalities: In a CLI environment, providing an image or audio file to the agent typically means supplying a file path or URL that the tool can access. In our example, we used a local file path. You could also design the tool to accept an image URL (and then have the handler fetch that URL and analyze the image bytes). This might be easier if you expect users to provide links. For audio, similarly, a tool could accept a file path or URL to an .mp3
/.wav
file.
Adding an Audio Transcription Tool (Whisper example)
Following the same pattern, you can add an audio tool. Let’s briefly outline a transcription tool using OpenAI’s Whisper (speech-to-text):
Define a
transcribe_audio
tool with a parameter likeaudioPath: string
and return type{ transcript: string; }
.Implement the handler to read the audio file and call OpenAI’s transcription API. The OpenAI Node library supports this via
openai.audio
.transcriptions.create()
– you’d pass the file buffer and model (e.g."whisper-1"
). This returns a text transcript.Register
transcribe_audio
in the Tool Bridge.
Now your agent can handle requests like “Here’s an audio clip of a meeting (./
meeting.mp
3
); summarize what was discussed.” The model would call transcribe_audio
on the file, get the text transcript, then possibly use other tools or its own capabilities to summarize it, and finally respond with a summary to the user. This chains multiple modalities: audio input → text transcript (via tool) → answer.
General Tips for New Tools
Keep Tool Descriptions Precise: The tool’s name and description should clearly indicate its purpose. The AI decides to use a tool based on how well the user’s request matches that description. Avoid overly broad descriptions; be specific about when it should be used (e.g. “transcribe an audio file to text” or “fetch stock prices given a ticker symbol”).
Parameter Schema: Define the JSON schema carefully. Use required fields appropriately and validate types in your handler. The model will provide arguments that match this schema. If something is missing or invalid, your handler should handle it (e.g. throw an error or return an error message object, which the assistant can then relay as an apology or prompt for correction).
Security and Access: Be mindful of what you allow the AI to do. Tools that access the file system (like our image example) should be constrained (perhaps only allow certain directories) to avoid any unwanted access. Similarly, if exposing internet access or other powerful capabilities, consider the implications and possibly implement safeties or user confirmation steps.
Testing Tools: It’s a good idea to test your tool handlers in isolation. You can call them directly in a Node REPL or write a small unit test to ensure they work (for example, feed a sample image to
imageCaptionHandler
and see that it returns a sensible caption). Once the handler logic is solid, test it via the AI by asking in the CLI – sometimes you may need to tweak the tool description to get the model to use it reliably.
By following the pattern of the existing tools, you can continue to extend your GPT agent with all sorts of functionality: database queries, sending emails, controlling IoT devices, and so on. Each new modality or capability just requires wrapping it as a function that the AI can invoke.
Next Steps and Resources
With your multi-modal GPT agent up and running, there are plenty of directions to explore from here:
Experiment with the Existing Tools: Try queries that combine tools. For example, “Find the current weather in Paris and then search the web for local news in Paris.” This might invoke both the weather and search tools in one go. Studying the model’s chain-of-thought (visible if you check
debug.log
or add logging) can be enlightening.Add More Modalities: We showed examples for image captioning and audio transcription. You could also add an image generation tool (using OpenAI’s DALL-E API or Stable Diffusion via an API) to create images from text prompts. Another idea is a text-to-speech tool to have the assistant read out answers (using, say, Amazon Polly or Google TTS). Each new capability makes the agent more interactive and useful.
Enhance Memory or Context: The current implementation keeps a simple JSON log of the entire conversation and sends it in every request. This works for short sessions, but for longer ones you might hit context length limits. Consider integrating a vector database to store and retrieve summary embeddings, or implement a strategy to summarize old parts of the conversation.
Interface and Deployment: The CLI is merely for demonstration purposes, but you could wrap this agent in a web server (express.js) to provide a web UI or an API endpoint. Dockerizing the whole app (along with SearxNG) could make it easier to deploy. Since GPTAgent is open source, you can also fork the repository and customize it to your needs.
Further Reading & Resources:
The full source code of GPTAgent is available on GitHub. It’s well-documented with a detailed README that echoes much of what we discussed, and inline comments in code for deeper insight. Exploring the code is one of the best ways to learn how it works.
OpenAI Function Calling Documentation – Understanding how OpenAI’s API handles function (tool) calling is crucial. Read OpenAI’s guide on function calling to see how it expects function definitions and how the model decides to call them. GPTAgent’s approach is built around these principles.
OpenAI Whisper API – If you plan to integrate audio, check out OpenAI’s Whisper for speech-to-text. The Whisper API docs show how to send audio files for transcription.
Playwright & Web Scraping – The
get
tool uses Playwright. If you want to modify or extend web browsing (for example, allowing JavaScript execution or handling login for certain sites), see Playwright’s documentation for capabilities. Just be mindful of terms of service when scraping content.Multi-modal ML Models – Keep an eye on developments in GPT-4’s vision abilities. In the future, OpenAI might allow direct image inputs to the model. When that happens, some of our tool-based approaches (like the image caption tool) might be replaceable with a direct model call. Until then, the tool approach is a powerful way to bridge that gap.
Thank You!
Thank you for taking the time to read my article and I hope you found it useful (or at the very least, mildly entertaining). For more great information about web dev, systems administration and cloud computing, please read the Designly Blog. Also, please leave your comments! I love to hear thoughts from my readers.
If you want to support me, please follow me on Spotify or SoundCloud!
Please also feel free to check out my Portfolio Site
Looking for a web developer? I'm available for hire! To inquire, please fill out a contact form.
Subscribe to my newsletter
Read articles from Jay Simons directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Jay Simons
Jay Simons
Jay is a full-stack developer, electrical engineer, writer and music producer. He currently resides in the Madison, WI area. 🔗Linked In 🔗JaySudo.com