#3 - AI Agents: What are they & How to code them

Mishal AlexanderMishal Alexander
10 min read

While chatting with large language models (LLMs) like ChatGPT or Gemini, we often get a feeling that they are these super-smart brains trapped in a digital box with no option to delve into the outside world. They can chat, write, and even code with impressive fluency. Alas they have a limitation - they cannot work with live data. Why? Due to the limited knowledge with which they were trained on before the knowledge cutoff. Now what if I told you that there's a way to give these ‘digital brains’ a ‘body’, allowing them to interact with the real world? That's where the fascinating realm of AI agents begins.

As someone deeply curious about the cutting edge of artificial intelligence, I've been digging into how these intelligent entities are built, and let me tell you, it's like watching science fiction slowly become reality.

Road to AI Agents

The idea of autonomous agents isn't exactly brand new. Even before the rise of the powerful LLMs that we see today, computer engineers and researchers were exploring the concept of AI systems that could perceive their environment, make decisions, and take actions to achieve specific goals. Think back to early expert systems or even robotic process automation (RPA) – these were, in a way, the early ancestors of today's more sophisticated AI agents.

However, the recent explosion in the capabilities of LLMs has truly supercharged the field. These models, with their incredible ability to understand and generate human-like text, provide the "brainpower" needed for agents to reason, plan, and interact in complex ways.

The Brain and the Body: AI Agent

Imagine an LLM as a highly intelligent human brain, brimming with knowledge and the ability to process information. It can understand language, reason logically, and even be creative. But just like how a brain alone cannot physically interact with the world, an LLM in its raw form is limited to the digital realm of text and data it has been trained on.

Now, picture giving this brain a set of tools – arms to fetch information from the internet, legs to navigate different software applications, or hands to write and save files. These "tools" are just functions that the LLM can learn to use to extend its abilities beyond pure language processing.

This is the fundamental concept behind AI agents. They are essentially LLMs equipped with the ability to call upon and utilize external tools to perform actions in the real world or with dynamic data. This simple yet powerful combination unlocks a whole new level of potential.

Unleashing the Power: LLMs Interacting with the Real World

The beauty of equipping LLMs with tools lies in their newfound ability to interact with things that are constantly changing. Think about it:

  • Fetching Real-Time Data

  • Interacting with different APIs

  • Managing Files

  • Automating Complex Workflows

At the core of AI applications like Cursor, which is an AI-powered code editor, it leverages an LLM, but its ability to edit code, refactor, and even generate new code snippets based on your project context is powered by its internal "tools" that allow it to interact with your codebase. It's a prime example of how an AI agent can significantly enhance productivity in a specific domain.

Simplifying Agents: ChatGPT's “Assistant Function Calling”

Creating these sophisticated AI agents might sound like something only a programmer could do, and traditionally, it did require a fair bit of coding. However, platforms like OpenAI have made this process significantly more accessible with features like Assistant function calling. Essentially, this allows you to describe functions (i.e. "tools") to the OpenAI Assistant. When you interact with the Assistant, it can intelligently determine if calling one of these functions would be helpful to answer your query. If so, it generates a JSON payload specifying which function to call and with what parameters. Your application then executes the function and provides the result back to the Assistant, which can then use this information to formulate its final response.

Think of it like this: instead of you explicitly telling the AI to use a specific tool, you simply ask your question, and the Assistant's internal reasoning, powered by function calling, figures out the best tool to use behind the scenes.

While ChatGPT's Assistants API has streamlined this process significantly, the underlying concept of equipping LLMs with tools is being explored across different platforms. While the exact implementation might differ, the core idea remains the same: enabling LLMs to interact with the external world. Keep an eye out for similar features and functionalities emerging in platforms like Google's Gemini and Anthropic's Claude as the field continues to evolve. These platforms are constantly pushing the boundaries of what their models can do, and the ability to seamlessly integrate tools is a natural progression.

Building a Simple AI Agent with OpenAI

Now, let's dive into the coding and see how a basic AI agent comes to life using the OpenAI library. This example demonstrates the core principles we've been discussing. (see the code at the end of this article)

It might seem daunting at first. So let's break down what we need to code in pseudo-code below:

  1. Defining Tools:
    The code should first defines a dictionary called avaiable_tools. This dictionary should hold the definitions of the tools our agent can use. Each tool should have a name (‘get_weather’, ‘run_command’, etc), a function (‘fn’) that actually performs the action, and a description that explains what the tool does. This ‘description’ is crucial because the LLM uses it to decide when and how to use the tool.

  2. The System Prompt:
    The system_prompt is the ‘brain’ of our agent's operation. It instructs the LLM on how to behave. It should have the "start, plan, action, observe" cycle which is a common framework for building agents:

    • Start: The agent receives a user query.

    • Plan: The agent analyzes the query and decides on the steps needed to fulfill it, potentially including which tools to use.

    • Action: If a tool is deemed necessary, the agent selects the appropriate tool and provides the required input.

    • Observe: The agent receives the output from the tool call.

The prompt should also specify a JSON output format that the LLM should adhere to at each step. This structured output makes it easy for the code to parse the agent's reasoning and actions.

  1. The Conversation Loop:
    The while True loop that should keeps the agent running, waiting for user input. For each query:

    • The user's query should be added to a list called messages, which should keep track of the conversation history.

    • The code should then enters another while True loop to handle the multi-step interaction with the LLM.

    • Use the client.chat.completions.create() function which should sent the conversation history (including the system prompt and the user query) to the model (in this case, the openAI model gpt-4o).

    • The model would responds with a JSON object based on the system_prompt.

    • The code should parse this JSON output and acts accordingly:

      • If the step is "plan", it should print the agent's thought process.

      • If the step is "action", it should extract the function name and input, call the corresponding function from avaiable_tools, and then add the "observe" step with the tool's output back to the conversation.

      • If the step is "output", it should print the final response to the user and breaks out of the inner loop to wait for the next user query.

This code, while simple, beautifully illustrates the core mechanism of how an LLM can be guided to use external tools to solve problems beyond its initial knowledge base. The system prompt acts as the director, and the structured JSON output allows for seamless communication between the LLM and the Python code that executes the tools.

The Future is ‘Agentic’

The development of AI agents is still in its early stages, but the potential is immense. As LLMs become even more sophisticated and tool integration becomes more seamless, we can expect to see agents that are capable of handling increasingly complex tasks autonomously. From personalized assistants that manage our schedules and finances to sophisticated problem-solving systems in various industries, the possibilities are truly exciting.

My journey into understanding AI agents has only just begun, and I'm eager to see how this field continues to evolve and shape the future of how we interact with technology. It's no longer just about intelligent language models; it's about giving them the means to act and interact with the world around us, and that's a prospect that truly sparks the imagination.

Agent Code

PFB the code which creates an AI agent -

import json
import requests
from dotenv import load_dotenv
from openai import OpenAI
import os

load_dotenv()

# creating a client instance 
client = openai.Client()

# tool to run terminal commands
def run_command(command):
    result = os.system(command=command)
    return result

# tool to fetch current weather using the weather api
def get_weather(city: str):
    # print to console everytime this tool is used
    print("🔨 Tool Called: get_weather", city)

    url = f"https://wttr.in/{city}?format=%C+%t"
    response = requests.get(url)

    if response.status_code == 200:
        return f"The weather in {city} is {response.text}."
    return "Something went wrong"

# creating a list of available tools explaining what it does
avaiable_tools = {
    "get_weather": {
        "fn": get_weather,
        "description": "Takes a city name as an input and returns the current weather for the city"
    },
    "run_command": {
        "fn": run_command,
        "description": "Takes a command as input to execute on system and returns ouput"
    }
}

# system prompt which instructs the LLM to be a helpful agents who can use tools if need be to
# to resolve user queries 
system_prompt = f"""
    You are an helpfull AI Assistant who is specialized in resolving user query.
    You work on start, plan, action, observe mode.
    For the given user query and available tools, plan the step by step execution, based on the planning,
    select the relevant tool from the available tool. and based on the tool selection you perform an action to call the tool.
    Wait for the observation and based on the observation from the tool call resolve the user query.

    Rules:
    - Follow the Output JSON Format.
    - Always perform one step at a time and wait for next input
    - Carefully analyse the user query

    Output JSON Format:
    {{
        "step": "string",
        "content": "string",
        "function": "The name of function if the step is action",
        "input": "The input parameter for the function",
    }}

    Available Tools:
    - get_weather: Takes a city name as an input and returns the current weather for the city
    - run_command: Takes a command as input to execute on system and returns ouput

    Example:
    User Query: What is the weather of new york?
    Output: {{ "step": "plan", "content": "The user is interseted in weather data of new york" }}
    Output: {{ "step": "plan", "content": "From the available tools I should call get_weather" }}
    Output: {{ "step": "action", "function": "get_weather", "input": "new york" }}
    Output: {{ "step": "observe", "output": "12 Degree Cel" }}
    Output: {{ "step": "output", "content": "The weather for new york seems to be 12 degrees." }}
"""

messages = [
    { "role": "system", "content": system_prompt }
]

# this is where we code the agent
while True:
    # get the query from user
    user_query = input('> ')
    # and add it to the messages as user query
    messages.append({ "role": "user", "content": user_query })

    while True:
        # set the model to use and response format to follow
        response = client.chat.completions.create(
            model="gpt-4o",
            response_format={"type": "json_object"},
            messages=messages
        )
        # given the output generated by the model back to the model for context
        parsed_output = json.loads(response.choices[0].message.content)
        messages.append({ "role": "assistant", "content": json.dumps(parsed_output) })

        # print to console if the step followed by the model is 'plan' to see it's thinking 
        if parsed_output.get("step") == "plan":
            print(f"🧠: {parsed_output.get("content")}")
            continue

        # if the step is 'action' look in the available tools to find something which can be 
        # useful in resolving the user's query
        if parsed_output.get("step") == "action":
            tool_name = parsed_output.get("function")
            tool_input = parsed_output.get("input")

            if avaiable_tools.get(tool_name, False) != False:
                output = avaiable_tools[tool_name].get("fn")(tool_input)
                messages.append({ "role": "assistant", "content": json.dumps({ "step": "observe", "output":  output}) })
                continue
        # break out of the second loop if the step is 'output' waiting for the next user input
        if parsed_output.get("step") == "output":
            print(f"🤖: {parsed_output.get("content")}")
            break

Reference

  1. Gen AI Cohort by Hitesh Chaudhary and Piyush Garg

  2. Piyush's github repository for code

0
Subscribe to my newsletter

Read articles from Mishal Alexander directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Mishal Alexander
Mishal Alexander