๐Ÿค– Agents and Tokenizer in Generative AI (GenAI) โ€” Explained Simply

Vimal NegiVimal Negi
8 min read

๐Ÿค– LLMs vs. Agents: Understanding the Difference (with a Simple Analogy)

In the world of AI, there's a lot of buzz around Large Language Models (LLMs) and Agents. But what's the real difference between the two?

Letโ€™s break it down in simple terms. ๐Ÿง ๐Ÿ’ก

๐Ÿง  What is an LLM?

A Large Language Model (LLM), like GPT-4 or Claude, is essentially a powerful brain that has been trained on massive amounts of text data. It can:

  • Understand natural language

  • Generate human-like responses

  • Summarize, translate, and even write code

But here's the limitation:

LLMs do not have memory or the ability to access real-time data on their own.

Think of it like this:

โAn LLM is like a brain with no body. It knows a lot, but it canโ€™t act on anything in the real world.โž


๐Ÿฆพ What is an Agent?

An Agent is what happens when you give that brain a body.

An AI Agent is built on top of an LLM and is designed to:

  • Interact with real-time data (e.g., check the weather, stock prices)

  • Use tools/APIs

  • Remember previous interactions

  • Take actions based on goals

So in short:

โAn Agent is an LLM with tools, memory, and the ability to take actions.โž


๐Ÿงฉ Analogy: Brain vs. Assistant

  • LLM = Brain: Smart, well-read, but passive. Can answer questions but can't "do" anything.

  • Agent = Assistant with a Brain: Not only smart, but can also check your emails, book appointments, and tell you the weatherโ€”in real-time.


๐Ÿ†š Key Differences

FeatureLLMAgent
Access to real-time dataโŒ Noโœ… Yes
Has memoryโŒ No (unless extended)โœ… Yes
Uses tools / APIsโŒ Noโœ… Yes
Acts autonomouslyโŒ Noโœ… Yes
ExampleChatGPT answering questionsAutoGPT planning and executing tasks

๐Ÿ’ก Why Does This Matter?

Understanding the distinction helps in:

  • Building smarter applications

  • Choosing the right tool for the job

  • Innovating in how AI is used (think AI assistants, automated workflows, etc.)

Example-:

Ask LLM to provide real time weather data you will note that it will ask you to make a google search or check any weather related website

Where as an agent will provide you the details

Creating weather agent-:

from openai import OpenAI
from dotenv import load_dotenv
load_dotenv()
import os
import requests
import json
api_key=os.getenv("GOOGLE_GEMINI_API_KEY")
client=OpenAI(api_key=api_key,
              base_url="https://generativelanguage.googleapis.com/v1beta/openai/")
messages=[]
def get_weather(city:str):
    url=f'https://wttr.in/{city}?format=%C+%t'
    response=requests.get(url)
    if response.status_code==200:
       return f"the weather in {city} is {response.text}"
    return "Something went wrong"

available_tools={
    'get_weather':{
        "fn":get_weather,
        "description":"Takes a city name as an input and return current weather of the city"
    }
}
System_prompt=f"""You are a powerful ai agent who is specialized in resolving user query.
 You work on start,plan,action,observe mode.
For the given user query and available tools plan step by step execution,based on the planning select the relevant tools from the avilable 
tools and based on the tool selction you perfom an action to call the tool wait for observation and based on observation from the tool call resolve the user query

Rules-:
- Follow the output JSON format.
- Always Perform one step at a time and wait for next input.
- Carefully analyse the user query

Output JSON Format:
{{
    "step":"string",
"content":"string",
"function":"the name of function if the step is action",
"input":"the input parameter for the function"
}}
Available Tools:
-get_weather:Takes city name as input returns the current weather for the city
Example:
User Query:what is the weather of new york?
Output:{{"step":"plan","content":"the user is interested in weather data of new york}}
Output:{{"step""plan","content":"From the available tools I should call get_weather"}}
Output:{{"step":"action","function":"get_weather","input":"new york"}}
Output:{{"step":"observe","output":"12 degree celsius"}}
Output:{{"step":"output","The weather of new york is 12 degree celsius"}}

"""
messages.append({
            'role':"system",'content':System_prompt})

user_query=input("Enter your query >")
messages.append({"role":"user","content":user_query})
outputs=[]
while True:
    response=client.chat.completions.create(
        model='gemini-2.0-flash',
        messages=messages,
        response_format={"type":"json_object"},

    )
    output=response.choices[0].message.content
    messages.append({"role":"assistant",'content':json.dumps(output)})
    extracted_output=json.loads(output)
    if(extracted_output.get("step")=='plan'):
     print(extracted_output.get("content"))
     continue
    if(extracted_output.get("step")=='action'):
       tool_name=extracted_output.get("function") #get weather
       tool_input=extracted_output.get("input") #city name
       if(available_tools.get(tool_name,False).get("fn")(tool_input))!=False:
          output=available_tools[tool_name].get("fn")(tool_input)
          messages.append({"role":"assistant","content":json.dumps({"step":"observe","output":output})})

    if(extracted_output.get("step")=="output"):
       print(extracted_output.get("content"))
       break

This is how you can create a basic weather agent

Creating mini cursor that even support voice assistant-:

from openai import OpenAI
from dotenv import load_dotenv
load_dotenv()
import os
import requests
import json
import platform
api_key=os.getenv("GOOGLE_GEMINI_API_KEY")
client=OpenAI(api_key=api_key,
              base_url="https://generativelanguage.googleapis.com/v1beta/openai/")


import speech_recognition as sr
import time

recognizer = sr.Recognizer()

with sr.Microphone() as source:
    print("๐Ÿ”ง Calibrating for background noise...")
    recognizer.adjust_for_ambient_noise(source, duration=2)
    print("โœ… Calibration done.")

    print("๐ŸŽค You can speak after this countdown:")
    for i in range(3, 0, -1):
        print(i)
        time.sleep(1)

    print("๐ŸŽ™๏ธ Listening now (speak up to 30 seconds)...")

    try:
        audio = recognizer.listen(
            source,
            timeout=60,               # wait up to 60s for user to start talking
            phrase_time_limit=30      # let them speak up to 30s max
        )

        print("๐Ÿ” Recognizing...")
        text = recognizer.recognize_google(audio)
        print("โœ… You said:", text)

    except sr.WaitTimeoutError:
        print("โŒ› Timeout: No speech detected within the time window.")
    except sr.UnknownValueError:
        print("โŒ Could not understand the audio.")
    except sr.RequestError as e:
        print(f"โŒ Recognition error: {e}")


messages=[]


def run_command(command:str):
   print(platform.system())
   os.system(command=command)
available_tools={

    'run_command':{
       "fn":run_command,
       "description":"takes a command from user and execute that"
    }

}
System_prompt=f"""You are a powerful ai agent who is specialized in resolving user query.
 You work on start,plan,action,observe mode.
For the given user query and available tools plan step by step execution,based on the planning select the relevant tools from the avilable 
tools and based on the tool selction you perfom an action to call the tool wait for observation and based on observation from the tool call resolve the user query

Rules-:
- Follow the output JSON format.
- Always Perform one step at a time and wait for next input.
- Carefully analyse the user query

Output JSON Format:
{{
    "step":"string",
"content":"string",
"function":"the name of function if the step is action",
"input":"the input parameter for the function"
}}
Available Tools:
-run_command:Takes a command and execute that command on windows or linux system according to a system configuration
Example:
User Query:what is the weather of new york?
Output:{{"step":"plan","content":"the user is interested in weather data of new york}}
Output:{{"step""plan","content":"From the available tools I should call get_weather"}}
Output:{{"step":"action","function":"get_weather","input":"new york"}}
Output:{{"step":"observe","output":"12 degree celsius"}}
Output:{{"step":"output","The weather of new york is 12 degree celsius"}}

"""
messages.append({
            'role':"system",'content':System_prompt})

user_query=text
messages.append({"role":"user","content":user_query})
outputs=[]
while True:
    response=client.chat.completions.create(
        model='gemini-2.0-flash',
        messages=messages,
        response_format={"type":"json_object"},

    )
    output=response.choices[0].message.content
    messages.append({"role":"assistant",'content':json.dumps(output)})
    extracted_output=json.loads(output)
    if(extracted_output.get("step")=='plan'):
     print(extracted_output.get("content"))
     continue
    if(extracted_output.get("step")=='action'):
       tool_name=extracted_output.get("function") #get weather
       tool_input=extracted_output.get("input") #city name
       if(available_tools.get(tool_name,False).get("fn")(tool_input))!=False:
          output=available_tools[tool_name].get("fn")(tool_input)
          messages.append({"role":"assistant","content":json.dumps({"step":"observe","output":output})})

    if(extracted_output.get("step")=="output"):
       print(extracted_output.get("content"))
       break

How to run an LLM locally-:

๐Ÿ”ง 1. Key Differences: OpenAI vs. Open-Source LLMs

FeatureOpenAI (like GPT)Open-Source LLMs (e.g., LLaMA, Mistral)
HostingCloud/API onlyLocal or server
Fine-tuningโŒ Not availableโœ… Fully supported
Inferenceโœ… Supported via APIโœ… Supported locally and via API
PrivacyโŒ Depends on APIโœ… Full local control
Cost๐Ÿ’ธ Pay-per-useโœ… Free (one-time compute cost)

๐Ÿš€ 2. What is Ollama?

Ollama is a CLI tool and runtime that lets you run LLMs (like LLaMA2, Mistral, Gemma, etc.) locally on your machine with just a few commands.

โœ… Benefits of Ollama:

  • Runs entirely on your CPU or GPU

  • Easy to install and use

  • Supports many open-source models

  • Lightweight models like llama2:7b or mistral:7b work well even on mid-range laptops

Running Ollama using Docker commands

services:
  ollama:
   image: ollama/ollama:latest
   ports:
     - '11434:11434'
   volumes:
     - models:/root/.ollama/models

volumes:
  models:
docker compose up

Using it in code

from fastapi import FastAPI
from ollama import Client
from fastapi import Body
app=FastAPI()
client=Client(
    host='http://localhost:11434'
) #our ollama will run here
client.pull('gemma3:1b') #pulled this model from ollama
@app.post("/chat")
def chat(message:str=Body(...,description="Chat Message")):
    response=client.chat(model='gemma3:1b',messages=[
        {"role":"user","content":message}
    ])
    return response['message']['content']

download package uvicorn-pip install uvicorn

 uvicorn ollama_api:app --port 8000

๐Ÿ”ง Basics of Fine-Tuning Transformers: Tokenizer and AutoModelForCausalLM Explained

Fine-tuning a transformer model sounds complex, but if you break it down into its key components, it's totally manageableโ€”even exciting!

In this post, we'll cover the core concepts of fine-tuning a language model using Hugging Face Transformers, focusing on:

  • What is a Tokenizer?

  • What is AutoModelForCausalLM?

  • How they work together during fine-tuning


๐Ÿ“š What is Fine-Tuning in NLP?

Fine-tuning is the process of taking a pre-trained model (like GPT-2, LLaMA, or Falcon) and training it further on your specific dataset so it learns the patterns and context related to your task.

This could be:

  • A chatbot for your domain

  • A code generator

  • A language-specific assistant

  • Custom text completions


๐Ÿ”ค Tokenizer: Your Input Translator

Before we feed data to a model, we need to convert text into numbers (tokens). Thatโ€™s what a tokenizer does.

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("gpt2")
inputs = tokenizer("Hello, how are you?", return_tensors="pt")

โœ… Why Itโ€™s Important:

  • Ensures input is in the right format for the model

  • Handles special tokens (like padding, EOS, BOS)

  • Must match the model architecture (e.g., GPT2 tokenizer for GPT2 model)

Think of the tokenizer as a translator that converts human text into a format the model understand

๐Ÿง  AutoModelForCausalLM: Predict the Next Word

Once text is tokenized, itโ€™s fed into a causal language model, which is a type of model that predicts the next token in a sequence.

from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("gpt2")
outputs = model(**inputs)
  • Causal LM = "left to right" prediction

  • Used for text generation tasks

  • Common models: GPT2, LLaMA, Mistral, Falcon, etc.

Complete code-:

!pip install transformers
import os
os.environ["HF_TOKEN"]="Your hugging face token"
model_name="google/gemma-3-1b-it"
from transformers import AutoTokenizer
tokenizer=AutoTokenizer.from_pretrained(model_name) # generates token for a model or pulls token.json from hugging face
print(tokenizer("hello how are you"))
print(tokenizer.get_vocab())
input_tokens=tokenizer("hello how are you")["input_ids"]
from transformers import AutoModelForCausalLM #This is basically next token predictor
import torch
model=AutoModelForCausalLM.from_pretrained(model_name,torch_dtype=torch.bfloat16)
from transformers import pipeline
genPipeLine=pipeline("text-generation",model=model,tokenizer=tokenizer)
genPipeLine("hey there")

Conclusion-:

So this was the basic follow for such content and feel free to comment if you have any doubt.

1
Subscribe to my newsletter

Read articles from Vimal Negi directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Vimal Negi
Vimal Negi

Hey there! I'm Vimal Negi, a passionate and self-driven Full-Stack Developer and Final-Year Engineering Student. I love building interactive web applications and solving real-world problems using technologies like React, Node.js, Express, MongoDB, and Tailwind CSS.