Agents, LLMs, and APIs: A Developer’s Guide to Local AI & Cloud Deployment


LLM vs AI Model

LLM: Specific Focus: Language Models, particularly Large Language Models (LLMs) like GPT-3 or GPT-4, are designed specifically for processing and generating human language. They are trained on vast amounts of text data and can perform a wide range of language-related tasks, such as translation, summarization, question answering, and text generation.

AI Model : An AI model refers to any computational model that's designed to perform tasks that would normally require human intelligence. This includes a wide array of tasks beyond language processing, such as image recognition, decision making, prediction, and more.

LLM’s are subset of AI Models


LLM Models / AI Models → AI Agents

Analogy: AI Models are just brain with pre-trained data to work on and what they do is just predicts next thing based on your input , they can’t build, function

Body Analogy: AI MODELS ARE JUST BRAIN, BUT Hands, legs with Brain - AI AGENTS - which can function and build things


To convert Models to Agents :

  • we can provide, functions , to models and instruct it to use it when needed, this is how we are providing him capability to function other than just predicting output based on pre-trained data it has

  • with just Model, you can’t tell what’s current weather , but if u provide a function ( calls weather api ) then u can make him tell u weather also ( will be done using Agent Capabilities u provided ) - While making use of his Pre-trained qualities too, like showing that data in Fahrenheit ( this will be done using his Model Capability )

for Example,

OpenAI is a model

ChatGPT is an agent which works on OpenAI model, Perplexity is another agent u might know

To give example of functionality, ChatGPT has , but not open AI is “Search on web” its a functionality added to chatGPT


LLM’s Categorization - OPEN SOURCE vs CLOSED:

  • LLM’s are also categorized like softwares , open source and closed Ones

  • Open Source AI Models : Llama, Gemma, Deepseek, Mistral , Falcon, etc.

  • Closed Ones: OpenAI, Gemini , etc.

  • You can read this article about open Source LLM’s/ Closed LLM’s here

Closed Ones, can only be used for Inferencing ( Using ) , but open Source models can be Fine tuned , and Inferencing too.

Open Source Models , we can run locally and fine tune them according to our needs.


Running Models LOCALLY:

  • For now, could understood two ways basically,

    1. using OLLAMA : It is like a developer-friendly wrapper around open-source LLMs (like LLaMA, Mistral, etc.) that:

      • Runs models locally using Docker-style containers

      • Has super simple CLI and API

      • Makes using LLMs one-line easy

      • 🔧 How it works?

        To run a model:

          ollama run mistral
             or 
          //to use in code
          const res = await fetch('http://localhost:11434/api/generate', { ... })
        
    2. using HuggingFace : It is a platform and library (🤗 Transformers) that hosts thousands of open-source ML models — from LLMs like LLaMA, Mistral, GPT-J to image models, audio models, etc.

      You can:

      • Download models locally

      • Run them with PyTorch / TensorFlow / ONNX / Transformers

      • Customize them

      • Fine-tune them on your data

        🧠 Local Usage

        To use a model locally from Hugging Face:

      • You need Python

      • Usually you install libraries like:

          pip install transformers torch
        
          //load a model
        
          from transformers import pipeline
          model = pipeline("text-generation", model="gpt2")
        

Difference: Ollama & HuggingFace - WHICH ONE TO USE WHEN ?

Hugging Face can be used for both inferencing and fine Tuning, but ollama is only possible to use for Inferencing

So, why even use Ollama locally ? when we can’t fine tune it, when our main motive to run models locally, is to able to fine tune them, here are the cases Ollama is useful:

  • Running open-source LLMs locally with almost zero config.

  • Building apps or tools for yourself or in closed environments.

  • Useful for:

    • Desktop chatbots

    • Prototyping

    • Local API testing

    • Offline usage

But not for production deployment unless you use Ollama as a base layer + Dockerize it yourself and host it.

And Why Hugging Face?

  • Customization

  • Fine-tuning

  • Inferencing

  • Running models

  • Deploying via:

    • Your own API

    • Docker containers

    • Cloud GPUs (AWS/GCP/etc.)

Hugging Face is like the entire lab — research, experiments, training, serving models, publishing, everything.


CONCEPTS Showed in Class:

Docker:

"Docker is like a Windows application, just that it doesn't need any machine to run on..."

Almost, Docker does need a machine (called a host), but it doesn’t depend on that machine’s environment. Think of Docker like this:

  • It brings its own environment (its own OS, libraries, configs)

  • It runs on top of any base (your PC, a server, your friend’s laptop, cloud)

  • The host system just needs Docker installed — nothing else!

"Docker doesn't touch any files on your system..."

Yes, unless you intentionally connect it to your system files (via volumes). By default, it's isolated 🔒

"Docker is a portable app with its own system packed in..."

It’s like putting an entire app + its own mini Linux inside a carry-on backpack 🧳 and then running it anywhere.

"Docker just needs a base to sit and setup its stuff to run that application..."

Docker containers are platform-independent because of this.

"The base can be your computer / Cloud / Friends computer..."

Yes — anywhere Docker is installed.

How IT works:

  1. Create a docker file ( contains, base OS , dependencies, command to run etc. )

  2. Build the docker image “docker build -t my-cool-app “

  3. Export image , then Share it where ever u want, to a friend, on cloud…

    Since image, is zip file - that’s how docker is portable anywhere with all its requirements inside it…

Ollama is not for production unless you use [ Ollama as a base layer + Dockerize it yourself and host it ]


💡
Colab and Hugging Face both require Python. Fine-tuning is done using Python libraries like transformers, datasets, accelerate, etc. You cannot fine-tune models with JavaScript.
💡
Fine-tuning = Python (Colab or local with GPU) || Inference = JavaScript can be used, but limited

Ways to use LOCAL MODELS for FINE TUNING them and Deploy to get an API KEY which can be used in our Products / Projects :

Its kind of summary of what’s being explained in class Last part, since next topic is fine Tuning,

There are two ways:

  1. Locally (if you have good GPU) 2. Using Services like GoogleCollab (If you don’t have good GPU)

Now, Steps to do it Locally:

  1. Download model locally using Hugging Face on your desktop / laptop,

  2. Load it into Python (e.g., with transformers)

  3. Fine tune it

  4. Wrap it using FastAPI or Flask

  5. This gives you a local API on http://localhost:8000 , this API you can use in your project to use you own local model , fine tuned by you !

  6. But Remember this model is locally available, so you can’t use it Online yet, to use it online two ways

    1. Use ngrok (for quick test APIs)

    1. Cloud Deploy ( explained below )

Steps to do it on GoogleCollab:

Google Collab works on session, it deletes data after 90 minutes of inactivity.

  1. Fine-tune on Colab using Python + Hugging Face

  2. Save the model (model.save_pretrained(...))

  3. Download it or upload to Hugging Face Hub or Google Drive

  4. Move it to your local machine/project

  5. Wrap it in FastAPI (just like above)

  6. Now again you have API , but its not accessible everywhere, for that we need cloud deployment

Train in Colab (Python) ✅
         ⬇️
Download model locally ✅
         ⬇️
Create API using FastAPI ✅
         ⬇️
EITHER: Run locally for testing
   OR: Package in Docker & deploy to AWS

CLOUD DEPLOYMENT OF Fine Tuned LLM’s:

2 ways again: With or WIthout Docker

  1. WITHOUT DOCKER : You can directly deploy your model on AWS , it will provide u can API, which u can use in your Projects - Deploy to AWS with a Python server (FastAPI/Flask)

  2. WITH DOCKER : Docker packages everything (your model, FastAPI code, dependencies) into an image.

    This image can be run anywhere — your PC, AWS, GCP, etc.

Inside the Docker image:

  • Your fine-tuned model (either saved in a folder or loaded from Hugging Face Hub)

  • Your API server (e.g., main.py using FastAPI)

  • A Dockerfile that defines the environment setup

Once it's Dockerized:

  • Push the Docker image to Docker Hub or AWS ECR

  • Deploy it to AWS EC2, Lambda (with container), or ECS

You’ll get a public IP or API Gateway endpoint, and now anyone can call your model API.

Without DockerWith Docker
You install Python, dependencies manually on the serverYou define a Dockerfile once, and it runs the same everywhere
Issues with versions, paths, environmentsEverything is bundled in a container, no surprises
Harder to maintain, scale, or movePortable, reproducible, and deployable

Docker ensures:

  • You can run your model + server (e.g., FastAPI) as a self-contained unit

  • Same container works on your machine, a friend's machine, or AWS

  • You don't need to worry about the server’s environment

So Docker is not a must, but a huge help especially in real-world deployment.


Summaries:

Hugging Face vs Ollama

Deployment Stack

Overall:

Yes, you can fine-tune a model in Google Colab, export it, and deploy it as your own API using Docker. Once Dockerized, the model is portable and can be hosted anywhere (AWS, GCP, etc). The frontend (like React on Vercel) can call your API for live responses from your model.


Conclusion:

I have just tried to explain what i understood, i might be wrong , or information might be incomplete, but i thought i should share this after getting these topics! I hope it might add something to your Knowledge.

I have just explained the theoritical Overview, not concepts like How to do ? I am sure u can find it , if u know what to use where, which approach to take - this might be the thing of importance what options to take ?


Credits:

Credits: I am very grateful to ChaiCode for Providing all this knowledge, Insights , Deep Learning about AI : Piyush Garg Hitesh Choudhary

If you want to learn too, you can Join here → Cohort || Apply from ChaiCode & Use NAKUL51937 to get 10% off


Thanks:

Feel free to Comment your thoughts, i love hearing feedback and improve, this is my second Article :)

Thanks you for Giving your Precious time , reading this article


Connect:

Let’s learn something Together: LinkedIn

If you would like , you can Check out my Portfolio


5
Subscribe to my newsletter

Read articles from Nakul Srivastava directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Nakul Srivastava
Nakul Srivastava

I'm a passionate web developer who loves building beautiful, functional, and efficient web applications. I focus on crafting seamless user experiences using modern frontend technologies.