Agents, LLMs, and APIs: A Developer’s Guide to Local AI & Cloud Deployment

Table of contents
- LLM vs AI Model
- LLM Models / AI Models → AI Agents
- To convert Models to Agents :
- LLM’s Categorization - OPEN SOURCE vs CLOSED:
- Running Models LOCALLY:
- 🔧 How it works?
- 🧠 Local Usage
- Difference: Ollama & HuggingFace - WHICH ONE TO USE WHEN ?
- CONCEPTS Showed in Class:
- Ways to use LOCAL MODELS for FINE TUNING them and Deploy to get an API KEY which can be used in our Products / Projects :

LLM vs AI Model
LLM: Specific Focus: Language Models, particularly Large Language Models (LLMs) like GPT-3 or GPT-4, are designed specifically for processing and generating human language. They are trained on vast amounts of text data and can perform a wide range of language-related tasks, such as translation, summarization, question answering, and text generation.
AI Model : An AI model refers to any computational model that's designed to perform tasks that would normally require human intelligence. This includes a wide array of tasks beyond language processing, such as image recognition, decision making, prediction, and more.
LLM’s are subset of AI Models
LLM Models / AI Models → AI Agents
Analogy: AI Models are just brain with pre-trained data to work on and what they do is just predicts next thing based on your input , they can’t build, function
Body Analogy: AI MODELS ARE JUST BRAIN, BUT Hands, legs with Brain - AI AGENTS - which can function and build things
To convert Models to Agents :
we can provide, functions , to models and instruct it to use it when needed, this is how we are providing him capability to function other than just predicting output based on pre-trained data it has
with just Model, you can’t tell what’s current weather , but if u provide a function ( calls weather api ) then u can make him tell u weather also ( will be done using Agent Capabilities u provided ) - While making use of his Pre-trained qualities too, like showing that data in Fahrenheit ( this will be done using his Model Capability )
for Example,
OpenAI is a model
ChatGPT is an agent which works on OpenAI model, Perplexity is another agent u might know
To give example of functionality, ChatGPT has , but not open AI is “Search on web” its a functionality added to chatGPT
LLM’s Categorization - OPEN SOURCE vs CLOSED:
LLM’s are also categorized like softwares , open source and closed Ones
Open Source AI Models : Llama, Gemma, Deepseek, Mistral , Falcon, etc.
Closed Ones: OpenAI, Gemini , etc.
You can read this article about open Source LLM’s/ Closed LLM’s here
Closed Ones, can only be used for Inferencing ( Using ) , but open Source models can be Fine tuned , and Inferencing too.
Open Source Models , we can run locally and fine tune them according to our needs.
Running Models LOCALLY:
For now, could understood two ways basically,
using OLLAMA : It is like a developer-friendly wrapper around open-source LLMs (like LLaMA, Mistral, etc.) that:
Runs models locally using Docker-style containers
Has super simple CLI and API
Makes using LLMs one-line easy
🔧 How it works?
To run a model:
ollama run mistral or //to use in code const res = await fetch('http://localhost:11434/api/generate', { ... })
using HuggingFace : It is a platform and library (🤗 Transformers) that hosts thousands of open-source ML models — from LLMs like LLaMA, Mistral, GPT-J to image models, audio models, etc.
You can:
Download models locally
Run them with PyTorch / TensorFlow / ONNX / Transformers
Customize them
Fine-tune them on your data
🧠 Local Usage
To use a model locally from Hugging Face:
You need Python
Usually you install libraries like:
pip install transformers torch //load a model from transformers import pipeline model = pipeline("text-generation", model="gpt2")
Difference: Ollama & HuggingFace - WHICH ONE TO USE WHEN ?
Hugging Face can be used for both inferencing and fine Tuning, but ollama is only possible to use for Inferencing
So, why even use Ollama locally ? when we can’t fine tune it, when our main motive to run models locally, is to able to fine tune them, here are the cases Ollama is useful:
Running open-source LLMs locally with almost zero config.
Building apps or tools for yourself or in closed environments.
Useful for:
Desktop chatbots
Prototyping
Local API testing
Offline usage
But not for production deployment unless you use Ollama as a base layer + Dockerize it yourself and host it.
And Why Hugging Face?
Customization
Fine-tuning
Inferencing
Running models
Deploying via:
Your own API
Docker containers
Cloud GPUs (AWS/GCP/etc.)
Hugging Face is like the entire lab — research, experiments, training, serving models, publishing, everything.
CONCEPTS Showed in Class:
Docker:
"Docker is like a Windows application, just that it doesn't need any machine to run on..."
Almost, Docker does need a machine (called a host), but it doesn’t depend on that machine’s environment. Think of Docker like this:
It brings its own environment (its own OS, libraries, configs)
It runs on top of any base (your PC, a server, your friend’s laptop, cloud)
The host system just needs Docker installed — nothing else!
"Docker doesn't touch any files on your system..."
Yes, unless you intentionally connect it to your system files (via volumes). By default, it's isolated 🔒
"Docker is a portable app with its own system packed in..."
It’s like putting an entire app + its own mini Linux inside a carry-on backpack 🧳 and then running it anywhere.
"Docker just needs a base to sit and setup its stuff to run that application..."
Docker containers are platform-independent because of this.
"The base can be your computer / Cloud / Friends computer..."
Yes — anywhere Docker is installed.
How IT works:
Create a docker file ( contains, base OS , dependencies, command to run etc. )
Build the docker image “docker build -t my-cool-app “
Export image , then Share it where ever u want, to a friend, on cloud…
Since image, is zip file - that’s how docker is portable anywhere with all its requirements inside it…
Ollama is not for production unless you use [ Ollama as a base layer + Dockerize it yourself and host it ]
transformers
, datasets
, accelerate
, etc. You cannot fine-tune models with JavaScript.Ways to use LOCAL MODELS for FINE TUNING them and Deploy to get an API KEY which can be used in our Products / Projects :
Its kind of summary of what’s being explained in class Last part, since next topic is fine Tuning,
There are two ways:
- Locally (if you have good GPU) 2. Using Services like GoogleCollab (If you don’t have good GPU)
Now, Steps to do it Locally:
Download model locally using Hugging Face on your desktop / laptop,
Load it into Python (e.g., with
transformers
)Fine tune it
Wrap it using FastAPI or Flask
This gives you a local API on
http://localhost:8000
, this API you can use in your project to use you own local model , fine tuned by you !But Remember this model is locally available, so you can’t use it Online yet, to use it online two ways
1. Use ngrok (for quick test APIs)
- Cloud Deploy ( explained below )
Steps to do it on GoogleCollab:
Google Collab works on session, it deletes data after 90 minutes of inactivity.
Fine-tune on Colab using Python + Hugging Face
Save the model (
model.save
_pretrained(...)
)Download it or upload to Hugging Face Hub or Google Drive
Move it to your local machine/project
Wrap it in FastAPI (just like above)
Now again you have API , but its not accessible everywhere, for that we need cloud deployment
Train in Colab (Python) ✅
⬇️
Download model locally ✅
⬇️
Create API using FastAPI ✅
⬇️
EITHER: Run locally for testing
OR: Package in Docker & deploy to AWS
CLOUD DEPLOYMENT OF Fine Tuned LLM’s:
2 ways again: With or WIthout Docker
WITHOUT DOCKER : You can directly deploy your model on AWS , it will provide u can API, which u can use in your Projects - Deploy to AWS with a Python server (FastAPI/Flask)
WITH DOCKER : Docker packages everything (your model, FastAPI code, dependencies) into an image.
This image can be run anywhere — your PC, AWS, GCP, etc.
Inside the Docker image:
Your fine-tuned model (either saved in a folder or loaded from Hugging Face Hub)
Your API server (e.g.,
main.py
using FastAPI)A
Dockerfile
that defines the environment setup
Once it's Dockerized:
Push the Docker image to Docker Hub or AWS ECR
Deploy it to AWS EC2, Lambda (with container), or ECS
You’ll get a public IP or API Gateway endpoint, and now anyone can call your model API.
Without Docker | With Docker |
You install Python, dependencies manually on the server | You define a Dockerfile once, and it runs the same everywhere |
Issues with versions, paths, environments | Everything is bundled in a container, no surprises |
Harder to maintain, scale, or move | Portable, reproducible, and deployable |
Docker ensures:
You can run your model + server (e.g., FastAPI) as a self-contained unit
Same container works on your machine, a friend's machine, or AWS
You don't need to worry about the server’s environment
So Docker is not a must, but a huge help especially in real-world deployment.
Summaries:
Hugging Face vs Ollama
Deployment Stack
Overall:
Yes, you can fine-tune a model in Google Colab, export it, and deploy it as your own API using Docker. Once Dockerized, the model is portable and can be hosted anywhere (AWS, GCP, etc). The frontend (like React on Vercel) can call your API for live responses from your model.
Conclusion:
I have just tried to explain what i understood, i might be wrong , or information might be incomplete, but i thought i should share this after getting these topics! I hope it might add something to your Knowledge.
I have just explained the theoritical Overview, not concepts like How to do ? I am sure u can find it , if u know what to use where, which approach to take - this might be the thing of importance what options to take ?
Credits:
Credits: I am very grateful to ChaiCode for Providing all this knowledge, Insights , Deep Learning about AI : Piyush Garg Hitesh Choudhary
If you want to learn too, you can Join here → Cohort || Apply from ChaiCode & Use NAKUL51937 to get 10% off
Thanks:
Feel free to Comment your thoughts, i love hearing feedback and improve, this is my second Article :)
Thanks you for Giving your Precious time , reading this article
Connect:
Let’s learn something Together: LinkedIn
If you would like , you can Check out my Portfolio
Subscribe to my newsletter
Read articles from Nakul Srivastava directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Nakul Srivastava
Nakul Srivastava
I'm a passionate web developer who loves building beautiful, functional, and efficient web applications. I focus on crafting seamless user experiences using modern frontend technologies.