Introduction

Generative AI, or Generative Artificial Intelligence, refers to models that generate text, images, or other media using learned data patterns. At the heart of this are tokens—which can be characters, words, or subwords—and vectors, which are numerical representations of meaning and context. Tokens are embedded into these vectors so that models can understand relationships, such as how "dog" is closer to "pet" than "car" in semantic space.

Models

Language models are AI systems designed to process, generate, and understand natural language input or queries.

Large Language Models (LLMs) are designed to produce outputs based on user input.LLMs are still the core architecture behind most modern AI systems. Chat models are a specialized subclass of LLMs fine-tuned for interactive dialogue and multi-turn conversations.

Chat Models, on the other hand, are specialized language models designed for conversational tasks.

In this blog, I will mainly focus on Chat Models. However, let's explore the key points and differences between both.

The main difference between the two is that LLMs are general-purpose and can support various NLP projects, such as text generation, summarization, code generation, and question-answering. In contrast, Chat Models can handle multiple messages or queries and generate suitable responses. LangChain v0.3+ favors chat-style interactions through ChatModel interfaces, even for single-turn outputs. This shift simplifies memory integration and multi-turn state management.

LLMs do not have built-in memory, meaning they do not retain information from previous interactions. In contrast, Chat Models maintain a structured conversation history, allowing them to reference past exchanges and provide contextually relevant responses.

BaseLLM and baseChatModel are the foundational functions used when creating a Large Language Model (LLM) or a chat model, respectively. These functions serve as the building blocks for developing and implementing the specific capabilities of each model type.

Temperature in LLMs is a parameter that controls the randomness of the output, affecting the creativity of the responses. Lower values (0-0.3) result in more deterministic and repetitive outputs, while higher values (1.5+) lead to more diverse and creative responses.

Max_Completion_Token is a parameter used to limit the length or number of tokens in the model's output, ensuring the response does not exceed a specified size.

Open Sourced (Proprietary) & Closed Sourced Models

These models are of two types: open source and closed source. Open source models can be easily downloaded and run on your local device, while closed source models are hosted on a server and accessed via an API. Some major closed source models include OpenAI’s GPT model, Anthropic’s Claude, and Google’s Gemini. Major open source models include Meta’s LLaMA, DeepSeek R1 and Mistral.

Choosing between open and closed source depends on your use case. If privacy, customization, and control are critical, open-source models like LLaMA or Mistral on HuggingFace are ideal. For robust performance out-of-the-box with minimal setup, closed models like GPT-4 or Claude are better suited.

The difference between open source and closed source models lies in data privacy and control. Closed source models require API calls, meaning your data is sent to a remote server, which raises concerns about data privacy. In contrast, open source models allow you to download and run the entire model on your device, giving you full control over it. This also supports fine-tuning, a concept encountered in Transformer architecture, as you own the model. Additionally, with open source models, there is no risk of leaking API keys or needing to load them for every new project.

Closed source models can be found on their respective domains. Open sourced models are available on HuggingFace, which is the largest repository of open-sourced LLMs. There are more than If you have storage constraints on your local system, HuggingFace also provides a free tier for inference API usage up to a certain limit, allowing you to use thousands of open-sourced LLMs without downloading them, with a large set of choices to select from.

Disadvantages of using open-sourced LLMs are that they require high computational power and a more powerful GPU. They also involve setup complexity and may lack refined human feedback, as they are not always well-trained for various inputs.

https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/

A place where you can judge different Open Sourced Models & decide which one to use depending on your use case scenarios

Open Source Model Nomenclature

To read the names of open-source models, you can follow these general guidelines. These names are often abbreviations or acronyms and usually follow a pattern that reflects their:

Architecture
Creator/organization
Version
Size or parameters

Example: Meta-Llama3-8B - “Meta LLaMA 3, 8 billion parameters”

Meta → Creator or organization (Meta AI)
LLaMA → "Large Language Model Meta AI"
3 → Version (LLaMA 3)
8B → 8 billion parameters (model size)

Prefixes and Suffixes

Instruct → fine-tuned for following instructions e.g. chats
Base → raw pre-trained model (not instruction-tuned)
Chat → optimized for dialogue
Lite / Tiny / Small / XL → indicates relative size
v1 / v2 / v3 → version numbers

Example Name	Read As / Meaning
Mistral-7B	Mistral model, 7 billion parameters
Falcon-40B-Instruct	Falcon, 40B parameters, fine-tuned for instructions
GPT-J-6B	GPT variant by EleutherAI, 6 billion parameters
BLOOMZ-mt7B1	BLOOMZ multilingual, 7.1 billion parameters
Phi-2	Phi model, version 2 (by Microsoft)

In the context of language models, "parameters" refer to the components of the model that are learned from the training data. They are akin to the "brushes" in your painting analogy. A model with fewer parameters (like 10 brushes) can perform tasks adequately but with less detail and complexity. In contrast, a model with a large number of parameters (like 10,000 brushes) can handle more complex tasks and produce more detailed and nuanced outputs, but it requires more computational resources and time to train and run.

Context Parameters (Context Window / Context Length)

Context parameters refer to the number of tokens a model can "see" at once in a single conversation or prompt. This is also known as the context window, context length, or attention span. Tokens are chunks of text, roughly equivalent to a word or part of a word. The context window determines:

How much of the past conversation the model can remember
The length of your prompt/input
The size of a document or multi-turn chat before the model starts forgetting earlier parts

Example:

If a chat model has:

70B parameters: It is powerful and nuanced.
128K context: It can remember an entire book while answering your question.

GPT-4 Turbo supports up to 128K tokens, which is roughly 300 pages of text. This makes it ideal for tasks like summarizing research papers, processing books, or handling long chat histories.

Difference between Model Parameters & Chat Parameters

Model Parameters	Context Parameters
Internal weights (e.g., 7B, 70B)	Max input tokens
How smart/powerful the model is	How much text it can handle at once

Conclusion

Generative AI transforms human language into vectors that machines understand. While LLMs offer general-purpose NLP capabilities, chat models are optimized for interactive conversation. Whether open or closed source, each model brings unique trade-offs in performance, privacy, and scalability. Understanding model structure, naming, and parameters equips you to choose the right AI for your next project.

Chat Models vs LLMs: A Deep Dive into the Brains Behind Generative AI

Table of contents