Data Extraction with LangChain and Ollama Locally

In this guide, I'll show you how to extract and structure data using LangChain and Ollama on your local machine. This approach is particularly useful for automated data retrieval, market research, and AI-driven content analysis. We will leverage:
Pydantic for schema validation
LangChain’s output parsers for structured response formatting
Pandas for data manipulation Let's dive in!
1. Installation
In order to get started, you need to install the following packages:
pip install langchain
pip install ollama
pip install pydantic
pip install pandas
I assume you already have the Ollama app installed on your local machine. If not, you can download it from here. Once installed, open the app and start the server. You can also check out some useful command-line operations for Ollama here.
2. Importing Required Libraries
First, let's import the necessary libraries for this task:
from pydantic import BaseModel, Field
from langchain_ollama.llms import OllamaLLM
from langchain.output_parsers import PydanticOutputParser
from langchain.prompts import HumanMessagePromptTemplate, ChatPromptTemplate
import pandas as pd
3. Define the Schema
We'll define a Pydantic model to validate the response from LangChain, ensuring that the extracted data is in the expected format.
For example, let's extract a list of popular phone models:
class PhoneModel(BaseModel):
brand: str = Field(description="Brand of the phone")
model: str = Field(description="Model of the phone")
year: int = Field(description="Year of the phone")
price: float = Field(description="Price of the phone")
class PhonesModel(BaseModel):
phones: list[PhoneModel] = Field(description="List of phones")
4. Formatting Output
Next, we define a PydanticOutputParser to format the response data into a structured format:
output_parser = PydanticOutputParser(pydantic_object = PhonesModel)
format_instructions = output_parser.get_format_instructions()
print(format_instructions)
3. Using an Ollama Model
In case you want to use the model from Ollama, you can pull the model from Ollama. Example I want to use the model llama3.2:3b
.
model = OllamaLLM(
model='llama3.2:3b',
temperature=0.0,
)
4. Generate Data
We'll now generate structured data by passing instructions and formatting details to the model. This process forms a chain—hence the name LangChain—by connecting the prompt, model, and output parser seamlessly.
For example, let's retrieve the top 10 popular phones:
human_text = "{instruction}\n{format_instructions}"
message = HumanMessagePromptTemplate.from_template(human_text)
prompt = ChatPromptTemplate.from_messages([message])
chain = prompt | model | output_parser
products = chain.invoke({"instruction":"Give top 10 popular phone now", "format_instructions":format_instructions})
5. Convert to DataFrame
Finally, we will convert the structured data into a Pandas DataFrame for further analysis and manipulation.
df = pd.DataFrame([product.model_dump() for product in products.phones])
df
Output
brand | model | year | price | |
0 | Apple | iPhone 13 | 2021 | 699.0 |
1 | Samsung | Galaxy S21 | 2021 | 799.0 |
2 | Pixel 6 | 2021 | 599.0 | |
3 | OnePlus | 9 Pro | 2021 | 969.0 |
4 | Xiaomi | Mi 11 | 2021 | 749.0 |
5 | Oppo | Find X3 Pro | 2021 | 1149.0 |
6 | Realme | GT 2 Pro | 2022 | 599.0 |
7 | Vivo | X70 Pro | 2021 | 899.0 |
8 | Sony | Xperia 1 III | 2021 | 1299.0 |
9 | Motorola | Edge 30 Pro | 2022 | 799.0 |
Conclusion
In this blog post, we showed how to extract and organize data using LangChain and Ollama on your local machine. We used Pydantic for validating data, LangChain's parsers for structured responses, and Pandas for data handling. This method is great for tasks like automated data collection, market research, and AI-based content analysis. I hope this guide was helpful!
print("Thank you for reading!")
Subscribe to my newsletter
Read articles from Phanith LIM directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
