Data Extraction with LangChain and Ollama Locally

Phanith LIMPhanith LIM
3 min read

In this guide, I'll show you how to extract and structure data using LangChain and Ollama on your local machine. This approach is particularly useful for automated data retrieval, market research, and AI-driven content analysis. We will leverage:

  • Pydantic for schema validation

  • LangChain’s output parsers for structured response formatting

  • Pandas for data manipulation Let's dive in!

1. Installation

In order to get started, you need to install the following packages:

pip install langchain
pip install ollama
pip install pydantic
pip install pandas

I assume you already have the Ollama app installed on your local machine. If not, you can download it from here. Once installed, open the app and start the server. You can also check out some useful command-line operations for Ollama here.

2. Importing Required Libraries

First, let's import the necessary libraries for this task:

from pydantic import BaseModel, Field
from langchain_ollama.llms import OllamaLLM
from langchain.output_parsers import PydanticOutputParser
from langchain.prompts import HumanMessagePromptTemplate, ChatPromptTemplate
import pandas as pd

3. Define the Schema

We'll define a Pydantic model to validate the response from LangChain, ensuring that the extracted data is in the expected format.

For example, let's extract a list of popular phone models:

class PhoneModel(BaseModel):
    brand: str = Field(description="Brand of the phone")
    model: str = Field(description="Model of the phone")
    year: int = Field(description="Year of the phone")
    price: float = Field(description="Price of the phone")

class PhonesModel(BaseModel):
    phones: list[PhoneModel] = Field(description="List of phones")

4. Formatting Output

Next, we define a PydanticOutputParser to format the response data into a structured format:

output_parser = PydanticOutputParser(pydantic_object = PhonesModel)
format_instructions = output_parser.get_format_instructions()
print(format_instructions)

3. Using an Ollama Model

In case you want to use the model from Ollama, you can pull the model from Ollama. Example I want to use the model llama3.2:3b.

model = OllamaLLM(
    model='llama3.2:3b',
    temperature=0.0,
)

4. Generate Data

We'll now generate structured data by passing instructions and formatting details to the model. This process forms a chain—hence the name LangChain—by connecting the prompt, model, and output parser seamlessly.

For example, let's retrieve the top 10 popular phones:

human_text = "{instruction}\n{format_instructions}"
message = HumanMessagePromptTemplate.from_template(human_text)
prompt = ChatPromptTemplate.from_messages([message])

chain = prompt | model | output_parser
products = chain.invoke({"instruction":"Give top 10 popular phone now", "format_instructions":format_instructions})

5. Convert to DataFrame

Finally, we will convert the structured data into a Pandas DataFrame for further analysis and manipulation.

df = pd.DataFrame([product.model_dump() for product in products.phones])
df

Output

brandmodelyearprice
0AppleiPhone 132021699.0
1SamsungGalaxy S212021799.0
2GooglePixel 62021599.0
3OnePlus9 Pro2021969.0
4XiaomiMi 112021749.0
5OppoFind X3 Pro20211149.0
6RealmeGT 2 Pro2022599.0
7VivoX70 Pro2021899.0
8SonyXperia 1 III20211299.0
9MotorolaEdge 30 Pro2022799.0

Conclusion

In this blog post, we showed how to extract and organize data using LangChain and Ollama on your local machine. We used Pydantic for validating data, LangChain's parsers for structured responses, and Pandas for data handling. This method is great for tasks like automated data collection, market research, and AI-based content analysis. I hope this guide was helpful!

print("Thank you for reading!")
0
Subscribe to my newsletter

Read articles from Phanith LIM directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Phanith LIM
Phanith LIM