As an AI/NLP/ML platform engineer, I often need to prototype and iterate on text analytics pipelines. Entity extraction is a frequent, foundational step—whether I'm building quick R&D experiments or shaping components for production. I wanted a tool that was both fast and flexible: something I could use from the command line for ad-hoc analysis, or drop into Python code and larger frameworks like Langchain for more complex workflows.

That's why I built nergrep: a Python package and CLI for extracting and filtering named entities using spaCy, with filtering and integration into modern LLM pipelines (I started with LangChain plugin).

While I'm still exploring whether nergrep will become a production staple or remain a rapid prototyping tool, it's already proven invaluable for shaping and testing new NLP pipelines in R&D.

Why `nergrep`?

Precision Filtering: Filter entities by type, fuzzy match, regex, black/whitelists, and more.
CLI & Python API: Use interactively or in scripts and pipelines.
Langchain Integration: Plug entity extraction into LLM-powered workflows.
Custom ML Model Integration: Extend entity extraction with custom ML models for enhanced precision and additional entity types.

Quick Demo: Entity Extraction from Apollo 11

Running nergrep on an excerpt about the Apollo 11 mission gives us clean named entity extraction:

 nergrep "Apollo 11 was the spaceflight that first landed humans on the Moon. Commander Neil Armstrong and lunar module pilot Buzz Aldrin formed the American crew. They landed the Apollo Lunar Module Eagl......"

Example output:

# Example: YAML output from nergrep entity extraction

- text: "Apollo"
  label: "ORG"
  sentence: "Apollo 11 was the spaceflight that first landed humans on the Moon."
- text: "Neil Armstrong"
  label: "PERSON"
  sentence: "Commander Neil Armstrong and lunar module pilot Buzz Aldrin formed the American crew."
- text: "July 20, 1969"
  label: "DATE"
  sentence: "They landed the Apollo Lunar Module Eagle on July 20, 1969, at 20:17 UTC."
- text: "Earth"
  label: "LOC"
  sentence: "They collected 47.5 pounds of lunar material to bring back to Earth."
- text: "NASA"
  label: "ORG"
  sentence: "The mission was launched by a Saturn V rocket from Kennedy Space Center in Florida, and fulfilled a national goal set by President John F. Kennedy in 1961."

NERgrep can be used to extract, filter, or redact named entities flexibly across any text source.

Quickstart

Installation

pip install git+https://github.com/metawake/nergrep.git
python -m spacy download en_core_web_lg

CLI Usage

Extract organizations and people from a file, filter by fuzzy match, and output as JSON:

nergrep input.txt --types ORG,PERSON --fuzzy "micro" --format json

Filter entities by regex and minimum length:

nergrep "Microsoft acquired GitHub in 2018." --regex "^[A-Z]" --min-length 5

Python API

from nergrep.extractor import extract_entities
from nergrep.filters import FilterConfig, filter_all

text = "Apple Inc. is working with Microsoft on AI projects."
entities = extract_entities(text, types=["ORG"])

# Advanced filtering
config = FilterConfig(
    entity_types={"ORG"},
    fuzzy_match="micro",
    min_length=5
)
filtered = filter_all(entities, config)
print([e.text for e in filtered])

Langchain Integration

nergrep fits naturally into Langchain pipelines for LLM-powered document processing:

from langchain_core.documents import Document
from nergrep.extractor import extract_entities

doc = Document(page_content="OpenAI and Meta are leading AI research.")
entities = extract_entities(doc.page_content, types=["ORG"])
print([e.text for e in entities])  # ['OpenAI', 'Meta']

Custom ML Model Integration

nergrep supports integration with custom ML models to enhance entity extraction precision and extend the range of entity types:

from nergrep.extractor import extract_entities_with_model

# Load your custom model
custom_model = load_your_custom_model()

# Extract entities using the custom model
entities = extract_entities_with_model(text, custom_model)
print([e.text for e in entities])

Engineering for Real-World NLP

Building nergrep is about more than just code—it's about designing tools that fit real data workflows:

Modular, composable filters for production pipelines
CLI for quick experiments and batch jobs
Python API for integration with ML/AI platforms
Langchain support for LLM-centric architectures
Custom ML model integration for enhanced entity extraction

If you're building NLP or LLM applications and need robust, customizable entity extraction, check out nergrep. Feedback and contributions are welcome!

Potential TODOs

Performance Optimization: Explore multi-threading or GPU acceleration for faster entity extraction on large datasets.
Enhanced Filtering: Add support for more advanced filtering options, such as context-aware filtering or custom entity types.
Integration with More Frameworks: Extend nergrep to integrate with other NLP and ML frameworks beyond Langchain.

Reach out to me on LinkedIn for work and consulting opportunities.

Follow me on Hashnode for more on building scalable, production-grade AI/NLP/ML tools.

Flexible Entity Extraction for NLP Pipelines: Introducing nergrep

Why `nergrep`?

Quick Demo: Entity Extraction from Apollo 11

Quickstart

Installation

CLI Usage

Python API

Langchain Integration

Custom ML Model Integration

Engineering for Real-World NLP

Potential TODOs

Subscribe to my newsletter

Alex Alexapolsky

Alex Alexapolsky

Flexible Entity Extraction for NLP Pipelines: Introducing nergrep

Why nergrep?

Quick Demo: Entity Extraction from Apollo 11

Quickstart

Installation

CLI Usage

Python API

Langchain Integration

Custom ML Model Integration

Engineering for Real-World NLP

Potential TODOs

Subscribe to my newsletter

Alex Alexapolsky

Alex Alexapolsky

Why `nergrep`?