Flexible Entity Extraction for NLP Pipelines: Introducing nergrep

As an AI/NLP/ML platform engineer, I often need to prototype and iterate on text analytics pipelines. Entity extraction is a frequent, foundational step—whether I'm building quick R&D experiments or shaping components for production. I wanted a tool that was both fast and flexible: something I could use from the command line for ad-hoc analysis, or drop into Python code and larger frameworks like Langchain for more complex workflows.
That's why I built nergrep
: a Python package and CLI for extracting and filtering named entities using spaCy, with filtering and integration into modern LLM pipelines (I started with LangChain plugin).
While I'm still exploring whether nergrep
will become a production staple or remain a rapid prototyping tool, it's already proven invaluable for shaping and testing new NLP pipelines in R&D.
Why nergrep
?
Precision Filtering: Filter entities by type, fuzzy match, regex, black/whitelists, and more.
CLI & Python API: Use interactively or in scripts and pipelines.
Langchain Integration: Plug entity extraction into LLM-powered workflows.
Custom ML Model Integration: Extend entity extraction with custom ML models for enhanced precision and additional entity types.
Quick Demo: Entity Extraction from Apollo 11
Running nergrep
on an excerpt about the Apollo 11 mission gives us clean named entity extraction:
nergrep "Apollo 11 was the spaceflight that first landed humans on the Moon. Commander Neil Armstrong and lunar module pilot Buzz Aldrin formed the American crew. They landed the Apollo Lunar Module Eagl......"
Example output:
# Example: YAML output from nergrep entity extraction
- text: "Apollo"
label: "ORG"
sentence: "Apollo 11 was the spaceflight that first landed humans on the Moon."
- text: "Neil Armstrong"
label: "PERSON"
sentence: "Commander Neil Armstrong and lunar module pilot Buzz Aldrin formed the American crew."
- text: "July 20, 1969"
label: "DATE"
sentence: "They landed the Apollo Lunar Module Eagle on July 20, 1969, at 20:17 UTC."
- text: "Earth"
label: "LOC"
sentence: "They collected 47.5 pounds of lunar material to bring back to Earth."
- text: "NASA"
label: "ORG"
sentence: "The mission was launched by a Saturn V rocket from Kennedy Space Center in Florida, and fulfilled a national goal set by President John F. Kennedy in 1961."
NERgrep can be used to extract, filter, or redact named entities flexibly across any text source.
Quickstart
Installation
pip install git+https://github.com/metawake/nergrep.git
python -m spacy download en_core_web_lg
CLI Usage
Extract organizations and people from a file, filter by fuzzy match, and output as JSON:
nergrep input.txt --types ORG,PERSON --fuzzy "micro" --format json
Filter entities by regex and minimum length:
nergrep "Microsoft acquired GitHub in 2018." --regex "^[A-Z]" --min-length 5
Python API
from nergrep.extractor import extract_entities
from nergrep.filters import FilterConfig, filter_all
text = "Apple Inc. is working with Microsoft on AI projects."
entities = extract_entities(text, types=["ORG"])
# Advanced filtering
config = FilterConfig(
entity_types={"ORG"},
fuzzy_match="micro",
min_length=5
)
filtered = filter_all(entities, config)
print([e.text for e in filtered])
Langchain Integration
nergrep
fits naturally into Langchain pipelines for LLM-powered document processing:
from langchain_core.documents import Document
from nergrep.extractor import extract_entities
doc = Document(page_content="OpenAI and Meta are leading AI research.")
entities = extract_entities(doc.page_content, types=["ORG"])
print([e.text for e in entities]) # ['OpenAI', 'Meta']
Custom ML Model Integration
nergrep
supports integration with custom ML models to enhance entity extraction precision and extend the range of entity types:
from nergrep.extractor import extract_entities_with_model
# Load your custom model
custom_model = load_your_custom_model()
# Extract entities using the custom model
entities = extract_entities_with_model(text, custom_model)
print([e.text for e in entities])
Engineering for Real-World NLP
Building nergrep
is about more than just code—it's about designing tools that fit real data workflows:
Modular, composable filters for production pipelines
CLI for quick experiments and batch jobs
Python API for integration with ML/AI platforms
Langchain support for LLM-centric architectures
Custom ML model integration for enhanced entity extraction
If you're building NLP or LLM applications and need robust, customizable entity extraction, check out nergrep
. Feedback and contributions are welcome!
Potential TODOs
Performance Optimization: Explore multi-threading or GPU acceleration for faster entity extraction on large datasets.
Enhanced Filtering: Add support for more advanced filtering options, such as context-aware filtering or custom entity types.
Integration with More Frameworks: Extend
nergrep
to integrate with other NLP and ML frameworks beyond Langchain.
Reach out to me on LinkedIn for work and consulting opportunities.
Follow me on Hashnode for more on building scalable, production-grade AI/NLP/ML tools.
Subscribe to my newsletter
Read articles from Alex Alexapolsky directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Alex Alexapolsky
Alex Alexapolsky
Ukranian Python dev in Montenegro. https://www.linkedin.com/in/alexey-a-181a614/