DocQuery : Extracting data from documents using a query engine
DocQuery is a library and command-line tool that makes it easy to analyze semi-structured and unstructured documents (PDFs, scanned images, etc.) using large language models (LLMs). Then specify a question to ask DocQuery and point it at the document or documents.
Install Package :
pip install docquery
!apt install tesseract-ocr
With DocQuery scan, you can ask one or more questions about a single document or directory of files. With docquery scan, you can ask one or more questions about a single document or directory of files.
Quickstart :
from docquery import document, pipeline
p = pipeline('document-question-answering')
doc = document.load_document("/path/to/document.pdf")
for q in ["What is the invoice number?", "What is the invoice total?"]:
print(q, p(question=q, **doc.context))
Use cases :
There are many use cases where DocQuery excels, including structured, semi-structured, and unstructured documents. There are many questions that you can ask about invoices, contracts, forms, emails, letters, receipts, and a lot of other things. You can also classify documents. As the model evolves, more modeling options will be offered, and the document types supported will expand.
Subscribe to my newsletter
Read articles from Amogh Kawle directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
Amogh Kawle
Amogh Kawle
A passionate software developer. I tend to make use of different technologies to build software and web app that looks great, feels fantastic, and functions correctly.