DocQuery : Extracting data from documents using a query engine

Amogh KawleAmogh Kawle
1 min read

DocQuery is a library and command-line tool that makes it easy to analyze semi-structured and unstructured documents (PDFs, scanned images, etc.) using large language models (LLMs). Then specify a question to ask DocQuery and point it at the document or documents.

Install Package :

pip install docquery
!apt install tesseract-ocr

With DocQuery scan, you can ask one or more questions about a single document or directory of files. With docquery scan, you can ask one or more questions about a single document or directory of files.

Quickstart :

from docquery import document, pipeline
p = pipeline('document-question-answering')
doc = document.load_document("/path/to/document.pdf")
for q in ["What is the invoice number?", "What is the invoice total?"]:
      print(q, p(question=q, **doc.context))

Use cases :

There are many use cases where DocQuery excels, including structured, semi-structured, and unstructured documents. There are many questions that you can ask about invoices, contracts, forms, emails, letters, receipts, and a lot of other things. You can also classify documents. As the model evolves, more modeling options will be offered, and the document types supported will expand.

GitHub Repo

0
Subscribe to my newsletter

Read articles from Amogh Kawle directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Amogh Kawle
Amogh Kawle

A passionate software developer. I tend to make use of different technologies to build software and web app that looks great, feels fantastic, and functions correctly.