UDA: A Benchmark Suite for Retrieval-Augmented Generation in Real-world Document Analysis

Introduction

In recent years, the use of Retrieval-Augmented Generation (RAG) has significantly enhanced the capabilities of Large Language Models (LLMs), enabling them to interact more effectively with external data. However, real-world applications present substantial challenges, particularly in fields like academic research and financial analysis, where data is often embedded in lengthy and unstructured formats such as HTML or PDF. These documents can contain a mix of raw text and complex tables, making them difficult to parse and analyze efficiently.

To address these challenges, we are excited to introduce the Unstructured Document Analysis (UDA) benchmark suite. This suite comprises 2,965 real-world documents and 29,590 expert-annotated Q&A pairs, designed to push the limits of existing LLM- and RAG-based solutions. By revisiting popular document analysis techniques, we evaluate their design choices and answer quality across various document domains and query types. Our findings highlight the crucial role of data parsing and retrieval in improving the performance of these systems.

The UDA Dataset

The UDA benchmark includes documents from finance, academia, and general knowledge domains, encompassing a variety of formats and structures. The dataset is designed to reflect real-world scenarios, maintaining the original, unparsed formats of documents to test the true capabilities of RAG systems. Here's a brief overview of the dataset:

Finance: Includes documents from the FINQA and TAT-DQA datasets, which focus on financial reports and numerical reasoning.
Academia: Draws from the Qasper dataset, featuring questions based on NLP research papers.
World Knowledge: Incorporates data from FetaQA and Natural-Questions, with content from Wikipedia and other sources.

The diversity in document types and question formats ensures that UDA provides a comprehensive evaluation platform for document analysis tasks.

Key Challenges in Document Analysis

Unstructured Inputs

Parsing unstructured documents is a complex task. Unlike plain text, these documents often include intricate layouts and redundant symbols that can confuse conventional parsing methods. For example, financial reports may contain tables with various levels of complexity, requiring sophisticated parsing strategies to extract meaningful data.

Lengthy Documents

Documents such as financial reports can span hundreds of pages, necessitating efficient embedding and retrieval mechanisms to handle the volume of data. Ensuring that the most relevant chunks of information are retrieved in response to a query is critical for effective analysis.

Diverse Query Types

User queries can range from straightforward extraction tasks to complex arithmetic reasoning, each requiring different strategies for accurate answers. Techniques like Chain-of-Thought reasoning and the use of external tools, such as code interpreters, can significantly impact the quality of answers generated.

Evaluation and Findings

Data Parsing

We evaluated various table-parsing methods, including raw text extraction, computer vision-based approaches, and advanced multi-modal parsing using GPT-4-Omni. Interestingly, GPT-4-Omni outperformed other methods, but even simple raw text extraction showed promising results due to its ability to preserve structural markers in the text.

Retrieval Strategies

Effective retrieval of relevant data chunks is vital for accurate LLM responses. We tested multiple retrieval strategies, including sparse retrieval (BM-25) and dense embedding models. Our results showed that while dense embedding models like all-mpnet-base performed well, sparse retrieval methods often provided better initial results, especially for complex queries.

Long-Context LLMs vs. RAG

We compared the performance of long-context LLMs, capable of directly handling lengthy documents, with traditional RAG approaches. While long-context LLMs showed potential, they often fell short compared to RAG methods, which benefit from more precise retrieval mechanisms.

End-to-End Evaluation

Our end-to-end evaluations covered a range of LLMs, including GPT-4, Llama-3, and other state-of-the-art models. We found that smaller retrieval models could perform reasonably well in certain applications, and that Chain-of-Thought approaches significantly improved answer quality in zero-shot numerical document analysis.

Conclusion

The UDA benchmark suite is a powerful tool for evaluating and improving RAG-based document analysis systems. By addressing the challenges of unstructured inputs, lengthy documents, and diverse query types, UDA aims to drive advancements in real-world applications of LLMs. We believe that our benchmark will provide valuable insights and guide future research in this field.

Stay tuned for further updates as we continue to refine and expand the capabilities of UDA!

UDA - Unstructured Document Analysis