Extract, Define, Canonicalize: An LLM-based Framework for Knowledge Graph Construction

Chubby TummyChubby Tummy
7 min read

Knowledge graphs are essential tools for organizing information, but building them traditionally required manual effort or complex systems that struggled to scale. This research introduces "Extract-Define-Canonicalize" (EDC), a novel framework leveraging large language models to automate knowledge graph construction from text. EDC breaks the process into three phases: Open Information Extraction, Schema Definition, and Schema Canonicalization, addressing challenges such as scalability and schema-less scenarios. The framework significantly outperforms existing methods on complex datasets, demonstrating its capacity to efficiently and accurately construct knowledge graphs without predefined schemas.

Knowledge graphs (KGs) are powerful tools for representing and organizing information. They connect entities (like people, places, or things) with relationships (like "works at," "located in," or "invented") in a structured way, forming a network of interconnected knowledge. This structure makes KGs incredibly useful for a wide range of applications, from powering search engines and question-answering systems to enabling sophisticated decision-making and recommendation engines. However, building these graphs has traditionally been a labor-intensive process, often relying on manual curation or complex, finely-tuned systems that struggle to scale to the vast amounts of textual information available in the real world. Recent progress on Large Language Models (LLMs) has prompted researchers to explore new approaches to KGC.

This research introduces a novel framework called "Extract-Define-Canonicalize" (EDC) that leverages the power of large language models (LLMs) to automate the construction of knowledge graphs from unstructured text. The core problem EDC tackles is a significant one: how can we efficiently and accurately extract relational triplets (Subject-Relation-Object, like ["Alan Shepard", "participatedIn", "Apollo 14"]) from text, especially when dealing with large, complex schemas or even situations where a predefined schema doesn't exist? Previous approaches often struggled because they required the entire schema (the list of all possible entity and relation types) to be included within the prompt given to the LLM. This becomes impractical when the schema is large, exceeding the LLM's context window (the amount of text it can process at once), or when we want the system to discover new relationships and build the schema on the fly.

EDC addresses these limitations by breaking down the knowledge graph construction process into three distinct, interconnected phases. The first phase, Open Information Extraction, uses an LLM in a few-shot prompting scenario to identify potential relational triplets within the input text. This step is "open" because it doesn't rely on a predefined schema; the LLM is free to extract any triplets it deems relevant. The result of this phase is an "open KG," a set of triplets that might contain redundant or ambiguous information. For example, the same relationship might be expressed using different phrases ("profession," "job," "occupation"). This is where the subsequent phases come into play. The figure demonstrates these steps, by extracting relationships from a text on Alan Shepard, it outputs triplets from that text. Then, defining each relation, before moving onto the canonicalization.

The second stage of the method is the definition stage. After receiving the extracted triplets from the information extraction stage, it then generates definitions for each of the triplets. Phase 3 uses schema canonization to standardize all of the triplets to create a better knowledge graph.

The second phase, Schema Definition, takes the "open KG" generated in the first phase and uses the LLM's ability to generate explanations to create natural language definitions for each of the relations and entity types identified. This is a crucial step because it provides a semantically rich description of what each relation means in the context of the input text. Instead of relying on external resources like WordNet (which might not capture the specific nuance of a relation in a particular domain), EDC leverages the LLM's understanding of the text to generate these definitions. For instance, if the open KG contains the triplet ["Alan Shepard", "participatedIn", "Apollo 14"], the Schema Definition phase might generate a definition for "participatedIn" like: "The subject entity took part in the mission specified by the object entity." These definitions become the key to the final, and most innovative, phase of EDC.

The third phase, Schema Canonicalization, is where EDC truly distinguishes itself from prior work. This phase aims to transform the potentially messy and redundant "open KG" into a clean, consistent, and canonicalized knowledge graph. The process works differently depending on whether a predefined target schema is available. If a target schema exists (the "Target Alignment" scenario), EDC uses the definitions generated in Phase 2 to find the closest matching relation (or entity type) within the target schema. It does this by using a sentence transformer to create vector embeddings of the definitions and then performing a similarity search against the embeddings of the target schema elements. Crucially, EDC doesn't just blindly replace relations based on embedding similarity. Instead, it uses the LLM again to verify whether the proposed transformation is semantically valid, given the context of the specific triplet. This prevents over-generalization, a common problem with previous canonicalization methods that often grouped semantically distinct relations together. For example, a system without this verification step might incorrectly conflate "is brother of" and "is son of."

If no target schema is available (the "Self Canonicalization" scenario), EDC builds the schema dynamically. It starts with an empty canonical schema and iteratively examines the triplets from the open KG. It uses the same definition-based similarity search to find potential matches within the currently existing canonical schema. If a close match is found and the LLM verifies the transformation, the triplet is updated to use the canonical relation. If no suitable match is found, the relation (and its definition) are added to the canonical schema, effectively expanding the schema on the fly. This ability to operate without a predefined schema is a major advantage of EDC, making it applicable in situations where the desired knowledge structure is not known in advance. This flexibility allows EDC to adapt to a wider range of KGC tasks, from aligning extracted information with established knowledge bases like Wikidata to creating entirely new knowledge graphs from scratch based on the specific content of the input text.

To further improve the performance of the Information Extraction, researchers created an additional Refinement phase. By adding the triplets and a part of the schema in the prompt, they were able to improve on their results. They also created the Schema Retriever, that helped to identify components of the schema that would be relevant.

The researchers behind EDC rigorously evaluated its performance on three established knowledge graph construction datasets: WebNLG, REBEL, and Wiki-NRE. These datasets were chosen because they represent a significant step up in complexity compared to the smaller, domain-specific datasets often used to evaluate previous LLM-based KGC methods. They contain a much wider variety of relation types, making them a more realistic testbed for real-world applications. The experiments were conducted in both the Target Alignment setting (where a predefined schema was provided) and the Self Canonicalization setting (where EDC had to construct the schema itself).

The results demonstrate that EDC significantly outperforms existing state-of-the-art methods, particularly on the more challenging REBEL and Wiki-NRE datasets. In the Target Alignment setting, EDC achieved higher precision, recall, and F1 scores compared to specialized, trained models like REGEN and GenIE. This is particularly impressive because EDC achieves this without any parameter tuning of the underlying LLMs, relying solely on prompting. The researchers also explored the impact of using different LLMs for the Open Information Extraction phase, finding that GPT-4 yielded the best results, with Mistral-7b and GPT-3.5-turbo showing comparable performance. Furthermore, the addition of the refinement phase (EDC+R), which iteratively improves the extraction by incorporating previously extracted triplets and schema information, consistently boosted performance across all datasets and LLMs.

In the Self Canonicalization setting, EDC was evaluated through manual human evaluation, focusing on the precision, conciseness, and redundancy of the generated knowledge graphs. EDC demonstrated high precision, meaning the extracted triplets were accurate and meaningful. It also produced schemas that were significantly more concise and less redundant than those generated by a strong baseline clustering-based canonicalization method (CESI). This highlights EDC's ability to avoid the over-generalization problem that often plagues clustering-based approaches. The researchers also performed ablation studies, demonstrating the importance of the trained Schema Retriever in providing relevant schema information during the refinement phase.

In conclusion, EDC represents a significant advancement in automated knowledge graph construction. Its three-phase framework, leveraging the strengths of LLMs for open information extraction, schema definition, and post-hoc canonicalization, offers a flexible and powerful approach that can handle large, complex schemas and even scenarios where no predefined schema exists. The demonstrated performance improvements and the ability to scale to real-world text open up exciting possibilities for a wide range of applications, from enhancing search and question-answering systems to building comprehensive knowledge bases from vast amounts of textual data. While there are limitations, such as the computational cost associated with using LLMs, and areas for future work, such as incorporating entity de-duplication, EDC provides a substantial step forward toward the goal of truly automated knowledge acquisition.

Read Paper:

[2404.03868] Extract, Define, Canonicalize: An LLM-based Framework for Knowledge Graph Construction

Zhang, B., & Soh, H. (2024). Extract, define, canonicalize: An llm-based framework for knowledge graph construction. arXiv preprint arXiv:2404.03868.

0
Subscribe to my newsletter

Read articles from Chubby Tummy directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Chubby Tummy
Chubby Tummy