Automated Ontology Mining from LLMs

Thomas WeitzelThomas Weitzel
12 min read

Large Language Models (LLMs) are increasingly used to automate ontology mining – extracting structured knowledge (entities, relations, hierarchies) from text – and to generate knowledge graphs (KGs). Recent research spans general frameworks to domain-specific applications, and new tools are emerging to integrate LLMs with existing ontologies. Below is a summary of key academic works, industry use cases, and open-source projects in this area.

Academic Research and Methods

LLM-Driven Ontology & KG Construction: Several recent papers propose pipelines where an LLM helps build an ontology (schema) and populate a knowledge graph from unstructured text. For example, OntoKGen is a pipeline that uses an LLM with an adaptive Chain-of-Thought prompting to iteratively extract an ontology and then generate a KG​arxiv.orgarxiv.org. OntoKGen guides the LLM step-by-step: first identifying key concepts, then relationships, then properties, with user feedback at each step​promptlayer.com. This approach was validated on technical documents (e.g. semiconductor manufacturing manuals), yielding a detailed KG that aligned with expert expectations​promptlayer.compromptlayer.com. Another work by Shimizu and Hitzler (2024) outlines how LLMs can accelerate ontology modeling, extension, population, and alignment – tasks traditionally requiring intensive expert effort​arxiv.org. They argue that modular ontologies and human-in-the-loop validation remain important even as LLMs take on more of the load​ ar5iv.orgar5iv.org .

Prompting Frameworks for Knowledge Graphs: LLMs have been applied to generate KG triples either directly via prompts or in stages. Zhang & Soh (EMNLP 2024) note that naive prompting struggles with large schemas (exceeding context length) or when no schema is predefined​aclanthology.orgaclanthology.org. They propose an “Extract-Define-Canonicalize (EDC)” framework: use open information extraction to get raw facts, then have the LLM define a schema (ontology) from those facts, and finally canonicalize entities/relations​aclanthology.orgaclanthology.org. This three-phase pipeline can work with or without a given ontology, constructing one on the fly if needed. Another approach, by Kommineni et al. (2024), starts by having an LLM generate competency questions (CQs) for a domain, then uses those questions to derive an ontology and populate a KG​arxiv.org. Their semi-automatic pipeline was demonstrated on the “deep learning methods” domain using scholarly publications​arxiv.org. Notably, they use a “judge LLM” to evaluate the accuracy of the generated KG against ground truth, finding that LLMs greatly reduce human effort though expert validation is still recommended​arxiv.org. Hu et al. (2024) introduce a Progressive Ontology Prompting (POP) algorithm combined with a dual-agent LLM system to automatically discover knowledge from scientific papers​arxiv.org. In their “LLM-Duo” setup, one agent (explorer) extracts candidates while another (evaluator) critiques and refines the output​arxiv.orgarxiv.org. This method guided by a predefined ontology (via a BFS traversal of ontology terms) enabled large-scale extraction – e.g. identifying 2,421 medical interventions from 64k research articles – with higher accuracy and completeness than baselines​arxiv.orgarxiv.org.

Structured Extraction with Zero-Shot LLMs: Some research focuses on using LLMs in a zero-shot or few-shot manner to fill a given schema or ontology. The SPIRES method (Caufield et al., 2024) is one such approach, which uses Structured Prompt Interrogation and Recursive Extraction of Semantics to have an LLM extract information conforming to a specified schema without training​arxiv.org. In SPIRES, the LLM is prompted to output an instance (ontology class instance) with attribute–value pairs according to a custom ontology model​academic.oup.com. This has been applied in bioinformatics to populate biomedical databases. Similarly, other works combine LLM outputs with knowledge graph completion techniques: for example, one hybrid approach uses an LLM to validate and explain edges predicted by a graph neural network, ensuring new relations are supported by literature​mdpi.commdpi.com. This helped discover plausible protein interactions from PubMed papers while mitigating the LLM’s tendency to hallucinate by grounding its answers in real text evidence​mdpi.commdpi.com.

Integrating LLMs with Existing Ontologies: A critical theme is aligning LLM-extracted knowledge with established ontologies or knowledge bases to improve consistency. Kim et al. (2024) demonstrate this in healthcare: they built a “patient” knowledge graph for a rare disease by having LLMs extract clinical facts in natural language, then mapping the outputs back to medical ontologies like MeSH and HPO​arxiv.orgarxiv.org. This approach allowed them to capture nuances not explicitly covered by codified ontologies (using the LLM’s generalized language understanding), but still anchor the results to standard vocabularies for interoperability​arxiv.orgarxiv.org. Another study presents an ontology-grounded KG construction method that uses the Wikidata schema as guidance​arxiv.org. They combine LLM-driven competency question generation with ontology alignment to Wikidata, so that the extracted relations are wherever possible matched to existing Wikidata properties​arxiv.org. The resulting KGs are “consistent, complete, and interoperable” with Wikidata, meaning they can be parsed as RDF and merged into larger knowledge bases​arxiv.orgarxiv.org. In the legal domain, Li et al. (2024) introduce a Joint Knowledge-Enhanced Model (JKEM) that injects legal domain knowledge into an LLM via prefix-tuning​mdpi.com. By embedding law ontology information as a prompt prefix (while freezing the LLM’s weights), they significantly improved the accuracy of extracting facts from Chinese legal documents​mdpi.commdpi.com. The fine-tuned LLM then produced a legal knowledge graph (CLKG) with thousands of triples, showing the value of combining prior knowledge with language modeling for high-stakes domains.

Domain-Specific Applications

Healthcare & Biomedicine: Ontology mining with LLMs has seen active use in biomedical domains. Beyond the patient case study above, researchers have used LLMs to assist curation of disease ontologies and gene–disease relationships. For example, LLMs can extract medical entities from clinical notes and map them to codes (ICD, SNOMED) with minimal supervision​arxiv.orgarxiv.org. One experiment found that a dynamically prompted GPT model could identify phenotype terms in patient text and correctly map them to the Human Phenotype Ontology (HPO) with high recall, whereas fine-tuning the LLM narrowly on the ontology terms sometimes hurt generalization​arxiv.orgarxiv.org. In bioinformatics, the SPIRES zero-shot schema extraction (noted above) and similar LLM-based pipelines have been used to populate biochemical knowledge bases. LLMs have also been combined with structured biomedical knowledge for verification – e.g. using an LLM to read literature and confirm if a proposed protein–protein interaction is supported by experimental evidence​mdpi.commdpi.com. These approaches leverage LLMs’ ability to interpret complex scientific text, while the ontologies (e.g. MeSH for diseases, GO for gene functions) provide a structure to slot the information into.

Legal and Regulatory: In the legal domain, accuracy and alignment to authoritative ontologies (like statutes or legal taxonomies) are paramount. The JKEM model for Chinese law is one example where an LLM was guided with legal knowledge to extract facts for a legal KG​mdpi.com. Other projects have explored LLMs summarizing and linking regulations or case law. For instance, an LLM might extract key entities (people, organizations, legal provisions) and relationships (e.g. cites, amends, violates) from court case texts to build a legal knowledge graph. Research in this area often involves a human-in-the-loop to validate critical details, given the high cost of errors. Nonetheless, early case studies suggest LLMs can substantially speed up the assembly of legal knowledge graphs by drafting the skeleton of entities/relations which experts then refine.

Science & Engineering: Ontology mining is also being applied to scientific literature to help researchers navigate complex, evolving knowledge. Oarga et al. (NeurIPS 2024 Workshop) demonstrated end-to-end ontology and KG generation from scientific papers using open-source LLMs​openreview.net. They showed that an LLM (without human-defined schema) could reconstruct a known ontology of chemical elements and even propose a new ontology for the nascent field of single-atom catalysts​openreview.netopenreview.net. The automatically generated ontologies/KGs captured hierarchical relationships and facts from papers, suggesting LLMs can aid ontology creation in emerging scientific domains where no schema exists yet. Another ambitious effort (Hu et al., 2024) in the science domain was the dual-agent POP approach mentioned earlier, which in a speech therapy research scenario extracted thousands of relevant interventions from literature and built a structured knowledge base for practitioners​arxiv.org. These examples underscore that in domains like materials science, chemistry, or social sciences, LLMs can help distill scattered knowledge into ontologies and graphs – accelerating discovery and literature review.

Industry Applications and Case Studies

Organizations are beginning to apply these techniques in real-world settings to manage large text corpora and enhance AI applications. A notable example is Microsoft’s GraphRAG approach, which integrates LLM-based KG construction into a retrieval-augmented generation system. In GraphRAG, an LLM first reads a collection of private documents and produces a knowledge graph of the key entities and relations​

microsoft.com

. That graph is then used at query time to augment the LLM’s context with relevant facts, dramatically improving question-answering accuracy on the documents​microsoft.commicrosoft.com. Microsoft reported that GraphRAG yields more informed and grounded answers (fewer hallucinations) compared to standard RAG, especially on complex queries that require connecting information across multiple sources​microsoft.commicrosoft.com. A case study on a news dataset (VIINA) showed that GraphRAG could discover indirect connections (e.g. identifying an entity like “Novorossiya” and linking it to related events) that a pure vector search missed, thereby enabling the AI to answer questions with proper evidence​microsoft.commicrosoft.com.

Major tech companies highlight such integrations of KGs with LLMs as key to enterprise AI. NVIDIA, for instance, has discussed how LLM-driven knowledge graphs can enhance enterprise search and analytics​developer.nvidia.com. By transforming unstructured internal data into a structured graph, businesses can enable reasoning and complex querying that plain text search or FAQ bots cannot handle​developer.nvidia.comdeveloper.nvidia.com. This approach helps reduce LLM hallucinations by grounding responses in a factual graph and improves accuracy on multi-hop questions​developer.nvidia.comdeveloper.nvidia.com. NVIDIA’s technical blog notes that Microsoft’s GraphRAG demonstrated substantial gains in handling “narrative” private data, and hybrid methods (combining vector search with graph-based retrieval) are emerging to tackle a variety of queries​developer.nvidia.comdeveloper.nvidia.com.

In industry settings, domain-specific knowledge graphs built with LLM assistance are becoming a valuable asset. For example, in healthcare, companies are interested in LLM-built KGs that map symptoms, diseases, treatments, and genes, enabling advanced clinical decision support​developer.nvidia.com. Finance and legal firms are experimenting with LLMs to parse regulatory texts and contracts into graph representations for easier compliance checking and Q&A. These case studies show that automated ontology mining isn’t just academic – it’s beginning to power real-world applications from intelligent document search to personalized recommendation systems, where organizing knowledge explicitly yields better results.

Open-Source Projects and Tools

The surge in research has been accompanied by open-source tools to perform LLM-based ontology extraction and KG construction:

  • OntoGPT (Monarch Initiative): An open-source Python package that uses LLMs plus ontology grounding to extract structured info from text​github.com. OntoGPT takes in raw text and a target ontology or schema type (for example, “drug”), and prompts an LLM to output instances of that ontology. It leverages existing biomedical ontologies behind the scenes – e.g. given “One treatment for high blood pressure is carvedilol.”, OntoGPT will recognize “carvedilol” as a Drug and output a structured object with the drug and the condition, grounded to identifiers in an ontology​github.comgithub.com. It implements the SPIRES methodology and supports various LLM backends (OpenAI, Anthropic, etc.)​github.com. This tool has been used for biological knowledge base population and demonstrates how LLMs can serve as ontology-driven information extractors.

  • LLM-KG Construction Pipelines: The team at Fusion Jena released a repository for automatic KG construction with LLMsgithub.com. Their code implements the pipeline of competency question generation → ontology building → KG population, tested with four different LLMs (GPT-4, GPT-3.5, and others)​github.comgithub.com. The repo includes prompts, data, and evaluation results, allowing others to replicate or adapt the semi-automated ontology engineering process described in their papers. Similarly, researchers from UMBC (Padia et al.) have shared case study code where an open LLM is used to suggest corrections to a knowledge graph for consistency (part of an AAAI-MAKE 2024 workshop) – showcasing how open models can be applied to refine KGs in an iterative loop.

  • Knowledge Graph Maker: Knowledge Graph Maker is an open-source library (available via pip) that simplifies text-to-graph conversion using LLMs. Notably, it allows the user to provide a custom ontology schema to constrain the LLM’s output​towardsdatascience.com. Rather than letting the LLM invent its own structure, the tool “coerces” the LLM to use the user-defined ontology in extracting triples​towardsdatascience.com. This is useful when a domain schema is known – e.g. one could supply an ontology of personnel hierarchy, and have the LLM extract all Person→Organization relationships from a document in that format. The project includes examples and a notebook for easy adoption.

  • Neo4j LLM Graph Builder: Graph database vendor Neo4j has introduced an LLM Knowledge Graph Builder, an online tool that turns unstructured data (PDFs, web pages, etc.) into a Neo4j graph. It uses various LLMs (OpenAI GPT-4, Anthropic Claude, Llama 2, etc.) under the hood to parse content and generate a property graph of entities and relationships​neo4j.com. The user can either use an existing ontology or let the system infer one, and the results can be visualized and queried in Neo4j’s interface. This is essentially a no-code/low-code solution to apply the kind of techniques described in research, packaged for practitioners.

  • Other Notable Tools: There are many emerging projects in this space. For example, LLMGraph

    github.com (by Dylan Hogg) lets you input a single seed entity and then uses an LLM to recursively fetch related entities and relations (like an expanding knowledge graph of a topic). It can output the graph in formats like GraphML for analysis. This is useful for exploratory knowledge graph creation from an LLM’s world knowledge (for instance, generating a mini knowledge graph of a historical figure and related events by interrogating ChatGPT). We’re also seeing LLM integrations in existing knowledge base frameworks – for instance, Haystack and LlamaIndex have modules to incorporate KG-based querying, and LangChain provides patterns for “knowledge graph memory”. Many of these tools are open-source and allow customization of prompts or schemas, lowering the barrier to experiment with automated knowledge extraction.

In summary, automated ontology mining with LLMs is a fast-evolving field. Academic research has demonstrated that LLMs can propose ontologies, extract structured triples, and even update knowledge bases with minimal human input. Domain-specific studies (in healthcare, legal, scientific domains) highlight both the promise and the need for careful alignment with existing expert knowledge. Industry adoption is underway, focusing on using LLM-generated knowledge graphs to improve search, QA, and decision support systems. Meanwhile, an ecosystem of open-source tools is making these capabilities accessible to practitioners. As LLM technology advances, we can expect even more robust methods for extracting structured knowledge, building knowledge graphs, and integrating LLMs with ontologies – moving closer to AI systems that can learn, organize, and reason with knowledge in a human-like yet formally structured way.

Sources:

  • Abolhasani & Pan (2024). Leveraging LLM for Automated Ontology Extraction and Knowledge Graph Generationarxiv.orgarxiv.org.

  • Shimizu & Hitzler (2024). Accelerating Knowledge Graph and Ontology Engineering with LLMsarxiv.org.

  • Zhang & Soh (2024). Extract, Define, Canonicalize: An LLM-based Framework for KG Constructionaclanthology.orgaclanthology.org.

  • Kommineni et al. (2024). LLM supported approach to ontology and KG constructionarxiv.orggithub.com.

  • Hu et al. (2024). LLM-Duo with Progressive Ontology Prompting for Scientific Literaturearxiv.orgarxiv.org.

  • Caufield et al. (2024). SPIRES: Zero-shot Schema Extraction with LLMsarxiv.orgacademic.oup.com.

  • Ivanisenko et al. (2024). Hybrid GNN-LLM for Extending Biomedical KGsmdpi.commdpi.com.

  • Kim et al. (2024). Structured Extraction of Real World Medical Knowledge using LLMsarxiv.orgarxiv.org.

  • Li et al. (2024). Legal Knowledge Graph Construction via Knowledge-Enhanced LLMmdpi.commdpi.com.

  • Oarga et al. (2024). Scientific KG and Ontology Generation using Open LLMsopenreview.netopenreview.net.

  • Microsoft Research Blog (2024). GraphRAG: LLMs + Knowledge Graphs for Private Datamicrosoft.commicrosoft.com.

  • NVIDIA Technical Blog (2024). LLM-Driven Knowledge Graphs in Enterprisesdeveloper.nvidia.comdeveloper.nvidia.com.

  • Monarch Initiative (2023). OntoGPT – LLM-based Ontology Extraction Toolgithub.comgithub.com.

  • Fusion Jena (2024). Automatic KG Construction with LLM – Code Repositorygithub.com.

  • Rahul Nair (2023). Graph Maker librarytowardsdatascience.com.

  • Neo4j (2023). LLM Knowledge Graph Builderneo4j.com.

0
Subscribe to my newsletter

Read articles from Thomas Weitzel directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Thomas Weitzel
Thomas Weitzel