Efficient Ontology Mining from Language Models

Large Language Models (LLMs) are increasingly used to automate ontology mining – extracting structured knowledge (entities, relations, hierarchies) from text – and to generate knowledge graphs (KGs). Recent research spans general frameworks to domain-specific applications, and new tools are emerging to integrate LLMs with existing ontologies. Below is a summary of key academic works, industry use cases, and open-source projects in this area.

Academic Research and Methods

LLM-Driven Ontology & KG Construction: Several recent papers propose pipelines where an LLM helps build an ontology (schema) and populate a knowledge graph from unstructured text. For example, OntoKGen is a pipeline that uses an LLM with an adaptive Chain-of-Thought prompting to iteratively extract an ontology and then generate a KGarxiv.org arxiv.org. OntoKGen guides the LLM step-by-step: first identifying key concepts, then relationships, then properties, with user feedback at each steppromptlayer.com. This approach was validated on technical documents (e.g. semiconductor manufacturing manuals), yielding a detailed KG that aligned with expert expectationspromptlayer.com promptlayer.com. Another work by Shimizu and Hitzler (2024) outlines how LLMs can accelerate ontology modeling, extension, population, and alignment – tasks traditionally requiring intensive expert effortarxiv.org. They argue that modular ontologies and human-in-the-loop validation remain important even as LLMs take on more of the load ar5iv.org ar5iv.org .

Prompting Frameworks for Knowledge Graphs: LLMs have been applied to generate KG triples either directly via prompts or in stages. Zhang & Soh (EMNLP 2024) note that naive prompting struggles with large schemas (exceeding context length) or when no schema is predefinedaclanthology.org aclanthology.org. They propose an “Extract-Define-Canonicalize (EDC)” framework: use open information extraction to get raw facts, then have the LLM define a schema (ontology) from those facts, and finally canonicalize entities/relationsaclanthology.org aclanthology.org. This three-phase pipeline can work with or without a given ontology, constructing one on the fly if needed. Another approach, by Kommineni et al. (2024), starts by having an LLM generate competency questions (CQs) for a domain, then uses those questions to derive an ontology and populate a KGarxiv.org. Their semi-automatic pipeline was demonstrated on the “deep learning methods” domain using scholarly publicationsarxiv.org. Notably, they use a “judge LLM” to evaluate the accuracy of the generated KG against ground truth, finding that LLMs greatly reduce human effort though expert validation is still recommendedarxiv.org. Hu et al. (2024) introduce a Progressive Ontology Prompting (POP) algorithm combined with a dual-agent LLM system to automatically discover knowledge from scientific papersarxiv.org. In their “LLM-Duo” setup, one agent (explorer) extracts candidates while another (evaluator) critiques and refines the outputarxiv.org arxiv.org. This method guided by a predefined ontology (via a BFS traversal of ontology terms) enabled large-scale extraction – e.g. identifying 2,421 medical interventions from 64k research articles – with higher accuracy and completeness than baselinesarxiv.org arxiv.org.

Structured Extraction with Zero-Shot LLMs: Some research focuses on using LLMs in a zero-shot or few-shot manner to fill a given schema or ontology. The SPIRES method (Caufield et al., 2024) is one such approach, which uses Structured Prompt Interrogation and Recursive Extraction of Semantics to have an LLM extract information conforming to a specified schema without trainingarxiv.org. In SPIRES, the LLM is prompted to output an instance (ontology class instance) with attribute–value pairs according to a custom ontology modelacademic.oup.com. This has been applied in bioinformatics to populate biomedical databases. Similarly, other works combine LLM outputs with knowledge graph completion techniques: for example, one hybrid approach uses an LLM to validate and explain edges predicted by a graph neural network, ensuring new relations are supported by literaturemdpi.com mdpi.com. This helped discover plausible protein interactions from PubMed papers while mitigating the LLM’s tendency to hallucinate by grounding its answers in real text evidencemdpi.com mdpi.com.

Integrating LLMs with Existing Ontologies: A critical theme is aligning LLM-extracted knowledge with established ontologies or knowledge bases to improve consistency. Kim et al. (2024) demonstrate this in healthcare: they built a “patient” knowledge graph for a rare disease by having LLMs extract clinical facts in natural language, then mapping the outputs back to medical ontologies like MeSH and HPOarxiv.org arxiv.org. This approach allowed them to capture nuances not explicitly covered by codified ontologies (using the LLM’s generalized language understanding), but still anchor the results to standard vocabularies for interoperabilityarxiv.org arxiv.org. Another study presents an ontology-grounded KG construction method that uses the Wikidata schema as guidancearxiv.org. They combine LLM-driven competency question generation with ontology alignment to Wikidata, so that the extracted relations are wherever possible matched to existing Wikidata propertiesarxiv.org. The resulting KGs are “consistent, complete, and interoperable” with Wikidata, meaning they can be parsed as RDF and merged into larger knowledge basesarxiv.org arxiv.org. In the legal domain, Li et al. (2024) introduce a Joint Knowledge-Enhanced Model (JKEM) that injects legal domain knowledge into an LLM via prefix-tuningmdpi.com. By embedding law ontology information as a prompt prefix (while freezing the LLM’s weights), they significantly improved the accuracy of extracting facts from Chinese legal documentsmdpi.com mdpi.com. The fine-tuned LLM then produced a legal knowledge graph (CLKG) with thousands of triples, showing the value of combining prior knowledge with language modeling for high-stakes domains.

Domain-Specific Applications

Healthcare & Biomedicine: Ontology mining with LLMs has seen active use in biomedical domains. Beyond the patient case study above, researchers have used LLMs to assist curation of disease ontologies and gene–disease relationships. For example, LLMs can extract medical entities from clinical notes and map them to codes (ICD, SNOMED) with minimal supervisionarxiv.org arxiv.org. One experiment found that a dynamically prompted GPT model could identify phenotype terms in patient text and correctly map them to the Human Phenotype Ontology (HPO) with high recall, whereas fine-tuning the LLM narrowly on the ontology terms sometimes hurt generalizationarxiv.org arxiv.org. In bioinformatics, the SPIRES zero-shot schema extraction (noted above) and similar LLM-based pipelines have been used to populate biochemical knowledge bases. LLMs have also been combined with structured biomedical knowledge for verification – e.g. using an LLM to read literature and confirm if a proposed protein–protein interaction is supported by experimental evidencemdpi.com mdpi.com. These approaches leverage LLMs’ ability to interpret complex scientific text, while the ontologies (e.g. MeSH for diseases, GO for gene functions) provide a structure to slot the information into.

Legal and Regulatory: In the legal domain, accuracy and alignment to authoritative ontologies (like statutes or legal taxonomies) are paramount. The JKEM model for Chinese law is one example where an LLM was guided with legal knowledge to extract facts for a legal KGmdpi.com. Other projects have explored LLMs summarizing and linking regulations or case law. For instance, an LLM might extract key entities (people, organizations, legal provisions) and relationships (e.g. cites, amends, violates) from court case texts to build a legal knowledge graph. Research in this area often involves a human-in-the-loop to validate critical details, given the high cost of errors. Nonetheless, early case studies suggest LLMs can substantially speed up the assembly of legal knowledge graphs by drafting the skeleton of entities/relations which experts then refine.

Science & Engineering: Ontology mining is also being applied to scientific literature to help researchers navigate complex, evolving knowledge. Oarga et al. (NeurIPS 2024 Workshop) demonstrated end-to-end ontology and KG generation from scientific papers using open-source LLMsopenreview.net. They showed that an LLM (without human-defined schema) could reconstruct a known ontology of chemical elements and even propose a new ontology for the nascent field of single-atom catalystsopenreview.net openreview.net. The automatically generated ontologies/KGs captured hierarchical relationships and facts from papers, suggesting LLMs can aid ontology creation in emerging scientific domains where no schema exists yet. Another ambitious effort (Hu et al., 2024) in the science domain was the dual-agent POP approach mentioned earlier, which in a speech therapy research scenario extracted thousands of relevant interventions from literature and built a structured knowledge base for practitionersarxiv.org. These examples underscore that in domains like materials science, chemistry, or social sciences, LLMs can help distill scattered knowledge into ontologies and graphs – accelerating discovery and literature review.

Industry Applications and Case Studies

Organizations are beginning to apply these techniques in real-world settings to manage large text corpora and enhance AI applications. A notable example is Microsoft’s GraphRAG approach, which integrates LLM-based KG construction into a retrieval-augmented generation system. In GraphRAG, an LLM first reads a collection of private documents and produces a knowledge graph of the key entities and relations

microsoft.com

. That graph is then used at query time to augment the LLM’s context with relevant facts, dramatically improving question-answering accuracy on the documentsmicrosoft.com microsoft.com. Microsoft reported that GraphRAG yields more informed and grounded answers (fewer hallucinations) compared to standard RAG, especially on complex queries that require connecting information across multiple sourcesmicrosoft.com microsoft.com. A case study on a news dataset (VIINA) showed that GraphRAG could discover indirect connections (e.g. identifying an entity like “Novorossiya” and linking it to related events) that a pure vector search missed, thereby enabling the AI to answer questions with proper evidencemicrosoft.com microsoft.com.

Major tech companies highlight such integrations of KGs with LLMs as key to enterprise AI. NVIDIA, for instance, has discussed how LLM-driven knowledge graphs can enhance enterprise search and analyticsdeveloper.nvidia.com. By transforming unstructured internal data into a structured graph, businesses can enable reasoning and complex querying that plain text search or FAQ bots cannot handledeveloper.nvidia.com developer.nvidia.com. This approach helps reduce LLM hallucinations by grounding responses in a factual graph and improves accuracy on multi-hop questionsdeveloper.nvidia.com developer.nvidia.com. NVIDIA’s technical blog notes that Microsoft’s GraphRAG demonstrated substantial gains in handling “narrative” private data, and hybrid methods (combining vector search with graph-based retrieval) are emerging to tackle a variety of queriesdeveloper.nvidia.com developer.nvidia.com.

In industry settings, domain-specific knowledge graphs built with LLM assistance are becoming a valuable asset. For example, in healthcare, companies are interested in LLM-built KGs that map symptoms, diseases, treatments, and genes, enabling advanced clinical decision supportdeveloper.nvidia.com. Finance and legal firms are experimenting with LLMs to parse regulatory texts and contracts into graph representations for easier compliance checking and Q&A. These case studies show that automated ontology mining isn’t just academic – it’s beginning to power real-world applications from intelligent document search to personalized recommendation systems, where organizing knowledge explicitly yields better results.

Open-Source Projects and Tools

The surge in research has been accompanied by open-source tools to perform LLM-based ontology extraction and KG construction:

OntoGPT (Monarch Initiative): An open-source Python package that uses LLMs plus ontology grounding to extract structured info from textgithub.com. OntoGPT takes in raw text and a target ontology or schema type (for example, “drug”), and prompts an LLM to output instances of that ontology. It leverages existing biomedical ontologies behind the scenes – e.g. given “One treatment for high blood pressure is carvedilol.”, OntoGPT will recognize “carvedilol” as a Drug and output a structured object with the drug and the condition, grounded to identifiers in an ontologygithub.comgithub.com. It implements the SPIRES methodology and supports various LLM backends (OpenAI, Anthropic, etc.)github.com. This tool has been used for biological knowledge base population and demonstrates how LLMs can serve as ontology-driven information extractors.
LLM-KG Construction Pipelines: The team at Fusion Jena released a repository for automatic KG construction with LLMsgithub.com. Their code implements the pipeline of competency question generation → ontology building → KG population, tested with four different LLMs (GPT-4, GPT-3.5, and others)github.comgithub.com. The repo includes prompts, data, and evaluation results, allowing others to replicate or adapt the semi-automated ontology engineering process described in their papers. Similarly, researchers from UMBC (Padia et al.) have shared case study code where an open LLM is used to suggest corrections to a knowledge graph for consistency (part of an AAAI-MAKE 2024 workshop) – showcasing how open models can be applied to refine KGs in an iterative loop.
Knowledge Graph Maker: Knowledge Graph Maker is an open-source library (available via pip) that simplifies text-to-graph conversion using LLMs. Notably, it allows the user to provide a custom ontology schema to constrain the LLM’s outputtowardsdatascience.com. Rather than letting the LLM invent its own structure, the tool “coerces” the LLM to use the user-defined ontology in extracting triplestowardsdatascience.com. This is useful when a domain schema is known – e.g. one could supply an ontology of personnel hierarchy, and have the LLM extract all Person→Organization relationships from a document in that format. The project includes examples and a notebook for easy adoption.
Neo4j LLM Graph Builder: Graph database vendor Neo4j has introduced an LLM Knowledge Graph Builder, an online tool that turns unstructured data (PDFs, web pages, etc.) into a Neo4j graph. It uses various LLMs (OpenAI GPT-4, Anthropic Claude, Llama 2, etc.) under the hood to parse content and generate a property graph of entities and relationshipsneo4j.com. The user can either use an existing ontology or let the system infer one, and the results can be visualized and queried in Neo4j’s interface. This is essentially a no-code/low-code solution to apply the kind of techniques described in research, packaged for practitioners.
Other Notable Tools: There are many emerging projects in this space. For example, LLMGraph

github.com (by Dylan Hogg) lets you input a single seed entity and then uses an LLM to recursively fetch related entities and relations (like an expanding knowledge graph of a topic). It can output the graph in formats like GraphML for analysis. This is useful for exploratory knowledge graph creation from an LLM’s world knowledge (for instance, generating a mini knowledge graph of a historical figure and related events by interrogating ChatGPT). We’re also seeing LLM integrations in existing knowledge base frameworks – for instance, Haystack and LlamaIndex have modules to incorporate KG-based querying, and LangChain provides patterns for “knowledge graph memory”. Many of these tools are open-source and allow customization of prompts or schemas, lowering the barrier to experiment with automated knowledge extraction.

In summary, automated ontology mining with LLMs is a fast-evolving field. Academic research has demonstrated that LLMs can propose ontologies, extract structured triples, and even update knowledge bases with minimal human input. Domain-specific studies (in healthcare, legal, scientific domains) highlight both the promise and the need for careful alignment with existing expert knowledge. Industry adoption is underway, focusing on using LLM-generated knowledge graphs to improve search, QA, and decision support systems. Meanwhile, an ecosystem of open-source tools is making these capabilities accessible to practitioners. As LLM technology advances, we can expect even more robust methods for extracting structured knowledge, building knowledge graphs, and integrating LLMs with ontologies – moving closer to AI systems that can learn, organize, and reason with knowledge in a human-like yet formally structured way.

Sources:

Abolhasani & Pan (2024). Leveraging LLM for Automated Ontology Extraction and Knowledge Graph Generationarxiv.orgarxiv.org.
Shimizu & Hitzler (2024). Accelerating Knowledge Graph and Ontology Engineering with LLMsarxiv.org.
Zhang & Soh (2024). Extract, Define, Canonicalize: An LLM-based Framework for KG Constructionaclanthology.orgaclanthology.org.
Kommineni et al. (2024). LLM supported approach to ontology and KG constructionarxiv.orggithub.com.
Hu et al. (2024). LLM-Duo with Progressive Ontology Prompting for Scientific Literaturearxiv.orgarxiv.org.
Caufield et al. (2024). SPIRES: Zero-shot Schema Extraction with LLMsarxiv.orgacademic.oup.com.
Ivanisenko et al. (2024). Hybrid GNN-LLM for Extending Biomedical KGsmdpi.commdpi.com.
Kim et al. (2024). Structured Extraction of Real World Medical Knowledge using LLMsarxiv.orgarxiv.org.
Li et al. (2024). Legal Knowledge Graph Construction via Knowledge-Enhanced LLMmdpi.commdpi.com.
Oarga et al. (2024). Scientific KG and Ontology Generation using Open LLMsopenreview.netopenreview.net.
Microsoft Research Blog (2024). GraphRAG: LLMs + Knowledge Graphs for Private Datamicrosoft.commicrosoft.com.
NVIDIA Technical Blog (2024). LLM-Driven Knowledge Graphs in Enterprisesdeveloper.nvidia.comdeveloper.nvidia.com.
Monarch Initiative (2023). OntoGPT – LLM-based Ontology Extraction Toolgithub.comgithub.com.
Fusion Jena (2024). Automatic KG Construction with LLM – Code Repositorygithub.com.
Rahul Nair (2023). Graph Maker librarytowardsdatascience.com.
Neo4j (2023). LLM Knowledge Graph Builderneo4j.com.

Automated Ontology Mining from LLMs

Table of contents

Academic Research and Methods

Domain-Specific Applications

Industry Applications and Case Studies

Open-Source Projects and Tools

Subscribe to my newsletter

Thomas Weitzel

Thomas Weitzel