Ontology Discovery¶
How to bootstrap an ontology from a corpus and how to propose schema additions as new documents arrive — the discovery side of the ontology lifecycle. The complement to Ontology Evolution, which covers safely changing an ontology that already exists.
If you've never built an ontology by hand and you don't know where to start, start here.
The Mental Model¶
There are four places an ontology can come from:
- The built-in default. If you construct
GraphRAGwithout anontology=argument and no ontology has been previously persisted to the graph, the SDK seeds the ontology graph withDEFAULT_ENTITY_TYPES(Person,Organization,Technology,Product,Location,Date,Event,Concept,Law,Dataset,Method) so extraction has something to anchor on. - Hand-authored. You write
Ontology(entities=[...], relations=[...])because you already know what your knowledge graph should look like. The fastest path when the domain is well understood. - Discovered from a corpus. You point the SDK at some documents and ask it to draft one. Use when you're exploring an unfamiliar corpus, or when manually enumerating types upfront is brittle.
- Discovered and then evolved. The realistic workflow: draft once with discovery, ingest with that draft, then as new documents arrive use
suggest_schema_extensionsto propose additions you review and apply via the evolution API.
This page is about (3) and (4). The same data model (Ontology) is used end-to-end — you can mix and match.
The API at a Glance¶
Two cooperating entry points. Both return artifacts you inspect before any data is touched.
Bootstrap — Ontology.from_sources¶
A classmethod on the Ontology data model. Pure function, no DB connection needed. Two discovery algorithms, selected via the method= argument.
method="llm" (default) — LLM-driven schema invention¶
draft = await Ontology.from_sources(
sources, # str or list[str], same as ingest()
llm, # any LLMInterface
method="llm", # explicit, but this is the default
boundaries="biotech papers about CRISPR", # free-text scope hint
existing=None, # optional structured prior
sample_chunks_per_doc=3,
max_retries=3,
concurrency=4,
)
Behavior:
- For each source: load → chunk → sample
sample_chunks_per_docchunks → run a per-document summary call (anchors the per-chunk prompts) → run a per-chunk proposal call (constrained Pydantic output, validated with retry) → merge in-document. - Across the corpus: merge all per-document drafts.
- Single normalization call: collapse synonyms (
Org+Organization→ one label), fix obvious direction reversals, drop entity types whose only role was to be a property of another type. - When
existingis supplied, its labels are passed into every prompt as a soft controlled vocabulary, and the returned draft is merged withexistingon the way out.
Returns: a new Ontology.
method="grounded" — catalog lookup, zero LLM calls¶
Adapts barakb/text-to-rdf's "find entities, look up their schemas" technique to schema discovery. The corpus tells you which types to include; the catalog tells you what their schemas are.
from graphrag_sdk.discovery.catalog import DBpediaCatalog
draft = await Ontology.from_sources(
sources,
method="grounded",
catalog=DBpediaCatalog(), # required for method="grounded"
sample_chunks_per_doc=3, # used for NER, not LLM
concurrency=4,
)
Behavior:
- For each source: load → chunk → sample → run NER on each sampled chunk with a small fixed anchor label list (
["person", "organization", "location", "event"]) to find entity mention strings. These anchor labels never appear in the output ontology — they only help local NER spot proper nouns. - For each unique mention name across the corpus, call
catalog.link_entity(name)— the catalog answers "what types is this entity?" (forDBpediaCatalogthat's a SPARQL query to DBpedia). - Aggregate the union of types returned across all linked entities.
- Ask the catalog for each detected type's schema definition (
Entitywith attributes + canonical URI). - Ask the catalog for every relation whose source and target are both in the detected set (or in
existing— bridge relations between newly-detected and pre-existing types are surfaced). - Merge with
existingif supplied.
Zero LLM calls when llm is not supplied. No hallucination, no drift — the schema is entirely catalog-derived; the corpus only chooses the subset. At the cost of being limited to what the catalog knows — domain-specific types that aren't in the catalog won't appear.
The default NER backend is GLiNERExtractor() (no API calls, local model). Override with entity_extractor= to plug in LLMExtractor or any custom EntityExtractor.
Returns: a new Ontology.
Optional: per-type property trim with an LLM¶
The catalog hands back every property Schema.org (or whatever catalog) declares for a type. For Person that means ~28 properties — including obscure ones like duns, vatID, callSign. To tailor the schema to your corpus, pass llm:
draft = await Ontology.from_sources(
sources,
llm=llm, # triggers the per-type trim
method="grounded",
catalog=DBpediaCatalog(),
)
With llm provided, after type detection the pipeline runs one LLM call per detected type: it shows the LLM the catalog's full property list plus up to 5 chunks where that type was detected, and asks "which of these properties are stated or implied in any chunk?" The result trims each entity to a corpus-specific subset.
Cost: ~1 LLM call per type, not per chunk — much cheaper than method="llm". Output is deterministic for label invention (still purely catalog) but stochastic for which properties survive (the LLM decides). On hard failure the per-type trim soft-fails to the catalog's full list — a flaky LLM never silently loses schema information.
Catalogs¶
A Catalog is the source of truth for ontology vocabulary. Concrete catalogs that ship today:
| Class | Per-entity source | Per-type source |
|---|---|---|
DBpediaCatalog |
DBpedia SPARQL (https://dbpedia.org/sparql) |
Schema.org JSON-LD (live) |
DBpediaCatalog is fully live — no bundled data. The work splits across two services:
link_entity(name)— given an entity mention NER found in a chunk (e.g."Albert Einstein"), SPARQL DBpedia for entities whoserdfs:labelmatches, filtered tohttp://dbpedia.org/ontology/types. Returns the local names (e.g.["Person", "Scientist"]). Results are cached in-process for the lifetime of the catalog instance — repeated mentions across chunks only get one SPARQL call.lookup(type)/relations_among(types)— first call downloads Schema.org's full JSON-LD vocabulary, processes it (appliesrdfs:subClassOfinheritance so e.g.ArticlecarriesCreativeWork's attributes), and caches the processed result under$XDG_CACHE_HOME/graphrag-sdk/schema_org.json(typically~/.cache/graphrag-sdk/schema_org.json). Cache TTL defaults to 30 days; passcache_ttl_days=Noneto disable expiry or0to force re-fetch every construction.
There is no offline fallback. If DBpedia or Schema.org is unreachable and the Schema.org cache is missing/stale, DBpediaFetchError is raised. Pre-warm the cache by running once with connectivity, or supply a pre-populated cache_path.
No hardcoded type list: types come from real entities found in your corpus, mapped through DBpedia's ontology, then defined by Schema.org. The corpus drives which types appear in the output.
The Catalog ABC has three abstract methods — link_entity(name) / lookup(type_name) / relations_among(type_names) — so adding a WikidataCatalog, a SPARQL-store catalog, or a domain-specific catalog is a focused subclass. The pipeline runs NER (default: GLiNERExtractor) to find mention strings, then asks the catalog the rest.
Choosing between method="llm" and method="grounded"¶
method="llm" |
method="grounded" |
|
|---|---|---|
| LLM cost | Yes (~D × (S + 1) + 1 calls per corpus) |
None |
| Catalog needed | No | Yes |
| Determinism | Stochastic (LLM) | Fully deterministic for a given corpus + catalog |
| Coverage | Whatever the LLM extracts from the text | Whatever the catalog defines for detected types |
| Domain-specific types | Yes (LLM can invent them) | Only if the catalog has them |
| Best for | Unfamiliar / domain-specific corpora | General web / news / biographies where Schema.org fits |
| Composition | Run method="grounded" first to anchor on the catalog, then a second method="llm" pass with existing=draft to add domain-specific types |
Same composition pattern, from either direction |
Live-graph delta — GraphRAG.suggest_schema_extensions¶
A method on GraphRAG. Operates against the currently committed ontology.
proposal = await rag.suggest_schema_extensions(
sources, # new docs that motivate additions
boundaries=...,
sample_chunks_per_doc=3,
max_retries=3,
concurrency=4,
)
print(proposal.summary())
# SchemaExtensionProposal(entities=+2, relations=+1, patterns=+0, attributes=+3, ...)
Internally runs the same discovery pipeline with the committed ontology as the prior, then diffs against the committed ontology and surfaces only the additions. Returns a SchemaExtensionProposal. Nothing is applied to the graph — you review and apply.
Apply — existing evolution API¶
You hand the accepted parts of the proposal to the v1.2.x mutation API:
for entity in proposal.new_entities:
await rag.add_entity(entity)
for relation in proposal.new_relations:
if not relation.patterns:
# Open-mode relations (no patterns) can't be applied via
# add_relation_pattern in v1 — see "What's Not in the API" below.
continue
for src, tgt in relation.patterns:
await rag.add_relation_pattern(relation.label, src, tgt)
# Preserve the proposed description once the relation has been
# declared via its first add_relation_pattern call.
if relation.description:
await rag.set_relation_description(relation.label, relation.description)
for rel_label, src, tgt in proposal.new_patterns:
await rag.add_relation_pattern(rel_label, src, tgt)
for owner, attribute in proposal.new_attributes:
await rag.add_attribute(owner, attribute) # atomic, LLM-backfilled
See Ontology Evolution for the invariants add_attribute enforces. Discovery never commits anything itself — the existing evolution machinery stays the single commit surface.
On new_attributes for relation owners: the v1 mutation API does not yet apply attributes to relations (see "What's Not in the API" in Ontology Evolution — add_attribute raises NotImplementedError for relation owners). Proposals may include them so you see what discovery found, but applying them requires either waiting for the relation-attribute path to land or hand-managing it via delete_all() + re-ingest() with the updated schema. The diff still surfaces them so the proposal is honest about what discovery saw.
How the LLM is Kept Honest¶
Open-vocabulary discovery is brittle by default. The pipeline applies three disciplines that make it tractable:
1. Structured output with validation-retry feedback. Every LLM call inside the discovery pipeline goes through a wrapper that parses the response into a Pydantic model and runs a semantic validator (relation patterns reference declared entities, attribute types are in the allowed set, no system-key shadowing — except for name, see below). On failure, the specific errors are sent back to the LLM as a new user message and the call is retried. The conversation history is preserved across retries so the model sees its own rejected output and the precise correction it needs to make.
2. Anchoring with a per-document summary. Before any per-chunk proposal is made, a one-shot per-document call extracts the document's central entities and a one-sentence "aboutness". That summary is prefixed onto every per-chunk prompt for the same document. Chunks share the doc's frame instead of each independently guessing what the document is about.
3. Normalization as a separate pass. Per-chunk proposals are merged naively (label-based union) and then a single cross-draft LLM call canonicalizes labels and direction. Without this, drafts accumulate synonyms across calls (Org + Organization + Company as three types) and the schema drifts. The normalization pass is also where the existing prior, if supplied, becomes a hard preference: "prefer these labels."
The wrapper is strict — on exhausted retries it raises OntologyDiscoveryError. The pipeline above it is soft-fail: a single bad chunk is logged and skipped, a failed summary degrades the doc to weaker anchoring, a failed normalization returns the un-normalized draft. Net effect: one noisy chunk or one rate-limited call does not kill the whole draft.
The name Attribute¶
Every entity type in a discovered ontology carries a name: STRING attribute. This is intentional — every node in the graph has a name (e.g. "Alice Liddell" for a Person, "Acme Corporation" for an Organization), and the schema honestly reflects that.
The catch: the SDK fills name automatically during extraction (from the entity span detected by GLiNER and verified in the LLM step). The LLM is never asked to extract name as a per-entity attribute — that would cause the value to be written twice, with possibly conflicting versions.
Practical consequences:
- In
ontology.jsonfiles, every entity hasnamelisted inproperties. This is documentation, not a knob — you don't need to do anything to populate it. - If discovery doesn't propose
nameon some entity type (small models sometimes omit it), the pipeline adds it automatically before returning. The schema is consistent regardless of model behavior. suggest_schema_extensionswill never proposenameas a new attribute. It's conceptually always present on every entity, so adding it is a no-op.- You should not declare
namewith a non-STRING type — the SDK writes a string. Discovery rejects non-STRINGnamedeclarations like any other invalid type.
Other reserved-system names (id, description, source_chunk_ids, spans, rel_type, fact, src_name, tgt_name, label) are not schema-visible — they are internal SDK plumbing and the discovery validator still rejects them in proposals.
When to Reach for Which¶
| Situation | Use |
|---|---|
| Brand-new project, no ontology, unfamiliar / domain-specific corpus | Ontology.from_sources(sources, llm, method="llm") |
| General web content / news / biographies — Schema.org vocabulary fits | Ontology.from_sources(sources, method="grounded", catalog=DBpediaCatalog()) |
| You have a draft and want to refresh it against more docs | Ontology.from_sources(new_sources, llm, existing=current_draft) |
| You've ingested with a committed ontology and new docs are coming in | rag.suggest_schema_extensions(new_sources) → review → apply with the mutation API |
| You already know the schema | Hand-author Ontology(...) — discovery only buys overhead |
| The corpus has hard structural constraints (e.g. a known taxonomy) | Hand-author the core; use suggest_schema_extensions to find what you missed |
End-to-End Example¶
import asyncio
from graphrag_sdk import (
ConnectionConfig, GraphRAG, LiteLLM, LiteLLMEmbedder, Ontology,
)
async def main():
llm = LiteLLM(model="gpt-4o-mini")
# 1. Bootstrap a draft from a small corpus.
draft = await Ontology.from_sources(
["docs/intro.md", "docs/history.pdf"],
llm,
boundaries="company history and biographies",
sample_chunks_per_doc=3,
)
draft.save_to_file("ontology.json")
# Review ontology.json by hand at this point — edit / curate / commit.
# 2. Ingest with the curated draft.
rag = GraphRAG(
connection=ConnectionConfig(host="localhost", graph_name="discover_demo"),
llm=llm,
embedder=LiteLLMEmbedder(model="text-embedding-3-large"),
embedding_dimension=256,
ontology=Ontology.from_file("ontology.json"),
)
async with rag:
await rag.ingest(source=["docs/intro.md", "docs/history.pdf"])
# 3. Later — a new document shows up.
proposal = await rag.suggest_schema_extensions(
"docs/acquisition_news.md",
boundaries="company history and biographies",
)
# 4. Apply the additions you accept.
for entity in proposal.new_entities:
await rag.add_entity(entity)
for owner, attribute in proposal.new_attributes:
await rag.add_attribute(owner, attribute)
asyncio.run(main())
A more comprehensive walkthrough lives in examples/10_ontology_discovery.py.
SchemaExtensionProposal Reference¶
Returned by suggest_schema_extensions. Carries additions only — never modifications or deletions of existing schema.
| Field | Type | Description |
|---|---|---|
new_entities |
list[Entity] |
Entity types not in the committed ontology |
new_relations |
list[Relation] |
Relation types not in the committed ontology |
new_patterns |
list[tuple[str, str, str]] |
Additional (rel_label, src, tgt) patterns for relation types that already exist |
new_attributes |
list[tuple[str, Attribute]] |
Additional (owner_label, attribute) pairs for entity types that already exist |
sources_scanned |
list[str] |
Source identifiers the proposal was derived from |
is_empty |
bool (property) |
True when the proposal has no additions to apply |
summary() |
-> str |
One-line summary for logs / CLI output |
sources_scanned is coarse-grained evidence — it tells you which inputs informed the proposal, not which input motivated which specific addition. Per-item evidence (proposal-id → chunk-ids) is a planned upgrade that requires plumbing chunk identifiers through the discovery pipeline; not in v1.
OntologyDiscoveryError Reference¶
Raised by the validation-retry wrapper inside the discovery pipeline when an individual LLM call exhausts its retry budget. The pipeline itself is soft-fail and catches these to keep going, so most users never see this exception. If you call extract_with_retry directly (from graphrag_sdk.discovery) you will.
| Field | Type | Description |
|---|---|---|
chunk_id |
str \| None |
Identifier of the unit being processed (chunk uid, "summary:<source>", or "normalize") |
attempts |
int |
How many LLM calls were made before giving up |
last_error |
Exception \| None |
The last validation/parse error encountered |
Subclasses RuntimeError.
Cost & Scale¶
Per call to Ontology.from_sources(sources) with D documents and S = sample_chunks_per_doc:
| Step | LLM calls |
|---|---|
| Per-document summary | D |
| Per-chunk proposal | D × S |
| Normalization | 1 |
| Total | D × (S + 1) + 1 |
Each call is bounded by max_retries + 1. Concurrency is capped by the concurrency parameter (default 4). For a 10-document corpus with sample_chunks_per_doc=3 and no retries, that's 41 LLM calls — fast enough to run interactively, cheap on gpt-4o-mini-class models.
suggest_schema_extensions has the same shape: it runs the full discovery pipeline on the new sources plus the diff at the end (zero LLM cost). The diff against the committed ontology is purely structural.
What's Not in the API (and Why)¶
| Removed / not added | Why |
|---|---|
Auto-application of SchemaExtensionProposal |
The evolution API enforces the data/ontology consistency invariant; auto-applying would re-introduce the drift the invariant is designed to prevent. Always proposal → review → mutation API. |
| Per-item evidence (proposal-id → chunk-ids) | Requires threading chunk identifiers through the pipeline. Coarse sources_scanned is the v1 compromise — enough to know what informed a proposal, not enough to spot-check individual claims. Planned upgrade. |
| Closed controlled vocabulary | The SDK is domain-agnostic. The dynamic equivalent — the existing ontology when supplied — gives you the same benefit (constrains the LLM, prevents drift) without locking you to a specific external ontology. |
| Relation-attribute discovery | Mirrors the v1 limitation on add_attribute for relation owners. Will lift when the evolution API does. |
See Also¶
- Ontology Evolution — what to do with the schema once you've discovered it
- Graph Schema — the structural reference for
Ontology,Entity,Relation,Attribute