Graph Schema — How the Knowledge Graph is Structured¶
When you ingest documents into GraphRAG SDK, the system builds a property graph in FalkorDB. This document explains exactly what that graph looks like — what node types exist, what edges connect them, what properties they carry, and what indexes make it searchable.
Understanding the graph structure helps you write custom Cypher queries, debug ingestion quality, and tune retrieval.
The Big Picture¶
┌──────────┐ ┌──────────┐
│ Document │──PART_OF────────>│ Chunk │
│ │ │ (embed.) │
└──────────┘ └─────┬────┘
│
NEXT_CHUNK
│
┌─────v────┐
│ Chunk │
│ (embed.) │
└──────────┘
^
MENTIONED_IN
│
┌──────────┐ ┌─────┴────┐
│ Person │──RELATES────────>│ Org │
│__Entity__│ │__Entity__│
│ (embed.) │<──RELATES────────│ (embed.) │
└──────────┘ └──────────┘
The graph has two layers: 1. Lexical layer — Document and Chunk nodes with provenance edges (the text you ingested) 2. Knowledge layer — Entity nodes with relationship edges (the structured knowledge extracted from that text)
The layers are connected by MENTIONED_IN edges, which link entities to the chunks where they were found.
Node Labels¶
Document¶
Purpose: Represents a source document that was ingested.
| Property | Type | Description |
|---|---|---|
id |
string |
Unique ID (auto-generated UUID) |
path |
string |
Original file path or source identifier |
| Any metadata | varies | Custom metadata passed via DocumentInfo |
Created in: Step 3 (Lexical Graph) of the ingestion pipeline.
Chunk¶
Purpose: A text fragment from a document. Chunks are the atomic unit of retrieval — when the system finds relevant information, it returns chunks.
| Property | Type | Description |
|---|---|---|
id |
string |
Unique ID (auto-generated UUID) |
text |
string |
The chunk's text content |
index |
integer |
Position within the document (0-based) |
embedding |
vecf32 |
Vector embedding for semantic search (added in Step 9) |
start_char |
integer |
Start character offset in the source document (if FixedSizeChunking) |
end_char |
integer |
End character offset |
chunk_size |
integer |
Configured chunk size |
chunk_overlap |
integer |
Configured overlap |
Created in: Step 3 (Lexical Graph). Embedding added in Step 9.
Entity Nodes (Person, Organization, etc.)¶
Purpose: Extracted knowledge entities. Each entity node has two labels: its domain type (e.g., Person) and __Entity__ (a secondary label shared by all extracted entities).
| Property | Type | Description |
|---|---|---|
id |
string |
Deterministic: "name__type" (e.g., "alice__person") |
name |
string |
Entity name as extracted |
description |
string |
Rich description from LLM extraction |
source_chunk_ids |
list[string] |
Chunks where this entity was found |
spans |
string (JSON) |
Character offsets: {chunk_id: [{start, end}]} |
embedding |
vecf32 |
Vector embedding of the entity name (added during finalize()) |
Created in: Step 7 (Write). __Entity__ label is automatically added. Embedding is backfilled during finalize().
Default entity types (11): Person, Organization, Technology, Product, Location, Date, Event, Concept, Law, Dataset, Method
Entities below the NER confidence threshold get the special label Unknown.
Edge Types¶
PART_OF (Document -> Chunk)¶
Purpose: Provenance — tracks which document a chunk came from.
| Property | Type | Description |
|---|---|---|
index |
integer |
Chunk position within the document |
NEXT_CHUNK (Chunk -> Chunk)¶
Purpose: Sequential ordering — preserves the reading order of chunks within a document. Used to fetch neighboring chunks for context expansion.
No additional properties.
MENTIONED_IN (Entity -> Chunk)¶
Purpose: Co-occurrence — links entities to the chunks where they were extracted. This is a critical edge for retrieval: when you find an entity, you can traverse to its source chunks.
No additional properties. Deduplicated by (entity_id, chunk_id).
RELATES (Entity -> Entity)¶
Purpose: All extracted relationships between entities. This is the only relationship type used for knowledge edges.
| Property | Type | Description |
|---|---|---|
rel_type |
string |
Original relationship type (e.g., "WORKS_AT", "LOCATED_IN") |
fact |
string |
Human-readable fact: "(Alice, WORKS_AT, Acme Corp): Alice is a senior engineer" |
description |
string |
Relationship description from LLM |
keywords |
string |
Comma-separated terms for fulltext search |
weight |
float |
Confidence: 1.0 = explicit, 0.5 = implied, 0.2 = weak inference |
src_name |
string |
Source entity name |
tgt_name |
string |
Target entity name |
source_chunk_ids |
list[string] |
Chunks containing evidence for this relationship |
spans |
string (JSON) |
Character offsets of the evidence sentence |
embedding |
vecf32 |
Vector embedding of the fact text (added during finalize()) |
Why a Single RELATES Edge Type?¶
You might expect separate edge types like WORKS_AT, LOCATED_IN, and MARRIED_TO. Instead, all relationships use the single RELATES type with the original type stored in the rel_type property. Here's why:
-
Index efficiency. Each edge type in FalkorDB needs its own vector index. With potentially hundreds of LLM-generated relationship types, you'd need hundreds of indexes. One
RELATEStype means one vector index that covers all relationships. -
Consistent retrieval. The retrieval system searches all relationships at once via the RELATES edge vector index. Having a single type means one query covers everything.
-
No information loss. The original type is preserved in
rel_typeand appears in thefactstring, so you can still filter by type in custom Cypher queries:
Indexes¶
The SDK creates 5 standard indexes during finalize() (or ensure_indices()). All are idempotent — safe to create repeatedly.
Vector Indexes (3)¶
| Target | Property | Purpose |
|---|---|---|
Chunk nodes |
embedding |
Semantic search over text passages |
__Entity__ nodes |
embedding |
Semantic search over entity names |
RELATES edges |
embedding |
Semantic search over relationship facts |
Syntax:
CREATE VECTOR INDEX FOR (n:Chunk) ON (n.embedding)
OPTIONS {dimension:256, similarityFunction:'cosine'}
Fulltext Indexes (2)¶
| Target | Properties | Purpose |
|---|---|---|
Chunk nodes |
text |
Keyword search over text passages |
__Entity__ nodes |
name, description |
Keyword search over entity names and descriptions |
Syntax:
CALL db.idx.fulltext.createNodeIndex('Chunk', 'text')
CALL db.idx.fulltext.createNodeIndex('__Entity__', 'name', 'description')
The Provenance Chain¶
The graph structure ensures complete traceability from any answer back to its source:
Answer
↑ (generated by LLM from context)
Chunk text passages
↑ MENTIONED_IN (entity → chunk where it was found)
Entity relationships (RELATES)
↑ extracted from chunks
Chunk nodes
↑ PART_OF (document → chunk)
Document node
↑ (original source file)
Your Document
This is the Zero-Loss Data principle: every piece of source material is traceable in the graph. When the retrieval system provides context to the LLM, it can always point back to which document and which chunk the information came from.
Defining Your Own Schema¶
A GraphSchema tells the extraction pipeline which entity and relationship types to look for, and the pruning step uses it to filter non-conforming data.
Basic Schema¶
from graphrag_sdk import GraphSchema, EntityType, RelationType
schema = GraphSchema(
entities=[
EntityType(label="Person", description="A human being"),
EntityType(label="Organization", description="A company or institution"),
],
relations=[
RelationType(label="WORKS_AT", description="Employment relationship"),
],
)
Schema with Patterns¶
Patterns define which source-target pairs are valid for each relationship type.
They are specified directly on RelationType:
schema = GraphSchema(
entities=[
EntityType(label="Person"),
EntityType(label="Organization"),
EntityType(label="Location"),
],
relations=[
RelationType(label="WORKS_AT", patterns=[("Person", "Organization")]),
RelationType(label="LOCATED_IN", patterns=[("Organization", "Location")]),
],
)
A relationship with an empty patterns list is allowed between any entity types.
Open Schema Mode¶
If you create an empty schema (GraphSchema()), the pipeline operates in open schema mode:
- The LLM extracts any entities and relationships it finds
- The pruning step is skipped entirely
- The 11 default entity types are used for NER
This is good for exploration. For production, a defined schema produces cleaner, more consistent graphs.
Inspecting the Graph¶
Statistics¶
stats = await rag.graph_store.get_statistics()
# Returns: node_count, edge_count, entity_types, relationship_types,
# graph_density, embedded_relationship_count, mention_edge_count,
# relates_edge_count
Raw Cypher Queries¶
# Find all Person entities
result = await rag.graph_store.query_raw(
"MATCH (p:Person) RETURN p.name, p.description LIMIT 10"
)
# Find relationships between two entities
result = await rag.graph_store.query_raw(
"MATCH (a:Person {name: 'Alice'})-[r:RELATES]->(b) "
"RETURN a.name, r.rel_type, b.name, r.fact"
)
# Count mentions per entity
result = await rag.graph_store.query_raw(
"MATCH (e:__Entity__)-[m:MENTIONED_IN]->(c:Chunk) "
"RETURN e.name, count(m) AS mentions "
"ORDER BY mentions DESC LIMIT 20"
)
File Reference¶
| File | What it contains |
|---|---|
core/models.py |
GraphSchema, EntityType, RelationType, PropertyType |
storage/graph_store.py |
Node/relationship upserts, label hints, statistics |
storage/vector_store.py |
Index creation, vector search, fulltext search |
ingestion/pipeline.py |
Lexical graph construction, pruning logic |
ingestion/extraction_strategies/entity_extractors.py |
Default entity types, compute_entity_id() |