Storage — How Data Lives in FalkorDB¶
The storage layer is the bridge between the SDK's Python objects and FalkorDB's graph database. Two classes handle everything: GraphStore manages node and relationship writes, and VectorStore manages embeddings, indexes, and search. A third class, EntityDeduplicator, handles post-ingestion entity merging.
This document explains how each one works, what Cypher queries they generate, and how to tune them.
The Big Picture¶
SDK Python Objects FalkorDB
────────────────── ────────
GraphNode ───GraphStore───> (:Person {id, name, ...})
GraphRelationship ───GraphStore───> ()-[:RELATES {fact, ...}]->()
TextChunks ───VectorStore──> (:Chunk {embedding: vecf32(...)})
───VectorStore──> Vector + Fulltext Indexes
Query vectors <──VectorStore─── db.idx.vector.queryNodes(...)
Keyword queries <──VectorStore─── db.idx.fulltext.queryNodes(...)
Duplicate entities ─EntityDeduplicator─> DETACH DELETE + edge remap
GraphStore — Writing Nodes and Relationships¶
GraphStore is the single write path for all graph data. It uses parameterized Cypher to prevent injection and batched UNWIND for performance.
Batched UNWIND MERGE¶
Nodes and relationships are written in batches of 500 items per query. The pattern:
-- Nodes
UNWIND $batch AS item
MERGE (n:`Person` {id: item.id})
SET n += item.properties
SET n:__Entity__
-- Relationships (with label hints)
UNWIND $batch AS item
MATCH (a:`__Entity__` {id: item.start_id}), (b:`__Entity__` {id: item.end_id})
MERGE (a)-[r:`RELATES`]->(b)
SET r += item.properties
Why MERGE, not CREATE? MERGE is idempotent — if a node or relationship already exists, it updates the properties rather than creating a duplicate. This makes re-ingestion safe.
Label Hints¶
Relationship MATCH queries use label hints to speed up endpoint lookups. Instead of searching all nodes, FalkorDB only looks at nodes with the specified label:
| Edge Type | Source Label | Target Label |
|---|---|---|
PART_OF |
Document |
Chunk |
NEXT_CHUNK |
Chunk |
Chunk |
MENTIONED_IN |
__Entity__ |
Chunk |
RELATES |
__Entity__ |
__Entity__ |
Unknown edge types default to (__Entity__, __Entity__).
Per-Item Fallback¶
If a batch upsert fails (e.g., a single malformed property causes the whole batch to error), GraphStore falls back to per-item upserts. The behavior differs slightly: for nodes, the first per-item failure raises a DatabaseError (remaining items in that batch are not attempted); for relationships, failures are logged as warnings and processing continues through the batch.
None-ID Guard¶
Before writing, nodes with None or empty IDs are filtered out. These come from bad LLM extraction (the LLM sometimes returns entities without proper names). The guard prevents phantom nodes from polluting the graph.
Property Cleaning¶
All properties go through _clean_properties() before writing:
| Input Type | Treatment |
|---|---|
str, int, float, bool |
Stored as-is |
list |
Items filtered to primitives only, empty lists dropped |
dict |
Serialized to JSON string |
None |
Dropped entirely |
| Other | Converted to string |
Cypher Safety¶
Node labels and relationship types are sanitized via sanitize_cypher_label() to prevent Cypher injection. Properties are applied via parameter maps (e.g., SET n += item.properties), so property keys and values are never interpolated into the Cypher string.
VectorStore — Embeddings, Indexes, and Search¶
VectorStore handles everything related to vector and fulltext operations.
Index Creation¶
Vector indexes use FalkorDB's native vector index syntax:
CREATE VECTOR INDEX FOR (n:Chunk) ON (n.embedding)
OPTIONS {dimension:256, similarityFunction:'cosine'}
Fulltext indexes use the RediSearch-based fulltext API:
CALL db.idx.fulltext.createNodeIndex('Chunk', 'text')
CALL db.idx.fulltext.createNodeIndex('__Entity__', 'name', 'description')
Relationship vector indexes use the edge index syntax:
CREATE VECTOR INDEX FOR ()-[e:`RELATES`]->() ON (e.embedding)
OPTIONS {dimension:256, similarityFunction:'cosine'}
All index creation is idempotent — if the index already exists, the error is silently caught and logged at debug level.
Chunk Indexing (index_chunks)¶
When chunks are ingested, their text is embedded and stored:
- Batch embed: All chunk texts are passed to the embedder via a single
aembed_documentscall. The underlying provider controls how these are internally batched (e.g., via a configurablebatch_size) and may split them across multiple API requests. - Batch write: Vectors are written to Chunk nodes using UNWIND (500 per batch):
- Fallback: If batch embedding fails, chunks are embedded one at a time. If batch writing fails, items are written individually.
Entity Embedding Backfill (backfill_entity_embeddings)¶
After all documents are ingested, entity nodes need embeddings for vector search. This is done during finalize():
- Query: Find entities missing embeddings:
WHERE e.embedding IS NULL - Embed: Batch-embed entity names via
aembed_documents - Write: Store vectors using UNWIND:
- Loop: Repeat until no more entities with NULL embeddings remain. Each batch naturally returns the next un-embedded set.
Relationship Embedding (embed_relationships)¶
RELATES edges with a fact property but no embedding are batch-embedded:
- Query: Find edges with
r.embedding IS NULL AND r.fact IS NOT NULL - Embed: Batch-embed the
facttext - Write: Store vectors on each edge individually (using internal FalkorDB edge IDs):
Search Methods¶
Vector search on Chunk nodes:
CALL db.idx.vector.queryNodes('Chunk', 'embedding', $top_k, vecf32($vector))
YIELD node, score
RETURN node.id AS id, node.text AS text, score
ORDER BY score DESC
Vector search on Entity nodes:
CALL db.idx.vector.queryNodes('__Entity__', 'embedding', $top_k, vecf32($vector))
YIELD node, score
RETURN node.id AS id, node.name AS name, node.description AS description, score
ORDER BY score DESC
Vector search on RELATES edges (with fallback):
-- Primary (FalkorDB >= 4.2):
CALL db.idx.vector.queryRelationships('RELATES', 'embedding', $top_k, vecf32($vector))
YIELD relationship AS r, score
RETURN r.src_name, r.rel_type, r.tgt_name, r.fact, score
-- Fallback (Cypher-based cosine distance scan):
MATCH (a:__Entity__)-[r:RELATES]->(b:__Entity__)
WHERE r.embedding IS NOT NULL
WITH a, r, b, vecf32.distance.cosine(r.embedding, vecf32($vector)) AS dist
RETURN r.src_name, r.rel_type, r.tgt_name, r.fact, (1-dist) AS score
ORDER BY dist ASC LIMIT $top_k
Fulltext search:
CALL db.idx.fulltext.queryNodes('Chunk', $query_text)
YIELD node, score
RETURN node.id AS id, node.text AS text, score
ORDER BY score DESC LIMIT $top_k
Special characters in fulltext queries are escaped for RediSearch compatibility (commas, brackets, operators, etc.).
Stored Embedding Optimization¶
During retrieval, the rerank_chunks() function uses stored embeddings when possible. Instead of re-embedding all candidate chunks (which would require an expensive API call), it fetches the vectors already stored on Chunk nodes and computes cosine similarity locally. This makes reranking instant when stored embedding coverage is >= 90%.
ensure_indices()¶
Creates all 5 standard indexes in one call. Tracks state internally (_indices_ensured) to avoid redundant creation. Called automatically after each ingest() call, and re-run during finalize() (which resets the flag).
EntityDeduplicator — Merging Duplicate Entities¶
After ingesting multiple documents, the same real-world entity might exist as multiple nodes (e.g., "Alice" from doc 1 and "Alice" from doc 5). The deduplicator merges them.
Phase 1: Exact Name Match (Always Runs)¶
- Fetch all
__Entity__nodes with their primary label (the non-__Entity__label) - Group by
(normalized_name.lower(), label)— grouping by label prevents cross-type merging (Person "Paris" and Location "Paris" stay separate) - For each group with duplicates:
- Survivor: The entity with the longest description
- Remap edges: All RELATES and MENTIONED_IN edges from duplicates are redirected to the survivor:
-- Outgoing RELATES MATCH (dup:__Entity__ {id: $dup_id})-[r:RELATES]->(b:__Entity__) WHERE b.id <> $survivor_id MERGE (s:__Entity__ {id: $survivor_id})-[nr:RELATES]->(b) SET nr += properties(r) DELETE r -- Incoming RELATES MATCH (a:__Entity__)-[r:RELATES]->(dup:__Entity__ {id: $dup_id}) WHERE a.id <> $survivor_id MERGE (a)-[nr:RELATES]->(s:__Entity__ {id: $survivor_id}) SET nr += properties(r) DELETE r -- MENTIONED_IN MATCH (dup:__Entity__ {id: $dup_id})-[r:MENTIONED_IN]->(c:Chunk) MERGE (s:__Entity__ {id: $survivor_id})-[:MENTIONED_IN]->(c) DELETE r - Delete the duplicate node:
MATCH (e:__Entity__ {id: $dup_id}) DETACH DELETE e
Phase 2: Fuzzy Embedding Match (Optional)¶
Enabled via fuzzy=True. Catches near-duplicates that have slightly different names (e.g., "J. Doe" and "Jane Doe"):
- Fetch all surviving entities with their labels
- Batch-embed entity names
- Normalize vectors and compute pairwise cosine similarity in blocks (1000 entities per block to avoid OOM)
- Merge pairs above the similarity threshold (default: 0.9), but only within the same label (no cross-type merging)
- Remap and delete as in Phase 1
FalkorDB-Specific Notes¶
Vector Storage¶
FalkorDB stores vectors as vecf32 — a native 32-bit float vector type. All vectors in the SDK are stored via:
Vector Search API¶
Node vector search uses 4 arguments:
Relationship vector search (FalkorDB >= 4.2):
Graph Deletion¶
For fast graph deletion, use GRAPH.DELETE (the Redis-level command):
MATCH (n) DETACH DELETE n on large graphs.
Retry Behavior¶
The connection layer retries transient query failures up to 3 times using exponential backoff with jitter (retry_delay * 2^attempt * random(0.5, 1.5)) and employs a circuit breaker to short-circuit repeated failures. Non-transient errors (containing "already indexed", "already exists", or "unknown index") are raised immediately.
File Reference¶
| File | What it contains |
|---|---|
storage/graph_store.py |
GraphStore — batched MERGE, label hints, statistics, cleanup |
storage/vector_store.py |
VectorStore — index management, chunk/entity/relationship embedding, search |
storage/deduplicator.py |
EntityDeduplicator — exact + fuzzy dedup, edge remapping |
utils/cypher.py |
sanitize_cypher_label() and other Cypher utilities |
core/connection.py |
FalkorDBConnection — async client, connection pooling, retry |