Benchmarking¶
This guide explains how to evaluate the GraphRAG SDK against academic benchmarks and your own datasets. It covers the evaluation methodology, dataset format, step-by-step reproduction with the SDK API, pipeline configuration options, and our published results on the GraphRAG-Bench Novel leaderboard.
Table of Contents¶
- Overview
- Prerequisites
- Datasets
- Reproducing with the SDK API
- Pipeline Configuration
- GraphRAG-Bench Novel Results
Overview¶
A benchmark run measures four dimensions:
| Dimension | What it captures |
|---|---|
| Accuracy | Answer quality against ground-truth references (ACC, ROUGE-L, coverage) |
| Ingestion throughput | Time to chunk, extract, resolve, and build the knowledge graph |
| Query latency | End-to-end time from question submission to final answer |
| Graph statistics | Nodes, edges, and chunks produced — a proxy for knowledge density |
GraphRAG-Bench scoring system¶
We use the official GraphRAG-Bench evaluation methodology. The primary leaderboard metric is ACC (answer correctness × 100), computed as:
$$\text{ACC} = \bigl(0.75 \times \text{factuality_F1} + 0.25 \times \text{semantic_similarity}\bigr) \times 100$$
| Component | How it works |
|---|---|
| Factuality F1 | An LLM decomposes both the generated answer and the ground truth into atomic statements, classifies each as TP / FP / FN, and computes F1 |
| Semantic similarity | Cosine similarity between answer and reference embeddings, scaled to [0, 1] |
| ROUGE-L | Longest common subsequence F1 — used for Fact Retrieval and Complex Reasoning |
| Coverage score | Fraction of reference facts present in the answer — used for Contextual Summarize and Creative Generation |
Prerequisites¶
Infrastructure¶
Start a FalkorDB instance:
Environment Variables¶
Configure your LLM and embedding provider. Example for Azure OpenAI:
export AZURE_OPENAI_API_KEY="your-key"
export AZURE_OPENAI_ENDPOINT="https://your-resource.openai.azure.com/"
export AZURE_OPENAI_API_VERSION="2024-12-01-preview"
export AZURE_OPENAI_DEPLOYMENT="gpt-4o-mini"
export AZURE_OPENAI_EMBEDDING_DEPLOYMENT="text-embedding-3-small"
Any LiteLLM-supported provider works — OpenAI, Anthropic, local models, etc.
Dependencies¶
Datasets¶
We evaluate against GraphRAG-Bench, an academic benchmark for graph-based retrieval-augmented generation. The datasets and questions are published in the GraphRAG-Bench project — download them from the official repository and place them in your dataset directory.
Available datasets¶
| Dataset | Corpus | Questions | Docs | Questions |
|---|---|---|---|---|
| Novel (Full) | novel.json |
novel_questions.json |
20 | 2,010 |
| Novel (Sample 100) | novel.json |
novel_questions_sample_100.json |
20 | 100 |
| Medical | medical.json |
medical_questions.json |
1 | 2,062 |
Note: These files are not included in this repository. Download them from GraphRAG-Bench and place them in a local directory (e.g.,
datasets/).
Data format¶
Corpus — a JSON array of documents:
Questions — a JSON array of evaluation items:
[
{
"id": "Novel-73586ddc",
"source": "Novel-44557",
"question": "Which plant known as Erica vagans is also called...?",
"answer": "Cornish heath",
"question_type": "Fact Retrieval",
"evidence": "The plant known scientifically as Erica vagans...",
"evidence_relations": ["..."]
}
]
Four question types are evaluated, each with different metrics:
| Question Type | Metrics Used |
|---|---|
| Fact Retrieval | ROUGE-L + answer_correctness (ACC) |
| Complex Reasoning | ROUGE-L + answer_correctness (ACC) |
| Contextual Summarize | answer_correctness (ACC) + coverage_score |
| Creative Generation | answer_correctness (ACC) + coverage_score |
To benchmark your own domain, create two JSON files following the same schema.
Reproducing with the SDK API¶
The following walkthrough shows how to reproduce our benchmark results from scratch using the SDK's Python API. Following these exact steps with the same configuration and dataset will produce the results shown in the GraphRAG-Bench Novel Results section.
Step 1 — Initialize providers¶
import asyncio
import json
from graphrag_sdk import ConnectionConfig, GraphRAG, LiteLLM, LiteLLMEmbedder
llm = LiteLLM(
model="azure/gpt-4o-mini",
api_key="...",
api_base="https://your-resource.openai.azure.com/",
api_version="2024-12-01-preview",
)
embedder = LiteLLMEmbedder(
model="azure/text-embedding-3-small",
api_key="...",
api_base="https://your-resource.openai.azure.com/",
api_version="2024-12-01-preview",
)
Step 2 — Create a GraphRAG instance¶
rag = GraphRAG(
connection=ConnectionConfig(host="localhost", port=6379, graph_name="novel_bench"),
llm=llm,
embedder=embedder,
)
Step 3 — Configure the ingestion pipeline¶
from graphrag_sdk.core.context import Context
from graphrag_sdk.ingestion.chunking_strategies.sentence_token_cap import SentenceTokenCapChunking
from graphrag_sdk.ingestion.extraction_strategies.graph_extraction import GraphExtraction
from graphrag_sdk.ingestion.extraction_strategies.entity_extractors import GLiNERExtractor
from graphrag_sdk.ingestion.extraction_strategies.coref_resolvers import FastCorefResolver
from graphrag_sdk.ingestion.resolution_strategies.base import ResolutionStrategy
from graphrag_sdk.ingestion.resolution_strategies.exact_match import ExactMatchResolution
from graphrag_sdk.ingestion.resolution_strategies.description_merge import DescriptionMergeResolution
from graphrag_sdk.ingestion.resolution_strategies.semantic_resolution import SemanticResolution
from graphrag_sdk.ingestion.resolution_strategies.llm_verified_resolution import LLMVerifiedResolution
chunker = SentenceTokenCapChunking(max_tokens=512, overlap_sentences=2)
extractor = GraphExtraction(
llm=llm,
entity_extractor=GLiNERExtractor(model_name="urchade/gliner_medium-v2.1"),
coref_resolver=FastCorefResolver(),
)
Step 4 — Chain resolution stages¶
Multiple resolution strategies run in sequence — each stage feeds its output into the next. Create a simple chained resolver:
from graphrag_sdk.core.models import GraphData, ResolutionResult
class ChainedResolution(ResolutionStrategy):
"""Run multiple resolution strategies in sequence."""
def __init__(self, *stages):
self._stages = stages
async def resolve(self, graph_data, ctx):
for stage in self._stages:
result = await stage.resolve(graph_data, ctx)
# Convert ResolutionResult back to GraphData for the next stage
graph_data = GraphData(
nodes=result.nodes,
relationships=result.relationships,
)
return ResolutionResult(
nodes=graph_data.nodes,
relationships=graph_data.relationships,
)
resolver = ChainedResolution(
ExactMatchResolution(resolve_property="name"),
DescriptionMergeResolution(llm=llm),
SemanticResolution(embedder=embedder, similarity_threshold=0.85),
LLMVerifiedResolution(llm=llm, embedder=embedder, hard_threshold=0.95, soft_threshold=0.60),
)
Step 5 — Ingest the corpus¶
corpus = json.load(open("datasets/novel.json"))
for doc in corpus:
await rag.ingest(
doc["corpus_name"],
text=doc["context"],
chunker=chunker,
extractor=extractor,
resolver=resolver,
ctx=Context(tenant_id=doc["corpus_name"]),
)
Step 6 — Finalize the graph¶
This removes null/stub entities, deduplicates across documents, embeds all entities and relationships, and creates vector and fulltext indexes in FalkorDB.
Step 7 — Query¶
questions = json.load(open("datasets/novel_questions.json"))
results = []
for q in questions:
result = await rag.completion(q["question"])
results.append({
"question": q["question"],
"answer": result.answer,
"reference": q["answer"],
"question_type": q["question_type"],
})
Step 8 — Evaluate with GraphRAG-Bench metrics¶
For each question, compute the official GraphRAG-Bench metrics:
Answer Correctness (ACC) — the primary leaderboard metric:
- The LLM decomposes both the generated answer and the ground truth into atomic statements
- Each statement is classified as TP (true positive), FP (false positive), or FN (false negative)
- Factuality F1 is computed from the TP / FP / FN counts
- Semantic similarity is the cosine similarity between answer and reference embeddings, scaled to [0, 1]
- Final score:
ACC = 0.75 × factuality_F1 + 0.25 × semantic_similarity
ROUGE-L — longest common subsequence F1 between answer and reference. Applied to Fact Retrieval and Complex Reasoning questions.
Coverage Score — the LLM extracts facts from the reference and checks what fraction is covered in the answer. Applied to Contextual Summarize and Creative Generation questions.
After running evaluation across all questions, aggregate the results:
from collections import defaultdict
by_type = defaultdict(list)
for r in results:
acc = compute_answer_correctness(llm, embedder, r["question"], r["answer"], r["reference"])
by_type[r["question_type"]].append(acc)
# Per-type and overall ACC
for q_type, scores in by_type.items():
avg_acc = sum(scores) / len(scores) * 100
print(f"{q_type}: ACC = {avg_acc:.2f}")
all_scores = [s for scores in by_type.values() for s in scores]
overall_acc = sum(all_scores) / len(all_scores) * 100
print(f"Overall ACC: {overall_acc:.2f}")
This produces the accuracy tables, graph statistics, and leaderboard comparison shown in the results section below.
Pipeline Configuration¶
The ingestion and retrieval pipeline is fully composable. Each stage can be swapped independently.
Chunking strategies¶
| Strategy | Description |
|---|---|
SentenceTokenCapChunking(max_tokens, overlap_sentences) |
Splits on sentence boundaries with a configurable token cap. Best for most use cases. |
Extraction strategies¶
| Strategy | Description |
|---|---|
GraphExtraction(llm, entity_extractor=GLiNERExtractor(), coref_resolver=FastCorefResolver()) |
Local NER (no API cost) + coreference resolution + LLM for relationships. Best accuracy. |
GraphExtraction(llm) |
LLM-only extraction. Higher API cost per document. |
Resolution strategies¶
Chain multiple resolvers in sequence using a ChainedResolution wrapper (see Step 4). Each stage feeds its deduplicated output into the next:
| Strategy | Description |
|---|---|
ExactMatchResolution(resolve_property="name") |
Merges entities with identical names. Zero API cost. |
DescriptionMergeResolution(llm) |
LLM merges entities with similar descriptions. |
SemanticResolution(embedder, similarity_threshold) |
Cosine similarity on embeddings with hnswlib ANN index. No LLM calls. |
LLMVerifiedResolution(llm, embedder, hard_threshold, soft_threshold) |
Two-tier: auto-merge above hard threshold, LLM-verify between soft and hard. Uses Louvain community detection. |
Winning chain (used in our benchmark):
resolver = ChainedResolution(
ExactMatchResolution(resolve_property="name"),
DescriptionMergeResolution(llm=llm),
SemanticResolution(embedder=embedder, similarity_threshold=0.85),
LLMVerifiedResolution(llm=llm, embedder=embedder, hard_threshold=0.95, soft_threshold=0.60),
)
Retrieval strategies¶
| Strategy | Description |
|---|---|
MultiPathRetrieval (default) |
Multi-path entity discovery, 2-hop graph expansion, chunk retrieval, cosine rerank. No configuration required. |
Post-ingestion: finalize()¶
Always call await rag.finalize() after ingesting all documents:
- Removes null/stub entities
- Deduplicates across document boundaries
- Embeds all entities and relationships
- Creates vector and fulltext indexes in FalkorDB
GraphRAG-Bench Novel Results¶
The following results were produced by running the pipeline described above on the complete GraphRAG-Bench Novel dataset (20 novels, 2,010 questions).
Configuration¶
| Parameter | Value |
|---|---|
| LLM | gpt-4o-mini (Azure OpenAI) |
| Embeddings | text-embedding-3-small (Azure OpenAI) |
| Chunking | SentenceTokenCapChunking — max_tokens=512, overlap_sentences=2 |
| Extraction | GLiNER v2.1 + FastCoref + LLM relationship extraction |
| Resolution | ExactMatch (name) → DescriptionMerge → Semantic (0.85) → LLMVerified (0.95 / 0.60) |
| Retrieval | MultiPathRetrieval |
| Corpus | novel.json — 20 novels, 4.7 MB |
| Questions | novel_questions.json — 2,010 questions |
Accuracy (official GraphRAG-Bench ACC)¶
| Question Type | ACC (×100) | ROUGE-L | Coverage |
|---|---|---|---|
| Fact Retrieval | 65.22 | 35.95 | — |
| Complex Reasoning | 58.63 | 22.39 | — |
| Contextual Summarize | 69.54 | — | 55.21 |
| Creative Generation | 57.08 | — | 44.52 |
| Overall | 63.73 | — | — |
Leaderboard comparison¶
Note: Only the FalkorDB GraphRAG-SDK row was produced by us using the pipeline described above. All other system scores are taken from the GraphRAG-Bench published leaderboard as of April 2025. Leaderboard rankings may change as systems are updated and re-evaluated.
| System | Fact Retrieval | Complex Reasoning | Contextual Summarize | Creative Generation | Overall |
|---|---|---|---|---|---|
| FalkorDB GraphRAG-SDK | 65.22 | 58.63 | 69.54 | 57.08 | 63.73 |
| AutoPrunedRetriever | 45.99 | 62.80 | 83.10 | 62.97 | 63.72 |
| G-Reasoner | 60.07 | 53.92 | 71.28 | 50.48 | 58.94 |
| HippoRAG2 | 60.14 | 53.38 | 64.10 | 48.28 | 56.48 |
| Fast-GraphRAG | 56.95 | 48.55 | 56.41 | 46.18 | 52.02 |
| MS-GraphRAG (local) | 49.29 | 50.93 | 64.40 | 39.10 | 50.93 |
| RAG (w/ rerank) | 60.92 | 42.93 | 51.30 | 38.26 | 48.35 |
| LightRAG | 58.62 | 49.07 | 48.85 | 23.80 | 45.09 |
| HippoRAG | 52.93 | 38.52 | 48.70 | 38.85 | 44.75 |
Source: graphrag-bench.github.io — Novel leaderboard.
Graph statistics¶
| Metric | Value |
|---|---|
| Total nodes | 8,765 |
| Total edges | 25,895 |
| Total chunks | 2,782 |
| Documents | 20 |
Timing¶
| Phase | Duration |
|---|---|
| Avg. query latency | 3.6 s |
Evaluation methodology¶
Scores use the official GraphRAG-Bench evaluation suite,
ported from github.com/GraphRAG-Bench/GraphRAG-Benchmark/Evaluation:
| Component | How it works | Judge LLM |
|---|---|---|
| answer_correctness | 0.75 × Factuality F1 + 0.25 × Semantic Similarity. The LLM decomposes both the answer and reference into atomic statements, classifies TP / FP / FN, and computes F1. Semantic similarity is cosine similarity of answer vs reference embeddings. | gpt-4o-mini |
| rouge_score | ROUGE-L F1 (used for Fact Retrieval & Complex Reasoning) | — (algorithmic) |
| coverage_score | The LLM extracts facts from the reference and checks which are covered in the answer (used for Contextual Summarize & Creative Generation) | gpt-4o-mini |
ACC reported on the leaderboard = answer_correctness × 100, averaged per question type.