GraphRAG Benchmark: Proving Graph-Based Retrieval at Scale
A 2 million token benchmark comparing LLM-only, Basic RAG, and NetworkX GraphRAG using long-form scientific papers.
Why GraphRAG matters
The benchmark asks whether graph-structured retrieval can keep answers grounded while reducing context bloat.
LLM-only
Fast to wire up, but answers are ungrounded and can hallucinate when the model lacks source context.
Basic RAG
Retrieves semantically similar chunks, but often misses relationships spread across documents or entities.
GraphRAG
Traverses entity and chunk relationships, enabling compact context and multi-hop reasoning over the corpus.
A scientific-paper dataset with relationship density
Scientific papers naturally contain methods, datasets, tasks, assumptions, citations, and concepts that reward graph retrieval.
The corpus is large enough to stress token budgets and rich enough to expose the difference between nearest-neighbor retrieval and graph-aware traversal.
View dataset sourceSame questions, three retrieval strategies
Every evaluation question is run through all three pipelines and scored with identical metrics.
| Pipeline | Retrieval strategy | Reasoning | Strength | Limitation |
|---|---|---|---|---|
| LLM-only | Direct prompt | General parametric reasoning | Simple baseline | No retrieval grounding |
| Basic RAG | ChromaDB similarity search | Single-hop semantic retrieval | Strong factual grounding | Can miss entity relationships |
| GraphRAG | NetworkX graph traversal | Multi-hop entity context | Compact relationship-aware evidence | Depends on graph quality |
Two independent quality checks
Token reduction only matters when answers remain correct, grounded, and semantically aligned with references.
LLM-as-a-Judge
A hosted Hugging Face model independently grades each answer PASS or FAIL against the reference answer.
BERTScore
Semantic similarity catches correct paraphrases using rescaled F1, reducing dependence on exact wording.
A lightweight architecture judges can reproduce
The production deployment works fully without TigerGraph. NetworkX is the canonical GraphRAG engine.
Built from portable, judge-friendly pieces
Frontend
Backend
Retrieval
Evaluation
Deployment
Architectural Pivot: From TigerGraph to NetworkX
TigerGraph was explored as the graph database backend, but operational complexity distracted from the research question.
Authentication issues, Docker complexity, resource overhead, and infrastructure work made TigerGraph a poor default for hackathon judging. NetworkX preserves graph traversal and multi-hop reasoning while running entirely in Python.
The benchmark demonstrates graph-based retrieval methodology, not dependence on a particular graph database.
| Capability | NetworkX | TigerGraph |
|---|---|---|
| Multi-hop traversal | Yes | Yes |
| Benchmark ready | Yes | Yes |
| Zero setup | Yes | No |
| Laptop friendly | Yes | No |
| Enterprise scale | Limited | Excellent |
| Required for benchmark | Yes | No |
Efficiency, cost, and quality in one view
The dashboard preserves raw answers and context while the landing page summarizes judge-facing metrics.
| Pipeline | Badges | Judge | BERT F1 | Avg Tokens | Latency | Cost | Token Reduction | Latency Reduction | Cost Reduction |
|---|---|---|---|---|---|---|---|---|---|
| LLM-only | Lowest Tokens | N/A | N/A | 365 | 7.29s | $0.0007 | 0.0% | 0.0% | 0.0% |
| Basic RAG | N/A | N/A | 748 | 3.52s | $0.0015 | -104.7% | 51.7% | -104.7% | |
| GraphRAG | FastestBest Overall | N/A | N/A | 378 | 2.00s | $0.0008 | -3.4% | 72.5% | -3.4% |
Accuracy and semantic similarity
Normalized overall profile
Average token usage
Average latency
Free, reproducible, and TigerGraph-free by default
Storage
ChromaDB files, JSON artifacts, and NetworkX graph files are bundled with the backend.