GraphRAG Benchmark
Research benchmark portal

GraphRAG Benchmark: Proving Graph-Based Retrieval at Scale

A 2 million token benchmark comparing LLM-only, Basic RAG, and NetworkX GraphRAG using long-form scientific papers.

2M Tokens3 Pipelines40 Evaluation QuestionsLLM Judge + BERTScoreVercel + Hugging Face Spaces
Corpus scale
2,004,563 tokens
Retrieval strategy
Vector similarity + graph traversal
Evaluation
Independent judge and semantic validator
Outcome
Efficiency without cutting accuracy
Problem

Why GraphRAG matters

The benchmark asks whether graph-structured retrieval can keep answers grounded while reducing context bloat.

LLM-only

Fast to wire up, but answers are ungrounded and can hallucinate when the model lacks source context.

Basic RAG

Retrieves semantically similar chunks, but often misses relationships spread across documents or entities.

GraphRAG

Traverses entity and chunk relationships, enabling compact context and multi-hop reasoning over the corpus.

Corpus

A scientific-paper dataset with relationship density

Scientific papers naturally contain methods, datasets, tasks, assumptions, citations, and concepts that reward graph retrieval.

The corpus is large enough to stress token budgets and rich enough to expose the difference between nearest-neighbor retrieval and graph-aware traversal.

View dataset source
Dataset
armanc/scientific_papers (arxiv)
Domain
Scientific Research Papers
Total tokens
2,004,563
Papers selected
220
Evaluation questions
40
Source
Hugging Face arXiv split
Benchmark Method

Same questions, three retrieval strategies

Every evaluation question is run through all three pipelines and scored with identical metrics.

PipelineRetrieval strategyReasoningStrengthLimitation
LLM-onlyDirect promptGeneral parametric reasoningSimple baselineNo retrieval grounding
Basic RAGChromaDB similarity searchSingle-hop semantic retrievalStrong factual groundingCan miss entity relationships
GraphRAGNetworkX graph traversalMulti-hop entity contextCompact relationship-aware evidenceDepends on graph quality
Evaluation

Two independent quality checks

Token reduction only matters when answers remain correct, grounded, and semantically aligned with references.

LLM-as-a-Judge

A hosted Hugging Face model independently grades each answer PASS or FAIL against the reference answer.

BERTScore

Semantic similarity catches correct paraphrases using rescaled F1, reducing dependence on exact wording.

Judge model: meta-llama/Llama-3.1-8B-Instruct. Evaluation set: 40 grounded questions.
System Design

A lightweight architecture judges can reproduce

The production deployment works fully without TigerGraph. NetworkX is the canonical GraphRAG engine.

Frontend
Vercel, Next.js, TypeScript, Tailwind CSS, shadcn-style UI, Framer Motion
Backend
Hugging Face Spaces, FastAPI, Python
Retrieval
ChromaDB vectors plus NetworkX graph traversal
Evaluation
Hugging Face Inference API and BERTScore
Optional connector
TigerGraph for enterprise-scale graph storage
Tech Stack

Built from portable, judge-friendly pieces

Frontend

Next.jsTypeScriptTailwind CSSshadcn/uiFramer MotionRecharts

Backend

PythonFastAPIdatasetstiktokensentence-transformers

Retrieval

ChromaDBNetworkX

Evaluation

huggingface_hubevaluatebert-score

Deployment

VercelHugging Face SpacesDocker
Architecture Decision

Architectural Pivot: From TigerGraph to NetworkX

TigerGraph was explored as the graph database backend, but operational complexity distracted from the research question.

Authentication issues, Docker complexity, resource overhead, and infrastructure work made TigerGraph a poor default for hackathon judging. NetworkX preserves graph traversal and multi-hop reasoning while running entirely in Python.

The benchmark demonstrates graph-based retrieval methodology, not dependence on a particular graph database.

CapabilityNetworkXTigerGraph
Multi-hop traversalYesYes
Benchmark readyYesYes
Zero setupYesNo
Laptop friendlyYesNo
Enterprise scaleLimitedExcellent
Required for benchmarkYesNo
Results

Efficiency, cost, and quality in one view

The dashboard preserves raw answers and context while the landing page summarizes judge-facing metrics.

Data source: Live backend final_summary.json
Open full dashboard
PipelineBadgesJudgeBERT F1Avg TokensLatencyCostToken ReductionLatency ReductionCost Reduction
LLM-only
Lowest Tokens
N/AN/A3657.29s$0.00070.0%0.0%0.0%
Basic RAG
N/AN/A7483.52s$0.0015-104.7%51.7%-104.7%
GraphRAG
FastestBest Overall
N/AN/A3782.00s$0.0008-3.4%72.5%-3.4%

Accuracy and semantic similarity

Normalized overall profile

Average token usage

Average latency

Deployment

Free, reproducible, and TigerGraph-free by default

Frontend

Vercel hosts the Next.js dashboard.

Open deployment

Backend

Hugging Face Spaces runs the Dockerized FastAPI API.

Open deployment

Storage

ChromaDB files, JSON artifacts, and NetworkX graph files are bundled with the backend.