Research benchmark portal

GraphRAG Benchmark: Proving Graph-Based Retrieval at Scale

A 2 million token benchmark comparing LLM-only, Basic RAG, and NetworkX GraphRAG using long-form scientific papers.

2M Tokens3 Pipelines40 Evaluation QuestionsLLM Judge + BERTScoreVercel + Hugging Face Spaces

Run Benchmark Explore Architecture View GitHub

Corpus scale

2,004,563 tokens

Retrieval strategy

Vector similarity + graph traversal

Evaluation

Independent judge and semantic validator

Outcome

Efficiency without cutting accuracy

Why Dataset Method Accuracy Architecture Results Deploy

Problem

Why GraphRAG matters

The benchmark asks whether graph-structured retrieval can keep answers grounded while reducing context bloat.

LLM-only

Fast to wire up, but answers are ungrounded and can hallucinate when the model lacks source context.

Basic RAG

Retrieves semantically similar chunks, but often misses relationships spread across documents or entities.

GraphRAG

Traverses entity and chunk relationships, enabling compact context and multi-hop reasoning over the corpus.

Corpus

A scientific-paper dataset with relationship density

Scientific papers naturally contain methods, datasets, tasks, assumptions, citations, and concepts that reward graph retrieval.

The corpus is large enough to stress token budgets and rich enough to expose the difference between nearest-neighbor retrieval and graph-aware traversal.

View dataset source

Dataset

armanc/scientific_papers (arxiv)

Domain

Scientific Research Papers

Total tokens

2,004,563

Papers selected

220

Evaluation questions

Source

Hugging Face arXiv split

Benchmark Method

Same questions, three retrieval strategies

Every evaluation question is run through all three pipelines and scored with identical metrics.

Pipeline	Retrieval strategy	Reasoning	Strength	Limitation
LLM-only	Direct prompt	General parametric reasoning	Simple baseline	No retrieval grounding
Basic RAG	ChromaDB similarity search	Single-hop semantic retrieval	Strong factual grounding	Can miss entity relationships
GraphRAG	NetworkX graph traversal	Multi-hop entity context	Compact relationship-aware evidence	Depends on graph quality

Evaluation

Two independent quality checks

Token reduction only matters when answers remain correct, grounded, and semantically aligned with references.

LLM-as-a-Judge

A hosted Hugging Face model independently grades each answer PASS or FAIL against the reference answer.

BERTScore

Semantic similarity catches correct paraphrases using rescaled F1, reducing dependence on exact wording.

Judge model: meta-llama/Llama-3.1-8B-Instruct. Evaluation set: 40 grounded questions.

System Design

A lightweight architecture judges can reproduce

The production deployment works fully without TigerGraph. NetworkX is the canonical GraphRAG engine.

Frontend

Vercel, Next.js, TypeScript, Tailwind CSS, shadcn-style UI, Framer Motion

Backend

Hugging Face Spaces, FastAPI, Python

Retrieval

ChromaDB vectors plus NetworkX graph traversal

Evaluation

Hugging Face Inference API and BERTScore

Optional connector

TigerGraph for enterprise-scale graph storage

Tech Stack

Built from portable, judge-friendly pieces

Frontend

Next.jsTypeScriptTailwind CSSshadcn/uiFramer MotionRecharts

Backend

PythonFastAPIdatasetstiktokensentence-transformers

Retrieval

ChromaDBNetworkX

Evaluation

huggingface_hubevaluatebert-score

Deployment

VercelHugging Face SpacesDocker

Architecture Decision

Architectural Pivot: From TigerGraph to NetworkX

TigerGraph was explored as the graph database backend, but operational complexity distracted from the research question.

Authentication issues, Docker complexity, resource overhead, and infrastructure work made TigerGraph a poor default for hackathon judging. NetworkX preserves graph traversal and multi-hop reasoning while running entirely in Python.

The benchmark demonstrates graph-based retrieval methodology, not dependence on a particular graph database.

Capability	NetworkX	TigerGraph
Multi-hop traversal	Yes	Yes
Benchmark ready	Yes	Yes
Zero setup	Yes	No
Laptop friendly	Yes	No
Enterprise scale	Limited	Excellent
Required for benchmark	Yes	No

Results

Efficiency, cost, and quality in one view

The dashboard preserves raw answers and context while the landing page summarizes judge-facing metrics.

Data source: Live backend final_summary.json

Open full dashboard

Pipeline	Badges	Judge	BERT F1	Avg Tokens	Latency	Cost	Token Reduction	Latency Reduction	Cost Reduction
LLM-only	Lowest Tokens	N/A	N/A	365	7.29s	$0.0007	0.0%	0.0%	0.0%
Basic RAG		N/A	N/A	748	3.52s	$0.0015	-104.7%	51.7%	-104.7%
GraphRAG	FastestBest Overall	N/A	N/A	378	2.00s	$0.0008	-3.4%	72.5%	-3.4%

Accuracy and semantic similarity

Normalized overall profile

Average token usage

Average latency

Deployment

Free, reproducible, and TigerGraph-free by default

Frontend

Vercel hosts the Next.js dashboard.

Open deployment

Backend

Hugging Face Spaces runs the Dockerized FastAPI API.

Open deployment

Storage

ChromaDB files, JSON artifacts, and NetworkX graph files are bundled with the backend.