What If Document Search Worked Like Navigation?
GraphRAG uses entity extraction to build navigable graphs. What if we used embedding geometry instead? A speculative exploration of TDA for document retrieval.
The Core Idea
GraphRAG builds knowledge graphs by extracting entities from text. It works great when you want to find "all documents mentioning John Smith."
But what if we skipped entity extraction and worked directly with embeddings?
GraphRAG needs an LLM to find entities. That's expensive and misses implicit themes—like documents about "startup culture" that don't mention specific companies or people.
The experiment: Build document graphs from embedding geometry. Your query walks through semantic neighborhoods, following paths that make sense.
This is totally speculative, but worth exploring.
Why GraphRAG Works (And What It Might Miss)
How GraphRAG Works
- Extract entities (people, companies, places) from text with an LLM
- Build a knowledge graph where nodes = entities, edges = relationships
- When you search, walk through the entity graph
What it's good at:
- "Find all mentions of John Smith"
- "Show companies connected to this investor"
- Microsoft uses it in production
What it's bad at:
- Costs a lot (LLM calls for every document)
- Misses themes without explicit entities
- LLMs hallucinate fake entities sometimes
The Gap
Imagine documents about "startup culture" that don't mention specific companies. GraphRAG can't connect them—no entities to extract. Vector search just finds similar documents without structure.
What if we built graphs from embedding geometry instead of entities?
How This Might Work
The Energy Function
Instead of just cosine similarity, combine three factors:
E(node) = semantic_distance + structural_fit + time_decay
- Semantic: How close is this cluster to the query?
- Structural: Does this cluster fit the topic hierarchy?
- Temporal: Penalty for old documents
The search process:
- Start at nearby clusters (using HNSW index)
- Walk to neighbors with lower energy
- Stop at local minimum
- Return those documents
Does this beat regular vector search? No idea—nobody's tested it.
Building the Document Graph
Use the Mapper algorithm to turn embeddings into a graph:
- Project embeddings to 2D (with UMAP)
- Create overlapping tiles covering the space
- Cluster documents in each tile
- Connect clusters that share documents
# Pseudocode - won't actually run
from umap import UMAP
from kmapper import KeplerMapper
mapper = KeplerMapper()
lens = UMAP(n_components=2).fit_transform(embeddings)
graph = mapper.map(lens, embeddings)
# nodes = document clusters
# edges = connections between clusters
Warning: Mapper is super sensitive to parameters. Change the resolution slightly and you get a totally different graph. This is a known problem.
Finding Entry Points
Use HNSW (a fast nearest-neighbor index) to find where to start:
# Pseudocode - won't actually run
import hnswlib
index = hnswlib.Index(space='cosine', dim=768)
index.add_items(document_embeddings, doc_ids)
# Get starting points
entry_points = index.knn_query(query_embedding, k=5)
# Then walk through the graph from there
HNSW finds nearby clusters fast. Then the energy function takes over.
The Full Picture
# Pseudocode - won't actually run
def search(query_text, k=10):
query_emb = model.encode(query_text)
# Start near the query
entry_points = hnsw_index.knn_query(query_emb, k=5)
# Walk to lower energy neighbors
current = entry_points[0]
path = [current]
while True:
neighbors = graph.neighbors(current)
best = min(neighbors, key=lambda n: energy(query_emb, n))
if energy(query_emb, best) >= energy(query_emb, current):
break
current = best
path.append(current)
return graph.nodes[current]['documents'][:k], path
The query walks through the graph following energy gradients. You can trace the path it took.
What This Might Add vs GraphRAG
What You Might Gain
- Catches implicit themes - "Startup culture" docs without named entities
- Cheaper - No LLM calls per document, just embeddings
- Deterministic - No hallucination risk
- Shows structure - Reveals clusters and gaps in your corpus
What You Lose
- No entity tracking - Can't say "this mentions John Smith"
- Parameter hell - Mapper is finicky, different settings = different graphs
- Unproven - GraphRAG works in production, this doesn't
- Might be slower - TDA adds overhead
Real talk: This might just be worse. Nobody's tested it.
How You'd Build It
# Pseudocode - won't actually run
class DocumentNavigator:
def __init__(self, documents, embeddings):
self.graph = build_mapper_graph(embeddings)
self.index = build_hnsw_index(embeddings)
def search(self, query, k=10):
entry = self.index.find_nearest(query, k=5)
path = self.walk_graph(query, entry)
return self.graph.get_docs(path[-1], k), path
Tools: kmapper
, hnswlib
, umap-learn
, networkx
How You'd Test It
- Run on BEIR benchmarks (FiQA, SCIDOCS, NFCorpus)
- Compare to vector search, BM25, GraphRAG
- Test different Mapper parameters—see how much results vary
- Measure latency vs accuracy trade-offs
You'd probably find: Mapper instability kills it, or the overhead isn't worth it, or it just doesn't beat simpler methods.
Honest Comparison: GraphRAG vs TDA Approach vs Vector Search
Aspect | Vector Search | GraphRAG | TDA Approach (Speculative) |
---|---|---|---|
Graph construction | None (flat space) | Entity extraction via LLM | Embedding geometry (Mapper) |
Indexing cost | Low (just embeddings) | High (LLM per document) | Medium (embeddings + Mapper) |
Query cost | Very low (ANN search) | Medium (graph traversal) | Medium (graph traversal) |
Explicit entities | ✗ Not captured | ✓ Strong provenance | ✗ Not captured |
Implicit themes | ✓ Via embeddings | Limited (needs entities) | ✓ Via geometry |
Multi-hop reasoning | ✗ Flat similarity | ✓ Entity paths | ✓ Cluster paths |
Explainability | Low (just scores) | High (entity paths) | Medium (cluster paths) |
Proven results | ✓ Yes, widely used | ✓ Yes, in production | ✗ No validation |
Hallucination risk | None | Yes (LLM extraction) | None (deterministic) |
Hyperparameter sensitivity | Low | Low | High (Mapper params) |
Best for | "Find similar docs" | "Find John Smith mentions" | "Find semantic region" (?) |
When to Use What
- Vector search: Fast, cheap, works great already
- GraphRAG: Need entity tracking, have LLM budget, proven results
- TDA approach: You're feeling experimental and have time to waste
Could you combine them? Sure, but why add complexity when you don't know if it helps?
Why This Is Interesting (Maybe)
It combines real techniques:
- TDA/Mapper: Used for visualizing high-dimensional data (Singh et al., 2007)
- BERTopic/Top2Vec: Already cluster docs by embeddings successfully
- HNSW: Fast nearest-neighbor search, proven at scale
The leap: What if we navigated through Mapper graphs instead of just clustering?
Does it work? No clue.
Implementation Phases (If You're Brave)
- Build Mapper graph from embeddings
- Add HNSW for entry points
- Implement energy function and graph walking
- Test on BEIR vs vector search and GraphRAG
- Find out it doesn't work (probably)
Tools: kmapper
, hnswlib
, umap-learn
When This Makes Sense
Hypothetically:
- Implicit themes matter more than entities
- LLM indexing is too expensive
- You're doing research
Reality:
- Just use vector search or GraphRAG
- This is unproven and probably slower
Bottom Line
GraphRAG uses entity extraction to build navigable graphs. This proposes using embedding geometry instead.
Why it might work:
- Captures implicit themes
- Cheaper than LLM calls
- No hallucination
Why it might fail:
- Mapper parameter sensitivity
- Overhead not worth it
- Simple methods already good enough
What to do: Someone should test it on BEIR. Until then, it's just speculation.
If you try it and it fails, publish that—negative results matter.
References
Graph-Based Retrieval
- Microsoft GraphRAG Documentation - Official GraphRAG implementation and docs
- Graph-Based RAG Explained - Weaviate's overview of graph retrieval
Topological Data Analysis
- Topological information retrieval with dilation-invariant bottleneck comparative measures - Monod et al., 2023, Information and Inference: A Journal of the IMA
- Topology and data - Carlsson, 2009, Bulletin of the American Mathematical Society
- Topological methods for the analysis of high dimensional data sets - Singh, Mémoli & Carlsson, 2007
Embedding-Based Document Organization
- BERTopic: Neural topic modeling with a class-based TF-IDF procedure - Grootendorst, 2022
- Top2Vec: Distributed representations of topics - Angelov, 2020
Approximate Nearest Neighbors
- Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs - Malkov & Yashunin, 2018, IEEE TPAMI
Evaluation Benchmarks
- BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models - Thakur et al., 2021, NeurIPS Datasets and Benchmarks
Disclaimer: This is purely speculative exploration. No empirical validation has been performed. Use established methods (vector search, GraphRAG) for production systems.