Ideas/October 14, 2025/8 min read

What If Document Search Worked Like Navigation?

GraphRAG uses entity extraction to build navigable graphs. What if we used embedding geometry instead? A speculative exploration of TDA for document retrieval.

The Core Idea

GraphRAG builds knowledge graphs by extracting entities from text. It works great when you want to find "all documents mentioning John Smith."

But what if we skipped entity extraction and worked directly with embeddings?

GraphRAG needs an LLM to find entities. That's expensive and misses implicit themes—like documents about "startup culture" that don't mention specific companies or people.

The experiment: Build document graphs from embedding geometry. Your query walks through semantic neighborhoods, following paths that make sense.

This is totally speculative, but worth exploring.

Why GraphRAG Works (And What It Might Miss)

How GraphRAG Works

Extract entities (people, companies, places) from text with an LLM
Build a knowledge graph where nodes = entities, edges = relationships
When you search, walk through the entity graph

What it's good at:

"Find all mentions of John Smith"
"Show companies connected to this investor"
Microsoft uses it in production

What it's bad at:

Costs a lot (LLM calls for every document)
Misses themes without explicit entities
LLMs hallucinate fake entities sometimes

The Gap

Imagine documents about "startup culture" that don't mention specific companies. GraphRAG can't connect them—no entities to extract. Vector search just finds similar documents without structure.

What if we built graphs from embedding geometry instead of entities?

How This Might Work

The Energy Function

Instead of just cosine similarity, combine three factors:

E(node) = semantic_distance + structural_fit + time_decay

Semantic: How close is this cluster to the query?
Structural: Does this cluster fit the topic hierarchy?
Temporal: Penalty for old documents

The search process:

Start at nearby clusters (using HNSW index)
Walk to neighbors with lower energy
Stop at local minimum
Return those documents

Does this beat regular vector search? No idea—nobody's tested it.

Building the Document Graph

Use the Mapper algorithm to turn embeddings into a graph:

Project embeddings to 2D (with UMAP)
Create overlapping tiles covering the space
Cluster documents in each tile
Connect clusters that share documents

python

# Pseudocode - won't actually run
from umap import UMAP
from kmapper import KeplerMapper

mapper = KeplerMapper()
lens = UMAP(n_components=2).fit_transform(embeddings)
graph = mapper.map(lens, embeddings)

# nodes = document clusters
# edges = connections between clusters

Warning: Mapper is super sensitive to parameters. Change the resolution slightly and you get a totally different graph. This is a known problem.

Finding Entry Points

Use HNSW (a fast nearest-neighbor index) to find where to start:

python

# Pseudocode - won't actually run
import hnswlib

index = hnswlib.Index(space='cosine', dim=768)
index.add_items(document_embeddings, doc_ids)

# Get starting points
entry_points = index.knn_query(query_embedding, k=5)

# Then walk through the graph from there

HNSW finds nearby clusters fast. Then the energy function takes over.

The Full Picture

python

# Pseudocode - won't actually run
def search(query_text, k=10):
    query_emb = model.encode(query_text)
    
    # Start near the query
    entry_points = hnsw_index.knn_query(query_emb, k=5)
    
    # Walk to lower energy neighbors
    current = entry_points[0]
    path = [current]
    
    while True:
        neighbors = graph.neighbors(current)
        best = min(neighbors, key=lambda n: energy(query_emb, n))
        if energy(query_emb, best) >= energy(query_emb, current):
            break
        current = best
        path.append(current)
    
    return graph.nodes[current]['documents'][:k], path

The query walks through the graph following energy gradients. You can trace the path it took.

What This Might Add vs GraphRAG

What You Might Gain

Catches implicit themes - "Startup culture" docs without named entities
Cheaper - No LLM calls per document, just embeddings
Deterministic - No hallucination risk
Shows structure - Reveals clusters and gaps in your corpus

What You Lose

No entity tracking - Can't say "this mentions John Smith"
Parameter hell - Mapper is finicky, different settings = different graphs
Unproven - GraphRAG works in production, this doesn't
Might be slower - TDA adds overhead

Real talk: This might just be worse. Nobody's tested it.

How You'd Build It

python

# Pseudocode - won't actually run
class DocumentNavigator:
    def __init__(self, documents, embeddings):
        self.graph = build_mapper_graph(embeddings)
        self.index = build_hnsw_index(embeddings)
    
    def search(self, query, k=10):
        entry = self.index.find_nearest(query, k=5)
        path = self.walk_graph(query, entry)
        return self.graph.get_docs(path[-1], k), path

Tools: kmapper, hnswlib, umap-learn, networkx

How You'd Test It

Run on BEIR benchmarks (FiQA, SCIDOCS, NFCorpus)
Compare to vector search, BM25, GraphRAG
Test different Mapper parameters—see how much results vary
Measure latency vs accuracy trade-offs

You'd probably find: Mapper instability kills it, or the overhead isn't worth it, or it just doesn't beat simpler methods.

Honest Comparison: GraphRAG vs TDA Approach vs Vector Search

Aspect	Vector Search	GraphRAG	TDA Approach (Speculative)
Graph construction	None (flat space)	Entity extraction via LLM	Embedding geometry (Mapper)
Indexing cost	Low (just embeddings)	High (LLM per document)	Medium (embeddings + Mapper)
Query cost	Very low (ANN search)	Medium (graph traversal)	Medium (graph traversal)
Explicit entities	✗ Not captured	✓ Strong provenance	✗ Not captured
Implicit themes	✓ Via embeddings	Limited (needs entities)	✓ Via geometry
Multi-hop reasoning	✗ Flat similarity	✓ Entity paths	✓ Cluster paths
Explainability	Low (just scores)	High (entity paths)	Medium (cluster paths)
Proven results	✓ Yes, widely used	✓ Yes, in production	✗ No validation
Hallucination risk	None	Yes (LLM extraction)	None (deterministic)
Hyperparameter sensitivity	Low	Low	High (Mapper params)
Best for	"Find similar docs"	"Find John Smith mentions"	"Find semantic region" (?)

When to Use What

Vector search: Fast, cheap, works great already
GraphRAG: Need entity tracking, have LLM budget, proven results
TDA approach: You're feeling experimental and have time to waste

Could you combine them? Sure, but why add complexity when you don't know if it helps?

Why This Is Interesting (Maybe)

It combines real techniques:

TDA/Mapper: Used for visualizing high-dimensional data (Singh et al., 2007)
BERTopic/Top2Vec: Already cluster docs by embeddings successfully
HNSW: Fast nearest-neighbor search, proven at scale

The leap: What if we navigated through Mapper graphs instead of just clustering?

Does it work? No clue.

Implementation Phases (If You're Brave)

Build Mapper graph from embeddings
Add HNSW for entry points
Implement energy function and graph walking
Test on BEIR vs vector search and GraphRAG
Find out it doesn't work (probably)

Tools: kmapper, hnswlib, umap-learn

When This Makes Sense

Hypothetically:

Implicit themes matter more than entities
LLM indexing is too expensive
You're doing research

Reality:

Just use vector search or GraphRAG
This is unproven and probably slower

Bottom Line

GraphRAG uses entity extraction to build navigable graphs. This proposes using embedding geometry instead.

Why it might work:

Captures implicit themes
Cheaper than LLM calls
No hallucination

Why it might fail:

Mapper parameter sensitivity
Overhead not worth it
Simple methods already good enough

What to do: Someone should test it on BEIR. Until then, it's just speculation.

If you try it and it fails, publish that—negative results matter.

References

Graph-Based Retrieval

Microsoft GraphRAG Documentation - Official GraphRAG implementation and docs
Graph-Based RAG Explained - Weaviate's overview of graph retrieval

Topological Data Analysis

Topological information retrieval with dilation-invariant bottleneck comparative measures - Monod et al., 2023, Information and Inference: A Journal of the IMA
Topology and data - Carlsson, 2009, Bulletin of the American Mathematical Society
Topological methods for the analysis of high dimensional data sets - Singh, Mémoli & Carlsson, 2007

Embedding-Based Document Organization

BERTopic: Neural topic modeling with a class-based TF-IDF procedure - Grootendorst, 2022
Top2Vec: Distributed representations of topics - Angelov, 2020

Approximate Nearest Neighbors

Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs - Malkov & Yashunin, 2018, IEEE TPAMI

Evaluation Benchmarks

BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models - Thakur et al., 2021, NeurIPS Datasets and Benchmarks

Disclaimer: This is purely speculative exploration. No empirical validation has been performed. Use established methods (vector search, GraphRAG) for production systems.

Ideas/October 14, 2025/8 min read

What If Document Search Worked Like Navigation?

GraphRAG uses entity extraction to build navigable graphs. What if we used embedding geometry instead? A speculative exploration of TDA for document retrieval.

The Core Idea

GraphRAG builds knowledge graphs by extracting entities from text. It works great when you want to find "all documents mentioning John Smith."

But what if we skipped entity extraction and worked directly with embeddings?

GraphRAG needs an LLM to find entities. That's expensive and misses implicit themes—like documents about "startup culture" that don't mention specific companies or people.

The experiment: Build document graphs from embedding geometry. Your query walks through semantic neighborhoods, following paths that make sense.

This is totally speculative, but worth exploring.

Why GraphRAG Works (And What It Might Miss)

How GraphRAG Works

Extract entities (people, companies, places) from text with an LLM
Build a knowledge graph where nodes = entities, edges = relationships
When you search, walk through the entity graph

What it's good at:

"Find all mentions of John Smith"
"Show companies connected to this investor"
Microsoft uses it in production

What it's bad at:

Costs a lot (LLM calls for every document)
Misses themes without explicit entities
LLMs hallucinate fake entities sometimes

The Gap

Imagine documents about "startup culture" that don't mention specific companies. GraphRAG can't connect them—no entities to extract. Vector search just finds similar documents without structure.

What if we built graphs from embedding geometry instead of entities?

How This Might Work

The Energy Function

Instead of just cosine similarity, combine three factors:

E(node) = semantic_distance + structural_fit + time_decay

Semantic: How close is this cluster to the query?
Structural: Does this cluster fit the topic hierarchy?
Temporal: Penalty for old documents

The search process:

Start at nearby clusters (using HNSW index)
Walk to neighbors with lower energy
Stop at local minimum
Return those documents

Does this beat regular vector search? No idea—nobody's tested it.

Building the Document Graph

Use the Mapper algorithm to turn embeddings into a graph:

Project embeddings to 2D (with UMAP)
Create overlapping tiles covering the space
Cluster documents in each tile
Connect clusters that share documents

python

# Pseudocode - won't actually run
from umap import UMAP
from kmapper import KeplerMapper

mapper = KeplerMapper()
lens = UMAP(n_components=2).fit_transform(embeddings)
graph = mapper.map(lens, embeddings)

# nodes = document clusters
# edges = connections between clusters

Warning: Mapper is super sensitive to parameters. Change the resolution slightly and you get a totally different graph. This is a known problem.

Finding Entry Points

Use HNSW (a fast nearest-neighbor index) to find where to start:

python

# Pseudocode - won't actually run
import hnswlib

index = hnswlib.Index(space='cosine', dim=768)
index.add_items(document_embeddings, doc_ids)

# Get starting points
entry_points = index.knn_query(query_embedding, k=5)

# Then walk through the graph from there

HNSW finds nearby clusters fast. Then the energy function takes over.

The Full Picture

python

# Pseudocode - won't actually run
def search(query_text, k=10):
    query_emb = model.encode(query_text)
    
    # Start near the query
    entry_points = hnsw_index.knn_query(query_emb, k=5)
    
    # Walk to lower energy neighbors
    current = entry_points[0]
    path = [current]
    
    while True:
        neighbors = graph.neighbors(current)
        best = min(neighbors, key=lambda n: energy(query_emb, n))
        if energy(query_emb, best) >= energy(query_emb, current):
            break
        current = best
        path.append(current)
    
    return graph.nodes[current]['documents'][:k], path

The query walks through the graph following energy gradients. You can trace the path it took.

What This Might Add vs GraphRAG

What You Might Gain

Catches implicit themes - "Startup culture" docs without named entities
Cheaper - No LLM calls per document, just embeddings
Deterministic - No hallucination risk
Shows structure - Reveals clusters and gaps in your corpus

What You Lose

No entity tracking - Can't say "this mentions John Smith"
Parameter hell - Mapper is finicky, different settings = different graphs
Unproven - GraphRAG works in production, this doesn't
Might be slower - TDA adds overhead

Real talk: This might just be worse. Nobody's tested it.

How You'd Build It

python

# Pseudocode - won't actually run
class DocumentNavigator:
    def __init__(self, documents, embeddings):
        self.graph = build_mapper_graph(embeddings)
        self.index = build_hnsw_index(embeddings)
    
    def search(self, query, k=10):
        entry = self.index.find_nearest(query, k=5)
        path = self.walk_graph(query, entry)
        return self.graph.get_docs(path[-1], k), path

Tools: kmapper, hnswlib, umap-learn, networkx

How You'd Test It

Run on BEIR benchmarks (FiQA, SCIDOCS, NFCorpus)
Compare to vector search, BM25, GraphRAG
Test different Mapper parameters—see how much results vary
Measure latency vs accuracy trade-offs

You'd probably find: Mapper instability kills it, or the overhead isn't worth it, or it just doesn't beat simpler methods.

Honest Comparison: GraphRAG vs TDA Approach vs Vector Search

Aspect	Vector Search	GraphRAG	TDA Approach (Speculative)
Graph construction	None (flat space)	Entity extraction via LLM	Embedding geometry (Mapper)
Indexing cost	Low (just embeddings)	High (LLM per document)	Medium (embeddings + Mapper)
Query cost	Very low (ANN search)	Medium (graph traversal)	Medium (graph traversal)
Explicit entities	✗ Not captured	✓ Strong provenance	✗ Not captured
Implicit themes	✓ Via embeddings	Limited (needs entities)	✓ Via geometry
Multi-hop reasoning	✗ Flat similarity	✓ Entity paths	✓ Cluster paths
Explainability	Low (just scores)	High (entity paths)	Medium (cluster paths)
Proven results	✓ Yes, widely used	✓ Yes, in production	✗ No validation
Hallucination risk	None	Yes (LLM extraction)	None (deterministic)
Hyperparameter sensitivity	Low	Low	High (Mapper params)
Best for	"Find similar docs"	"Find John Smith mentions"	"Find semantic region" (?)

When to Use What

Vector search: Fast, cheap, works great already
GraphRAG: Need entity tracking, have LLM budget, proven results
TDA approach: You're feeling experimental and have time to waste

Could you combine them? Sure, but why add complexity when you don't know if it helps?

Why This Is Interesting (Maybe)

It combines real techniques:

TDA/Mapper: Used for visualizing high-dimensional data (Singh et al., 2007)
BERTopic/Top2Vec: Already cluster docs by embeddings successfully
HNSW: Fast nearest-neighbor search, proven at scale

The leap: What if we navigated through Mapper graphs instead of just clustering?

Does it work? No clue.

Implementation Phases (If You're Brave)

Build Mapper graph from embeddings
Add HNSW for entry points
Implement energy function and graph walking
Test on BEIR vs vector search and GraphRAG
Find out it doesn't work (probably)

Tools: kmapper, hnswlib, umap-learn

When This Makes Sense

Hypothetically:

Implicit themes matter more than entities
LLM indexing is too expensive
You're doing research

Reality:

Just use vector search or GraphRAG
This is unproven and probably slower

Bottom Line

GraphRAG uses entity extraction to build navigable graphs. This proposes using embedding geometry instead.

Why it might work:

Captures implicit themes
Cheaper than LLM calls
No hallucination

Why it might fail:

Mapper parameter sensitivity
Overhead not worth it
Simple methods already good enough

What to do: Someone should test it on BEIR. Until then, it's just speculation.

If you try it and it fails, publish that—negative results matter.

References

Graph-Based Retrieval

Microsoft GraphRAG Documentation - Official GraphRAG implementation and docs
Graph-Based RAG Explained - Weaviate's overview of graph retrieval

Topological Data Analysis

Topological information retrieval with dilation-invariant bottleneck comparative measures - Monod et al., 2023, Information and Inference: A Journal of the IMA
Topology and data - Carlsson, 2009, Bulletin of the American Mathematical Society
Topological methods for the analysis of high dimensional data sets - Singh, Mémoli & Carlsson, 2007

Embedding-Based Document Organization

BERTopic: Neural topic modeling with a class-based TF-IDF procedure - Grootendorst, 2022
Top2Vec: Distributed representations of topics - Angelov, 2020

Approximate Nearest Neighbors

Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs - Malkov & Yashunin, 2018, IEEE TPAMI

Evaluation Benchmarks

BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models - Thakur et al., 2021, NeurIPS Datasets and Benchmarks

Disclaimer: This is purely speculative exploration. No empirical validation has been performed. Use established methods (vector search, GraphRAG) for production systems.