Ideas/October 14, 2025/8 min read

What If Document Search Worked Like Navigation?

GraphRAG uses entity extraction to build navigable graphs. What if we used embedding geometry instead? A speculative exploration of TDA for document retrieval.

The Core Idea

GraphRAG builds knowledge graphs by extracting entities from text. It works great when you want to find "all documents mentioning John Smith."

But what if we skipped entity extraction and worked directly with embeddings?

GraphRAG needs an LLM to find entities. That's expensive and misses implicit themes—like documents about "startup culture" that don't mention specific companies or people.

The experiment: Build document graphs from embedding geometry. Your query walks through semantic neighborhoods, following paths that make sense.

This is totally speculative, but worth exploring.

Why GraphRAG Works (And What It Might Miss)

How GraphRAG Works

  1. Extract entities (people, companies, places) from text with an LLM
  2. Build a knowledge graph where nodes = entities, edges = relationships
  3. When you search, walk through the entity graph

What it's good at:

  • "Find all mentions of John Smith"
  • "Show companies connected to this investor"
  • Microsoft uses it in production

What it's bad at:

  • Costs a lot (LLM calls for every document)
  • Misses themes without explicit entities
  • LLMs hallucinate fake entities sometimes

The Gap

Imagine documents about "startup culture" that don't mention specific companies. GraphRAG can't connect them—no entities to extract. Vector search just finds similar documents without structure.

What if we built graphs from embedding geometry instead of entities?

How This Might Work

The Energy Function

Instead of just cosine similarity, combine three factors:

E(node) = semantic_distance + structural_fit + time_decay
  • Semantic: How close is this cluster to the query?
  • Structural: Does this cluster fit the topic hierarchy?
  • Temporal: Penalty for old documents

The search process:

  1. Start at nearby clusters (using HNSW index)
  2. Walk to neighbors with lower energy
  3. Stop at local minimum
  4. Return those documents

Does this beat regular vector search? No idea—nobody's tested it.

Building the Document Graph

Use the Mapper algorithm to turn embeddings into a graph:

  1. Project embeddings to 2D (with UMAP)
  2. Create overlapping tiles covering the space
  3. Cluster documents in each tile
  4. Connect clusters that share documents
python
# Pseudocode - won't actually run
from umap import UMAP
from kmapper import KeplerMapper

mapper = KeplerMapper()
lens = UMAP(n_components=2).fit_transform(embeddings)
graph = mapper.map(lens, embeddings)

# nodes = document clusters
# edges = connections between clusters

Warning: Mapper is super sensitive to parameters. Change the resolution slightly and you get a totally different graph. This is a known problem.

Finding Entry Points

Use HNSW (a fast nearest-neighbor index) to find where to start:

python
# Pseudocode - won't actually run
import hnswlib

index = hnswlib.Index(space='cosine', dim=768)
index.add_items(document_embeddings, doc_ids)

# Get starting points
entry_points = index.knn_query(query_embedding, k=5)

# Then walk through the graph from there

HNSW finds nearby clusters fast. Then the energy function takes over.

The Full Picture

python
# Pseudocode - won't actually run
def search(query_text, k=10):
    query_emb = model.encode(query_text)
    
    # Start near the query
    entry_points = hnsw_index.knn_query(query_emb, k=5)
    
    # Walk to lower energy neighbors
    current = entry_points[0]
    path = [current]
    
    while True:
        neighbors = graph.neighbors(current)
        best = min(neighbors, key=lambda n: energy(query_emb, n))
        if energy(query_emb, best) >= energy(query_emb, current):
            break
        current = best
        path.append(current)
    
    return graph.nodes[current]['documents'][:k], path

The query walks through the graph following energy gradients. You can trace the path it took.

What This Might Add vs GraphRAG

What You Might Gain

  1. Catches implicit themes - "Startup culture" docs without named entities
  2. Cheaper - No LLM calls per document, just embeddings
  3. Deterministic - No hallucination risk
  4. Shows structure - Reveals clusters and gaps in your corpus

What You Lose

  1. No entity tracking - Can't say "this mentions John Smith"
  2. Parameter hell - Mapper is finicky, different settings = different graphs
  3. Unproven - GraphRAG works in production, this doesn't
  4. Might be slower - TDA adds overhead

Real talk: This might just be worse. Nobody's tested it.

How You'd Build It

python
# Pseudocode - won't actually run
class DocumentNavigator:
    def __init__(self, documents, embeddings):
        self.graph = build_mapper_graph(embeddings)
        self.index = build_hnsw_index(embeddings)
    
    def search(self, query, k=10):
        entry = self.index.find_nearest(query, k=5)
        path = self.walk_graph(query, entry)
        return self.graph.get_docs(path[-1], k), path

Tools: kmapper, hnswlib, umap-learn, networkx

How You'd Test It

  1. Run on BEIR benchmarks (FiQA, SCIDOCS, NFCorpus)
  2. Compare to vector search, BM25, GraphRAG
  3. Test different Mapper parameters—see how much results vary
  4. Measure latency vs accuracy trade-offs

You'd probably find: Mapper instability kills it, or the overhead isn't worth it, or it just doesn't beat simpler methods.

Honest Comparison: GraphRAG vs TDA Approach vs Vector Search

AspectVector SearchGraphRAGTDA Approach (Speculative)
Graph constructionNone (flat space)Entity extraction via LLMEmbedding geometry (Mapper)
Indexing costLow (just embeddings)High (LLM per document)Medium (embeddings + Mapper)
Query costVery low (ANN search)Medium (graph traversal)Medium (graph traversal)
Explicit entities✗ Not captured✓ Strong provenance✗ Not captured
Implicit themes✓ Via embeddingsLimited (needs entities)✓ Via geometry
Multi-hop reasoning✗ Flat similarity✓ Entity paths✓ Cluster paths
ExplainabilityLow (just scores)High (entity paths)Medium (cluster paths)
Proven results✓ Yes, widely used✓ Yes, in production✗ No validation
Hallucination riskNoneYes (LLM extraction)None (deterministic)
Hyperparameter sensitivityLowLowHigh (Mapper params)
Best for"Find similar docs""Find John Smith mentions""Find semantic region" (?)

When to Use What

  • Vector search: Fast, cheap, works great already
  • GraphRAG: Need entity tracking, have LLM budget, proven results
  • TDA approach: You're feeling experimental and have time to waste

Could you combine them? Sure, but why add complexity when you don't know if it helps?

Why This Is Interesting (Maybe)

It combines real techniques:

  • TDA/Mapper: Used for visualizing high-dimensional data (Singh et al., 2007)
  • BERTopic/Top2Vec: Already cluster docs by embeddings successfully
  • HNSW: Fast nearest-neighbor search, proven at scale

The leap: What if we navigated through Mapper graphs instead of just clustering?

Does it work? No clue.

Implementation Phases (If You're Brave)

  1. Build Mapper graph from embeddings
  2. Add HNSW for entry points
  3. Implement energy function and graph walking
  4. Test on BEIR vs vector search and GraphRAG
  5. Find out it doesn't work (probably)

Tools: kmapper, hnswlib, umap-learn

When This Makes Sense

Hypothetically:

  • Implicit themes matter more than entities
  • LLM indexing is too expensive
  • You're doing research

Reality:

  • Just use vector search or GraphRAG
  • This is unproven and probably slower

Bottom Line

GraphRAG uses entity extraction to build navigable graphs. This proposes using embedding geometry instead.

Why it might work:

  • Captures implicit themes
  • Cheaper than LLM calls
  • No hallucination

Why it might fail:

  • Mapper parameter sensitivity
  • Overhead not worth it
  • Simple methods already good enough

What to do: Someone should test it on BEIR. Until then, it's just speculation.

If you try it and it fails, publish that—negative results matter.


References

Graph-Based Retrieval

Topological Data Analysis

Embedding-Based Document Organization

Approximate Nearest Neighbors

Evaluation Benchmarks

Disclaimer: This is purely speculative exploration. No empirical validation has been performed. Use established methods (vector search, GraphRAG) for production systems.