We Built an Offline RAG AI That Reads 1000s of technical Pages

We built a fully offline AI document intelligence system that reads thousands of pages of technical documentation and answers engineering queries instantly — no cloud, no internet, no hallucinations.

Technical writeup

Read the full technical breakdown

The Problem

Electrical engineers waste hours manually searching massive IEC 61850 standard documents. Standard RAG systems break on compound queries, conversational follow-ups, and completely fail in air-gapped offline environments where no cloud API can be used.

Our Approach

Built a fully offline RAG pipeline with local LLMs via Ollama, Qdrant vector database, and a custom double-caching engine. Engineered parent-child table mapping, multi-intent query splitting, conversational memory with pronoun resolution, and a self-healing streaming validation layer — all running on-premise with zero external calls.

Results

Exact cache hits — under 1ms

Semantic cache hits — ~10ms

26 second startup bottleneck reduced to under 1ms

Conversational query rewriting in under 150ms

100% precision score across 20 complex technical questions

Zero hallucinations, zero cloud dependency

Tech Stack

PythonFastAPIReactOllamaQwen2.5-7BQdrantSQLitePyMuPDFmxbai-embed-largeBAAI/bge-reranker-baseopenpyxlSSE StreamingParent-Child RAGDouble-Layer CachingMulti-Intent Query SplittingSelf-Healing Stream ValidationOffline LLM InferenceLocal Vector SearchCross-Encoder Reranking

Need something similar?

Let's talk about how we can solve your specific problem.

Book a call