1. The Problem We Were Solving
2. Why Standard RAG Systems Fail Here
3. System Overview — What We Built
5. Ingestion Pipeline — How We Read Documents
7. Conversational Memory — Resolving Pronouns Without Bloat
8. Multi-Intent Query Splitting
9. Self-Healing Stream Validation
11. Hierarchical Reference Chaining
15. What We Learned
Imagine you are an electrical engineer working on a substation automation project. Your job requires you to look up definitions, attributes, and relationships inside the IEC 61850 standard — a massive collection of technical documents spanning dozens of parts, each hundreds of pages long.
The standard describes how power equipment should model itself digitally. Every device exposes a hierarchy of:
Physical Device
└── Logical Device
└── Logical Node (e.g. XCBR = Circuit Breaker, MMXU = Measurement Unit)
└── Data Object (e.g. Pos = Position, TotW = Total Active Power)
└── Common Data Class (e.g. SPS = Single Point Status)
└── Data Attributes (e.g. stVal, q, t)
A simple question like "What are all the mandatory data attributes of the XCBR logical node?" requires you to:
1. Find the XCBR table in Part 7-4 (pages deep into a PDF)
2. Cross-reference the Common Data Classes in Part 7-3
3. Look up each CDC's attribute list separately
4. Mentally merge everything together
Engineers were doing this manually. Every. Single. Time.
There were three hard constraints that made this non-trivial to solve with off-the-shelf tools:
Constraint 1 — Air-gapped environment. Substations are often offline for security reasons. No internet. No OpenAI API. No cloud anything. The entire system had to run locally on a single machine.
Constraint 2 — Scale. The documents are massive. Standard keyword search (Ctrl+F) doesn't understand meaning. You can't search for "how does position status flow" — it has to be an exact match.
Constraint 3 — Accuracy. Wrong information in a substation context is not just annoying — it's dangerous. Hallucinations were completely unacceptable.
Before we explain what we built, let's understand why a standard off-the-shelf RAG pipeline doesn't cut it.
A basic Retrieval-Augmented Generation pipeline works like this:
User Query
│
▼
Embed query into a vector
│
▼
Search vector DB for similar chunks
│
▼
Stuff top-K chunks into LLM prompt
│
▼
LLM generates answer
Simple, and it works well for unstructured text. But it breaks badly on structured technical documents for several reasons:
Say a user asks:
> "What is the XCBR logical node and what are its Pos data object's attributes?"
This is actually two questions in one. When you embed this into a single vector, the high-level concept ("XCBR logical node") dominates the embedding space. The secondary detail ("Pos data object attributes") gets buried and the retrieval returns chunks about XCBR — but misses the specific attribute table.
The user gets a partial answer and doesn't even know what's missing.
Standard RAG splits documents into chunks of ~500 tokens. When a table gets split:
Chunk A: [Table header row — LN Class: XCBR, Data Object, FC, Presence Condition]
Chunk B: [Row — Pos, ST, M]
Chunk C: [Row — BlkOpn, CO, O]
If Chunk B gets retrieved, it arrives at the LLM without Chunk A. The LLM has no idea what Pos, ST, and M mean — because the headers are missing.
Result: garbled, incomplete, or hallucinated answers.
Standard RAG does not track conversation history. So if a user asks:
> Turn 1: "What is SIMG?"
> Turn 2: "What are its pressure attributes?"
On Turn 2, the RAG system tries to search the vector DB for "What are its pressure attributes?" — but the word "its" means nothing to a vector search. The retrieval returns completely unrelated chunks.
When you run a local LLM on consumer hardware (like an RTX 3060 6GB), GPU memory pressure can cause the streaming response to cut off mid-sentence. The user sees:
> "The XCBR logical node contains the following data objects: Pos, BlkOpn, Bl"
Incomplete. Useless. And there's no built-in recovery mechanism.
These four problems were what we had to engineer our way around. Here's how we did it.
We built a fully offline, air-gapped document intelligence system with a conversational natural language interface.
At a high level, a user types a question, and the system:
1. Checks if it has answered this (or something very similar) before — returns instantly if yes
2. Rewrites the question if it contains pronouns that reference prior conversation
3. Splits the question if it contains multiple intents
4. Searches the vector database with precise, pre-filtered queries
5. Heals any missing data from the source PDF on the fly
6. Reranks candidates with a cross-encoder
7. Assembles context and runs local LLM inference
8. Streams the answer back, and validates it — silently healing if anything got cut off
Here is the complete query flow:
User Query
│
▼
┌─────────────────────────────────┐
│ Temporal Memory Lookup │ ← retrieves last 3 turns from SQLite
└─────────────────────────────────┘
│
▼
┌─────────────────────────────────┐
│ Pronoun Bypass Check (regex) │ ← 0ms if no pronouns detected
└─────────────────────────────────┘
│ │
Pronouns found No pronouns
│ │
▼ │
┌──────────────┐ │
│ Query │ │
│ Condenser │ │
│ qwen2.5:3b │ │
│ (< 150ms) │ │
└──────────────┘ │
│ │
└──────────┬─────────┘
│
▼
Normalized Standalone Query
│
▼
┌─────────────────────────────────┐
│ Double-Caching Check │
│ ├─ Exact SHA-256 → < 1ms │
│ └─ Semantic Qdrant → ~10ms │
└─────────────────────────────────┘
│ │
Cache Hit Cache Miss
│ │
▼ ▼
Instant Return ┌─────────────────────┐
│ Multi-Intent │
│ Query Splitter │
└─────────────────────┘
│ │
Sub-Q 1 Sub-Q 2
│ │
▼ ▼
Heuristic Router (< 1ms)
│ │
▼ ▼
Pre-filtered Vector Search (Qdrant)
│ │
└────┬────┘
│
▼
┌──────────────────────┐
│ Dynamic Healing │ ← repairs missing descriptions
│ Cache Layer │ from PDF in < 1ms
└──────────────────────┘
│
▼
┌──────────────────────┐
│ Cross-Encoder │ ← BAAI/bge-reranker-base
│ Reranker │
└──────────────────────┘
│
▼
┌──────────────────────┐
│ Context Deduplicator │
│ & Merger │
└──────────────────────┘
│
▼
┌──────────────────────┐
│ Local LLM Inference │ ← Qwen2.5-7B via Ollama
│ (Qwen 7B, Ollama) │
└──────────────────────┘
│
▼
┌──────────────────────┐
│ SSE Stream → Client │
│ + Self-Heal Check │ ← validates tokens, auto-recovers
└──────────────────────┘
Now let's go through each major component in depth.
| Layer | Technology |
|---|---|
| LLM (Primary) | Qwen2.5-7B-Instruct via Ollama |
| LLM (Query Rewriter) | Qwen2.5-3B via Ollama |
| Embeddings | mxbai-embed-large (1024-dim) / nomic-embed-text (768-dim) |
| Vector Database | Qdrant (local file-based or Docker) |
| Exact Cache | SQLite (exact_cache.db) |
| Reranker | BAAI/bge-reranker-base (cross-encoder) |
| PDF Processing | PyMuPDF (fitz) + pymupdf4llm |
| Excel Export | openpyxl |
| Backend | FastAPI + uvicorn |
| Frontend | React (Vite) + TailwindCSS |
| Streaming | Server-Sent Events (SSE) |
Everything runs on a single machine. No external API calls. No internet dependency.
The first challenge is getting the documents into a format that supports accurate retrieval.
A typical RAG system splits documents using a sliding window — every 512 tokens, with some overlap. For prose text, this works reasonably well.
For technical PDFs with tables like this:
| Data Object | FC | Pres. Cond. | Description |
|---|---|---|---|
| Pos | ST | M | Position |
| BlkOpn | CO | O | Block opening |
| CBOpCap | OR | O | CB operating capability |
...naive chunking is catastrophic. A chunk that starts mid-table has no headers, no column names, no context. The vector search matches the right chunk — but the LLM can't make sense of it.
We built a generic layout classifier that reads any technical PDF and separates it into two content types:
Prose blocks — standard text, chunked with a sliding window (512 tokens, 64 overlap).
Table blocks — processed with a parent-child structure:
Table detected
│
├── Parent Chunk (ChunkType.TABLE_PARENT)
│ ├── Table name
│ ├── Column headers
│ ├── Page number
│ └── Section category
│
└── Child Chunks (ChunkType.TABLE_ROW) — one per row
├── Flat JSON cell contents
└── parent_id → links back to parent
Example:
For the XCBR table, we'd create:
{table_name: "XCBR", headers: ["DataObject", "FC", "PresCond", "Description"], page: 47}{DataObject: "Pos", FC: "ST", PresCond: "M", Description: "Position"}When a child row is retrieved during search, the system automatically fetches its parent and merges the headers back in before passing context to the LLM. The LLM always sees a complete, coherent table — not an orphaned row.
Instead of random UUIDs for each chunk, we generate deterministic IDs using uuid.uuid5 based on the document name, table name, and page number:
chunk_id = uuid.uuid5(
uuid.NAMESPACE_DNS,
f"parent_table:{source_file}:{table_name}:{page_number}"
)
This means if you re-run the ingestion pipeline on the same document, it produces the exact same IDs. Qdrant's upsert simply overwrites with identical data — zero extra storage or embedding overhead. The pipeline is completely idempotent.
Raw PDF
│
▼
PyMuPDF Parser → Markdown format
│
▼
Segment Classifier
│
├── Prose Blocks ──────────────────────────────────────┐
│ └── Sliding window chunker (512t, 64 overlap) │
│ │
└── Table Blocks │
└── Parent-Child Splitter │
├── 1 parent chunk per table │
└── N child chunks (1 per row) │
│
┌─────────────────────────────┘
│
▼
uuid5 Hash → unique ID per chunk
│
▼
Embed with mxbai-embed-large
│
▼
Upsert into Qdrant (local)
This is one of the most impactful components we built. Engineering documents get queried repeatedly — the same questions come up again and again across sessions and users.
A full RAG pipeline with local LLM inference takes 12–15 seconds on our hardware. That's acceptable for a first answer. It's painful if you're asking the same question twice.
We built a two-layer cache that sits between the query and the retrieval pipeline.
Every incoming query is normalized (lowercased, stripped of punctuation) and hashed with SHA-256:
import hashlib
def normalize_and_hash(query: str) -> str:
clean = query.lower().strip()
clean = re.sub(r'[^a-z0-9\s]', '', clean)
return hashlib.sha256(clean.encode()).hexdigest()
The hash is looked up in a SQLite table:
CREATE TABLE exact_cache_chat (
query_hash TEXT PRIMARY KEY,
response TEXT,
created_at TIMESTAMP
);
If there's a hit → return immediately. Latency: under 1 millisecond.
Example:
Not all repeat questions are worded identically. Someone might ask:
These are semantically identical. The exact cache misses them. The semantic cache catches them.
When a query misses the exact cache:
Query
│
▼
Embed with mxbai-embed-large
│
▼
Search Qdrant semantic cache collection
│
▼
If cosine similarity ≥ 0.99 → return cached response (~10ms)
If similarity < 0.99 → proceed to full RAG pipeline
After any new answer is generated, it gets stored in both the SQLite exact cache and the Qdrant semantic cache.
This is critical. Sometimes the system genuinely doesn't find information — and returns:
> "I could not find this information in the indexed documents."
If we cached this, the next person asking the same question would get a wrong "not found" response even after we added more documents.
So we built a filter:
def is_negative_or_fallback_response(response: str) -> bool:
negative_patterns = [
"could not find",
"not available in the indexed",
"no information found",
"unable to locate"
]
return any(p in response.lower() for p in negative_patterns)
Negative responses are never written to cache. Only real, successful answers get stored.
The naive solution for conversational memory is to dump the entire chat history into every LLM prompt. But this causes two problems:
1. Token bloat — an exchange that includes a large markdown table (like a 39-row XCBR attribute table) can easily consume 3,000+ tokens of context. Do this for 5 turns and you've used your entire context window before the LLM even sees the new question.
2. Latency spikes — more input tokens means slower generation.
We solved this with a two-phase approach.
Most follow-up questions don't actually contain pronouns. "What is MMXU?" after "What is XCBR?" is completely standalone — it doesn't need context.
We run a fast regex check on every incoming query:
PRONOUN_PATTERN = re.compile(
r'\b(it|its|they|them|their|this|that|these|those|here|there|'
r'first|second|third|previous|above|former|latter|explain|'
r'what about|how about)\b',
re.IGNORECASE
)
def has_pronouns(query: str) -> bool:
return bool(PRONOUN_PATTERN.search(query))
If no pronouns → the query is standalone → skip to caching/retrieval immediately. Zero overhead.
If pronouns are detected, we fetch the last 3 turns from SQLite (capped at 6 roles: 3 user + 3 assistant) and pass them to qwen2.5:3b — our small, fast rewriter model — with a very specific prompt:
You are a query rewriter. Rewrite the user's follow-up question into a
standalone question that contains no pronouns or references to previous messages.
Output ONLY the rewritten question. Maximum 24 tokens.
Conversation:
User: What is SIMG?
Assistant: SIMG is a logical node for gas insulation monitoring...
User: What are its pressure attributes?
Rewritten:
With max_tokens = 24, this completes in under 150ms.
Example:
Now this standalone query goes into the normal vector search pipeline — and retrieves the right chunks.
The crucial part: once we have real source chunks, we only inject the last 1 turn of history into the final LLM prompt. The rewriting resolved the pronoun — we don't need the full conversation history anymore.
Memory Strategy Summary:
For retrieval: rewritten standalone query (no history needed)
For generation: last 1 turn only (minimal context, natural dialogue flow)
Some queries are actually two or three questions disguised as one:
> "What is the XCBR logical node and what are the attributes of its Pos data object?"
If you embed this as a single vector, the high-level XCBR concept dominates. The secondary query about Pos attributes gets poor representation and the retrieval misses the specific CDC attribute table you need.
We detect compound queries by looking for separator patterns:
COMPOUND_SEPARATORS = [
r',\s*also\b',
r'\band\s+how\b',
r'\band\s+what\b',
r'\band\s+which\b',
r'\bas\s+well\s+as\b',
r'\bin\s+addition\b',
]
When detected, the query gets split into independent sub-queries:
"What is XCBR and what are the attributes of its Pos data object?"
│
▼
Sub-Query 1: "What is the XCBR logical node?"
Sub-Query 2: "What are the attributes of the Pos data object?"
Each sub-query then runs through its own independent retrieval and reranking pipeline in parallel. The results are merged using content-hash deduplication (so if both retrieve the same chunk, it appears once).
The merged context is assembled so both questions get equal representation before being passed to the LLM.
Local LLM streaming over SSE (Server-Sent Events) is fragile. On consumer hardware with tight VRAM:
done event firesThe result: the user sees a truncated response. The experience is broken.
We built a three-part self-healing mechanism.
When the LLM finishes generating, the backend appends a special final SSE event:
# Normal stream events
yield f"data: {json.dumps({'type': 'token', 'content': token})}\n\n"
# Final event — includes the full answer for validation
yield f"data: {json.dumps({'type': 'done', 'full_answer': complete_response})}\n\n"
The React frontend accumulates every streamed token:
let accumulatedTokens = "";
eventSource.onmessage = (event) => {
const data = JSON.parse(event.data);
if (data.type === "token") {
accumulatedTokens += data.content;
updateChatBubble(accumulatedTokens);
}
if (data.type === "done") {
const normalized_accumulated = normalize(accumulatedTokens);
const normalized_full = normalize(data.full_answer);
if (normalized_accumulated !== normalized_full) {
// Mismatch detected — trigger fallback
triggerSilentFallback();
}
}
};
eventSource.onerror = () => {
// Stream died without done event — trigger fallback
triggerSilentFallback();
};
The fallback sends a synchronous POST request to /chat. Because the query was already processed, the answer is already in the semantic cache — it comes back in under 10ms. The chat bubble updates seamlessly.
The user never sees a broken experience.
Stream ends
│
├── done event received
│ ├── tokens match full_answer → ✅ done
│ └── mismatch detected → silent fallback (< 10ms) → update bubble ✅
│
└── error / no done event → silent fallback (< 10ms) → update bubble ✅
Sometimes the vector database has rows with missing descriptions. This can happen because:
Instead of returning an incomplete answer, the system heals itself.
Every row chunk stores its source PDF filename and page number as metadata. When a retrieved row has an empty description field:
Row: {DataObject: "Pos", FC: "ST", PresCond: "M", description: ""}
↑
Empty!
The healing service:
1. Looks up a disk-persistent JSON cache of page mappings (page_cache.json)
2. If the page mapping is cached → extracts the description directly in < 1ms
3. If not → runs PyMuPDF to scan the relevant PDF page, extracts the table, finds the row, and caches the result for next time
After healing, the description is stored persistently — so the next request for the same row is instant.
Before healing cache: 1.5 seconds per missing row (live PDF scan)
After healing cache: < 1 millisecond (from persistent JSON)
IEC 61850 signals are referenced using structured object paths like:
Mon.SIMG1.Pres.mag.f
This path means:
Mon → Logical Device (Monitor)
└── SIMG1 → Logical Node (Gas Insulation Monitoring, instance 1)
└── Pres → Data Object (Pressure)
└── mag → Common Data Attribute (Magnitude)
└── f → Basic Data Attribute (Float value)
Understanding this path requires recursively looking up 5 different levels across multiple standard documents. Engineers were doing this manually.
Our reference chaining engine:
1. Parses the path string into tokens using a structured regex
2. Identifies the Logical Node class (SIMG)
3. Queries Qdrant for the LN table → finds Pres Data Object
4. Follows the Pres CDC reference → retrieves the attribute table
5. Looks up mag.f within that CDC's data attributes
6. Returns a complete, structured deconstruction:
Mon.SIMG1.Pres.mag.f
├── Logical Device: Mon (Monitor)
├── Logical Node: SIMG1 — Gas Insulation Monitoring (IEC 61850-7-4)
├── Data Object: Pres — Pressure measurement [CDC: MV]
├── Data Attribute: mag — Magnitude of the measured value
└── Basic Attribute: f — 32-bit IEEE 754 float (actual pressure value)
What used to take 10–15 minutes of manual cross-referencing now takes seconds.
Engineers often need to share query results as formatted tables — for reports, commissioning documentation, or compliance audits.
When the user clicks "Export to Excel", the system:
1. Sends the markdown Q&A response to the LLM with a JSON extraction prompt
2. The LLM outputs a structured JSON schema:
{
"title": "XCBR Logical Node Attributes",
"headers": ["Data Object", "FC", "Presence Condition", "Description"],
"rows": [
["Pos", "ST", "M", "Position of the circuit breaker"],
["BlkOpn", "CO", "O", "Block opening"],
["CBOpCap", "OR", "O", "CB operating capability"]
]
}
3. openpyxl compiles this into a styled Excel workbook:
- Header row: Slate Gray background (#1F2937), white bold text
- Accent cells: Emerald Green (#059669)
- Zebra row shading: alternating #F9FAFB / white
- Auto-adjusted column widths
- Soft cell borders (#E5E7EB)
4. The .xlsx file is streamed back to the client for download
No copy-pasting. No manual formatting. One click.
All benchmarks on: Intel Core i7, NVIDIA RTX 3060 Laptop GPU (6GB VRAM)
| Operation | Before optimization | After optimization |
|---|---|---|
| Exact cache hit | 12–15 seconds (full pipeline) | < 1 millisecond |
| Semantic cache hit | 12–15 seconds | ~10 milliseconds |
| Pronoun query rewriting | 8–12 seconds (7B model) | < 150 milliseconds |
| PDF TOC startup scan | 26.4 seconds | < 1 millisecond |
| Missing attribute healing | 1.5 seconds per row | < 1 millisecond |
| Query reranking | 26.5–39.0 seconds | ~2.5–3.7 seconds |
| Full answer (warm, cache miss) | — | 12–15 seconds |
| Full answer (cold start, new PDF) | — | 40–90 seconds (one-time) |
We ran a comprehensive 20-question test suite covering:
Result: 100% correct, complete, verified answers. Zero hallucinations. Zero omissions.
For real-world substation deployment, a single laptop running everything isn't practical. We designed a Local Compute Client-Server model:
[Air-Gapped Substation LAN]
│
├── Substation Workstation (Server)
│ ├── GPU: RTX 4000 SFF Ada (20GB) or RTX 4090 (24GB)
│ ├── Hosts: FastAPI backend + Ollama + Qdrant
│ └── Warm query latency: < 2 seconds (high tensor core bandwidth)
│
├── Engineer Laptop 1 ──► Browser → HTTP → Workstation
├── Engineer Laptop 2 ──► Browser → HTTP → Workstation
└── Substation Gateway PC ──► REST API → Workstation
The field engineers don't need a GPU. They access the system through any web browser over the local substation LAN. All heavy computation happens on one dedicated workstation.
With a proper GPU (20–24GB VRAM vs the 6GB we benchmarked on), total conversational latency drops to under 2 seconds for warm queries — which feels near-instant in practice.
A few things that weren't obvious at the start:
1. Table structure is everything. The parent-child mapping was the single biggest accuracy improvement. Naive chunking on technical tables gives wrong results no matter how good your embedding model is.
2. Caching is not optional at local LLM speeds. On consumer hardware, 12 seconds per query is the ceiling. Without a cache layer, a multi-user deployment would be unusable. The double-cache dropped average response time by ~99% for repeat queries.
3. Small models can do specific jobs extremely well. Using qwen2.5:3b for query condensation only — with a strict 24-token output cap — gave us sub-150ms rewrites without burning VRAM on the primary task.
4. Stream validation is a real production concern. We discovered this the hard way. On a 6GB GPU running a 7B model, stream truncation happened often enough to be a real UX problem. The self-healing validation eliminated it entirely.
5. Generic parsers beat domain-specific ones. We deliberately built the layout classifier to have no knowledge of IEC 61850 specifically. It works on any technical PDF. This decision paid off immediately — adding new document types required zero code changes.
This project started as a solution to a very specific engineering bottleneck — and turned into a solid foundation for offline document intelligence in any air-gapped, privacy-sensitive domain.
The architecture we built handles:
And it does all of this on a single consumer GPU, with no internet dependency.
If you are working on a RAG system, document intelligence pipeline, or any AI agent that needs to run privately and offline — we would be glad to talk.
Visit daemlabs.com or reach out directly.
We build what others think is hard.
© DaemLabs · AI Agent Automation Agency