Imagine you are an electrical engineer working on a substation automation project. Your job requires you to look up definitions, attributes, and relationships inside the IEC 61850 standard — a massive collection of technical documents spanning dozens of parts, each hundreds of pages long.

The standard describes how power equipment should model itself digitally. Every device exposes a hierarchy of:


Physical Device
  └── Logical Device
        └── Logical Node (e.g. XCBR = Circuit Breaker, MMXU = Measurement Unit)
              └── Data Object (e.g. Pos = Position, TotW = Total Active Power)
                    └── Common Data Class (e.g. SPS = Single Point Status)
                          └── Data Attributes (e.g. stVal, q, t)

A simple question like "What are all the mandatory data attributes of the XCBR logical node?" requires you to:

1. Find the XCBR table in Part 7-4 (pages deep into a PDF)

2. Cross-reference the Common Data Classes in Part 7-3

3. Look up each CDC's attribute list separately

4. Mentally merge everything together

Engineers were doing this manually. Every. Single. Time.

There were three hard constraints that made this non-trivial to solve with off-the-shelf tools:

Constraint 1 — Air-gapped environment. Substations are often offline for security reasons. No internet. No OpenAI API. No cloud anything. The entire system had to run locally on a single machine.

Constraint 2 — Scale. The documents are massive. Standard keyword search (Ctrl+F) doesn't understand meaning. You can't search for "how does position status flow" — it has to be an exact match.

Constraint 3 — Accuracy. Wrong information in a substation context is not just annoying — it's dangerous. Hallucinations were completely unacceptable.

2. Why Standard RAG Systems Fail Here

Before we explain what we built, let's understand why a standard off-the-shelf RAG pipeline doesn't cut it.

What standard RAG does

A basic Retrieval-Augmented Generation pipeline works like this:


User Query
    │
    ▼
Embed query into a vector
    │
    ▼
Search vector DB for similar chunks
    │
    ▼
Stuff top-K chunks into LLM prompt
    │
    ▼
LLM generates answer

Simple, and it works well for unstructured text. But it breaks badly on structured technical documents for several reasons:

Problem 1 — Compound queries drown out details

Say a user asks:

> "What is the XCBR logical node and what are its Pos data object's attributes?"

This is actually two questions in one. When you embed this into a single vector, the high-level concept ("XCBR logical node") dominates the embedding space. The secondary detail ("Pos data object attributes") gets buried and the retrieval returns chunks about XCBR — but misses the specific attribute table.

The user gets a partial answer and doesn't even know what's missing.

Problem 2 — Tables lose context when chunked

Standard RAG splits documents into chunks of ~500 tokens. When a table gets split:


Chunk A: [Table header row — LN Class: XCBR, Data Object, FC, Presence Condition]
Chunk B: [Row — Pos, ST, M]
Chunk C: [Row — BlkOpn, CO, O]

If Chunk B gets retrieved, it arrives at the LLM without Chunk A. The LLM has no idea what Pos, ST, and M mean — because the headers are missing.

Result: garbled, incomplete, or hallucinated answers.

Problem 3 — Follow-up questions break retrieval

Standard RAG does not track conversation history. So if a user asks:

> Turn 1: "What is SIMG?"

> Turn 2: "What are its pressure attributes?"

On Turn 2, the RAG system tries to search the vector DB for "What are its pressure attributes?" — but the word "its" means nothing to a vector search. The retrieval returns completely unrelated chunks.

Problem 4 — Streaming truncation

When you run a local LLM on consumer hardware (like an RTX 3060 6GB), GPU memory pressure can cause the streaming response to cut off mid-sentence. The user sees:

> "The XCBR logical node contains the following data objects: Pos, BlkOpn, Bl"

Incomplete. Useless. And there's no built-in recovery mechanism.

These four problems were what we had to engineer our way around. Here's how we did it.

3. System Overview — What We Built

We built a fully offline, air-gapped document intelligence system with a conversational natural language interface.

At a high level, a user types a question, and the system:

1. Checks if it has answered this (or something very similar) before — returns instantly if yes

2. Rewrites the question if it contains pronouns that reference prior conversation

3. Splits the question if it contains multiple intents

4. Searches the vector database with precise, pre-filtered queries

5. Heals any missing data from the source PDF on the fly

6. Reranks candidates with a cross-encoder

7. Assembles context and runs local LLM inference

8. Streams the answer back, and validates it — silently healing if anything got cut off

Here is the complete query flow:


User Query
    │
    ▼
┌─────────────────────────────────┐
│   Temporal Memory Lookup        │  ← retrieves last 3 turns from SQLite
└─────────────────────────────────┘
    │
    ▼
┌─────────────────────────────────┐
│   Pronoun Bypass Check (regex)  │  ← 0ms if no pronouns detected
└─────────────────────────────────┘
    │                    │
  Pronouns found     No pronouns
    │                    │
    ▼                    │
┌──────────────┐         │
│ Query        │         │
│ Condenser    │         │
│ qwen2.5:3b   │         │
│ (< 150ms)    │         │
└──────────────┘         │
    │                    │
    └──────────┬─────────┘
               │
               ▼
    Normalized Standalone Query
               │
               ▼
┌─────────────────────────────────┐
│   Double-Caching Check          │
│   ├─ Exact SHA-256  → < 1ms    │
│   └─ Semantic Qdrant → ~10ms   │
└─────────────────────────────────┘
    │                    │
  Cache Hit          Cache Miss
    │                    │
    ▼                    ▼
Instant Return   ┌─────────────────────┐
                 │ Multi-Intent        │
                 │ Query Splitter      │
                 └─────────────────────┘
                      │         │
                 Sub-Q 1    Sub-Q 2
                      │         │
                      ▼         ▼
               Heuristic Router (< 1ms)
                      │         │
                      ▼         ▼
               Pre-filtered Vector Search (Qdrant)
                      │         │
                      └────┬────┘
                           │
                           ▼
                ┌──────────────────────┐
                │ Dynamic Healing      │  ← repairs missing descriptions
                │ Cache Layer          │     from PDF in < 1ms
                └──────────────────────┘
                           │
                           ▼
                ┌──────────────────────┐
                │ Cross-Encoder        │  ← BAAI/bge-reranker-base
                │ Reranker             │
                └──────────────────────┘
                           │
                           ▼
                ┌──────────────────────┐
                │ Context Deduplicator │
                │ & Merger             │
                └──────────────────────┘
                           │
                           ▼
                ┌──────────────────────┐
                │ Local LLM Inference  │  ← Qwen2.5-7B via Ollama
                │ (Qwen 7B, Ollama)    │
                └──────────────────────┘
                           │
                           ▼
                ┌──────────────────────┐
                │ SSE Stream → Client  │
                │ + Self-Heal Check    │  ← validates tokens, auto-recovers
                └──────────────────────┘

Now let's go through each major component in depth.

4. The Tech Stack

Layer	Technology
LLM (Primary)	Qwen2.5-7B-Instruct via Ollama
LLM (Query Rewriter)	Qwen2.5-3B via Ollama
Embeddings	mxbai-embed-large (1024-dim) / nomic-embed-text (768-dim)
Vector Database	Qdrant (local file-based or Docker)
Exact Cache	SQLite (`exact_cache.db`)
Reranker	BAAI/bge-reranker-base (cross-encoder)
PDF Processing	PyMuPDF (fitz) + pymupdf4llm
Excel Export	openpyxl
Backend	FastAPI + uvicorn
Frontend	React (Vite) + TailwindCSS
Streaming	Server-Sent Events (SSE)

Everything runs on a single machine. No external API calls. No internet dependency.

5. Ingestion Pipeline — How We Read Documents

The first challenge is getting the documents into a format that supports accurate retrieval.

Why naive chunking fails on technical PDFs

A typical RAG system splits documents using a sliding window — every 512 tokens, with some overlap. For prose text, this works reasonably well.

For technical PDFs with tables like this:

Data Object	FC	Pres. Cond.	Description
Pos	ST	M	Position
BlkOpn	CO	O	Block opening
CBOpCap	OR	O	CB operating capability

...naive chunking is catastrophic. A chunk that starts mid-table has no headers, no column names, no context. The vector search matches the right chunk — but the LLM can't make sense of it.

Our solution: Parent-Child Table Mapping

We built a generic layout classifier that reads any technical PDF and separates it into two content types:

Prose blocks — standard text, chunked with a sliding window (512 tokens, 64 overlap).

Table blocks — processed with a parent-child structure:


Table detected
    │
    ├── Parent Chunk (ChunkType.TABLE_PARENT)
    │       ├── Table name
    │       ├── Column headers
    │       ├── Page number
    │       └── Section category
    │
    └── Child Chunks (ChunkType.TABLE_ROW) — one per row
            ├── Flat JSON cell contents
            └── parent_id → links back to parent

Example:

For the XCBR table, we'd create:

1 parent chunk: {table_name: "XCBR", headers: ["DataObject", "FC", "PresCond", "Description"], page: 47}

39 child chunks: one per row, e.g. {DataObject: "Pos", FC: "ST", PresCond: "M", Description: "Position"}

When a child row is retrieved during search, the system automatically fetches its parent and merges the headers back in before passing context to the LLM. The LLM always sees a complete, coherent table — not an orphaned row.

Idempotent ingestion with uuid5 hashing

Instead of random UUIDs for each chunk, we generate deterministic IDs using uuid.uuid5 based on the document name, table name, and page number:


chunk_id = uuid.uuid5(
    uuid.NAMESPACE_DNS,
    f"parent_table:{source_file}:{table_name}:{page_number}"
)

This means if you re-run the ingestion pipeline on the same document, it produces the exact same IDs. Qdrant's upsert simply overwrites with identical data — zero extra storage or embedding overhead. The pipeline is completely idempotent.

The full ingestion flow


Raw PDF
    │
    ▼
PyMuPDF Parser → Markdown format
    │
    ▼
Segment Classifier
    │
    ├── Prose Blocks ──────────────────────────────────────┐
    │       └── Sliding window chunker (512t, 64 overlap)  │
    │                                                       │
    └── Table Blocks                                        │
            └── Parent-Child Splitter                       │
                    ├── 1 parent chunk per table            │
                    └── N child chunks (1 per row)          │
                                                            │
                              ┌─────────────────────────────┘
                              │
                              ▼
                    uuid5 Hash → unique ID per chunk
                              │
                              ▼
                    Embed with mxbai-embed-large
                              │
                              ▼
                    Upsert into Qdrant (local)

6. The Double-Caching Engine

This is one of the most impactful components we built. Engineering documents get queried repeatedly — the same questions come up again and again across sessions and users.

A full RAG pipeline with local LLM inference takes 12–15 seconds on our hardware. That's acceptable for a first answer. It's painful if you're asking the same question twice.

We built a two-layer cache that sits between the query and the retrieval pipeline.

Layer 1 — Exact Match Cache (SQLite + SHA-256)

Every incoming query is normalized (lowercased, stripped of punctuation) and hashed with SHA-256:


import hashlib

def normalize_and_hash(query: str) -> str:
    clean = query.lower().strip()
    clean = re.sub(r'[^a-z0-9\s]', '', clean)
    return hashlib.sha256(clean.encode()).hexdigest()

The hash is looked up in a SQLite table:


CREATE TABLE exact_cache_chat (
    query_hash TEXT PRIMARY KEY,
    response   TEXT,
    created_at TIMESTAMP
);

If there's a hit → return immediately. Latency: under 1 millisecond.

Example:

Query 1: "What is XCBR?" → runs full pipeline (12s), stores result

Query 2: "What is XCBR?" → SHA-256 match, returns in < 1ms ✅

Layer 2 — Semantic Cache (Qdrant + Cosine Similarity)

Not all repeat questions are worded identically. Someone might ask:

"What is XCBR?"

"Tell me about the XCBR logical node"

"Explain XCBR"

These are semantically identical. The exact cache misses them. The semantic cache catches them.

When a query misses the exact cache:


Query
  │
  ▼
Embed with mxbai-embed-large
  │
  ▼
Search Qdrant semantic cache collection
  │
  ▼
If cosine similarity ≥ 0.99 → return cached response (~10ms)
If similarity < 0.99 → proceed to full RAG pipeline

After any new answer is generated, it gets stored in both the SQLite exact cache and the Qdrant semantic cache.

The Negative Response Guard

This is critical. Sometimes the system genuinely doesn't find information — and returns:

> "I could not find this information in the indexed documents."

If we cached this, the next person asking the same question would get a wrong "not found" response even after we added more documents.

So we built a filter:


def is_negative_or_fallback_response(response: str) -> bool:
    negative_patterns = [
        "could not find",
        "not available in the indexed",
        "no information found",
        "unable to locate"
    ]
    return any(p in response.lower() for p in negative_patterns)

Negative responses are never written to cache. Only real, successful answers get stored.

7. Conversational Memory — Resolving Pronouns Without Bloat

The naive solution for conversational memory is to dump the entire chat history into every LLM prompt. But this causes two problems:

1. Token bloat — an exchange that includes a large markdown table (like a 39-row XCBR attribute table) can easily consume 3,000+ tokens of context. Do this for 5 turns and you've used your entire context window before the LLM even sees the new question.

2. Latency spikes — more input tokens means slower generation.

We solved this with a two-phase approach.

Phase 1 — Heuristic Pronoun Bypass (0ms)

Most follow-up questions don't actually contain pronouns. "What is MMXU?" after "What is XCBR?" is completely standalone — it doesn't need context.

We run a fast regex check on every incoming query:


PRONOUN_PATTERN = re.compile(
    r'\b(it|its|they|them|their|this|that|these|those|here|there|'
    r'first|second|third|previous|above|former|latter|explain|'
    r'what about|how about)\b',
    re.IGNORECASE
)

def has_pronouns(query: str) -> bool:
    return bool(PRONOUN_PATTERN.search(query))

If no pronouns → the query is standalone → skip to caching/retrieval immediately. Zero overhead.

Phase 2 — Sub-Second Query Condensation (< 150ms)

If pronouns are detected, we fetch the last 3 turns from SQLite (capped at 6 roles: 3 user + 3 assistant) and pass them to qwen2.5:3b — our small, fast rewriter model — with a very specific prompt:


You are a query rewriter. Rewrite the user's follow-up question into a
standalone question that contains no pronouns or references to previous messages.
Output ONLY the rewritten question. Maximum 24 tokens.

Conversation:
User: What is SIMG?
Assistant: SIMG is a logical node for gas insulation monitoring...
User: What are its pressure attributes?

Rewritten:

With max_tokens = 24, this completes in under 150ms.

Example:

Input: "What are its pressure attributes?"

Output: "What are the pressure attributes of the SIMG logical node?"

Now this standalone query goes into the normal vector search pipeline — and retrieves the right chunks.

The crucial part: once we have real source chunks, we only inject the last 1 turn of history into the final LLM prompt. The rewriting resolved the pronoun — we don't need the full conversation history anymore.


Memory Strategy Summary:

For retrieval:   rewritten standalone query (no history needed)
For generation:  last 1 turn only (minimal context, natural dialogue flow)

8. Multi-Intent Query Splitting

Some queries are actually two or three questions disguised as one:

> "What is the XCBR logical node and what are the attributes of its Pos data object?"

If you embed this as a single vector, the high-level XCBR concept dominates. The secondary query about Pos attributes gets poor representation and the retrieval misses the specific CDC attribute table you need.

The splitter

We detect compound queries by looking for separator patterns:


COMPOUND_SEPARATORS = [
    r',\s*also\b',
    r'\band\s+how\b',
    r'\band\s+what\b',
    r'\band\s+which\b',
    r'\bas\s+well\s+as\b',
    r'\bin\s+addition\b',
]

When detected, the query gets split into independent sub-queries:


"What is XCBR and what are the attributes of its Pos data object?"
    │
    ▼
Sub-Query 1: "What is the XCBR logical node?"
Sub-Query 2: "What are the attributes of the Pos data object?"

Each sub-query then runs through its own independent retrieval and reranking pipeline in parallel. The results are merged using content-hash deduplication (so if both retrieve the same chunk, it appears once).

The merged context is assembled so both questions get equal representation before being passed to the LLM.

9. Self-Healing Stream Validation

Local LLM streaming over SSE (Server-Sent Events) is fragile. On consumer hardware with tight VRAM:

GPU memory pressure can cause early stream termination

TCP packet fragmentation can split or drop tokens

The connection can silently close before the done event fires

The result: the user sees a truncated response. The experience is broken.

We built a three-part self-healing mechanism.

Step 1 — Backend sends full answer alongside the stream

When the LLM finishes generating, the backend appends a special final SSE event:


# Normal stream events
yield f"data: {json.dumps({'type': 'token', 'content': token})}\n\n"

# Final event — includes the full answer for validation
yield f"data: {json.dumps({'type': 'done', 'full_answer': complete_response})}\n\n"

Step 2 — Frontend accumulates and validates

The React frontend accumulates every streamed token:


let accumulatedTokens = "";

eventSource.onmessage = (event) => {
  const data = JSON.parse(event.data);

  if (data.type === "token") {
    accumulatedTokens += data.content;
    updateChatBubble(accumulatedTokens);
  }

  if (data.type === "done") {
    const normalized_accumulated = normalize(accumulatedTokens);
    const normalized_full = normalize(data.full_answer);

    if (normalized_accumulated !== normalized_full) {
      // Mismatch detected — trigger fallback
      triggerSilentFallback();
    }
  }
};

eventSource.onerror = () => {
  // Stream died without done event — trigger fallback
  triggerSilentFallback();
};

Step 3 — Silent fallback from cache

The fallback sends a synchronous POST request to /chat. Because the query was already processed, the answer is already in the semantic cache — it comes back in under 10ms. The chat bubble updates seamlessly.

The user never sees a broken experience.


Stream ends
    │
    ├── done event received
    │       ├── tokens match full_answer → ✅ done
    │       └── mismatch detected → silent fallback (< 10ms) → update bubble ✅
    │
    └── error / no done event → silent fallback (< 10ms) → update bubble ✅

10. Dynamic Healing Cache

Sometimes the vector database has rows with missing descriptions. This can happen because:

The source PDF table had a blank cell

The parser couldn't extract text from a scanned image row

The description was on a different page and wasn't linked correctly

Instead of returning an incomplete answer, the system heals itself.

How it works

Every row chunk stores its source PDF filename and page number as metadata. When a retrieved row has an empty description field:


Row: {DataObject: "Pos", FC: "ST", PresCond: "M", description: ""}
                                                                 ↑
                                                            Empty!

The healing service:

1. Looks up a disk-persistent JSON cache of page mappings (page_cache.json)

2. If the page mapping is cached → extracts the description directly in < 1ms

3. If not → runs PyMuPDF to scan the relevant PDF page, extracts the table, finds the row, and caches the result for next time

After healing, the description is stored persistently — so the next request for the same row is instant.

Before healing cache: 1.5 seconds per missing row (live PDF scan)

After healing cache: < 1 millisecond (from persistent JSON)

11. Hierarchical Reference Chaining

IEC 61850 signals are referenced using structured object paths like:


Mon.SIMG1.Pres.mag.f

This path means:


Mon           → Logical Device (Monitor)
  └── SIMG1   → Logical Node (Gas Insulation Monitoring, instance 1)
        └── Pres    → Data Object (Pressure)
              └── mag     → Common Data Attribute (Magnitude)
                    └── f → Basic Data Attribute (Float value)

Understanding this path requires recursively looking up 5 different levels across multiple standard documents. Engineers were doing this manually.

Our reference chaining engine:

1. Parses the path string into tokens using a structured regex

2. Identifies the Logical Node class (SIMG)

3. Queries Qdrant for the LN table → finds Pres Data Object

4. Follows the Pres CDC reference → retrieves the attribute table

5. Looks up mag.f within that CDC's data attributes

6. Returns a complete, structured deconstruction:


Mon.SIMG1.Pres.mag.f
├── Logical Device:       Mon (Monitor)
├── Logical Node:         SIMG1 — Gas Insulation Monitoring (IEC 61850-7-4)
├── Data Object:          Pres — Pressure measurement [CDC: MV]
├── Data Attribute:       mag — Magnitude of the measured value
└── Basic Attribute:      f — 32-bit IEEE 754 float (actual pressure value)

What used to take 10–15 minutes of manual cross-referencing now takes seconds.

12. Excel Export Engine

Engineers often need to share query results as formatted tables — for reports, commissioning documentation, or compliance audits.

When the user clicks "Export to Excel", the system:

1. Sends the markdown Q&A response to the LLM with a JSON extraction prompt

2. The LLM outputs a structured JSON schema:


{
  "title": "XCBR Logical Node Attributes",
  "headers": ["Data Object", "FC", "Presence Condition", "Description"],
  "rows": [
    ["Pos", "ST", "M", "Position of the circuit breaker"],
    ["BlkOpn", "CO", "O", "Block opening"],
    ["CBOpCap", "OR", "O", "CB operating capability"]
  ]
}

3. openpyxl compiles this into a styled Excel workbook:

- Header row: Slate Gray background (#1F2937), white bold text

- Accent cells: Emerald Green (#059669)

- Zebra row shading: alternating #F9FAFB / white

- Auto-adjusted column widths

- Soft cell borders (#E5E7EB)

4. The .xlsx file is streamed back to the client for download

No copy-pasting. No manual formatting. One click.

13. Performance Numbers

All benchmarks on: Intel Core i7, NVIDIA RTX 3060 Laptop GPU (6GB VRAM)

Operation	Before optimization	After optimization
Exact cache hit	12–15 seconds (full pipeline)	< 1 millisecond
Semantic cache hit	12–15 seconds	~10 milliseconds
Pronoun query rewriting	8–12 seconds (7B model)	< 150 milliseconds
PDF TOC startup scan	26.4 seconds	< 1 millisecond
Missing attribute healing	1.5 seconds per row	< 1 millisecond
Query reranking	26.5–39.0 seconds	~2.5–3.7 seconds
Full answer (warm, cache miss)	—	12–15 seconds
Full answer (cold start, new PDF)	—	40–90 seconds (one-time)

Accuracy

We ran a comprehensive 20-question test suite covering:

Single logical node lookups

CDC attribute tables

Cross-reference queries (LN → CDC → attribute)

Multi-intent compound questions

Conversational follow-up chains

Hierarchical object reference parsing

Result: 100% correct, complete, verified answers. Zero hallucinations. Zero omissions.

14. Deployment Architecture

For real-world substation deployment, a single laptop running everything isn't practical. We designed a Local Compute Client-Server model:


[Air-Gapped Substation LAN]
    │
    ├── Substation Workstation (Server)
    │       ├── GPU: RTX 4000 SFF Ada (20GB) or RTX 4090 (24GB)
    │       ├── Hosts: FastAPI backend + Ollama + Qdrant
    │       └── Warm query latency: < 2 seconds (high tensor core bandwidth)
    │
    ├── Engineer Laptop 1 ──► Browser → HTTP → Workstation
    ├── Engineer Laptop 2 ──► Browser → HTTP → Workstation
    └── Substation Gateway PC ──► REST API → Workstation

The field engineers don't need a GPU. They access the system through any web browser over the local substation LAN. All heavy computation happens on one dedicated workstation.

With a proper GPU (20–24GB VRAM vs the 6GB we benchmarked on), total conversational latency drops to under 2 seconds for warm queries — which feels near-instant in practice.

15. What We Learned

A few things that weren't obvious at the start:

1. Table structure is everything. The parent-child mapping was the single biggest accuracy improvement. Naive chunking on technical tables gives wrong results no matter how good your embedding model is.

2. Caching is not optional at local LLM speeds. On consumer hardware, 12 seconds per query is the ceiling. Without a cache layer, a multi-user deployment would be unusable. The double-cache dropped average response time by ~99% for repeat queries.

3. Small models can do specific jobs extremely well. Using qwen2.5:3b for query condensation only — with a strict 24-token output cap — gave us sub-150ms rewrites without burning VRAM on the primary task.

4. Stream validation is a real production concern. We discovered this the hard way. On a 6GB GPU running a 7B model, stream truncation happened often enough to be a real UX problem. The self-healing validation eliminated it entirely.

5. Generic parsers beat domain-specific ones. We deliberately built the layout classifier to have no knowledge of IEC 61850 specifically. It works on any technical PDF. This decision paid off immediately — adding new document types required zero code changes.

Wrapping Up

This project started as a solution to a very specific engineering bottleneck — and turned into a solid foundation for offline document intelligence in any air-gapped, privacy-sensitive domain.

The architecture we built handles:

Any technical PDF — not just one standard

Multi-turn conversations without context bloat

Compound queries without result quality degradation

Streaming failures without user-visible disruption

Missing data without hallucinations

And it does all of this on a single consumer GPU, with no internet dependency.

If you are working on a RAG system, document intelligence pipeline, or any AI agent that needs to run privately and offline — we would be glad to talk.

Visit daemlabs.com or reach out directly.

We build what others think is hard.

How We Built an Offline RAG AI That Reads 1000s of technical Pages — And Never Hallucinates

Table of Contents

1. The Problem We Were Solving

2. Why Standard RAG Systems Fail Here

What standard RAG does

Problem 1 — Compound queries drown out details

Problem 2 — Tables lose context when chunked

Problem 3 — Follow-up questions break retrieval

Problem 4 — Streaming truncation

3. System Overview — What We Built

4. The Tech Stack

5. Ingestion Pipeline — How We Read Documents

Why naive chunking fails on technical PDFs

Our solution: Parent-Child Table Mapping

Idempotent ingestion with uuid5 hashing

The full ingestion flow

6. The Double-Caching Engine

Layer 1 — Exact Match Cache (SQLite + SHA-256)

Layer 2 — Semantic Cache (Qdrant + Cosine Similarity)

The Negative Response Guard

7. Conversational Memory — Resolving Pronouns Without Bloat

Phase 1 — Heuristic Pronoun Bypass (0ms)

Phase 2 — Sub-Second Query Condensation (< 150ms)

8. Multi-Intent Query Splitting

The splitter

9. Self-Healing Stream Validation

Step 1 — Backend sends full answer alongside the stream

Step 2 — Frontend accumulates and validates

Step 3 — Silent fallback from cache

10. Dynamic Healing Cache

How it works

11. Hierarchical Reference Chaining

12. Excel Export Engine

13. Performance Numbers

Accuracy

14. Deployment Architecture

15. What We Learned

Wrapping Up