A benchmarked document QA system for PDF and DOCX files that combines hybrid retrieval, document structure intelligence, reranking, and citation-aware grounded answers. On a 150 question evaluation held-out question suite, it produced zero false-support claims and correctly abstained on every unanswerable question. A 6–20% reduced hallucination when compared to OpenAI File Search and Vectara.
Architecture
Problem
Long documents such as policies, theses, and research papers are difficult to query reliably. Keyword search misses semantic meaning, while naive LLM chat over documents often produces ungrounded answers without clear evidence. Broad summary questions are especially hard because relevant information is spread across multiple sections rather than one exact chunk.
Solution
HelpmateAI is a grounded long-document QA system that indexes PDF and DOCX files, builds persistent local retrieval artifacts, and answers user questions with visible evidence and citations. It combines dense and lexical retrieval, fusion, reranking, an LLM retrieval orchestrator for explicit scope, indexing-time chunk semantics and document landmarks, and a strict support verifier, so responses stay tied to source material reducing hallucinations.
Workspace
How it works
1. Document Ingestion
Uploads PDF and DOCX files and extracts structured content via pypdf, python-docx, and a selective pdfplumber pass for table-heavy pages. Captures page labels, section headings, clause IDs, section paths, section kinds, and document style hints. An indexing time chunk-semantics layer classifies suspicious candidates as metadata, definition, or table evidence (or noise), and a document landmarks pass identifies title pages, forewords, abstracts, executive summaries, glossaries, and volume boundaries.
2. Structure & Index Build
Builds metadata rich chunks, section records, deterministic section synopses, and lightweight topology artifacts. Sections are enriched with document aware profile metadata - chapter numbers, section roles, page ranges and scope aliases so locally scoped questions can stay inside the requested chapter. The index is schema versioned and keyed by document fingerprint so documents can be reused without unnecessary rebuilds.
3. Retrieval Planning
Analyzes the question and produces a retrieval plan. An LLM retrieval orchestrator runs first on a compact document map and can resolve explicit local scope to validated section IDs; deterministic code then enforces safety checks and routes through chunk, synopsis, summary, section, or hybrid retrieval. Low-confidence orchestrator output is ignored, the deterministic layer never trusts unbounded LLM scope.
4. Hybrid Retrieval
Runs dense retrieval, TF-IDF lexical retrieval, fusion, metadata-aware ranking, and optional reranking. It also uses soft structural guidance and global fallback so broader or distributed questions do not collapse into narrow evidence misses.
Grades evidence as strong, weak, or unsupported. For weak middle band cases, retrieval can adapt without model based query rewriting. A spread triggered, reorder only evidence selector promotes stronger evidence among the top candidates without pruning support. Final answers carry citations, retrieval notes, and explicit support status. A strict support verifier can recover a refused answer to full support only when grounded facts are visible and no missing facts or hedging language remain. The mechanism behind the zero false support eval result.
Evaluation
A 150 question held out suite across NIST AI RMF, an arXiv climate ML paper, a public UPenn thesis, FOMC minutes, and the IRENA World Energy Transitions Outlook. Vendors run in their native answer modes.
Challenges
- Handling noisy academic and journal PDFs with weak section structure
- Making retrieval behavior more measurable and benchmarkable rather than relying on intuition
- Balancing architecture complexity, latency, and retrieval gains through ablations and threshold calibration
What I learned
This project pushed me beyond building a simple RAG demo. I learned how retrieval quality depends on structure, routing, evidence selection, and evaluation not just embeddings or prompting. I also learned to treat benchmarking as part of the architecture itself, using retrieval metrics, abstention checks, baseline comparisons, and ragas to justify design decisions.
Tech Stack
- Next.js, FastAPI, Caddy + Docker on a VPS
- Document parsing: pypdf, python-docx, pdfplumber for table enrichment
- Vector Store: ChromaDB (local + optional cloud persistence)
- Retrieval / ML: scikit-learn, sentence-transformers, Python core logic






