A benchmarked document QA system for PDF and DOCX files that combines hybrid retrieval, document structure intelligence, reranking, and citation-aware grounded answers.

Problem

Long documents such as policies, theses, and research papers are difficult to query reliably. Keyword search misses semantic meaning, while naive LLM chat over documents often produces ungrounded answers without clear evidence. Broad summary questions are especially hard because relevant information is spread across multiple sections rather than one exact chunk.

Solution

HelpmateAI is a grounded long-document QA system that indexes PDF and DOCX files, builds persistent local retrieval artifacts, and answers user questions with visible evidence and citations. The system combines dense retrieval, lexical search, fusion, optional reranking, deterministic retrieval planning, and bounded answer generation so responses stay tied to source material rather than model memory.

How it works

1. Document Ingestion

Uploads PDF and DOCX files and extracts structured content, including page labels, section headings, clause IDs where possible, section paths, section kinds, and document-style hints.

2. Structure & Index Build

Builds metadata-rich chunks, section records, deterministic section synopses, and lightweight topology artifacts. The index is schema-versioned and keyed by document fingerprint so documents can be reused without unnecessary rebuilds.

3. Retrieval Planning

Analyzes the user’s question and produces a deterministic retrieval plan. Depending on query shape, the system can choose between chunk, synopsis, summary, section or hybrid. A lightweight LLM route refinement is only used when deterministic confidence is low.

4. Hybrid Retrieval

Runs dense retrieval, TF-IDF lexical retrieval, fusion, metadata-aware ranking, and optional reranking. It also uses soft structural guidance and global fallback so broader or distributed questions do not collapse into narrow evidence misses.

5. Evidence Selection & Answering

5. Evidence Selection

Grades evidence as strong, weak, or unsupported. For weak middle-band cases, retrieval can adapt without model-based query rewriting. A bounded post-rerank evidence selector can promote more direct evidence already present in top candidates. Final answers are generated with citations, retrieval notes, and explicit support status.

Grades evidence as strong, weak, or unsupported. For weak middle-band cases, retrieval can adapt without model-based query rewriting. A bounded post-rerank evidence selector can promote more direct evidence already present in top candidates. Final answers are generated with citations, retrieval notes, and explicit support status.

Challenges

- Improving broad summary and synthesis-style questions without hurting strong factual retrieval

- Improving broad summary and synthesis style questions without hurting strong factual retrieval

- Handling noisy academic and journal PDFs with weak section structure

- Making retrieval behavior more measurable and benchmarkable rather than relying on intuition

- Removing model-based query rewriting in favor of simpler, more predictable deterministic recovery and guardrails

- Removing model based query rewriting in favor of simpler, more predictable deterministic recovery and guardrails

- Balancing architecture complexity, latency, and retrieval gains through ablations and threshold calibration

What I learned

This project pushed me beyond building a simple RAG demo. I learned how retrieval quality depends on structure, routing, evidence selection, and evaluation not just embeddings or prompting. I also learned to treat benchmarking as part of the architecture itself, using retrieval metrics, abstention checks, baseline comparisons, and ragas to justify design decisions.

Tech Stack

- Next.js, FastAPI

- Vector Store: ChromaDB

- LLM / Evaluation: OpenAI, ragas

- Retrieval / ML: scikit-learn, sentence-transformers, Python (core logic)

- Scikit-learn, sentence-transformers, Python (core logic)

- Deployment path: Framer for marketing, Next.js for app, FastAPI for API, with Supabase and Chroma cloud persistence

Create a free website with Framer, the website builder loved by startups, designers and agencies.