RAG4HealthQA – Healthcare Q&A Assistant

RAG4HealthQA is an educational tool that summarizes public health information. It does not provide medical advice, diagnosis, or treatment. For personal health decisions, consult a licensed clinician or local health authority.

RAG4HealthQA — Evidence-Grounded Healthcare QA with RAG

Live demo: rag4healthapp-ussxwkag2zmbdvrtjzvsfq.streamlit.app/
Repo: https://github.com/Sankarraj-Subramani/RAG4HealthQA
Stack: LangChain · FAISS · Cohere (embeddings/LLM) · Streamlit · Python

NIW Alignment: Open, auditable, and privacy-aware access to trusted medical knowledge (WHO/NIH/CDC); supports health equity, AI transparency, and digital literacy.

1. Introduction

1.1 Motivation: Trustworthy Health Answers at Scale

Healthcare misinformation remains a public-health risk. Patients, educators, and even clinicians struggle to locate concise, reliable, and explainable answers. Generic LLMs can hallucinate and lack source traceability.

Research gap. Few open tools combine (i) curated medical corpora, (ii) auditable retrieval pipelines, and (iii) patient-friendly UX with explicit citations and refusal policies for unsafe queries.

Thesis. A domain-tuned Retrieval-Augmented Generation (RAG) system—grounded on WHO/NIH/CDC sources—can deliver faithful, explainable answers while keeping deployment simple (Streamlit) and privacy-preserving (local FAISS).

1.2 What is RAG4HealthQA?

RAG4HealthQA is a healthcare Q&A assistant that uses RAG to answer questions grounded in curated public health documents. It ships with:

LangChain orchestration (retriever → reformatter → generator)
FAISS local vector store (no remote database dependency)
Cohere embeddings/LLM for high-quality language output
Streamlit UI for accessible, patient-friendly interactions
Refreshable knowledge base (/data/*.txt|.md) and a scraper (scripts/generate_health_kb.py)

1.3 Problem → Solution

Problem.

Web search is noisy; generic AI can hallucinate or omit citations.
Health AI must be transparent, grounded, and safe (refusals, scope limits).
Teams need simple, reproducible deployments.

Solution (RAG4HealthQA).

RAG-first: answers cite specific WHO/NIH/CDC passages.
Explainability: response includes sources + snippet highlights.
Reproducibility: local FAISS + versioned corpus; rebuild in minutes.
Safety: refusals + emergency guidance; no clinical instructions.

2. System Architecture & Key Features

2.1 At a Glance

Frontend: Streamlit (patient-friendly Q&A, citations, warnings)
Retriever: FAISS similarity search over chunked health docs
Embeddings: Cohere (configurable model; multilingual optional)
LLM: Cohere text generation with safety prompt + style guide
Pipeline: Modular LangChain graph (app/rag_pipeline.py)
Data: WHO/NIH/CDC markdown/text files under /data
Tooling: Devcontainer, tests, and a KB scraper (scripts/generate_health_kb.py)

2.2 Architecture (Mermaid v11-safe)

flowchart TD
    A["Streamlit Frontend (main.py)"]
    B["RAG Pipeline (app/rag_pipeline.py)"]
    C["Embeddings + Retriever (LangChain + Cohere)"]
    D["FAISS Vector Store (local semantic index)"]
    E["Curated Health KB (WHO/NIH/CDC .txt/.md)"]

    A --> B --> C --> D --> E

ASCII fallback

+-----------------------------------------------------+
|                   RAG4HealthQA                      |
+-----------------------------------------------------+
|  Streamlit Frontend (main.py)                       |
|          |                                          |
|          v                                          |
|  RAG Pipeline (app/rag_pipeline.py)                 |
|          |                                          |
|          v                                          |
|  LangChain + Cohere (Embeddings & Retrieval)        |
|          |                                          |
|          v                                          |
|  FAISS Vector Store (local semantic index)          |
|          |                                          |
|          v                                          |
|  WHO/NIH/CDC Knowledge Base (.txt / .md)            |
+-----------------------------------------------------+

2.3 Feature Breakdown

Healthcare-specific RAG: optimized chunking + retrieval for clinical prose.
Grounded answers with citations: every answer lists source titles/links and excerpts.
Updateable corpus: drop new .md/.txt files into /data and rebuild index.
Local performance: FAISS (HNSW or FlatIP) avoids SQLite/remote latency.
Safety guardrails: emergency disclaimers, scope limits, refusal templates.
Streamlit UX: simple input, answer + citations, expandable source snippets.

3. Methodology

3.1 Data & Pre-processing

Sources: curated public documents (e.g., WHO fact sheets, NIH/NIMH pages, CDC guidance, nutrition.gov).
Normalization: unicode cleanup, lower-case, de-hyphenation, de-dup, whitespace collapse.
Segmentation: section-aware chunking (headings, bullets); recommended chunk_size=800–1200 tokens with overlap=120–160.
Metadata: {source, title, url, section, date_scraped} stored with each chunk for traceability.
Versioning: keep the raw documents under /data, commit hashes serve as KB versions.

Safety note: exclude PHI; include a medical-advice disclaimer in the UI and the prompt.

3.2 Embeddings & Indexing

Embedding model: Cohere Embed (English or Multilingual).
Vector normalization: L2 normalize before storing; prefer Inner Product (IP) in FAISS when normalized.
Index type:
- Small corpora: IndexFlatIP for exact search.
- Larger corpora: IndexHNSWFlat (M=32, efConstruction=200, efSearch=64).
Store: FAISS + sidecar metadata (LangChain’s FAISS wrapper).
Search params: k=5 (typical), with score_threshold (e.g., 0.2–0.3) to filter weak hits.
Reranking (optional): simple Reciprocal Rank Fusion (RRF) across sections or a light cross-encoder.

3.3 Retrieval-Generation Chain

Retriever step

Build query embedding
Retrieve top-k chunks from FAISS
Format context with source headers and short snippets

Generation step

System prompt (core policy)
- State non-medical-advice disclaimer
- Cite sources explicitly (title + section)
- If confidence is low or no relevant passages → politely decline and suggest consulting a clinician/public health authority
- Avoid dosages, diagnosis, or personalized treatment

User prompt template

You are a healthcare information assistant. Use ONLY the provided context from trusted sources.
If the answer is not in the context, say you don't know and suggest consulting a clinician.

Question: {question}
Context:
{context_chunks}

Return:
- A concise answer (plain language)
- Bullet list of citations with titles
- If needed, a cautionary note (not medical advice)

Post-processing
- Deduplicate citations, bold source titles, include link if present in metadata.
- Confidence flagging: when total similarity < threshold → add a caution banner.

3.4 Safety, Ethics & Guardrails

Refusals for emergencies, prescription advice, diagnostics, or individualized treatment.
Inclusive language and reading level targeting (Flesch 50–60).
Bias checks: ensure diet/mental-health answers are culturally aware and non-stigmatizing.
Content freshness: show date_scraped and encourage cross-checking source pages.
Telemetry (optional, anonymized): aggregate non-PHI usage stats to improve KB coverage.

4. Implementation

4.1 Repository Map

RAG4HealthQA/
├── main.py                   # Streamlit frontend
├── app/
│   ├── rag_pipeline.py       # End-to-end RAG chain
│   ├── embedding_config.py   # Cohere embedding model setup
│   ├── vector_store.py       # Build/load FAISS index
│   └── utils.py              # IO, text cleaning, chunking
├── data/                     # WHO/NIH/CDC .txt/.md corpus
├── scripts/
│   └── generate_health_kb.py # Scraper/loader for KB refresh
├── tests/                    # Pytest for retriever & prompt paths
└── .streamlit/config.toml    # Theme, layout

4.2 Minimal Build/Query Flow (pseudo-code)

# build_index.py
docs = load_docs("./data")
cleaned = [normalize(d) for d in docs]
chunks = chunk(cleaned, size=1000, overlap=140, keep_sections=True)

emb = cohere_embedder(model="embed-english-v3")
vectors = emb.embed(chunks)               # shape: (N, d)
vectors = l2_normalize(vectors)

faiss = build_faiss_index(vectors, kind="IndexFlatIP")
store = save_with_metadata(faiss, chunks, "./faiss_store")

# query.py
q_vec = emb.embed([user_query])[0]; q_vec = l2_normalize(q_vec)
hits = store.search(q_vec, k=5, score_thresh=0.25)

context = format_context(hits)            # add titles/sections/urls
answer = cohere_generate(prompt_tpl(question, context), safety=True)
return answer, citations_from(hits)

4.3 Streamlit UI Tips

Show answer, then collapsible “Sources” with highlighted passages.
Include a standard disclaimer banner on top.
Add “Last updated” time for the KB.
Toggle “Show context” for power users and auditors.

5. Evaluation

5.1 Metrics (dual format: plain text + LaTeX)

Answer Faithfulness (AF)

AF = (# answers grounded in retrieved docs) / (total answers) × 100%

Coverage Rate (CR)

CR = (# queries answerable from KB) / (total queries) × 100%

Mean Retrieval Latency (MRL) — time (ms) from query → top-k retrieval.
End-to-End Latency (E2E) — retrieval + generation + post-processing.
Explainability Index (EI) — ✓ citations, ✓ snippet relevance, ✓ readable tone.
Safety Score (SS) — ✓ refusal on unsafe ask, ✓ disclaimer visible, ✓ no prescriptive advice.

5.2 Experimental Protocol

100 benchmark questions across: nutrition, hypertension, diabetes, mental health.
Hold-out set of 20 adversarial/ambiguous queries.
Three configs:
1. FlatIP + k=5 + 1000/140 chunks
2. HNSW + efSearch=64
3. k=8 + RRF rerank
Report medians + IQR; human raters score Faithfulness and Explainability.

5.3 Example Results (fill with your data)

Config	AF ↑	CR ↑	MRL (ms) ↓	E2E (s) ↓	EI ↑	SS ↑
FlatIP k=5	92%	94%	65	1.7	0.92	1.00
HNSW ef=64	91%	94%	34	1.6	0.91	1.00
k=8 + RRF	93%	96%	74	1.9	0.94	1.00

Interpretation: HNSW improves latency, RRF slightly improves AF/CR at modest E2E cost.

5.4 Ablations & Error Analysis

Chunk size: 600–1400 tokens; smaller chunks help precision, larger help coverage.
k (top-k): 3–8; beyond 8 tends to add noise without reranking.
Score thresholds: too low → hallucination risk; too high → “I don’t know” frequency.
Frequent failure modes: outdated guidance, ambiguous user ask, multi-topic questions.
- Mitigation: ask clarifying question; surface date/section; add “see also” links.

6. Compliance, Privacy, and Ethics

Not medical advice: UI + prompt must state this explicitly.
No PHI: System does not store personally identifiable or health data.
Traceability: Show title + section + scrape date for each citation.
Licensing: WHO/NIH/CDC materials used under their public/open terms; keep attribution.
Accessibility: Aim for plain-language readability; multilingual roadmap below.

7. Roadmap

Multilingual QA (Spanish, Hindi) via multilingual embeddings.
Guideline integration (selected clinical summaries; PubMed abstracts).
Clarifying questions when coverage is low.
Dockerized deployment and FastAPI endpoint alongside Streamlit.
Bias & inclusivity audits; dedicated health-literacy checks.
Temporal freshness: scheduled KB scraper + “updated” watermark in answers.

8. Getting Started

# 1) Clone
git clone https://github.com/Sankarraj-Subramani/RAG4HealthQA
cd RAG4HealthQA

# 2) Install
pip install -r requirements.txt

# 3) Set credentials
echo "COHERE_API_KEY=your-cohere-api-key" > .env

# 4) Optional: refresh corpus
python scripts/generate_health_kb.py

# 5) Run app
streamlit run main.py

9. Appendix

9.1 Test Plan (Pytest Ideas)

Retriever returns >0 results for common topics (smoke test).
Faithfulness checker: answer must contain at least one citation.
Refusal policy: “emergency” or “dosage” queries trigger safe output.
Determinism: repeated runs on same commit produce stable citations.

9.2 Reporting Checklist

KB version/hash printed in UI
Timestamp of last KB refresh
Citations: title + section + (optional) URL
Disclaimer visible on every page
Logs exclude user PII/PHI

10. Conclusion

RAG4HealthQA demonstrates that reproducible, explainable, and safe health Q&A is practical with open tooling. By grounding LLM outputs in vetted medical sources and exposing citations, the system advances digital health literacy and aligns with national public-interest goals around AI transparency and health equity.

Author
Sankarraj Subramani — QA Automation Lead · AI/ML Innovator · EB2-NIW/EB1A Aspirant
GitHub • LinkedIn