RAG4HealthQA is an educational tool that summarizes public health information. It does not provide medical advice, diagnosis, or treatment. For personal health decisions, consult a licensed clinician or local health authority.
Live demo: rag4healthapp-ussxwkag2zmbdvrtjzvsfq.streamlit.app/
Repo: https://github.com/Sankarraj-Subramani/RAG4HealthQA
Stack: LangChain Β· FAISS Β· Cohere (embeddings/LLM) Β· Streamlit Β· Python
NIW Alignment: Open, auditable, and privacy-aware access to trusted medical knowledge (WHO/NIH/CDC); supports health equity, AI transparency, and digital literacy.
Healthcare misinformation remains a public-health risk. Patients, educators, and even clinicians struggle to locate concise, reliable, and explainable answers. Generic LLMs can hallucinate and lack source traceability.
Research gap. Few open tools combine (i) curated medical corpora, (ii) auditable retrieval pipelines, and (iii) patient-friendly UX with explicit citations and refusal policies for unsafe queries.
Thesis. A domain-tuned Retrieval-Augmented Generation (RAG) systemβgrounded on WHO/NIH/CDC sourcesβcan deliver faithful, explainable answers while keeping deployment simple (Streamlit) and privacy-preserving (local FAISS).
RAG4HealthQA is a healthcare Q&A assistant that uses RAG to answer questions grounded in curated public health documents. It ships with:
/data/*.txt|.md
) and a scraper (scripts/generate_health_kb.py
)Problem.
Solution (RAG4HealthQA).
app/rag_pipeline.py
)/data
scripts/generate_health_kb.py
)flowchart TD A["Streamlit Frontend (main.py)"] B["RAG Pipeline (app/rag_pipeline.py)"] C["Embeddings + Retriever (LangChain + Cohere)"] D["FAISS Vector Store (local semantic index)"] E["Curated Health KB (WHO/NIH/CDC .txt/.md)"] A --> B --> C --> D --> E
ASCII fallback
+-----------------------------------------------------+ | RAG4HealthQA | +-----------------------------------------------------+ | Streamlit Frontend (main.py) | | | | | v | | RAG Pipeline (app/rag_pipeline.py) | | | | | v | | LangChain + Cohere (Embeddings & Retrieval) | | | | | v | | FAISS Vector Store (local semantic index) | | | | | v | | WHO/NIH/CDC Knowledge Base (.txt / .md) | +-----------------------------------------------------+
.md
/.txt
files into /data
and rebuild index.chunk_size=800β1200
tokens with overlap=120β160
.{source, title, url, section, date_scraped}
stored with each chunk for traceability./data
, commit hashes serve as KB versions.Safety note: exclude PHI; include a medical-advice disclaimer in the UI and the prompt.
Embedding model: Cohere Embed (English or Multilingual).
Vector normalization: L2 normalize before storing; prefer Inner Product (IP) in FAISS when normalized.
Index type:
IndexFlatIP
for exact search.IndexHNSWFlat
(M=32
, efConstruction=200
, efSearch=64
).Store: FAISS + sidecar metadata (LangChainβs FAISS
wrapper).
Search params: k=5
(typical), with score_threshold
(e.g., 0.2β0.3) to filter weak hits.
Reranking (optional): simple Reciprocal Rank Fusion (RRF) across sections or a light cross-encoder.
Retriever step
Generation step
System prompt (core policy)
User prompt template
You are a healthcare information assistant. Use ONLY the provided context from trusted sources. If the answer is not in the context, say you don't know and suggest consulting a clinician. Question: {question} Context: {context_chunks} Return: - A concise answer (plain language) - Bullet list of citations with titles - If needed, a cautionary note (not medical advice)
Post-processing
date_scraped
and encourage cross-checking source pages.RAG4HealthQA/ βββ main.py # Streamlit frontend βββ app/ β βββ rag_pipeline.py # End-to-end RAG chain β βββ embedding_config.py # Cohere embedding model setup β βββ vector_store.py # Build/load FAISS index β βββ utils.py # IO, text cleaning, chunking βββ data/ # WHO/NIH/CDC .txt/.md corpus βββ scripts/ β βββ generate_health_kb.py # Scraper/loader for KB refresh βββ tests/ # Pytest for retriever & prompt paths βββ .streamlit/config.toml # Theme, layout
# build_index.py docs = load_docs("./data") cleaned = [normalize(d) for d in docs] chunks = chunk(cleaned, size=1000, overlap=140, keep_sections=True) emb = cohere_embedder(model="embed-english-v3") vectors = emb.embed(chunks) # shape: (N, d) vectors = l2_normalize(vectors) faiss = build_faiss_index(vectors, kind="IndexFlatIP") store = save_with_metadata(faiss, chunks, "./faiss_store") # query.py q_vec = emb.embed([user_query])[0]; q_vec = l2_normalize(q_vec) hits = store.search(q_vec, k=5, score_thresh=0.25) context = format_context(hits) # add titles/sections/urls answer = cohere_generate(prompt_tpl(question, context), safety=True) return answer, citations_from(hits)
AF = (# answers grounded in retrieved docs) / (total answers) Γ 100%
Coverage Rate (CR)
CR = (# queries answerable from KB) / (total queries) Γ 100%
Mean Retrieval Latency (MRL) β time (ms) from query β top-k retrieval.
End-to-End Latency (E2E) β retrieval + generation + post-processing.
Explainability Index (EI) β β citations, β snippet relevance, β readable tone.
Safety Score (SS) β β refusal on unsafe ask, β disclaimer visible, β no prescriptive advice.
100 benchmark questions across: nutrition, hypertension, diabetes, mental health.
Hold-out set of 20 adversarial/ambiguous queries.
Three configs:
Report medians + IQR; human raters score Faithfulness and Explainability.
Config | AF β | CR β | MRL (ms) β | E2E (s) β | EI β | SS β |
---|---|---|---|---|---|---|
FlatIP k=5 | 92% | 94% | 65 | 1.7 | 0.92 | 1.00 |
HNSW ef=64 | 91% | 94% | 34 | 1.6 | 0.91 | 1.00 |
k=8 + RRF | 93% | 96% | 74 | 1.9 | 0.94 | 1.00 |
Interpretation: HNSW improves latency, RRF slightly improves AF/CR at modest E2E cost.
Chunk size: 600β1400 tokens; smaller chunks help precision, larger help coverage.
k (top-k): 3β8; beyond 8 tends to add noise without reranking.
Score thresholds: too low β hallucination risk; too high β βI donβt knowβ frequency.
Frequent failure modes: outdated guidance, ambiguous user ask, multi-topic questions.
# 1) Clone git clone https://github.com/Sankarraj-Subramani/RAG4HealthQA cd RAG4HealthQA # 2) Install pip install -r requirements.txt # 3) Set credentials echo "COHERE_API_KEY=your-cohere-api-key" > .env # 4) Optional: refresh corpus python scripts/generate_health_kb.py # 5) Run app streamlit run main.py
RAG4HealthQA demonstrates that reproducible, explainable, and safe health Q&A is practical with open tooling. By grounding LLM outputs in vetted medical sources and exposing citations, the system advances digital health literacy and aligns with national public-interest goals around AI transparency and health equity.
Author
Sankarraj Subramani β QA Automation Lead Β· AI/ML Innovator Β· EB2-NIW/EB1A Aspirant
GitHub β’ LinkedIn