Extending the evidenceβpractice gap solution from RAG retrieval into coordinated clinical reasoning β using specialized AI agents to handle drug safety, risk stratification, and patient communication in parallel.

Phase I of this project β CardioCDSS β established a RAG-based guideline retrieval engine that transforms static cardiovascular guidelines into a queryable clinical knowledge system. It solved the first barrier of evidence-practice gap: getting the right evidence in front of the right person at the right time.
But retrieval alone is not enough.
A clinician querying a guideline engine still has to:
These are not retrieval problems. They are reasoning and coordination problems β and they call for a different architecture.
Phase I retrieved the evidence. Phase II reasons over it.
The three persistent barriers that motivated this phase:
Time pressure. A clinician must synthesize guideline recommendations, check drug safety, and assess risk simultaneously. Doing these sequentially in a consultation is impractical. These tasks need to run in parallel, coordinated, and merged into a single output.
Medication safety at scale. A guideline recommendation is only safe in the context of what the patient is already taking. A system that returns recommendations without cross-checking the current medication list is incomplete β and potentially harmful.
Patient compliance. Evidence-based recommendations written in clinical language do not improve outcomes if patients cannot understand or act on them. The last mile β translating clinical plans into plain language β is consistently underdone in CDSS design.
CardioSentinel MAS is a multi-agent layer that sits directly on top of the CardioCDSS RAG engine.
Where Phase I answers: "What does the guideline say?"
Phase II answers: "Given this specific patient, what is the safe, risk-stratified, actionable plan β explained at two levels: for the clinician, and for the patient?"
The system is a proof-of-concept architecture. The tools are mocked (see Limitations). No real patient data is used. This is not a clinical product β it is a demonstration of how a production multi-agent CDSS should be structured.
This is Phase II in the three-layer CardioSentinel ecosystem:
βββββββββββββββββββββββββββ
β Guideline RAG Engine β β Phase I (CardioCDSS)
β Evidence Retrieval β
ββββββββββββββ¬βββββββββββββ
β
ββββββββββββββΌβββββββββββββ
β Multi-Agent System β β Phase II (This project)
β Care Planning + Safety β
ββββββββββββββ¬βββββββββββββ
β
ββββββββββββββΌβββββββββββββ
β Production System β β Phase III (Future)
β (UI, EHR Integration) β
βββββββββββββββββββββββββββ
The system decomposes a clinical query into four specialized agents, coordinated by a central orchestrator:
User Query + Patient Data
β
βΌ
π― Orchestrator Agent
β Plans which agents are needed
β Calls agents in dependency order
β Merges outputs into final report
β
ββββΆ π§ Guideline Agent βββΆ GuidelineRetrieverTool (RAG Engine)
β Returns: recommendations + citations + confidence
β
ββββΆ π Risk Agent βββββββΆ RiskScoreCalculator
β Returns: score (0β100) + classification + contributing factors
β
ββββΆ π Medication Agent βΆ DrugInteractionTool
β βΆ ContraindicationChecker
β Returns: interaction warnings + contraindication flags + safe_to_proceed
β
ββββΆ π€ Patient Agent ββββΆ Claude API
Returns: plain-language summary + lifestyle advice
Each agent is a self-contained class with a single .run() method. Agents do not call each other directly β all coordination goes through the orchestrator.
The central brain of the system. It does not contain clinical logic.
FinalReportKey design decision: The orchestrator uses dynamic agent selection β not every query triggers every agent. A simple informational query may only need the Guideline Agent. A high-risk patient with polypharmacy triggers all four.
Interfaces with the RAG engine from Phase I.
GuidelineRetrieverToolinsufficient_evidence, never fabricatesOutput schema:
{ "recommendations": ["Thiazide diuretics recommended as first-line..."], "evidence_sources": ["ACC/AHA 2023 Hypertension Guidelines"], "confidence": "high" }
Quantifies patient cardiovascular risk to contextualize guideline recommendations.
RiskScoreCalculator against the patient's age, blood pressure, LDL, and conditionsThe risk score is what determines whether guideline recommendations are framed conservatively or urgently in the final report.
The safety-critical agent. Runs two checks before any recommendation proceeds.
DrugInteractionTool: Checks every pair of inferred medications for known interactions. Severity-filtered β only major and contraindicated pairs block the pipeline.
ContraindicationChecker: Cross-references patient conditions against proposed medications. Example: beta-blockers contraindicated in asthma; ACE inhibitors contraindicated in bilateral renal artery stenosis.
Critical failure behavior: If either tool is unavailable, the agent returns a warning flag and sets safe_to_proceed = False. An unavailable safety check is treated as a failed safety check β not as a passed one.
Converts the clinical output into language a patient can understand and act on.
All clinical knowledge lives in tools, not in agents. Agents are logic β tools are data. This separation means tools can be replaced with real data sources without touching agent code.
| Tool | What It Does Now |
|---|---|
GuidelineRetrieverTool | Returns hardcoded strings from a Python dict keyed by condition |
DrugInteractionTool | Checks a hardcoded dict of ~6 drug pairs |
ContraindicationChecker | Checks a hardcoded dict of ~8 conditionβdrug mappings |
RiskScoreCalculator | Additive point formula (not clinically validated) |
| Tool | Real Replacement |
|---|---|
GuidelineRetrieverTool | Phase I CardioCDSS RAG engine (ChromaDB + Neo4j + Cohere reranker) |
DrugInteractionTool | Lexicomp, DrFirst, or First Databank API (licensed) |
ContraindicationChecker | Same licensed API + RxNorm drug normalization layer |
RiskScoreCalculator | ACC/AHA Pooled Cohort Equations (Goff et al., JACC 2014) |
The tool interface contract (input/output schema) remains identical in both cases. Agents do not need to change.
| Decision | Choice | Reason |
|---|---|---|
| Framework | Plain Python classes, no LangGraph | LangGraph adds complexity and state management overhead that isn't justified at this scale. Simpler to test, easier to reason about. |
| Agent communication | All through orchestrator, never direct | Prevents spaghetti dependencies. Each agent can be tested and swapped independently. |
| Failure handling | Agents abstain, not guess | In healthcare, a wrong answer from a failed tool is worse than no answer. Abstention is the only safe default. |
| LLM usage | Only in PatientAgent | Clinical reasoning agents use deterministic tools. LLM is only used for natural language reformatting β a lower-risk task. |
| Output schema | Typed Python dataclasses | Enforces structure at the code level. A missing citation field is a type error, not a runtime surprise. |
| Stateless pipeline | No session memory | Mirrors the Phase I decision. Prevents patient data cross-contamination between queries. |
| Test coverage | 54 tests including failure paths | Failure paths are tested as rigorously as happy paths. A system that fails silently in production is worse than one that visibly breaks in testing. |
Phase I Output Phase II Input
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
{ GuidelineAgent.run()
"recommendations": [...], βββΆ calls GuidelineRetrieverTool
"sources": [...], which calls Phase I RAG API
"confidence": "high"
}
In the current implementation, GuidelineRetrieverTool is mocked. In a connected deployment, .run() makes an HTTP call to the Phase I CardioCDSS API. The agent receives the same output schema either way β it cannot tell the difference.
This is intentional. The agent layer is decoupled from the retrieval layer by design.
multi_agent_system/
β
βββ agents/
β βββ orchestrator.py # Coordinates all agents, owns the pipeline
β βββ guideline_agent.py # Evidence retrieval via RAG tool
β βββ medication_agent.py # Drug interaction + contraindication checks
β βββ risk_agent.py # Cardiovascular risk scoring
β βββ patient_agent.py # Plain-language output via Claude API
β
βββ tools/
β βββ rag_tool.py # β οΈ MOCK β replace with Phase I RAG API
β βββ interaction_tool.py # β οΈ MOCK β replace with Lexicomp/DrFirst
β βββ contraindication_tool.py # β οΈ MOCK β replace with licensed DB + RxNorm
β βββ risk_tool.py # β οΈ SIMPLIFIED β replace with ACC/AHA PCE
β
βββ schemas/
β βββ outputs.py # Typed dataclasses: FinalReport, GuidelineOutput, etc.
β
βββ tests/
β βββ conftest.py # Shared patient fixtures
β βββ test_tools.py # Unit tests for all 4 tools (happy + edge cases)
β βββ test_agents.py # Unit tests for all 4 agents (Claude API mocked)
β βββ test_pipeline.py # Integration tests incl. cascading failure scenarios
β
βββ pipeline.py # run_pipeline() entry point + print_report()
βββ main.py # Example run with sample patient
βββ requirements.txt
βββ .env.example
βββ .gitignore
βββ README.md
git clone https://github.com/anaboset/cardiosentinel-mas cd cardiosentinel-mas python -m venv venv source venv/bin/activate # Windows: venv\Scripts\activate pip install -r requirements.txt
cp .env.example .env # Set ANTHROPIC_API_KEY for PatientAgent # Set RAG_API_URL and RAG_API_KEY to connect Phase I
python main.py
pytest tests/ -v # Expected: 54 passed
patient = { "age": 65, "bp": "150/95", "ldl": 160, "conditions": ["hypertension", "smoker"], } query = "What is first-line therapy?"
Output:
============================================================
CLINICAL DECISION SUPPORT REPORT
============================================================
π QUERY: What is first-line therapy?
π€ PATIENT: Age 65, BP 150/95, LDL 160 mg/dL
Conditions: hypertension, smoker
β οΈ RISK STRATIFICATION
Classification: Very High (Score: 72/100)
β’ Age 65 (β₯65 years)
β’ Stage 2 hypertension (SBP 150)
β’ High LDL (160 mg/dL)
β’ Active smoker
π GUIDELINE RECOMMENDATIONS (Confidence: high)
β’ Thiazide diuretics are recommended as first-line for uncomplicated hypertension.
β’ Target BP < 130/80 mmHg for high-risk patients (ACC/AHA 2023).
β’ Smoking cessation counseling is mandatory for all smokers.
β’ High-intensity statin therapy for LDL > 190 mg/dL or ASCVD risk > 20%.
Sources:
[ACC/AHA 2023 Hypertension Guidelines]
[USPSTF Tobacco Cessation Guidelines 2021]
π MEDICATION SAFETY: β
Safe to proceed
No interactions or contraindications flagged.
π€ PATIENT COMMUNICATION
Your blood pressure and cholesterol are both elevated, which puts you at
high risk for a heart attack or stroke β but both are manageable with
medication and lifestyle changes.
Lifestyle Advice:
β Quit smoking β this is the single highest-impact action you can take
β Reduce salt intake to under 2g/day to help lower blood pressure
β Walk 30 minutes daily, 5 days a week
β Follow up in 4 weeks to check BP response to medication
Testing was designed around two principles: verify the happy path, then verify every way it can fail.
test_tools.py)Each tool is tested in isolation without any agent or pipeline context.
GuidelineRetrieverTool β known conditions return recommendations; unknown conditions return insufficient_evidence; duplicate sources are deduplicatedDrugInteractionTool β known pairs detected; order-independent (A+B = B+A); empty list returns no warningsContraindicationChecker β asthma + beta_blocker flagged; safe combinations pass cleanly; unknown conditions produce no false flagsRiskScoreCalculator β score capped at 100; smoker adds to score; low-risk young patient classified correctlytest_agents.py)Each agent is tested with its tool mocked, so failures in tools can be isolated from failures in agent logic.
PatientAgent tested with mocked Claude API β both success and 401 failure casesMedicationAgent._infer_medications() tested directlytest_pipeline.py)Full pipeline run with the Claude API mocked.
As with Phase I, evaluation is framed around what matters clinically β not chatbot metrics.
Does the orchestrator call the right agents for a given patient profile?
Does the medication safety layer catch what it should?
Does the system correctly decline to answer when evidence is missing?
insufficient_evidence is returnedDoes one agent failure bring down the whole pipeline?
Target: full pipeline under 5 seconds excluding LLM call.
All tools are mocked. The drug interaction database has ~6 rules. The contraindication checker covers ~8 conditions. The risk scorer uses simplified arithmetic that is not clinically validated. The guideline tool returns hardcoded strings, not real retrieved evidence.
Patient schema is incomplete. A real cardiovascular risk calculator requires sex, race/ethnicity, HDL, total cholesterol, and BP treatment status. These fields are absent from the current patient schema.
Drug names are not normalized. The medication agent infers drug class names ("ace_inhibitor") from condition names. Real systems must normalize to RxNorm CUI codes before any lookup β otherwise medications with different brand/generic names will be missed.
No human-in-the-loop gate. The current orchestrator passes all output to the final report unconditionally. A production system needs a confidence threshold below which output is suppressed, not displayed.
No authentication or access control. Patient data is passed as a plain Python dict. Production deployment requires role-based access control, audit logging, and HIPAA-compliant data handling (US) or GDPR + MDR compliance (EU).
LLM-generated patient summaries are unreviewed. The PatientAgent output goes directly into the report. In clinical use, LLM output should not reach a patient without clinician review.
Phase III of CardioSentinel will address the remaining gap: deployment infrastructure.
Planned components:
This software is intended for research and architectural demonstration purposes only.
It is not a medical device and not intended for diagnosis, treatment, or clinical decision-making without qualified human oversight.
The tools are mocked. The drug interaction database is incomplete. The risk scoring is not clinically validated. No clinical expert was involved in the design of this system.
All clinical decisions must be made by licensed healthcare professionals. The author assumes no liability for clinical use of this system.