Academia Analyzer: Multi-Agent RAG System for Research Synthesis

1. Introduction and Project Purpose (What is this about?)

The volume of academic literature is growing exponentially, making it difficult for researchers to stay current. The Academia Analyzer Agent solves this problem by providing an automated, multi-agent system that ingests raw research documents (PDFs/Text) and synthesizes them into structured, high-value executive summaries.

This project serves as the Module 2 MVP deliverable, demonstrating mastery over Multi-Agent Orchestration and Tool Integration.

Core Objectives:

Automated Synthesis: Transforms dense academic text into clear Hypotheses, Methodologies, and Findings.
Grounded Generation: Uses Retrieval-Augmented Generation (RAG) to ensure every summary is backed by the actual content of the paper, minimizing hallucinations.
Orchestration: Utilizes LangGraph to coordinate file ingestion, data extraction, and final synthesis.

2. System Architecture: LangGraph Orchestration (Why does it matter?)

The system utilizes a sequential pipeline where three specialized agents collaborate to process the document.

Agent	Role & Responsibility	Key Tool Integration
1. DocumentIngestorAgent	Ingestion & Indexing. Reads local PDF/Text files, chunks the content, and builds the vector search index.	PDF Reader Tool (`pypdf`, `TextLoader`) & RAG Retriever (`FAISS`).
2. ThesisExtractorAgent	Data Extraction. Identifies the paper's core hypothesis and technical keywords using NLP.	Keyword Extractor Tool (`nltk`).
3. InsightSynthesizerAgent	Final Synthesis. Generates the Executive Summary and Key Findings using the RAG context.	LLM API (OpenRouter) & Pydantic Parser (Structured Output).

Flow Narrative

Ingest: The user uploads a file via Streamlit. Agent 1 chunks the text and creates a FAISS index.
Extract: Agent 2 scans the raw text to extract frequency-based keywords and topic focus areas.
Synthesize: Agent 3 combines the RAG context and the extracted keywords to generate a structured InsightSummary JSON object.

3. Technical Credibility and Validation (Can I trust it?)

To ensure high technical quality, the system employs rigorous methodology for document processing and output generation.

A. RAG Configuration (Verifiable Evidence)

The RAG index is configured to handle complex academic language:

Setting	Value	Rationale
Text Chunk Size	1500 tokens	Larger chunk size chosen to capture full academic arguments and methodology sections without breaking context.
Text Chunk Overlap	250 tokens	Ensures continuity between pages and paragraphs.
Embedding Model	`HuggingFaceEmbeddings(all-MiniLM-L6-v2)`	Fast, efficient local embedding model suitable for scientific text.

B. Code Evidence

The complete, working code is available in the linked GitHub repository.

Code Repository: https://github.com/maddiravi/academia-analyzer-agent

4. Usability and Operational Features (Can I use it?)

A. User Interface

The system is delivered via a Streamlit web application (app.py), allowing users to easily upload research papers via a drag-and-drop interface and view the analysis in real-time.

B. Quick-Start Guide

Clone the Repository:

git clone [https://github.com/maddiravi/academia-analyzer-agent](https://github.com/maddiravi/academia-analyzer-agent)
cd academia-analyzer-agent
pip install -r requirements.txt

Configuration: Create a .env file with your OPENROUTER_API_KEY.
Launch:
```
python -m streamlit run app.py
```

C. Successful Execution Proof

MOD 2.1.jpeg MOD 2,2.jpeg MOD 2.3.jpeg

D. Future Directions

Future work will focus on:

Multi-Paper Comparison: Allowing users to upload multiple PDFs to compare methodologies.
Citation Extraction: Adding a specific tool to extract and format bibliography references.

5. Security, Safety, and Guardrails

To ensure the system is safe for deployment, we implemented multiple layers of protection:

Input Validation: The system uses strict typing and file extension checks (.pdf, .txt only) to prevent processing malicious file types.
Output Sanitization: A dedicated safety.py tool strips potential script injection patterns (e.g., HTML tags) from the AI's output before it is rendered on the frontend.
Secret Management: API keys are managed strictly via .env files and are never hardcoded. A pre-flight HealthCheck ensures the API connection is secure and valid before any user data is processed.
Logging: Comprehensive logging tracks every agent's entry and exit point, allowing administrators to trace errors or potential misuse without exposing sensitive user content.

BY RAVI KANTH MADDI

Academia Analyzer: Multi-Agent RAG System for Research Synthesis

Table of contents

Academia Analyzer: Multi-Agent RAG System for Research Synthesis

1. Introduction and Project Purpose (What is this about?)

2. System Architecture: LangGraph Orchestration (Why does it matter?)

Flow Narrative

3. Technical Credibility and Validation (Can I trust it?)

A. RAG Configuration (Verifiable Evidence)

B. Code Evidence

4. Usability and Operational Features (Can I use it?)

A. User Interface

B. Quick-Start Guide

C. Successful Execution Proof

D. Future Directions

5. Security, Safety, and Guardrails

Table of contents

Code

Code