Aug 06, 2025●12 reads●MIT License

Retrieval-Augmented Generation for Multi-PDF Querying with Gemini and LlamaIndex

Gemini
LlamaIndex
RAG

@y.rashan22

Multi-PDF RAG with LlamaIndex

RAG System Architecture

A Retrieval-Augmented Generation (RAG) system that allows you to chat with multiple PDF documents using Google's Gemini AI. Ask questions about your documents and get intelligent answers based on their content.

What This Project Does

Upload multiple PDFs and ask questions about their content
Intelligent search through your documents using AI embeddings
Persistent storage - build the index once, use it multiple times
Smart responses using Google's Gemini 2.5 Flash model
Easy setup in Google Colab with no local installation required

Prerequisites

Before you start, you'll need:

Google Account for accessing Google Colab
Google Gemini API Key (free tier available)
- Visit Google AI Studio
- Click "Create API Key"
- Copy your API key (you'll need this later)

Step-by-Step Setup Guide

Step 1: Get the Code

Click the "Open in Colab" badge at the top of this README to open directly in Google Colab
Alternative: Download the Multi_PDF_RAG_with_LlamaIndex.ipynb file from the GitHub repository and upload to Google Colab

Step 2: Set Up Your API Key

In Google Colab, look for the key icon in the left sidebar
Click on "Secrets" tab
Click "Add new secret"
Name: geminiapikey
Value: Paste your Google Gemini API key here
Toggle the "Notebook access" switch to ON

Step 3: Install Required Libraries

Cell 1: Install Dependencies

!pip install -q llama-index pypdf
!pip install -q llama-index-embeddings-gemini
!pip install -q llama-index-llms-gemini

What this does:

llama-index: Core framework for building RAG applications
pypdf: Library for reading and processing PDF files
llama-index-embeddings-gemini: Google Gemini embedding model integration
llama-index-llms-gemini: Google Gemini language model integration
-q flag: Quiet installation (less verbose output)

Step 4: Import Required Modules

Cell 2: Import Libraries

from pathlib import Path
import os
from google.colab import userdata
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.core.storage import StorageContext
from llama_index.core.node_parser import SentenceSplitter
from llama_index.embeddings.gemini import GeminiEmbedding
from llama_index.llms.gemini import Gemini
from llama_index.core import load_index_from_storage

What each import does:

pathlib.Path: Handle file system paths
os: Operating system interface for file operations
google.colab.userdata: Access Colab secrets (API keys)
VectorStoreIndex: Create searchable vector database from documents
SimpleDirectoryReader: Read PDF files from specified locations
Settings: Global configuration for LlamaIndex
StorageContext: Manage persistent storage of the vector index
SentenceSplitter: Split documents into manageable chunks
GeminiEmbedding: Convert text to vector embeddings using Gemini
Gemini: Google's language model for generating responses

Step 5: Configure API Access

Cell 3: Get API Key

API_KEY = userdata.get('geminiapikey')

What this does:

Retrieves your Gemini API key from Colab's secure storage
userdata.get() safely accesses the secret you stored earlier

Step 6: Prepare Your PDF Files

Upload your PDFs to Colab:
- Click the folder icon in the left sidebar
- Drag and drop your PDF files into the file browser
- Wait for upload to complete
Configure file paths:

Cell 4: Set PDF Locations

pdf_directory = ['/content/part-1.pdf','/content/part-2.pdf']

Update this with your files:

# Replace with your actual PDF file names
pdf_directory = [
    '/content/your-document-1.pdf',
    '/content/your-document-2.pdf',
    '/content/your-document-3.pdf'
]

What this does:

Creates a list of file paths pointing to your uploaded PDFs
/content/ is Colab's default upload directory
You can add as many PDF files as needed

Step 7: Configure Storage and Processing Settings

Cell 5-6: Set up storage and chunk size

persist_dir = "./storage"
chunk_size = 1024

What these settings mean:

persist_dir: Directory where the vector index will be saved
chunk_size: How many characters/tokens each text chunk contains
- Smaller chunks (512): More precise but less context
- Larger chunks (2048): More context but less precise
- 1024 is a good balance for most documents

Cell 7: Create storage directory

Path(persist_dir).mkdir(exist_ok=True)

What this does:

Creates the storage directory if it doesn't exist
exist_ok=True prevents errors if directory already exists

Step 8: Configure AI Models

Cell 8: Set up embedding model

Settings.embed_model = GeminiEmbedding(
    model_name="models/embedding-001", api_key=API_KEY
)

What this does:

Configures the embedding model that converts text to vectors
embedding-001 is Google's text embedding model
These vectors enable semantic search through your documents

Cell 9: Configure language model and text processing

Settings.llm = Gemini(api_key=API_KEY, model_name="models/gemini-2.5-flash")
Settings.text_splitter = SentenceSplitter(chunk_size=chunk_size)
Settings.chunk_size = chunk_size

What each setting does:

Settings.llm: The AI model that generates answers to your questions
gemini-2.5-flash: Fast and efficient version of Gemini
SentenceSplitter: Intelligently splits text at sentence boundaries
Global chunk_size setting ensures consistency

Step 9: Create the RAG System

Cell 10: Main function to create or load index

def load_or_create_index():
    """Load existing index or create new one if it doesn't exist"""
    if not os.listdir(persist_dir):
        print("Creating new index...")
        # Load PDF documents
        documents = SimpleDirectoryReader(input_files=pdf_directory).load_data()

        # Create and persist index
        index = VectorStoreIndex.from_documents(
            documents, show_progress=True
        )
        index.storage_context.persist(persist_dir=persist_dir)
    else:
        print("Loading existing index...")
        storage_context = StorageContext.from_defaults(persist_dir=persist_dir)
        index = load_index_from_storage(storage_context)

    return index

What this function does:

Checks if index already exists:
- os.listdir(persist_dir) checks if storage directory has files
- If empty, creates new index; if not, loads existing one
Creating new index (first time):
- SimpleDirectoryReader(input_files=pdf_directory).load_data():
  - Reads all PDFs from your specified paths
  - Extracts text content from each PDF
- VectorStoreIndex.from_documents():
  - Splits documents into chunks
  - Converts each chunk to vector embeddings
  - Creates searchable vector database
- show_progress=True: Shows progress bar during creation
- index.storage_context.persist(): Saves index to disk for reuse
Loading existing index:
- StorageContext.from_defaults(): Loads storage configuration
- load_index_from_storage(): Reconstructs index from saved files
- Much faster than recreating from scratch

Cell 11: Initialize the system

index = load_or_create_index()

What happens here:

Calls the function to either create or load your document index
First run: Processes all PDFs (may take several minutes)
Subsequent runs: Loads quickly from storage

Step 10: Create Query Function

Cell 12: Query function

def query_pdfs(question):
    """Query the PDF knowledge base"""
    query_engine = index.as_query_engine(
        similarity_top_k=3,
        response_mode="compact",
        verbose=True
    )
    response = query_engine.query(question)
    return response

What each parameter does:

similarity_top_k=3: Retrieves 3 most relevant text chunks
response_mode="compact": Generates concise, focused answers
verbose=True: Shows which chunks were used for the answer
query_engine.query(question): Searches index and generates response

How the query process works:

Your question is converted to a vector
System finds 3 most similar text chunks from your PDFs
These chunks are sent to Gemini along with your question
Gemini generates an answer based on the relevant content

Step 11: Start Asking Questions

Cell 13: Example query

response = query_pdfs("What challenges do the heroes face on their journey to recover the Crystal of Lumina?")
print(response)

Try different types of questions:

# Summarization questions
response = query_pdfs("What are the main topics covered in these documents?")
print(response)

# Specific factual questions
response = query_pdfs("What methodology was used in the research?")
print(response)

# Analytical questions
response = query_pdfs("What are the key findings and conclusions?")
print(response)

# Comparative questions
response = query_pdfs("How do the authors' recommendations differ between documents?")
print(response)

Understanding the output:

The system will show which document chunks were used
Answers are generated based on actual content from your PDFs
If information isn't found, the system will indicate this

Advanced Configuration Options

Customizing Chunk Size

# In Cell 6, modify the chunk size
chunk_size = 1024  # Default: 1024

# Options and their effects:
chunk_size = 512   # Smaller chunks = more precise answers, less context
chunk_size = 1024  # Balanced approach (recommended)
chunk_size = 2048  # Larger chunks = more context, potentially less precise

When to adjust chunk size:

Use smaller chunks (512) for documents with dense, specific information
Use larger chunks (2048) for documents that need more context to understand

Customizing Search Parameters

# In the query_pdfs function, modify these parameters:
def query_pdfs(question):
    query_engine = index.as_query_engine(
        similarity_top_k=3,      # Number of relevant chunks to retrieve
        response_mode="compact", # How to format the response
        verbose=True            # Show source information
    )
    response = query_engine.query(question)
    return response

Parameter explanations:

# Retrieve more or fewer relevant chunks
similarity_top_k=1    # Fast, but may miss context
similarity_top_k=3    # Good balance (recommended)
similarity_top_k=5    # More comprehensive, slower

# Different response modes
response_mode="compact"        # Concise answers
response_mode="tree_summarize" # Hierarchical summarization
response_mode="accumulate"     # Detailed, comprehensive responses

# Verbose output control
verbose=True   # Shows which chunks were used (helpful for debugging)
verbose=False  # Cleaner output, just the answer

Advanced Model Configuration

# Customize the embedding model (Cell 8)
Settings.embed_model = GeminiEmbedding(
    model_name="models/embedding-001",  # Google's embedding model
    api_key=API_KEY,
    # Optional: add custom parameters
)

# Customize the language model (Cell 9)
Settings.llm = Gemini(
    api_key=API_KEY, 
    model_name="models/gemini-2.5-flash",  # Fast model
    # Alternative: "models/gemini-1.5-pro" for more complex reasoning
    temperature=0.1,  # Lower = more focused, Higher = more creative
    max_tokens=1000   # Maximum response length
)

Custom Text Splitting

# Advanced text splitting options (Cell 9)
from llama_index.core.node_parser import SentenceSplitter, TokenTextSplitter

# Sentence-based splitting (default - recommended)
Settings.text_splitter = SentenceSplitter(
    chunk_size=chunk_size,
    chunk_overlap=20,  # Overlap between chunks for continuity
)

# Token-based splitting (alternative)
Settings.text_splitter = TokenTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=50,
)

Adding Custom Prompts

# Create a custom query engine with specific instructions
def query_pdfs_with_custom_prompt(question, custom_instruction=""):
    """Query with custom instructions for the AI"""
    
    # Custom system prompt
    system_prompt = f"""
    You are an expert document analyst. {custom_instruction}
    Always cite which document or section your information comes from.
    If you cannot find relevant information, say so clearly.
    """
    
    query_engine = index.as_query_engine(
        similarity_top_k=3,
        response_mode="compact",
        system_prompt=system_prompt
    )
    
    response = query_engine.query(question)
    return response

# Example usage
response = query_pdfs_with_custom_prompt(
    "Summarize the methodology", 
    "Focus on technical details and be very specific about procedures."
)
print(response)

Example Use Cases

This system works great for:

Research papers: "What are the limitations mentioned in this study?"
Legal documents: "What are the key terms and conditions?"
Technical manuals: "How do I troubleshoot this specific issue?"
Reports: "What were the main conclusions and recommendations?"
Books: "What challenges did the main character face?"

File Structure After Setup

Your Colab Environment/
├── Multi_PDF_RAG_with_LlamaIndex.ipynb    # Main notebook from GitHub
├── your-document-1.pdf                    # Your uploaded PDFs
├── your-document-2.pdf
├── storage/                               # Auto-created vector database
│   ├── docstore.json
│   ├── index_store.json
│   └── vector_store.json
└── README.md                              # This documentation

Understanding the Output

When you ask a question, the system will:

Search through your documents for relevant content
Retrieve the most similar text chunks
Generate an answer using the found information
Provide source context when available

Troubleshooting Common Issues

API Key Problems

Error: Authentication failed
Solution: Double-check that your API key is correctly stored in Colab secrets with the exact name geminiapikey

File Path Issues

Error: File not found
Solution: Verify your PDF file paths in pdf_directory. Use /content/filename.pdf format

Memory or Timeout Issues

Error: Runtime disconnected or out of memory
Solution: Try reducing chunk_size to 512 or process fewer PDFs at once

Slow Performance

Issue: Taking too long to process
Solution:
- Use smaller PDFs (under 100 pages each)
- Reduce chunk_size
- Process fewer documents simultaneously

Poor Answer Quality

Issue: Answers are not relevant or accurate
Solution:
- Ask more specific questions
- Increase similarity_top_k to retrieve more context
- Ensure your PDFs contain the information you're asking about

Complete Code Walkthrough

Here's the complete notebook code with detailed explanations:

Complete Notebook Structure

# ===== CELL 1: Install Dependencies =====
!pip install -q llama-index pypdf
!pip install -q llama-index-embeddings-gemini
!pip install -q llama-index-llms-gemini

# ===== CELL 2: Import Required Libraries =====
from pathlib import Path
import os
from google.colab import userdata
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.core.storage import StorageContext
from llama_index.core.node_parser import SentenceSplitter
from llama_index.embeddings.gemini import GeminiEmbedding
from llama_index.llms.gemini import Gemini
from llama_index.core import load_index_from_storage

# ===== CELL 3: Get API Key from Colab Secrets =====
API_KEY = userdata.get('geminiapikey')

# ===== CELL 4: Define PDF File Paths =====
# IMPORTANT: Update these paths with your actual PDF files
pdf_directory = ['/content/part-1.pdf','/content/part-2.pdf']

# Example with more files:
# pdf_directory = [
#     '/content/research-paper-1.pdf',
#     '/content/research-paper-2.pdf',
#     '/content/manual.pdf'
# ]

# ===== CELL 5: Set Storage Directory =====
persist_dir = "./storage"  # Where the vector index will be saved

# ===== CELL 6: Configure Chunk Size =====
chunk_size = 1024  # Size of text chunks for processing

# ===== CELL 7: Create Storage Directory =====
Path(persist_dir).mkdir(exist_ok=True)

# ===== CELL 8: Configure Embedding Model =====
Settings.embed_model = GeminiEmbedding(
    model_name="models/embedding-001", 
    api_key=API_KEY
)

# ===== CELL 9: Configure Language Model and Text Processing =====
Settings.llm = Gemini(api_key=API_KEY, model_name="models/gemini-2.5-flash")
Settings.text_splitter = SentenceSplitter(chunk_size=chunk_size)
Settings.chunk_size = chunk_size

# ===== CELL 10: Main Function - Create or Load Vector Index =====
def load_or_create_index():
    """
    This function either:
    1. Creates a new vector index from your PDFs (first time)
    2. Loads an existing index from storage (subsequent times)
    """
    if not os.listdir(persist_dir):
        print("Creating new index...")
        print("This may take a few minutes for large PDFs...")
        
        # Read all PDF files
        documents = SimpleDirectoryReader(input_files=pdf_directory).load_data()
        print(f"Loaded {len(documents)} documents")

        # Create vector embeddings and searchable index
        index = VectorStoreIndex.from_documents(
            documents, 
            show_progress=True  # Shows progress bar
        )
        
        # Save the index for future use
        index.storage_context.persist(persist_dir=persist_dir)
        print("Index created and saved successfully!")
        
    else:
        print("Loading existing index...")
        # Load previously created index
        storage_context = StorageContext.from_defaults(persist_dir=persist_dir)
        index = load_index_from_storage(storage_context)
        print("Index loaded successfully!")

    return index

# ===== CELL 11: Initialize the System =====
# This will either create a new index or load existing one
index = load_or_create_index()

# ===== CELL 12: Query Function =====
def query_pdfs(question):
    """
    Function to ask questions about your PDFs
    
    Args:
        question (str): Your question about the documents
    
    Returns:
        Response object with answer and source information
    """
    print(f"Question: {question}")
    print("-" * 50)
    
    # Create query engine
    query_engine = index.as_query_engine(
        similarity_top_k=3,      # Get 3 most relevant text chunks
        response_mode="compact", # Generate concise response
        verbose=True            # Show source information
    )
    
    # Get response
    response = query_engine.query(question)
    return response

# ===== CELL 13: Example Query =====
# Ask your first question
response = query_pdfs("What challenges do the heroes face on their journey to recover the Crystal of Lumina?")
print(response)

# ===== CELL 14: Additional Example Queries =====
# Try different types of questions:

# Summarization
response = query_pdfs("Provide a summary of the main topics discussed")
print("SUMMARY:")
print(response)
print("\n" + "="*60 + "\n")

# Specific facts
response = query_pdfs("What specific methods or approaches are mentioned?")
print("METHODS:")
print(response)
print("\n" + "="*60 + "\n")

# Analysis
response = query_pdfs("What are the key conclusions or findings?")
print("CONCLUSIONS:")
print(response)

Understanding the Code Flow

Phase 1: Setup (Cells 1-9)

Install required Python packages
Import necessary libraries
Get API key from secure storage
Configure file paths and settings
Set up AI models (embedding and language models)

Phase 2: Index Creation (Cells 10-11)

Check if vector index already exists
If not, read PDFs and create embeddings
Save index for future use
If exists, load from storage

Phase 3: Querying (Cells 12-14)

Define function to process questions
Search through vector database for relevant content
Generate AI-powered answers
Display results with source information

Debugging and Monitoring Code

# ===== OPTIONAL: Add this cell for debugging =====
def debug_index_info():
    """Display information about your vector index"""
    print("=== INDEX INFORMATION ===")
    print(f"Storage directory: {persist_dir}")
    print(f"Directory exists: {os.path.exists(persist_dir)}")
    
    if os.path.exists(persist_dir):
        files = os.listdir(persist_dir)
        print(f"Storage files: {files}")
    
    print(f"PDF files to process: {pdf_directory}")
    for pdf_path in pdf_directory:
        exists = os.path.exists(pdf_path)
        print(f"  {pdf_path}: {'✓ Found' if exists else '✗ Missing'}")

# Run this to check your setup
debug_index_info()

# ===== OPTIONAL: Test with simple question first =====
def test_system():
    """Test the system with a simple question"""
    try:
        response = query_pdfs("What is this document about?")
        print("✓ System working correctly!")
        print("Response:", str(response)[:200] + "...")
        return True
    except Exception as e:
        print("✗ Error in system:")
        print(f"Error: {e}")
        return False

# Run this to test your setup
test_system()

Cost Considerations

Gemini API: Free tier includes generous limits
Google Colab: Free tier sufficient for most use cases
Storage: Vector indices stored temporarily in Colab session

Limitations

Session-based: Data is lost when Colab runtime disconnects
File size: Large PDFs (>100MB) may cause memory issues
Languages: Works best with English text
Complex layouts: Tables and images are converted to text

Contributing

We welcome contributions to improve this project! Here's how you can help:

Fork the Repository
- Visit https://github.com/yasithrashan/llamaindex-multipdf-rag
- Click the "Fork" button
Make Your Improvements
- Test with different types of PDFs
- Add new features or fix bugs
- Update documentation as needed
Submit a Pull Request
- Describe your changes and their benefits

Ideas for Contributions

New features like web interface or batch processing
Performance optimizations and better error handling
More examples and troubleshooting guides
Testing with different PDF types and edge cases

License

MIT License - feel free to use and modify for your projects.

Support

If you encounter issues:

Check the troubleshooting section above for common solutions
Verify all setup steps were completed correctly
Try with a simple, small PDF first to test the system
Search existing issues on GitHub
Open a new issue with detailed error information if needed

Star the Repository

If this project helped you, please consider giving it a ⭐ on GitHub!