A Retrieval-Augmented Generation (RAG) system that allows you to chat with multiple PDF documents using Google's Gemini AI. Ask questions about your documents and get intelligent answers based on their content.
Before you start, you'll need:
Multi_PDF_RAG_with_LlamaIndex.ipynb
file from the GitHub repository and upload to Google Colabgeminiapikey
Cell 1: Install Dependencies
!pip install -q llama-index pypdf !pip install -q llama-index-embeddings-gemini !pip install -q llama-index-llms-gemini
What this does:
llama-index
: Core framework for building RAG applicationspypdf
: Library for reading and processing PDF filesllama-index-embeddings-gemini
: Google Gemini embedding model integrationllama-index-llms-gemini
: Google Gemini language model integration-q
flag: Quiet installation (less verbose output)Cell 2: Import Libraries
from pathlib import Path import os from google.colab import userdata from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings from llama_index.core.storage import StorageContext from llama_index.core.node_parser import SentenceSplitter from llama_index.embeddings.gemini import GeminiEmbedding from llama_index.llms.gemini import Gemini from llama_index.core import load_index_from_storage
What each import does:
pathlib.Path
: Handle file system pathsos
: Operating system interface for file operationsgoogle.colab.userdata
: Access Colab secrets (API keys)VectorStoreIndex
: Create searchable vector database from documentsSimpleDirectoryReader
: Read PDF files from specified locationsSettings
: Global configuration for LlamaIndexStorageContext
: Manage persistent storage of the vector indexSentenceSplitter
: Split documents into manageable chunksGeminiEmbedding
: Convert text to vector embeddings using GeminiGemini
: Google's language model for generating responsesCell 3: Get API Key
API_KEY = userdata.get('geminiapikey')
What this does:
userdata.get()
safely accesses the secret you stored earlierUpload your PDFs to Colab:
Configure file paths:
Cell 4: Set PDF Locations
pdf_directory = ['/content/part-1.pdf','/content/part-2.pdf']
Update this with your files:
# Replace with your actual PDF file names pdf_directory = [ '/content/your-document-1.pdf', '/content/your-document-2.pdf', '/content/your-document-3.pdf' ]
What this does:
/content/
is Colab's default upload directoryCell 5-6: Set up storage and chunk size
persist_dir = "./storage" chunk_size = 1024
What these settings mean:
persist_dir
: Directory where the vector index will be savedchunk_size
: How many characters/tokens each text chunk contains
Cell 7: Create storage directory
Path(persist_dir).mkdir(exist_ok=True)
What this does:
exist_ok=True
prevents errors if directory already existsCell 8: Set up embedding model
Settings.embed_model = GeminiEmbedding( model_name="models/embedding-001", api_key=API_KEY )
What this does:
embedding-001
is Google's text embedding modelCell 9: Configure language model and text processing
Settings.llm = Gemini(api_key=API_KEY, model_name="models/gemini-2.5-flash") Settings.text_splitter = SentenceSplitter(chunk_size=chunk_size) Settings.chunk_size = chunk_size
What each setting does:
Settings.llm
: The AI model that generates answers to your questionsgemini-2.5-flash
: Fast and efficient version of GeminiSentenceSplitter
: Intelligently splits text at sentence boundariesCell 10: Main function to create or load index
def load_or_create_index(): """Load existing index or create new one if it doesn't exist""" if not os.listdir(persist_dir): print("Creating new index...") # Load PDF documents documents = SimpleDirectoryReader(input_files=pdf_directory).load_data() # Create and persist index index = VectorStoreIndex.from_documents( documents, show_progress=True ) index.storage_context.persist(persist_dir=persist_dir) else: print("Loading existing index...") storage_context = StorageContext.from_defaults(persist_dir=persist_dir) index = load_index_from_storage(storage_context) return index
What this function does:
Checks if index already exists:
os.listdir(persist_dir)
checks if storage directory has filesCreating new index (first time):
SimpleDirectoryReader(input_files=pdf_directory).load_data()
:
VectorStoreIndex.from_documents()
:
show_progress=True
: Shows progress bar during creationindex.storage_context.persist()
: Saves index to disk for reuseLoading existing index:
StorageContext.from_defaults()
: Loads storage configurationload_index_from_storage()
: Reconstructs index from saved filesCell 11: Initialize the system
index = load_or_create_index()
What happens here:
Cell 12: Query function
def query_pdfs(question): """Query the PDF knowledge base""" query_engine = index.as_query_engine( similarity_top_k=3, response_mode="compact", verbose=True ) response = query_engine.query(question) return response
What each parameter does:
similarity_top_k=3
: Retrieves 3 most relevant text chunksresponse_mode="compact"
: Generates concise, focused answersverbose=True
: Shows which chunks were used for the answerquery_engine.query(question)
: Searches index and generates responseHow the query process works:
Cell 13: Example query
response = query_pdfs("What challenges do the heroes face on their journey to recover the Crystal of Lumina?") print(response)
Try different types of questions:
# Summarization questions response = query_pdfs("What are the main topics covered in these documents?") print(response) # Specific factual questions response = query_pdfs("What methodology was used in the research?") print(response) # Analytical questions response = query_pdfs("What are the key findings and conclusions?") print(response) # Comparative questions response = query_pdfs("How do the authors' recommendations differ between documents?") print(response)
Understanding the output:
# In Cell 6, modify the chunk size chunk_size = 1024 # Default: 1024 # Options and their effects: chunk_size = 512 # Smaller chunks = more precise answers, less context chunk_size = 1024 # Balanced approach (recommended) chunk_size = 2048 # Larger chunks = more context, potentially less precise
When to adjust chunk size:
# In the query_pdfs function, modify these parameters: def query_pdfs(question): query_engine = index.as_query_engine( similarity_top_k=3, # Number of relevant chunks to retrieve response_mode="compact", # How to format the response verbose=True # Show source information ) response = query_engine.query(question) return response
Parameter explanations:
# Retrieve more or fewer relevant chunks similarity_top_k=1 # Fast, but may miss context similarity_top_k=3 # Good balance (recommended) similarity_top_k=5 # More comprehensive, slower # Different response modes response_mode="compact" # Concise answers response_mode="tree_summarize" # Hierarchical summarization response_mode="accumulate" # Detailed, comprehensive responses # Verbose output control verbose=True # Shows which chunks were used (helpful for debugging) verbose=False # Cleaner output, just the answer
# Customize the embedding model (Cell 8) Settings.embed_model = GeminiEmbedding( model_name="models/embedding-001", # Google's embedding model api_key=API_KEY, # Optional: add custom parameters ) # Customize the language model (Cell 9) Settings.llm = Gemini( api_key=API_KEY, model_name="models/gemini-2.5-flash", # Fast model # Alternative: "models/gemini-1.5-pro" for more complex reasoning temperature=0.1, # Lower = more focused, Higher = more creative max_tokens=1000 # Maximum response length )
# Advanced text splitting options (Cell 9) from llama_index.core.node_parser import SentenceSplitter, TokenTextSplitter # Sentence-based splitting (default - recommended) Settings.text_splitter = SentenceSplitter( chunk_size=chunk_size, chunk_overlap=20, # Overlap between chunks for continuity ) # Token-based splitting (alternative) Settings.text_splitter = TokenTextSplitter( chunk_size=chunk_size, chunk_overlap=50, )
# Create a custom query engine with specific instructions def query_pdfs_with_custom_prompt(question, custom_instruction=""): """Query with custom instructions for the AI""" # Custom system prompt system_prompt = f""" You are an expert document analyst. {custom_instruction} Always cite which document or section your information comes from. If you cannot find relevant information, say so clearly. """ query_engine = index.as_query_engine( similarity_top_k=3, response_mode="compact", system_prompt=system_prompt ) response = query_engine.query(question) return response # Example usage response = query_pdfs_with_custom_prompt( "Summarize the methodology", "Focus on technical details and be very specific about procedures." ) print(response)
This system works great for:
Your Colab Environment/
├── Multi_PDF_RAG_with_LlamaIndex.ipynb # Main notebook from GitHub
├── your-document-1.pdf # Your uploaded PDFs
├── your-document-2.pdf
├── storage/ # Auto-created vector database
│ ├── docstore.json
│ ├── index_store.json
│ └── vector_store.json
└── README.md # This documentation
When you ask a question, the system will:
geminiapikey
pdf_directory
. Use /content/filename.pdf
formatchunk_size
to 512 or process fewer PDFs at oncechunk_size
similarity_top_k
to retrieve more contextHere's the complete notebook code with detailed explanations:
# ===== CELL 1: Install Dependencies ===== !pip install -q llama-index pypdf !pip install -q llama-index-embeddings-gemini !pip install -q llama-index-llms-gemini
# ===== CELL 2: Import Required Libraries ===== from pathlib import Path import os from google.colab import userdata from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings from llama_index.core.storage import StorageContext from llama_index.core.node_parser import SentenceSplitter from llama_index.embeddings.gemini import GeminiEmbedding from llama_index.llms.gemini import Gemini from llama_index.core import load_index_from_storage
# ===== CELL 3: Get API Key from Colab Secrets ===== API_KEY = userdata.get('geminiapikey')
# ===== CELL 4: Define PDF File Paths ===== # IMPORTANT: Update these paths with your actual PDF files pdf_directory = ['/content/part-1.pdf','/content/part-2.pdf'] # Example with more files: # pdf_directory = [ # '/content/research-paper-1.pdf', # '/content/research-paper-2.pdf', # '/content/manual.pdf' # ]
# ===== CELL 5: Set Storage Directory ===== persist_dir = "./storage" # Where the vector index will be saved
# ===== CELL 6: Configure Chunk Size ===== chunk_size = 1024 # Size of text chunks for processing
# ===== CELL 7: Create Storage Directory ===== Path(persist_dir).mkdir(exist_ok=True)
# ===== CELL 8: Configure Embedding Model ===== Settings.embed_model = GeminiEmbedding( model_name="models/embedding-001", api_key=API_KEY )
# ===== CELL 9: Configure Language Model and Text Processing ===== Settings.llm = Gemini(api_key=API_KEY, model_name="models/gemini-2.5-flash") Settings.text_splitter = SentenceSplitter(chunk_size=chunk_size) Settings.chunk_size = chunk_size
# ===== CELL 10: Main Function - Create or Load Vector Index ===== def load_or_create_index(): """ This function either: 1. Creates a new vector index from your PDFs (first time) 2. Loads an existing index from storage (subsequent times) """ if not os.listdir(persist_dir): print("Creating new index...") print("This may take a few minutes for large PDFs...") # Read all PDF files documents = SimpleDirectoryReader(input_files=pdf_directory).load_data() print(f"Loaded {len(documents)} documents") # Create vector embeddings and searchable index index = VectorStoreIndex.from_documents( documents, show_progress=True # Shows progress bar ) # Save the index for future use index.storage_context.persist(persist_dir=persist_dir) print("Index created and saved successfully!") else: print("Loading existing index...") # Load previously created index storage_context = StorageContext.from_defaults(persist_dir=persist_dir) index = load_index_from_storage(storage_context) print("Index loaded successfully!") return index
# ===== CELL 11: Initialize the System ===== # This will either create a new index or load existing one index = load_or_create_index()
# ===== CELL 12: Query Function ===== def query_pdfs(question): """ Function to ask questions about your PDFs Args: question (str): Your question about the documents Returns: Response object with answer and source information """ print(f"Question: {question}") print("-" * 50) # Create query engine query_engine = index.as_query_engine( similarity_top_k=3, # Get 3 most relevant text chunks response_mode="compact", # Generate concise response verbose=True # Show source information ) # Get response response = query_engine.query(question) return response
# ===== CELL 13: Example Query ===== # Ask your first question response = query_pdfs("What challenges do the heroes face on their journey to recover the Crystal of Lumina?") print(response)
# ===== CELL 14: Additional Example Queries ===== # Try different types of questions: # Summarization response = query_pdfs("Provide a summary of the main topics discussed") print("SUMMARY:") print(response) print("\n" + "="*60 + "\n") # Specific facts response = query_pdfs("What specific methods or approaches are mentioned?") print("METHODS:") print(response) print("\n" + "="*60 + "\n") # Analysis response = query_pdfs("What are the key conclusions or findings?") print("CONCLUSIONS:") print(response)
Phase 1: Setup (Cells 1-9)
Phase 2: Index Creation (Cells 10-11)
Phase 3: Querying (Cells 12-14)
# ===== OPTIONAL: Add this cell for debugging ===== def debug_index_info(): """Display information about your vector index""" print("=== INDEX INFORMATION ===") print(f"Storage directory: {persist_dir}") print(f"Directory exists: {os.path.exists(persist_dir)}") if os.path.exists(persist_dir): files = os.listdir(persist_dir) print(f"Storage files: {files}") print(f"PDF files to process: {pdf_directory}") for pdf_path in pdf_directory: exists = os.path.exists(pdf_path) print(f" {pdf_path}: {'✓ Found' if exists else '✗ Missing'}") # Run this to check your setup debug_index_info()
# ===== OPTIONAL: Test with simple question first ===== def test_system(): """Test the system with a simple question""" try: response = query_pdfs("What is this document about?") print("✓ System working correctly!") print("Response:", str(response)[:200] + "...") return True except Exception as e: print("✗ Error in system:") print(f"Error: {e}") return False # Run this to test your setup test_system()
We welcome contributions to improve this project! Here's how you can help:
Fork the Repository
Make Your Improvements
Submit a Pull Request
MIT License - feel free to use and modify for your projects.
If you encounter issues:
If this project helped you, please consider giving it a ⭐ on GitHub!