A Retrieval-Augmented Generation System for Spanish-Language Documents

Accessing information contained in PDF documents remains a challenge for users working with Spanish text. Many existing generative AI solutions are optimized for English and do not adequately handle the segmentation, semantics, or structure of Spanish academic or legal documents.
Leveraging NVIDIA's Llama 3.3 model—available through its NIM API—this project seeks to demonstrate how a custom RAG pipeline can improve understanding and generate accurate answers, even when the source text is in another language and contains complex formats.
What is this about?
This project implements a RAG (Retrieval-Augmented Generation) pipeline to answer questions from PDFs, specifically optimized for Spanish-language academic texts. It combines:
Target Audience: Researchers, developers, and students working with Spanish PDFs (e.g., theses, legal docs).
Why does it matter?
chunk_size=200, accent-aware processing).How was the system built?
This RAG pipeline was designed following a modular and reproducible structure. Below is a breakdown of each stage:
PDF Ingestion
PyMuPDF is used to extract clean text, with quality control to skip empty or non-semantic sections.Text Chunking
RecursiveCharacterTextSplitter is applied with Spanish-specific separators: double newlines (\n\n), punctuation marks (¿, ¡, ;, etc.).chunk_size=200 is used to preserve semantic cohesion without exceeding context limits.Vectorization & Semantic Search
max_tokens=400 cap to ensure compatibility with long texts.Answer Generation
ChatNVIDIA model (LLaMA 3.3) to generate a natural language response.Validation & Testing
Can I trust it?
| Component | Implementation Details |
|---|---|
| Text Splitting | RecursiveCharacterTextSplitter tuned for Spanish (prioritizes \n\n, ;, ¿?¡!) |
| Embeddings | NVIDIA’s NVIDIAEmbeddings with token-length validation (max_tokens=400) |
| LLM | ChatNVIDIA with Llama 3.3 (49B params) |
| Error Handling | Fallback to direct LLM answers if RAG fails |
notebooks/benchmarks.ipynb).Can I use it?
git clone https://github.com/simsimi2143/Rag-nvidia-nim.git cd Rag-nvidia-nim pip install -r requirements.txt export NVIDIA_API_KEY="your_key_here" python rag_pipeline.py --pdf_path data/your_doc.pdf
Modify config/api_config.py:
CHUNK_SIZE = 200 # Smaller for Spanish MODEL_NAME = "llama-3.3-nemotron-super-49b-v1" SEARCH_KWARGS = {"k": 2} # Top-2 chunks for answers
🌟 Asistente RAG - Basado en tu PDF 📄 🤔 Tu pregunta: ¿Cuál es la hipótesis principal? 💡 Respuesta: La hipótesis propone que... 📚 Fuentes relevantes: 1. Página 12: "La hipótesis H1 establece..."
| Criteria | Where Addressed |
|---|---|
| Clear purpose | Section 1 |
| Technical validation | Section 3 (Implementation + Metrics) |
| Reproducibility | GitHub repo + Setup Guide |
| Error handling | Code: safe_embedding() function |
| Use-case examples | Example Output + README.md |
"First RAG pipeline optimized for Spanish academic texts with NVIDIA’s latest models."
docs/architecture.png).LICENSE.