Support this project by becoming a GitHub Sponsor or buying me a coffee!
This project aims to enhance how we interact with information and make learning more efficient by leveraging open-source language models. While the current model is small, it’s effective for retrieving relevant information for studying, decision-making, and analysis. The system operates locally, ensuring privacy and offline access without relying on cloud services.
Initially designed for personal use, I believe the project can support a wide range of users, from professionals staying updated on industry trends to hobbyists exploring new topics. Moving forward, I plan to improve the system with more advanced features and welcome contributions from others.
Named PyGamgee in homage to Samwise Gamgee from The Lord of the Rings, the project may seem humble, but its goal is to assist and empower users in their journey to master knowledge and make informed decisions.
Ultimately, this project reflects my belief in accessible technology that streamlines knowledge acquisition and supports users in applying information that drives growth and learning. By making it open-source, I hope others will contribute to improving and adapting it to their needs.
This project implements a Retrieval-Augmented Generation (RAG) based question answering system designed to help students, analyst learn and prepare for their studies, exams and analysis. It leverages the following technologies:
The system ingests PDF study notes, creates embeddings using Ollama, stores them in a FAISS index, and then uses this index to answer your questions about the material. This allows for a more interactive and personalized learning experience.
Before running this project, ensure you have the following installed:
deepseek-r1:1.5b
) is downloaded. Use ollama pull deepseek-r1:1.5b
to download the model.While the current model (deepseek-r1:1.5b
) works well for basic learning and analysis, if you require higher accuracy or need to run larger models, we recommend upgrading your hardware to support models that require 80GB+ of GPU memory. This will significantly improve the accuracy and performance of your system.
deepseek-r1:70b
, your GPU must have at least 80GB of VRAM. Additionally, multiple GPUs will be needed for parallel processing to achieve optimal performance.deepseek-r1:70b
will provide much better outputs, but they require more powerful hardware to run effectively.ollama pull deepseek-r1:70b
Clone the repository:
git clone https://github.com/boyac/pyGamgee.git cd pyGamgee
Install the required Python packages:
pip install -r requirements.txt
It is highly recommended to use a virtual environment.
Prepare your study materials: Place your data source (in PDF format) in a directory named data
.
Configure the script:
data_dir
variable in the pygamgee.py
script (or your main script file) to point to the correct directory containing your study notes. Use an absolute path for reliability. Example: data_dir = r"data"
deepseek-r1:1.5b
).chunk_size
and chunk_overlap
parameters in the CharacterTextSplitter
to optimize retrieval performance for your specific study materials.Run the script:
python pygamgee.py
Start Learning: A Gradio interface will open in your web browser. Ask questions about the materials and receive answers based on your data!
pygamgee.py
: The main script that handles data loading, embedding generation, FAISS indexing, question answering, and the Gradio interface.data/
: Directory where your data source (PDF documents) are stored. Example file: far2024.pdf
faiss_index/
: Directory where the FAISS index is saved (created automatically during the first run). This allows for faster loading on subsequent runs.faiss-cpu
(or faiss-gpu
) is correctly installed.ollama serve
is always running in the background.Contributions are welcome! If you find a bug, have an idea for a new feature, or want to improve the documentation, please open an issue or submit a pull request.
Please see CONTRIBUTING.md
for guidelines.
I have several ideas for future enhancements to this project, but limited time to implement them myself. I'd love to hear your feedback and welcome contributions!
These are just a few ideas, and I'm open to suggestions and contributions from the community!
This project is made possible by the generous support of its users. If you find this project helpful, please consider:
This project is licensed under a modified MIT License - see the LICENSE
file for full details.
There are no models linked
There are no datasets linked
There are no datasets linked
There are no models linked