This publication introduces a prototype of an end-to-end Retrieval-Augmented Generation (RAG) system, meticulously designed for processing and querying PDF documents with open-source AI models. The seamless integration of LangChain orchestrates complex workflows, effortlessly managing both retrieval and generation tasks. In the background, Ollama serves as a framework for interacting with AI models and the operating system, simplifying model updates to keep the system on the cutting edge of technology. Meanwhile, Nougat OCR elegantly transforms intricate document layouts—ranging from tables to scientific formulas—into a clear, readable format, overcoming the challenges posed by complex PDF content. Finally, ChromaDB underpins the solution with efficient vector storage and retrieval, facilitating logical connections between documents in a database. Operating entirely locally, this system provides an instantly deployable, cost-free solution that emphasizes both security and accessibility for commercial and personal applications.
Repository URL: https://github.com/alexej09/rag_system_pdf
Ensure you have the following installed:
pip install -r requirements.txt
To process a PDF file and store its embeddings in a vector database, execute:
python vector_db_nougat.py
To read the content of the vector database, use:
python read_vector_db.py
To chat with the database, run following steps:
Pull a model from ollama database e.g. llama3.2
ollama pull llama3.2
include the model in chat_with_vector_db.py
and run
python chat_with_vector_db.py
After running the chatbot script, you can ask questions based on the uploaded PDFs. Example:
Your question: What are the key findings in this research paper?
The system will return the most relevant answer based on the extracted document content.
You can interact with the chatbot in the terminal by entering your questions as a chat. To exit the script, simply type exit
in the terminal.
This project is open-source and available under the MIT License.
There are no models linked
There are no datasets linked
There are no datasets linked
There are no models linked