
This is a simple RAG-based AI assistant that answer questions based on provided with txt or pdf documents in the data folder which is chunked, embedded, and then stored in a ChromaDB vector database. It takes the user's query, embeds it, and then it searches the embeddings of chunks in the vector database for cosine similarity, the top n results (where n can be any number set by the developer) are sent with a custom prompt for the available API for these three options(Google, OpenAI, and Groq).
Then the LLM provides an answer based on the relevant chunks.
The project consists of two main folder, src/ and data/
The data folder contains the document which represent the resources of information for the AI Assistant those documents can be either txt or pdf files.
The src folder contains the source code for the project, which are app.py , utils.py ,and vectordb.py.
app.py is the main code, it has the implementation of the assistant with the functions of adding documents to the vector database, and invoking the model with a query.
vectordb.py contains the implementation of the vector database, which has the functions of chunking, embedding, and adding docs to the vector database, with the function for searching the database for relevant chunks.
utils.py contains simple functions for reading the contents of txt and pdf files.
*prompts.py contains the prompt templates
The project's GitHub repo: https://github.com/Haydara-Othman/AAIDC_Module1
The repository has a .env.example file with the names of the required environment variables ( like the APIs or the choices for models to use ).
A .env file with those environment variables names and their true values must be created.
The repository also has a requirements.txt file which contains the important dependencies.
To install the required libraries and dependencies, run the following code:
pip install -r requirements.txt
After creating the .env file and putting your APIs and Model choices in it, the project is ready to run.
The code won't work without installing the dependencies and setting up the environment variables.
It is crucial to run the command above to install all required libraries and dependencies, and to put your APIs in the .env file.
In the data folder you can put any document you want, the accepted formats are pdf and txt.
The data-processing process is very simple, the program chunks all the documents, embeds them and then adds all the chunks' embeddings to the vector database with some metadata like :
the title of the original document the chunk is from, the string(text) of the chunks, and a unique id for each chunk.
Later, we apply the search method to search for most relevant chunks to the user's query using cosine similarity.
The data folder in the repository has 11 ready documents between big books and very small articles about various scientific topics.
You can keep these documents and add additional ones from your own, or you can remove them if you want to.
This code file only contains the prompt templates as strings.
It has 2 templates, the first is the main one, which contains instructions for the LLM about his role, tone, and limitations and permissions.
The prompt was designed so that the LLM responds normally to everyday greeting, but when asked about information, it must be from the documents.
The prompts also implements the ReAct reasoning technique so that the AI can connect documents and information together.
In the prompt we also explicitly command the LLM to generate its output in JSON format with two keywords, "thinking", which is the generation that happened when the LLM was applying the ReAct reasoning technique if he used it, if he didn't then the value of this keyword should be "" as instructed to the LLM in the prompt. The thinking process should not be printed out to the user.
The second keywords is "response", which contains the final response of the LLM.
The second prompt is a secondary one for the secondary chain (brief explanation below) , which gets sent when the first LLM response wasn't structured.
The situation where the LLM response isn't formatted and structured well is very very rare, yet it could happen, which is why there is a second call to the LLM API as backup.
Calling the LLM a second time barely increase the cost in this situation since it is very rare.
This code file contains simple functions to read the text contents of pdf and txt files.
It also has a function to parse the JSON formatted output of the LLM.
def extract_final_answer(response: str) -> tuple[str, str]: cleaned = response.strip() if cleaned.startswith("```json"): cleaned = cleaned[7:] elif cleaned.startswith("```"): cleaned = cleaned[3:] if cleaned.endswith("```"): cleaned = cleaned[:-3] cleaned = cleaned.strip() try: data = json.loads(cleaned) final_answer = data.get("response", "").strip() thinking = data.get("thinking", "").strip() return final_answer, thinking,True except json.JSONDecodeError: return response.strip(), "",False
This function deletes the ```json tags that some LLMs put when outputing in JSON format, and then it tries to convert the string into an actual dictionary to get both the response and the thinking values and returns them with a True that indicates that the string was parsed successfully.
If parsing the string as a json format fails, the function returns the answer such as it is in the response keyword's value and leave the thinking empty with a False that indicates that the parsing failed.
in this code file, we define the vector database class vectordb , to create an instance of this class, we need to pass a name for the collection in the database in which we'll store the embeddings and the docs, and the embedding model to embed the documents.
This vector database class has the following functions:
def chunk_text(self, text: str, title:str , chunk_size: int = 500) :
Which chunks massive texts into chunks of same size (determined by developer) and with interleaving between chunks.
It needs the text to chunk , the title of the original document the text is from and the chunk size passed to it.
it then uses langchain's RecursiveCharacterTextSplitter to split the text to chunks of size chunk_size and interleaving between consecutive chunks of size min(chunk_size/10 , 200 ) , (this choice for the size of interleaving parts is completely up to the developer, in this project, we made it like this since it worked well).
Then the each chunk is wrapped in a dictionary with the content of the chunk, its title( the title of the original document the chunk is from, and a unique id for each chunk.
def add_documents(self, documents: List , chunck_size:int =500) -> None:
this function takes a list of documents, where each document is a dictionary containing the content and the title of the document in the data folder.
This function iterates over the list of docs passed to it, and does the following to each doc :
it chunks the document's content, to get a list of chunks, then, it works in a batching-like manner so we can consider ChromaDB's limit on the number of added chunks (almost 5400).
for each batch of chunks of the same document with a max length of 5400, it gets their embeddings using the embedding model, then we define an accompanying metadata list holding both the title of the original document the chunk is from and a unique id ( which is the order of the chunk in the database ), and we don't forget to store a copy of the original text of the chunk so we can extract it again when searching the database.
So in the end, the database has the embeddings of each chunk with a unique id alongside its text and the title of the original document the chunk is from.
def search(self, query: str, n_results: int = 5):
this function embeds the user's query, then does a search in the vector database for the top n_result (this value is set by the developer) relevant chunks to the query using cosine similarity.
Then we return a dictionary containing the following:
The text content of the chunks, alongside the metadata [title of the original document and the id of the chunks], and the similarity (between the embeddings of the chunks and the embeddings of the embeddings of the user's query)
This is the main code file, it contains the RAGAssistant class implementation and the very-simple inference.
The code contains the load_documents() function
def load_documents() -> List:
which extracts the text of every pdf or txt document in the data folder alongside its title and put them both in a dictionary, and finally, it returns the list of dictionaries.
the RAGAssistant's constructor initiates the LLM, the vector database and the full chain
Below, we discuss the function of this class briefly:
def _initialize_llm(self):
This function tries to get an API from environment variables to connect to an LLM in the following oredr:
It tries to get the API of OpenAI, if it couldn't find it, it tries to get the one of GROQ, and if it also couldn't find it, it tries to get the one of GOOGLE, finally, if it couldn't get any API, the code raises an ERROR telling the user to provide an API in the .env file.
If an API was available, the function returns the LLM client.
def add_documents(self, documents: List) :
This function simply uses the add_documents method of the vector database to add the docs to the DB.
def invoke(self, input: str, n_results: int = 5):
This function takes the user's query input, as well as the number of relevant chunks of text from the vector database to return when we search it ( default is 5 )
The function uses the search method from the vector database class to get the results back.
then the function defines a new string variable content, which is a huge string containing each relevant chink returned by the search method, its content preceded by the title of the original document it was extracted from.
That content is passed to the prompt_template and then the resulting prompt is passed to the LLM to get an output of that LLM.
The LLM should generate its output in JSON format, and then the output should be parsed into a dictionary using the extract_final_answer function in utils.py
If the parsing failed, extract_final_answer returns False after returning the not-parsed output of the LLM, and if that's the case, this not-parsed output should be sent again to the LLM to extract the final response from it since there could be thinking generations that shouldn't be printed out to the user.
This situation costs double since it invokes the LLM again, but it is very very rare to happen so its effect and impact on the total cost is very tiny if not zero.
The function returns a dictionary with the llm_response , content , sources , and is_sources
where sources is the list of the titles of the docs the relevant chunks were from, and is_sources is a Boolean to know if there is or isn't relevant sources to print.
def invoke(self, input: str, n_results: int = 5): rc = self.vector_db.search(query = input ,n_results = n_results) content = "\n\n".join([ f"From {chunk['title']} : \n {chunk['content']}" for chunk in rc ]) sources = [chunk['title'] for chunk in rc if chunk["similarity"]>0] is_sources = True if sources else False sources = list (set ( sources)) sources = ' \n '.join(sources) llm_response = self.chain.invoke({'content': content , 'query' : input}) # If the LLM didn't generate JSON format properly, resend the response to the llm and ask it to extract the final response. final_answer, thinking_process , parsed= extract_final_answer(llm_response) if not parsed: final_answer = StrOutputParser(self.secondary_chain.invoke(final_answer)) return { 'llm_final_response': final_answer, # Only the final answer for display 'thinking_process': thinking_process, # The thinking process (can be empsty) 'full_response': llm_response, # The complete raw response 'relevant_context': content, 'sources': sources, 'is_sources': is_sources }
It initializes the RAGAssistant instance and loads the documents using the load_documents() function and adds them to the vector database using the .add_documents() method.
Then the program tries going into a loop of asking the user to enter his query, invoking the assistant with the query using the .invoke method and then printing the llm_response with the sources.
And when an exception happens, the program prints the exception alongside some troubleshooting advices and then stops.
def main(): try: print("Initializing RAG Assistant...") assistant = RAGAssistant() print("\nLoading documents...") sample_docs = load_documents() print(f"Loaded {len(sample_docs)} sample documents") assistant.add_documents(sample_docs) done = False while not done: print("") print("_"*50) print("") question = input("Enter a question or 'quit' to exit: ") print("\n") if question.lower() == "quit": done = True else: result = assistant.invoke(question) print(result['llm_final_response']) if result["is_sources"]: sources =result["sources"] print(f"\n Sources : {sources}") except Exception as e: print(f"Error running RAG assistant: {e}") print("Make sure you have set up your .env file with at least one API key:") print("- OPENAI_API_KEY (OpenAI GPT models)") print("- GROQ_API_KEY (Groq Llama models)") print("- GOOGLE_API_KEY (Google Gemini models)") if __name__ == "__main__": main()
Run the assistant by
python src/app.py
In thise section we'll analyze and test the outputs of the assistant:
When we run the app.py file, we get the following:
Initializing RAG Assistant... Using Groq model: llama-3.1-8b-instant Loading embedding model: sentence-transformers/all-MiniLM-L6-v2 Vector database initialized with collection: rag_documents RAG Assistant initialized successfully Loading documents... Loaded 11 sample documents Processing 11 documents... Documents added to vector database __________________________________________________ Enter a question or 'quit' to exit:
We check that it works by asking a simple question, whose answer is ready and literally in the documents:
Enter a question or 'quit' to exit: What is AI Artificial Intelligence (AI) is a branch of computer science that aims to create intelligent machines that can perform tasks that typically require human intelligence. These tasks include learning, reasoning, problem-solving, perception, and language understanding. Sources : artificial_intelligence
we can see that it generated the exact answer in the mentioned source.
Now, let's test how would it respond to an everyday-greeting
Enter a question or 'quit' to exit: Hello, how are you today Hello, I'm an AI assistant. I'm here to help answer your questions based on the given content. How can I assist you today?
it responded to the greeting normally without any errors, and we also see that it didn't mention any source, which is exactly what we want.
For a more complex question, we ask a question whose answer isn't directly in the resources, but the assistant must connect information across the content it was given, so it has to use ReAct as it was instructed in the prompt.
Enter a question or 'quit' to exit: List the top 3 inventions of the 19th century The top 3 inventions of the 19th century are likely to be the steam engine, the electric telegraph, and improvements in iron and steel manufacturing. Sources : Discoveries and Inventions of the Nineteenth Century
Even though the documents don't have explicit list of top 3 of inventions, the LLM used the ReAct reasoning method and got the answer.
For testing and seeing the thinking process, if we print the answer with the thinking process ("thinking" keyword in the final result of the invoke function, we get the following:
Enter a question or 'quit' to exit: List the top 3 inventions of the 19th century
Enter a question or 'quit' to exit: List the top 3 inventions of the 19th century The top 3 inventions of the 19th century are likely to be the steam engine, the electric telegraph, and improvements in iron and steel manufacturing. Sources : Discoveries and Inventions of the Nineteenth Century Thinking process for testing: Thought: The content mentions several inventions, but it doesn't explicitly list the top 3. I need to find a way to narrow down the options. Action: I'll look for any sections or paragraphs that discuss notable inventions of the 19th century. Observation: The preface mentions that the author has aimed to present a popular account of remarkable discoveries and inventions of the 19th century. It also highlights the importance of the steam engine, improvements in iron and steel manufacturing, and the telegraph and telephone. Reflection: Based on the content, I can infer that the top 3 inventions of the 19th century are likely to be the steam engine, the electric telegraph, and improvements in iron and steel manufacturing.
we can see that the Assistant did what it was instructed to do when it doesn't find a direct answer in the content.
It applied the ReAct method chose and implemented an approach to get the information needed and got the final response.
We can see that as well in the following example:
Enter a question or 'quit' to exit: What is the name of the first story in the Arabian Nights book The Story of the Merchant and the Genius Sources : The Arabian Nights Entertainments
and when printing the thinking process:
Enter a question or 'quit' to exit: What is the name of the first story in the Arabian Nights book The Story of the Merchant and the Genius Sources : The Arabian Nights Entertainments Thinking process for testing: Thought: The content mentions the title of the book as 'The Arabian Nights Entertainments' and lists several stories. I need to find the first story mentioned in the content. Action: I'll look for the first story in the list of contents. Observation: The first story mentioned is 'The Story of the Merchant and the Genius'. Reflection: Based on the content, I can infer that the first story in the Arabian Nights book is 'The Story of the Merchant and the Genius'.
Which confirms that the assistants uses the ReAct reasoning method correctly when needed to answer questions even if they were a little bit complicated.
This assistant responses to queries based on the data (documents) in the data folder.
It should generate its output in JSON format.
It outputs the response directly if found ready in the content immediately.
if the assistant couldn't find the direct answer, it uses the ReAct reasoning method to connect information in the content provided to it together to try to get a full response.
If the output isn't in valid JSON format (which happens very very rarely), the output (both the thinking and the final response) is resend to the LLM to extract the final response and give it to us.