Knowledge retention is a challenge for organizations across various sectors. The loss of intellectual capital when employees leave without an effective knowledge transfer leads to inefficient operations and disrupts business continuity.
The traditional model of knowledge management has created data silos within the organizations and leads to:
A Retrieval Augmented Generation (RAG) based application has been developed in collaboration with a Computer Aided Engineering (CAE) services company to utilize data from that domain. This publication gives an overview of the approach, testing methodology, results and insights based on extensive testing and integration of the RAG components. The application has been validated by domain experts and is able to give domain specific contextual and relevant responses with good accuracy.
The knowledge base encompasses CAE domain specific proprietary documents, technical manuals, online lecture transcripts, official websites, expert interviews, channel conversations and Frequently Asked Questions (FAQs). The data is stored in a vector database, where it is indexed and made searchable for real-time retrieval during query operations. The system relies on a three-step process:
The following hardware specifications were utilized for the development:
Additional development cost includes the API key pricing for OpenAI and Claude.
The vector database serves as the core component for managing unstructured data, transforming domain-specific documents into vector embeddings. This enables advanced semantic similarity search, ensuring relevant documents are retrieved efficiently. Milvus and Pinecone were tested due to their scalability and efficiency in managing high-dimensional data vectors. Additionally, they integrate seamlessly with existing Machine Learning pipelines, streamlining deployment in production environments.
Embedding models are central to transforming documents into dense, multi-dimensional vector representations. These embeddings capture the semantic meaning of the documents, allowing the system to perform effective searches based on user queries. Embedding models like BERT and GTE are used to map text data into a searchable vector space. Snowflake and GTE embedding models were tested extensively and GTE was chosen for pipeline development.
The LLMs such as GPT, Mistral, or Vicuna, are used to generate responses. The model processes both the query and the retrieved documents, integrating domain-specific knowledge with the user’s query to generate a reasonably accurate response. The combination of these components allows for a dynamic system where real-time information retrieval is integrated with pre-trained knowledge.
The following pipelines were evaluated to arrive at the right combination for CAE specific information:
The data loading system handles the ingestion and processing of documents into the vector database. It involves a step-by-step process that transforms raw data into searchable vectors:
A sentence can be transformed into a vector using an embedding model. The code along with its output that demonstrates this conversion into a vector is shown below.
The vector representations of each chunk will be stored along with metadata in the vector database.
Images are extracted and sent to vision LLMs for generating descriptions. These descriptions are converted into vectors and stored in the database alongside the images. This structured approach ensures that all relevant data is searchable, allowing the AI assistant to retrieve and process information from diverse sources.
The workflow is shown in the figure below.
The LLM is used initially to process user queries, rephrasing them for best retrieval. An embedding model is used to convert the rephrased question into a vector representation, which makes it easier to find relevant text chunks in the Milvus vector database. The question can now be represented in a high-dimensional space, where each dimension represents distinct semantic aspects of the text.
Cosine similarity is used as the main comparison metric to find relevant chunks. An effective way to assess the degree of similarity between the vectors of text chunks and the vector representations of the question is to use cosine similarity, which calculates the cosine of the angle between the two. The associated text chunk and the query have a closer association when the cosine similarity score is higher, indicating that the chunk is probably relevant to the query. The cosine similarity between two closely related sentences is displayed below.
At the same time the system looks for any image IDs added in the meta data that would necessitate fetching related images from MongoDB. To generate a comprehensive final response, the system combines the rephrased query with a prompt template that guides the LLM to formulate an answer using the retrieved content, including both text chunks and relevant images.
The workflow is shown in the figure below.
The system was evaluated through a series of structured experiments. The primary goal was to determine the best combination of vector databases, embedding models, and large language models (LLMs) for providing contextual, accurate and relevant responses. The experiments focused on tuning the following parameters to optimize the pipeline:
The Top K parameter controls the number of most relevant documents retrieved from the vector database for any given query. We experimented with different Top K values (e.g., 5, 10, 50) to balance retrieval relevance and computational efficiency.
The similarity threshold determines how closely the retrieved documents match the query in semantic space. This was evaluated by adjusting cosine similarity thresholds (e.g., 30%, 60%) to find the optimal balance between retrieval accuracy and the quality of the generated response.
Different vector database index types were tested (IVF-FLAT, HNSW) to compare retrieval speeds and accuracy. IVF-FLAT provided a good trade-off between speed and similarity score, while HNSW (Hierarchical Navigable Small World) offered higher accuracy at the cost of speed.
Other iterations tested are listed below:
To evaluate the effectiveness of the various configurations, we used the following metrics:
The accuracy of responses generated by the LLMs was measured based on how well they addressed the user's query, incorporating the correct information retrieved from the vector database.
Each retrieved document’s relevance to the query was rated on a scale from 0 to 5, with 5 representing a perfect match and 0 indicating irrelevance. These relevance scores were aggregated to assess overall retrieval performance.
The response time from query to answer was measured to ensure the system could provide real-time responses. Latency was particularly important when testing different Top K values and indexing methods.
The system’s memory usage and computational load were monitored to assess the trade-offs between higher retrieval accuracy and processing efficiency, particularly when using more complex models such as LLaMA 70B.
In addition, RAGAS metrics were incorporated to provide deeper insights into the quality of the responses:
Measures how accurately the information retrieved from the vector database matches the query’s intent. High contextual precision indicates that the retrieved information is directly relevant to the user's question.
Evaluates how comprehensively the retrieved information addresses all aspects of the query. High contextual recall indicates that the system is not only retrieving relevant chunks but is also covering the full scope of the query.
Focuses on the relevance of the final generated answer in relation to both the query and the retrieved documents. This metric assesses how well the LLM integrates the retrieved information to form a coherent and contextually appropriate response.
Measures the factual consistency between the retrieved documents and the generated answer. High faithfulness ensures that the LLM does not introduce hallucinations or incorrect information in the final response, particularly critical in domain-specific applications like CAE.
Giskard RAG (Retriever-Augmented Generation) evaluation is a framework designed to evaluate the performance of RAG pipelines, especially in the context of NLP systems like question-answers or summarization tasks. The idea behind RAG is to combine the strengths of two components: a retriever, which fetches relevant information from a knowledge base, and a generator, which formulates natural language responses based on the retrieved information.
Generator
Retriever
Rewriter
Router
Knowledge Base
Based on the text-based Giskard evaluation metrics, for closed-source models, OpenAI 4o mini outperforms Claude 3.5 Sonnet and GPT 4o, and for open-source models, Llama 3.1 8B outperforms Llava -Mistral. The results are shown below.
A dataset of 20 questions was generated from the files and unit test was conducted in the pipelines. Bias is one of the metrics which was introduced during the pytest. Along with the existing RAGAS metrics —contextual precision, contextual recall, answer relevancy, and faithfulness, Bias has now been incorporated as an additional evaluation criterion for the unit tests. A threshold of 0.5 was set for all metrics, meaning any test case scoring below 0.5 on even a single metric would result in a failed evaluation.
Note: The bias metric determines whether your LLM output contains gender, racial, or political bias. Therefore, Bias should fall below 0.5 with 0% to pass the test.
It is observed that the LLaMA3.1 has passed 13 out of 20 test cases in the unit testing.
Considering that a test case is deemed successful only if all metrics exceed the 50% threshold, the following metrics analysis table was created to provide a better understanding of the unit testing results.
The table above presents the unit testing results, with metric definitions provided in the previous section. This table summarizes the outcomes for 20-unit tests conducted on the pipeline. In each test case, the bias metric consistently scored 0, leading to a count of 20 in the 0.00 to 0.30 range. For bias, a score of 0% is considered a pass, as it indicates no presence of gender, racial, or political bias in the output, which is ideal. This result shows the LLM's responses are neutral and free from potentially problematic or offensive content. All three remaining metrics scored above 70%, and the overall unit test achieved a score of 65%, passing 13 out of 20 test cases. For a test case to pass, each metric must individually score above 50%.
The GPT-4o has identical results to LLaMA3.1 in the unit testing where it has passed in 13 out 20 tests.
The table above presents the unit testing results, with metric definitions provided in the previous section. This table summarizes the outcomes for 20-unit tests conducted on the pipeline. In each test case, the bias metric consistently scored 0, leading to a count of 20 in the 0.00 to 0.30 range. For bias, a score of 0% is considered a pass, as it indicates no presence of gender, racial, or political bias in the output, which is ideal. This result shows the LLM's responses are neutral and free from potentially problematic or offensive content. All four remaining metrics scored above 70%, and the overall unit test achieved a score of 65%, passing 13 out of 20 test cases. For a test case to pass, each metric must individually score above 50%.
Milvus demonstrated high efficiency in retrieving documents at scale, with IVF-FLAT providing the best balance between retrieval time and accuracy. HNSW offered slightly better accuracy but introduced significant latency at larger scales, making it less practical for real-time applications.
In the initial development phase, due diligence was done to test numerous open-source and licensed large language models (LLMs). The availability of more advanced LLMs in a short timeframe enabled us to test increasingly powerful options. Initial iterations included models like Falcon, LLaMA-2, and Mistral 7B. The primary motivation for adopting a Retrieval Augmented Generation (RAG) approach was the rapid advancements in Generative AI, promising continual access to faster, more capable models suited to complex tasks.
The knowledge base contains approximately 2,500 files related to the CAE domain, enabling the LLM to deliver content-rich answers to user queries.
Key components were evaluated based on variety of criteria, including retrieval accuracy, response quality and system efficiency. While the detailed results of the final configuration remain proprietary, the exploration phase provided valuable insights. The system demonstrated strong performance in handling a variety of input types including text, images, and mathematical queries. Evaluation metrics such as contextual precision, recall and answer relevancy were used to measure the system’s performance. Across all the iterations, the system consistently met or at times exceeded the predefined threshold (0.5), ensuring a reliable retrieval of relevant documents and quality of response generation.
By incorporating the multimodal capabilities, the pipeline effectively supports diverse use cases within the CAE domain. A user-friendly GUI enables seamless switching between input types, making domain knowledge easily accessible. A few snapshots of the application is shown below.
To address the challenge of knowledge retention in organisations, an AI-driven assistant with multimodal data handling capabilities was developed by integrating RAG techniques for CAE domain. The system demonstrates strong performance in handling text, images, and mathematical queries, with GPT-4o-mini, GPT-4o, Claude 3.5 Sonnet, and LLaMA 3.1 8B emerging as top performers.
For the given experimental environment set up, the model performed with good accuracy hence validating the approach for knowledge retention. Currently in progress is the deployment of the model in a real-time environment, where knowledge base is constantly updated, and multiple users will be accessing the application.
The model will be improved to make sure the company never “forgets” critical information and keeps the employees informed, hence maintaining competitive advantage and improved financial metrics.
There are no models linked
There are no models linked
There are no datasets linked
There are no datasets linked