Developing and Evaluating Multi-Agent RAG System for Central Bank of Libya Financial Reports
Table of contents
Abstract
Large Language Models (LLMs) have revolutionized open-domain question answering, yet they struggle with domain-specific queries—particularly when dealing with dynamic and specialized financial data. This paper introduces a Multi-Agent Retrieval-Augmented Generation (RAG) system designed specifically for processing Arabic financial reports from the Central Bank of Libya. Our approach integrates advanced document preprocessing techniques, including structural normalization, table reconstruction, and metadata stripping, to prepare a diverse dataset from the Central Bank of Libya's financial reports. The system employs a hierarchical framework where a top-level agent directs specialized low-level agents to perform refined, domain-targeted searches. An iterative refinement mechanism further enhances retrieval accuracy, ensuring that complex multi-hop queries yield coherent, context-aware responses. Comparative evaluations against a Naive RAG baseline—using metrics such as Answer Similarity, Correctness, Faithfulness, and Relevance—demonstrate the effectiveness of our approach. Notably, our Multi-Agent RAG system achieved an answer correctness score of 3.72 out of 5, significantly outperforming the Naive RAG baseline, which scored 2.31. This highlights the advantages of a structured, multi-agent retrieval process in handling complex financial queries with higher accuracy and reliability.
1. Introduction
Large Language Models (LLMs) have demonstrated remarkable proficiency in open-ended question answering, enabling them to generate human-like responses across a wide range of topics [1]. However, they face significant limitations when dealing with domain-specific queries, particularly in fields requiring up-to-date or highly specialized knowledge. These limitations stem from knowledge cut-offs, where LLMs are restricted to the data available at the time of training, as well as gaps in proprietary, niche, or dynamically evolving information.
A common approach to mitigating this issue is to enhance LLMs by integrating ex-ternal knowledge sources, a technique known as Retrieval-Augmented Generation (RAG) [2]. This method enables LLMs to access real-time, domain-specific data by retrieving relevant documents from external repositories and incorporating them into their responses. By supplementing the model’s inherent knowledge with curated financial reports, structured databases, and authoritative sources, RAG helps improve accuracy, reliability, and contextual relevance in domain-specific applications [3] [4][5].
We define naïve RAG as the most basic form of RAG where retrieval is performed using straightforward static semantic similarity. In this approach, a set of documents is retrieved based on an initial query and directly fed into the language model for answer generation, without any iterative refinement or contextual adaptation.
In contrast, advanced RAG incorporates agentic retrieval, where autonomous decision-making agents refine the retrieval process dynamically. These systems employ feedback loops to adjust search queries, assess source credibility, and prioritize con-textually relevant information. By integrating adaptive mechanisms, advanced RAG enhances the accuracy and relevance of generated responses, particularly in complex domains such as finance.
This paper aims to enhance RAG for Arabic financial data, specifically from the Central Bank of Libya. Our objectives are to first clean and prepare Arabic financial documents to ensure high-quality data for retrieval. Next, we develop a multi-agent retrieval system tailored to the structure and complexity of these financial documents, enabling more accurate and context-aware information retrieval. Finally, we create a test dataset from our files and use it to systematically evaluate and compare the performance of multi-agent RAG against a simple RAG system. Through these objectives, we seek to improve retrieval accuracy and document understanding in Arabic financial contexts.
2. Methodology
2.1 Dataset
This study utilizes a collection of publicly available PDF documents from the CBL (Central Bank of Libya), covering various financial reports. The dataset encompasses a total of 85 files including annual reports, economic bulletins, and foreign exchange usage reports, all of which provide insights into the financial sector. These documents span a period from 2007 to 2025 and contain a mix of structured and unstructured data, including long and short tables, detailed paragraphs, and financial statistics. Additionally, the dataset consists of both selectable PDFs (digitally readable) and non-selectable PDFs (scanned or image-based documents), requiring different pre-processing techniques [6].
2.2 Parsing
To integrate the Central Bank of Libya's financial reports into our RAG pipeline, we used LlamaParse to convert PDFs to Markdown format, followed by LlamaIndex for structured data ingestion and indexing [7]. While the parsing process extracted con-tent effectively in general, we encountered several challenges that required manual intervention. These included metadata artifacts like page numbers mixing with main text, incorrectly split multi-page tables causing misalignment, mis-rendered Arabic script characters (particularly currency symbols and numerals), and figures lacking proper captions. We implemented a comprehensive manual curation pipeline to ad-dress these issues, consisting of structural normalization to reorganize content into logical sections, table reconstruction through cross-referencing with original PDFs, metadata stripping to isolate relevant information, and annotation enrichment to add descriptive elements. This curation process demanded iterative validation against the source PDFs to ensure fidelity, with particular attention needed for Arabic-language content where right-to-left text formatting and ligatures presented additional parsing challenges after the initial LlamaParse conversion.
2.3 Naive RAG
Fig. 1. Naive RAG architecture
To effectively leverage the prepared dataset for retrieval-augmented generation (RAG) tasks, we adopted the Naive RAG paradigm. This approach comprises three primary phases: indexing, retrieval, and generation. We implemented this system utilizing LlamaIndex, a framework for constructing, managing, and querying data-augmented language models. LlamaIndex provides essential tools for indexing, re-trieving, and integrating external data sources, enabling more efficient and contextu-ally aware responses in retrieval-augmented generation (RAG) systems [8].
In the indexing phase, the standardized Markdown documents derived from the original PDF dataset were segmented into smaller textual chunks using LlamaIndex’s Markdown Splitter. Each chunk was subsequently transformed into numerical vector embeddings, employing GPT
The retrieval phase involves embedding user queries using the same model utilized during indexing. These query embeddings are then compared against vectors stored in the database to identify the most semantically relevant chunks. The top-5 matching chunks are combined with the original user query to create a comprehensive prompt, providing enriched context for subsequent response generation. During generation, this prompt is presented to a large language model (LLM), which synthesizes the retrieved information with its internal knowledge to generate coherent and accurate responses. The system can also incorporate conversational contexts, supporting multi-turn interactions effectively.
However, Naive RAG has inherent limitations. Retrieval processes can occasional-ly return irrelevant or incomplete chunks, and the generation phase is susceptible to inaccuracies, biases, or hallucinations. Additionally, integrating multiple retrieved chunks may lead to redundancy or coherence issues. Thus, the effectiveness of Naive RAG significantly depends on embedding quality, retrieval accuracy, and the com-prehensiveness of the external document database.
2.4 Multi-Agent RAG
In this section, we explain our approach that utilizes multiple agents to search over the data. The system consists of a hierarchy of agents, where a top-level agent con-trols low-level agents, and each low-level agent is connected to a group of query en-gines pertaining to a single report type. Each query engine contains the chunks of each file. This approach aims to reduce the responsibility on the vector search by letting the top agent select the most appropriate low-level agent according to the que-ry. After that, the low-level agent selects the relevant query engine and performs the vector search. Figure. 2 illustrates a flowchart of the system.
Fig. 2. Architecture of Multi-Agent RAG
An important advantage of this approach is that, rather than performing a search across all document chunks, the search is confined to a single document, which is selected by the agent responsible for the specific file type associated with the document. This method reduces the burden on the RAG system to retrieve the most relevant chunk from the entire dataset, as the agents effectively narrow the search do-main, ensuring that only the most pertinent document is considered.
The system further incorporates an iterative refinement mechanism that enables the top-level agent to reassess and synthesize results returned by the low-level agents. This dynamic coordination allows for additional query disambiguation, ensuring that any overlapping or ambiguous requests are properly routed to the appropriate docu-ment subset. In cases where a query spans multiple report types, the top-level agent aggregates inputs from the relevant low-level agents, thus formulating a cohesive response that maintains the integrity of the underlying financial data.
Another critical enhancement in this multi-agentic architecture is its modularity. Each low-level agent is designed to operate autonomously within its specialized do-main, which not only streamlines the search process but also simplifies system maintenance and scalability. New document types or report categories can be inte-grated by deploying additional agents and query engines without necessitating major alterations to the overall framework. This flexibility is particularly advantageous in the dynamic financial landscape, where continuous data updates and regulatory changes are common.
To further improve retrieval accuracy and robustness, the system leverages persis-tent vector indices managed by LlamaIndex. These indices are continuously updated, ensuring that each agent has immediate access to the most current version of the data. The integration with GPT-4o-mini plays a pivotal role in generating context-aware refinements and in facilitating seamless transitions between agents. This synergy between advanced language modeling and specialized retrieval mechanisms results in a system that is both efficient and highly responsive to the nuanced needs of financial data queries.
3. Evaluation Methodology
3.1 Test Dataset
To evaluate our system, we generated a test dataset using LLMs, specifically RAGAS (Retrieval-Augmented Generation Assessment System) and LlamaIndex. RAGAS provides a comprehensive framework for evaluating RAG-based systems by ensuring that responses are accurate, contextually relevant, and factually faithful [10]. The dataset consists of multiple queries paired with their referenced contexts and expected answers, enabling a systematic assessment of the system’s response quality. This dataset is stored in Comma Separated Values (CSV) format and contains key columns essential for evaluating multilingual performance in financial question-answering systems.
- user_input: The question or query submitted to the system. Queries are written in both Arabic and English to assess multilingual capabilities.
- reference_contexts: The source excerpts retrieved from financial documents. These contexts provide information used to assess the accuracy of responses.
- reference: The correct answer, extracted from official financial reports. This serves as the ground truth for evaluating system performance.
- synthesizer_name: The method used to generate the query, categorized as either single-hop specific query synthesizer, multi-hop abstract query synthesizer or multi-hop specific query synthesizer. The single-hop synthesizer generates direct, fact-based questions whilst the multi-hop abstract query synthesizer creates com-plex queries, and the multi-hop specific synthesizer produces detailed questions that span multiple sources.
This dataset is designed to facilitate research on information retrieval, question-answering accuracy, and multilingual financial analysis.
3.2 Evaluation Metrics
To assess the effectiveness of our multi-agentic RAG system, we employ four key evaluation metrics: Answer Similarity, Answer Correctness, Answer Relevance, and Faithfulness. These metrics ensure that the system not only retrieves accurate information but also maintains coherence and reliability in responses. To systematically track and manage these evaluations, we leverage MLflow, an open-source platform for machine learning lifecycle management. MLflow enables us to log, compare, and analyze our evaluation metrics efficiently, ensuring reproducibility and transparency in our assessment process [11].
The Answer Similarity metric evaluates the semantic resemblance between the model's output and the ground truth. Higher scores indicate a closer alignment in meaning, typically ranging from 1 to 5.
The Answer Correctness metric assesses the factual accuracy of the model's output based on the ground truth, considering both semantic similarity and factual correct-ness. Scores range from 1 to 5, with higher values indicating more accurate and factually correct responses.
The Answer Relevance metric measures how well the model's output addresses the input question, focusing on the appropriateness and applicability of the response. A higher score, within the 1 to 5 range, signifies greater relevance.
The Faithfulness metric evaluates how well the model’s output adheres to the pro-vided context, ensuring that responses are supported by the given information. Scores range from 1 to 5, with higher values indicating stronger adherence to the context.
By evaluating these four dimensions, we ensure that our system maintains high-quality information retrieval while minimizing errors and hallucinations.
4 Comparative Evaluation of Retrieval-Augmented Generation Methods
4.1 Results
In this study, we conducted a comparative evaluation of two Retrieval-Augmented Generation (RAG) methodologies—Multi-Agent RAG and Naive RAG—across multiple metrics and query synthesizer types. Evaluations were conducted using four critical metrics: Answer Similarity, Answer Correctness, Faithfulness, and Answer Relevance, and across three distinct query synthesizer scenarios: Multi-Hop Abstract, Multi-Hop Specific, and Single-Hop Specific queries.
Metrics Comparison
Fig. 3. summarizes the mean scores for each metric across all query types. Our analysis revealed that multi-Agent RAG consistently outperformed Naive RAG across all evaluation dimensions. The most substantial performance gap emerged in Answer Relevance, where multi-Agent achieved a mean score of 4.37 compared to Naive RAG's 2.88, representing a 1.49-point improvement. This metric most clearly demonstrates multi-Agent's superior ability to generate contextually appropriate content. The Faithfulness metric showed multi-Agent scoring 4.03 versus Naive RAG's 2.74 (a 1.29-point advantage), indicating multi-Agent's enhanced reliability. Similarly, Answer Correctness measurements revealed Multi-Agent's stronger performance (3.72) compared to Naive RAG (2.31), with a 1.41-point differential. Answer Similarity evaluations further confirmed multi-Agent's advantage, with scores of 3.47 versus 2.34 for Naive RAG, representing a 1.13-point improvement.
Fig. 3. Comparison of Evaluation Metrics - Naive vs Multi-Agent RAG.
Performance Across Query Types
Fig. 4. compares mean correctness scores between multi-Agent RAG and Naive RAG across three synthesizer types: Multi-Hop Abstract, Multi-Hop Specific, and Single-Hop Specific Queries. Multi-Agent RAG consistently outperformed Naive RAG in all categories, with the largest advantage observed in Multi-Hop Abstract Queries (4.29 vs. 3.06). For Multi-Hop Specific Queries, multi-Agent RAG scored 3.56 against Naive RAG's 2.00, while for Single-Hop Specific Queries, scores were 3.29 and 1.88, respectively.
Fig. 4. Mean Correctness Score by Synthesizer Type and RAG Type.
Table 1 provides detailed metric-specific insights supporting these findings. It highlights that multi-Agent RAG substantially outperformed Naive RAG, particularly in Multi-Hop Abstract Queries, with scores of 4.71 in Relevance and 4.41 in Faithful-ness compared to Naive-Agent's 3.65 in both metrics. multi-Agent RAG also showed consistent superiority in Similarity and Correctness metrics across all synthesizer types.
Table 1. Performance Comparison of multi-Agent and Naive-Agent Approaches across Similarity, Correctness, Faithfulness, and Relevance Metrics
Synthesizer | RAG Type | Similarity | Correctness | Faithfulness | Relevance |
---|---|---|---|---|---|
Multi-hop abstract query | Multi-Agent | 3.970588 | 4.294118 | 4.411765 | 4.705882 |
Naive | 3.058824 | 3.058824 | 3.647059 | 3.647059 | |
Multi-hop specific query | Multi-Agent | 3.323529 | 3.558824 | 3.588235 | 4.205882 |
Naive | 2.205882 | 2.000000 | 2.352941 | 2.558824 | |
Single-hop specific query | Multi-Agent | 3.117647 | 3.294118 | 4.088235 | 4.205882 |
Naive | 1.764706 | 1.882353 | 2.205882 | 2.441176 |
5. Discussion
Our results demonstrate that the multi-Agent RAG approach consistently outperforms Naive RAG across all evaluation metrics, with particularly significant advantages in Answer Correctness and Faithfulness. This highlights the multi-Agent system's superior ability to generate factually accurate and trustworthy responses—essential qualities for financial applications.
The multi-Agent framework proved especially effective with complex multi-hop queries requiring abstract reasoning or detailed synthesis across multiple documents. In these scenarios, Naive RAG showed clear limitations in integrating diverse information sources and maintaining logical coherence. Importantly, multi-Agent RAG maintained its performance edge even with simpler queries, indicating fundamental strengths in its retrieval and synthesis mechanisms.
While both approaches produced contextually relevant answers (shown by smaller improvements in Answer Relevance), the multi-Agent system's true value emerges in accuracy and reliability metrics. This suggests its architecture effectively leverages specialized agents that cross-validate information, creating a more robust system for processing Arabic financial data from the Central Bank of Libya. The improved faith-fulness metrics particularly demonstrate how the multi-Agent approach reduces hallucination and maintains closer alignment with source documents—critical for applications where precision is paramount.
6. Conclusion and Future Work
This study conducted a comprehensive evaluation of two Retrieval-Augmented Gen-elation (RAG) methodologies—multi-Agent RAG and Naive RAG—by systematically assessing their effectiveness across multiple metrics and query synthesizer types. Our goal was to determine their ability to generate accurate, reliable, and contextually relevant responses within the domain of Arabic financial data from the Central Bank of Libya. To achieve this, we first focused on cleaning and preparing Arabic financial documents to ensure high-quality data for retrieval. Next, we developed a multi-agent retrieval system designed to handle the structure and complexity of these financial documents, enabling more precise and context-aware information retrieval. Finally, we created a test dataset derived from these financial records and employed it to systematically evaluate and compare the performance of multi-Agent RAG against a simple Naive RAG system. Through this structured approach, we aimed to improve retrieval accuracy and document comprehension in Arabic financial contexts. Through extensive experimentation, we analyzed the performance of both approaches using four critical evaluation metrics: Answer Similarity, Answer Correct-ness, Faith-fulness, and Answer Relevance. Furthermore, we examined their effectiveness in handling different query types, including Multi-Hop Abstract, Multi-Hop Specific, and Single-Hop Specific queries. The results consistently demonstrated that multi-Agent RAG outperformed Naive RAG across all evaluation dimensions, highlighting its superiority in maintaining contextual coherence, factual correctness, and faithful adherence to source documents. The multi-Agent framework proved particularly effective in handling complex multi-hop queries that required abstract reasoning and synthesis across multiple documents.
The findings of this study underscore the potential of multi-Agent RAG for high-stakes applications where accuracy and trustworthiness are paramount. Specifically, its ability to handle financial data reliably suggests promising applications in regulatory reporting, compliance monitoring, and financial analysis. While our results demonstrate significant improvements over Naive RAG, future research could focus on further optimizing the multi-Agent framework to enhance efficiency and scalability. This includes refining retrieval strategies, incorporating more advanced cross-validation mechanisms, and leveraging external knowledge sources to improve con-textual accuracy. Future work could also focus on developing an automated pipeline to fetch financial data in real time as updates become available, ensuring that the system always works with the most current and relevant information. Another potential enhancement involves incorporating additional features such as rerankers, which could refine and prioritize retrieved documents based on relevance and quality. Additionally, integrating financial data from the Libyan Audit Bureau could further strengthen the system’s capabilities by providing a more comprehensive and authoritative dataset for retrieval. Exploring the integration of machine learning techniques to fine-tune retrieval models dynamically and mitigate biases in Arabic financial data processing could also be a valuable direction for future work.
7. References
References
[1] P. Shailendra, R. C. Ghosh, R. Kumar, and N. Sharma, “Survey of Large Language Models for Answering Questions Across Various Fields,” 10th International Conference on Advanced Computing and Communication Systems, ICACCS 2024, pp. 520–527, 2024, doi: 10.1109/ICACCS60874.2024.10717078.
[2] M. Irwin. Jordan, Yann. LeCun, and S. A. . Solla, “Advances in neural information processing systems edited by Michael I. Jordan, Yann LeCun and Sara A. Solla,” 2001, Accessed: Mar. 18, 2025. [Online]. Available: https://mitpress.mit.edu/9780262561457/advances-in-neural-information-processing-systems/
[3] Y. Gao et al., “Retrieval-Augmented Generation for Large Language Models: A Survey,” Dec. 2023, Accessed: Mar. 18, 2025. [Online]. Available: https://arxiv.org/abs/2312.10997v5
[4] F. Cuconasu et al., “The Power of Noise: Redefining Retrieval for RAG Systems,” SIGIR 2024 - Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 719–729, Jul. 2024, doi: 10.1145/3626772.3657834.
[5] Y. Huang and J. X. Huang, “A Survey on Retrieval-Augmented Text Generation for Large Language Models,” Apr. 2024, Accessed: Mar. 18, 2025. [Online]. Available: https://arxiv.org/abs/2404.10981v2
[6] “مصرف ليبيا المركزي.” Accessed: Mar. 18, 2025. [Online]. Available: https://cbl.gov.ly/
[7] “Getting Started | LlamaCloud Documentation.” Accessed: Mar. 18, 2025. [Online]. Available: https://docs.cloud.llamaindex.ai/llamaparse/getting_started
[8] “LlamaIndex - LlamaIndex.” Accessed: Mar. 18, 2025. [Online]. Available: https://docs.llamaindex.ai/en/stable/
[9] “Vector embeddings - OpenAI API.” Accessed: Mar. 18, 2025. [Online]. Available: https://platform.openai.com/docs/guides/embeddings
[10] S. Es, J. James, L. Espinosa-Anke, S. Schockaert, and E. Gradients, “RAGAs: Automated Evaluation of Retrieval Augmented Generation,” 2024. Accessed: Mar. 18, 2025. [Online]. Available: https://aclanthology.org/2024.eacl-demo.16/
[11] “MLflow: A Tool for Managing the Machine Learning Lifecycle | MLflow.” Accessed: Mar. 18, 2025. [Online]. Available: https://mlflow.org/docs/latest/index.html
Models
There are no models linked