# Introduction ERNIEBot Researcher is an Autonomous Agent designed to conduct comprehensive research for various tasks based on the provided documents. It can carefully compile detailed, authentic, and unbiased Chinese research reports, while providing deep customization services for specific resources, structured outlines, and valuable experiences and lessons as needed. Drawing on the essence of the recently notable [Plan-and-Solve](https://arxiv.org/abs/2305.04091) technology, and combining the advantages of the currently popular [RAG](https://arxiv.org/abs/2005.11401) technology. ERNIEBot Researcher effectively overcomes challenges such as speed bottlenecks, decision certainty, and reliability of results through multi-agent collaboration and efficient parallel processing mechanisms. - Why do we need ERNIEBot Researcher? 1. Forming objective conclusions through manual research tasks can be time-consuming, sometimes taking weeks to find the correct resources and information. 2. Current LLMs (Large Language Models) are trained on past and outdated information, which carries a high risk of generating hallucinations, making the produced reports almost irrelevant to the research tasks. 3. Reports generated by LLMs generally do not include paragraph-level or sentence-level citations of literature sources, making the generated content difficult to trace and verify. ## Demo Show The web page shown in the figure below is used for topic research. Users can input keywords or natural language sentences. The backend searches for relevant content based on the given literature, and then uses the ERNIE to generate a research report.

The download link for the generated report：[Report download](https://github.com/w5688414/ERNIEBot-Researcher/files/14901642/polish__research_report.pdf)

--DIVIDER--# Methodology The main idea is to operate "planner" and "execution" agents. The planner generates questions for research, and the execution agents seek the most relevant information for each generated research question based on the provided documents. Finally, the planner filters and aggregates all relevant information and creates a research report, in order to improve the quality of the reports, we generate multiple candidate reports simultaneously and automatically select the best one, furthermore, we use reflection mechanism to revise and perfect the report, to reduce the fact errors, we adopt the chain of verification method and search engine to vierify every detail of the report, finally, to improve the readibility, we polich every paragraph to make the report easier for audience to read. So the final report is fluent, authentic and detailed. Agents utilize ERNIE-4.0 and ERNIE-LongText to complete research tasks, ChatGPT is also supported. ERNIE-4.0 is primarily used for decision-making and planning, while ERNIE-LongText is mainly used for writing reports.

## Application Features - Create domain-specific agents based on research queries or tasks. - Generate a diverse set of research questions based on the content of the existing knowledge base, which collectively form an objective opinion on any given task. - For each research question, select information from the knowledge base that is relevant to the given question. - Filter and aggregate all information sources and generate the final research report. - Multiple report agents generate reports in parallel while maintaining a certain level of diversity. - Use chain-of-thought techniques to evaluate and rank multiple reports, overcoming pseudo-randomness, and selecting the optimal report. - Revise and refine the report using a reflection mechanism. - Verify facts using retrieval-augmented techniques and chain of verification. Enhance the overall readability of the report using a polishing mechanism, integrating more detailed descriptions. **Note** 1. Generating a report takes more than 10 minutes, and the more research agents are set up, the longer it takes, consuming a large number of tokens. 2. The quality of the generated report is related to the quality of the documents input into the application. It is suitable for scenarios such as web pages, journals, and corporate office documents, but not suitable for scenarios with less text and excessive useless information in the documents. # Quick Start ## Step 1: Download the project source code ```bash git https://github.com/PaddlePaddle/ERNIE-SDK.git cd ernie-agent/applications/erniebot_researcher ``` ## Step 2: Install dependencies ```bash pip install -r requirements.txt ``` If the above command fails, please run the following command: ```bash conda create -n researcher39 -y python=3.9 && conda activate researcher39 pip install -r requirements.txt ``` Instal ernie-agent from source code: ```bash cd ernie-agent pip install -e . ``` ## Step 3: Download Chinese fonts ```bash wget https://paddlenlp.bj.bcebos.com/pipelines/fonts/SimSun.ttf ``` ## Step 4: Build the document index Support for two vector types: azure openai_embedding and ernie_embedding. For ernie-embedding, you need to register and log in to an account on the [AI Studio Galaxy Community](https://aistudio.baidu.com/index), then obtain the `Access Token` from the [Access Token page](https://aistudio.baidu.com/index/accessToken) on AI Studio, and finally set the environment variable: ```bash export EB_AGENT_ACCESS_TOKEN= export AISTUDIO_ACCESS_TOKEN= export EB_AGENT_LOGGING_LEVEL=INFO ``` To set up Azure OpenAI embedding, you need to configure the relevant OpenAI environment variables. ```bash export AZURE_OPENAI_ENDPOINT="" export AZURE_OPENAI_API_KEY="" ``` We support file formats such as docx, pdf, and txt. Users can place these files in the same folder and then run the following command to create an index. Subsequent reports will be generated based on these files. For convenience in testing, we provide sample data. Sample data: ```bash wget https://paddlenlp.bj.bcebos.com/pipelines/erniebot_researcher_example.tar.gz tar xvf erniebot_researcher_example.tar.gz ``` URL Data: If users have URLs corresponding to their files, they can provide a txt file containing these URLs. In the txt file, each line should store the URL link and the corresponding file path, for example: ```bash https://zhuanlan.zhihu.com/p/659457816 erniebot_researcher_example/Ai_Agent的起源.md ``` If the user does not provide a URL file, the default file path will be used as the URL link. Abstract Data: Users can use the path_abstract parameter to provide the storage path of the abstracts corresponding to their files. The abstracts need to be stored in a JSON file. The JSON file contains multiple dictionaries, and each dictionary has three key-value pairs. - `page_content` : `str`, file abstract. - `url` : `str`, file URL link. - `name` : `str`, file name. For example, ```bash [{"page_content":"文件摘要","url":"https://zhuanlan.zhihu.com/p/659457816","name":Ai_Agent的起源}, ...] ``` If the user does not provide an abstract path, there is no need to change the default value of path_abstract. We will use ernie-4.0 to automatically generate the abstracts, and the generated abstracts will be stored in abstract.json. Next, run: ```bash python ./tools/preprocessing.py \ --index_name_full_text \ --index_name_abstract \ --path_full_text \ --url_path \ --path_abstract ``` ## Step 5: Run ```bash python demo.py --num_research_agent 1 \ --index_name_full_text \ --index_name_abstract ``` - `index_name_full_text`: Path to the full-text knowledge base index - `index_name_abstract`: Path to the abstract knowledge base index - `index_name_citation`: Path to the citation index - `num_research_agent`: Number of agents generating the report - `iterations`: Number of reflection iterations - `chatbot`: Type of LLM, currently supports erniebot and chatgpt - `report_type`: Type of report, currently supports research_report - `embedding_type`: Type of embedding used, currently supports ernie_embedding and openai_embedding (azure) - `save_path`:Path to save the report - `server_name`: IP address of the web UI - `server_port`: Port number of the web UI - `log_path`: Path to save the logs - `use_ui`: Whether to use the web UI - `use_reflection`: Whether to use the reflection process - `fact_checking`:Whether to use the fact-checking process - `framework`: Underlying framework, currently supports langchain --DIVIDER--# Conlusion The project shows a multi agent approach to geneate a search report based on your provided documents, overcoming the outdated information and hallucination limitations of a single model agent, and improve the research quality through multi candidate report generation and filter out the bad ones, perfert the good ones, in short, the report generation methods can bring a better experience to individual users. # Reference [1] Lei Wang, Wanyu Xu, Yihuai Lan, Zhiqiang Hu, Yunshi Lan, Roy Ka-Wei Lee, Ee-Peng Lim: [Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models](https://arxiv.org/abs/2305.04091). ACL (1) 2023: 2609-2634 [2] Weiwei Sun, Lingyong Yan, Xinyu Ma, Shuaiqiang Wang, Pengjie Ren, Zhumin Chen, Dawei Yin, Zhaochun Ren: [Is ChatGPT Good at Search? Investigating Large Language Models as Re-Ranking Agents](https://arxiv.org/abs/2304.09542). EMNLP 2023: 14918-14937 # Acknowledge We learn form the excellent framework design of Assaf Elovic [GPT Researcher](https://github.com/assafelovic/gpt-researcher), and we would like to express our thanks to the authors of GPT Researcher and their open source community.