ERNIEBot Researcher is an Autonomous Agent designed to conduct comprehensive research for various tasks based on the provided documents. It can carefully compile detailed, authentic, and unbiased Chinese research reports, while providing deep customization services for specific resources, structured outlines, and valuable experiences and lessons as needed. Drawing on the essence of the recently notable Plan-and-Solve technology, and combining the advantages of the currently popular RAG technology. ERNIEBot Researcher effectively overcomes challenges such as speed bottlenecks, decision certainty, and reliability of results through multi-agent collaboration and efficient parallel processing mechanisms.
The web page shown in the figure below is used for topic research. Users can input keywords or natural language sentences. The backend searches for relevant content based on the given literature, and then uses the ERNIE to generate a research report.
The download link for the generated report:Report download
The main idea is to operate "planner" and "execution" agents. The planner generates questions for research, and the execution agents seek the most relevant information for each generated research question based on the provided documents. Finally, the planner filters and aggregates all relevant information and creates a research report, in order to improve the quality of the reports, we generate multiple candidate reports simultaneously and automatically select the best one, furthermore, we use reflection mechanism to revise and perfect the report, to reduce the fact errors, we adopt the chain of verification method and search engine to vierify every detail of the report, finally, to improve the readibility, we polich every paragraph to make the report easier for audience to read. So the final report is fluent, authentic and detailed.
Agents utilize ERNIE-4.0 and ERNIE-LongText to complete research tasks, ChatGPT is also supported. ERNIE-4.0 is primarily used for decision-making and planning, while ERNIE-LongText is mainly used for writing reports.
Note
git https://github.com/PaddlePaddle/ERNIE-SDK.git cd ernie-agent/applications/erniebot_researcher
pip install -r requirements.txt
If the above command fails, please run the following command:
conda create -n researcher39 -y python=3.9 && conda activate researcher39 pip install -r requirements.txt
Instal ernie-agent from source code:
cd ernie-agent pip install -e .
wget https://paddlenlp.bj.bcebos.com/pipelines/fonts/SimSun.ttf
Support for two vector types: azure openai_embedding and ernie_embedding. For ernie-embedding, you need to register and log in to an account on the AI Studio Galaxy Community, then obtain the Access Token
from the Access Token page on AI Studio, and finally set the environment variable:
export EB_AGENT_ACCESS_TOKEN=<aistudio-access-token> export AISTUDIO_ACCESS_TOKEN=<aistudio-access-token> export EB_AGENT_LOGGING_LEVEL=INFO
To set up Azure OpenAI embedding, you need to configure the relevant OpenAI environment variables.
export AZURE_OPENAI_ENDPOINT="<your azure openai endpoint>" export AZURE_OPENAI_API_KEY="<your azure openai api key>"
We support file formats such as docx, pdf, and txt. Users can place these files in the same folder and then run the following command to create an index. Subsequent reports will be generated based on these files.
For convenience in testing, we provide sample data. Sample data:
wget https://paddlenlp.bj.bcebos.com/pipelines/erniebot_researcher_example.tar.gz tar xvf erniebot_researcher_example.tar.gz
URL Data: If users have URLs corresponding to their files, they can provide a txt file containing these URLs. In the txt file, each line should store the URL link and the corresponding file path, for example:
https://zhuanlan.zhihu.com/p/659457816 erniebot_researcher_example/Ai_Agent的起源.md
If the user does not provide a URL file, the default file path will be used as the URL link.
Abstract Data: Users can use the path_abstract parameter to provide the storage path of the abstracts corresponding to their files. The abstracts need to be stored in a JSON file. The JSON file contains multiple dictionaries, and each dictionary has three key-value pairs.
page_content
: str
, file abstract.url
: str
, file URL link.name
: str
, file name.For example,
[{"page_content":"文件摘要","url":"https://zhuanlan.zhihu.com/p/659457816","name":Ai_Agent的起源}, ...]
If the user does not provide an abstract path, there is no need to change the default value of path_abstract. We will use ernie-4.0 to automatically generate the abstracts, and the generated abstracts will be stored in abstract.json.
Next, run:
python ./tools/preprocessing.py \ --index_name_full_text <the index name of your full text> \ --index_name_abstract <the index name of your abstract text> \ --path_full_text <the folder path of your full text> \ --url_path <the path of your url text> \ --path_abstract <the json path of your abstract text>
python demo.py --num_research_agent 1 \ --index_name_full_text <your full text> \ --index_name_abstract <your abstract text>
index_name_full_text
: Path to the full-text knowledge base indexindex_name_abstract
: Path to the abstract knowledge base indexindex_name_citation
: Path to the citation indexnum_research_agent
: Number of agents generating the reportiterations
: Number of reflection iterationschatbot
: Type of LLM, currently supports erniebot and chatgptreport_type
: Type of report, currently supports research_reportembedding_type
: Type of embedding used, currently supports ernie_embedding and openai_embedding (azure)save_path
to save the reportserver_name
: IP address of the web UIserver_port
: Port number of the web UIlog_path
: Path to save the logsuse_ui
: Whether to use the web UIuse_reflection
: Whether to use the reflection processfact_checking
to use the fact-checking processframework
: Underlying framework, currently supports langchainThe project shows a multi agent approach to geneate a search report based on your provided documents, overcoming the outdated information and hallucination limitations of a single model agent, and improve the research quality through multi candidate report generation and filter out the bad ones, perfert the good ones, in short, the report generation methods can bring a better experience to individual users.
[1] Lei Wang, Wanyu Xu, Yihuai Lan, Zhiqiang Hu, Yunshi Lan, Roy Ka-Wei Lee, Ee-Peng Lim:
Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models. ACL (1) 2023: 2609-2634
[2] Weiwei Sun, Lingyong Yan, Xinyu Ma, Shuaiqiang Wang, Pengjie Ren, Zhumin Chen, Dawei Yin, Zhaochun Ren:
Is ChatGPT Good at Search? Investigating Large Language Models as Re-Ranking Agents. EMNLP 2023: 14918-14937
We learn form the excellent framework design of Assaf Elovic GPT Researcher, and we would like to express our thanks to the authors of GPT Researcher and their open source community.
There are no datasets linked
There are no datasets linked