Shunyu Yao*, Jeffrey Zhao*, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, Yuan Cao
Equal contribution to original paper
Department of Computer Science, Princeton University
Google Research, Brain team
Emails: {shunyuy,karthikn}@princeton.edu, {jeffreyzhao,dianyu,dunan,izhak,yuancao}@google.com
Jovan CvijanoviΔ
Paper reproducer
Faculty of Technical Sciences, University of Novi Sad, Novi Sad
The study validates the methodology and findings of the original ReAct paper [@yao2023react], providing:
(1) an intuitive description of the ReAct style prompting,
(2) details of the reproduction environment and implementation, and
(3) key challenges encountered.
The ReAct paradigm, introduced in ReAct, represents a significant advancement in large language model (LLM) capabilities by synergizing reasoning and acting for complex task-solving. This approach addresses key limitations in prior work with interleaving verbal reasoning traces and environment interactions, creating a closed-loop system that enables real-time plan formulation, exception handling, and integration of external observations with internal knowledge.
Traditional approaches to LLM reasoning and decision-making have typically treated these capabilities separately. Chain-of-Thought prompting demonstrated the value of explicit reasoning traces but lacked fact grounding, while action-only methods like WebGPT enabled environment interaction but suffered from limited strategic planning.
The ReAct framework overcomes these limitations by establishing a continuous feedback loop where reasoning traces guide action selection through plan decomposition, while environment observations ground subsequent reasoning in external context.
This synergy is achieved through a unified prompting architecture that maintains human-interpretable reasoning traces while achieving state-of-the-art performance across multiple benchmarks, with only 1β6 in-context examples.
For model implementation, I utilized OpenAI's GPT-3.5-turbo API as the closest available alternative to the original paper's text-davinci-002 model. All datasets can be downloaded alongside the implementation. The reproduction environment was configured to match the experimental conditions of the original work as closely as possible given hardware constraints. Implementation can be found on GitHub.
Table 1. Performance comparison of different prompt methods on HotpotQA and Fever benchmarks (%).
| Prompt Method | HotpotQA | Fever |
|---|---|---|
| Standard | 28.8 | 52.6 |
| CoT | 36.4 | 56.8 |
| CoT-SC | 38.2 | 59.4 |
| Act | 28.2 | 57.5 |
| ReAct | 29.4 | 62.0 |
| CoT-SC β ReAct | 37.7 | 64.7 |
| ReAct β CoT-SC | 31.5 | 63.5 |
| Supervised SoTA | 67.5 | 89.5 |
Table 2. Performance on ALFWorld benchmark (%).
| Method | Pick | Clean | Heat | Cool | Look | Pick 2 | All |
|---|---|---|---|---|---|---|---|
| Act (avg) | 45 | 40 | 50 | 54 | 48 | 21 | 43 |
| ReAct (avg) | 71 | 60 | 82 | 75 | 38 | 30 | 59 |
| BUTLERg (best 8) | 33 | 26 | 70 | 76 | 17 | 12 | 22 |
| BUTLER (best 8) | 46 | 39 | 74 | 100 | 22 | 24 | 37 |
Table 3. Performance on WebShop benchmark (%).
| Method | Score | SR |
|---|---|---|
| Act | - | - |
| ReAct | - | - |
| Human Expert | 82.1 | 59.6 |
Experimental Results
Due to the deprecation of the original text-davinci-002 model used in the paper, I adopted the closest available alternative, gpt-3.5-turbo. While this substitution yields performance variations, the results consistently align with the original findings. Crucially, they demonstrate that ReAct outperforms alternative methods on all benchmarks while maintaining comparative patterns despite model differences.
Limitations in Experimental Coverage
Due to computational budget constraints, I could not reproduce two experiments from the original paper:
Reflection
This project marked my first systematic reproduction of a research paper, serving as a valuable learning experience in research methodology. I developed critical skills in:
The challenges in aligning with the original authors' implementation with available computational resources also improved my understanding of practical research constraints.
Clone the repository and install dependencies:
git clone https://github.com/AStroCvijo/react_reproduction.git cd react_reproduction conda create --name react python=3.9 conda activate react conda install -c conda-forge libstdcxx-ng pip install -r requirements.txt
| Name | Description | Command |
|---|---|---|
| Standard | Standard inference (no reasoning/acting) | ./scripts/fever/standard.sh |
| CoT | Chain-of-Thought (CoT) | ./scripts/fever/cot.sh |
| CoT-SC | CoT with self-consistency (21 samples) | ./scripts/fever/cot_sc.sh |
| Act | Action-only (no reasoning) | ./scripts/fever/act.sh |
| ReAct | ReAct (reasoning + acting) | ./scripts/fever/react.sh |
| CoT-SC -> ReAct | CoT with self-consistency and ReAct hybrid | ./scripts/fever/cot_sc_react.sh |
| ReAct -> CoT-SC | ReAct and CoT with self-consistency hybrid | ./scripts/fever/react_cot_sc.sh |
| Name | Description | Command |
|---|---|---|
| Standard | Standard inference (no reasoning/acting) | ./scripts/hotpotqa/standard.sh |
| CoT | Chain-of-Thought (CoT) | ./scripts/hotpotqa/cot.sh |
| CoT-SC | CoT with self-consistency (21 samples) | ./scripts/hotpotqa/cot_sc.sh |
| Act | Action-only (no reasoning) | ./scripts/hotpotqa/act.sh |
| ReAct | ReAct (reasoning + acting) | ./scripts/hotpotqa/react.sh |
| CoT-SC -> ReAct | CoT with self-consistency and ReAct hybrid | ./scripts/hotpotqa/cot_sc_react.sh |
| ReAct -> CoT-SC | ReAct and CoT with self-consistency hybrid | ./scripts/hotpotqa/react_cot_sc.sh |
| Name | Description | Command |
|---|---|---|
| Act | Action-only (no reasoning) | ./scripts/alfworld/act.sh |
| ReAct | ReAct (reasoning + acting) | ./scripts/alfworld/react.sh |
| Name | Description | Command |
|---|---|---|
| Act | Action-only (no reasoning) | ./scripts/webshop/act.sh |
| ReAct | ReAct (reasoning + acting) | ./scripts/webshop/react.sh |
| Argument | Description | Default | Options |
|---|---|---|---|
| -ds, --data_set | Dataset selection | FEVER | FEVER, HotpotQA, ALFWorld, WebShop |
| -ps, --prompt_style | Prompt style to use | ReAct | ReAct, Act, CoT, Standard, CoT-SC-ReAct, ReAct-CoT-SC |
| -ns, --num_samples | Number of samples to generate | 1 | Any positive integer |
| -t, --tempreture | Temperature setting for response variability | 0.0 | Any float value (0.0 to 1.0) |
For detailed experiments and evaluations, please refer to the following document: