Reproduction of ICLR 2023 paper "ReAct: Synergizing Reasoning and Acting in Language Models".

ReAct: Synergizing Reasoning and Acting in Language Models

Reproduction of Work

Shunyu Yao*, Jeffrey Zhao*, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, Yuan Cao
Equal contribution to original paper
Department of Computer Science, Princeton University
Google Research, Brain team
Emails: {shunyuy,karthikn}@princeton.edu, {jeffreyzhao,dianyu,dunan,izhak,yuancao}@google.com

Jovan Cvijanović
Paper reproducer
Faculty of Technical Sciences, University of Novi Sad, Novi Sad

Abstract

The study validates the methodology and findings of the original ReAct paper [@yao2023react], providing:
(1) an intuitive description of the ReAct style prompting,
(2) details of the reproduction environment and implementation, and
(3) key challenges encountered.

1. Introduction

The ReAct paradigm, introduced in ReAct, represents a significant advancement in large language model (LLM) capabilities by synergizing reasoning and acting for complex task-solving. This approach addresses key limitations in prior work with interleaving verbal reasoning traces and environment interactions, creating a closed-loop system that enables real-time plan formulation, exception handling, and integration of external observations with internal knowledge.

Traditional approaches to LLM reasoning and decision-making have typically treated these capabilities separately. Chain-of-Thought prompting demonstrated the value of explicit reasoning traces but lacked fact grounding, while action-only methods like WebGPT enabled environment interaction but suffered from limited strategic planning.

The ReAct framework overcomes these limitations by establishing a continuous feedback loop where reasoning traces guide action selection through plan decomposition, while environment observations ground subsequent reasoning in external context.

This synergy is achieved through a unified prompting architecture that maintains human-interpretable reasoning traces while achieving state-of-the-art performance across multiple benchmarks, with only 1–6 in-context examples.

2. Experiment Setting

For model implementation, I utilized OpenAI's GPT-3.5-turbo API as the closest available alternative to the original paper's text-davinci-002 model. All datasets can be downloaded alongside the implementation. The reproduction environment was configured to match the experimental conditions of the original work as closely as possible given hardware constraints. Implementation can be found on GitHub.

3. Experiments

Table 1. Performance comparison of different prompt methods on HotpotQA and Fever benchmarks (%).

Prompt Method	HotpotQA	Fever
Standard	28.8	52.6
CoT	36.4	56.8
CoT-SC	38.2	59.4
Act	28.2	57.5
ReAct	29.4	62.0
CoT-SC → ReAct	37.7	64.7
ReAct → CoT-SC	31.5	63.5
Supervised SoTA	67.5	89.5

Table 2. Performance on ALFWorld benchmark (%).

Method	Pick	Clean	Heat	Cool	Look	Pick 2	All
Act (avg)	45	40	50	54	48	21	43
ReAct (avg)	71	60	82	75	38	30	59
BUTLERg (best 8)	33	26	70	76	17	12	22
BUTLER (best 8)	46	39	74	100	22	24	37

Table 3. Performance on WebShop benchmark (%).

Method	Score	SR
Act	-	-
ReAct	-	-
Human Expert	82.1	59.6

4. Conclusion

Experimental Results
Due to the deprecation of the original text-davinci-002 model used in the paper, I adopted the closest available alternative, gpt-3.5-turbo. While this substitution yields performance variations, the results consistently align with the original findings. Crucially, they demonstrate that ReAct outperforms alternative methods on all benchmarks while maintaining comparative patterns despite model differences.

Limitations in Experimental Coverage

Due to computational budget constraints, I could not reproduce two experiments from the original paper:

The Act (best of 6) and ReAct (best of 6) evaluations on the ALFWorld dataset.
Validation of ReAct and Act methods on the WebShop benchmark, as the official platform was non-functional at the time of the reproduction.

Reflection
This project marked my first systematic reproduction of a research paper, serving as a valuable learning experience in research methodology. I developed critical skills in:

Analytical paper reading
Strategic reproduction
Results documentation

The challenges in aligning with the original authors' implementation with available computational resources also improved my understanding of practical research constraints.

5. ⚡Quickstart

Prerequisites

Python 3.9+
Conda (for environment management)

🛠️Setup

Clone the repository and install dependencies:

git clone https://github.com/AStroCvijo/react_reproduction.git
cd react_reproduction
conda create --name react python=3.9
conda activate react
conda install -c conda-forge libstdcxx-ng
pip install -r requirements.txt

6. 🖥️ Scripts for Running Experiments

🔍 FEVER Dataset

Name	Description	Command
Standard	Standard inference (no reasoning/acting)	`./scripts/fever/standard.sh`
CoT	Chain-of-Thought (CoT)	`./scripts/fever/cot.sh`
CoT-SC	CoT with self-consistency (21 samples)	`./scripts/fever/cot_sc.sh`
Act	Action-only (no reasoning)	`./scripts/fever/act.sh`
ReAct	ReAct (reasoning + acting)	`./scripts/fever/react.sh`
CoT-SC -> ReAct	CoT with self-consistency and ReAct hybrid	`./scripts/fever/cot_sc_react.sh`
ReAct -> CoT-SC	ReAct and CoT with self-consistency hybrid	`./scripts/fever/react_cot_sc.sh`

🍲 HotpotQA Dataset

Name	Description	Command
Standard	Standard inference (no reasoning/acting)	`./scripts/hotpotqa/standard.sh`
CoT	Chain-of-Thought (CoT)	`./scripts/hotpotqa/cot.sh`
CoT-SC	CoT with self-consistency (21 samples)	`./scripts/hotpotqa/cot_sc.sh`
Act	Action-only (no reasoning)	`./scripts/hotpotqa/act.sh`
ReAct	ReAct (reasoning + acting)	`./scripts/hotpotqa/react.sh`
CoT-SC -> ReAct	CoT with self-consistency and ReAct hybrid	`./scripts/hotpotqa/cot_sc_react.sh`
ReAct -> CoT-SC	ReAct and CoT with self-consistency hybrid	`./scripts/hotpotqa/react_cot_sc.sh`

🏠 ALFWorld Dataset

Name	Description	Command
Act	Action-only (no reasoning)	`./scripts/alfworld/act.sh`
ReAct	ReAct (reasoning + acting)	`./scripts/alfworld/react.sh`

🛍️ WebShop Dataset

Name	Description	Command
Act	Action-only (no reasoning)	`./scripts/webshop/act.sh`
ReAct	ReAct (reasoning + acting)	`./scripts/webshop/react.sh`

7. 📖Arguments Guide

Argument	Description	Default	Options
-ds, --data_set	Dataset selection	FEVER	FEVER, HotpotQA, ALFWorld, WebShop
-ps, --prompt_style	Prompt style to use	ReAct	ReAct, Act, CoT, Standard, CoT-SC-ReAct, ReAct-CoT-SC
-ns, --num_samples	Number of samples to generate	1	Any positive integer
-t, --tempreture	Temperature setting for response variability	0.0	Any float value (0.0 to 1.0)

8. 📄Experiments

For detailed experiments and evaluations, please refer to the following document:

Experiments and Evaluations