This paper presents a comprehensive study of deep reinforcement learning (DRL) algorithms for autonomous vehicle control in the OpenAI Gym Car Racing environment. We implement and compare two state-of-the-art DRL algorithms: Proximal Policy Optimization (PPO) and Deep Q-Network (DQN). Our experiments involve extensive training runs of 500,000 and 2,000,000 timesteps to evaluate the performance and learning capabilities of each algorithm. The results demonstrate the effectiveness of DRL in autonomous racing tasks and provide insights into the comparative advantages of different algorithms in this domain.
Introduction
Autonomous vehicle control presents a complex challenge in robotics and artificial intelligence. The task requires real-time decision-making, handling continuous control inputs, and processing high-dimensional visual information. Deep Reinforcement Learning (DRL) has emerged as a promising approach for such tasks, offering the ability to learn complex behaviors directly from raw sensory inputs. This work focuses on implementing and comparing DRL algorithms in the context of autonomous racing using the OpenAI Gym Car Racing environment.
Related work
Previous research in autonomous racing has explored various approaches:
Traditional control methods using PID controllers and path planning
Supervised learning approaches using human demonstration data
Reinforcement learning with handcrafted features
End-to-end deep learning approaches
Our work builds upon these foundations by implementing and comparing modern DRL algorithms, specifically PPO and DQN, which have shown promising results in similar domains.
Methodology
Environment
We utilize the OpenAI Gym Car Racing environment (v2), which provides:
A 2D racing environment with realistic physics
Visual input (96x96x3 RGB images)
Continuous action space (steering, acceleration, brake)
Reward function based on track completion and speed
Algorithms
Proximal Policy Optimization (PPO)
Implementation using Stable Baselines3
Policy network architecture: CNN for visual processing
Value network for state-value estimation
Clipped objective for stable training
Deep Q-Network (DQN)
Implementation using Stable Baselines3
CNN architecture for state processing
Experience replay buffer
Target network for stable training
Training Setup
Training durations: 500k and 2M timesteps
Environment vectorization using DummyVecEnv
Regular evaluation episodes
Model checkpointing and saving
Experiments
Training Configurations
PPO Training
Learning rate: 3e-4
Batch size: 64
N steps: 2048
Gamma: 0.99
DQN Training
Learning rate: 1e-4
Batch size: 64
Buffer size: 100000
Target update frequency: 1000
Evaluation Metrics
Average episode reward
Track completion rate
Average speed
Training stability
Results
Performance Comparison
PPO Results
Random Policy (Baseline): -30.04 average reward
500k steps: +44.87 average reward
2M steps: +476.68 average reward
Training stability and convergence showed consistent improvement with longer training duration
DQN Results
Training showed less stable learning compared to PPO
Performance was more sensitive to hyperparameter tuning
Required larger buffer size for stable learning
Performance Analysis
The graph demonstrates clear performance improvements across training iterations:
Random policy (no training) yields negative rewards (-30.04), indicating poor performance
After 500k training steps, the agent achieves positive rewards (+44.87), showing basic learning
Model checkpointing essential for training management
Vectorized environment improved training efficiency
Conclusion
Our study demonstrates the effectiveness of DRL algorithms in autonomous racing tasks. The comparative analysis of PPO and DQN provides valuable insights into their respective strengths and limitations in this domain. Future work could explore: