Deep Q-Learning: A Reinforcement Learning Approach in Gym

Introduction

Reinforcement learning (RL) has emerged as a powerful method for training AI agents to make decisions in dynamic environments. The Gym LunarLander-v2 environment serves as a problem where an agent must control a lander’s thrusters to land safely. This project explores Deep Q-Learning, an extension of Q-learning that uses deep neural networks to calculate Q-values, allowing the agent to generalize better over continuous state spaces.

Deep Q-Learning

For more information on deep q-learning view this website: https://medium.com/@samina.amin/deep-q-learning-dqn-71c109586bae
For more information on traditional reinforcement learning view this website: https://www.geeksforgeeks.org/what-is-reinforcement-learning/

Methodology

Environment Setup

LunarLander-v2: A physics-based environment where an agent controls a lander’s main and side thrusters.
State Space: An 8-dimensional continuous space representing position, velocity, angle, and leg contact indicators.
Action Space: Four discrete actions—do nothing, fire left thruster, fire right thruster, fire main thruster.

Configuring the environment in Python:

import gym

class Environment():
    """
    The Environment class provides a simple interface for interacting with OpenAI Gym 
    environments. It provides information about the state shape, state size, and the 
    number of available actions in the environment.

    Attributes:
        env (gym.Env): An instance of the 'LunarLander-v2' environment created using OpenAI Gym.
        state_shape (tuple): The shape of the state space in the environment.
        state_size (int): The size of the state space in the environment.
        number_actions (int): The number of available actions in the environment.

    Methods:
        __init__(): Constructor method for initializing the Environment instance.

    Example:
        # Create an instance of the Environment class
        env_instance = Environment()
    """

    def __init__(self, env : str) -> None:
        """
        Initializes the Environment instance with the specified OpenAI Gym environment and
        extracts information about the state shape, state size, and number of actions.

        Parameters:
            env (str, optional): The ID of the OpenAI Gym environment to initialize. 
                                Defaults to 'LunarLander-v2'.

        Example:
            # Create an instance of the Environment class with a custom environment ID
            env_instance = Environment(env='LunarLander-v2')
        """

        self.env            : gym.Env                = gym.make(id=env)
        self.state_shape    : tuple[int, ...] | None = self.env.observation_space.shape
        self.state_size     : int                    = self.state_shape[0] #type: ignore
        self.number_actions : int                    = self.env.action_space.n #type: ignore
        print(f'State Shape: {self.state_shape}\nState Size: {self.state_size}\nNumber of Actions: {self.number_actions}')

    def reset(self) -> tuple:
        """
        Resets the environment to its initial state and returns the initial observation.

        Returns:
            observation: The initial observation/state of the environment.

        Example:
            # Reset the environment and get the initial observation
            initial_observation = env_instance.reset()
        """

        return self.env.reset()

    def step(self, action) -> tuple:
        """
        Takes an action in the environment and returns the next observation, reward, and done flag.

        Parameters:
            action: The action to be taken in the environment.

        Returns:
            observation: The next observation/state of the environment.
            reward: The reward received from the environment.
            done: A flag indicating whether the episode is done.

        Example:
            # Take an action in the environment and get the next observation, reward, and done flag
            next_observation, reward, done = env_instance.step(selected_action)
        """

        return self.env.step(action)

For more information on the lunar lander environment, of if you want to try it yourself visit this website: https://www.gymlibrary.dev/environments/box2d/lunar_lander/

Deep Q-Learning Algorithm

Deep Q-Learning is an extension of traditional Q-learning that employs a neural network to approximate the Q-values for different state-action pairs. In this implementation, a feedforward neural network is used to estimate the Q-function, mapping the observed state space to expected future rewards. The network is trained using a mean squared error loss function, minimizing the difference between predicted Q-values and target Q-values derived from the Bellman equation.

Building the neural network in Python:

from torch.nn import functional as F
from torch import nn
import torch

class NeuralNetwork(nn.Module):
    """
    The NeuralNetwork class defines a simple feedforward neural network architecture
    for a given reinforcement learning task.

    Attributes:
        state_size (int): The size of the input state space.
        action_size (int): The size of the output action space.
        seed (int): A seed for random number generation. Default is 42.

    Methods:
        __init__(state_size, action_size, seed): Constructor method for initializing
            the NeuralNetwork instance with the specified state size, action size, and seed.
        forward(state): Defines the forward pass of the neural network.

    Example:
        # Create an instance of the NeuralNetwork class
        neural_net = NeuralNetwork(state_size=8, action_size=4)
    """

    def __init__(self, state_size : int, action_size : int, seed : int = 42) -> None:
        """
        Initializes the NeuralNetwork instance with the specified state size, action size, and seed.

        Parameters:
            state_size (int): The size of the input state space.
            action_size (int): The size of the output action space.
            seed (int, optional): A seed for random number generation. Default is 42.

        Example:
            # Create an instance of the NeuralNetwork class with specific state and action sizes
            neural_net = NeuralNetwork(state_size=8, action_size=4)
        """

        super(NeuralNetwork, self).__init__()
        self.state_size : int= state_size
        self.seed : torch.Generator = torch.manual_seed(seed)
        self.fc1  : nn.Linear       = nn.Linear(in_features=state_size, out_features=64)
        self.fc2  : nn.Linear       = nn.Linear(in_features=self.fc1.out_features, out_features=64)
        self.fc3  : nn.Linear       = nn.Linear(in_features=self.fc2.out_features, out_features=action_size)

    def forward(self, state):
        """
        Defines the forward pass of the neural network.

        Parameters:
            state (torch.Tensor): The input state tensor.

        Returns:
            torch.Tensor: The output tensor representing the Q-values for each action.

        Example:
            # Forward pass with a given state tensor
            output = neural_net.forward(torch.tensor([1.0, 2.0, 3.0]))
        """

        x : torch.Tensor = self.fc1(state.view(-1, self.state_size))
        x                = F.relu(input=x)
        x                = self.fc2(x)
        x                = F.relu(input=x)
        return self.fc3(x)

To enhance stability and prevent divergence in training, two key techniques are incorporated: experience replay and target networks. Experience replay involves storing past experiences in a replay buffer and randomly sampling them during training. This reduces correlations between consecutive experiences and leads to more stable learning. The target network, a separate copy of the Q-network, is updated less frequently to provide consistent Q-value targets, further improving stability.

Implementing the memory replay in Python:

import random
import numpy
import torch

class ReplayMemory(object):
    """
    The ReplayMemory class represents a replay memory buffer for storing and sampling
    experiences for reinforcement learning.

    Attributes:
        capacity (int): The maximum capacity of the replay memory.
        device (torch.device): The device on which the memory is stored (CPU or CUDA).
        memory (list): List to store experiences.

    Methods:
        __init__(capacity): Constructor method for initializing the ReplayMemory instance.
        push(event): Adds an experience tuple to the replay memory.
        sample(batch_size): Randomly samples a batch of experiences from the replay memory.

    Example:
        # Create an instance of the ReplayMemory class with capacity 1000
        memory = ReplayMemory(capacity=1000)
    """

    def __init__(self, capacity : int) -> None:
        """
        Initializes the ReplayMemory instance with the specified capacity.

        Parameters:
            capacity (int): The maximum capacity of the replay memory.

        Example:
            # Create an instance of the ReplayMemory class with capacity 1000
            memory = ReplayMemory(capacity=1000)
        """

        self.device   : torch.device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
        self.capacity : int          = capacity
        self.memory   : list         = []

    def push(self, event : tuple) -> None:
        """
        Adds an experience tuple to the replay memory.

        Parameters:
            event (tuple): The experience tuple to be added.

        Example:
            # Add an experience tuple to the replay memory
            memory.push((state, action, reward, next_state, done))
        """

        self.memory.append(event)
        if len(self.memory) > self.capacity:
            del self.memory[0]

    def sample(self, batch_size : int) -> tuple:
        """
        Randomly samples a batch of experiences from the replay memory.

        Parameters:
            batch_size (int): The number of experiences to be sampled in a batch.

        Returns:
            tuple: A tuple containing PyTorch tensors for states, next_states, actions,
                   rewards, and dones.

        Example:
            # Sample a batch of experiences from the replay memory
            states, next_states, actions, rewards, dones = memory.sample(batch_size=64)
        """

        experiences : list         = random.sample(self.memory, k=batch_size)
        states      : torch.Tensor = torch.from_numpy(numpy.vstack([e[0] for e in experiences if e is not None])).float().to(self.device)
        actions     : torch.Tensor = torch.from_numpy(numpy.vstack([e[1] for e in experiences if e is not None])).long().to(self.device)
        rewards     : torch.Tensor = torch.from_numpy(numpy.vstack([e[2] for e in experiences if e is not None])).float().to(self.device)
        next_states : torch.Tensor = torch.from_numpy(numpy.vstack([e[3] for e in experiences if e is not None])).float().to(self.device)
        dones       : torch.Tensor = torch.from_numpy(numpy.vstack([e[4] for e in experiences if e is not None]).astype(numpy.uint8)).float().to(self.device)
        return states, next_states, actions, rewards, dones

Exploration-exploitation balance is maintained using an epsilon-greedy strategy, where the agent initially explores by taking random actions and gradually shifts toward exploiting learned policies as training progresses. The reward structure is carefully designed to encourage soft landings while penalizing crashes, inefficient fuel usage, and unstable orientations. Through iterative updates, the agent refines its decision-making process and improves its ability to land safely.

Building the AI agent in Python:

from neural_network import NeuralNetwork
from replay_memory import ReplayMemory
from hyperparameters import HyperParameters
from torch import optim
from torch.nn import functional as F
from torch import nn
import torch
import random
import numpy

parameters = HyperParameters()

class Agent:
    """
    Represents a reinforcement learning agent using Deep Q-Networks.

    Args:
    - state_size (int): The size of the state space.
    - action_size (int): The number of possible actions.

    Attributes:
    - device (torch.device): The device used for computation (GPU if available, else CPU).
    - state_size (int): The size of the state space.
    - action_size (int): The number of possible actions.
    - local_qnet (NeuralNetwork): The local Q-network for estimating Q-values.
    - target_qnet (NeuralNetwork): The target Q-network for stable learning.
    - optimizer (optim.Adam): The Adam optimizer for updating the local Q-network.
    - memory (ReplayMemory): Replay memory for storing experiences.
    - time_step (int): Counter for tracking time steps.

    Methods:
    - step(state, action, reward, next_state, done): Records a transition in the replay memory and updates the Q-network.
    - action(state, epsilon): Selects an action using an epsilon-greedy strategy.
    - learn(experiences, gamma): Performs a Q-learning update using a batch of experiences.
    - soft_update(local_model, target_model, interpolation_parameter): Updates target model parameters with a soft update.
    """

    def __init__(self, state_size : int, action_size : int) -> None:
        """
        Initializes the Agent with the given state and action sizes.
        """

        self.device      : torch.device  = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
        self.state_size  : int           = state_size
        self.action_size : int           = action_size
        self.local_qnet  : NeuralNetwork = NeuralNetwork(state_size, action_size).to(self.device)
        self.target_qnet : NeuralNetwork = NeuralNetwork(state_size, action_size).to(self.device)
        self.optimizer   : optim.Adam    = optim.Adam(self.local_qnet.parameters(), lr=parameters.learning_rate)
        self.memory      : ReplayMemory  = ReplayMemory(capacity=parameters.replay_buffer_size)
        self.time_step   : int           = 0

    def step(self, state : numpy.ndarray, action : int, reward : float, next_state : numpy.ndarray, done : bool) -> None:
        """
        Records a transition in the replay memory and updates the Q-network.

        Args:
        - state (numpy.ndarray): The current state.
        - action (int): The taken action.
        - reward (float): The received reward.
        - next_state (numpy.ndarray): The next state.
        - done (bool): Indicates whether the episode is done.
        """

        self.memory.push((state, action, reward, next_state, done))
        self.time_step = (self.time_step + 1) % 4
        if self.time_step == 0:
            if len(self.memory.memory) > parameters.minibatch_size:
                experiences = self.memory.sample(parameters.minibatch_size)
                self.learn(experiences, parameters.gamma)

    def action(self, state : numpy.ndarray, epsilon : float = 0.) -> int:
        """
        Selects an action using an epsilon-greedy strategy.

        Args:
        - state (numpy.ndarray): The current state.
        - epsilon (float): Exploration-exploitation trade-off parameter.

        Returns:
        - int: The selected action.
        """

        state_tensor : torch.Tensor = torch.from_numpy(state).float().unsqueeze(0).to(self.device)
        self.local_qnet.eval()
        with torch.no_grad():
            action_values = self.local_qnet(state_tensor)
        self.local_qnet.train()
        if random.random() > epsilon:
            return int(numpy.argmax(action_values.cpu().data.numpy()))
        else:
            return int(random.choice(numpy.arange(self.action_size)))

    def learn(self, experiences : tuple, gamma : float) -> None:
        """
        Performs a Q-learning update using a batch of experiences.

        Args:
        - experiences (tuple): A tuple of (states, next_states, actions, rewards, dones).
        - gamma (float): The discount factor for future rewards.
        """

        states, next_states, actions, rewards, dones = experiences
        next_q_targets = self.target_qnet(next_states).detach().max(1)[0].unsqueeze(1)
        q_targets = rewards + (gamma * next_q_targets * (1-dones))
        q_expected = self.local_qnet(states).gather(1, actions)
        loss = F.mse_loss(q_expected, q_targets)
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()
        self.soft_update(self.local_qnet, self.target_qnet, parameters.interpolation_parameter)

    def soft_update(self, local_model : nn.Module, target_model : nn.Module, interpolation_parameter : float) -> None:
        """
        Updates target model parameters with a soft update.

        Args:
        - local_model (nn.Module): The source model.
        - target_model (nn.Module): The target model.
        - interpolation_parameter (float): The interpolation parameter.
        """

        for target_param, local_param in zip(target_model.parameters(), local_model.parameters()):
            target_param.data.copy_(interpolation_parameter * local_param.data + (1.0 - interpolation_parameter) * target_param.data)

Training the agent in Python:

from environment import Environment
from agent import Agent
from collections import deque
import numpy
import torch

env   : Environment = Environment(env='LunarLander-v2')
agent : Agent       = Agent(state_size=env.state_size, action_size=env.number_actions)

number_episodes               : int   = 2500
maximum_timesteps_per_episode : int   = 1000
epsilon_starting_value        : float = 1.
episilon_ending_value         : float = 0.01
episilon_decay_value          : float = 0.995
epsilon                       : float = epsilon_starting_value
scores_on_100_episodes        : deque = deque(maxlen = 100)

def train(epsilon : float) -> None:
    """
    Trains the reinforcement learning agent using the specified exploration-exploitation strategy.

    Args:
    - epsilon (float): The initial exploration rate (epsilon-greedy strategy).

    Returns:
    - None

    The function runs training episodes, updating the agent's Q-network and monitoring performance.
    Training stops when the environment is considered solved or the maximum number of episodes is reached.

    During training, the function prints the episode number and the average score over the last 100 episodes.
    If the average score surpasses 200, the training is considered successful, and the agent's model is saved.

    Args:
    - epsilon (float): The initial exploration rate (epsilon-greedy strategy).

    Returns:
    - None

    Example:
    train(epsilon=1.0)
    """

    for episodes in range(1, number_episodes + 1):
        state, _ = env.reset()
        score = 0
        for _ in range(0, maximum_timesteps_per_episode):
            action = agent.action(state, epsilon)
            next_state, reward, done, _, _ = env.step(action)
            agent.step(state, action, reward, next_state, done)
            state = next_state
            score += reward
            if done:
                break
        scores_on_100_episodes.append(score)
        epsilon = max(episilon_ending_value, episilon_decay_value * epsilon)
        print(f'\rEpisode: {episodes}\tAverage Score: {numpy.mean(scores_on_100_episodes):.2f}',end='')
        if (episodes % 100 == 0):
            print(f'\rEpisode: {episodes}\tAverage Score: {numpy.mean(scores_on_100_episodes):.2f}')
        if numpy.mean(scores_on_100_episodes) >= 200.: #type: ignore
            print(f'\nEnvironment Solved in {episodes:d} episodes!\tAverage Score: {numpy.mean(scores_on_100_episodes):.2f}')
            torch.save(agent.local_qnet.state_dict(), 'model.pth')
            break

def main() -> None:
    """
    Main entry point for training a reinforcement learning agent.

    This function initiates the training process by calling the train function with a specified exploration rate.

    Args:
    - None

    Returns:
    - None

    Example:
    main()
    """

    train(epsilon=epsilon)

if __name__ == "__main__":
    main()

Introduction

Deep Q-Learning

Methodology

Environment Setup

LunarLander-v2: A physics-based environment where an agent controls a lander’s main and side thrusters.
State Space: An 8-dimensional continuous space representing position, velocity, angle, and leg contact indicators.
Action Space: Four discrete actions—do nothing, fire left thruster, fire right thruster, fire main thruster.

Configuring the environment in Python:

import gym

class Environment():
    """
    The Environment class provides a simple interface for interacting with OpenAI Gym 
    environments. It provides information about the state shape, state size, and the 
    number of available actions in the environment.

    Attributes:
        env (gym.Env): An instance of the 'LunarLander-v2' environment created using OpenAI Gym.
        state_shape (tuple): The shape of the state space in the environment.
        state_size (int): The size of the state space in the environment.
        number_actions (int): The number of available actions in the environment.

    Methods:
        __init__(): Constructor method for initializing the Environment instance.

    Example:
        # Create an instance of the Environment class
        env_instance = Environment()
    """

    def __init__(self, env : str) -> None:
        """
        Initializes the Environment instance with the specified OpenAI Gym environment and
        extracts information about the state shape, state size, and number of actions.

        Parameters:
            env (str, optional): The ID of the OpenAI Gym environment to initialize. 
                                Defaults to 'LunarLander-v2'.

        Example:
            # Create an instance of the Environment class with a custom environment ID
            env_instance = Environment(env='LunarLander-v2')
        """

        self.env            : gym.Env                = gym.make(id=env)
        self.state_shape    : tuple[int, ...] | None = self.env.observation_space.shape
        self.state_size     : int                    = self.state_shape[0] #type: ignore
        self.number_actions : int                    = self.env.action_space.n #type: ignore
        print(f'State Shape: {self.state_shape}\nState Size: {self.state_size}\nNumber of Actions: {self.number_actions}')

    def reset(self) -> tuple:
        """
        Resets the environment to its initial state and returns the initial observation.

        Returns:
            observation: The initial observation/state of the environment.

        Example:
            # Reset the environment and get the initial observation
            initial_observation = env_instance.reset()
        """

        return self.env.reset()

    def step(self, action) -> tuple:
        """
        Takes an action in the environment and returns the next observation, reward, and done flag.

        Parameters:
            action: The action to be taken in the environment.

        Returns:
            observation: The next observation/state of the environment.
            reward: The reward received from the environment.
            done: A flag indicating whether the episode is done.

        Example:
            # Take an action in the environment and get the next observation, reward, and done flag
            next_observation, reward, done = env_instance.step(selected_action)
        """

        return self.env.step(action)

For more information on the lunar lander environment, of if you want to try it yourself visit this website: https://www.gymlibrary.dev/environments/box2d/lunar_lander/

Deep Q-Learning Algorithm

Building the neural network in Python:

from torch.nn import functional as F
from torch import nn
import torch

class NeuralNetwork(nn.Module):
    """
    The NeuralNetwork class defines a simple feedforward neural network architecture
    for a given reinforcement learning task.

    Attributes:
        state_size (int): The size of the input state space.
        action_size (int): The size of the output action space.
        seed (int): A seed for random number generation. Default is 42.

    Methods:
        __init__(state_size, action_size, seed): Constructor method for initializing
            the NeuralNetwork instance with the specified state size, action size, and seed.
        forward(state): Defines the forward pass of the neural network.

    Example:
        # Create an instance of the NeuralNetwork class
        neural_net = NeuralNetwork(state_size=8, action_size=4)
    """

    def __init__(self, state_size : int, action_size : int, seed : int = 42) -> None:
        """
        Initializes the NeuralNetwork instance with the specified state size, action size, and seed.

        Parameters:
            state_size (int): The size of the input state space.
            action_size (int): The size of the output action space.
            seed (int, optional): A seed for random number generation. Default is 42.

        Example:
            # Create an instance of the NeuralNetwork class with specific state and action sizes
            neural_net = NeuralNetwork(state_size=8, action_size=4)
        """

        super(NeuralNetwork, self).__init__()
        self.state_size : int= state_size
        self.seed : torch.Generator = torch.manual_seed(seed)
        self.fc1  : nn.Linear       = nn.Linear(in_features=state_size, out_features=64)
        self.fc2  : nn.Linear       = nn.Linear(in_features=self.fc1.out_features, out_features=64)
        self.fc3  : nn.Linear       = nn.Linear(in_features=self.fc2.out_features, out_features=action_size)

    def forward(self, state):
        """
        Defines the forward pass of the neural network.

        Parameters:
            state (torch.Tensor): The input state tensor.

        Returns:
            torch.Tensor: The output tensor representing the Q-values for each action.

        Example:
            # Forward pass with a given state tensor
            output = neural_net.forward(torch.tensor([1.0, 2.0, 3.0]))
        """

        x : torch.Tensor = self.fc1(state.view(-1, self.state_size))
        x                = F.relu(input=x)
        x                = self.fc2(x)
        x                = F.relu(input=x)
        return self.fc3(x)

Implementing the memory replay in Python:

import random
import numpy
import torch

class ReplayMemory(object):
    """
    The ReplayMemory class represents a replay memory buffer for storing and sampling
    experiences for reinforcement learning.

    Attributes:
        capacity (int): The maximum capacity of the replay memory.
        device (torch.device): The device on which the memory is stored (CPU or CUDA).
        memory (list): List to store experiences.

    Methods:
        __init__(capacity): Constructor method for initializing the ReplayMemory instance.
        push(event): Adds an experience tuple to the replay memory.
        sample(batch_size): Randomly samples a batch of experiences from the replay memory.

    Example:
        # Create an instance of the ReplayMemory class with capacity 1000
        memory = ReplayMemory(capacity=1000)
    """

    def __init__(self, capacity : int) -> None:
        """
        Initializes the ReplayMemory instance with the specified capacity.

        Parameters:
            capacity (int): The maximum capacity of the replay memory.

        Example:
            # Create an instance of the ReplayMemory class with capacity 1000
            memory = ReplayMemory(capacity=1000)
        """

        self.device   : torch.device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
        self.capacity : int          = capacity
        self.memory   : list         = []

    def push(self, event : tuple) -> None:
        """
        Adds an experience tuple to the replay memory.

        Parameters:
            event (tuple): The experience tuple to be added.

        Example:
            # Add an experience tuple to the replay memory
            memory.push((state, action, reward, next_state, done))
        """

        self.memory.append(event)
        if len(self.memory) > self.capacity:
            del self.memory[0]

    def sample(self, batch_size : int) -> tuple:
        """
        Randomly samples a batch of experiences from the replay memory.

        Parameters:
            batch_size (int): The number of experiences to be sampled in a batch.

        Returns:
            tuple: A tuple containing PyTorch tensors for states, next_states, actions,
                   rewards, and dones.

        Example:
            # Sample a batch of experiences from the replay memory
            states, next_states, actions, rewards, dones = memory.sample(batch_size=64)
        """

        experiences : list         = random.sample(self.memory, k=batch_size)
        states      : torch.Tensor = torch.from_numpy(numpy.vstack([e[0] for e in experiences if e is not None])).float().to(self.device)
        actions     : torch.Tensor = torch.from_numpy(numpy.vstack([e[1] for e in experiences if e is not None])).long().to(self.device)
        rewards     : torch.Tensor = torch.from_numpy(numpy.vstack([e[2] for e in experiences if e is not None])).float().to(self.device)
        next_states : torch.Tensor = torch.from_numpy(numpy.vstack([e[3] for e in experiences if e is not None])).float().to(self.device)
        dones       : torch.Tensor = torch.from_numpy(numpy.vstack([e[4] for e in experiences if e is not None]).astype(numpy.uint8)).float().to(self.device)
        return states, next_states, actions, rewards, dones

Building the AI agent in Python:

from neural_network import NeuralNetwork
from replay_memory import ReplayMemory
from hyperparameters import HyperParameters
from torch import optim
from torch.nn import functional as F
from torch import nn
import torch
import random
import numpy

parameters = HyperParameters()

class Agent:
    """
    Represents a reinforcement learning agent using Deep Q-Networks.

    Args:
    - state_size (int): The size of the state space.
    - action_size (int): The number of possible actions.

    Attributes:
    - device (torch.device): The device used for computation (GPU if available, else CPU).
    - state_size (int): The size of the state space.
    - action_size (int): The number of possible actions.
    - local_qnet (NeuralNetwork): The local Q-network for estimating Q-values.
    - target_qnet (NeuralNetwork): The target Q-network for stable learning.
    - optimizer (optim.Adam): The Adam optimizer for updating the local Q-network.
    - memory (ReplayMemory): Replay memory for storing experiences.
    - time_step (int): Counter for tracking time steps.

    Methods:
    - step(state, action, reward, next_state, done): Records a transition in the replay memory and updates the Q-network.
    - action(state, epsilon): Selects an action using an epsilon-greedy strategy.
    - learn(experiences, gamma): Performs a Q-learning update using a batch of experiences.
    - soft_update(local_model, target_model, interpolation_parameter): Updates target model parameters with a soft update.
    """

    def __init__(self, state_size : int, action_size : int) -> None:
        """
        Initializes the Agent with the given state and action sizes.
        """

        self.device      : torch.device  = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
        self.state_size  : int           = state_size
        self.action_size : int           = action_size
        self.local_qnet  : NeuralNetwork = NeuralNetwork(state_size, action_size).to(self.device)
        self.target_qnet : NeuralNetwork = NeuralNetwork(state_size, action_size).to(self.device)
        self.optimizer   : optim.Adam    = optim.Adam(self.local_qnet.parameters(), lr=parameters.learning_rate)
        self.memory      : ReplayMemory  = ReplayMemory(capacity=parameters.replay_buffer_size)
        self.time_step   : int           = 0

    def step(self, state : numpy.ndarray, action : int, reward : float, next_state : numpy.ndarray, done : bool) -> None:
        """
        Records a transition in the replay memory and updates the Q-network.

        Args:
        - state (numpy.ndarray): The current state.
        - action (int): The taken action.
        - reward (float): The received reward.
        - next_state (numpy.ndarray): The next state.
        - done (bool): Indicates whether the episode is done.
        """

        self.memory.push((state, action, reward, next_state, done))
        self.time_step = (self.time_step + 1) % 4
        if self.time_step == 0:
            if len(self.memory.memory) > parameters.minibatch_size:
                experiences = self.memory.sample(parameters.minibatch_size)
                self.learn(experiences, parameters.gamma)

    def action(self, state : numpy.ndarray, epsilon : float = 0.) -> int:
        """
        Selects an action using an epsilon-greedy strategy.

        Args:
        - state (numpy.ndarray): The current state.
        - epsilon (float): Exploration-exploitation trade-off parameter.

        Returns:
        - int: The selected action.
        """

        state_tensor : torch.Tensor = torch.from_numpy(state).float().unsqueeze(0).to(self.device)
        self.local_qnet.eval()
        with torch.no_grad():
            action_values = self.local_qnet(state_tensor)
        self.local_qnet.train()
        if random.random() > epsilon:
            return int(numpy.argmax(action_values.cpu().data.numpy()))
        else:
            return int(random.choice(numpy.arange(self.action_size)))

    def learn(self, experiences : tuple, gamma : float) -> None:
        """
        Performs a Q-learning update using a batch of experiences.

        Args:
        - experiences (tuple): A tuple of (states, next_states, actions, rewards, dones).
        - gamma (float): The discount factor for future rewards.
        """

        states, next_states, actions, rewards, dones = experiences
        next_q_targets = self.target_qnet(next_states).detach().max(1)[0].unsqueeze(1)
        q_targets = rewards + (gamma * next_q_targets * (1-dones))
        q_expected = self.local_qnet(states).gather(1, actions)
        loss = F.mse_loss(q_expected, q_targets)
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()
        self.soft_update(self.local_qnet, self.target_qnet, parameters.interpolation_parameter)

    def soft_update(self, local_model : nn.Module, target_model : nn.Module, interpolation_parameter : float) -> None:
        """
        Updates target model parameters with a soft update.

        Args:
        - local_model (nn.Module): The source model.
        - target_model (nn.Module): The target model.
        - interpolation_parameter (float): The interpolation parameter.
        """

        for target_param, local_param in zip(target_model.parameters(), local_model.parameters()):
            target_param.data.copy_(interpolation_parameter * local_param.data + (1.0 - interpolation_parameter) * target_param.data)

Training the agent in Python:

from environment import Environment
from agent import Agent
from collections import deque
import numpy
import torch

env   : Environment = Environment(env='LunarLander-v2')
agent : Agent       = Agent(state_size=env.state_size, action_size=env.number_actions)

number_episodes               : int   = 2500
maximum_timesteps_per_episode : int   = 1000
epsilon_starting_value        : float = 1.
episilon_ending_value         : float = 0.01
episilon_decay_value          : float = 0.995
epsilon                       : float = epsilon_starting_value
scores_on_100_episodes        : deque = deque(maxlen = 100)

def train(epsilon : float) -> None:
    """
    Trains the reinforcement learning agent using the specified exploration-exploitation strategy.

    Args:
    - epsilon (float): The initial exploration rate (epsilon-greedy strategy).

    Returns:
    - None

    The function runs training episodes, updating the agent's Q-network and monitoring performance.
    Training stops when the environment is considered solved or the maximum number of episodes is reached.

    During training, the function prints the episode number and the average score over the last 100 episodes.
    If the average score surpasses 200, the training is considered successful, and the agent's model is saved.

    Args:
    - epsilon (float): The initial exploration rate (epsilon-greedy strategy).

    Returns:
    - None

    Example:
    train(epsilon=1.0)
    """

    for episodes in range(1, number_episodes + 1):
        state, _ = env.reset()
        score = 0
        for _ in range(0, maximum_timesteps_per_episode):
            action = agent.action(state, epsilon)
            next_state, reward, done, _, _ = env.step(action)
            agent.step(state, action, reward, next_state, done)
            state = next_state
            score += reward
            if done:
                break
        scores_on_100_episodes.append(score)
        epsilon = max(episilon_ending_value, episilon_decay_value * epsilon)
        print(f'\rEpisode: {episodes}\tAverage Score: {numpy.mean(scores_on_100_episodes):.2f}',end='')
        if (episodes % 100 == 0):
            print(f'\rEpisode: {episodes}\tAverage Score: {numpy.mean(scores_on_100_episodes):.2f}')
        if numpy.mean(scores_on_100_episodes) >= 200.: #type: ignore
            print(f'\nEnvironment Solved in {episodes:d} episodes!\tAverage Score: {numpy.mean(scores_on_100_episodes):.2f}')
            torch.save(agent.local_qnet.state_dict(), 'model.pth')
            break

def main() -> None:
    """
    Main entry point for training a reinforcement learning agent.

    This function initiates the training process by calling the train function with a specified exploration rate.

    Args:
    - None

    Returns:
    - None

    Example:
    main()
    """

    train(epsilon=epsilon)

if __name__ == "__main__":
    main()

Deep Q-Learning: A Reinforcement Learning Approach in Gym

Table of contents

Introduction

Deep Q-Learning

Methodology

Environment Setup

Deep Q-Learning Algorithm

Table of contents

Introduction

Deep Q-Learning

Methodology

Environment Setup

Deep Q-Learning Algorithm

Code

Code

Datasets

Datasets