Q-Learning in Reinforcement Learning | October 17th, 2023

Introduction to Q-Learning

Q-Learning is a model-free reinforcement learning algorithm used to find the optimal action-selection policy for any given finite Markov decision process (MDP). It works by learning an action-value function that ultimately gives the expected utility of taking a given action in a given state and following the optimal policy thereafter.

The Markov Property

A key feature of an MDP is the Markov Property, which states that the future is independent of the past given the present. Mathematically, this is expressed as:

$P (s_{t + 1} ∣ s_{t}, a_{t}) = P (s_{t + 1} ∣ s_{1}, a_{1}, \dots, s_{t}, a_{t})$

Basic Concept

The core of Q-Learning is the Q-function, defined as $Q : State \times Action \to R$ . This function quantifies the value of taking an action in a particular state.

Q-Function Update Rule

The Q-function is updated as follows:

$Q^{n e w} (s_{t}, a_{t}) \leftarrow (1 - α) \cdot Q (s_{t}, a_{t}) + α \cdot (r_{t + 1} + γ \cdot max_{a} Q (s_{t + 1}, a))$

Where:

$s_{t}$ and $s_{t + 1}$ are the current and next state, respectively.
$a_{t}$ is the action taken at time t.
$α$ is the learning rate.
$r_{t + 1}$ is the reward received after moving from state $s_{t}$ to $s_{t + 1}$ .
$γ$ is the discount factor.

Q-Learning with Convolutional Neural Networks (CNN)

In complex environments, the state space can be large, making a tabular approach infeasible. CNNs can be integrated with Q-Learning to handle such scenarios, particularly in image-based environments.

The Q-Network

The Q-network can be trained to approximate the Q-value function. The input is the state, and the output is the Q-value of all possible actions.

Network Update Rule

The loss function for updating the Q-network is defined as:

$L (θ) = E [{(r + γ max_{a^{'}} Q (s^{'}, a^{'}; θ^{-}) - Q (s, a; θ))}^{2}]$

Where:

$E$ is the expected value over a mini-batch of experiences. A mini-batch is simply a subset of the experiences stored in the replay buffer.
$θ$ and $θ^{-}$ are the parameters of the current and target Q-networks, respectively.
$s, a, r, s^{'}$ are the state, action, reward, and next state sampled from the replay buffer.

Experience Replay

Experience replay stores a replay buffer of past experiences and randomly samples mini-batches from it to train the network, breaking the correlation between sequential experiences.

Variants of Q-Learning

Deep Q-Network (DQN): Combines Q-Learning with deep learning, using a neural network to approximate the Q-function.
Double Q-Learning: Addresses the overestimation of Q-values by decoupling the selection of action from its evaluation.
Dueling Network Architectures: Separates the representation of state values and state-dependent action advantages.

Mathematical Derivation of Q-Learning

Consider the Bellman equation for the Q-function:

$Q (s, a) = r + γ \sum_{s^{'}} P (s^{'} ∣ s, a) max_{a^{'}} Q (s^{'}, a^{'})$

The iterative update rule in Q-learning aims to converge to this equation.

Derivation Steps

Initial Assumption: Start with an arbitrary Q-function $Q_{0} (s, a)$ .

$Q_{0} (s, a) for all s \in S, a \in A (s)$
Bellman Update: Apply the Bellman update iteratively using the following update rule:

$Q_{k + 1} (s, a) = \sum_{s^{'}} P (s^{'} ∣ s, a) [r (s, a, s^{'}) + γ max_{a^{'}} Q_{k} (s^{'}, a^{'})]$

where $Q_{k}$ is the Q-function at iteration $k$ , $P (s^{'} ∣ s, a)$ is the transition probability from state $s$ to state $s^{'}$ given action $a$ , $r (s, a, s^{'})$ is the immediate reward received after transition, and $γ$ is the discount factor.
Convergence Proof: Show that under certain conditions, such as a finite Markov Decision Process (MDP), sufficient exploration, and an appropriately chosen learning rate, the Q-function converges to the optimal Q-function $Q^{*}$ :

$lim_{k \to \infty} Q_{k} (s, a) = Q^{*} (s, a) for all s \in S, a \in A (s)$

This is often proved by showing that the update rule is a contraction mapping under the max-norm, which guarantees convergence by the Banach fixed-point theorem.