Q-Learning in Reinforcement Learning | October 17th, 2023

Introduction to Q-Learning

Q-Learning is a model-free reinforcement learning algorithm used to find the optimal action-selection policy for any given finite Markov decision process (MDP). It works by learning an action-value function that ultimately gives the expected utility of taking a given action in a given state and following the optimal policy thereafter.

The Markov Property

A key feature of an MDP is the Markov Property, which states that the future is independent of the past given the present. Mathematically, this is expressed as:

$$ P(s_{t+1} \mid s_t, a_t) = P(s_{t+1} \mid s_1, a_1, …, s_t, a_t) $$

Basic Concept

The core of Q-Learning is the Q-function, defined as $Q: \text{State} \times \text{Action} \rightarrow \mathbb{R}$. This function quantifies the value of taking an action in a particular state.

Q-Function Update Rule

The Q-function is updated as follows:

$$ Q^{new}(s_t, a_t) \leftarrow (1-\alpha) \cdot Q(s_t, a_t) + \alpha \cdot \left( r_{t+1} + \gamma \cdot \max_{a} Q(s_{t+1}, a) \right) $$

Where:

  • $ s_t $ and $ s_{t+1} $ are the current and next state, respectively.
  • $ a_t $ is the action taken at time t.
  • $ \alpha $ is the learning rate.
  • $ r_{t+1} $ is the reward received after moving from state $ s_t $ to $ s_{t+1} $.
  • $ \gamma $ is the discount factor.

Q-Learning with Convolutional Neural Networks (CNN)

In complex environments, the state space can be large, making a tabular approach infeasible. CNNs can be integrated with Q-Learning to handle such scenarios, particularly in image-based environments.

The Q-Network

The Q-network can be trained to approximate the Q-value function. The input is the state, and the output is the Q-value of all possible actions.

Network Update Rule

The loss function for updating the Q-network is defined as:

$$ L(\theta) = \mathbb{E} \left[ \left( r + \gamma \max_{a’} Q(s’, a’; \theta^-) - Q(s, a; \theta) \right)^2 \right] $$

Where:

  • $\mathbb{E}$ is the expected value over a mini-batch of experiences. A mini-batch is simply a subset of the experiences stored in the replay buffer.
  • $ \theta $ and $ \theta^- $ are the parameters of the current and target Q-networks, respectively.
  • $ s, a, r, s’ $ are the state, action, reward, and next state sampled from the replay buffer.

Experience Replay

Experience replay stores a replay buffer of past experiences and randomly samples mini-batches from it to train the network, breaking the correlation between sequential experiences.

Variants of Q-Learning

  1. Deep Q-Network (DQN): Combines Q-Learning with deep learning, using a neural network to approximate the Q-function.
  2. Double Q-Learning: Addresses the overestimation of Q-values by decoupling the selection of action from its evaluation.
  3. Dueling Network Architectures: Separates the representation of state values and state-dependent action advantages.

Mathematical Derivation of Q-Learning

Consider the Bellman equation for the Q-function:

$$ Q(s,a) = r + \gamma \sum_{s’} P(s’ \mid s,a) \max_{a’} Q(s’,a’) $$

The iterative update rule in Q-learning aims to converge to this equation.

Derivation Steps

  1. Initial Assumption: Start with an arbitrary Q-function $Q_0(s,a)$.

    $$ Q_0(s,a) \text{ for all } s \in S, a \in A(s) $$

  2. Bellman Update: Apply the Bellman update iteratively using the following update rule:

    $$ Q_{k+1}(s,a) = \sum_{s’} P(s’ \mid s,a) \left[r(s,a,s’) + \gamma \max_{a’} Q_k(s’,a’)\right] $$

    where $Q_k$ is the Q-function at iteration $k$, $P(s’ \mid s,a)$ is the transition probability from state $s$ to state $s’$ given action $a$, $r(s,a,s’)$ is the immediate reward received after transition, and $\gamma$ is the discount factor.

  3. Convergence Proof: Show that under certain conditions, such as a finite Markov Decision Process (MDP), sufficient exploration, and an appropriately chosen learning rate, the Q-function converges to the optimal Q-function $Q^*$:

    $$ \lim_{k \to \infty} Q_k(s,a) = Q^*(s,a) \text{ for all } s \in S, a \in A(s) $$

    This is often proved by showing that the update rule is a contraction mapping under the max-norm, which guarantees convergence by the Banach fixed-point theorem.