Q-Learning in Reinforcement Learning | October 17th, 2023

Introduction to Q-Learning

Q-Learning is a model-free reinforcement learning algorithm used to find the optimal action-selection policy for any given finite Markov decision process (MDP). It works by learning an action-value function that ultimately gives the expected utility of taking a given action in a given state and following the optimal policy thereafter.

The Markov Property

A key feature of an MDP is the Markov Property, which states that the future is independent of the past given the present. Mathematically, this is expressed as:

P(st+1st,at)=P(st+1s1,a1,,st,at)

Basic Concept

The core of Q-Learning is the Q-function, defined as Q:State×ActionR. This function quantifies the value of taking an action in a particular state.

Q-Function Update Rule

The Q-function is updated as follows:

Qnew(st,at)(1α)Q(st,at)+α(rt+1+γmaxaQ(st+1,a))

Where:

  • st and st+1 are the current and next state, respectively.
  • at is the action taken at time t.
  • α is the learning rate.
  • rt+1 is the reward received after moving from state st to st+1.
  • γ is the discount factor.

Q-Learning with Convolutional Neural Networks (CNN)

In complex environments, the state space can be large, making a tabular approach infeasible. CNNs can be integrated with Q-Learning to handle such scenarios, particularly in image-based environments.

The Q-Network

The Q-network can be trained to approximate the Q-value function. The input is the state, and the output is the Q-value of all possible actions.

Network Update Rule

The loss function for updating the Q-network is defined as:

L(θ)=E[(r+γmaxaQ(s,a;θ)Q(s,a;θ))2]

Where:

  • E is the expected value over a mini-batch of experiences. A mini-batch is simply a subset of the experiences stored in the replay buffer.
  • θ and θ are the parameters of the current and target Q-networks, respectively.
  • s,a,r,s are the state, action, reward, and next state sampled from the replay buffer.

Experience Replay

Experience replay stores a replay buffer of past experiences and randomly samples mini-batches from it to train the network, breaking the correlation between sequential experiences.

Variants of Q-Learning

  1. Deep Q-Network (DQN): Combines Q-Learning with deep learning, using a neural network to approximate the Q-function.
  2. Double Q-Learning: Addresses the overestimation of Q-values by decoupling the selection of action from its evaluation.
  3. Dueling Network Architectures: Separates the representation of state values and state-dependent action advantages.

Mathematical Derivation of Q-Learning

Consider the Bellman equation for the Q-function:

Q(s,a)=r+γsP(ss,a)maxaQ(s,a)

The iterative update rule in Q-learning aims to converge to this equation.

Derivation Steps

  1. Initial Assumption: Start with an arbitrary Q-function Q0(s,a).

    Q0(s,a) for all sS,aA(s)

  2. Bellman Update: Apply the Bellman update iteratively using the following update rule:

    Qk+1(s,a)=sP(ss,a)[r(s,a,s)+γmaxaQk(s,a)]

    where Qk is the Q-function at iteration k, P(ss,a) is the transition probability from state s to state s given action a, r(s,a,s) is the immediate reward received after transition, and γ is the discount factor.

  3. Convergence Proof: Show that under certain conditions, such as a finite Markov Decision Process (MDP), sufficient exploration, and an appropriately chosen learning rate, the Q-function converges to the optimal Q-function Q:

    limkQk(s,a)=Q(s,a) for all sS,aA(s)

    This is often proved by showing that the update rule is a contraction mapping under the max-norm, which guarantees convergence by the Banach fixed-point theorem.