Q-Learning in Reinforcement Learning | October 17th, 2023
Introduction to Q-Learning
Q-Learning is a model-free reinforcement learning algorithm used to find the optimal action-selection policy for any given finite Markov decision process (MDP). It works by learning an action-value function that ultimately gives the expected utility of taking a given action in a given state and following the optimal policy thereafter.
The Markov Property
A key feature of an MDP is the Markov Property, which states that the future is independent of the past given the present. Mathematically, this is expressed as:
Basic Concept
The core of Q-Learning is the Q-function, defined as
Q-Function Update Rule
The Q-function is updated as follows:
Where:
and are the current and next state, respectively. is the action taken at time t. is the learning rate. is the reward received after moving from state to . is the discount factor.
Q-Learning with Convolutional Neural Networks (CNN)
In complex environments, the state space can be large, making a tabular approach infeasible. CNNs can be integrated with Q-Learning to handle such scenarios, particularly in image-based environments.
The Q-Network
The Q-network can be trained to approximate the Q-value function. The input is the state, and the output is the Q-value of all possible actions.
Network Update Rule
The loss function for updating the Q-network is defined as:
Where:
is the expected value over a mini-batch of experiences. A mini-batch is simply a subset of the experiences stored in the replay buffer. and are the parameters of the current and target Q-networks, respectively. are the state, action, reward, and next state sampled from the replay buffer.
Experience Replay
Experience replay stores a replay buffer of past experiences and randomly samples mini-batches from it to train the network, breaking the correlation between sequential experiences.
Variants of Q-Learning
- Deep Q-Network (DQN): Combines Q-Learning with deep learning, using a neural network to approximate the Q-function.
- Double Q-Learning: Addresses the overestimation of Q-values by decoupling the selection of action from its evaluation.
- Dueling Network Architectures: Separates the representation of state values and state-dependent action advantages.
Mathematical Derivation of Q-Learning
Consider the Bellman equation for the Q-function:
The iterative update rule in Q-learning aims to converge to this equation.
Derivation Steps
-
Initial Assumption: Start with an arbitrary Q-function
. -
Bellman Update: Apply the Bellman update iteratively using the following update rule:
where
is the Q-function at iteration , is the transition probability from state to state given action , is the immediate reward received after transition, and is the discount factor. -
Convergence Proof: Show that under certain conditions, such as a finite Markov Decision Process (MDP), sufficient exploration, and an appropriately chosen learning rate, the Q-function converges to the optimal Q-function
:This is often proved by showing that the update rule is a contraction mapping under the max-norm, which guarantees convergence by the Banach fixed-point theorem.