is a popular branch of [[Method|Methods]] in [[Reinforcement Learning]], optimizing the [[Bellman optimality equation]] through [[Temporal-Difference Learning]] updates. The focus is on learning a Value or Q function and from there estimate/learn our policy. Usually the Q function is approximated through [[Deep Learning]], producing **Deep Q-Learning** or the Deep Q Network. The gradient of this Network follows below equation:
$
\nabla_{\phi} ~\frac{1}{|\mathcal{D}|} \sum_{(s,a,r,s',d) \in \mathcal{D}} (Q_{\phi}(a,s) - TD(r,s',d))^2
$
Where $\mathcal{D}$ is the Data collected through [[Off-policy vs On-policy#Replay Buffer vs Replay Memory|Replay Memory]].
$TD(\cdot)$ is the temporal difference update, which is done differently dependent on the architecture. In a standard DQN it would look like:
$
TD(s',r,d) = \left(r + \gamma (1 - d) \max_{a'} Q_{\phi}(s',a') \right)
$
See [[Value-based methods#Soft Double Q-Learning]] or [[Deep Deterministic Policy Gradient|DDPG]] as extensions to Q-learning.
> [!note] Off-policy vs On-policy
Methods based on this idea are considered **off-policy**, because the goal is only to learn any function Q or V that can represent the environment. From there we try to learn/get a policy. Even in [[Actor-Critic]] methods, we use the value function directly (pass gradient through) to update our actor.
### Background
Q-learning basically follows "directly" from the Bellman optimality Equation as seen below:
![[Bellman optimality equation]]
Replacing the Q Function with a NN function approximator $Q_\phi$ we optimize through the [[Minimum Mean Squared Error]]:
$
L(\phi, {\mathcal D}) = \underset{(s,a,r,s',d) \sim {\mathcal D}}{{\mathrm E}}\left[
\Bigg( Q_{\phi}(s,a) - \left(r + \gamma (1 - d) \max_{a'} Q_{\phi}(s',a') \right) \Bigg)^2
\right]
$
Here $d$ is if the episode has terminated, thus not giving any future reward. $\mathcal{D}$ is the Dataset collected through [[@Prioritized Experience Replay]].
> [!note] From Bellman to Update
> Boils down to the rule that:
> $Q_{i+1}(s, a) = E [r + \gamma \underset{a'}{\max} Q_i(s', a')|s, a]$
> Such value iteration algorithms converge to the optimal action- value function, $Q_i → Q∗ \text{as} ~ i → ∞$ [^dqn]
> Proof seems highly non trivial, Reference to [^book]. Basically coming down to the Value Iteration algorithm, also "simple" proof [here](http://users.isr.ist.utl.pt/~mtjspaan/readingGroup/ProofQlearning.pdf) and [here](https://arxiv.org/abs/2108.02827).
The update here is the [[Temporal-Difference Learning|Temporal Difference]], comparing our current understanding of the value, but updating through a new reward information r:
$
Q(S_t, A_t) \leftarrow
\underbrace{Q(S_t, A_t)}_{\text{Current Estimate}} +
\alpha (
\underbrace{R_{t+1}}_
{\substack{\text{New} \\ \text{Info}}} +
\gamma
\underbrace{\max_a Q(S_{t+1}, a)}_{\text{max, not $\pi(S_{t+1})$}}
- Q(S_t,A_t))
$
> [!question] Does the max make it off-policy?
> Because $A_{t+1}$ is chosen through $\max_a Q(S_{t+1}, a)$, instead of $\pi$ , Q-learning is an off-policy Learner. Or is this through the equation we are optimizing? So just optimizing the Q function for random state action pairs instead of assuming we have something sampled from a policy?
This is usually **combined with [[@Prioritized Experience Replay]]** to drastically improve stability and other optimization technique. By running $\mathop{\mathrm{arg\,max}}_a$ on $Q_\theta$ we find our policy to choose the "best" action:
$
\pi(s) = \mathop{\underset{a}{\mathop{\mathrm{arg\,max}}}} Q_\theta(s,a)
$
Also see: [[SARSA vs Q-learning]]
### Calculating the update
As we have been operating on the Expectation so far, a [[sample mean]] is used to calculate a value and running the e.g. Adam on this gradient:
*Important Note*: is the squared part, as otherwise the gradient would die.
$
\nabla_{\phi} ~\frac{1}{|B|} \sum_{(s,a,r,s',d) \in B} (Q_{\theta}(a,s) - TD_{targ}(r,s',d))^2
$
## Extensions
The master update of Q-Learning is combining many many techniques into one prototype: [[@Rainbow_ Combining improvements in deep reinforcement learning]].
### Soft Double Q-Learning
is an extension using a second network (target network) that tries to correct the prediction of the initial network, the Loss is the same as the initial function:
$
L(\phi, {\mathcal D}) = \underset{(s,a,r,s',d) \sim {\mathcal D}}{{\mathrm E}}\left[
(Q_{\phi}(s,a) -
TD_{targ}(s',r,d)
)^2
\right]
$
With the only difference being the new calculation of $TD$ through $TD_{targ}$. Where $TD_{targ}$ is just the temporal difference update calculated with the target network, instead of the initial network:
$
TD_{targ}(s',r,d) = r + \gamma(1-d)~ \max_{a'} ~ Q_{\phi_{targ}}(s',a'))
$
**Updating the target network** is done through moving towards the parameters of the policy parameters $\phi$ slowly with $\tau \in [0,1]$ and usually $\tau = 0.005$, see [[Experiments Reinforcement Learning DQN]] for more details.
$
\phi_{targ} = \tau ~ \phi + (1 - \tau) \phi_{targ}
$
This update is also called **soft Double Q-Learning**, as the alternative would be to fully replace the target network at some point. In experiments it seems as this soft update works just better.
*Experimental* results for my own tests show a drastic difference in using the target network vs without. In the CartPole Environment it does not seem to learn anything without the target network.
For a **graphic display** of the update of the policy net:
![[DQN 2023-01-30 16.42.13.excalidraw]]
### DDPG
[[Deep Deterministic Policy Gradient]].
### SAC
[[Soft Actor-Critic|SAC]]
### Difference Q-learning vs DDPG vs SAC
- When updating our Q Network, the action for the next state (a') will be calculated through a max over all actions. This also introduces the **Overestimation Bias** as in the actual run, we would not follow max, we would actually have some random actions for exploration
- The DDPG uses a network to get an action, so we calculate the Q for a specific action. This action is similar to the max operation, but as it is a trained network might provide some-ish randomness
- The SAC *samples* its action from the actor. This is exactly how we would run the policy during training. We do not act randomly through exploration. Our exploration is done through our sampling. As this is represented in our gradient, the update is much more representative of what is happening!
References:
https://pytorch.org/tutorials/intermediate/reinforcement_q_learning.html
[^dqn]: https://www.cs.toronto.edu/%7Evmnih/docs/dqn.pdf Playing Atari with Deep Reinforcement Learnin
[^book]: https://web.stanford.edu/class/psych209/Readings/SuttonBartoIPRLBook2ndEd.pdf Reinforcement Learning: An Introduction Richard S. Sutton and Andrew G. Barto