Andrew G. Barto, Richard S. Sutton, Charles W. Anderson
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS
December 24, 2020
https://ieeexplore.ieee.org/abstract/document/9306925
---
Historical overview of the actor–critic family and its relationship to earlier “adaptive critic” / error-regulated learning ideas.
In modern notation, the critic provides a TD error signal (example for a state-value critic):
$
\delta_t = r_{t+1} + \gamma V_w(s_{t+1}) - V_w(s_t)
$
That scalar $\delta_t$ is used to update:
- the **critic** (to improve $V_w$), and
- the **actor** (to increase the probability of actions that led to positive $\delta_t$), often with eligibility traces.
The paper also explains the origin of the name **eligibility traces**.
Useful diagram from the paper:
![[Screenshot from 2021-05-20 19-22-46.png]]