Andrew G. Barto, Richard S. Sutton, Charles W. Anderson IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS December 24, 2020 https://ieeexplore.ieee.org/abstract/document/9306925 --- Historical overview of the actor–critic family and its relationship to earlier “adaptive critic” / error-regulated learning ideas. In modern notation, the critic provides a TD error signal (example for a state-value critic): $ \delta_t = r_{t+1} + \gamma V_w(s_{t+1}) - V_w(s_t) $ That scalar $\delta_t$ is used to update: - the **critic** (to improve $V_w$), and - the **actor** (to increase the probability of actions that led to positive $\delta_t$), often with eligibility traces. The paper also explains the origin of the name **eligibility traces**. Useful diagram from the paper: ![[Screenshot from 2021-05-20 19-22-46.png]]