SAC - Soft Actor-Critic - Obsidian Publish

>[!info] >Off-policy approach for [[Actor-Critic Methods|deep actor-critic methods]]. Optimizes a **[[Policy|stochastic policy]]** under the objective $\mathcal{J}_{\theta}^{\alpha}=\mathcal{J}_{\theta}+\alpha H(\pi_{\theta}),$where $H(\pi_{\theta})=-\mathbb{E}_{s\sim d^{\pi},a\sim \pi}[\log(\pi_{\theta}(a|s))]$ is the policy [[Shannon Entropy|entropy]]. --- In contrast to [[DDPG - Deep Deterministic Policy Gradient|DDPG]]/[[TD3 - Twin Delayed DDPG|TD3]], SAC has **no target actor** and the surrogate loss is computed based on a **soft value function**$\mathcal{L}_{\theta}^{\alpha}=-\mathbb{E}_{s \sim u^{q},\textcolor{red}{a\sim \pi_{\theta}}}[Q^{\pi}(s,a)-\alpha \log(\pi_{\theta}(s|a))],$where the action is sampled from the policy. Distributions with higher uncertainty artificially increase the estimate and are therefore favored, leading to policies with higher entropy. The following tricks are used - **[[The Reparametrization Trick|The Reparametrization Trick]]** - In order to reduce variance of the [[Statistic and Estimator|estimator]], rewrite the action as $a=g(\epsilon;s,\theta)$, effectively splitting random components $\epsilon$ from deterministic ones ($s,\theta$). Then, the gradient can be written using the chain rule via $\nabla_{\theta}\mathbb{E}_{a \sim \pi_{\theta}}[L(a)]=\mathbb{E}_{\epsilon}[\nabla_{\theta}g(\epsilon;s,\theta)\nabla_{g}L(g(\epsilon;s,\theta))].$ - **Automatic tuning of the Entropy Parameter** - Tuning $\alpha$ is notoriously hard, better to specify target entropy $\bar{H}$ and choose according to optimization $\mathcal{L}=-\mathbb{E}_{s\sim u^{q},a \sim \pi_{\theta}}[\alpha(\log(\pi(a|s))+\bar{H})]$for each batch of data. - **Squashed Gaussians** - To ensure the actions are sampled within bounds, use squashed [[Gaussian Distribution|gaussians]]$a=\tanh(u), \quad u \sim \mathcal{N}(\cdot|\mu_{\theta}(s),\Sigma_{\theta}(s))$and use online actors choosing state-dependent [[Expectations|mean]] and [[Covariance and Variance|standard deviation]] (implement using two [[Artificial Neural Network|NN]]).![[Pasted image 20230716180104.png|center|400]] --- ### Pseudocode ![[Pasted image 20230802182613.png|center|550]] ![[Pasted image 20230802182642.png|center|550]]