Gradient Descent - zhikai - Obsidian Publish

gradient descent为什么长这样是一阶+二次距离正则项的argmin值 $x_{s+1}=\arg\min_{x\in\mathbb{R}^d}\left[\langle\nabla f(x_s),x-x_s\rangle+\frac{1}{2\underline{\eta_s}}\|x-x_s\|^2.\right].$ **Projected Gradient Descent** $\pi_{x\in\mathcal{X}}(y) = argmin_{x\in\mathcal{X}}||x-y||$ Iteration * $y_{s+1} =x_{s}-\eta g_{s}$ * $x_{s+1}=argmin_{x_{s+1}\in\mathcal{X}}||x_{s+1} - y_{s+1}||$ **收敛分析的证明** 凸函数+Lipschitz $ \begin{aligned} &\text { Theorem 1. If } f \text { is L-Lipschitz ( }\left\|g_x\right\| \leq L \text { for all } x \in \mathcal{X} \text { ) and convex, then for fixed } t \text { and } \eta_s=\eta=\frac{R}{L \sqrt{t}} \text {, }\\ &f\left(\frac{1}{t} \sum_{s=1}^t x_s\right)-f\left(x^*\right) \leq \frac{R L}{\sqrt{t}} \end{aligned} $ 证明: 1. [[Convergence rate#Gradient Descent Decomposition]] 2. [[Convergence rate#Gradient Descent Telescoping]] $R$是可行域的直径为什么左边这个可以说明收敛速率凸函数满足 $ f\left(\frac{1}{t}\sum x_s\right) \le \frac{1}{t}\sum f(x_s). $ 所以只要控制了 $\frac{1}{t}\sum f(x_s) - f(x^*)$, 就自动控制了 $f\left(\frac{1}{t}\sum x_s\right) - f(x^*)$. ### 收敛速率 $t = R^2 L^2 / \epsilon^2 = \mathcal{O}(1/\epsilon^2)$ [[strongly convex function]]的情况 $\textbf{Theorem 2. If f is L-Lipschitz and -strongly convex, }\eta_s=\frac{2}{\alpha(s+1)}$ $ f\left(\frac{2}{t(t+1)}\sum_{s=1}^ts\cdot x_s\right)-f(x^*)\leq\frac{2L^2}{\alpha(t+1)}.$ 变化来自于学习率的设定，也就是说在强凸的情况下，继续保持收敛的假设，我们可以把学习率的$\sqrt{s}$换成一个一次项$s$ ### [[smooth convex functions]] $\begin{aligned}\textbf{Lemma 1. }&\textit{If f is -smooth, then }\forall x,y\in\mathcal{X}\\\\&f(x)\leq f(y)+\langle\nabla f(y),x-y\rangle+\frac{\ell}{2}\|x-y\|^2.\end{aligned}$ Proof. Apply [[Convergence rate#Gradient Descent Change of variables]] and then the definition of [[smooth convex functions]] comes in Lemma 2 (Descent lemma). If $f$ is $\ell$-smooth and $\eta \ell \leq 1$, then GD with learning rate $\eta$ satisfies $f(x_{t+1}) \leq f(x_t) - \frac{\eta}{2} \|\nabla f(x_t)\|^2.$ Proof. Apply lemma 1 and plug in the definition of gradient Remark of lemma 2. This descent lemma demonstrates that GD always decreases the function value in the ℓ-smooth setting as long as we choose the learning rate to be no larger than 1/ℓ. Note that this results holds even if f is nonconvex. Lemma 3 (contraction). If $f$ is convex, $\ell$-smooth, and $\eta\ell \leq 1$, then $\|x_{t+1} - x^*\| \leq \|x_t - x^*\|,$ where $x^*$ is any minimizer of $f$. Proof idea. Use lemma 2 Theorem 1. If is convex, l-smooth, and $\eta \ell = 1, \textit{thenfor all }t> 1$ $f(x_t)-f(x^*)\leq\frac{2\ell\|x_1-x^*\|^2}{t-1}.$ Proof is a trivial application of lemmas above $Proof.$ Define $\delta _s: = f( x_s) - f( x^\star ) . \textbf{ By the descent lemma, we have}$ $\delta_{s+1}\leq\delta_s-\frac1{2\ell}\|\nabla f(x_s)\|^2.$ $\textbf{By convexity, Cauchy- Schwartz, and the contraction lemma, }$ $\delta_s\leq\langle\nabla f(x_t),x_t-x^*\rangle\leq\|x_s-x^*\|\cdot\|\nabla f(x_s)\|\leq\|x_1-x^*\|\cdot\|\nabla f(x_s)\|.$ Put two inequalities together, $\delta_{s+1}\leq\delta_s-\frac1{2\ell}\frac{\delta_s^2}{\|x_1-x^*\|^2}.$ Divide both sides by $\delta_s\delta_{s+1}$, $\frac{1}{\delta_{s+1}}-\frac{1}{\delta_{s}}\geq\frac{1}{2\ell\|x_1-x^*\|^2}\cdot\frac{\delta_s}{\delta_{s+1}}\geq\frac{1}{2\ell\|x_1-x^*\|^2}.$ Therefore, $\delta_t\geq\frac{t-1}{2\ell\|x_1-x^*\|^2}\Longrightarrow f(x_t)-f(x^*)\leq\frac{2\ell\|x_1-x^*\|^2}{t-1}.$ Convergence rate is thus $O(\frac{lR^2}{t})$ ### [[smooth convex functions]] + [[strongly convex function]] Theorem 2. If f is α-strongly convex ℓ-smooth and ηℓ = 1, then $f(x_{t}) - f(x^{*}) \leq \frac{\ell\|x_{1} - x^{*}\|^{2}}{2} \exp \left(-\frac{t-1}{\kappa}\right).$ Proof. By α-strongly convexity, $f(x_{t}) - f(x^{*}) + \frac{\alpha}{2}\|x_{t} - x^{*}\|^{2} \leq \langle \nabla f(x_{t}), x_{t} - x^{*} \rangle.$ Proof. 第二行最后那个会消掉 $\begin{aligned}\|\underline{x_{t+1}-x^{*}}\|^{2}&=\|x_{t}-\eta\nabla f(x_{t})-x^{*}\|^{2}\\&\overbrace{=\|x_{t}-x^{\star}\|^{2}}-2\eta\langle\nabla f(x_{t}),x_{t}-x^{*}\rangle+\eta^{2}\|\nabla f(x_{t})\|^{2}\\&\leq(1-\eta\alpha)\|x_{t}-x^{\star}\|^{2}-2\eta(f(x_{t+1})-\overline{f(x^{*}))}\\&\leq\exp\left(-\frac{1}{\kappa}\right)\|x_t-x^\star\|^2.\end{aligned}$ ![[Pasted image 20250613090545.png]] ### Acceleration 两种形式 - Heavy ball momentum: The next iterate $\mathbf{x}_{t+1}$ is given by $\mathbf{x}_{t+1} = \mathbf{x}_t - \eta \nabla f(\mathbf{x}_t) + \gamma (\mathbf{x}_t - \mathbf{x}_{t-1}),$ for some parameter $\gamma \in [0, 1]$. $\gamma$ is called the momentum parameter and is usually set to $1 - \theta$ for $\theta \ll 1$. - Nesterov's accelerated gradient descent: The next iterate $\mathbf{x}_{t+1}$ is given by $ \mathbf{y}_t = \mathbf{x}_t + \gamma (\mathbf{x}_t - \mathbf{x}_{t-1})$ $\mathbf{x}_{t+1} = \mathbf{y}_t - \eta \nabla f(\mathbf{y}_t).$ Again, $\gamma$ is called the momentum parameter and is set in the range $[0, 1]$. The algorithm needs to start with two initial points $\mathbf{x}_0, \mathbf{x}_1$ and unless stated otherwise, we always assume $\mathbf{x}_0 = \mathbf{x}_1$. This algorithm improves the convergence time to $\varepsilon^{-1/2}$ and $\sqrt{\kappa} \log \varepsilon^{-1}$ for smooth convex and smooth strongly convex functions respectively. We will prove the convergence rates for strongly smooth convex functions in the next section. **为什么能加速** 例子考虑一个二次函数的情况 $\begin{aligned}&\textbf{Lemma 1. Given a quadratic function f, GD with learning rate }\eta=\frac{1}{\ell}\text{ will converge to an }\varepsilon\text{-optimal solution in time}\\&\tilde{\mathcal{O}}(\frac{1}{\kappa}\log(\frac{1}{\varepsilon})).\end{aligned}$ 使用Nesterov加速以后 Lemma 2. For the quadratic function f, Nesterov's AGD with learning rate η = 1/ℓ and momentum γ = √κ−1/√κ+1 will converge to an ε-optimal solution in time O(1/√κ log(1/ε)). 最后得到的结果是 $\left\|\begin{pmatrix}\mathbf{x}_{t+1}\\\mathbf{x}_t\end{pmatrix}\right\|\leq\left(1-\Theta(\frac{1}{\sqrt{\kappa}})\right)^t\cdot\left\|\begin{pmatrix}\mathbf{x}_1\\\mathbf{x}_0\end{pmatrix}\right\|.$ 加速的本质就在于动量项把原来的一阶线性迭代 $x_{t+1} = (I - \eta A)x_t$ 变成了二阶的迭代 $ x_{t+1} = (1 + \gamma)(I - \eta A)x_t - \gamma(I - \eta A)x_{t-1},$ 从而它在每个特征方向上满足的特征多项式是 $\mu^2 - (1 + \gamma)(1 - \eta \lambda_i)\mu + \gamma(1 - \eta \lambda_i) = 0.$ 求解这个方程，得到的两个根 $\mu_{1,2}$ 的模长（即谱半径）决定了每次迭代的收敛快慢： $ r_i = \max |\mu_{1,2}(\lambda_i)| = \sqrt{\gamma(1 - \eta \lambda_i)}.$ 为了让“最慢”的方向（也就是 $\lambda_i = \alpha$ 时）收敛最快，我们把 $\eta = \frac{1}{\ell}, \quad \kappa = \frac{\ell}{\alpha}, \quad \gamma = \frac{\sqrt{\kappa} - 1}{\sqrt{\kappa} + 1}$ 代入，就能验证 $ r_{\max} = \sqrt{\gamma(1 - \frac{1}{\kappa})} = \frac{\sqrt{\kappa} - 1}{\sqrt{\kappa} + 1} \approx 1 - \frac{2}{\sqrt{\kappa}}.$ 而普通梯度下降在最坏方向上的收敛因子是 $1 - \eta \alpha = 1 - \frac{1}{\kappa}.$ 因此每迭代一次，AGD 的误差收缩约为 $1 - \Theta(1/\sqrt{\kappa})$，比GD的 $1 - \Theta(1/\kappa)$ 快了 $\sqrt{\kappa}$ 倍。换句话说，动量参数 $\gamma$ 是专门选来“平衡”各个特征值方向上的收敛率，并使得迭代矩阵的谱半径最小，从而带来从 $O(\kappa \log \frac{1}{\varepsilon})$ 到 $O(\sqrt{\kappa} \log \frac{1}{\varepsilon})$ 的加速。 **strongly smooth convex** 时的收敛速率 Theorem . If a function f is $\ell$-smooth and $\alpha$-strongly convex, then running the Nesterov's AGD with parameters $\eta = \frac{1}{\ell}$ and $\gamma = 1 - \frac{1}{\sqrt{\kappa}}$ and initial iterate $\mathbf{x}_1 = \mathbf{x}_0$ to optimize f gives: $f(\mathbf{x}_t) - f(\mathbf{x}^*) \leq (1 - \frac{1}{\sqrt{\kappa}})^{t-1} \ell \|\mathbf{x}_1 - \mathbf{x}^*\|^2.$ Remark. 证明的关键是得到以下的不等式 $\left[f(\mathbf{x}_{t+1})-f(\mathbf{x}^*)+\frac{\alpha}{2}\|\mathbf{v}_{t+1}-\mathbf{x}^*\|^2\right]\leq\left(1-\frac{1}{\sqrt{\kappa}}\right)\cdot\left[f(\mathbf{x}_t)-f(\mathbf{x}^*)+\frac{\alpha}{2}\|\mathbf{v}_t-\mathbf{x}^*\|^2\right].$ Remark. 如何从这个式子得到收敛速率 $\begin{gathered}\text{我们希望}\\f(x_T)-f(x^*)\leq\varepsilon.\\\text{由上式只需}\\r^{T-1}E_0\leq\varepsilon\quad\Longleftrightarrow\quad(T-1)\ln r+\ln E_0\leq\ln\varepsilon.\\\text{因为 }\ln r=\ln(1-1/\sqrt{\kappa})\leq-1/\sqrt{\kappa}\text{,可得}\\(T-1)\left(-\frac{1}{\sqrt{\kappa}}\right)+\ln E_0\leq\ln\varepsilon\quad\Longrightarrow\quad T\geq1+\sqrt{\kappa}\ln\frac{E_0}{\varepsilon}\\\text{于是}\\T=O{\left(\sqrt{\kappa}\ln\frac{1}{\varepsilon}\right)}.\end{gathered}$ Remark. 动力学背景最经典的物理类比，是把优化轨迹 $x(t)$ 看作一个质量为 1 的质点在势场 $f(x)$ 中运动，受到两种力的作用： 1. 弹性恢复力 $-\nabla f(x)$ —— 把质点拉向势能最低点。 2. 阻尼（摩擦）力 $-\gamma \dot{x}$ —— 消耗能量，使运动最终停下来。写成连续时间的二阶 ODE，就是 $\ddot{x}(t) + \gamma \dot{x}(t) + \nabla f(x(t)) = 0.$ - 当 $\gamma = 0$ 时，它就是无阻尼的牛顿力学，能量守恒，质点会在最低点来回振荡。 - 当 $\gamma > 0$ 时，系统带阻尼，会逐渐“停”在势能最低处，也就是优化到极小值。 Nesterov 加速其实对应一个临界阻尼或者欠阻尼 / 过阻尼之间的精心调参版本，使得下降既快又不产生太大振荡。 2. 势能、动能和“优化能量” - 势能：在物理里，质点位置为 $x$ 时的势能是 $V(x)$。在优化里，我们把它对应成函数值差 $ V(x) \longleftrightarrow f(x) - f(x^*)$ 它度量“距离最优值还差多少能量”。 - 动能：质点有速度 $\dot{x}$，就是动能 $\frac{1}{2} \|\dot{x}\|^2$。在离散算法里，速度对应成迭代中“动量”项，所以通常会看到 $\frac{\alpha}{2} \|\dot{y}_t - \dot{x}_t\|^2$ 这样的二次项，被解读为“动能”或“惯性能量”。于是把 $ \underbrace{f(x_t) - f(x^*)}_{\text{势能}} + \underbrace{\frac{\alpha}{2} \|\dot{y}_t - \dot{x}^*\|^2}_{\text{动能}}$ 组合起来，就像经典力学里质点的总能量——这在推导加速收敛率时，扮演了一个“Lyapunov 函数”的角色。