gradient descent为什么长这样
是一阶+二次距离正则项的argmin值
$x_{s+1}=\arg\min_{x\in\mathbb{R}^d}\left[\langle\nabla f(x_s),x-x_s\rangle+\frac{1}{2\underline{\eta_s}}\|x-x_s\|^2.\right].$
**Projected Gradient Descent**
$\pi_{x\in\mathcal{X}}(y) = argmin_{x\in\mathcal{X}}||x-y||$
Iteration
* $y_{s+1} =x_{s}-\eta g_{s}$
* $x_{s+1}=argmin_{x_{s+1}\in\mathcal{X}}||x_{s+1} - y_{s+1}||$
**收敛分析的证明**
凸函数+Lipschitz
$
\begin{aligned}
&\text { Theorem 1. If } f \text { is L-Lipschitz ( }\left\|g_x\right\| \leq L \text { for all } x \in \mathcal{X} \text { ) and convex, then for fixed } t \text { and } \eta_s=\eta=\frac{R}{L \sqrt{t}} \text {, }\\
&f\left(\frac{1}{t} \sum_{s=1}^t x_s\right)-f\left(x^*\right) \leq \frac{R L}{\sqrt{t}}
\end{aligned}
$
证明: 1. [[Convergence rate#Gradient Descent Decomposition]]
2. [[Convergence rate#Gradient Descent Telescoping]]
$R$是可行域的直径
为什么左边这个可以说明收敛速率
凸函数满足
$
f\left(\frac{1}{t}\sum x_s\right) \le \frac{1}{t}\sum f(x_s).
$
所以只要控制了 $\frac{1}{t}\sum f(x_s) - f(x^*)$, 就自动控制了 $f\left(\frac{1}{t}\sum x_s\right) - f(x^*)$.
### 收敛速率
$t = R^2 L^2 / \epsilon^2 = \mathcal{O}(1/\epsilon^2)$
[[strongly convex function]]的情况
$\textbf{Theorem 2. If f is L-Lipschitz and -strongly convex, }\eta_s=\frac{2}{\alpha(s+1)}$
$
f\left(\frac{2}{t(t+1)}\sum_{s=1}^ts\cdot x_s\right)-f(x^*)\leq\frac{2L^2}{\alpha(t+1)}.$
变化来自于学习率的设定,也就是说在强凸的情况下,继续保持收敛的假设,我们可以把学习率的$\sqrt{s}$换成一个一次项$s$
### [[smooth convex functions]]
$\begin{aligned}\textbf{Lemma 1. }&\textit{If f is -smooth, then }\forall x,y\in\mathcal{X}\\\\&f(x)\leq f(y)+\langle\nabla f(y),x-y\rangle+\frac{\ell}{2}\|x-y\|^2.\end{aligned}$
Proof. Apply [[Convergence rate#Gradient Descent Change of variables]] and then the definition of [[smooth convex functions]] comes in
Lemma 2 (Descent lemma). If $f$ is $\ell$-smooth and $\eta \ell \leq 1$, then GD with learning rate $\eta$ satisfies
$f(x_{t+1}) \leq f(x_t) - \frac{\eta}{2} \|\nabla f(x_t)\|^2.$
Proof. Apply lemma 1 and plug in the definition of gradient
Remark of lemma 2. This descent lemma demonstrates that GD always decreases the function value in the ℓ-smooth setting as long as we choose the learning rate to be no larger than 1/ℓ. Note that this results holds even if f is nonconvex.
Lemma 3 (contraction). If $f$ is convex, $\ell$-smooth, and $\eta\ell \leq 1$, then
$\|x_{t+1} - x^*\| \leq \|x_t - x^*\|,$
where $x^*$ is any minimizer of $f$.
Proof idea. Use lemma 2
Theorem 1. If is convex, l-smooth, and $\eta \ell = 1, \textit{thenfor all }t> 1$
$f(x_t)-f(x^*)\leq\frac{2\ell\|x_1-x^*\|^2}{t-1}.$
Proof is a trivial application of lemmas above
$Proof.$ Define $\delta _s: = f( x_s) - f( x^\star ) . \textbf{ By the descent lemma, we have}$
$\delta_{s+1}\leq\delta_s-\frac1{2\ell}\|\nabla f(x_s)\|^2.$
$\textbf{By convexity, Cauchy- Schwartz, and the contraction lemma, }$
$\delta_s\leq\langle\nabla f(x_t),x_t-x^*\rangle\leq\|x_s-x^*\|\cdot\|\nabla f(x_s)\|\leq\|x_1-x^*\|\cdot\|\nabla f(x_s)\|.$
Put two inequalities together,
$\delta_{s+1}\leq\delta_s-\frac1{2\ell}\frac{\delta_s^2}{\|x_1-x^*\|^2}.$
Divide both sides by $\delta_s\delta_{s+1}$,
$\frac{1}{\delta_{s+1}}-\frac{1}{\delta_{s}}\geq\frac{1}{2\ell\|x_1-x^*\|^2}\cdot\frac{\delta_s}{\delta_{s+1}}\geq\frac{1}{2\ell\|x_1-x^*\|^2}.$
Therefore,
$\delta_t\geq\frac{t-1}{2\ell\|x_1-x^*\|^2}\Longrightarrow f(x_t)-f(x^*)\leq\frac{2\ell\|x_1-x^*\|^2}{t-1}.$
Convergence rate is thus $O(\frac{lR^2}{t})$
### [[smooth convex functions]] + [[strongly convex function]]
Theorem 2. If f is α-strongly convex ℓ-smooth and ηℓ = 1, then
$f(x_{t}) - f(x^{*}) \leq \frac{\ell\|x_{1} - x^{*}\|^{2}}{2} \exp \left(-\frac{t-1}{\kappa}\right).$
Proof. By α-strongly convexity,
$f(x_{t}) - f(x^{*}) + \frac{\alpha}{2}\|x_{t} - x^{*}\|^{2} \leq \langle \nabla f(x_{t}), x_{t} - x^{*} \rangle.$
Proof.
第二行最后那个会消掉
$\begin{aligned}\|\underline{x_{t+1}-x^{*}}\|^{2}&=\|x_{t}-\eta\nabla f(x_{t})-x^{*}\|^{2}\\&\overbrace{=\|x_{t}-x^{\star}\|^{2}}-2\eta\langle\nabla f(x_{t}),x_{t}-x^{*}\rangle+\eta^{2}\|\nabla f(x_{t})\|^{2}\\&\leq(1-\eta\alpha)\|x_{t}-x^{\star}\|^{2}-2\eta(f(x_{t+1})-\overline{f(x^{*}))}\\&\leq\exp\left(-\frac{1}{\kappa}\right)\|x_t-x^\star\|^2.\end{aligned}$
![[Pasted image 20250613090545.png]]
### Acceleration
两种形式
- Heavy ball momentum: The next iterate $\mathbf{x}_{t+1}$ is given by
$\mathbf{x}_{t+1} = \mathbf{x}_t - \eta \nabla f(\mathbf{x}_t) + \gamma (\mathbf{x}_t - \mathbf{x}_{t-1}),$
for some parameter $\gamma \in [0, 1]$. $\gamma$ is called the momentum parameter and is usually set to $1 - \theta$ for $\theta \ll 1$.
- Nesterov's accelerated gradient descent: The next iterate $\mathbf{x}_{t+1}$ is given by
$
\mathbf{y}_t = \mathbf{x}_t + \gamma (\mathbf{x}_t - \mathbf{x}_{t-1})$
$\mathbf{x}_{t+1} = \mathbf{y}_t - \eta \nabla f(\mathbf{y}_t).$
Again, $\gamma$ is called the momentum parameter and is set in the range $[0, 1]$. The algorithm needs to start with two initial points $\mathbf{x}_0, \mathbf{x}_1$ and unless stated otherwise, we always assume $\mathbf{x}_0 = \mathbf{x}_1$. This algorithm improves the convergence time to $\varepsilon^{-1/2}$ and $\sqrt{\kappa} \log \varepsilon^{-1}$ for smooth convex and smooth strongly convex functions respectively. We will prove the convergence rates for strongly smooth convex functions in the next section.
**为什么能加速**
例子考虑一个二次函数的情况
$\begin{aligned}&\textbf{Lemma 1. Given a quadratic function f, GD with learning rate }\eta=\frac{1}{\ell}\text{ will converge to an }\varepsilon\text{-optimal solution in time}\\&\tilde{\mathcal{O}}(\frac{1}{\kappa}\log(\frac{1}{\varepsilon})).\end{aligned}$
使用Nesterov加速以后
Lemma 2. For the quadratic function f, Nesterov's AGD with learning rate η = 1/ℓ and momentum γ = √κ−1/√κ+1 will converge to an ε-optimal solution in time O(1/√κ log(1/ε)).
最后得到的结果是
$\left\|\begin{pmatrix}\mathbf{x}_{t+1}\\\mathbf{x}_t\end{pmatrix}\right\|\leq\left(1-\Theta(\frac{1}{\sqrt{\kappa}})\right)^t\cdot\left\|\begin{pmatrix}\mathbf{x}_1\\\mathbf{x}_0\end{pmatrix}\right\|.$
加速的本质就在于动量项把原来的一阶线性迭代
$x_{t+1} = (I - \eta A)x_t$
变成了二阶的迭代
$
x_{t+1} = (1 + \gamma)(I - \eta A)x_t - \gamma(I - \eta A)x_{t-1},$
从而它在每个特征方向上满足的特征多项式是
$\mu^2 - (1 + \gamma)(1 - \eta \lambda_i)\mu + \gamma(1 - \eta \lambda_i) = 0.$
求解这个方程,得到的两个根 $\mu_{1,2}$ 的模长(即谱半径)决定了每次迭代的收敛快慢:
$
r_i = \max |\mu_{1,2}(\lambda_i)| = \sqrt{\gamma(1 - \eta \lambda_i)}.$
为了让“最慢”的方向(也就是 $\lambda_i = \alpha$ 时)收敛最快,我们把
$\eta = \frac{1}{\ell}, \quad \kappa = \frac{\ell}{\alpha}, \quad \gamma = \frac{\sqrt{\kappa} - 1}{\sqrt{\kappa} + 1}$
代入,就能验证
$
r_{\max} = \sqrt{\gamma(1 - \frac{1}{\kappa})} = \frac{\sqrt{\kappa} - 1}{\sqrt{\kappa} + 1} \approx 1 - \frac{2}{\sqrt{\kappa}}.$
而普通梯度下降在最坏方向上的收敛因子是
$1 - \eta \alpha = 1 - \frac{1}{\kappa}.$
因此每迭代一次,AGD 的误差收缩约为 $1 - \Theta(1/\sqrt{\kappa})$,比GD的 $1 - \Theta(1/\kappa)$ 快了 $\sqrt{\kappa}$ 倍。
换句话说,动量参数 $\gamma$ 是专门选来“平衡”各个特征值方向上的收敛率,并使得迭代矩阵的谱半径最小,从而带来从 $O(\kappa \log \frac{1}{\varepsilon})$ 到 $O(\sqrt{\kappa} \log \frac{1}{\varepsilon})$ 的加速。
**strongly smooth convex** 时的收敛速率
Theorem . If a function f is $\ell$-smooth and $\alpha$-strongly convex, then running the Nesterov's AGD with parameters $\eta = \frac{1}{\ell}$ and $\gamma = 1 - \frac{1}{\sqrt{\kappa}}$ and initial iterate $\mathbf{x}_1 = \mathbf{x}_0$ to optimize f gives:
$f(\mathbf{x}_t) - f(\mathbf{x}^*) \leq (1 - \frac{1}{\sqrt{\kappa}})^{t-1} \ell \|\mathbf{x}_1 - \mathbf{x}^*\|^2.$
Remark. 证明的关键是得到以下的不等式
$\left[f(\mathbf{x}_{t+1})-f(\mathbf{x}^*)+\frac{\alpha}{2}\|\mathbf{v}_{t+1}-\mathbf{x}^*\|^2\right]\leq\left(1-\frac{1}{\sqrt{\kappa}}\right)\cdot\left[f(\mathbf{x}_t)-f(\mathbf{x}^*)+\frac{\alpha}{2}\|\mathbf{v}_t-\mathbf{x}^*\|^2\right].$
Remark. 如何从这个式子得到收敛速率
$\begin{gathered}\text{我们希望}\\f(x_T)-f(x^*)\leq\varepsilon.\\\text{由上式只需}\\r^{T-1}E_0\leq\varepsilon\quad\Longleftrightarrow\quad(T-1)\ln r+\ln E_0\leq\ln\varepsilon.\\\text{因为 }\ln r=\ln(1-1/\sqrt{\kappa})\leq-1/\sqrt{\kappa}\text{,可得}\\(T-1)\left(-\frac{1}{\sqrt{\kappa}}\right)+\ln E_0\leq\ln\varepsilon\quad\Longrightarrow\quad T\geq1+\sqrt{\kappa}\ln\frac{E_0}{\varepsilon}\\\text{于是}\\T=O{\left(\sqrt{\kappa}\ln\frac{1}{\varepsilon}\right)}.\end{gathered}$
Remark. 动力学背景
最经典的物理类比,是把优化轨迹 $x(t)$ 看作一个质量为 1 的质点在势场 $f(x)$ 中运动,受到两种力的作用:
1. 弹性恢复力 $-\nabla f(x)$ —— 把质点拉向势能最低点。
2. 阻尼(摩擦)力 $-\gamma \dot{x}$ —— 消耗能量,使运动最终停下来。
写成连续时间的二阶 ODE,就是
$\ddot{x}(t) + \gamma \dot{x}(t) + \nabla f(x(t)) = 0.$
- 当 $\gamma = 0$ 时,它就是无阻尼的牛顿力学,能量守恒,质点会在最低点来回振荡。
- 当 $\gamma > 0$ 时,系统带阻尼,会逐渐“停”在势能最低处,也就是优化到极小值。
Nesterov 加速其实对应一个临界阻尼或者欠阻尼 / 过阻尼之间的精心调参版本,使得下降既快又不产生太大振荡。
2. 势能、动能和“优化能量”
- 势能:在物理里,质点位置为 $x$ 时的势能是 $V(x)$。在优化里,我们把它对应成函数值差
$
V(x) \longleftrightarrow f(x) - f(x^*)$
它度量“距离最优值还差多少能量”。
- 动能:质点有速度 $\dot{x}$,就是动能 $\frac{1}{2} \|\dot{x}\|^2$。在离散算法里,速度对应成迭代中“动量”项,所以通常会看到
$\frac{\alpha}{2} \|\dot{y}_t - \dot{x}_t\|^2$
这样的二次项,被解读为“动能”或“惯性能量”。
于是把
$
\underbrace{f(x_t) - f(x^*)}_{\text{势能}} + \underbrace{\frac{\alpha}{2} \|\dot{y}_t - \dot{x}^*\|^2}_{\text{动能}}$
组合起来,就像经典力学里质点的总能量——这在推导加速收敛率时,扮演了一个“Lyapunov 函数”的角色。