最优化方法的收敛速度
分析最优化方法的核心
Summary
| Condition & Smoothness | Bound |
| ---------------------------------------------------------- | -------------------------------------------------------------------------------------------------- |
| GD for Convex & L-Lipschitz | $((R\,L)^{2} / \varepsilon)$ |
| GD for Convex & ℓ-smooth | $(\ell\,R^{2} / \varepsilon)$ |
| GD for α-strongly convex & L-Lipschitz | $(R\,L / (\alpha\,\varepsilon))$ |
| GD for α-strongly convex & ℓ-smooth | $(\kappa \,\log\!\bigl(\ell\,R^{2} / \varepsilon\bigr))$ |
| SGD (Convex + L-Lipschitz) | $\frac{(R L)^2}{\varepsilon}+\frac{(R \sigma)^2}{\varepsilon^2}$ |
| SGD (Convex + l-smooth) | $\frac{\ell R^2}{\varepsilon}+\frac{(R \sigma)^2}{\varepsilon^2}$ |
| SGD (a-strongly convex + L-Lipschitz) | $\frac{L^2}{\alpha \varepsilon}+\frac{\sigma^2}{\alpha \varepsilon}$ |
| SGD (a-strongly convex $+l$-smooth) | $\kappa \log \left(\frac{\ell R^2}{\varepsilon}\right)+\frac{\sigma^2}{\alpha \varepsilon}$ |
| Mini batch SGD(Convex+L-Lipschitz) | $m\left((R L / \varepsilon)^2+(R \sigma / \varepsilon)^2\right)$ |
| Mini batch SGD(Convex+l-smooth) | $m\left(\ell R^2 / \varepsilon+(R \sigma / \varepsilon)^2\right)$ |
| Mini batch SGD(a-strongly convex $+\mathrm{L}-$ Lipschitz) | $m\left(L^2 /(\alpha \varepsilon)+\sigma^2 /(\alpha \varepsilon)\right)$ |
| Mini batch SGD(a-strongly convex $+\ell$- smooth) | $m\left(\kappa \log \left(\ell R^2 / \varepsilon\right)+\sigma^2 /(\alpha \varepsilon)\right)$ |
| SVRG(Convex+L-Lipschitz) | $\frac{R^2\left(L^2+\sigma^2\right)}{\varepsilon^2}$ |
| SVRG(Convex + l-smooth) | $\frac{R^2 \sigma^2}{\varepsilon^2}+\sqrt{\frac{\ell R^2}{\varepsilon}}$ |
| SVRG(a-strongly convex+L-Lipschitz) | $\frac{L^2+\sigma^2}{\alpha \varepsilon}$ |
| SVRG( $\alpha$-strongly convex + l-smooth) | $\frac{\sigma^2}{\alpha \varepsilon}+\sqrt{\kappa} \log \left(\frac{\ell R^2}{\varepsilon}\right)$ |
技巧
### [[Gradient Descent]] Decomposition
$
\begin{aligned}
f\left(x_s\right)-f\left(x^*\right) & \leq g_s^T\left(x_s-x^*\right) \\
& =\frac{1}{\eta}\left(x_s-y_{s+1}\right)^T\left(x_s-x^*\right) \\
& =\frac{1}{2 \eta}\left[\left\|x_s-y_{s+1}\right\|^2+\left\|x_s-x^*\right\|^2-\left\|y_{s+1}-x^*\right\|^2\right]
\end{aligned}
$
这里的推导用的是polarization inequality
$
u^T v \;=\; \frac12\Bigl(\|u\|^2 + \|v\|^2 - \|u - v\|^2\Bigr),
$
$
u = x_s - y_{s+1},\;
v = x_s - x^*,
$
### [[Gradient Descent]] Telescoping
$
\begin{align*}
\sum_{s=1}^t \left( \frac{1}{2\eta} [\|x_s - x^*\|^2 - \|x_{s+1} - x^*\|^2] + \frac{\eta L^2}{2} \right) &= \frac{1}{2\eta} \sum_{s=1}^t [\|x_s - x^*\|^2 - \|x_{s+1} - x^*\|^2] + \frac{\eta L^2}{2}t \\
&= \frac{1}{2\eta} [\|x_1 - x^*\|^2 - \|x_{t+1} - x^*\|^2] + \frac{\eta L^2}{2}t, \quad (5.1)
\end{align*}
$
where in the second line we used \textit{telescoping}, that is the sum of the series $\sum_{s=1}^t (a_s - a_{s+1})$ can be evaluated as
$\begin{align*}
\sum_{s=1}^t (a_s - a_{s+1}) &= \sum_{s=1}^t a_s - \sum_{s=1}^t a_{s+1} \\
&= \sum_{s=1}^t a_s - \sum_{s=2}^{t+1} a_s \\
&= \left[ a_1 + \sum_{s=2}^t a_s \right] - \left[ \sum_{s=2}^t a_s + a_{t+1} \right] \\
&= a_1 - a_{t+1}.
\end{align*}$
### [[Gradient Descent]] Change of variables
举例来说
1. Parameterize the segment:
Define
$g(t) = f(y + t(x - y)), \quad t \in [0, 1].$
Then $g(0) = f(y)$ and $g(1) = f(x)$.
2. Differentiate under the hood:
By the multivariate chain rule,
$g'(t) = \nabla f(y + t(x - y))^T (x - y).$
3. Apply FTC and subtract off the linear part:
$f(x) - f(y) = g(1) - g(0) = \int_0^1 g'(t) \, dt = \int_0^1 \nabla f(y + t(x - y))^T (x - y) \, dt,$
$f(x) - f(y) - \langle \nabla f(x), y - x \rangle = \int_0^1 \left[ \nabla f(y + t(x - y)) - \nabla f(x) \right]^T (x - y) \, dt.$
[[Gradient Descent]] 从强凸的收敛速率推导其他情况的rate
1. 给凸函数加一个小的二次项
令原始的凸函数是 $f(x)$,我们在它上面加一个强凸(二次)正则项,得到
$\tilde{f}(x) = f(x) + \frac{\tilde{\alpha}}{2} \|x - x_1\|^2, \quad \tilde{\alpha} = \frac{\varepsilon}{R^2}.$
这样 $\tilde{f}$ 就变成了 $\tilde{\alpha}$-强凸(同时保持平滑或 Lipschitz)的函数。
2. 在 $\tilde{f}$ 上跑 GD
- 由于 $\tilde{f}$ 是强凸的,我们已经知道跑足够多步梯度下降后,会以指数速率($\exp(-t/\kappa)$)收敛到它的最优值。
- 选步数 $t$ 保证 $\tilde{f}(x_t)$ 距离 $\min \tilde{f}$ 只有 $\varepsilon/2$ 的“$\varepsilon/2$-精度”。
3. 把 $\tilde{f}$ 的精度转回到原始的 $f$
按照文中 (7.10)-(7.14) 的推导,可以证明
$
f(x_t) \leq \tilde{f}(x_t) \leq \min \tilde{f} + \frac{\varepsilon}{2} \leq f(x^*) + \varepsilon,$
也就是说,$\tilde{f}$ 上的 “$\varepsilon/2$-最优” 点,也是 $f$ 上的 “$\varepsilon$-最优” 点。
4. 因此
要想让 $f$ 达到 $\varepsilon$-精度,只要让 $\tilde{f}$ 达到 $\frac{\varepsilon}{2}$-精度就行。
而 $\tilde{f}$ 是强凸的,所以我们可以直接引用强凸+平滑(或强凸+Lipschitz)的指数收敛结果,算出所需的迭代次数
$t = O\left( \kappa \log \frac{1}{\varepsilon} \right) \implies O\left( \frac{LR^2}{\varepsilon} \log \frac{LR^2}{\varepsilon} \right) \quad \text{(平滑情形)}.$