最优化方法的收敛速度 分析最优化方法的核心 Summary | Condition & Smoothness | Bound | | ---------------------------------------------------------- | -------------------------------------------------------------------------------------------------- | | GD for Convex & L-Lipschitz | $((R\,L)^{2} / \varepsilon)$ | | GD for Convex & ℓ-smooth | $(\ell\,R^{2} / \varepsilon)$ | | GD for α-strongly convex & L-Lipschitz | $(R\,L / (\alpha\,\varepsilon))$ | | GD for α-strongly convex & ℓ-smooth | $(\kappa \,\log\!\bigl(\ell\,R^{2} / \varepsilon\bigr))$ | | SGD (Convex + L-Lipschitz) | $\frac{(R L)^2}{\varepsilon}+\frac{(R \sigma)^2}{\varepsilon^2}$ | | SGD (Convex + l-smooth) | $\frac{\ell R^2}{\varepsilon}+\frac{(R \sigma)^2}{\varepsilon^2}$ | | SGD (a-strongly convex + L-Lipschitz) | $\frac{L^2}{\alpha \varepsilon}+\frac{\sigma^2}{\alpha \varepsilon}$ | | SGD (a-strongly convex $+l$-smooth) | $\kappa \log \left(\frac{\ell R^2}{\varepsilon}\right)+\frac{\sigma^2}{\alpha \varepsilon}$ | | Mini batch SGD(Convex+L-Lipschitz) | $m\left((R L / \varepsilon)^2+(R \sigma / \varepsilon)^2\right)$ | | Mini batch SGD(Convex+l-smooth) | $m\left(\ell R^2 / \varepsilon+(R \sigma / \varepsilon)^2\right)$ | | Mini batch SGD(a-strongly convex $+\mathrm{L}-$ Lipschitz) | $m\left(L^2 /(\alpha \varepsilon)+\sigma^2 /(\alpha \varepsilon)\right)$ | | Mini batch SGD(a-strongly convex $+\ell$- smooth) | $m\left(\kappa \log \left(\ell R^2 / \varepsilon\right)+\sigma^2 /(\alpha \varepsilon)\right)$ | | SVRG(Convex+L-Lipschitz) | $\frac{R^2\left(L^2+\sigma^2\right)}{\varepsilon^2}$ | | SVRG(Convex + l-smooth) | $\frac{R^2 \sigma^2}{\varepsilon^2}+\sqrt{\frac{\ell R^2}{\varepsilon}}$ | | SVRG(a-strongly convex+L-Lipschitz) | $\frac{L^2+\sigma^2}{\alpha \varepsilon}$ | | SVRG( $\alpha$-strongly convex + l-smooth) | $\frac{\sigma^2}{\alpha \varepsilon}+\sqrt{\kappa} \log \left(\frac{\ell R^2}{\varepsilon}\right)$ | 技巧 ### [[Gradient Descent]] Decomposition $ \begin{aligned} f\left(x_s\right)-f\left(x^*\right) & \leq g_s^T\left(x_s-x^*\right) \\ & =\frac{1}{\eta}\left(x_s-y_{s+1}\right)^T\left(x_s-x^*\right) \\ & =\frac{1}{2 \eta}\left[\left\|x_s-y_{s+1}\right\|^2+\left\|x_s-x^*\right\|^2-\left\|y_{s+1}-x^*\right\|^2\right] \end{aligned} $ 这里的推导用的是polarization inequality $ u^T v \;=\; \frac12\Bigl(\|u\|^2 + \|v\|^2 - \|u - v\|^2\Bigr), $ $ u = x_s - y_{s+1},\; v = x_s - x^*, $ ### [[Gradient Descent]] Telescoping $ \begin{align*} \sum_{s=1}^t \left( \frac{1}{2\eta} [\|x_s - x^*\|^2 - \|x_{s+1} - x^*\|^2] + \frac{\eta L^2}{2} \right) &= \frac{1}{2\eta} \sum_{s=1}^t [\|x_s - x^*\|^2 - \|x_{s+1} - x^*\|^2] + \frac{\eta L^2}{2}t \\ &= \frac{1}{2\eta} [\|x_1 - x^*\|^2 - \|x_{t+1} - x^*\|^2] + \frac{\eta L^2}{2}t, \quad (5.1) \end{align*} $ where in the second line we used \textit{telescoping}, that is the sum of the series $\sum_{s=1}^t (a_s - a_{s+1})$ can be evaluated as $\begin{align*} \sum_{s=1}^t (a_s - a_{s+1}) &= \sum_{s=1}^t a_s - \sum_{s=1}^t a_{s+1} \\ &= \sum_{s=1}^t a_s - \sum_{s=2}^{t+1} a_s \\ &= \left[ a_1 + \sum_{s=2}^t a_s \right] - \left[ \sum_{s=2}^t a_s + a_{t+1} \right] \\ &= a_1 - a_{t+1}. \end{align*}$ ### [[Gradient Descent]] Change of variables 举例来说 1. Parameterize the segment: Define $g(t) = f(y + t(x - y)), \quad t \in [0, 1].$ Then $g(0) = f(y)$ and $g(1) = f(x)$. 2. Differentiate under the hood: By the multivariate chain rule, $g'(t) = \nabla f(y + t(x - y))^T (x - y).$ 3. Apply FTC and subtract off the linear part: $f(x) - f(y) = g(1) - g(0) = \int_0^1 g'(t) \, dt = \int_0^1 \nabla f(y + t(x - y))^T (x - y) \, dt,$ $f(x) - f(y) - \langle \nabla f(x), y - x \rangle = \int_0^1 \left[ \nabla f(y + t(x - y)) - \nabla f(x) \right]^T (x - y) \, dt.$ [[Gradient Descent]] 从强凸的收敛速率推导其他情况的rate 1. 给凸函数加一个小的二次项 令原始的凸函数是 $f(x)$,我们在它上面加一个强凸(二次)正则项,得到 $\tilde{f}(x) = f(x) + \frac{\tilde{\alpha}}{2} \|x - x_1\|^2, \quad \tilde{\alpha} = \frac{\varepsilon}{R^2}.$ 这样 $\tilde{f}$ 就变成了 $\tilde{\alpha}$-强凸(同时保持平滑或 Lipschitz)的函数。 2. 在 $\tilde{f}$ 上跑 GD - 由于 $\tilde{f}$ 是强凸的,我们已经知道跑足够多步梯度下降后,会以指数速率($\exp(-t/\kappa)$)收敛到它的最优值。 - 选步数 $t$ 保证 $\tilde{f}(x_t)$ 距离 $\min \tilde{f}$ 只有 $\varepsilon/2$ 的“$\varepsilon/2$-精度”。 3. 把 $\tilde{f}$ 的精度转回到原始的 $f$ 按照文中 (7.10)-(7.14) 的推导,可以证明 $ f(x_t) \leq \tilde{f}(x_t) \leq \min \tilde{f} + \frac{\varepsilon}{2} \leq f(x^*) + \varepsilon,$ 也就是说,$\tilde{f}$ 上的 “$\varepsilon/2$-最优” 点,也是 $f$ 上的 “$\varepsilon$-最优” 点。 4. 因此 要想让 $f$ 达到 $\varepsilon$-精度,只要让 $\tilde{f}$ 达到 $\frac{\varepsilon}{2}$-精度就行。 而 $\tilde{f}$ 是强凸的,所以我们可以直接引用强凸+平滑(或强凸+Lipschitz)的指数收敛结果,算出所需的迭代次数 $t = O\left( \kappa \log \frac{1}{\varepsilon} \right) \implies O\left( \frac{LR^2}{\varepsilon} \log \frac{LR^2}{\varepsilon} \right) \quad \text{(平滑情形)}.$