Bregman Divergence Introduction - Alto's Vault (Hongda)

- [Convex Sets Intro](Background/Convex%20Sets%20Intro.md) - [Gradient, Hessian Characterization of Convex Functions](CVX%20Analysis/Gradient,%20Hessian%20Characterization%20of%20Convex%20Functions.md) - [Simply Legendre Type](Simply%20Legendre%20Type.md) $ \newcommand{\R}{\mathbb{R}} \newcommand{\e}{\mathbf{\vec{e}}} $ --- ### **Intro** The Bregman distance generalizes the Euclidean distance. Recall that the Euclidean norm between 2 points appeared Rockafellar's proximal point method; it also appeared in the [Proximal Gradient, Mapping and Envelope](Proximal%20Methods/Proximal%20Gradient,%20Mapping%20and%20Envelope.md). In this article, we will faithfully follow Amir Beck's First order method[^1] chapter 9, and then we will augment it with classics Rockafellar in convex analysis, and Heinz's NoLips paper [^3]. The important part is, these ideas can generalize the concepts of strong convexity and smoothness into much general setting for optimizations and their analysis. #### **Definition (9.2) | Bregman Divergence Legendre Type** > Let $\omega$ be a proper closed convex function that is differentiable over $\text{dom}(\partial \omega)$. The Bregman distance associated with $\omega$ is the function $D_\omega: \text{dom}(\omega) \times \text{dom}(\partial \omega)\mapsto \mathbb R$, given by > > $ > \begin{aligned} > D_{\omega}(x,y) &= \omega(x) - \omega(y) - \langle \nabla \omega(y), x - y\rangle. > \end{aligned} > $ **Observations** $D_\omega(x, x_0)$ is, $\omega(x) - (\omega(x_0) + \langle \nabla\omega (x_0), x - x_0\rangle)$, it's the difference between the function $\omega(x)$ and the linearization of $\omega$ at $x_0$. It contains the second-order information for the differentiable function. Because $\omega$ is closed and convex, we immediately know that $D_\omega(x, y) \ge 0$, by convexity. **Remarks** The definition is taken from Beck [^1]. For more about Bregman divergence in general, visit [Wikipedia Bregman Divergence](https://en.wikipedia.org/wiki/Bregman_divergence#cite_note-cs.utexas.edu-1). #### **The Legendre Type** Bregman distance of $\omega$ function of Legendre Type has favorable properties. In convex analysis, a Legendre function has a one-to-one gradient mapping, satisfying $(\nabla f)^{-1} = \nabla f^\star$, enabling inversion. See more about the Legendre Type function in the prerequisite listed at the start of the file. **Remarks** The assumptions made in Amir's Beck [^1] writing differ entirely from those above. Here, we used the definition 2.1 in a paper by Heinz et. al.[^3]. Furthermore, recall that a function of Legendre type on a convex set $C$ is only differentiable on any convex subset of $\text{dom}(\partial f)$. The above fact should clarify the domain of the Bregman distance being $\text{dom}(\omega)\times \text{dom}(\partial \omega)$ because the domain of the subgradient may not be the domain of the function, for a differentiable convex function. #### **Some Subtleties** Using a Legendre function to induce Bregman Divergence is not absolute necessary. Its presence is for a certain degree of theoretical beauty and, necessary for algorithms that are relatively smooth. For this reason, the following weaker definition is presented. #### **Definition | Bregman Divergence of Differentiable functions** > Let $\omega: \R^n \rightarrow \overline \R$ be a differentiable function. > Then, the Bregman Divergence is a mapping $\R^n \times \text{dom}\; \nabla \omega(x) \rightarrow \R$ defined by: > $ > \begin{aligned} > D_\omega(x, y) = \omega(x) - \omega(y) - \langle \nabla \omega(y), x - y\rangle. > \end{aligned} > $ --- ### **The basic properties** Some of the basic properties of Bregman divergence can be direct observed. No Legendere assumption. #### **Claim | linear operators on the set of differentiable function** > Let $h: \R^n \rightarrow \R$ be a differentiable function. > Denote the class of differentiable functions to be $\mathcal C^1$. > Then, for all $\alpha, \beta \in \R^n$, and $h_1, h_2$ from $\mathcal C^1$, it satisfies > $ > \begin{aligned} > D_{\alpha h_1 + \beta h_2}(x, y) = \alpha D_{h_1}(x, y) + \beta D_{h_2} (x, y). > \end{aligned} > $ > And, the null space of the linear operators are linear functions $h \in \mathcal C^1$. **Proof** Obvious. $\blacksquare$ --- ### **The 3, 4 Points Lemma** Bregman divergence can generalizes the cosin formula in the Euclidean space. No Legendre Assumption. #### **Lemma | 3 Points lemma, Cosine law** > Let $\omega: \mathbb E\mapsto (-\infty, \infty]$, be differentiable, then for all $a, b \in \text{dom}(\partial \omega)$, and $c \in \text{dom}(\omega)$, we have > $ > \begin{aligned} > \langle \nabla \omega(b) - \nabla w(a), c - a\rangle > = > D_\omega(c, a) + D_\omega(a, b) - D_\omega(c, b). > \end{aligned} > $ **Observations** Pay attention to that, $a, b\in \text{dom}(\partial \omega)$ but $c \in \text{dom}(\omega)$, $a, b$ comes from a smaller set than where $c$ is coming from. **Proof** $ \begin{aligned} D_\omega(c, a) + D_\omega(a, b) &= \omega(c) - \omega(a) - \langle \nabla \omega(a), c - a\rangle + \omega(a) - \omega(b) - \langle \nabla \omega(b), a - b\rangle \\ &= \omega(c) - \omega(b) - \langle \nabla \omega(a), c- a\rangle - \langle \omega(b), a - b\rangle \\ &= \omega(c) - \omega(b) + \langle -\nabla \omega(a), c- a\rangle + \langle \omega(b), b - a\rangle \\ &= \omega(c) - \omega(b) + \langle -\nabla \omega(a), c- a\rangle + \langle \omega(b), b - c + c - a\rangle \\ &= \omega(c) - \omega(b) - \langle \nabla \omega(b), c - b\rangle + \langle \nabla \omega(b) - \nabla \omega(a), c - a\rangle \\ &= D_\omega(c, b) + \langle \nabla \omega(b) - \nabla \omega(a), c - a\rangle. \end{aligned} $ It is simple algebra. $\blacksquare$ **Remark** When $a, b, c$ are all on one line, the formula simplifies. If we have $c = b$, then the 3 points lemma becomes: $ \begin{aligned} \langle \nabla \omega(b) - \nabla \omega(a), b - a\rangle = D_\omega(b, a) + D_\omega(a, b), \end{aligned} $ #### **Example | Euclidean Cousin law** > Let $\omega = \Vert \cdot\Vert^2/2$, then the Bregman divergence cosin equality has, for all $a, b , c \in \R^n$: > $ > \begin{aligned} > (1/2)\Vert c - b\Vert^2 = (1/2)\Vert c- a\Vert^2 + (1/2)\Vert b - a\Vert^2 - > 2 \langle b - a, c - a\rangle. > \end{aligned} > $ We speculate that if there exist points $a, b \in \text{dom}(\partial f)$ and $c \in \text{dom}(f)$ such that $\langle \nabla \omega(b) - \nabla \omega(a), c - a\rangle = 0$, then it would recover the analogous Pythagoras theorem for the Bregman divergence. #### **Lemma | 4 points anchoring lemma** > Let $x, \bar x, y \in \R^n, \lambda \in \R$ be arbitrary. > Let $z = \lambda x + (1 - \lambda)\bar x$, then it has the following equality > $ > \begin{aligned} > D_f(z, y) &= D_f(\bar x, y) + \lambda \langle x - \bar x, \nabla f(\bar x) - \nabla f(y)\rangle + D_f(z, \bar x). \\ > &= D_f(\bar x, y) + \langle z - \bar x, \nabla f(\bar x) - \nabla f(y)\rangle + D_f(z, \bar x). > \end{aligned} > $ **Proof** It's not hard to verify the following intermediate result which is direct by the definition of Bregman Divergence: $ \begin{aligned} f(z) = f(\bar x) + \langle \lambda(x - \bar x), \nabla f(\bar x)\rangle + D_f(\lambda(x - \bar x) + \bar x, \bar x). \end{aligned}\tag{a} $ Then, it has $ \begin{aligned} & D_f(z, y) \\ &= f(z) - f(y) - \langle \nabla f(y), z - y\rangle \\ &\underset{\text{(a)}}{=} f(\bar x) + \langle \lambda(x - \bar x), \nabla f(\bar x)\rangle + D_f(\lambda(x - \bar x) + \bar x, \bar x) - f(y) - \langle \nabla f(y), z - y\rangle \\ &= f(\bar x) - f(y) + \lambda \langle\nabla f(\bar x) ,x - \bar x \rangle - \langle \nabla f(y), \lambda x + (1 - \lambda)\bar x - y\rangle + D_f(\lambda(x - \bar x) + \bar x, \bar x) \\ &= f(\bar x) - f(y) + \lambda \langle\nabla f(\bar x) ,x - \bar x \rangle - \langle \nabla f(y), \lambda(x - y) + (1 - \lambda)(\bar x - y)\rangle + D_f(\lambda(x - \bar x) + \bar x, \bar x) \\ &= f(\bar x) - f(y) + \lambda \langle\nabla f(\bar x) ,x - \bar x \rangle - \lambda\langle \nabla f(y), x - y \rangle - (1 - \lambda)\langle \nabla f(y), \bar x - y\rangle + D_f(\lambda(x - \bar x) + \bar x, \bar x) \\ &= f(\bar x) - f(y) + \lambda \langle\nabla f(\bar x) ,x - \bar x \rangle - \lambda\langle \nabla f(y), x - \bar x + \bar x - y \rangle \\ &\quad - (1 - \lambda)\langle \nabla f(y), \bar x - y\rangle + D_f(\lambda(x - \bar x) + \bar x, \bar x) \\ &= f(\bar x) - f(y) + \lambda \langle\nabla f(\bar x) - \nabla f(y) ,x - \bar x \rangle - \lambda\langle \nabla f(y), \bar x - y \rangle \\ &\quad - (1 - \lambda)\langle \nabla f(y), \bar x - y\rangle + D_f(\lambda(x - \bar x) + \bar x, \bar x) \\ &= f(\bar x) - f(y) + \lambda \langle\nabla f(\bar x) - \nabla f(y) ,x - \bar x \rangle - \langle \nabla f(y), \bar x - y\rangle + D_f(\lambda(x - \bar x) + \bar x, \bar x) \\ &= D_f(\bar x, y) + \lambda \langle \nabla f(\bar x) - \nabla f(y), x - \bar x\rangle + D_f(\lambda(x - \bar x) + \bar x, \bar x). \end{aligned} $ $\blacksquare$ --- ### **Bregman Divergence of Legendre Type** We drive some more properties when the function used to induced the Bregman Divergence is of Legendre type. --- ### **Bregman Proximal Mapping** One important possibility introduced by Bregman divergence is the generalization of Euclidean based proximal mapping. See [Bregman Proximal Mapping](Bregman%20Proximal%20Mapping.md) for more information. --- ### **Examples of Bregman Divergence** For the first example we take a look at the famous KL-Divergence. For more about KL-Divergence, visits [KL-Divergence](KL-Divergence.md). #### **Example 0 | KL Divergence is Bregman Div of Negative Entropy over Unit Simplex** > Let $\omega(x) =\sum x\odot \log(x)$ for all $x\in \mathbb R_{++}$, then $B_\omega(x)$ gives the KL-Divergence. **Demonstrations** We consider the gradient of $\omega$ directly giving $ \begin{aligned} \nabla \omega(y) &= \nabla \left[\sum x\odot\log(x)\right] \\ &=\sum_{i = 1}^{n} \partial_x[x\log(x)](y_i)\e_i \\ &= \sum_{i = 1}^{n} [1 + \log(\cdot)](y_i)\e_i \\ \langle \nabla \omega (y), y - x\rangle &= \sum_{ i =1}^{n} (1 + \log y_i) \langle y - x, \e_i\rangle \\ &= \sum_{i = 1}^{n} (1 + \log y_i)(y_i - x_i). \end{aligned} $ Then using the definition of a Bregman Divergence we have the expression that for all $x \in \mathbb E$: $ \begin{aligned} D_\omega(x, y) &= \omega(x) - \omega(y) - \langle \nabla \omega(y), y - x\rangle \\ &= \sum_{i = 1}^{n} x_i \log x_i - \sum_{ i =1}^{n} y_i \log y_i - \sum_{i = 1}^{n} (1 + \log y_i)(y_i - x_i) \\ &= \sum_{i = 1}^{n} x_i \log x_i - \sum_{ i =1}^{n} y_i \log y_i - \sum_{i = 1}^{n} (\log y_i) (y_i - x_i) - \left( \sum_{i = 1}^{n} y_i - x_i \right) \\ &= \sum_{i = 1}^{n} x_i \log x_i - \sum_{i = 1}^{n}(\log y_i)(-x_i) - \underbrace{\left( \sum_{i = 1}^{n} y_i - x_i \right)}_{=0} \\ &= \sum_{i = 1}^{n}x_i \log(x_i/y_i). \end{aligned} $ Which is what we aimed to demonstrate. #### **Example 1 | The Energy Function** Let's say that the energy function is $f$, defined as $\Vert x\Vert^2/2$. This is a Legendre function because it's differentiable on the entirety of $\mathbb R^n$. Using the definition, the Bregman divergence is $ \begin{aligned} D_f(x, y) &= \frac{\Vert x\Vert^2}{2} - \frac{\Vert y\Vert^2}{2} - \left\langle \nabla \left[ \frac{\Vert \cdot\Vert^2}{2}\right] (y), x - y \right\rangle \\ &= \frac{\Vert x\Vert^2}{2} - \frac{\Vert y\Vert^2}{2} - \left\langle \nabla y, x - y \right\rangle \\ &= \frac{\Vert x\Vert^2}{2} - \frac{\Vert y\Vert^2}{2} - \left\langle \nabla y, x \right\rangle + \Vert y\Vert^2 \\ &= \frac{\Vert x\Vert^2}{2} - \frac{\Vert y\Vert^2}{2} - \langle y, x\rangle \\ &= \frac{\Vert x - y\Vert^2}{2}. \end{aligned} $ Immediately observe that the above formula is used everywhere in analyzing optimization algorithms. If we generalize the above expression into different contexts, the Bregman divergence measures the proximity of 2 points, using a $f$, a function of Legendre Type. The above Euclidean distance measure is symmetric and isotropic. Other Legendre Type functions may not provide those properties. #### **Example 2 | Normed linear composite squared** The function $f(x) = 1/2\Vert Ax\Vert^2$ can also induce a Bregman Divergence and it will work very similar to the function $x \rightarrow (1/2) \Vert x\Vert^2$. It is a semi-norm: $ \begin{aligned} D_f(x, y) &= \frac{1}{2}\Vert Ax\Vert^2 - \frac{1}{2}\Vert Ay\Vert^2 - \langle Ay, A(x - y)\rangle \\ &= \frac{1}{2}\Vert Ax - Ay\Vert^2. \end{aligned} $ **WARNING**. This is not a Legendre typed function. [^1]: A. Beck, _First-Order Methods in Optimization | SIAM Publications Library_. in MOS-SIAM Series in Optimization. SIAM. Accessed: Oct. 19, 2023. [Online]. Available: [https://epubs.siam.org/doi/book/10.1137/1.9781611974997](https://epubs.siam.org/doi/book/10.1137/1.9781611974997) [^3]: H. H. Bauschke, J. Bolte, and M. Teboulle, “A Descent Lemma Beyond Lipschitz Gradient Continuity: First-Order Methods Revisited and Applications,” Mathematics of OR, vol. 42, no. 2, pp. 330–348, May 2017, doi: 10.1287/moor.2016.0817.