- [Convex Sets Intro](Background/Convex%20Sets%20Intro.md)
- [Gradient, Hessian Characterization of Convex Functions](CVX%20Analysis/Gradient,%20Hessian%20Characterization%20of%20Convex%20Functions.md)
- [Simply Legendre Type](Simply%20Legendre%20Type.md)
$
\newcommand{\R}{\mathbb{R}}
\newcommand{\e}{\mathbf{\vec{e}}}
$
---
### **Intro**
The Bregman distance generalizes the Euclidean distance. Recall that the Euclidean norm between 2 points appeared Rockafellar's proximal point method; it also appeared in the [Proximal Gradient, Mapping and Envelope](Proximal%20Methods/Proximal%20Gradient,%20Mapping%20and%20Envelope.md).
In this article, we will faithfully follow Amir Beck's First order method[^1] chapter 9, and then we will augment it with classics Rockafellar in convex analysis, and Heinz's NoLips paper [^3].
The important part is, these ideas can generalize the concepts of strong convexity and smoothness into much general setting for optimizations and their analysis.
#### **Definition (9.2) | Bregman Divergence Legendre Type**
> Let $\omega$ be a proper closed convex function that is differentiable over $\text{dom}(\partial \omega)$. The Bregman distance associated with $\omega$ is the function $D_\omega: \text{dom}(\omega) \times \text{dom}(\partial \omega)\mapsto \mathbb R$, given by
>
> $
> \begin{aligned}
> D_{\omega}(x,y) &= \omega(x) - \omega(y) - \langle \nabla \omega(y), x - y\rangle.
> \end{aligned}
> $
**Observations**
$D_\omega(x, x_0)$ is, $\omega(x) - (\omega(x_0) + \langle \nabla\omega (x_0), x - x_0\rangle)$, it's the difference between the function $\omega(x)$ and the linearization of $\omega$ at $x_0$.
It contains the second-order information for the differentiable function. Because $\omega$ is closed and convex, we immediately know that $D_\omega(x, y) \ge 0$, by convexity.
**Remarks**
The definition is taken from Beck [^1].
For more about Bregman divergence in general, visit [Wikipedia Bregman Divergence](https://en.wikipedia.org/wiki/Bregman_divergence#cite_note-cs.utexas.edu-1).
#### **The Legendre Type**
Bregman distance of $\omega$ function of Legendre Type has favorable properties.
In convex analysis, a Legendre function has a one-to-one gradient mapping, satisfying $(\nabla f)^{-1} = \nabla f^\star$, enabling inversion.
See more about the Legendre Type function in the prerequisite listed at the start of the file.
**Remarks**
The assumptions made in Amir's Beck [^1] writing differ entirely from those above. Here, we used the definition 2.1 in a paper by Heinz et. al.[^3].
Furthermore, recall that a function of Legendre type on a convex set $C$ is only differentiable on any convex subset of $\text{dom}(\partial f)$.
The above fact should clarify the domain of the Bregman distance being $\text{dom}(\omega)\times \text{dom}(\partial \omega)$ because the domain of the subgradient may not be the domain of the function, for a differentiable convex function.
#### **Some Subtleties**
Using a Legendre function to induce Bregman Divergence is not absolute necessary.
Its presence is for a certain degree of theoretical beauty and, necessary for algorithms that are relatively smooth.
For this reason, the following weaker definition is presented.
#### **Definition | Bregman Divergence of Differentiable functions**
> Let $\omega: \R^n \rightarrow \overline \R$ be a differentiable function.
> Then, the Bregman Divergence is a mapping $\R^n \times \text{dom}\; \nabla \omega(x) \rightarrow \R$ defined by:
> $
> \begin{aligned}
> D_\omega(x, y) = \omega(x) - \omega(y) - \langle \nabla \omega(y), x - y\rangle.
> \end{aligned}
> $
---
### **The basic properties**
Some of the basic properties of Bregman divergence can be direct observed.
No Legendere assumption.
#### **Claim | linear operators on the set of differentiable function**
> Let $h: \R^n \rightarrow \R$ be a differentiable function.
> Denote the class of differentiable functions to be $\mathcal C^1$.
> Then, for all $\alpha, \beta \in \R^n$, and $h_1, h_2$ from $\mathcal C^1$, it satisfies
> $
> \begin{aligned}
> D_{\alpha h_1 + \beta h_2}(x, y) = \alpha D_{h_1}(x, y) + \beta D_{h_2} (x, y).
> \end{aligned}
> $
> And, the null space of the linear operators are linear functions $h \in \mathcal C^1$.
**Proof**
Obvious. $\blacksquare$
---
### **The 3, 4 Points Lemma**
Bregman divergence can generalizes the cosin formula in the Euclidean space.
No Legendre Assumption.
#### **Lemma | 3 Points lemma, Cosine law**
> Let $\omega: \mathbb E\mapsto (-\infty, \infty]$, be differentiable, then for all $a, b \in \text{dom}(\partial \omega)$, and $c \in \text{dom}(\omega)$, we have
> $
> \begin{aligned}
> \langle \nabla \omega(b) - \nabla w(a), c - a\rangle
> =
> D_\omega(c, a) + D_\omega(a, b) - D_\omega(c, b).
> \end{aligned}
> $
**Observations**
Pay attention to that, $a, b\in \text{dom}(\partial \omega)$ but $c \in \text{dom}(\omega)$, $a, b$ comes from a smaller set than where $c$ is coming from.
**Proof**
$
\begin{aligned}
D_\omega(c, a) + D_\omega(a, b) &=
\omega(c) - \omega(a) - \langle \nabla \omega(a), c - a\rangle
+
\omega(a) - \omega(b) - \langle \nabla \omega(b), a - b\rangle
\\
&=
\omega(c) - \omega(b) - \langle \nabla \omega(a), c- a\rangle -
\langle \omega(b), a - b\rangle
\\
&=
\omega(c) - \omega(b) + \langle -\nabla \omega(a), c- a\rangle
+
\langle \omega(b), b - a\rangle
\\
&=
\omega(c) - \omega(b) + \langle -\nabla \omega(a), c- a\rangle
+ \langle \omega(b), b - c + c - a\rangle
\\
&=
\omega(c) - \omega(b) - \langle \nabla \omega(b), c - b\rangle +
\langle \nabla \omega(b) - \nabla \omega(a), c - a\rangle
\\
&=
D_\omega(c, b) + \langle \nabla \omega(b) - \nabla \omega(a), c - a\rangle.
\end{aligned}
$
It is simple algebra.
$\blacksquare$
**Remark**
When $a, b, c$ are all on one line, the formula simplifies.
If we have $c = b$, then the 3 points lemma becomes:
$
\begin{aligned}
\langle \nabla \omega(b) - \nabla \omega(a), b - a\rangle =
D_\omega(b, a) + D_\omega(a, b),
\end{aligned}
$
#### **Example | Euclidean Cousin law**
> Let $\omega = \Vert \cdot\Vert^2/2$, then the Bregman divergence cosin equality has, for all $a, b , c \in \R^n$:
> $
> \begin{aligned}
> (1/2)\Vert c - b\Vert^2 = (1/2)\Vert c- a\Vert^2 + (1/2)\Vert b - a\Vert^2 -
> 2 \langle b - a, c - a\rangle.
> \end{aligned}
> $
We speculate that if there exist points $a, b \in \text{dom}(\partial f)$ and $c \in \text{dom}(f)$ such that $\langle \nabla \omega(b) - \nabla \omega(a), c - a\rangle = 0$, then it would recover the analogous Pythagoras theorem for the Bregman divergence.
#### **Lemma | 4 points anchoring lemma**
> Let $x, \bar x, y \in \R^n, \lambda \in \R$ be arbitrary.
> Let $z = \lambda x + (1 - \lambda)\bar x$, then it has the following equality
> $
> \begin{aligned}
> D_f(z, y) &= D_f(\bar x, y) + \lambda \langle x - \bar x, \nabla f(\bar x) - \nabla f(y)\rangle + D_f(z, \bar x). \\
> &= D_f(\bar x, y) + \langle z - \bar x, \nabla f(\bar x) - \nabla f(y)\rangle + D_f(z, \bar x).
> \end{aligned}
> $
**Proof**
It's not hard to verify the following intermediate result which is direct by the definition of Bregman Divergence:
$
\begin{aligned}
f(z) = f(\bar x) + \langle \lambda(x - \bar x), \nabla f(\bar x)\rangle + D_f(\lambda(x - \bar x) + \bar x, \bar x).
\end{aligned}\tag{a}
$
Then, it has
$
\begin{aligned}
& D_f(z, y)
\\
&= f(z) - f(y) - \langle \nabla f(y), z - y\rangle
\\
&\underset{\text{(a)}}{=}
f(\bar x) + \langle \lambda(x - \bar x), \nabla f(\bar x)\rangle + D_f(\lambda(x - \bar x) + \bar x, \bar x) - f(y) - \langle \nabla f(y), z - y\rangle
\\
&= f(\bar x) - f(y) + \lambda \langle\nabla f(\bar x) ,x - \bar x \rangle - \langle \nabla f(y), \lambda x + (1 - \lambda)\bar x - y\rangle
+ D_f(\lambda(x - \bar x) + \bar x, \bar x)
\\
&= f(\bar x) - f(y)
+ \lambda \langle\nabla f(\bar x) ,x - \bar x \rangle
- \langle \nabla f(y), \lambda(x - y) + (1 - \lambda)(\bar x - y)\rangle
+ D_f(\lambda(x - \bar x) + \bar x, \bar x)
\\
&= f(\bar x) - f(y)
+ \lambda \langle\nabla f(\bar x) ,x - \bar x \rangle
- \lambda\langle \nabla f(y), x - y \rangle - (1 - \lambda)\langle \nabla f(y), \bar x - y\rangle
+ D_f(\lambda(x - \bar x) + \bar x, \bar x)
\\
&=
f(\bar x) - f(y)
+ \lambda \langle\nabla f(\bar x) ,x - \bar x \rangle
- \lambda\langle \nabla f(y), x - \bar x + \bar x - y \rangle \\ &\quad
- (1 - \lambda)\langle \nabla f(y), \bar x - y\rangle
+ D_f(\lambda(x - \bar x) + \bar x, \bar x)
\\
&=
f(\bar x) - f(y)
+ \lambda \langle\nabla f(\bar x) - \nabla f(y) ,x - \bar x \rangle
- \lambda\langle \nabla f(y), \bar x - y \rangle \\ &\quad
- (1 - \lambda)\langle \nabla f(y), \bar x - y\rangle
+ D_f(\lambda(x - \bar x) + \bar x, \bar x)
\\
&=
f(\bar x) - f(y)
+ \lambda \langle\nabla f(\bar x) - \nabla f(y) ,x - \bar x \rangle
- \langle \nabla f(y), \bar x - y\rangle
+ D_f(\lambda(x - \bar x) + \bar x, \bar x)
\\
&= D_f(\bar x, y) + \lambda \langle \nabla f(\bar x) - \nabla f(y), x - \bar x\rangle
+ D_f(\lambda(x - \bar x) + \bar x, \bar x).
\end{aligned}
$
$\blacksquare$
---
### **Bregman Divergence of Legendre Type**
We drive some more properties when the function used to induced the Bregman Divergence is of Legendre type.
---
### **Bregman Proximal Mapping**
One important possibility introduced by Bregman divergence is the generalization of Euclidean based proximal mapping.
See [Bregman Proximal Mapping](Bregman%20Proximal%20Mapping.md) for more information.
---
### **Examples of Bregman Divergence**
For the first example we take a look at the famous KL-Divergence.
For more about KL-Divergence, visits [KL-Divergence](KL-Divergence.md).
#### **Example 0 | KL Divergence is Bregman Div of Negative Entropy over Unit Simplex**
> Let $\omega(x) =\sum x\odot \log(x)$ for all $x\in \mathbb R_{++}$, then $B_\omega(x)$ gives the KL-Divergence.
**Demonstrations**
We consider the gradient of $\omega$ directly giving
$
\begin{aligned}
\nabla \omega(y) &=
\nabla \left[\sum x\odot\log(x)\right]
\\
&=\sum_{i = 1}^{n} \partial_x[x\log(x)](y_i)\e_i
\\
&= \sum_{i = 1}^{n} [1 + \log(\cdot)](y_i)\e_i
\\
\langle \nabla \omega (y), y - x\rangle
&=
\sum_{ i =1}^{n}
(1 + \log y_i) \langle y - x, \e_i\rangle
\\
&=
\sum_{i = 1}^{n} (1 + \log y_i)(y_i - x_i).
\end{aligned}
$
Then using the definition of a Bregman Divergence we have the expression that for all $x \in \mathbb E$:
$
\begin{aligned}
D_\omega(x, y) &=
\omega(x) - \omega(y) - \langle \nabla \omega(y), y - x\rangle
\\
&=
\sum_{i = 1}^{n} x_i \log x_i - \sum_{ i =1}^{n} y_i \log y_i
-
\sum_{i = 1}^{n} (1 + \log y_i)(y_i - x_i)
\\
&= \sum_{i = 1}^{n} x_i \log x_i - \sum_{ i =1}^{n} y_i \log y_i
-
\sum_{i = 1}^{n} (\log y_i) (y_i - x_i) -
\left(
\sum_{i = 1}^{n} y_i - x_i
\right)
\\
&=
\sum_{i = 1}^{n} x_i \log x_i - \sum_{i = 1}^{n}(\log y_i)(-x_i) -
\underbrace{\left(
\sum_{i = 1}^{n} y_i - x_i
\right)}_{=0}
\\
&=
\sum_{i = 1}^{n}x_i \log(x_i/y_i).
\end{aligned}
$
Which is what we aimed to demonstrate.
#### **Example 1 | The Energy Function**
Let's say that the energy function is $f$, defined as $\Vert x\Vert^2/2$.
This is a Legendre function because it's differentiable on the entirety of $\mathbb R^n$.
Using the definition, the Bregman divergence is
$
\begin{aligned}
D_f(x, y) &= \frac{\Vert x\Vert^2}{2} - \frac{\Vert y\Vert^2}{2}
- \left\langle \nabla
\left[ \frac{\Vert \cdot\Vert^2}{2}\right] (y), x - y
\right\rangle
\\
&= \frac{\Vert x\Vert^2}{2} - \frac{\Vert y\Vert^2}{2}
- \left\langle \nabla
y, x - y
\right\rangle
\\
&=
\frac{\Vert x\Vert^2}{2} - \frac{\Vert y\Vert^2}{2}
- \left\langle \nabla
y, x
\right\rangle + \Vert y\Vert^2
\\
&=
\frac{\Vert x\Vert^2}{2} - \frac{\Vert y\Vert^2}{2} - \langle y, x\rangle
\\
&= \frac{\Vert x - y\Vert^2}{2}.
\end{aligned}
$
Immediately observe that the above formula is used everywhere in analyzing optimization algorithms.
If we generalize the above expression into different contexts, the Bregman divergence measures the proximity of 2 points, using a $f$, a function of Legendre Type.
The above Euclidean distance measure is symmetric and isotropic.
Other Legendre Type functions may not provide those properties.
#### **Example 2 | Normed linear composite squared**
The function $f(x) = 1/2\Vert Ax\Vert^2$ can also induce a Bregman Divergence and it will work very similar to the function $x \rightarrow (1/2) \Vert x\Vert^2$.
It is a semi-norm:
$
\begin{aligned}
D_f(x, y) &= \frac{1}{2}\Vert Ax\Vert^2 - \frac{1}{2}\Vert Ay\Vert^2
- \langle Ay, A(x - y)\rangle
\\
&= \frac{1}{2}\Vert Ax - Ay\Vert^2.
\end{aligned}
$
**WARNING**. This is not a Legendre typed function.
[^1]: A. Beck, _First-Order Methods in Optimization | SIAM Publications Library_. in MOS-SIAM Series in Optimization. SIAM. Accessed: Oct. 19, 2023. [Online]. Available: [https://epubs.siam.org/doi/book/10.1137/1.9781611974997](https://epubs.siam.org/doi/book/10.1137/1.9781611974997)
[^3]: H. H. Bauschke, J. Bolte, and M. Teboulle, “A Descent Lemma Beyond Lipschitz Gradient Continuity: First-Order Methods Revisited and Applications,” Mathematics of OR, vol. 42, no. 2, pp. 330–348, May 2017, doi: 10.1287/moor.2016.0817.