Label Smoothing - Subhaditya's Website

# Label Smoothing - [Dense](Dense.md) layer is generally the last one and combined with soft max leads to a [Probability](Probability.md) distribution - Assume true label to be y, then a truth [Probability](Probability.md) distribution would be $p_i=1$ If i=y and 0 otherwise - During training, minimize negative [Cross Entropy](Cross%20Entropy.md) loss to make these [Distributions](Distributions.md) similar - We know, $\mathscr{l}(p,q) = -log p_y = -z_y + log(\Sigma^{K}_{i=1}exp(z_i))$ - Where the optimal solution is $z^{\ast}_{y}=\inf$ - The output scores are encouraged to be distinctive which leads to overfitting - Leads to - Instead $\cases{1-\epsilon& if i=1\\\frac{\epsilon}{(K-1)} & \text{otherwise}}$ - The optimal Solution is - $log((K-1)(1-\epsilon)/ \epsilon)+\alpha$ if $i=y$ - $\alpha$ otherwise - Any real number - Finite output from the last layer that generalizes well - If $\epsilon =0$ , $log((k-1)\frac{1-\epsilon}{\epsilon})$ is $\infty$ - As $\epsilon$ increases, the gap decreases - If $\epsilon=\frac{K-1}{K}$, all optimizal $z^{\ast}_{i}$ are identical