# Learning Rate Warmup
- Small learning rate at the start and then a larger learning rate when the training is stabilized
- Linearly from 0 to initial rate
- First m batches to warm up and if the initial learning rate is $\eta$ then at batch i, $1 \leq i \leq m$ , learning rate is $\frac{i\eta}{m}$