ACT - Action Chunking with Transformers

From [[Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware]]. --- **Action Chunking** Tries to work against the [[The Compounding Error Problem in Imitation Learning|compounding error problem]] in [[- Imitation Learning -|imitation learning]] by using chunks of $k$ actions. The [[Policy|policy]] models $\pi_\theta(a_{t:t+k} \mid s_t)$ rather than $\pi_\theta(a_t \mid s_t)$. **Temporal ensembling** The policy is queried at every timestep, producing overlapping predictions for each future timestep. These are combined via exponentially-decaying weights $w_i = \exp(-m \cdot i)$, where $i$ indexes how old the prediction is. Algorithm 2 maintains a FIFO buffer of size $T$. ![[Pasted image 20251213100041.png]] > [!brainwaves] Why Chunking can help > Trades off some reactive flexibility for lower drift. **Style Variable at Training Time** - Full [[Transformer]] architecture, input state (images, joints, latent $z$ for multimodality at training time), output action chunk - Encoder processes observation - Decoder generates sequence - At training time additional encoder for better training signal ![[Pasted image 20251213095949.png|center|697]]