---
toc: true
title: CvT
tags: ['temp']
---
# CvT
- [CvT: Introducing Convolutions to Vision Transformers](https://arxiv.org/abs/2103.15808)
- improves [Vision Transformer](Vision%20Transformer.md)
- introducing [Conv](Conv.md)
- a hierarchy of Transformers containing a new convolutional token [Embedding](Embedding.md)
- convolutional [Transformer](Transformer.md) block leveraging a convolutional projection
- shift, scale, and distortion invariance
- dynamic [Attention](Attention.md) , global context, and better generalization
- [ImageNet](ImageNet.md)
- [Position Encoding](Position%20Encoding.md) , a crucial component in existing Vision Transformers, can be safely removed in our model
- potential advantage for adaption
- built-in local context structure introduced by convolutions, CvT no longer requires a position [Embedding](Embedding.md)