![[CleanShot 2025-02-23 at
[email protected]]]
![[CleanShot 2025-02-23 at
[email protected]]]
## 1. Introduction and Technical Overview
The Segment Anything Model 2 (SAM 2) represents a significant advancement in visual understanding by extending the capabilities of foundation models from static image segmentation to the temporal domain of video. Developed by Meta FAIR, SAM 2 addresses the complex challenge of promptable visual segmentation across both images and videos, offering a unified framework that can be prompted interactively with points, boxes, or masks to produce spatio-temporal masks (masklets) of target objects.
From a technical perspective, SAM 2 can be understood as a generalization of the original Segment Anything Model (SAM) (Kirillov et al., 2023) to process sequential visual data through a novel architecture that incorporates temporal context through a streaming memory mechanism. This design enables real-time processing of arbitrarily long videos while maintaining consistent object identity across frames - a fundamental requirement for effective video understanding systems.
The technical contributions of SAM 2 can be summarized across three key dimensions:
1. **Task Formulation**: The extension of promptable segmentation from static images to the spatio-temporal domain through the Promptable Visual Segmentation (PVS) task.
2. **Architectural Innovations**: A streaming transformer architecture with a memory attention module that efficiently retains information about past frames and predictions to inform current frame processing.
3. **Data Engine**: A novel data collection methodology that leverages the model itself in an iterative improvement loop with human annotators, resulting in the creation of the Segment Anything Video (SA-V) dataset - the largest video segmentation dataset to date.
![[CleanShot 2025-02-23 at
[email protected]]]
FIGURE 1: Overview of SAM 2. This visualization illustrates the three core components of the SAM 2 system: (a) The Promptable Visual Segmentation task, showing how different types of prompts (box, points, mask) in arbitrary frames generate consistent object segmentation throughout a video; (b) The SAM 2 model architecture, highlighting the interconnections between the image encoder, prompt encoder, memory attention mechanism, mask decoder, and memory bank; and (c) The data engine and SA-V dataset statistics, demonstrating the scale and diversity of the training data. The figure demonstrates how SAM 2 leverages streaming memory to maintain object context across frames while enabling interactive refinement.
![[CleanShot 2025-02-23 at
[email protected]]]
FIGURE 3: The SAM 2 architecture. This diagram illustrates the streaming transformer design with memory capabilities. For each frame, the image encoder extracts features that are processed through memory attention, which integrates information from previous frames. The mask decoder generates segmentation predictions conditioned on both the current prompt (encoded via the prompt encoder) and/or previously observed memories. The system operates in a streaming fashion, with new frames processed sequentially and their information encoded into the memory bank for future reference. This architecture enables both prompt-based and memory-based segmentation, allowing SAM 2 to maintain object continuity across frames while supporting interactive refinement.
This technical review examines the mathematical foundations, algorithmic design, and implementation details of SAM 2, with particular focus on the memory attention mechanism, temporal processing capabilities, and the data engine methodology. We will also analyze the model's performance characteristics across various benchmarks, highlighting its advantages over previous approaches while acknowledging its limitations.
## 2. Technical Background and Related Work
### 2.1 Evolution of Image Segmentation Models
The problem of image segmentation has undergone significant transformation through the advent of deep learning. Traditional approaches relied on handcrafted features and optimizations such as graph cuts (Wang et al., 2005) or level set methods (Bai & Sapiro, 2007), where segmentation was formulated as an energy minimization problem. Mathematically, these approaches can be represented as:
$
E(S) = \lambda_1 E_{\text{data}}(S) + \lambda_2 E_{\text{boundary}}(S)
$
where $E_{\text{data}}$ quantifies how well the segmentation $S$ matches image features, and $E_{\text{boundary}}$ penalizes complex boundaries, with $\lambda_1$ and $\lambda_2$ being weighting parameters.
The advent of deep convolutional networks reframed segmentation as a per-pixel classification task, with Fully Convolutional Networks (FCNs) establishing a dominant paradigm. The segmentation network $f_\theta$ with parameters $\theta$ can be expressed as:
$
f_\theta: \mathbf{I} \rightarrow \mathbf{M}
$
where $\mathbf{I} \in \mathbb{R}^{H \times W \times 3}$ is the input image and $\mathbf{M} \in [0,1]^{H \times W}$ is the output probability map.
The Segment Anything Model (SAM) (Kirillov et al., 2023) introduced a significant paradigm shift by framing segmentation as a promptable task. Given an image $\mathbf{I}$ and a prompt $\mathbf{p}$ (which could be points, boxes, or masks), SAM generates a valid segmentation mask:
$
\text{SAM}(\mathbf{I}, \mathbf{p}) \rightarrow \mathbf{M}
$
SAM employs a transformer-based architecture with three main components:
1. An image encoder ($f_{\text{img}}$) that creates dense visual features
2. A prompt encoder ($f_{\text{prompt}}$) that converts user inputs into embeddings
3. A mask decoder ($f_{\text{dec}}$) that combines these to produce segmentation masks
The mathematical pipeline can be expressed as:
$
\mathbf{F}_{\text{img}} = f_{\text{img}}(\mathbf{I}), \quad \mathbf{P} = f_{\text{prompt}}(\mathbf{p}), \quad \mathbf{M} = f_{\text{dec}}(\mathbf{F}_{\text{img}}, \mathbf{P})
$
A key innovation in SAM was its training on the SA-1B dataset, which contains 1 billion masks from 11 million images, enabling zero-shot transfer to unseen domains.
### 2.2 Video Segmentation: Technical Challenges and Prior Approaches
Video object segmentation introduces additional complexities beyond static image segmentation, primarily due to temporal dynamics and the need to maintain consistent object identity across frames. The standard formulation of the semi-supervised Video Object Segmentation (VOS) task provides a ground-truth mask $\mathbf{M}_0$ for the first frame and requires propagating this segmentation to subsequent frames:
$
f_{\text{VOS}}: (\mathbf{I}_{0:T}, \mathbf{M}_0) \rightarrow \mathbf{M}_{1:T}
$
where $\mathbf{I}_{0:T}$ represents the video frames and $\mathbf{M}_{1:T}$ are the predicted segmentation masks.
Several technical approaches have emerged to address this task:
1. **Online Fine-tuning Methods**: These methods adapt a pre-trained model to the specific object of interest in the first frame through gradient updates:
$
\theta_t = \theta_{t-1} - \alpha \nabla_\theta \mathcal{L}(\mathbf{M}_0, f_{\theta_{t-1}}(\mathbf{I}_0))
$
where $\alpha$ is the learning rate and $\mathcal{L}$ is a segmentation loss function (typically a combination of cross-entropy and IoU-based losses). While effective, these methods are computationally expensive at inference time.
2. **Memory-based Propagation**: These approaches maintain a memory bank of features from previous frames to guide segmentation in the current frame:
$
\mathbf{M}_t = f_\theta(\mathbf{I}_t, \mathbf{F}_{0:t-1}, \mathbf{M}_{0:t-1})
$
where $\mathbf{F}_{0:t-1}$ represents image features from previous frames. This approach is exemplified by STM (Oh et al., 2019) and its derivatives.
3. **Transformer-based Methods**: Recent approaches leverage the attention mechanism of transformers to model temporal dependencies:
$
\mathbf{Z}_t = \text{Attention}(\mathbf{Q}_t, [\mathbf{K}_0; \mathbf{K}_1; \ldots; \mathbf{K}_{t-1}], [\mathbf{V}_0; \mathbf{V}_1; \ldots; \mathbf{V}_{t-1}])
$
where $\mathbf{Q}_t$ are query embeddings derived from the current frame, while $\mathbf{K}_i$ and $\mathbf{V}_i$ are key-value pairs from previous frames. This formulation is seen in works like STCN (Cheng et al., 2021a), AOT (Yang et al., 2021b), and XMem (Cheng & Schwing, 2022).
### 2.3 Interactive Video Segmentation Approaches
Interactive Video Object Segmentation (iVOS) extends the VOS paradigm by allowing user interactions beyond the first frame. The formal definition becomes:
$
f_{\text{iVOS}}: (\mathbf{I}_{0:T}, \mathbf{P}_{t_1, t_2, \ldots, t_n}) \rightarrow \mathbf{M}_{0:T}
$
where $\mathbf{P}_{t_1, t_2, \ldots, t_n}$ represents user prompts at frames $t_1, t_2, \ldots, t_n$.
Existing approaches typically address this problem through a modular design:
1. A spatial segmentation module that handles user inputs at a given frame
2. A temporal propagation module that extends the segmentation across time
Mathematically, this can be expressed as:
$
\mathbf{M}_t =
\begin{cases}
f_{\text{spatial}}(\mathbf{I}_t, \mathbf{P}_t), & \text{if prompt exists at frame } t \\
f_{\text{prop}}(\mathbf{I}_t, \mathbf{I}_{t'}, \mathbf{M}_{t'}), & \text{otherwise}
\end{cases}
$
where $t'$ is the nearest frame with a prompt or prediction.
Recent works have attempted to combine SAM with video tracking modules (Cheng et al., 2023b; Yang et al., 2023). However, these systems face critical limitations in their ability to iteratively refine predictions and adapt to changing object appearances, as they lack a unified mechanism to incorporate prompts across multiple frames.
### 2.4 Limitations of Existing Approaches and Technical Gaps
Prior to SAM 2, several technical limitations persisted in the field:
1. **Unified Prompt Processing**: Existing methods lacked a unified architecture to handle various prompt types (points, boxes, masks) across arbitrary frames in a video sequence.
2. **Memory Efficiency**: Most VOS approaches store features from all previous frames, leading to linear memory growth with video length, which is impractical for long videos.
3. **Object Identity Modeling**: Maintaining consistent object identity through occlusions, appearance changes, and complex motions remained challenging.
4. **Dataset Limitations**: Existing video segmentation datasets were relatively small, category-focused, and lacked the diversity needed for general-purpose segmentation.
SAM 2 addresses these limitations through its novel architecture and data collection methodology, which we will examine in subsequent sections.
## 3. Promptable Visual Segmentation Task
The Promptable Visual Segmentation (PVS) task introduced in SAM 2 represents a significant generalization of image segmentation to the video domain, unifying and extending capabilities from both static image segmentation and video object segmentation approaches.
### 3.1 Formal Task Definition
Formally, the PVS task can be defined as a function $f_{\text{PVS}}$ that takes as input a video sequence $\mathbf{V} = \{\mathbf{I}_1, \mathbf{I}_2, ..., \mathbf{I}_T\}$ consisting of $T$ frames and a set of prompts $\mathbf{P} = \{\mathbf{p}_{t_1}, \mathbf{p}_{t_2}, ..., \mathbf{p}_{t_n}\}$ provided at arbitrary frames $t_1, t_2, ..., t_n \in \{1, 2, ..., T\}$. The function outputs a spatio-temporal mask (masklet) $\mathbf{M} = \{\mathbf{M}_1, \mathbf{M}_2, ..., \mathbf{M}_T\}$ spanning the entire video:
$
f_{\text{PVS}}: (\mathbf{V}, \mathbf{P}) \rightarrow \mathbf{M}
$
Each prompt $\mathbf{p}_t$ can be one of multiple types:
- Positive clicks (points): $\mathbf{p}_t^+ = \{(x_1, y_1), (x_2, y_2), ...\}$
- Negative clicks (points): $\mathbf{p}_t^- = \{(x_1, y_1), (x_2, y_2), ...\}$
- Bounding boxes: $\mathbf{p}_t^b = \{(x_{\text{min}}, y_{\text{min}}, x_{\text{max}}, y_{\text{max}})\}$
- Masks: $\mathbf{p}_t^m \in \{0, 1\}^{H \times W}$
A key characteristic of the PVS task is its interactive nature. The model processes prompts sequentially, with each new prompt potentially refining the prediction based on all previous interactions:
$
\mathbf{M}^{(i)} = f_{\text{PVS}}(\mathbf{V}, \{\mathbf{p}^{(1)}, \mathbf{p}^{(2)}, ..., \mathbf{p}^{(i)}\})
$
where $\mathbf{M}^{(i)}$ represents the masklet after the $i$-th interaction, and $\mathbf{p}^{(i)}$ is the prompt at interaction step $i$.
### 3.2 Technical Requirements and Challenges
The PVS task imposes several technical requirements that distinguish it from standard image segmentation or video object segmentation:
1. **Real-time Interactive Response**: Upon receiving a prompt at frame $t$, the model must immediately respond with a valid segmentation mask for that frame. Mathematically, this implies a non-causal subfunction:
$
f_{\text{immediate}}(\mathbf{I}_t, \mathbf{p}_t, \mathcal{C}) \rightarrow \mathbf{M}_t
$
where $\mathcal{C}$ represents the context from previous frames and interactions.
2. **Spatio-Temporal Consistency**: The model must maintain object identity and segmentation consistency across the entire video, even through occlusions, appearance changes, and camera motion. This requires modeling temporal dynamics:
$
f_{\text{propagate}}(\mathbf{V}, \mathbf{M}_t, t) \rightarrow \mathbf{M}_{1:T}
$
3. **Multi-frame Prompt Integration**: The model must effectively incorporate prompts from multiple frames, using them as constraints to refine the entire spatio-temporal masklet. This can be expressed as an optimization problem:
$
\mathbf{M}^* = \underset{\mathbf{M}}{\arg\min} \sum_{i=1}^{n} \mathcal{L}(\mathbf{M}_{t_i}, \mathbf{p}_{t_i}) + \lambda \mathcal{R}(\mathbf{M})
$
where $\mathcal{L}$ is a loss function measuring consistency with prompts, and $\mathcal{R}$ is a regularization term encouraging temporal smoothness.
4. **Object Disappearance Handling**: Unlike standard VOS approaches that assume the target object is always present, PVS requires detecting when objects disappear from view and reappear. This necessitates an explicit visibility modeling component:
$
v_t = f_{\text{visible}}(\mathbf{I}_t, \mathcal{C}) \in \{0, 1\}
$
where $v_t$ indicates the object's visibility at frame $t$.
![[CleanShot 2025-02-23 at
[email protected]]]
FIGURE 2: Interactive segmentation with SAM 2. This figure illustrates the two-stage interactive process of SAM 2: (1) Selection stage - initial prompting with positive/negative clicks to segment the target object (a dog's tongue) in frame 1, with automatic propagation to subsequent frames; (2) Refinement stage - when the object is lost (after frame 2), a single click in frame 3 is sufficient to recover the object, leveraging SAM 2's memory of past frames. This demonstrates a key advantage over decoupled approaches (SAM + video tracker), which would require complete re-annotation rather than minimal correction, exemplifying how the memory-based architecture enables efficient spatio-temporal segmentation with minimal user interaction.
### 3.3 Relationship to Existing Segmentation Paradigms
The PVS task generalizes several existing segmentation paradigms:
1. **Static Image Segmentation**: When applied to a single-frame video ($T = 1$), PVS reduces to standard interactive image segmentation:
$
f_{\text{PVS}}(\{\mathbf{I}\}, \{\mathbf{p}\}) = f_{\text{SAM}}(\mathbf{I}, \mathbf{p})
$
2. **Semi-supervised VOS**: When prompts are restricted to only the first frame ($t_1 = 1$), PVS becomes equivalent to the standard semi-supervised VOS formulation:
$
f_{\text{PVS}}(\mathbf{V}, \{\mathbf{p}_1\}) = f_{\text{VOS}}(\mathbf{V}, \mathbf{M}_1)
$
where $\mathbf{M}_1 = f_{\text{immediate}}(\mathbf{I}_1, \mathbf{p}_1, \emptyset)$.
3. **Interactive VOS**: When prompts can be provided at any frame but each prompt operates independently, PVS resembles traditional iVOS approaches, but with the additional benefit of maintaining a coherent spatio-temporal understanding of the target object.
By unifying these paradigms, the PVS task provides a more general and practical approach to video segmentation that aligns with how humans naturally interact with visual content - allowing iterative refinement through minimal interactions at strategically chosen frames.
## 4. Model Architecture and Mathematical Foundations
SAM 2 introduces a novel architecture that extends the capabilities of the original SAM model to the video domain while maintaining its strong performance on static images. The architecture is designed with several key principles in mind: streaming processing for real-time performance, memory-efficient operation for arbitrarily long videos, and the ability to integrate prompts from multiple frames.
### 4.1 Architecture Overview
At a high level, SAM 2 can be conceptualized as a streaming transformer with memory. The model processes video frames sequentially, maintaining a memory bank that stores information about past frames and predictions. The architecture consists of four primary components:
1. Image Encoder: Extracts visual features from each frame
2. Memory Attention: Conditions the current frame features on past frame features and predictions
3. Prompt Encoder and Mask Decoder: Processes user prompts and generates segmentation masks
4. Memory Encoder and Memory Bank: Encodes and stores information from previous frames
The information flow through the model can be expressed mathematically as follows:
For a frame $\mathbf{I}_t$ at time $t$ with optional prompt $\mathbf{p}_t$:
$
\begin{align}
\mathbf{F}_t &= f_{\text{img}}(\mathbf{I}_t) \\
\mathbf{F}_t^{\text{mem}} &= f_{\text{mem-attn}}(\mathbf{F}_t, \mathcal{M}_{t-1}) \\
\mathbf{P}_t &= f_{\text{prompt}}(\mathbf{p}_t) \text{ (if prompt exists)} \\
\mathbf{M}_t, v_t &= f_{\text{dec}}(\mathbf{F}_t^{\text{mem}}, \mathbf{P}_t) \\
\mathcal{M}_t &= f_{\text{mem-update}}(\mathcal{M}_{t-1}, \mathbf{F}_t, \mathbf{M}_t, v_t)
\end{align}
$
where $\mathcal{M}_t$ represents the memory bank at time $t$, and $v_t$ is a binary variable indicating whether the object is visible in frame $t$.
### 4.2 Image Encoder
The image encoder in SAM 2 plays a crucial role in extracting high-quality visual features from each video frame. Unlike the original SAM which used a Vision Transformer (ViT), SAM 2 employs a hierarchical vision transformer called Hiera (Ryali et al., 2023), which was pre-trained using Masked Auto-Encoders (MAE) (He et al., 2022).
Hiera employs a hierarchical structure with windowed attention mechanisms, processing images through multiple stages with progressive downsampling. This design enables efficient multi-scale feature extraction, capturing both fine-grained local details and high-level semantic information. Unlike the standard ViT architecture, Hiera strategically uses global attention only in specific layers while relying on windowed attention for most computation, significantly reducing computational complexity.
Mathematically, the image encoder maps an input frame $\mathbf{I}_t \in \mathbb{R}^{H \times W \times 3}$ to a multi-scale feature representation:
$
\mathbf{F}_t^{(s)} = f_{\text{img}}^{(s)}(\mathbf{I}_t), \quad s \in \{1, 2, 3, 4\}
$
where $s$ indicates the feature scale, with corresponding spatial dimensions reduced by factors of 4, 8, 16, and 32, respectively. The primary feature map used for memory attention is $\mathbf{F}_t = \mathbf{F}_t^{(3,4)}$, which combines features from scales 3 and 4 using a Feature Pyramid Network (FPN) (Lin et al., 2017).
A key innovation in SAM 2's image encoder implementation is the removal of relative positional biases (RPB) from the attention layers, which enables the use of efficient attention kernels (Dao, 2023). This modification provides significant speedup without performance degradation, contributing to SAM 2's real-time processing capabilities.
### 4.3 Memory Attention Module
The memory attention module is the core innovation in SAM 2 that enables temporal reasoning. It conditions the current frame features on memories from previous frames through a series of transformer blocks with self-attention and cross-attention operations.
Given the image features $\mathbf{F}_t$ from the current frame and the memory bank $\mathcal{M}_{t-1} = \{\mathbf{M}_{1:t-1}, \mathbf{F}_{1:t-1}, \mathbf{O}_{1:t-1}\}$ containing information from previous frames (where $\mathbf{O}_{1:t-1}$ represents object pointer tokens), the memory attention module produces memory-conditioned features $\mathbf{F}_t^{\text{mem}}$:
$
\begin{align}
\mathbf{Z}_t^{(0)} &= \mathbf{F}_t \\
\mathbf{Z}_t^{(l)} &= \text{SA}^{(l)}(\text{LN}(\mathbf{Z}_t^{(l-1)})) + \mathbf{Z}_t^{(l-1)} \\
\mathbf{Z}_t^{(l)'} &= \text{CA}^{(l)}(\text{LN}(\mathbf{Z}_t^{(l)}), \mathcal{M}_{t-1}) + \mathbf{Z}_t^{(l)} \\
\mathbf{Z}_t^{(l)''} &= \text{MLP}^{(l)}(\text{LN}(\mathbf{Z}_t^{(l)'})) + \mathbf{Z}_t^{(l)'} \\
\mathbf{F}_t^{\text{mem}} &= \mathbf{Z}_t^{(L)}
\end{align}
$
where $\text{SA}^{(l)}$, $\text{CA}^{(l)}$, and $\text{MLP}^{(l)}$ represent the self-attention, cross-attention, and MLP operations in the $l$-th transformer block, $\text{LN}$ denotes layer normalization, and $L$ is the total number of blocks (typically 4).
A distinctive feature of the memory attention module is its use of 2D Rotary Position Embeddings (RoPE) (Su et al., 2021; Heo et al., 2024) in both self-attention and cross-attention layers. This enables the model to better understand spatial relationships while maintaining computational efficiency.
The cross-attention operation, which is key to integrating past information, can be expressed in detail as:
$
\text{CA}^{(l)}(\mathbf{Q}, \mathcal{M}) = \text{Softmax}\left(\frac{\mathbf{Q} \mathbf{K}^T}{\sqrt{d}}\right) \mathbf{V}
$
where $\mathbf{Q}$ is derived from the current frame features, while $\mathbf{K}$ and $\mathbf{V}$ are constructed from the memory bank. The memory bank includes both spatial memory from previous frames and object pointer tokens that capture high-level semantic information about the target object.
### 4.4 Prompt Encoder and Mask Decoder
The prompt encoder and mask decoder in SAM 2 extend the design from the original SAM model with adaptations for the video domain.
#### 4.4.1 Prompt Encoder
The prompt encoder $f_{\text{prompt}}$ processes various types of user prompts (points, boxes, masks) into embeddings that guide the segmentation process:
$
\mathbf{P}_t = f_{\text{prompt}}(\mathbf{p}_t)
$
For sparse prompts (points and boxes), positional encodings are combined with learned prompt-type embeddings:
$
\mathbf{P}_t^{\text{sparse}} = \{\mathbf{E}_{\text{pos}}(x_i, y_i) + \mathbf{E}_{\text{type}}(c_i)\}_{i=1}^{n}
$
where $(x_i, y_i)$ are the coordinates, $c_i \in \{\text{positive}, \text{negative}, \text{box}\}$ is the prompt type, and $n$ is the number of prompt points.
For dense prompts (masks), convolutional layers are used to embed the mask:
$
\mathbf{P}_t^{\text{dense}} = \text{Conv}(\mathbf{p}_t^m)
$
#### 4.4.2 Mask Decoder
The mask decoder $f_{\text{dec}}$ takes the memory-conditioned frame features $\mathbf{F}_t^{\text{mem}}$ and prompt embeddings $\mathbf{P}_t$ to produce a segmentation mask $\mathbf{M}_t$ and visibility score $v_t$:
$
\mathbf{M}_t, v_t = f_{\text{dec}}(\mathbf{F}_t^{\text{mem}}, \mathbf{P}_t)
$
SAM 2's mask decoder introduces several innovations beyond the original SAM design:
1. **Multi-scale Skip Connections**: Features from scales 1 and 2 of the image encoder ($\mathbf{F}_t^{(1)}$ and $\mathbf{F}_t^{(2)}$) are incorporated during upsampling to enhance fine-grained details:
$
\mathbf{U}_s = \text{UpConv}(\mathbf{U}_{s+1} + \mathbf{F}_t^{(s)}), \quad s \in \{1, 2\}
$
2. **Occlusion Prediction**: An additional token and MLP head predict whether the target object is visible in the current frame:
$
v_t = \sigma(\text{MLP}_{\text{vis}}(\mathbf{t}_{\text{vis}}))
$
where $\mathbf{t}_{\text{vis}}$ is a dedicated visibility token processed through the transformer blocks, and $\sigma$ is the sigmoid function.
3. **Ambiguity Handling**: For ambiguous prompts, the decoder predicts multiple masks along with their IoU scores:
$
\{\mathbf{M}_t^{(k)}, s_t^{(k)}\}_{k=1}^K = f_{\text{dec}}(\mathbf{F}_t^{\text{mem}}, \mathbf{P}_t)
$
where $K$ is the number of mask candidates (typically 3), and $s_t^{(k)}$ is the predicted IoU score for each mask.
### 4.5 Memory Encoder and Memory Bank
The memory encoder generates compact representations of frame features and segmentation masks for storage in the memory bank, which the model accesses when processing future frames.
#### 4.5.1 Memory Encoder
The memory encoder $f_{\text{mem-enc}}$ transforms the image features $\mathbf{F}_t$ and the predicted mask $\mathbf{M}_t$ into a memory representation:
$
\mathbf{M}_t^{\text{mem}} = f_{\text{mem-enc}}(\mathbf{F}_t, \mathbf{M}_t)
$
Specifically, it first downsamples the mask using convolutional layers and then combines it with the image features through element-wise addition followed by additional convolutional fusion layers:
$
\mathbf{M}_t^{\text{mem}} = \text{ConvFuse}(\text{ConvDown}(\mathbf{M}_t) + \mathbf{F}_t)
$
Additionally, the mask decoder's output tokens are stored as "object pointers" to capture high-level semantic information:
$
\mathbf{O}_t = \mathbf{t}_{\text{mask}}
$
where $\mathbf{t}_{\text{mask}}$ is the token corresponding to the selected mask.
#### 4.5.2 Memory Bank
The memory bank $\mathcal{M}_t$ maintains three types of information:
1. **Recent Frame Memories**: A FIFO queue storing features from the $N$ most recent frames (typically $N = 6$). These memories include temporal position encodings to help the model understand short-term motion patterns.
2. **Prompted Frame Memories**: Features from frames that received user prompts (up to $M$ frames, typically $M = 5$). Unlike recent memories, these do not include temporal position encodings, as prompted frames may come from temporally distant parts of the video.
3. **Object Pointers**: Semantic tokens derived from the mask decoder's output for each processed frame, capturing high-level information about the target object.
The mathematical formulation of the memory bank can be expressed as:
$
\mathcal{M}_t = \{\mathcal{M}_t^{\text{recent}}, \mathcal{M}_t^{\text{prompted}}, \mathcal{M}_t^{\text{pointers}}\}
$
where:
$
\mathcal{M}_t^{\text{recent}} = \{\mathbf{M}_{t-N+1}^{\text{mem}}, \ldots, \mathbf{M}_t^{\text{mem}}\}
$
$
\mathcal{M}_t^{\text{prompted}} = \{\mathbf{M}_{t_1}^{\text{mem}}, \mathbf{M}_{t_2}^{\text{mem}}, \ldots, \mathbf{M}_{t_{\min(n, M)}}^{\text{mem}}\}
$
$
\mathcal{M}_t^{\text{pointers}} = \{\mathbf{O}_{t_1}, \mathbf{O}_{t_2}, \ldots, \mathbf{O}_{t-1}, \mathbf{O}_t\}
$
When implementing the memory bank, developers should consider:
- Using efficient data structures for the FIFO queues
- Downsampling memory features to reduce storage requirements (the authors used 64-dimensional features for memory storage versus 256-dimensional features for processing)
- Implementing efficient indexing mechanisms to quickly retrieve relevant memories
- Adding temporal position encodings only to recent memories, not to prompted memories
#### 4.5.3 Object Pointers and Recurrent Memory Ablations
The authors conducted detailed ablation studies on the memory architecture to determine the optimal design for temporal reasoning. Two key design choices were investigated:
1. **Object Pointers**: The use of object pointer tokens from the mask decoder output was found to be crucial for maintaining object identity across frames. Experiments showed that removing object pointers caused a significant performance drop of 3.8% J&F on the SA-V val dataset and 4.6% J&F on the LVOSv2 benchmark. This suggests that these semantic tokens capture high-level object characteristics that spatial features alone cannot represent, particularly for long-term object tracking.
2. **Recurrent Memory Integration**: Unlike previous approaches that often employ GRU-based recurrent architectures for temporal modeling (e.g., XMem, STCN), SAM 2's ablation studies revealed that adding a GRU to process memory features before adding them to the memory bank did not improve performance. This counterintuitive finding suggests that the cross-attention mechanism in the memory attention module provides sufficient temporal integration, making explicit recurrent connections unnecessary.
The mathematical comparison can be formalized as:
Without object pointers:
$
\mathcal{M}_t = \{\mathcal{M}_t^{\text{recent}}, \mathcal{M}_t^{\text{prompted}}\}
$
With object pointers (default in SAM 2):
$
\mathcal{M}_t = \{\mathcal{M}_t^{\text{recent}}, \mathcal{M}_t^{\text{prompted}}, \mathcal{M}_t^{\text{pointers}}\}
$
With recurrent processing:
$
\mathbf{M}_t^{\text{mem}} = \text{GRU}(\mathbf{M}_t^{\text{mem}}, \mathbf{M}_{t-1}^{\text{mem}})
$
These ablations demonstrate that the simple design of directly storing multi-type memories (spatial and semantic) without recurrent processing strikes an optimal balance between performance and computational efficiency.
### 4.6 Training Procedure
SAM 2 is trained using a multi-stage process that combines image and video data:
1. **Pre-training on SA-1B**: The model is first pre-trained on the SA-1B dataset (Kirillov et al., 2023), focusing on static image segmentation.
2. **Joint Training**: The model is then trained on a combination of image and video data, with the sampling probability proportional to the size of each data source. For video training, sequences of 8 frames are sampled, with up to 2 frames randomly selected for prompting.
3. **Fine-tuning**: A final fine-tuning stage uses 16-frame sequences from challenging videos (those with many edited frames during annotation) to improve performance on longer videos.
The loss function combines several components:
$
\mathcal{L} = \lambda_1 \mathcal{L}_{\text{focal}} + \lambda_2 \mathcal{L}_{\text{dice}} + \lambda_3 \mathcal{L}_{\text{IoU}} + \lambda_4 \mathcal{L}_{\text{vis}}
$
where $\mathcal{L}_{\text{focal}}$ and $\mathcal{L}_{\text{dice}}$ are segmentation losses, $\mathcal{L}_{\text{IoU}}$ is a mean absolute error loss for IoU prediction, and $\mathcal{L}_{\text{vis}}$ is a cross-entropy loss for visibility prediction. The weighting coefficients are $\lambda_1 = 20$, $\lambda_2 = \lambda_3 = \lambda_4 = 1$.
During training, the model simulates interactive prompting by sampling clicks based on ground-truth masks and model predictions. Initial prompts are the ground-truth mask (50% probability), a positive click (25%), or a bounding box (25%).
An important aspect of the training procedure is the extensive data augmentation pipeline employed to improve model robustness. The augmentations include:
1. **Standard Image Augmentations**: Horizontal flipping and resizing to a fixed square resolution of 1024×1024.
2. **Video-specific Augmentations**:
- Random affine transforms (±25° rotation, ±20% shear)
- Color jittering (brightness: 0.1, contrast: 0.03, saturation: 0.03)
- Random grayscale conversion with 5% probability
- Per-frame color jittering to simulate appearance changes across frames (brightness: 0.1, contrast: 0.05, saturation: 0.05)
3. **Mosaic Augmentation**: With 10% probability, the same video is tiled into a 2×2 grid, with one of the four quadrants selected as the target object. This challenging augmentation forces the model to rely on temporal cues rather than just appearance, as identical-looking objects appear in different quadrants.
4. **Temporal Reversal**: With 50% probability, the temporal order of frames is reversed, which encourages the model to learn bidirectional temporal relationships.
These augmentation strategies significantly enhance the model's ability to handle challenging real-world scenarios, such as appearance changes, occlusions, and similar-looking objects.
### 4.7 Inference Processing
During inference, SAM 2 operates in a streaming fashion, processing one frame at a time and maintaining its memory bank. For each new frame $\mathbf{I}_t$:
1. Extract image features $\mathbf{F}_t = f_{\text{img}}(\mathbf{I}_t)$
2. Apply memory attention to get memory-conditioned features $\mathbf{F}_t^{\text{mem}}$
3. If a prompt $\mathbf{p}_t$ exists for this frame, encode it using the prompt encoder
4. Generate the mask $\mathbf{M}_t$ and visibility score $v_t$ using the mask decoder
5. Update the memory bank with the new frame information
For interactive applications, the model can seamlessly incorporate new prompts at any frame, using the memory bank to maintain context from previous frames and interactions.
This streaming design enables SAM 2 to process videos of arbitrary length with constant memory requirements, making it suitable for real-time applications.
## 5. Data Engine and Dataset Construction
A key innovation of SAM 2 is its data collection methodology, which addresses the critical challenge of creating a large, diverse, and high-quality video segmentation dataset. The authors developed a novel "data engine" approach that leverages the model itself in an iterative improvement loop with human annotators to efficiently generate training data.
### 5.1 Technical Overview of the Data Engine
The data engine combines model-assisted annotation with human verification to create a scalable pipeline for producing video segmentation masks (masklets). Unlike traditional annotation approaches that require frame-by-frame manual segmentation, the data engine efficiently leverages the capabilities of the evolving model to reduce annotation time and effort.
Mathematically, the data engine can be conceptualized as an iterative optimization process:
$
\begin{align}
\mathcal{D}_0 &= \emptyset \\
\mathcal{M}_i &= \text{Train}(\mathcal{D}_{i-1}) \\
\mathcal{D}_i &= \mathcal{D}_{i-1} \cup \text{Annotate}(\mathcal{V}_i, \mathcal{M}_i)
\end{align}
$
where $\mathcal{D}_i$ represents the dataset after iteration $i$, $\mathcal{M}_i$ is the model trained on dataset $\mathcal{D}_{i-1}$, $\mathcal{V}_i$ is a set of new videos to be annotated, and $\text{Annotate}(\mathcal{V}_i, \mathcal{M}_i)$ is the annotation process that uses model $\mathcal{M}_i$ to assist human annotators.
The data engine evolved through three distinct phases, each characterized by increasing levels of model assistance:
### 5.2 Phase 1: SAM Per Frame
In the initial phase, annotators used the original SAM model to segment objects in individual frames without temporal propagation:
$
\mathbf{M}_t = \text{SAM}(\mathbf{I}_t, \mathbf{p}_t)
$
For each frame $\mathbf{I}_t$ at time $t$, annotators provided prompts $\mathbf{p}_t$ to the SAM model to generate a mask $\mathbf{M}_t$. If necessary, manual editing tools (brush, eraser) were used to refine the mask. This process had to be repeated independently for each frame, resulting in an average annotation time of 37.8 seconds per frame.
This approach yielded high-quality spatial annotations but was inefficient for video annotation due to the lack of temporal continuity. In this phase, 16K masklets were collected across 1.4K videos.
### 5.3 Phase 2: SAM + SAM 2 Mask
The second phase introduced a preliminary version of SAM 2 (referred to as "SAM 2 Mask") that accepted only mask prompts for propagation:
$
\begin{align}
\mathbf{M}_t &= \text{SAM}(\mathbf{I}_t, \mathbf{p}_t) \\
\mathbf{M}_{t+1:T} &= \text{SAM2Mask}(\mathbf{I}_{t+1:T}, \mathbf{M}_t)
\end{align}
$
In this workflow, annotators would first create a mask in a single frame using SAM, then use SAM 2 Mask to propagate this mask to subsequent frames. If the propagation produced incorrect results in a later frame $t'$, annotators would create a new mask from scratch using SAM and then use SAM 2 Mask to re-propagate from that frame:
$
\begin{align}
\mathbf{M}_{t'} &= \text{SAM}(\mathbf{I}_{t'}, \mathbf{p}_{t'}) \\
\mathbf{M}_{t'+1:T} &= \text{SAM2Mask}(\mathbf{I}_{t'+1:T}, \mathbf{M}_{t'})
\end{align}
$
This iterative process continued until the entire masklet was correct. The introduction of temporal propagation reduced the average annotation time to 7.4 seconds per frame—a 5.1× speedup compared to Phase 1. During Phase 2, 63.5K masklets were collected, and SAM 2 Mask was retrained twice using the growing dataset.
### 5.4 Phase 3: Full SAM 2
The final phase employed the complete SAM 2 model, which accepts various prompt types (points, boxes, masks) and maintains a memory of the target object across frames:
$
\mathbf{M}_{1:T} = \text{SAM2}(\mathbf{I}_{1:T}, \{\mathbf{p}_{t_1}, \mathbf{p}_{t_2}, ..., \mathbf{p}_{t_n}\})
$
This approach allowed annotators to provide minimal refinement prompts (often just a single click) in intermediate frames where propagation errors occurred, rather than having to re-annotate from scratch. The model's memory of the target object across frames enabled more efficient corrections:
$
\mathbf{M}_{t'} = \text{SAM2}(\mathbf{I}_{t'}, \mathbf{p}_{t'}, \mathcal{M}_{t'-1})
$
where $\mathcal{M}_{t'-1}$ represents the memory context from previous frames.
This further reduced the average annotation time to 4.5 seconds per frame—an 8.4× speedup compared to Phase 1. In Phase 3, 197K masklets were collected, and SAM 2 was retrained five times during the process.
### 5.5 Quality Verification and Auto-Generation
The data engine incorporated a verification step where separate annotators evaluated the quality of each annotated masklet as either "satisfactory" or "unsatisfactory." Unsatisfactory masklets were returned to the annotation pipeline for refinement.
Additionally, to increase dataset diversity, the authors implemented an automatic masklet generation process:
$
\{\mathbf{M}^{(1)}_{1:T}, \mathbf{M}^{(2)}_{1:T}, ..., \mathbf{M}^{(K)}_{1:T}\} = \text{SAM2Auto}(\mathbf{I}_{1:T}, \mathbf{P}_{\text{grid}})
$
where $\mathbf{P}_{\text{grid}}$ represents a regular grid of points in the first frame used to prompt the model, generating $K$ candidate masklets. The grid prompting was implemented at multiple scales:
- A 32×32 grid on the full first frame
- A 16×16 grid on 4 zoomed image crops (from a 2×2 overlapped window)
- A 4×4 grid on 16 zoomed image crops (from a 4×4 overlapped window)
This multi-scale approach ensured coverage of objects at different sizes and positions. After generation, two post-processing steps were applied: removal of tiny disconnected components (< 200 pixels) and filling of small holes (< 200 pixels). These auto-generated masklets were then verified by human annotators, with satisfactory ones added to the dataset and unsatisfactory ones sampled for manual refinement.
The effectiveness of each phase was quantitatively assessed by measuring:
1. **Annotation time per frame**: Decreased from 37.8s (Phase 1) to 4.5s (Phase 3)
2. **Percentage of edited frames**: Decreased from 100% (Phase 1) to 19.04% (Phase 3)
3. **Clicks per clicked frame**: Decreased from 4.80 (Phase 1) to 2.68 (Phase 3)
4. **Mask alignment with Phase 1 reference**: Improved from 86.4% (Phase 2) to 89.1% (Phase 3)
![[CleanShot 2025-02-23 at
[email protected]]]
TABLE 1: Evolution of data engine phases showing the average annotation time per frame (decreasing from 37.8s in Phase 1 to 4.5s in Phase 3), the percentage of edited frames per masklet (reducing from 100% to 19.04%), the number of manual clicks per clicked frame (decreasing from 4.80 to 2.68), and mask alignment to Phase 1 by object size (showing consistent improvements across small, medium, and large objects, with perfect alignment for large objects in Phase 3). This quantitative analysis demonstrates how each iteration of the data engine significantly improved annotation efficiency while maintaining or enhancing segmentation quality.
### 5.6 SA-V Dataset Characteristics
The resulting Segment Anything Video (SA-V) dataset represents the largest and most diverse video segmentation dataset to date, comprising:
- 50.9K videos with an average duration of 14 seconds (196 hours total)
- 642.6K masklets (190.9K manual + 451.7K automatic)
- 35.5M individual masks (53× more than any existing video segmentation dataset)
![[CleanShot 2025-02-23 at
[email protected]]]
FIGURE 4: Example videos from the SA-V dataset with masklets overlaid (manual and automatic). Each row shows frames from a different video sequence with consistent color-coded segmentation masks tracking various objects across time. The examples demonstrate the dataset's remarkable diversity, including aquatic scenes with fish, people in indoor environments, crowds, animals, and everyday objects. This visualization illustrates how SAM 2 effectively segments and tracks objects through different environments, challenging poses, occlusions, and varying lighting conditions, highlighting the comprehensive nature of the SA-V dataset that enables robust video segmentation capabilities.
The dataset covers a wide variety of scenes, objects, and motions, with several distinctive characteristics:
1. **Object Diversity**: Unlike category-focused datasets, SA-V contains masks for arbitrary objects and parts without semantic constraints.
2. **Size Distribution**: 88% of masks have a normalized area less than 0.1, indicating a focus on smaller and more challenging objects.
3. **Disappearance Rate**: 42.5% of manual masklets contain objects that disappear and reappear, testing the model's ability to handle occlusions.
4. **Geographical Diversity**: Videos were collected across 47 countries to ensure cultural and environmental diversity.
The SA-V dataset is split into training, validation, and test sets based on video authors and geographic locations to minimize overlap. Validation and test videos were specifically selected to include challenging scenarios with fast-moving objects, complex occlusions, and disappearance/reappearance patterns.
![[CleanShot 2025-02-23 at
[email protected]]]
TABLE 3: Comparison of SA-V dataset with open-source VOS datasets in terms of number of videos, duration, number of masklets, masks, frames, and disappearance rate. The table quantitatively demonstrates the unprecedented scale of SA-V (50.9K videos, 196 hours, 642.6K masklets, 35.5M masks) compared to existing datasets like DAVIS 2017 (0.2K videos, 0.1 hours, 0.4K masklets, 27.1K masks) or even larger datasets like BURST (2.9K videos, 28.9 hours, 16.1K masklets, 600.2K masks). Notably, SA-V Manual+Auto contains approximately 53× more masks than any previous dataset while maintaining a high disappearance rate (27.7%), ensuring the dataset effectively captures challenging tracking scenarios with object occlusions and reappearances.
### 5.7 Mathematical Analysis of Data Engine Efficiency
The efficiency of the data engine can be formalized by comparing the annotation cost across phases. Let $T$ be the number of frames in a video, $p_e$ be the probability of needing to edit a frame, and $c$ be the cost (in time) of editing a single frame. The total annotation cost can be expressed as:
$
C = T \cdot p_e \cdot c
$
In Phase 1, $p_e = 1$ (all frames require annotation) and $c = c_{\text{SAM}}$ (the cost of using SAM alone):
$
C_{\text{Phase1}} = T \cdot c_{\text{SAM}}
$
In Phase 3, $p_e \approx 0.19$ (only 19% of frames need editing) and $c = c_{\text{SAM2}} < c_{\text{SAM}}$ (the cost of using SAM 2 is lower due to fewer required clicks):
$
C_{\text{Phase3}} = T \cdot 0.19 \cdot c_{\text{SAM2}}
$
Given that $c_{\text{SAM}} \approx 37.8s$ and $c_{\text{SAM2}} \approx 23.7s$ (derived from the clicks per frame and time per click), the efficiency gain is:
$
\frac{C_{\text{Phase1}}}{C_{\text{Phase3}}} \approx \frac{37.8}{0.19 \cdot 23.7} \approx 8.4
$
This 8.4× efficiency improvement enabled the creation of a dataset at an unprecedented scale, which in turn enabled the training of a more capable model—creating a virtuous cycle of improvement.
## 6. Experimental Evaluation and Performance Analysis
The authors conducted extensive experiments to evaluate SAM 2's performance across multiple segmentation tasks and benchmarks. The results demonstrate significant improvements over previous approaches in both video and image domains.
### 6.1 Evaluation Metrics and Methodologies
The primary metrics used to evaluate SAM 2's performance are:
1. **J&F Metric**: The standard evaluation metric for video segmentation tasks, calculated as the average of the Jaccard index (J) and boundary precision-recall (F):
$
\text{J\&F} = \frac{1}{2} \left( \frac{1}{T} \sum_{t=1}^{T} \frac{|\mathbf{M}_t \cap \mathbf{G}_t|}{|\mathbf{M}_t \cup \mathbf{G}_t|} + \frac{1}{T} \sum_{t=1}^{T} \frac{2 \cdot \text{Precision}_t \cdot \text{Recall}_t}{\text{Precision}_t + \text{Recall}_t} \right)
$
where $\mathbf{M}_t$ and $\mathbf{G}_t$ are the predicted and ground-truth masks at frame $t$, respectively.
2. **mIoU (Mean Intersection over Union)**: For image segmentation tasks, calculated as:
$
\text{mIoU} = \frac{1}{N} \sum_{i=1}^{N} \frac{|\mathbf{M}_i \cap \mathbf{G}_i|}{|\mathbf{M}_i \cup \mathbf{G}_i|}
$
where $N$ is the number of images in the evaluation set.
The authors evaluated SAM 2 in three primary settings:
1. **Promptable Video Segmentation (PVS)**: Simulating interactive segmentation with prompts on multiple frames.
2. **Semi-supervised Video Object Segmentation (VOS)**: Using only first-frame prompts.
3. **Image Segmentation**: Evaluating on static images using the original SAM benchmark.
### 6.2 Promptable Video Segmentation Results
For the PVS task, the authors simulated interactive segmentation scenarios using two evaluation protocols:
1. **Offline Evaluation**: Multiple passes through the video, selecting the frame with the largest error for each new interaction.
2. **Online Evaluation**: A single pass through the video, pausing at frames with low-quality predictions (IoU < 0.75) for correction.
Both evaluations used 3 clicks per prompted frame and were conducted on 9 zero-shot video datasets. SAM 2 was compared against two strong baselines:
1. **SAM+XMem++**: Combining SAM for handling prompts with XMem++ for propagation.
2. **SAM+Cutie**: Combining SAM with Cutie, a state-of-the-art VOS model.
The results demonstrate SAM 2's superior performance:
1. **Higher Accuracy**: SAM 2 achieved an average J&F score of 80.3% across the 9 datasets, compared to 74.7% for SAM+Cutie and 71.7% for SAM+XMem++.
2. **Efficiency**: SAM 2 reached the same performance level with 3× fewer interactions than the baselines. This can be formalized as finding the minimum number of frames $n$ required to achieve a target accuracy $\alpha$:
$
n^* = \min\{n : \text{J\&F}(n) \geq \alpha\}
$
For $\alpha = 75\%$, SAM 2 required approximately $n^* = 2$ frames, while the baselines needed $n^* \approx 6$ frames.
3. **Time Efficiency**: When accounting for annotation time using a model that assumes $T_{\text{loc}} = 1$ second for locating an object and $T_{\text{click}} = 1.5$ seconds per click, SAM 2 achieved better performance with significantly less annotation time.
### 6.3 Semi-supervised Video Object Segmentation Results
For the semi-supervised VOS task, where only the first frame is prompted, SAM 2 was evaluated on 17 video datasets using various prompt types: 1-click, 3-click, 5-click, bounding box, and ground-truth mask.
SAM 2 consistently outperformed the baselines across all prompt types:
1. **Click Prompts**: With a single click, SAM 2 achieved 64.7% J&F, compared to 56.9% for SAM+XMem++ and 56.7% for SAM+Cutie. With 3 clicks, SAM 2 reached 75.3% J&F (vs. 68.4% and 70.1% for the baselines).
2. **Box and Mask Prompts**: With bounding box prompts, SAM 2 achieved 74.4% J&F (vs. 67.6% and 69.4%). With ground-truth masks, SAM 2 reached 79.3% J&F (vs. 72.7% and 74.1%).
These results are particularly notable because the baselines were specifically designed for the semi-supervised VOS task, while SAM 2 is a more general model that excels across multiple segmentation paradigms.
### 6.4 Image Segmentation Results
Despite being designed for both video and image segmentation, SAM 2 also outperforms the original SAM on static image segmentation tasks. Evaluated on 37 zero-shot datasets (including the 23 original SAM benchmarks), SAM 2 demonstrated several advantages:
1. **Higher Accuracy**: On the original 23 SAM benchmarks, SAM 2 achieved 58.9% 1-click mIoU compared to 58.1% for SAM. With 5 clicks, SAM 2 reached 81.7% mIoU versus 81.3% for SAM. When trained on the combined dataset (SA-1B plus video data), SAM 2's performance further improved to 61.9% 1-click mIoU and 83.5% 5-click mIoU.
2. **Domain-specific Performance**: Performance breakdown by domain reveals particularly strong improvements in video-derived images:
- **Image-domain datasets**: SAM 2 achieved 63.3% 1-click mIoU (vs. 60.8% for SAM)
- **Video-domain datasets in SA-23**: SAM 2 reached 60.1% 1-click mIoU (vs. 54.5% for SAM)
- **14 additional video datasets**: SAM 2 attained 69.6% 1-click mIoU (vs. 59.1% for SAM)
3. **Computational Efficiency**: SAM 2 is significantly faster, processing images at 130.1 FPS compared to 21.7 FPS for SAM—a 6× speedup. This efficiency gain stems from three key factors:
- Replacement of the ViT image encoder with the more efficient hierarchical Hiera architecture
- Removal of relative positional biases (RPB) enabling use of optimized attention kernels
- Implementation of Flash Attention-2 for accelerated attention operations
4. **Encoder Size Efficiency**: The authors demonstrated that even smaller variants of SAM 2 maintain strong performance:
- SAM 2 (Hiera-B+): 58.9% 1-click mIoU at 130.1 FPS
- SAM 2 (Hiera-L): 60.0% 1-click mIoU at 61.4 FPS (still 2.8× faster than SAM)
- SAM (ViT-H): 58.1% 1-click mIoU at 21.7 FPS
- SAM (ViT-B): 55.9% 1-click mIoU at 76.7 FPS
5. **Video-to-Image Transfer**: The performance on image datasets improved when the model was trained with video data, indicating effective knowledge transfer from video understanding to static image segmentation. This finding suggests that temporal context provides useful priors even for static image tasks.
These results demonstrate that by unifying image and video segmentation in a single architecture, SAM 2 not only maintains but enhances performance on static image tasks while drastically improving computational efficiency. The significant speedup makes SAM 2 more practical for real-time applications, while the accuracy improvements highlight the benefits of cross-domain knowledge transfer between image and video tasks.
### 6.5 Comparison to State-of-the-Art in Semi-supervised VOS
The authors also compared SAM 2 against specialized state-of-the-art VOS methods on standard benchmarks including MOSE, DAVIS 2017, LVOS, and YouTubeVOS 2019. Two versions of SAM 2 were evaluated:
1. **SAM 2 (Hiera-B+)**: The standard model using a Hiera-B+ image encoder.
2. **SAM 2 (Hiera-L)**: A larger model using a Hiera-L image encoder.
The results show that SAM 2 significantly outperforms previous methods:
1. **MOSE val**: SAM 2 (Hiera-L) achieved 77.9% J&F, compared to 71.7% for Cutie-base+.
2. **DAVIS 2017 val**: SAM 2 (Hiera-L) reached 90.7% J&F, surpassing JointFormer's 90.1%.
3. **LVOS val**: SAM 2 (Hiera-L) obtained 78.0% J&F, substantially higher than Cutie-base's 66.0%.
4. **SA-V val/test**: Most notably, on the challenging SA-V benchmark, SAM 2 (Hiera-L) achieved 77.9%/78.4% J&F, far above the best previous method (Cutie-base+ with 61.3%/62.8%).
These results are particularly impressive considering that SAM 2 still runs at real-time speeds (43.8 FPS for Hiera-B+ and 30.2 FPS for Hiera-L on a single A100 GPU), while previous methods often require multiple passes or are significantly slower.
### 6.6 Ablation Studies and Technical Insights
The authors conducted extensive ablation studies that provide valuable technical insights:
1. **Data Mixture**: Training on the combination of existing VOS datasets, SA-V, and SA-1B yielded the best overall performance. Notably, training only on existing VOS datasets resulted in poor generalization, while adding SA-V data provided a substantial boost to zero-shot performance (+12.1% on 9 zero-shot video datasets).
2. **Data Composition Impact**: Detailed ablation studies on data mixtures revealed several important findings:
- Models trained solely on existing VOS datasets performed well on in-domain MOSE data (76.9% J&F) but poorly on zero-shot benchmarks (59.7% J&F).
- Adding SA-V data dramatically improved zero-shot performance from 59.7% to 70.9% J&F.
![[CleanShot 2025-02-23 at
[email protected]]]
TABLE 2: Segmentation accuracy (J&F metric) improvement from adding data from each data engine phase. The table shows clear performance gains on both the SA-V validation set and 9 zero-shot datasets as more data is incorporated: starting with VOS+SA-1B as the baseline (50.0/62.5), then adding Phase 1 data (53.0/66.9), Phase 2 data (58.8/70.9), Phase 3 data (62.5/71.2), and finally auto-generated data (63.2/71.5). This progressive improvement demonstrates the value of each data collection phase and confirms the effectiveness of the iterative data engine approach, with particularly substantial gains coming from Phase 2 and 3 data.
- Including SA-1B image data improved image segmentation performance without degrading video capability.
- Performance on the challenging SA-V validation set increased progressively with the addition of each data source: 48.1% (VOS only) → 57.0% (SA-1B) → 63.0% (Internal data) → 63.6% (full mixture).
- The most balanced performance across all benchmarks was achieved by combining VOS, SA-V, Internal data, and SA-1B, demonstrating the complementary nature of these datasets.
3. **Data Scaling**: Performance follows a consistent power law relationship with the quantity of training data across all benchmarks, highlighting the importance of the large-scale SA-V dataset.
4. **Data Quality vs. Quantity**: Experiments with data filtering showed that curating 50K masklets based on the number of edited frames (a proxy for difficulty) outperformed random sampling of the same amount (66.2% vs. 63.7% J&F on SA-V val), though using the full dataset (190K masklets) yielded the best results (69.9% J&F).
5. **Model Capacity**: Increasing resolution, number of frames during training, memory size, and model capacity all contributed to performance improvements, with different components affecting video and image performance to varying degrees.
6. **Memory Architecture**: The use of object pointers significantly improved performance on the SA-V val dataset (+3.8% J&F) and on the challenging LVOSv2 benchmark (+4.6% J&F), indicating their importance for maintaining object identity in complex videos.
7. **Positional Encoding**: Removing relative positional biases from the image encoder and using 2D Rotary Position Embeddings in the memory attention module provided a good balance between performance and computational efficiency.
The results from these ablations informed the final architectural choices in SAM 2, resulting in a model that achieves state-of-the-art performance while maintaining real-time inference capabilities.
## 7. Technical Implementation Considerations
Implementing and deploying SAM 2 in real-world applications requires careful attention to several technical aspects. Based on insights from the paper and supplementary materials, we highlight key implementation considerations for machine learning practitioners.
### 7.1 Optimizing Attention Operations
A key factor in SAM 2's real-time performance is the efficient implementation of attention mechanisms. The authors highlight the importance of using optimized attention kernels such as FlashAttention-2 (Dao, 2023), which is enabled by removing Relative Positional Biases (RPB) from the image encoder.
For the memory attention module, the authors implemented 2D Rotary Position Embeddings (RoPE) for both self-attention and cross-attention operations, providing better spatial understanding while maintaining computational efficiency. This design choice was informed by ablation studies showing that 2D-RoPE outperformed other positional encoding schemes for video tasks.
When implementing these attention mechanisms, developers should:
- Use the most efficient attention kernels available for their specific hardware
- Apply 2D-RoPE only to spatial tokens (not to object pointer tokens)
- Optimize memory access patterns in cross-attention to avoid redundant computations
- Implement efficient batching strategies for processing multiple objects in parallel
The cross-attention operation, which is key to integrating past information, can be mathematically expressed as:
$
\text{CA}(\mathbf{Q}, \mathcal{M}) = \text{Softmax}\left(\frac{\mathbf{Q} \mathbf{K}^T}{\sqrt{d}}\right) \mathbf{V}
$
where $\mathbf{Q}$ is derived from the current frame features, while $\mathbf{K}$ and $\mathbf{V}$ are constructed from the memory bank elements.
### 7.2 Memory Bank Design
Efficient memory management is crucial for processing long videos. SAM 2's memory bank design maintains a bounded memory footprint regardless of video length through a carefully designed structure that preserves both short-term temporal context and long-term object identity information. The memory bank architecture allows the model to handle arbitrarily long videos while enabling interactive refinement at any frame.
The memory bank in SAM 2 maintains three types of information:
1. **Recent Frame Memories**: A FIFO queue storing features from the $N$ most recent frames (typically $N = 6$). These memories include temporal position encodings to help the model understand short-term motion patterns.
2. **Prompted Frame Memories**: Features from frames that received user prompts (up to $M$ frames, typically $M = 5$). Unlike recent memories, these do not include temporal position encodings, as prompted frames may come from temporally distant parts of the video.
3. **Object Pointers**: Semantic tokens derived from the mask decoder's output for each processed frame, capturing high-level information about the target object.
The mathematical formulation of the memory bank can be expressed as:
$
\mathcal{M}_t = \{\mathcal{M}_t^{\text{recent}}, \mathcal{M}_t^{\text{prompted}}, \mathcal{M}_t^{\text{pointers}}\}
$
where:
$
\mathcal{M}_t^{\text{recent}} = \{\mathbf{M}_{t-N+1}^{\text{mem}}, \ldots, \mathbf{M}_t^{\text{mem}}\}
$
$
\mathcal{M}_t^{\text{prompted}} = \{\mathbf{M}_{t_1}^{\text{mem}}, \mathbf{M}_{t_2}^{\text{mem}}, \ldots, \mathbf{M}_{t_{\min(n, M)}}^{\text{mem}}\}
$
$
\mathcal{M}_t^{\text{pointers}} = \{\mathbf{O}_{t_1}, \mathbf{O}_{t_2}, \ldots, \mathbf{O}_{t-1}, \mathbf{O}_t\}
$
When implementing the memory bank, developers should consider:
- Using efficient data structures for the FIFO queues
- Downsampling memory features to reduce storage requirements (the authors used 64-dimensional features for memory storage versus 256-dimensional features for processing)
- Implementing efficient indexing mechanisms to quickly retrieve relevant memories
- Adding temporal position encodings only to recent memories, not to prompted memories
### 7.3 Deployment Considerations
For deploying SAM 2 in resource-constrained environments or real-time applications, several optimization techniques can be applied:
1. **Model Quantization**: Converting model weights from FP32 to lower precision (INT8 or FP16) can significantly reduce memory footprint and increase inference speed with minimal accuracy loss. According to ablation studies, the image encoder is particularly amenable to quantization.
2. **Model Distillation**: A smaller, faster student model can be trained to mimic SAM 2's behavior. This approach is particularly effective for the memory attention module, where knowledge distillation can reduce the number of attention layers while maintaining performance.
3. **Adaptive Processing**: For long videos, implementing keyframe-based processing where full attention is only applied to frames with significant changes can improve efficiency. This can be formalized as:
$
\mathbf{F}_t^{\text{mem}} =
\begin{cases}
f_{\text{mem-attn}}(\mathbf{F}_t, \mathcal{M}_{t-1}), & \text{if } \|\mathbf{F}_t - \mathbf{F}_{t-1}\|_2 > \tau \\
\mathbf{F}_{t-1}^{\text{mem}}, & \text{otherwise}
\end{cases}
$
where $\tau$ is a threshold for detecting significant changes between consecutive frames.
4. **Hardware Acceleration**: Leveraging specialized hardware like tensor cores on NVIDIA GPUs or Neural Engine on Apple Silicon can provide substantial speedups for attention operations and convolutions.
The authors report that SAM 2 achieves real-time performance (30+ FPS) on a single A100 GPU, making it suitable for interactive applications. For edge deployment, the Hiera-T variant offers a reasonable trade-off between performance and computational requirements.
### 7.4 Image Segmentation Performance
![[CleanShot 2025-02-23 at
[email protected]]]
TABLE 5: Zero-shot accuracy on the Segment Anything (SA) task across 37 datasets. The table compares SAM 2 with the original SAM, showing average 1- and 5-click mIoU by domains (image/video). Results are reported for three configurations: original SAM trained on SA-1B, SAM 2 trained on SA-1B, and SAM 2 trained on "our mix" (including SA-V data). Key findings include: (1) When trained on the same SA-1B data, SAM 2 matches SAM on images while showing improvements on video frames; (2) When trained on the enhanced data mix, SAM 2 demonstrates substantial improvements across all categories, with particularly impressive gains on video datasets (+5.6% on SA-23 Video, +10.5% on 14 new Video datasets); (3) SAM 2 achieves these improvements while running 6× faster than the original SAM (130.1 FPS vs. 21.7 FPS). This demonstrates that SAM 2's temporal architecture does not sacrifice image segmentation performance, but rather enhances it, especially when trained on diverse data.
The results demonstrate that SAM 2 successfully maintains or improves upon SAM's image segmentation capabilities while adding powerful video segmentation functions. Notably, SAM 2 shows the most substantial improvements on video frames, even when performing single-frame inference, indicating that its architecture is inherently better at capturing semantic information relevant to video content.
### 7.5 Interactive Segmentation Performance
The interactive segmentation scenario represents one of the most important use cases for SAM 2, as it demonstrates the model's ability to integrate user feedback efficiently across video frames. In this setting, the model receives sparse annotations (typically clicks or boxes) on a subset of frames and must propagate these annotations to all other frames.
![[CleanShot 2025-02-23 at
[email protected]]]
FIGURE 5: Zero-shot accuracy over 9 datasets in interactive offline and online evaluation settings. This comparison shows average J&F scores across datasets as the number of annotated frames (with 3 clicks per frame) increases from 1 to 8. The figure compares SAM 2 against two leading approaches: SAM + XMem++ and SAM + Cutie. The two panels represent: (a) offline evaluation, where all annotated frames are available simultaneously, and (b) online evaluation, where frames are processed sequentially. SAM 2 consistently outperforms competing methods in both settings, demonstrating superior accuracy with minimal user interaction - starting with a 7-8% advantage with just one annotated frame and maintaining a significant lead even as more frames are annotated. This performance gap highlights SAM 2's efficient integration of temporal information and robust memory mechanisms for tracking objects across frames.
![[CleanShot 2025-02-23 at
[email protected]]]
TABLE 4: Zero-shot accuracy across 17 video datasets using different prompts. The table compares SAM 2 against SAM+XMem++ and SAM+Cutie across five prompt types (1-click, 3-click, 5-click, bounding box, and ground-truth mask) in the first video frame. SAM 2 demonstrates substantial improvements across all prompt types, with particularly significant gains in the minimal interaction scenario (1-click: 64.7 vs 56.9/56.7). The consistent performance advantage persists even with more detailed prompts, culminating in a 5.2% improvement over the next best method when using ground-truth masks (79.3 vs 74.1). These results quantitatively demonstrate SAM 2's superior ability to leverage sparse user inputs while maintaining high segmentation quality throughout video sequences.
Experimental results demonstrate SAM 2's superiority over decoupled approaches that combine SAM with state-of-the-art video object segmentation methods. When evaluated on zero-shot datasets with varying levels of user interaction, SAM 2 consistently achieves higher J&F scores across different interaction budgets.
### 7.6 Semi-supervised Video Object Segmentation Performance
![[CleanShot 2025-02-23 at
[email protected]]]
TABLE 6: VOS comparison to prior work. This table compares SAM 2 with state-of-the-art video object segmentation methods across multiple benchmarks (MOSE val, DAVIS 2017 val, LVOS val, SA-V val/test, YTVOS 2019 val) using first-frame ground-truth mask prompts. The results demonstrate SAM 2's superior performance across all datasets, with both variants (Hiera-B+ and Hiera-L) significantly outperforming previous methods. Particularly noteworthy are the substantial gains on the MOSE validation set (+6.2% over Cutie-base+), LVOS validation set (+12% over Cutie-base), and SA-V val/test (+16.6%/+15.6% over the best competitor). These results are especially impressive considering that many competing methods are specialized for video object segmentation, while SAM 2 maintains strong performance on both image and video tasks. The consistent improvements across diverse benchmarks demonstrate the effectiveness of SAM 2's unified architecture and memory mechanisms for maintaining temporal coherence.
The semi-supervised video object segmentation results further validate SAM 2's capabilities in tracking objects across frames based solely on a ground-truth mask in the first frame. This is a particularly challenging scenario that tests the model's ability to handle object appearance changes, occlusions, and complex motion patterns without additional user guidance. The substantial performance improvements over specialized VOS methods highlight the effectiveness of SAM 2's memory attention mechanism and the benefits of the SA-V dataset for training robust video segmentation models.
## 8. Limitations and Future Research Directions
Despite SAM 2's impressive capabilities, the authors acknowledge several limitations that present opportunities for future research and development.
### 8.1 Technical Limitations
1. **Temporal Consistency in Complex Scenarios**: SAM 2 may struggle to maintain object identity across shot changes and can lose track of or confuse objects in crowded scenes, particularly after extended occlusions. The memory mechanism, while effective for moderate-length videos, has limitations in very long sequences where object appearance may change dramatically.
2. **Fine Detail Preservation**: The model exhibits difficulties in tracking objects with very thin or fine details, especially when these objects move rapidly. This limitation stems partly from the downsampling operations in the image encoder and the memory attention module.
3. **Object Disambiguation**: In scenarios with multiple similar-looking objects (e.g., identical juggling balls), SAM 2 may incorrectly switch between objects. This indicates limitations in the object identity modeling through the memory bank.
4. **Independent Object Processing**: While SAM 2 can track multiple objects in a video simultaneously, it processes each object independently, utilizing only shared per-frame embeddings without inter-object communication. This approach, while simple, fails to leverage potential contextual relationships between objects.
### 8.2 Future Research Directions
Building upon these limitations, several promising research directions emerge:
1. **Explicit Motion Modeling**: Incorporating dedicated motion modeling components, such as optical flow estimation or spatio-temporal convolutions, could enhance SAM 2's ability to track objects through fast movements, occlusions, and appearance changes. This could be formalized as:
$
\mathbf{M}_t = f_{\text{dec}}(\mathbf{F}_t^{\text{mem}}, \mathbf{P}_t, \mathbf{V}_{t-1:t})
$
where $\mathbf{V}_{t-1:t}$ represents explicit motion information between consecutive frames.
2. **Multi-object Reasoning**: Extending the architecture to perform joint reasoning across multiple objects could improve disambiguation in cluttered scenes. This could involve graph neural networks or transformer-based approaches that model relationships between objects:
$
\{\mathbf{M}_t^{(1)}, \mathbf{M}_t^{(2)}, ..., \mathbf{M}_t^{(K)}\} = f_{\text{multi-obj}}(\mathbf{F}_t^{\text{mem}}, \{\mathbf{P}_t^{(1)}, \mathbf{P}_t^{(2)}, ..., \mathbf{P}_t^{(K)}\})
$
3. **Hierarchical Temporal Modeling**: Implementing a hierarchical memory structure that captures both short-term and long-term temporal dependencies could improve performance on extended videos. This might involve maintaining representations at multiple temporal scales:
$
\mathcal{M}_t = \{\mathcal{M}_t^{\text{short}}, \mathcal{M}_t^{\text{medium}}, \mathcal{M}_t^{\text{long}}\}
$
4. **Automated Quality Assessment**: Developing mechanisms to automatically identify frames where segmentation quality might be low could guide interactive refinement more effectively, reducing the burden on users.
5. **Multi-modal Integration**: Incorporating additional modalities such as audio or text could provide complementary cues for object tracking and segmentation, particularly in ambiguous scenarios.
### 8.3 Application-specific Extensions
The authors suggest several domain-specific adaptations that could extend SAM 2's capabilities:
1. **Long-form Video Processing**: Specialized mechanisms for handling scene changes, shot boundaries, and persistent long-term object identities could enhance SAM 2's applicability to movie editing or long-form content analysis.
2. **Medical Imaging**: Adaptations for volumetric medical data (CT, MRI) could leverage SAM 2's interactive segmentation capabilities for 3D anatomical structures.
3. **AR/VR Applications**: Optimizing for very low-latency operation on edge devices could enable new interactive applications in augmented and virtual reality.
4. **Robotics Integration**: Extending the model to handle egocentric viewpoints and integrate with robotic control systems could enable more sophisticated object manipulation tasks.
The authors also conducted a fairness evaluation of SAM 2, assessing its performance across demographic groups using video data from the Ego-Exo4D dataset with self-reported demographic information. The evaluation focused on segmenting people across gender (male/female) and age groups (18-26, 26-50, 50+). Results showed minimal performance discrepancy based on perceived gender when using 3-click prompts or ground-truth masks (less than 1% J&F difference). Similarly, there was little variance among the three age groups evaluated. This fairness analysis provides a baseline for future work but would benefit from expansion to include additional demographic attributes and object categories beyond people.
These limitations and future directions highlight the evolving nature of video segmentation research and provide a roadmap for building upon SAM 2's foundation to develop even more capable systems.
## 9. Conclusion
SAM 2 represents a significant advancement in visual understanding by extending promptable segmentation from static images to the video domain. Through a careful synthesis of architectural innovations, training methodology, and data collection strategies, the authors have created a unified foundation model for segmentation that addresses fundamental challenges in video understanding while maintaining or improving performance on static images.
The key technical contributions that enable SAM 2's capabilities include:
1. **Memory-Augmented Architecture**: The streaming transformer design with a multi-component memory bank enables spatio-temporal reasoning while maintaining constant memory requirements, allowing the model to process arbitrarily long videos in real-time.
2. **Interactive Prompting Framework**: By generalizing prompt types across frames and maintaining object context through memories, SAM 2 enables natural interaction with video content through minimal prompts, dramatically reducing the annotation effort compared to previous approaches.
3. **Data Engine Methodology**: The iterative approach to data collection, using the evolving model to assist annotation, represents a scalable paradigm for building large-scale datasets for complex tasks where manual annotation would be prohibitively expensive.
From a theoretical perspective, SAM 2 advances our understanding of how transformer architectures can be adapted for streaming temporal data, providing insights into memory mechanisms, attention operations, and training strategies that may generalize to other domains beyond segmentation.
From a practical perspective, SAM 2 provides a foundation for numerous applications across video editing, AR/VR, robotics, and autonomous systems. The model's ability to work with various prompt types, adapt to user feedback, and operate in real-time makes it suitable for integration into interactive systems.
The demonstrated improvements over specialized methods for both video object segmentation and image segmentation suggest that unified approaches to visual understanding across domains may be more effective than domain-specific solutions. This aligns with the broader trend in foundation models, where versatile architectures trained on diverse data outperform specialized systems.
As research continues to address the limitations outlined in this paper, we can anticipate further advances in temporal understanding, multi-object reasoning, and hierarchical processing that will bring us closer to truly general-purpose visual understanding systems.
## 10. References
Bai, X., & Sapiro, G. (2007). A geodesic framework for fast interactive image and video segmentation and matting. In ICCV.
Bekuzarov, M., Bermudez, A., Lee, J. Y., & Li, H. (2023). XMem++: Production-level video segmentation from few annotated frames. In ICCV, 635-644.
Caelles, S., Maninis, K. K., Pont-Tuset, J., Leal-Taixé, L., Cremers, D., & Van Gool, L. (2016). One-shot video object segmentation. CVPR, 5320-5329.
Cheng, H. K., Oh, S. W., Price, B., Lee, J. Y., & Schwing, A. (2023b). Tracking anything with decoupled video segmentation. ICCV, 1316-1326.
Cheng, H. K., & Schwing, A. G. (2022). XMem: Long-term video object segmentation with an Atkinson-Shiffrin memory model. In ECCV, 640-658.
Cheng, H. K., Tai, Y. W., & Tang, C. K. (2021a). Rethinking space-time networks with improved memory coverage for efficient video object segmentation. In NeurIPS.
Cheng, H. K., Tai, Y. W., & Tang, C. K. (2021b). Modular interactive video object segmentation: Interaction-to-mask, propagation and difference-aware fusion. In CVPR.
Dao, T. (2023). FlashAttention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691.
Delatolas, T., Kalogeiton, V., & Papadopoulos, D. P. (2024). Learning the what and how of annotation in video object segmentation. In WACV, 6951-6961.
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., & Girshick, R. (2022). Masked autoencoders are scalable vision learners. In CVPR.
Heo, B., Park, S., Han, D., & Yun, S. (2024). Rotary position embedding for vision transformer. arXiv preprint arXiv:2403.13298.
Hong, L., Liu, Z., Chen, W., Tan, C., Feng, Y., Zhou, X., Guo, P., Li, J., Chen, Z., Gao, S., et al. (2024). LVOS: A benchmark for large-scale long-term video object segmentation. arXiv preprint arXiv:2404.19326.
Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A. C., Lo, W. Y., et al. (2023). Segment anything. In ICCV.
Lin, T. Y., Dollár, P., Girshick, R., He, K., Hariharan, B., & Belongie, S. (2017). Feature pyramid networks for object detection. In CVPR.
Oh, S. W., Lee, J. Y., Xu, N., & Kim, S. J. (2019). Video object segmentation using space-time memory networks. ICCV, 9225-9234.
Ryali, C., Hu, Y. T., Bolya, D., Wei, C., Fan, H., Huang, P. Y., Aggarwal, V., Chowdhury, A., Poursaeed, O., Hoffman, J., Malik, J., Li, Y., & Feichtenhofer, C. (2023). Hiera: A hierarchical vision transformer without the bells and whistles. ICML.
Su, J., Lu, Y., Pan, S., Wen, B., & Liu, Y. (2021). Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864.
Wang, J., Jiang, H., Yuan, Z., Cheng, M. M., Hu, X., & Zheng, N. (2019). Salient object detection: A discriminative regional feature integration approach. International Journal of Computer Vision, 123, 251-268.
Yang, J., Gao, M., Li, Z., Gao, S., Wang, F., & Zheng, F. (2023). Track anything: Segment anything meets videos. arXiv preprint arXiv:2304.11968.
Yang, Z., Wei, Y., & Yang, Y. (2021b). Associating objects with transformers for video object segmentation. In NeurIPS.