SMILE

Our method achieves state-of-the-art performance across multiple video understanding benchmarks, demonstrating strong generalization in both semantic and motion-centric tasks.

I. State-of-the-art on SSv2 and K400

We achieve top results on both Something-Something V2 (SSv2) and Kinetics-400 (K400) benchmarks using full finetuning with a ViT-B backbone. Our method outperforms prior self-supervised video models, including MGMAE, MME, and MGM, with significant gains—up to 2.5% on SSv2 and 2.3% on K400.

Even with fewer training epochs, our model surpasses existing methods trained for longer, thanks to CLIP-based semantic supervision that offers a more efficient and meaningful learning signal than traditional pixel-level reconstruction.

II. Downstream Generalization

We evaluate our method on the SEVERE benchmark , which tests four key generalization factors: domain shift, sample efficiency, action granularity, and task shift. Our model consistently outperforms existing approaches on all aspects.

Domain Shift: Best performance on SSv2 and Gym99 despite pretraining on K400.
Low-shot Learning: 5% higher accuracy on GYM with just 1,000 samples, outperforming MME.
Fine-grained Actions: Top results on FX-S1 and UB-S1, surpassing motion-focused models like MGMAE and MGM.
Task Adaptability: 9% improvement on multi-label Charades recognition and strong results on repetition counting.

Overall, our full model with synthetic motion achieves a 3.0% average gain over strong baselines, demonstrating robust generalization across diverse video understanding scenarios.

III. Better CLIP Adaptation

Compared to CLIP-based methods like ViCLIP and UMT, our model achieves higher accuracy on motion-sensitive datasets, especially under linear probing. This suggests stronger video representations, learned with less data and without relying on video-text alignment.

IV. Generalization to More Temporal Tasks

We evaluate our model on two temporally-aware tasks: Unsupervised Video Object Segmentation (Un-VOS) and Temporal Action Localization (TAL). These tasks test the model’s ability to model motion boundaries and object consistency over time.

On Un-VOS (DAVIS, YTVOS), our method outperforms prior works like MGMAE, MGM, and even the clustering-based SIGMA—showing up to 7% improvement in mIoU. On TAL (THUMOS-14, ActivityNet-v1.3), we surpass VideoMAE by 7% and MGMAE by 9%, and even approach the performance of fully-supervised models trained with labeled Kinetics-400.

These results highlight our strong temporal modeling capabilities and broad applicability across diverse video understanding tasks.

V. Learning Without Natural Videos

We show that our method can learn strong video representations even without any natural video data. By overlaying synthetic object motions onto static backgrounds—including black and noise images—our model achieves 22.2% on K400_m and 23.0% on SSv2_m, significantly outperforming random initialization.

This highlights a new learning paradigm: masked video modeling driven purely by motion, without relying on real-world videos.

VI. Qualitative Analysis

To assess temporal dynamics in learned video representations, we visualize feature similarities across frames (see Figure 6). Compared to other self-supervised methods, our model exhibits greater feature variation over time—indicating stronger temporal awareness.

SMILE: Infusing Spatial and Motion Semantics in Masked Video Learning

Method Overview

Highlights