Our method introduces high-level spatial and motion semantics into masked video modeling to enhance video representation learning. We achieve this through two core innovations:
During training, we apply a hybrid masking strategy:
Our architecture follows a student-teacher design: the student is a transformer-based encoder-decoder trained to reconstruct masked features, while the teacher is a frozen CLIP encoder providing high-level supervision.
By combining synthetic motion, semantic supervision, and targeted masking, our method learns temporally-aware and semantically-rich video representations. Notably, it achieves strong downstream performance even without access to natural video data.
Our method achieves state-of-the-art performance across multiple video understanding benchmarks, demonstrating strong generalization in both semantic and motion-centric tasks.
We achieve top results on both Something-Something V2 (SSv2) and Kinetics-400 (K400) benchmarks using full finetuning with a ViT-B backbone. Our method outperforms prior self-supervised video models, including MGMAE, MME, and MGM, with significant gains—up to 2.5% on SSv2 and 2.3% on K400.
Even with fewer training epochs, our model surpasses existing methods trained for longer, thanks to CLIP-based semantic supervision that offers a more efficient and meaningful learning signal than traditional pixel-level reconstruction.
We evaluate our method on the SEVERE benchmark , which tests four key generalization factors: domain shift, sample efficiency, action granularity, and task shift. Our model consistently outperforms existing approaches on all aspects.
Overall, our full model with synthetic motion achieves a 3.0% average gain over strong baselines, demonstrating robust generalization across diverse video understanding scenarios.
Compared to CLIP-based methods like ViCLIP and UMT, our model achieves higher accuracy on motion-sensitive datasets, especially under linear probing. This suggests stronger video representations, learned with less data and without relying on video-text alignment.
We evaluate our model on two temporally-aware tasks: Unsupervised Video Object Segmentation (Un-VOS) and Temporal Action Localization (TAL). These tasks test the model’s ability to model motion boundaries and object consistency over time.
On Un-VOS (DAVIS, YTVOS), our method outperforms prior works like MGMAE, MGM, and even the clustering-based SIGMA—showing up to 7% improvement in mIoU. On TAL (THUMOS-14, ActivityNet-v1.3), we surpass VideoMAE by 7% and MGMAE by 9%, and even approach the performance of fully-supervised models trained with labeled Kinetics-400.
These results highlight our strong temporal modeling capabilities and broad applicability across diverse video understanding tasks.
We show that our method can learn strong video representations even without any natural video data. By overlaying synthetic object motions onto static backgrounds—including black and noise images—our model achieves 22.2% on K400m and 23.0% on SSv2m, significantly outperforming random initialization.
This highlights a new learning paradigm: masked video modeling driven purely by motion, without relying on real-world videos.
To assess temporal dynamics in learned video representations, we visualize feature similarities across frames (see Figure 6). Compared to other self-supervised methods, our model exhibits greater feature variation over time—indicating stronger temporal awareness.
@inproceedings{thoker2025smile,
author = {Thoker, Fida Mohammad and Jiang, Letian and Zhao, Chen and Ghanem, Bernard},
title = {SMILE: Infusing Spatial and Motion Semantics in Masked Video Learning},
journal = {CVPR},
year = {2025},
}