SMILE: Infusing Spatial and Motion Semantics in Masked Video Learning

1King Abdullah University of Science and Technology (KAUST)
teaser

We introduce SMILE, a self-supervised video representation learning method that focuses on semantic abstraction and motion dynamics. Existing masked video modeling approaches, such as VideoMAE, primarily reconstruct low-level pixels from natural videos, which often contain significant temporal redundancy and limited motion variation. SMILE addresses these limitations by combining spatial semantics from pretrained image-language models like CLIP and synthetic motion augmentation to inject dynamic content into the training process. Instead of relying on raw pixel reconstruction, we guide the model to learn high-level visual concepts and motion-aware features. Notably, SMILE enables effective video representation learning even without natural videos, thanks to the use of synthetic motion overlays and trajectory-based masking. Through experiments across 7 datasets and diverse downstream tasks, SMILE consistently outperforms state-of-the-art self-supervised methods, demonstrating superior generalization and motion sensitivity.

Method Overview

Our method introduces high-level spatial and motion semantics into masked video modeling to enhance video representation learning. We achieve this through two core innovations:

  • Motion Augmentation: We overlay moving synthetic objects onto original videos along randomly generated smooth trajectories. These motions introduce dynamic visual patterns that reduce temporal redundancy and encourage the model to capture meaningful motion cues.
  • Semantic Reconstruction with CLIP: Instead of reconstructing low-level pixels, we guide the model to reconstruct CLIP features, which encode high-level semantic information such as scenes and object interactions. This leads to more transferable and meaningful video representations.

During training, we apply a hybrid masking strategy:

  • Tube masking for original video regions (as in VideoMAE).
  • Trajectory-aware masking for synthetic objects, forcing the model to infer motion dynamics.

Our architecture follows a student-teacher design: the student is a transformer-based encoder-decoder trained to reconstruct masked features, while the teacher is a frozen CLIP encoder providing high-level supervision.

By combining synthetic motion, semantic supervision, and targeted masking, our method learns temporally-aware and semantically-rich video representations. Notably, it achieves strong downstream performance even without access to natural video data.

method motion

Highlights

Our method achieves state-of-the-art performance across multiple video understanding benchmarks, demonstrating strong generalization in both semantic and motion-centric tasks.

I. State-of-the-art on SSv2 and K400

We achieve top results on both Something-Something V2 (SSv2) and Kinetics-400 (K400) benchmarks using full finetuning with a ViT-B backbone. Our method outperforms prior self-supervised video models, including MGMAE, MME, and MGM, with significant gains—up to 2.5% on SSv2 and 2.3% on K400.

Even with fewer training epochs, our model surpasses existing methods trained for longer, thanks to CLIP-based semantic supervision that offers a more efficient and meaningful learning signal than traditional pixel-level reconstruction.

K400 SSv2

II. Downstream Generalization

We evaluate our method on the SEVERE benchmark , which tests four key generalization factors: domain shift, sample efficiency, action granularity, and task shift. Our model consistently outperforms existing approaches on all aspects.

  • Domain Shift: Best performance on SSv2 and Gym99 despite pretraining on K400.
  • Low-shot Learning: 5% higher accuracy on GYM with just 1,000 samples, outperforming MME.
  • Fine-grained Actions: Top results on FX-S1 and UB-S1, surpassing motion-focused models like MGMAE and MGM.
  • Task Adaptability: 9% improvement on multi-label Charades recognition and strong results on repetition counting.

Overall, our full model with synthetic motion achieves a 3.0% average gain over strong baselines, demonstrating robust generalization across diverse video understanding scenarios.

grad-cam

III. Better CLIP Adaptation

Compared to CLIP-based methods like ViCLIP and UMT, our model achieves higher accuracy on motion-sensitive datasets, especially under linear probing. This suggests stronger video representations, learned with less data and without relying on video-text alignment.

data-eff

IV. Generalization to More Temporal Tasks

We evaluate our model on two temporally-aware tasks: Unsupervised Video Object Segmentation (Un-VOS) and Temporal Action Localization (TAL). These tasks test the model’s ability to model motion boundaries and object consistency over time.

On Un-VOS (DAVIS, YTVOS), our method outperforms prior works like MGMAE, MGM, and even the clustering-based SIGMA—showing up to 7% improvement in mIoU. On TAL (THUMOS-14, ActivityNet-v1.3), we surpass VideoMAE by 7% and MGMAE by 9%, and even approach the performance of fully-supervised models trained with labeled Kinetics-400.

These results highlight our strong temporal modeling capabilities and broad applicability across diverse video understanding tasks.

severe

V. Learning Without Natural Videos

We show that our method can learn strong video representations even without any natural video data. By overlaying synthetic object motions onto static backgrounds—including black and noise images—our model achieves 22.2% on K400m and 23.0% on SSv2m, significantly outperforming random initialization.

This highlights a new learning paradigm: masked video modeling driven purely by motion, without relying on real-world videos.

severe severe

VI. Qualitative Analysis

To assess temporal dynamics in learned video representations, we visualize feature similarities across frames (see Figure 6). Compared to other self-supervised methods, our model exhibits greater feature variation over time—indicating stronger temporal awareness.

sim_main

BibTeX

@inproceedings{thoker2025smile,
  author    = {Thoker, Fida Mohammad and Jiang, Letian and Zhao, Chen and Ghanem, Bernard},
  title     = {SMILE: Infusing Spatial and Motion Semantics in Masked Video Learning},
  journal   = {CVPR},
  year      = {2025},
}