Publications

CONFERENCE (INTERNATIONAL) Data Collection-free Masked Video Modeling

Yuchi Ishikawa, Masayoshi Kondo, Yoshimitsu Aoki (Keio University)

The 18th European Conference on Computer Vision 2024 (ECCV 2024)

October 03, 2024

Pre-training video transformers generally requires a large amount of data, presenting significant challenges in terms of data collection costs and concerns related to privacy, licensing, and inherent biases. Synthesizing data is one of the promising ways to solve these issues, yet pre-training solely on synthetic data has its own challenges. In this paper, we introduce a self-supervised learning framework for videos that leverages readily available and less costly static images. Our approach utilizes a Pseudo Motion Generator (PMG) module to generate pseudo-motion videos from real images, simulating a variety of motion patterns. These pseudo-motion videos are then leveraged in self-supervised learning of masked video modeling. Furthermore, we extend our approach to synthetic images, freeing the pre-training of video models from data collection costs and concerns related to real data. Through experiments in action recognition tasks, we demonstrate that our method effectively learns spatio-temporal features from pseudo-motion videos, surpassing existing methods with static images and partially outperforming pre-training methods using real videos. These results uncover fragments of what video transformers learn through masked video modeling.

Paper : Data Collection-free Masked Video Modelingopen into new tab or window (external link)