Publications
カンファレンス (国際) A Challenging Multimodal Video Summary: Simultaneously Extracting and Generating Keyframe-Caption Pairs from Video
Keito Kudo (Tohoku Univ.), Haruki Nagasawa (Tohoku Univ.), Jun Suzuki (Tohoku Univ.), Nobuyuki Shimizu
The 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP 2023)
2023.12.10
This paper proposes a practical task setting of a multimodal video summarization and a dataset for training and evaluating the task. Our task requires us to summarize a given video into a predefined number of keyframe-caption pairs and display them in a listable format to quickly grasp the video content. This task aims to ex- tract crucial scenes from the video in the form of images (keyframes) and generate a caption explaining each keyframe’s situation. This task is useful as an actual application and presents a highly challenging problem worthy of research. Specifically, achieving the simultaneous opti- mization of keyframe selection performance and caption quality necessitates careful consid- eration of the mutual dependence on both pre- ceding and subsequent keyframes and captions. To facilitate research in this area, we construct a dataset by expanding upon existing datasets and propose an evaluation framework. Further- more, we develop two types of baseline systems and report their respective performance
Paper : A Challenging Multimodal Video Summary: Simultaneously Extracting and Generating Keyframe-Caption Pairs from Video (外部サイト)