SLVMEval: Synthetic Meta Evaluation Benchmark for Text-to-Long Video Generation - LY Corporation R&D

Publications

CONFERENCE (INTERNATIONAL) SLVMEval: Synthetic Meta Evaluation Benchmark for Text-to-Long Video Generation

Ryosuke Matsuda (Tohoku University), Keito Kudo (Tohoku University), Haruto Yoshida (Tohoku University), Nobuyuki Shimizu, Jun Suzuki (Tohoku University)

The IEEE/CVF Conference on Computer Vision and Pattern Recognition 2026 (CVPR 2026)

June 05, 2026

We present SLVMEval, a benchmark for meta-evaluating text- to-video (T2V) evaluation systems. SLVMEval targets long videos, ranging from several minutes to a few hours, which remain challenging for current T2V models to generate. Start- ing from existing dense video captioning datasets, we de- grade each video to create matched pairs of high- and low- quality versions, and we test whether a candidate evaluation system can reliably rank them. Experiments on SLVMEval reveal that current evaluation systems still struggle with long videos. For example, while CLIPScore is the best-performing baseline, it ignores temporal coherence, and VLM-as-a-judge suffers from position bias.

Paper : SLVMEval: Synthetic Meta Evaluation Benchmark for Text-to-Long Video Generation open into new tab or window (external link)