Publications
CONFERENCE (INTERNATIONAL) SLVMEval: Synthetic Meta Evaluation Benchmark for Text-to-Long Video Generation
Ryosuke Matsuda (Tohoku University), Keito Kudo (Tohoku University), Haruto Yoshida (Tohoku University), Nobuyuki Shimizu, Jun Suzuki (Tohoku University)
The IEEE/CVF Conference on Computer Vision and Pattern Recognition 2026 (CVPR 2026)
June 05, 2026
We present SLVMEval, a benchmark for meta-evaluating text- to-video (T2V) evaluation systems. SLVMEval targets long videos, ranging from several minutes to a few hours, which remain challenging for current T2V models to generate. Start- ing from existing dense video captioning datasets, we de- grade each video to create matched pairs of high- and low- quality versions, and we test whether a candidate evaluation system can reliably rank them. Experiments on SLVMEval reveal that current evaluation systems still struggle with long videos. For example, while CLIPScore is the best-performing baseline, it ignores temporal coherence, and VLM-as-a-judge suffers from position bias.
Paper :
SLVMEval: Synthetic Meta Evaluation Benchmark for Text-to-Long Video Generation
(external link)