Generative Universal Verifier as Multimodal Meta-Reasoner

1Tsinghua University, 2ByteDance Seed, 3Princeton University

Introduction

We introduce Generative Universal Verifier, a novel concept and plugin designed for next-generation multimodal reasoning in vision-language models and unified multimodal models, providing the fundamental capability of reflection and refinement on visual outcomes during the reasoning and generation process. This work makes three main contributions:

(1) We build ViVerBench, a comprehensive benchmark spanning 16 categories of critical tasks for evaluating visual outcomes in multimodal reasoning. Results show that existing VLMs consistently underperform across these tasks, underscoring a substantial gap from human-level capability in reliable visual verification.

(2) We design two automated pipelines to construct large-scale visual verification data and train OmniVerifier-7B, the first omni-capable generative verifier trained for universal visual verification and achieves notable gains on ViVerBench(+8.3). Through training, we identify three atomic capabilities in visual verification and demonstrate how they generalize and interact synergistically.

(3) We propose OmniVerifier-TTS, a sequential test-time scaling paradigm that leverages the universal verifier to bridge image generation and editing within unified models, enhancing the upper bound of generative ability through iterative fine-grained optimization. Beyond generation, we extend universal verifier to broader world-modeling interleaved reasoning scenarios. Empirically, OmniVerifier-TTS achieves improvements on T2I-ReasonBench(+3.7), and GenEval++(+4.3), outperforming existing parallel test-time scaling methods, such as Best-of-N.

By endowing multimodal reasoning with reliable visual verification, OmniVerifier advances both reliable reflection during generation and scalable test-time refinement, marking a step toward more trustworthy and controllable next-generation reasoning systems.

Leaderboard on ViVerBench

Rule-based evaluation scores on ViVerBench. We report per-task performance and the Overall score.

# Model Overall CE-Obj CE-Attr CE-AbsP OR-Spat OR-NSpt WD-SPhy WD-DPhy IA-BBox IA-Point IA-Count SVE-Maze SVE-FLake SVE-Robot SVE-GUI STEM-Chart STEM-LaTeX
1 Gemini 2.5 Pro 🥇 0.745 0.763 0.750 0.856 0.875 0.761 0.746 0.532 0.875 0.863 0.698 0.580 0.804 0.563 0.912 0.540 0.799
2 GPT-5 🥈 0.744 0.696 0.737 0.849 0.725 0.746 0.775 0.668 0.831 0.885 0.659 0.507 0.743 0.589 0.856 0.760 0.876
3 OpenAI o3 🥉 0.735 0.723 0.728 0.801 0.713 0.754 0.729 0.682 0.802 0.885 0.643 0.517 0.671 0.627 0.875 0.732 0.887
4 Seed 1.5-VL 0.731 0.737 0.763 0.651 0.779 0.851 0.588 0.575 0.903 0.870 0.610 0.527 0.718 0.671 0.833 0.720 0.907
5 OpenAI o4-mini 0.727 0.745 0.746 0.781 0.763 0.754 0.646 0.654 0.843 0.819 0.604 0.560 0.650 0.658 0.833 0.700 0.876
6 OpenAI o1 0.715 0.647 0.754 0.760 0.704 0.769 0.675 0.671 0.758 0.826 0.626 0.587 0.646 0.601 0.764 0.728 0.902
7 InternVL3.5 A28B 0.671 0.688 0.737 0.637 0.742 0.799 0.592 0.500 0.847 0.796 0.527 0.503 0.539 0.519 0.796 0.640 0.881
8 Qwen 2.5-VL 72B 0.661 0.696 0.642 0.678 0.550 0.813 0.600 0.507 0.839 0.744 0.615 0.517 0.507 0.513 0.796 0.628 0.922
9 OmniVerifier 7B (Ours) 0.653 0.728 0.711 0.514 0.742 0.679 0.517 0.618 0.802 0.670 0.566 0.563 0.482 0.728 0.662 0.548 0.912
10 GPT-4o 0.645 0.540 0.608 0.671 0.538 0.731 0.713 0.500 0.649 0.744 0.632 0.570 0.643 0.563 0.796 0.656 0.758
11 Qwen 2.5-VL 7B 0.570 0.531 0.591 0.500 0.504 0.694 0.529 0.471 0.673 0.633 0.467 0.527 0.404 0.671 0.625 0.556 0.742
* Human 0.932 0.938 0.940 0.932 0.988 0.955 0.929 0.818 0.961 0.966 0.918 0.997 1.000 1.000 0.935 0.928 0.706
* Random 0.500 0.500 0.500 0.500 0.500 0.500 0.500 0.500 0.500 0.500 0.500 0.500 0.500 0.500 0.500 0.500 0.500

🚨 To submit your results to the leaderboard, please send to this email with your result json files.

BibTeX

@article{zhang2025generative,
  author  = {Zhang, Xinchen and Zhang, Xiaoying and Wu, Youbin and Cao, Yanbin and Zhang, Renrui and Chu, Ruihang and Yang, Ling and Yang, Yujiu},
  title   = {Generative Universal Verifier as Multimodal Meta-Reasoner},
  journal = {arXiv preprint arXiv:2510.13804},
  year    = {2025}
}