研究

文本

语音

视频

MiniMax Hailuo 2.3 / 2.3 Fast

NEW

音乐

MiniMax Music 2.5

NEW

产品

AI原生应用

开放平台

即刻接入AI能力

新闻动态

关于我们

与所有人共创智能

公司介绍

投资者关系

加入我们

2026.1.5

Why We Built VIBE Bench: Rethinking Evaluation for Real Workloads

Name: MiniMax
Brand: MiniMax
Availability: InStock
Rating: 4.9 (54366 reviews)

A full-stack application benchmark focused on real user experience

VIBE Bench: A Full-Stack Application Evaluation Benchmark for Real User Experience

To measure a model’s full-stack capability to build complete, runnable applications from zero to one, MiniMax introduces a new benchmark: VIBE (Visual & Interactive Benchmark for Execution).
Unlike traditional benchmarks, VIBE automatically evaluates the interaction logic and visual presentation of generated applications in a real execution environment, providing a more faithful assessment of real user experience.

Background & Motivation

When we evaluate large language models today, most widely used benchmarks, such as SWE-bench and Terminal-bench, focus on static code correctness or command-line–level task completion.
These benchmarks have been extremely valuable for measuring coding ability. But they also rely on an implicit assumption:
If the generated code is logically correct and passes tests, the task is considered complete.

In real-world usage, that assumption often falls short.
What users actually care about is whether:

the application can be successfully built and launched;
core features work through real user interaction;
interactions behave as expected;
the interface looks modern, polished, and professional.

In other words, code that runs is not the same as a product that users can actually use.

Despite this gap, there has been no benchmark that systematically evaluates whether model-generated applications truly deliver a usable end-to-end experience.
This is the motivation behind VIBE (Visual & Interactive Benchmark), a full-stack evaluation benchmark designed around real user experience.

VIBE Bench Overview

VIBE is built to evaluate the entire lifecycle of an application, focusing on how model-generated apps perform in real execution environments.
Unlike existing benchmarks that mainly target Web or backend development,VIBE deliberately includes several critical but often overlooked technical domains, such as:

Native Android development (Kotlin / Java)
Native iOS development (Swift / Objective-C)
High-fidelity Scientific Simulations, where both precise computation and realistic rendering matter

To reflect the diversity of real-world development, the VIBE dataset is organized into the following subsets by technology stack:

Web: Frontend applications that demand strong visual design and complex DOM interactions
Simulation: Scientific simulation tasks requiring high-fidelity rendering and accurate numerical computation
Android: Native Android application development (Kotlin / Java)
iOS: Native iOS application development (Swift / Objective-C)
Backend: Server-side systems that emphasize API completeness and overall system architecture

Core Approach: Agent-as-a-Verifier (AaaV)

At the core of VIBE is a new verification paradigm we call Agent-as-a-Verifier (AaaV).

VIBE uses a vision-enabled agent to act as an automated QA tester, rather than relying on hand-written rules or static tests. This agent interacts with model-generated applications directly, observing both their behavior and visual output.
Running inside a sandboxed environment, the agent performs end-to-end evaluation of each application (coming soon), from launch and functional interaction to layout, rendering, and overall visual presentation.
By shifting verification from predefined rules to agent-driven interaction, AaaV allows VIBE to evaluate applications in a way that more closely mirrors how real users experience software.

The Three Evaluation Layers of VIBE Bench (1)

Execution Layer

The first question VIBE asks is a very basic one:
Can the application actually survive?

At the execution layer, we check whether the generated app can make it through the most fundamental hurdles:

Does the project compile successfully
Can the application be built without errors
Does it launch and run without crashing

VIBE Bench's Three Evaluation Levels (2)

Interaction Layer Validates whether core functions are "usable"

Whether interactions are responsive
Whether business workflows can be completed end-to-end
Whether key functions align with user intent

VIBE Bench's Three Evaluation Levels (3)

Visual & Aesthetics Layer Validates whether the interface has "production-grade presentation"

Whether the layout is reasonable and professional
Whether the visual hierarchy is clear
Whether the color scheme is harmonious
Whether the overall style complies with modern UI design standards

From "Is the code correct?" to "Is the application usable and deliverable?"

VIBE Bench reflects a critical phase transition in the evolution of model capabilities:

VIBE Bench provides a unified, scalable standard for evaluating and training full-stack generative models oriented toward real-world scenarios, committed to advancing model capabilities from "technical correctness" to "practical deployment value."
https://huggingface.co/datasets/MiniMaxAI/VIBE/blob/main/README.md?code=true