Arena

Arena

Arena (formerly LMArena) is a community-driven AI model benchmarking and comparison platform. It enables users to evaluate and compare the real-world performance of cutting-edge models like GPT, Claude, Gemini, across tasks spanning text, vision, code, and more, through anonymous battles, user voting, and an Elo scoring system.
AI model evaluationlarge model leaderboardAI blind test battlesmodel performance comparisonArena AI platformAI benchmarking toolmultimodal model evaluation

Features of Arena

Battle Mode provides anonymous head-to-head battles where two models respond to the same prompt in parallel, with users voting based on answer quality.
Side by Side mode lets users select two specific models for side-by-side comparison tests.
Direct Chat mode enables direct dialogue and interaction with a single chosen model.
Specialized leaderboards across text, vision, image generation, video generation, coding, search, and more.
Elo-based scoring dynamically updates model rankings based on millions of user votes.
The platform aggregates hundreds of cutting-edge AI models, including GPT, Claude, Gemini, Grok, and more.
User voting data is openly transparent, providing a real-use reference for AI research and development.

Use Cases of Arena

When choosing an AI assistant, compare different models' answers on specific questions via anonymous battles.
Developers or researchers can horizontally benchmark multiple AI models on tasks like code generation and debugging.
Content creators can compare text-to-image or image-to-video models on creativity and generation quality.
Enterprises evaluating AI models can reference performance leaderboards derived from millions of real user votes.
AI enthusiasts can freely explore and test the latest top-tier models such as GPT, Claude, and Gemini.
For academic research, access open and transparent community-evaluated data and rankings.

FAQ about Arena

QWhat is Arena? What is it mainly used for?

Arena (formerly LMArena) is an open AI model benchmarking platform. It provides an ‘arena’ where users can anonymously compare the responses of different AI models (such as GPT, Claude), and generate an aggregated leaderboard reflecting real-world performance through voting.

QHow do model battles (Battle Mode) work on Arena?

In Battle Mode, users submit a query or prompt and the system randomly selects two anonymous AI models to generate responses in parallel. Users vote for the better answer based on quality, and votes affect the models’ Elo scores and leaderboard rankings.

QIs Arena free to use?

According to public information, the core evaluation and comparison features on Arena are currently freely accessible to users. You can experience and test the integrated AI models on the platform.

QHow does Arena ensure fairness in model evaluation?

The platform uses anonymous battles so voters don’t know model identities to reduce brand bias. An Elo scoring system processes the large volume of votes, and all evaluation data and rankings are publicly auditable.

QWhat types of AI models does Arena evaluate?

Arena offers multi-domain evaluations, including text dialogue, visual understanding, image generation, video generation, code programming, web development, and search enhancement, covering the capabilities of mainstream models.

QHow is user data handled when using Arena?

According to the platform’s policy, user input may be processed by third-party AI models and could be disclosed to the respective AI providers and publicly shared to support community development and AI research. Users are advised not to submit sensitive or personal data.

QHow often is the Leaderboard updated on Arena?

Leaderboards are dynamically updated based on ongoing community votes. Each specialized leaderboard (e.g., Text, Vision) typically shows the most recent update time, such as 'updated 1 day ago', indicating timely rankings.

QHow does Arena differ from traditional AI benchmarks?

Traditional benchmarks use fixed standardized test items. Arena emphasizes evaluation based on real user tasks and subjective judgments, reflecting model performance in real-world scenarios through a large volume of anonymous votes and comparisons.