Arena’s cover photo
Arena

Arena

Research Services

San Francisco, California 15,806 followers

Where AI meets the real world.

About us

Created by researchers from UC Berkeley, Arena (formerly LMArena) is a community-powered platform to measure and advance the frontier of AI for real-world use. Tens of millions of builders, researchers, and creative professionals come to Arena to use frontier models and give feedback on their responses, shaping a public leaderboard grounded in real-world use.

Website
https://arena.ai
Industry
Research Services
Company size
51-200 employees
Headquarters
San Francisco, California
Type
Privately Held
Founded
2025
Specialties
AI evaluation, AI research, and AI community

Employees at Arena

View 62 employees at Arena

or

By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement, Privacy Policy, and Cookie Policy.

See all employees

Locations

Updates

  • View organization page for Arena

    15,806 followers

    Grok Build 0.1 ranks #15 and Grok 4.3 (High) #17 in the new Agent Arena leaderboard. Grok Build 0.1 improves meaningfully on bash capability over Grok 4.3. It is slightly less steerable and more prone to tool hallucinations, but looks to be successfully completing tasks more often overall. Agent Arena ranks models on real-world agentic tasks using a causal tracing methodology. A model’s net improvement indicates how much better or worse it is than the average model. Grok Build 0.1 ranks #15 overall (-5.3%) - #15 Confirmed Success (-6.3%) - #18 Praise vs. Complaint (-15.8%) - #15 Steerability (-7.0%) - #9 Bash Recovery (+6.1%) - #19 Tool Hallucination (-3.5%) Grok 4.3 (High) ranks #17 overall (-9.4%) - #20 Confirmed Success (-15.8%) - #19 Praise vs. Complaint (-16.6%) - #18 Steerability (-9.3%) - #16 Bash Recovery (-3.8%) - #17 Tool Hallucination (-1.6%) Dive into the Agent Arena leaderboard details at: https://lnkd.in/gYTs-Wjt

    • No alternative text description for this image
    • No alternative text description for this image
    • No alternative text description for this image
  • View organization page for Arena

    15,806 followers

    ICYMI: Agentic AI is now measured in the Arena. Agent Mode can handle deep research around competitive intelligence, market sizing & opportunity analysis, scientific & medical research and more. Every session shapes the Agent Arena leaderboard. Get a quick walkthrough of the causal tracing methodology with ML Researcher, Evan Frick. Watch the full video on YouTube: https://lnkd.in/gXaECJCJ Dig into the Agent Arena leaderboard details at: https://lnkd.in/gYTs-Wjt

  • View organization page for Arena

    15,806 followers

    Three new models entered the Image Arena Top 10 this past month (May -> June, Text-to-Image): - #2 Reve 2.0 by Reve (1,273), behind only GPT Image 2. - #4 MAI-Image-2.5 by Microsoft AI (1,253). - #9 Ideogram 4.0 Quality by Ideogram enters at #9 (1,204). And the only open-weights model in the top 10. Reve 2.0 and MAI-Image-2.5 displaced their own predecessors, as both previous generations dropped out of the Top 10 with these improvements. The three new entries bring different strengths across the Text-to-Image categories: - Reve 2.0 has the broadest profile, leading the three models in six of eight categories. Its clearest strengths are Text rendering, Commercial Design and Photorealistic Imagery. - MAI Image 2.5 leads in 3D Imaging and Art, while remaining competitive across the other categories. - Ideogram 4.0 Quality’s strongest relative results are in overall performance and Text Rendering. Dive into the Text-to-Image Arena leaderboard details, and filter for the data points that matter most to you at: https://lnkd.in/ghWU6WCP

    • No alternative text description for this image
    • No alternative text description for this image
  • View organization page for Arena

    15,806 followers

    Agentic AI is now evaluated in the Arena with Agent Mode and measured with Agent Arena. Watch as Founding Engineer Matthew Hovanec and Product Lead Ted Moran show you Agent Mode in action: deep research, complex bash operations, whatever you throw at it. Every session contributes to the Agent Arena leaderboard. This agentic capability streamlines complex tasks and multi-stage workflows, minimizing the need for multiple prompts. Read more about how and when to use Agent Mode on our blog: https://lnkd.in/gPWgVDj4 Dive into the Agent Arena leaderboard and see how agentic models perform across 5 different signals: arena.ai/leaderboard/agent

  • Arena reposted this

    Today we’re excited to launch Agent Arena - a major step towards evaluating AI in the agentic era. The frontier is no longer just about chatbots answering questions. It is about completing real tasks: using tools, adapting to user feedback, navigating errors, and producing useful artifacts. At Arena, we believe the best evaluations should be grounded in real-world use: production workloads, live user interactions, and the actual utility created through real work. Agent Arena puts models in live tool-use environments, captures rich signals from user-agent interaction, and uses causal inference to measure performance across millions of real agentic traces - orders of magnitude larger than static benchmarks. It is both a live leaderboard that continuously evolves with frontier AI uses, and a cutting-edge agentic AI product for everyone. Huge shoutout to the @Arena team for the incredible work. This is the biggest project at Arena ever, a massive cross-functional effort across research, engineering, and product to mark a major transformation of the platform. If you’re excited about this mission, come build with us!

    View organization page for Arena

    15,806 followers

    Introducing Agent Arena: real-world agentic evals at scale. How do you evaluate agents doing actual work? We measure millions of live sessions where real users accomplish real tasks. On Arena, models now get web search, filesystem, and terminal tools to complete complex workflows: writing code, creating slide deck, researching the web, building apps, and analyzing documents. Every session produces rich signals. Users iterate with the agent turn-by-turn: approving, editing, correcting, praise or expressing frustration. The environment gives feedback too: shell errors, tool failures, recovery attempts, and more. Our leaderboard measures each model's agentic performance using causal inference across five signals: task success, steerability, error recovery, user praise vs. complaint, and tool hallucination. This leaderboard snapshot is built from 300K+ tasks, 2M+ tool calls, and 40M lines of code by agents. Top labs in Agent Arena: - #1 OpenAI: GPT-5.5 (High) - #2 Anthropic: Claude-Opus-4.7 (Thinking) - #3 Z.ai: GLM-5.1 - #4 Google DeepMind: Gemini-3.1-Pro - #5 Kimi (Moonshot AI): Kimi-K2.6 More analysis in the thread, with the full technical blog below. What are people actually using agents for? We analyzed the task distribution in Agent Arena across a 7-day window: 160K real user tasks spanning coding, debugging, research, document creation, frontend development, file analysis, and long multi-step workflows. The largest categories were: - Code writing (17.5%) - Research and lookup (10.8%) - Planning and brainstorming (10.6%) - Multimodal image/video work (10.2%) - Document creation (9.1%) - Code debugging (8.9%) Agent usage is broad: it’s not just coding, but research, planning, content creation, file work, and complex workflows that combine multiple tools over many turns. The aggregate ranking combines multiple signals: task success, user praise vs. complaints, steerability, bash recovery, and tool hallucination. Top models win in different ways: some complete tasks more reliably, some recover better from errors, and some are easier for users to steer. Higher-cost models generally deliver stronger agentic performance, but not always. Agent Arena helps measure the trade-off: which models are strongest, which are most efficient, and track how the frontier is moving. In a 7-day window, Agent Arena logged 2.06M tool calls across 160K+ real user tasks. The most-used tools were: - bash: 936K calls - write_file: 550K calls - web_search: 276K calls - read_file: 118K calls - fetch_page: 86K calls This gives us a new lens on agent behavior: not just what models answer, but how they search, code, edit, recover, and interact with the environment. This is the core motivation for Agent Arena: evaluate agents on real, messy, long-horizon work. Check out our technical blog for the Agent Arena methodology: https://lnkd.in/gjJ6yQNc The full Agent Arena Leaderboard is here: arena.ai/leaderboard/agent

    • No alternative text description for this image
    • No alternative text description for this image
    • No alternative text description for this image
    • No alternative text description for this image
    • No alternative text description for this image
      +2
  • Arena reposted this

    The evolution of Arena has been nothing short of enthralling. There’s something almost addictive about watching different models battle it out side by side. And having all the data fully on display from the aftermath is just good tech fun. With general-purpose agents now jumping into the fray, I can’t imagine how many leaps and bounds will be made when it comes to agentic workflows. Congratulations to the Arena team, I can’t wait to see the new rankings. Anastasios Angelopoulos Wei-Lin Chiang

    View organization page for Arena

    15,806 followers

    Introducing Agent Arena: real-world agentic evals at scale. How do you evaluate agents doing actual work? We measure millions of live sessions where real users accomplish real tasks. On Arena, models now get web search, filesystem, and terminal tools to complete complex workflows: writing code, creating slide deck, researching the web, building apps, and analyzing documents. Every session produces rich signals. Users iterate with the agent turn-by-turn: approving, editing, correcting, praise or expressing frustration. The environment gives feedback too: shell errors, tool failures, recovery attempts, and more. Our leaderboard measures each model's agentic performance using causal inference across five signals: task success, steerability, error recovery, user praise vs. complaint, and tool hallucination. This leaderboard snapshot is built from 300K+ tasks, 2M+ tool calls, and 40M lines of code by agents. Top labs in Agent Arena: - #1 OpenAI: GPT-5.5 (High) - #2 Anthropic: Claude-Opus-4.7 (Thinking) - #3 Z.ai: GLM-5.1 - #4 Google DeepMind: Gemini-3.1-Pro - #5 Kimi (Moonshot AI): Kimi-K2.6 More analysis in the thread, with the full technical blog below. What are people actually using agents for? We analyzed the task distribution in Agent Arena across a 7-day window: 160K real user tasks spanning coding, debugging, research, document creation, frontend development, file analysis, and long multi-step workflows. The largest categories were: - Code writing (17.5%) - Research and lookup (10.8%) - Planning and brainstorming (10.6%) - Multimodal image/video work (10.2%) - Document creation (9.1%) - Code debugging (8.9%) Agent usage is broad: it’s not just coding, but research, planning, content creation, file work, and complex workflows that combine multiple tools over many turns. The aggregate ranking combines multiple signals: task success, user praise vs. complaints, steerability, bash recovery, and tool hallucination. Top models win in different ways: some complete tasks more reliably, some recover better from errors, and some are easier for users to steer. Higher-cost models generally deliver stronger agentic performance, but not always. Agent Arena helps measure the trade-off: which models are strongest, which are most efficient, and track how the frontier is moving. In a 7-day window, Agent Arena logged 2.06M tool calls across 160K+ real user tasks. The most-used tools were: - bash: 936K calls - write_file: 550K calls - web_search: 276K calls - read_file: 118K calls - fetch_page: 86K calls This gives us a new lens on agent behavior: not just what models answer, but how they search, code, edit, recover, and interact with the environment. This is the core motivation for Agent Arena: evaluate agents on real, messy, long-horizon work. Check out our technical blog for the Agent Arena methodology: https://lnkd.in/gjJ6yQNc The full Agent Arena Leaderboard is here: arena.ai/leaderboard/agent

    • No alternative text description for this image
    • No alternative text description for this image
    • No alternative text description for this image
    • No alternative text description for this image
    • No alternative text description for this image
      +2
  • Arena reposted this

    Agent Arena gives every model access to a Claude-Code-like harness and a computer. Our users went nuts, generating millions of real traces per week. We used this data to build the first large-scale benchmark of agent usefulness in the wild. We analyze agents by collecting many axes of feedback, explicit and implicit, including: - Confirmed success: user marks task as success or failure. - Praise vs complaint: user praises or complains about agent output. - Steerability: agent responds correctly to user requests. - Bash recovery: time taken to recover from an error in bash. - Tool hallucination: agent hallucinates tool that does not exist. The longest tasks take multiple days and hundreds of turns, with nearly a thousand tool calls in a session (!), and give us a huge firehose of real-world agent traces to compute these signals. Our users are doing things like: - Building full-stack applications with backends and databases - Financial models involving market research pulled from the internet and .xlsx artifacts - Workflow automation, e.g. scraping all real-estate listings in an area and doing detailed data analysis on price as a function of parcel size and sqft - Deep research and scientific documents, pulling together .ppt presentations from careful research both from websites and academic publications By meeting our users where they work, Agent Arena can speak to the boundary between the possible and impossible with different agents. The leaderboards we calculate are based on a novel causal inference approach that looks at each subcomponent of the agent (orchestrator and harness) as a treatment, and calculates treatment effects for each. Soon we will release more on the harness side, sharing what effect different harnesses have on agent capabilities. Arena has gone far beyond a human preference benchmark and the voting mechanism. We are building signals of real post-deployment user value, and pushing the limits of evaluation. If you are interested in shaping the future of evaluation as a collaborator or colleague, please reach out. We’d love to hear from you!

    View organization page for Arena

    15,806 followers

    Introducing Agent Arena: real-world agentic evals at scale. How do you evaluate agents doing actual work? We measure millions of live sessions where real users accomplish real tasks. On Arena, models now get web search, filesystem, and terminal tools to complete complex workflows: writing code, creating slide deck, researching the web, building apps, and analyzing documents. Every session produces rich signals. Users iterate with the agent turn-by-turn: approving, editing, correcting, praise or expressing frustration. The environment gives feedback too: shell errors, tool failures, recovery attempts, and more. Our leaderboard measures each model's agentic performance using causal inference across five signals: task success, steerability, error recovery, user praise vs. complaint, and tool hallucination. This leaderboard snapshot is built from 300K+ tasks, 2M+ tool calls, and 40M lines of code by agents. Top labs in Agent Arena: - #1 OpenAI: GPT-5.5 (High) - #2 Anthropic: Claude-Opus-4.7 (Thinking) - #3 Z.ai: GLM-5.1 - #4 Google DeepMind: Gemini-3.1-Pro - #5 Kimi (Moonshot AI): Kimi-K2.6 More analysis in the thread, with the full technical blog below. What are people actually using agents for? We analyzed the task distribution in Agent Arena across a 7-day window: 160K real user tasks spanning coding, debugging, research, document creation, frontend development, file analysis, and long multi-step workflows. The largest categories were: - Code writing (17.5%) - Research and lookup (10.8%) - Planning and brainstorming (10.6%) - Multimodal image/video work (10.2%) - Document creation (9.1%) - Code debugging (8.9%) Agent usage is broad: it’s not just coding, but research, planning, content creation, file work, and complex workflows that combine multiple tools over many turns. The aggregate ranking combines multiple signals: task success, user praise vs. complaints, steerability, bash recovery, and tool hallucination. Top models win in different ways: some complete tasks more reliably, some recover better from errors, and some are easier for users to steer. Higher-cost models generally deliver stronger agentic performance, but not always. Agent Arena helps measure the trade-off: which models are strongest, which are most efficient, and track how the frontier is moving. In a 7-day window, Agent Arena logged 2.06M tool calls across 160K+ real user tasks. The most-used tools were: - bash: 936K calls - write_file: 550K calls - web_search: 276K calls - read_file: 118K calls - fetch_page: 86K calls This gives us a new lens on agent behavior: not just what models answer, but how they search, code, edit, recover, and interact with the environment. This is the core motivation for Agent Arena: evaluate agents on real, messy, long-horizon work. Check out our technical blog for the Agent Arena methodology: https://lnkd.in/gjJ6yQNc The full Agent Arena Leaderboard is here: arena.ai/leaderboard/agent

    • No alternative text description for this image
    • No alternative text description for this image
    • No alternative text description for this image
    • No alternative text description for this image
    • No alternative text description for this image
      +2
  • View organization page for Arena

    15,806 followers

    Introducing Agent Arena: real-world agentic evals at scale. How do you evaluate agents doing actual work? We measure millions of live sessions where real users accomplish real tasks. On Arena, models now get web search, filesystem, and terminal tools to complete complex workflows: writing code, creating slide deck, researching the web, building apps, and analyzing documents. Every session produces rich signals. Users iterate with the agent turn-by-turn: approving, editing, correcting, praise or expressing frustration. The environment gives feedback too: shell errors, tool failures, recovery attempts, and more. Our leaderboard measures each model's agentic performance using causal inference across five signals: task success, steerability, error recovery, user praise vs. complaint, and tool hallucination. This leaderboard snapshot is built from 300K+ tasks, 2M+ tool calls, and 40M lines of code by agents. Top labs in Agent Arena: - #1 OpenAI: GPT-5.5 (High) - #2 Anthropic: Claude-Opus-4.7 (Thinking) - #3 Z.ai: GLM-5.1 - #4 Google DeepMind: Gemini-3.1-Pro - #5 Kimi (Moonshot AI): Kimi-K2.6 More analysis in the thread, with the full technical blog below. What are people actually using agents for? We analyzed the task distribution in Agent Arena across a 7-day window: 160K real user tasks spanning coding, debugging, research, document creation, frontend development, file analysis, and long multi-step workflows. The largest categories were: - Code writing (17.5%) - Research and lookup (10.8%) - Planning and brainstorming (10.6%) - Multimodal image/video work (10.2%) - Document creation (9.1%) - Code debugging (8.9%) Agent usage is broad: it’s not just coding, but research, planning, content creation, file work, and complex workflows that combine multiple tools over many turns. The aggregate ranking combines multiple signals: task success, user praise vs. complaints, steerability, bash recovery, and tool hallucination. Top models win in different ways: some complete tasks more reliably, some recover better from errors, and some are easier for users to steer. Higher-cost models generally deliver stronger agentic performance, but not always. Agent Arena helps measure the trade-off: which models are strongest, which are most efficient, and track how the frontier is moving. In a 7-day window, Agent Arena logged 2.06M tool calls across 160K+ real user tasks. The most-used tools were: - bash: 936K calls - write_file: 550K calls - web_search: 276K calls - read_file: 118K calls - fetch_page: 86K calls This gives us a new lens on agent behavior: not just what models answer, but how they search, code, edit, recover, and interact with the environment. This is the core motivation for Agent Arena: evaluate agents on real, messy, long-horizon work. Check out our technical blog for the Agent Arena methodology: https://lnkd.in/gjJ6yQNc The full Agent Arena Leaderboard is here: arena.ai/leaderboard/agent

    • No alternative text description for this image
    • No alternative text description for this image
    • No alternative text description for this image
    • No alternative text description for this image
    • No alternative text description for this image
      +2
  • View organization page for Arena

    15,806 followers

    Introducing Agent Mode: Agentic AI is now measured in the Arena. Agent Mode can do deep research, create reports, generate images, build websites, debug code, and more. Take a peek with Product Lead Ted Moran and Founding Engineer, Matthew Hovanec. It completes more complex tasks by using tools like web search, bash in a sandbox environment, image generation, file writing, and asking follow-up questions. Frontier models are waiting for you in Agent Mode to take on real-world tasks. GPT-5.5, Claude Opus 4.7, Gemini 3.1 Pro, and top open models. Read more about Agent Mode, dig into the FAQ, and get a preview of what we've learned so far on our blog at:  https://lnkd.in/gPWgVDj4 Start evaluating agentic AI on Arena today with Agent Mode at: arena.ai/agent

  • View organization page for Arena

    15,806 followers

    MiniMax M3 has landed in the Arena and has moved the Pareto frontier! Their latest model ranks #7 for Code Arena: Frontend, scoring 1531, it is neck and neck with GLM-5.1. It moves the Pareto frontier in its price class at $0.60 input/$2.40 output per Mtoken. MiniMax M3 also ranks #14 in the Document Arena where models are ranked for their capabilities in document analysis and long-content reasoning. For its price point, it shifts the Pareto frontier here as well. Congrats to the MiniMax team on this achievement!

    • No alternative text description for this image
    • No alternative text description for this image
    • No alternative text description for this image

Similar pages

Funding