Agent Studio Benchmark

Agentic SearchLeaderboard

We tested every major LLM on real shopping queries through Agent Studio, Algolia's platform for building search and discovery agents. Three dimensions of quality. Open methodology. The results speak for themselves.

-
Models
-
Test cases
3
Metrics
-
Providers
Scroll
Who leads agentic search?
Anthropic
OpenAI
Google
Open Source
xAI
DeepSeek
2
Model B
•••
Strong on Relevance, but higher Hallucination rate
3.2s $ 0.018
1
Model A
•••
Overall leader: Relevance, Hallucinations, Language
4.1s $ 0.024
3
Model C
•••
Best Speed-to-quality ratio in the benchmark
1.8s $ 0.006
Scroll to discover the real ranking

Same question, different answers

Each metric tests a specific quality dimension. Pick one to see how models performed on a real query.

Loading metrics...

Speed and cost analysis available in the full leaderboard below.

From your catalog to ranked results

Three steps, fully automated. Click any step to see how it works.

Generated

From your catalog

Real products from your index become realistic shopping queries across multiple difficulty levels.

Graded

By LLM judges

A calibrated LLM evaluates each response against clear pass/fail criteria. Agreement rate above 95%.

Ranked

With statistical confidence

We resample each score 10,000 times. Overlapping confidence bands = same tier. No artificial distinctions.

Query
Loading...
Search calls
Agent response
Reveal verdict

We resample each score 10,000 times to compute confidence intervals. The band shows where the true score likely falls. When bands overlap, the models perform equivalently.

Model A
87.2% Tier 1
Model B
82.1% Tier 1
Model C
74.4% Tier 2
Models A and B overlap. Their confidence bands intersect, meaning we cannot distinguish their true quality. They share Tier 1. Model C separates cleanly into Tier 2.

Every model, every dimension

Sort by any column. Bands show 95% confidence intervals. Hover a metric for its definition.

Build better agents. Measure what matters. Same benchmarks, your data.

Frequently asked questions

This leaderboard evaluates LLMs in real agent workflows, focusing on practical factors like cost, latency, tool use, and groundedness so you can make informed decisions about which model to use for search and discovery agents.
Use it as a guide to narrow down model choices based on your specific constraints. It is a starting point, not a final answer. You should validate results in your own environment and select the model that best fits your needs.
We are model agnostic. Agent Studio supports a bring-your-own-LLM approach, so you can use any model you prefer. This leaderboard exists to provide data and context, not to prescribe a single choice.
No. The right model depends on your use case, budget, latency targets, and quality requirements. This leaderboard highlights tradeoffs so you can decide what matters most.
These results are based on 1,500+ queries evaluated on internal ecommerce product catalogs, and are meant to be directional. Performance will vary depending on your data, setup, and implementation, and you are responsible for evaluating what works best in your environment.
We update the leaderboard as we run new evaluations across models and scenarios. As the ecosystem evolves, the data will continue to reflect the latest findings.
Try Agent Studio