Agentic Search Leaderboard

Explore by metric

Same question, different answers

Each metric tests a specific quality dimension. Pick one to see how models performed on a real query.

Loading metrics...

Speed and cost analysis available in the full leaderboard below.

Under the hood

From your catalog to ranked results

Three steps, fully automated. Click any step to see how it works.

Generated

From your catalog

Real products from your index become realistic shopping queries across multiple difficulty levels.

▶

Graded

By LLM judges

A calibrated LLM evaluates each response against clear pass/fail criteria. Agreement rate above 95%.

▶

Ranked

With statistical confidence

We resample each score 10,000 times. Overlapping confidence bands = same tier. No artificial distinctions.

▶

Query

Search calls

Agent response

Reveal verdict

We resample each score 10,000 times to compute confidence intervals. The band shows where the true score likely falls. When bands overlap, the models perform equivalently.

Model A

87.2% Tier 1

Model B

82.1% Tier 1

Model C

74.4% Tier 2

Models A and B overlap. Their confidence bands intersect, meaning we cannot distinguish their true quality. They share Tier 1. Model C separates cleanly into Tier 2.

Full leaderboard

Every model, every dimension

Sort by any column. Bands show 95% confidence intervals. Hover a metric for its definition.

Frequently asked questions

This leaderboard evaluates LLMs in real agent workflows, focusing on practical factors like cost, latency, tool use, and groundedness so you can make informed decisions about which model to use for search and discovery agents.

Use it as a guide to narrow down model choices based on your specific constraints. It is a starting point, not a final answer. You should validate results in your own environment and select the model that best fits your needs.

We are model agnostic. Agent Studio supports a bring-your-own-LLM approach, so you can use any model you prefer. This leaderboard exists to provide data and context, not to prescribe a single choice.

No. The right model depends on your use case, budget, latency targets, and quality requirements. This leaderboard highlights tradeoffs so you can decide what matters most.

These results are based on 1,500+ queries evaluated on internal ecommerce product catalogs, and are meant to be directional. Performance will vary depending on your data, setup, and implementation, and you are responsible for evaluating what works best in your environment.

We update the leaderboard as we run new evaluations across models and scenarios. As the ecosystem evolves, the data will continue to reflect the latest findings.