We tested every major LLM on real shopping queries through Agent Studio, Algolia's platform for building search and discovery agents. Three dimensions of quality. Open methodology. The results speak for themselves.
Explore by metric
Each metric tests a specific quality dimension. Pick one to see how models performed on a real query.
Speed and cost analysis available in the full leaderboard below.
Under the hood
Three steps, fully automated. Click any step to see how it works.
Real products from your index become realistic shopping queries across multiple difficulty levels.
▶A calibrated LLM evaluates each response against clear pass/fail criteria. Agreement rate above 95%.
▶We resample each score 10,000 times. Overlapping confidence bands = same tier. No artificial distinctions.
▶We resample each score 10,000 times to compute confidence intervals. The band shows where the true score likely falls. When bands overlap, the models perform equivalently.
Full leaderboard
Sort by any column. Bands show 95% confidence intervals. Hover a metric for its definition.
Build better agents. Measure what matters. Same benchmarks, your data.