Dreamina Seedance 2.0 from ByteDance Seed takes the #1 spot across all modalities in the Artificial Analysis Video Arena, surpassing Kling 3.0, Grok Imagine, and Veo 3.1! Dreamina Seedance 2.0 is the latest video generation model from ByteDance Seed, capable of generating videos up to 15 seconds with native stereo audio support. It also accepts text, images, and video as inputs, including multiple image references in a single generation. Dreamina Seedance 2.0 is currently available for whitelisted customers on the Dreamina AI app, with general availability coming later. See example generations of Dreamina Seedance 2.0 in the Artificial Analysis Video Arena 🧵
Artificial Analysis
Technology, Information and Internet
Newark, Delaware 25,567 followers
Independent analysis of AI: Understand the AI landscape and analyze AI technologies http://artificialanalysis.com/
About us
Leading independent analysis of AI. Understand the AI landscape to choose the best AI technologies for your use case. Backed by Nat Friedman, Daniel Gross and Andrew Ng.
- Website
-
https://artificialanalysis.ai
External link for Artificial Analysis
- Industry
- Technology, Information and Internet
- Company size
- 11-50 employees
- Headquarters
- Newark, Delaware
- Type
- Privately Held
Locations
-
Get directions
131 Continental Dr
Suite 305
Newark, Delaware 19713, US
Employees at Artificial Analysis
Updates
-
India enters the open-weights AI race with its largest models pre-trained from scratch: Sarvam 105B and Sarvam 30B Sarvam's Sarvam 105B and Sarvam 30B score 18 and 12 on the Artificial Analysis Intelligence Index respectively. Announced at the India AI Impact Summit 2026 and open-sourced under Apache 2.0, both are Mixture-of-Experts models trained entirely in India using compute provided under the IndiaAI Mission. Both support reasoning and non-reasoning modes. These are an improvement from Sarvam's previous model, Sarvam M (8 on Intelligence Index, 23.6B parameters), which was based on Mistral Small rather than pre-trained from scratch. Sarvam 105B has 106B total parameters with ~10B active per token and a 128K context window. Sarvam 30B has 32B total parameters with ~2.4B active per token and a 65K context window. Alongside the text models, Sarvam also announced Saaras v3 (Speech to Text) and Bulbul v3 (Text to Speech) with a focus on Indic languages. Key takeaways in reasoning mode: ➤ Sarvam 105B scores 18 on the Intelligence Index. Among ~100B-class open-weights reasoning models, it trails GLM-4.5-Air (23), INTELLECT-3 (22), Mistral Small 4 (27), and gpt-oss-120B (High, 33). All four peers also activate more parameters per token. ➤ Sarvam 30B scores 12 on the Intelligence Index. Among ~30B-class open-weights reasoning models, it trails GLM-4.7-Flash (30), Nemotron Cascade 2 30B A3B (28), Qwen3 30B A3B 2507 (22), and Qwen3 32B (17). Sarvam 30B activates fewer parameters than these peers. ➤ Sarvam 105B's relative strength is in select agentic tasks. Its agentic index of 25 places it ahead of INTELLECT-3 (20) and GLM-4.5-Air (21) despite trailing both on overall intelligence. Its GDPval index of 773 also edges ahead of GLM-4.5-Air (665). Both new models are a large step up from Sarvam M (Reasoning), which scored 8 on the Intelligence Index. ➤ Compared to peers, both models score lower on TerminalBench Hard (Agentic Coding & Terminal Use) and AA-Omniscience. Sarvam 105B scored 1.5% and Sarvam 30B scored 2.3% on TerminalBench Hard, compared to GLM-4.5-Air (20.5%) and INTELLECT-3 (9.1%). The AA-Omniscience Index is -60 for Sarvam 105B and -72 for Sarvam 30B. Both models have high hallucination rates relative to their accuracy, and both attempt to answer far more questions rather than abstaining, which drives the negative scores. Key model details: ➤ Modality: Text input and output only. ➤ Context window: 128K tokens (Sarvam 105B) and 65K tokens (Sarvam 30B). ➤ Pricing: Currently free on Sarvam's first-party API. ➤ License: Apache 2.0. ➤ Availability: Sarvam's first-party API; weights available on Hugging Face and AIKosh. See how the models compare to other models you are using: https://lnkd.in/gVh662hY
-
-
Microsoft has released MAI-Transcribe-1: a speech transcription model achieving 3.0% on AA-WER (#4), and is fast at 69x realtime The model was developed by Microsoft AI (MAI)’s Superintelligence team and supports 25 languages including English, French, Arabic, Japanese, and Chinese. MAI-Transcribe-1 API is currently available in public preview via Azure Speech on Microsoft Foundry. On the Artificial Analysis Speech to Text (STT) leaderboard, MAI-Transcribe-1 achieves a 3.0% word error rate on AA-WER for speech transcription accuracy, positioning it 4th overall behind Mistral’s Voxtral Small (2.9% AA-WER), Google’s Gemini 3.1 Pro High (2.9% AA-WER) and ElevenLabs’ Scribe v2 (2.3% AA-WER). It also stands out as one of the faster high-accuracy transcription models available, processing audio at ~69x real-time. See more details below ⬇️
-
-
Google has released Gemma 4, a new family of multimodal open-weight models including Gemma 4 E2B, Gemma 4 E4B, Gemma 4 31B and Gemma 4 26B A4B Google DeepMind's new Gemma 4 family introduces four multimodal models supporting text, image, and video inputs. We evaluated Gemma 4 31B (dense) and Gemma 4 26B A4B (MoE), both with a 256k context window, while the other two smaller models support up to 128k. With 31B and 26B parameters respectively, both evaluated models can run on a single H100. On GPQA Diamond, our scientific reasoning evaluation, Gemma 4 31B (Reasoning) scores 85.7%, the second highest result we have recorded for an open-weights model with fewer than 40B parameters, just behind Qwen3.5 27B (Reasoning, 85.8%). It reaches this score using only ~1.2M output tokens, fewer than Qwen3.5 27B (~1.5M) and Qwen3.5 35B A3B (~1.6M). Gemma 4 26B A4B (Reasoning) scores 79.2%, ahead of gpt-oss-120B (high, 76.2%) but behind Qwen3.5 9B (Reasoning, 80.6%). We are now running the Artificial Analysis Intelligence Index on all four Gemma 4 models and will share a full update once those results are complete.
-
-
KwaiKAT has released KAT-Coder-Pro V2, a non-reasoning model that scores 44 on the Artificial Analysis Intelligence Index, an 8 point improvement from KAT-Coder-Pro V1 KwaiKAT has updated their flagship proprietary coding model with the release of KAT-Coder-Pro V2. KAT-Coder-Pro V2 achieves 44 on the Artificial Analysis Intelligence Index, matching Claude Sonnet 4.6 (non-reasoning) and trailing only Claude Opus 4.6 (non-reasoning, 46) among non-reasoning models. At ~9M output tokens, it is also more token efficient than Claude Opus 4.6 (~11M), Claude Sonnet 4.6 (~14M), and reasoning models with similar intelligence such as DeepSeek V3.2 (reasoning, ~61M) and Qwen3.5 397B A17B (reasoning, ~86M). KAT-Coder-Pro V2 is a non-reasoning model, unlike all of the current frontier language models which ‘think’ before answering. Typically, reasoning variants score higher on the Intelligence Index than their non-reasoning counterparts, but consume more output tokens and are less suited to latency-sensitive workloads. More details in the article below.
-
Cohere has released Cohere Transcribe: an open weights model achieving 4.7% on AA-WER, based on 3 datasets including our proprietary AA-AgentTalk dataset The 2B parameter model is based on a conformer encoder-decoder architecture. It was trained from scratch on 14 languages including English, French, Mandarin, Japanese, and Arabic. On the Artificial Analysis Speech to Text (STT) leaderboard, Cohere Transcribe achieves a 4.7% word error rate on AA-WER for speech transcription accuracy, positioning it near NVIDIA's Canary Qwen 2.5B (4.4% AA-WER, 2.5B parameters) and OpenAI's Whisper Large v3 (4.2% AA-WER, 1.6B parameters). It is also among the faster transcription models available, processing ~60 seconds of audio in approximately one second. Cohere Transcribe is currently available for free (subject to rate limits) via Cohere’s API. The model is also available for download on Hugging Face under Apache 2.0 license See more details below ⬇️
-
-
Introducing AA-AgentPerf - the hardware benchmark for the agent era. Key details: ➤ Real agent workloads, not synthetic queries: we’ve captured real coding agent trajectories where our agents used up to 200 turns and worked with sequence lengths >100K tokens ➤ Production optimizations allowed: KV cache reuse, disaggregated prefill/decode, speculative decoding - we’re allowing the optimizations that labs and inference providers are serving in production so that we can capture what real deployments should look like ➤ Measures what developers need to know: Max concurrent users at each target output speed, expressed per accelerator, per kW TDP, per $/hr, and per rack ➤ Built for every kind of scale: designed to measure systems from a single accelerator up to a full rack, and to fairly evaluate every architecture from DRAM-only designs to SRAM-only designs and everything in between ➤ Live now: we’re announcing AA-AgentPerf today and opening submissions of configurations for benchmarking effective immediately. The models supported at launch are gpt-oss-120b and DeepSeek V3.2. We’ll be publishing results on a rolling basis. AA-AgentPerf is a benchmark for real-world performance of AI accelerator hardware. We’re benchmarking inference of particular models on a specific system with a specific config (ie. inference stack, parallelism config and more). AA-AgentPerf has been shaped by our work with inference providers and engagement with AI accelerator companies, developers, and enterprise buyers over the past year. Our goal is for anyone deploying models - whether buying or leasing accelerators - to be able to use AA-AgentPerf as the definitive resource for understanding real-world hardware performance. AA-AgentPerf results will primarily be expressed as a maximum number of concurrent users serviceable at a given per-user token output speed (and vice versa). We will combine these results with several dimensions that are important to developers and customers: ➤ Users per accelerator: the most basic view - how many users can be serviced by a single accelerator at each output speed. ➤ Users per kW TDP: provides context on how power-efficient the accelerators are. ➤ Users per unit rental cost: provides context on how cost-efficient each accelerator is. ➤ Users per rack: provides context on how space-efficient the accelerators are. We expect initial results to be available within the next 1-2 weeks, after submissions from hardware providers and QA from our team. Results will be visible at https://lnkd.in/gR4hmEgY Get to know the evaluation methodology more closely at https://lnkd.in/gsJZYCpp
-
-
Google has released Gemini 3.1 Flash Live Preview, achieving #2 in our Big Bench Audio Speech to Speech model benchmark, and now features configurable thinking levels With thinking level set to high, it scores 95.9% on Big Bench Audio, making it the second-highest scoring speech reasoning model behind Step-Audio R1.1 Realtime (97.0%) and ahead of Grok Voice Agent (92.9%). Switching to minimal thinking brings the score down to 70.5%, but opens up a faster option for latency-sensitive applications. The flexibility in thinking levels also provides a range of latency profiles. On high, average Time to First Audio (TTFA) is 2.98 seconds, slower than Step-Audio R1.1 Realtime (1.51s) and Grok Voice Agent (0.78s). On minimal, TTFA drops to 0.96 seconds, closer to the pack but still behind Google's own Gemini 2.5 Flash Native Audio Dialog (0.63s), which trades ~5 points of intelligence for the fastest response time on our leaderboard. Key takeaways: ➤ Model introduces configurable thinking levels (minimal, low, medium, high) that let developers dial reasoning depth up or down ➤ "High" thinking level: 95.9% Big Bench Audio score (2nd overall, behind only Step-Audio R1.1 Realtime ), 2.98s TTFA ➤ "Minimal" thinking level: 70.5% score, 0.96s TTFA ➤ Pricing remains stable at $0.35 per hour of audio input, and $1.38 per hour audio output, matching Gemini 2.5 Flash Native Audio Dialog See below for more details 👇
-
-
Inworld, ElevenLabs, and MiniMax continue to lead our Text to Speech leaderboard for most preferred models Recent checkpoints from each of the labs continue to push the frontier of TTS quality, with 4 out of the top 5 models being released this year. Leading TTS models are increasingly realistic, particularly on relatively straightforward text, with preference differences increasingly coming down to affinity for different voices. Latest results also reflect stronger bot vote filtering, confirmed via triangulation against third-party evaluators. We've also added rank ranges based on each model's 95% confidence interval, showing where a model could land based on its Elo score range. Key results: ➤ Most preferred: Current top 5 per our TTS leaderboard: 1. Inworld TTS 1.5 Max (Elo of 1,238); 2. ElevenLabs Eleven v3 (1,197); 3. Inworld TTS 1 Max (1,183); 4. Inworld TTS 1.5 Mini (1,182); 5. MiniMax Speech 2.8 HD (1,175) ➤ Price: Kokoro 82M v1.0 (Replicate) leads at $0.65 per 1M characters, followed by Inworld TTS 1 and 1.5 Mini at $5, and AsyncFlow V2 at $8.33 ➤ Speed: WaveNet leads for batch generation at 419 characters processed per second, followed by Kokoro 82M v1.0 (Replicate) at 235, and Inworld TTS 1.5 Mini at 214 See below for further detail ⬇️
-
-
Artificial Analysis reposted this
Had the pleasure of a 20 minute meeting with Jensen at NVIDIA GTC last week! We discussed the importance of independent benchmarks and how Artificial Analysis can further support the AI ecosystem, from agents to models to hardware. Our benchmarks were also referred to three times in Jensen’s keynote. He also gave Artificial Analysis a NVIDIA Blackwell GB200 chip signed 'The benchmark of excellence'. The only problem is that we might blow a fuse if we try plugging it in at our office. 😆 Certainly a GTC to remember!
-