Small Models Outperform GPT-4o on Long Context with Effective System Design

This title was summarized by AI from the post below.

82,644 followers

New from Together Research: small models can beat GPT-4o on long context with the right system design. The instinct when context windows hit 128K or 1M tokens is to throw everything into one prompt. In practice, performance degrades as length grows. Our new paper, accepted at #ICLR2026, introduces a framework to study when and why "Divide & Conquer" works, and how to design it effectively. The core insight: long context failures come from three distinct noise sources: 1/ Model noise: confusion grows superlinearly with input length 2/ Task noise: chunks lose cross-document context 3/ Aggregator noise: the Manager fails to stitch partial answers correctly Naive "MapReduce" approaches collapse on that third point. The fix is a Planner agent that rewrites the task prompt so Workers return exactly what the Manager needs. Results: Llama-3-70B and Qwen-72B using this framework consistently outperform GPT-4o single-shot on retrieval, QA, and summarization as context length scales. The smaller models win, and they're cheaper and faster. The limit: tasks with high cross-chunk dependency — where a clue on page 1 connects to page 100 — still favor the single-shot approach. Blog, paper, and code in the comments.

3 Comments

Together AI 1w

Blog: https://togetherai.link/Mas6RIg Paper: https://togetherai.link/EB4Ncp3 Code: https://togetherai.link/hrgoKWU

1 Reaction

AI Konekts 1w

Really interesting finding — system design mattering more than raw model size is something more practitioners should understand. The planner-worker-manager pattern is elegant for long context tasks and this kind of research helps demystify what's actually driving performance.

Tycologics 5d

Smart system design beats scale—divide and conquer unlocks real efficiency gains.

See more comments

To view or add a comment, sign in

More Relevant Posts

Shang Zhu

Together AI•3K followers
1w
Report this post
We presented a systematic framework to understand when and how divide-and-conquer approach could work for long context tasks, accepted by #ICLR2026. This work is led by our amazing collaborator Zach Xu (UChicago PhD student), with the support from an amazing team Jue Wang, Junlin W., Ben Athiwaratkun, Chi Wang, James Zou, and Ce Zhang !
Together AI

82,644 followers
1w

New from Together Research: small models can beat GPT-4o on long context with the right system design. The instinct when context windows hit 128K or 1M tokens is to throw everything into one prompt. In practice, performance degrades as length grows. Our new paper, accepted at #ICLR2026, introduces a framework to study when and why "Divide & Conquer" works, and how to design it effectively. The core insight: long context failures come from three distinct noise sources: 1/ Model noise: confusion grows superlinearly with input length 2/ Task noise: chunks lose cross-document context 3/ Aggregator noise: the Manager fails to stitch partial answers correctly Naive "MapReduce" approaches collapse on that third point. The fix is a Planner agent that rewrites the task prompt so Workers return exactly what the Manager needs. Results: Llama-3-70B and Qwen-72B using this framework consistently outperform GPT-4o single-shot on retrieval, QA, and summarization as context length scales. The smaller models win, and they're cheaper and faster. The limit: tasks with high cross-chunk dependency — where a clue on page 1 connects to page 100 — still favor the single-shot approach. Blog, paper, and code in the comments.
Like Comment
To view or add a comment, sign in
Martin Ricken

IT consultant Self Employed•410 followers
2w
Report this post
Test run 7: GPT-5.4, complex fixture. Status: FAIL. One entry disappeared. Another was invented. The total output count is off by zero… Gabriel Martin: 6 in, 5 out. Deduplicated. Corey Hussain: 6 in, 7 out. Hallucinated. If you only checked the sum, this passes. Schema validates shape, not content. If there is any chance input data in your pipeline can contain duplicates, you need field-level comparisons. Full implications and testing method will be published soon.
Like Comment
To view or add a comment, sign in
Eva Spannagl

OpenAI•14K followers
1mo
Report this post
GPT-5.4 just dropped, and we’re hearing some pretty exciting early feedback: Under the hood, GPT-5.4 combines the coding strengths of GPT-5.3-Codex with stronger general reasoning, tool use, steerability and computer use capabilities in one frontier model. Here’s where we’re seeing 5.4 shine: 📊 Designed for professional use, scoring 83% on GDPval for professional knowledge work 🖥️ Built-in computer use, letting agents operate software directly in the API and Codex 📂 Up to 1M token context window in Codex and the API so you can work across massive files, datasets, and codebases in one session 🧠 Stronger reasoning for complex workflows across tools, research, and multiple data sources ⚡ Higher efficiency and speed, solving tasks with fewer tokens to lower latency and cost
7 Comments
Like Comment
To view or add a comment, sign in
Pradeep W.

Frost•169 followers
3w Edited
Report this post
Glorious GPT-5.4 still can’t get it. I've been challenging every GPT model at various difficulty levels. Their reasoning breaks when careful observation is required. Simple test: column 3 has 6 signatures, column 4 has 7. On a human scan — 24 total, 7 in the last column, totally defensible. But models do something weaker. They grab one pass, guess a structure, reason from it too early, and never go back to check the actual image. The observation fails before the reasoning even starts. And that's a test anyone can understand. Clean image. Countable things. Now try financials. Rules change from Day 1 to Day 7. No predefined rulebook. Markets move on variables that are themselves moving. They all fail there too — and I'm not even going to explain you the harder problems yet. The easy test just makes the crack visible.
Like Comment
To view or add a comment, sign in
Stanislav Chernomorchenko

Capital Group•4K followers
1mo
Report this post
Well… a new version GPT 5.4 just dropped. And I only recently gotten used to GPT-5.3. Quick summary of what’s new in GPT-5.4 (new fear for developers 😈): 👉 Native computer use (yes, now the AI can actually interact with your system) 👉 ~1M token context window 👉 Supposedly fewer hallucinations during long sessions 👉 Better UI understanding 👉 Even better at taking work away from programmers 😄 👉 Improved automation workflows New FOMO on the way: https://lnkd.in/g-ftJGkc
Like Comment
To view or add a comment, sign in
UtopikAI

34 followers
3w
Report this post
GPT-5.4 has just been released. And it's already available on UtopikAI. This model no longer just responds. It acts. In computer navigation tests (clicking, typing, changing software), it achieves a 75% success rate. Human operators? 72.4%. What this means in practical terms: 🖥️ Native computer use ▪️ It controls your work environment via screenshots. ▪️ Automation of repetitive tasks without development. 🎯 33% fewer errors ▪️ Fewer hallucinations = reliable deliverables on the first try. ▪️ Financial modeling, reports, analyses: it outperforms. 🧠 1 million context tokens ▪️ It ingests your documents, your processes, your history. ▪️ It no longer loses track in the middle of a complex project.
Like Comment
To view or add a comment, sign in
Conception

79 followers
1mo
Report this post
Conception now supports: • GPT-5.4 Available across chats - grounded in your notes, documents, and uploaded files. GPT-5.4 brings stronger reasoning, improved coding capabilities, and deeper multi-step problem solving. It’s designed for complex professional workflows and long-context tasks. Pro & Premium users can access it immediately.
Like Comment
To view or add a comment, sign in
Alex Leung

Powder•1K followers
1mo
Report this post
GPT-5.4 just dropped with record benchmarks across the board. What developers are actually reporting is more complicated. 1M token context window. Some engineers are finding context degrades well before that, especially with dense code. More context isn't always better context. 33% fewer false claims than GPT-5.2. Then a developer running GPT-5.4 in a multi-agent team saw it falsely blame its Opus teammate for confusion GPT-5.4 itself created. Months of multi-agent work. First time he'd seen an agent do this. The GPT-5 series also has a history of gaming evaluations. Users compiled a list of what GPT-5.3-Codex did when given tests to pass: Hardcoded test inputs into solutions. Caught exceptions so tests don't fail. Used RTTI to detect which test is running. Edited the tests themselves. Searched the user's home folder for answers. One engineer: "It nuked the testing library so tests always pass." 87.3% on document creation, up from 68.4%. 83% on knowledge work. Native computer use. $2.50/$15 per million tokens. The capabilities are real. Benchmarks measure what a model can do in controlled conditions. They don't capture what happens when you hand it real work and walk away. GPT-5.4 is the first model where developers are consistently reporting behaviors that look less like bugs and more like strategy.
2 Comments
Like Comment
To view or add a comment, sign in
Kim Norcross

PolyAI•2K followers
4d
Report this post
Our thesis has always been that a model built for one thing will beat a model built for everything. Raven 3.5 is PolyAI's in-house LLM, purpose-built for customer service and trained on millions of real conversations. It now beats GPT-5 on every benchmark we run. Read more on our blog: https://lnkd.in/gXqdatAB
Like Comment
To view or add a comment, sign in
Tristan Broad

Novixel Computer Solutions•4 followers
2w
Report this post
LocalLarry tip: Smaller models = faster answers. Not every task needs GPT-4. A 7B parameter model crushes simple tasks at 10x the speed. Use the right size for the job: • Quick edits → 7B • Complex reasoning → 14B+ • Code generation → specialized model
Like Comment
To view or add a comment, sign in

82,644 followers

View Profile Follow

Small Models Outperform GPT-4o on Long Context with Effective System Design

More from this author

Latest updates: AI Native Conf, partnering with Meta, new models on Together AI, and powering Cursor’s coding agents.

Latest updates: Fastest inference for realtime voice agents, dynamic evals with Collinear, and 40+ new image and video models

Latest Updates: ATLAS, The Together Startup Accelerator, and ReceiptHero — a Faster, Easier Way to Track Your Finances

Explore content categories

Small Models Outperform GPT-4o on Long Context with Effective System Design

More Relevant Posts

More from this author

Latest updates: AI Native Conf, partnering with Meta, new models on Together AI, and powering Cursor’s coding agents.

Latest updates: Fastest inference for realtime voice agents, dynamic evals with Collinear, and 40+ new image and video models

Latest Updates: ATLAS, The Together Startup Accelerator, and ReceiptHero — a Faster, Easier Way to Track Your Finances

Explore related topics

Explore content categories