Small Models Outperform GPT-4o on Long Context with Effective System Design

This title was summarized by AI from the post below.

New from Together Research: small models can beat GPT-4o on long context with the right system design. The instinct when context windows hit 128K or 1M tokens is to throw everything into one prompt. In practice, performance degrades as length grows. Our new paper, accepted at #ICLR2026, introduces a framework to study when and why "Divide & Conquer" works, and how to design it effectively. The core insight: long context failures come from three distinct noise sources: 1/ Model noise: confusion grows superlinearly with input length 2/ Task noise: chunks lose cross-document context 3/ Aggregator noise: the Manager fails to stitch partial answers correctly Naive "MapReduce" approaches collapse on that third point. The fix is a Planner agent that rewrites the task prompt so Workers return exactly what the Manager needs. Results: Llama-3-70B and Qwen-72B using this framework consistently outperform GPT-4o single-shot on retrieval, QA, and summarization as context length scales. The smaller models win, and they're cheaper and faster. The limit: tasks with high cross-chunk dependency — where a clue on page 1 connects to page 100 — still favor the single-shot approach. Blog, paper, and code in the comments.

  • No alternative text description for this image

Really interesting finding — system design mattering more than raw model size is something more practitioners should understand. The planner-worker-manager pattern is elegant for long context tasks and this kind of research helps demystify what's actually driving performance.

Like
Reply

Smart system design beats scale—divide and conquer unlocks real efficiency gains.

Like
Reply
See more comments

To view or add a comment, sign in

Explore content categories