Together AI's ATLAS Boosts Inference Speed 400% by Adapting to Workloads
The push to make AI inference quicker just got a bit more clever. Together AI rolled out ATLAS today - a speculative-execution engine that actually learns from the jobs you feed it, and it can crank inference up to four times faster, roughly a 400 % boost in speed. That’s a pretty big change from the old static speculators that tend to lose their edge once a company’s workload starts to drift.
ATLAS keeps an eye on each incoming prompt, tweaks its prediction plan on the fly, and basically teaches itself what’s likely to come next in a chat or task. It feels a little like the system is guessing, then learning from the guess, and repeating that loop in real time. “Companies we work with generally, as they scale up, they see shifting workloads, and then they don’t see as much speedup from speculative execution as before,” explained Tri Dao, chief scientist at Together AI, in a VentureBeat interview.
So, while it’s not a magic bullet, ATLAS seems poised to keep pace with changing usage patterns better than its predecessors.
"Companies we work with generally, as they scale up, they see shifting workloads, and then they don't see as much speedup from speculative execution as before," Tri Dao, chief scientist at Together AI, told VentureBeat in an exclusive interview. "These speculators generally don't work well when their workload domain starts to shift." The workload drift problem no one talks about Most speculators in production today are "static" models. They're trained once on a fixed dataset representing expected workloads, then deployed without any ability to adapt.
Companies like Meta and Mistral ship pre-trained speculators alongside their main models. Inference platforms like vLLM use these static speculators to boost throughput without changing output quality. When an enterprise's AI usage evolves the static speculator's accuracy plummets.
"If you're a company producing coding agents, and most of your developers have been writing in Python, all of a sudden some of them switch to writing Rust or C, then you see the speed starts to go down," Dao explained. "The speculator has a mismatch between what it was trained on versus what the actual workload is." This workload drift represents a hidden tax on scaling AI. Enterprises either accept degraded performance or invest in retraining custom speculators.
We’re seeing a new development just as enterprise AI hits a key moment. Companies are moving past small pilots and trying full-scale rollouts, and the old belief that inference speed will just keep growing linearly is turning out to be wrong. Fixed hardware can’t keep up with the way real-world traffic ebbs and flows, queries change, complexity spikes, and volume can swing wildly over a single day.
ATLAS’s on-the-fly adaptation hints that the industry may be heading toward inference engines that can tune themselves. I suspect the next step will be systems that not only react to the current load but also try to guess bigger patterns, seasonal sales cycles, demand spikes caused by time-zone differences, or the ripple effect of a new product launch on AI usage. Moving from a static accelerator to a context-aware one could finally give businesses the steady performance they need to make AI a reliable, cost-effective part of daily operations, instead of the current cycle of flashy benchmarks followed by disappointing real-world results.
Common Questions Answered
How does ATLAS achieve up to 400% faster inference speeds compared to traditional systems?
ATLAS uses a new speculative execution system that learns from workloads in real-time by continuously analyzing incoming prompts and dynamically adjusting its prediction strategy. This adaptive approach allows it to maintain high performance even as company usage patterns change, unlike static speculators which lose effectiveness.
What is the 'workload drift problem' that ATLAS addresses, according to Together AI's chief scientist?
The workload drift problem occurs when companies scale up and their AI usage patterns shift, causing static speculators trained on fixed datasets to become less effective. Tri Dao noted that these traditional systems don't work well when the workload domain starts to change, leading to reduced inference speedups.
Why is ATLAS particularly important for companies moving from pilot projects to full-scale AI deployment?
ATLAS is critical because static infrastructure cannot accommodate the dynamic nature of real-world usage where query types, complexity, and volume fluctuate unpredictably. Its adaptive capability ensures inference performance scales effectively during full deployment, addressing the false assumption that performance would scale linearly.
How does ATLAS's real-time learning capability differ from traditional static speculators?
Traditional static speculators are trained once on a fixed dataset and cannot adapt to changing workloads, while ATLAS continuously analyzes prompts and teaches itself how to anticipate patterns dynamically. This self-optimizing approach represents a shift toward intelligent infrastructure that maintains effectiveness as usage evolves.