Zyphra AI just dropped ZAYA1-8B, a Mixture‑of‑Experts language model that packs 8.4 billion total parameters but only 760 million active ones per inference pass. While the model runs on AMD hardware, its compute footprint stays tiny, letting it punch above its weight on math and coding benchmarks. The company has made the model available under an Apache 2.0 license on Hugging Face and as a serverless endpoint on Zyphra Cloud, so developers can experiment without building their own clusters.
Here's the thing: despite under a billion active parameters, ZAYA1-8B scores competitively against first‑generation frontier reasoning models such as DeepSeek‑R1‑0528, Gemini‑2.5‑Pro and Claude 4.5 Sonnet on tough mathematical tasks. Its test‑time compute trick, dubbed Markovian RSA, even nudges it past Claude 4.5 Sonnet and GPT‑5‑High on the HMMT’25 benchmark (89.6 vs 88.3). While dense models would need far more memory and latency to hit similar marks, ZAYA1‑8B’s expert routing keeps inference lean and fast. The distinction between “active” and “total” parameters is why this leaner setup matters, and it opens the door to on‑device LLM applications that were previously out of reach.
ZAYA1-8B has 8.4B total parameters but only 760M are active per forward pass. This dramatically reduces inference compute and memory bandwidth requirements while retaining the representational capacity of a much larger model.
ZAYA1-8B can be deployed on-device for local LLM applications, run efficiently in test-time compute harnesses, and serve requests at lower latency compared to dense models with similar benchmark performance.
https://www.zyphra.com/post/zaya1-8b
Architecture: MoE++ and Three Key Innovations
ZAYA1-8B is built on Zyphra’s MoE++ architecture, which introduces three specific changes over standard MoE designs. Together, these form the base of ZAYA1-8B’s intelligence efficiency which is the design goal Zyphra frames as maximizing intelligence extracted per parameter and per FLOP.
Compressed Convolutional Attention (CCA), a sequence mixing mechanism developed by Zyphra that operates in a compressed latent space and achieves 8× KV-cache compression versus standard attention. The KV-cache is the memory used during inference to store intermediate attention states -- an 8× reduction directly lowers memory requirements at inference time and allows longer effective contexts within the same hardware envelope.
ZAYA1 MLP-based router with PID-controller bias balancing.
Why this matters
ZAYA1-8B shows that a Mixture of Experts approach can shrink active compute without sacrificing benchmark scores. That’s quite significant. Trained end‑to‑end on AMD hardware, the model delivers math and coding results that beat many open‑weight models many times its size, yet only 760 million parameters fire per forward pass.
For developers seeking on‑device LLMs, the reduced memory bandwidth could lower deployment barriers. And because Zyphra released the code under Apache 2.0 on Hugging Face, we can experiment without licensing hurdles. Yet the article does not detail latency on typical edge devices, leaving performance claims unverified outside benchmark suites.
Can founders trust the reported gains for production workloads, or will hidden costs emerge when scaling to real‑world traffic? Moreover, the serverless endpoint offers a quick testbed, but its pricing and availability remain unclear. We remain cautious.
Our teams will watch early adopters closely. Until broader evaluations confirm the model’s versatility across domains beyond math and code.
🍪 We use cookies to analyze site traffic and improve your experience. By clicking "Accept", you consent to our use of cookies.
Learn more about our privacy policy