PyTorch and NVIDIA BioNeMo add attn_input_format for flash‑attention scaling

November 7, 2025 • 3 min read

When we try to train huge biology transformers, the attention layers often hit a snag - they just can’t keep up when sequence lengths swing wildly from one batch to the next. The hardware side isn’t the problem; modern NVIDIA GPUs running PyTorch still deliver plenty of horsepower. The hiccup seems to be in the software, which still needs a clean way to pass shape info without costly reshapes.

The latest tweak from the PyTorch-NVIDIA BioNeMo effort might help. By adding a new argument to the relevant layers, developers can now hand over the cumulative sequence-length tensor that flash-attention expects. That tensor, called cu_seq_lens_q, can be built on the fly by collators that understand thread-level parallelism - for example, Hugging Face’s DataColl utility.

In practice, this means a slimmer pipeline, fewer manual steps, and a likely boost in speed and memory efficiency for biology-focused transformer training.

TE makes this optimization relatively simple by adding an attn_input_format parameter to relevant layers, which then accepts standard flash-attention-style cumulative sequence length keyword arguments (cu_seq_lens_q) . These can be generated using THD-aware collators, such as Hugging Face's DataCollatorWithFlattening, or the masking version implemented in BioNeMo Recipes. "cu_seqlens_q": cu_seqlens, "cu_seqlens_kv": cu_seqlens, "max_length_q": max_length, "max_length_kv": max_length, } TE and sequence packing on/off performance Figure 2 shows the performance comparison, with a significant uplift in token throughput when TE is employed.

This demonstrates TE's ability to maximize the computational efficiency of your NVIDIA GPUs. EvolutionaryScale integrated Transformer Engine across their next-generation models as well: "ESM3 is the largest foundation model trained on biological data. Integrating the NVIDIA Transformer Engine was crucial to training it at this 98B parameter scale with high throughput and GPU utilization," said Tom Sercu, co-founder and VP of Engineering at EvolutionaryScale.

"The precision and speed of FP8 acceleration, combined with optimized kernels for fused layers, allow us to push the boundaries of compute and model scale across NVIDIA GPUs. This leads to emergent understanding of biology in our frontier models for the scientific community." Hugging Face interoperability One of the key advantages of TE is its interoperability with existing machine learning ecosystems, including popular libraries like Hugging Face. This means you can use TE's performance benefits even when working with models loaded from the Hugging Face Transformers library.

TE layers can be embedded directly inside a Hugging Face Transformers PreTrainedModel , and are fully compatible with AutoModel.from_pretrained .

Scale Biology Transformer Models with PyTorch and NVIDIA BioNeMo Recipes - NVIDIA Developer Blog

Related Topics: #PyTorch #NVIDIA #BioNeMo #flash-attention #Hugging Face #Transformer Engine #cu_seq_lens_q #ESM3 #EvolutionaryScale

Can researchers really dodge the steep learning curve? The attn_input_format flag does make hooking things together a bit easier, but you still have to spin up cu_seq_lens_q with THD-aware collators, which adds another moving part. A few groups say they saw faster convergence on benchmark sets, yet nobody has independently checked those numbers.

The speed-versus-code-familiarity trade-off is still fuzzy. Switching to BioNeMo’s recipes will probably mean a decent amount of engineering work. Because FP8 and FP4 are supported, memory pressure might drop, but we haven’t seen a full picture of the performance hit.

If the collator could be generated automatically, the extra cost would shrink - the docs, however, are pretty thin on that. Parallelism choices are still left to the user, so the promised simplicity could be swallowed by the hassle of picking the right libraries. Still, the approach does give a tangible route to scale biology transformers inside existing PyTorch pipelines.

More benchmarking will be needed to see if the speed wins outweigh the integration effort. In the end, whether teams adopt it will hinge on how smoothly they can slip the new flag into their current code without a major rewrite. The attn_input_format addition is modest, and its real-world payoff remains to be proven.

Common Questions Answered

How does the new attn_input_format parameter improve flash‑attention scaling for large biology transformers?

The attn_input_format flag allows layers to accept flash‑attention‑style cumulative sequence length arguments, eliminating costly reshaping operations. By feeding cu_seq_lens_q and related metadata directly, attention layers can handle wildly varying sequence lengths more efficiently on NVIDIA GPUs.

What role do THD‑aware collators like Hugging Face's DataCollatorWithFlattening play in using attn_input_format?

THD‑aware collators generate the required cu_seq_lens_q, cu_seq_lens_kv, max_length_q, and max_length_kv values that attn_input_format expects. This integration streamlines data preprocessing, ensuring the attention layers receive correctly formatted sequence length information without manual reshaping.

Can researchers expect faster convergence on benchmark datasets by adopting BioNeMo's attn_input_format optimization?

Some teams have reported faster convergence when using the attn_input_format flag with BioNeMo recipes, likely due to reduced overhead in attention computation. However, these results have not been independently verified, so the performance gain remains anecdotal.

What are the potential trade‑offs when integrating the attn_input_format flag into existing codebases?

While attn_input_format simplifies flash‑attention integration, it introduces a dependency on THD‑aware collators and BioNeMo's specific recipes, which may require additional engineering effort. Teams must balance the speed benefits against the learning curve and code familiarity required to adopt these new components.

🎓

Featured Review

No Code MBA

Build AI apps without coding. Our in-depth course review.

Read Review

PyTorch and NVIDIA BioNeMo add attn_input_format for flash‑attention scaling

Common Questions Answered

How does the new attn_input_format parameter improve flash‑attention scaling for large biology transformers?

What role do THD‑aware collators like Hugging Face's DataCollatorWithFlattening play in using attn_input_format?

Can researchers expect faster convergence on benchmark datasets by adopting BioNeMo's attn_input_format optimization?

What are the potential trade‑offs when integrating the attn_input_format flag into existing codebases?

Most Popular

Rob Pike’s AI‑generated ‘act of kindness’ spams draft tribute to his work

Meta adds Spotify AI music, Kannada/Telugu, and noise filtering to AI Glasses

Fusion reactors could produce dark‑sector particles via neutron emissions

Gemini 3 Flash Offers Fast Multimodal Reasoning for Video, Data, Visual Q&A

NeuroPixel.AI draws global brands with production‑ready design automation tools

Qwen‑Image‑2512 launches, rivals Google’s Nano Banana Pro in AI image generation

OpenAI Opens Submissions for Apps Using ChatGPT’s SDK, Unveiled at DevDay

OpenAI launches App Directory, accepts ChatGPT apps with privacy notices

Sora 2 Generates Disturbing AI Kid Videos as Legal Grey Area Persists

72% of US teens surveyed have used AI companions, Common Sense Media finds

Related Reading

OpenAI says AI saves knowledge workers 40‑80 minutes; use yields five‑fold gains

Grok Chat: AI for debugging, building, testing web apps with voice and images

Samsung adds Vision AI Companion, an AI Bixby, to TVs for real‑time queries

Oracle, NVIDIA deepen tie-up to speed sovereign AI and government digital shift

Blackwell Ultra speeds up AI; Nvidia Rubin platform slated for months‑away launch

Common Questions Answered

How does the new attn_input_format parameter improve flash‑attention scaling for large biology transformers?

What role do THD‑aware collators like Hugging Face's DataCollatorWithFlattening play in using attn_input_format?

Can researchers expect faster convergence on benchmark datasets by adopting BioNeMo's attn_input_format optimization?

What are the potential trade‑offs when integrating the attn_input_format flag into existing codebases?

Most Popular

Rob Pike’s AI‑generated ‘act of kindness’ spams draft tribute to his work

Meta adds Spotify AI music, Kannada/Telugu, and noise filtering to AI Glasses

Fusion reactors could produce dark‑sector particles via neutron emissions

Gemini 3 Flash Offers Fast Multimodal Reasoning for Video, Data, Visual Q&A

NeuroPixel.AI draws global brands with production‑ready design automation tools

Qwen‑Image‑2512 launches, rivals Google’s Nano Banana Pro in AI image generation

OpenAI Opens Submissions for Apps Using ChatGPT’s SDK, Unveiled at DevDay

OpenAI launches App Directory, accepts ChatGPT apps with privacy notices

Sora 2 Generates Disturbing AI Kid Videos as Legal Grey Area Persists

72% of US teens surveyed have used AI companions, Common Sense Media finds