Skip to main content
Tech presenter gestures toward a slide on flash‑attention scaling while NVIDIA GPUs glow behind a PyTorch logo.

PyTorch and NVIDIA BioNeMo add attn_input_format for flash‑attention scaling

3 min read

When we try to train huge biology transformers, the attention layers often hit a snag - they just can’t keep up when sequence lengths swing wildly from one batch to the next. The hardware side isn’t the problem; modern NVIDIA GPUs running PyTorch still deliver plenty of horsepower. The hiccup seems to be in the software, which still needs a clean way to pass shape info without costly reshapes.

The latest tweak from the PyTorch-NVIDIA BioNeMo effort might help. By adding a new argument to the relevant layers, developers can now hand over the cumulative sequence-length tensor that flash-attention expects. That tensor, called cu_seq_lens_q, can be built on the fly by collators that understand thread-level parallelism - for example, Hugging Face’s DataColl utility.

In practice, this means a slimmer pipeline, fewer manual steps, and a likely boost in speed and memory efficiency for biology-focused transformer training.

TE makes this optimization relatively simple by adding an attn_input_format parameter to relevant layers, which then accepts standard flash-attention-style cumulative sequence length keyword arguments (cu_seq_lens_q) . These can be generated using THD-aware collators, such as Hugging Face's DataCollatorWithFlattening, or the masking version implemented in BioNeMo Recipes. "cu_seqlens_q": cu_seqlens, "cu_seqlens_kv": cu_seqlens, "max_length_q": max_length, "max_length_kv": max_length, } TE and sequence packing on/off performance Figure 2 shows the performance comparison, with a significant uplift in token throughput when TE is employed.

This demonstrates TE's ability to maximize the computational efficiency of your NVIDIA GPUs. EvolutionaryScale integrated Transformer Engine across their next-generation models as well: "ESM3 is the largest foundation model trained on biological data. Integrating the NVIDIA Transformer Engine was crucial to training it at this 98B parameter scale with high throughput and GPU utilization," said Tom Sercu, co-founder and VP of Engineering at EvolutionaryScale.

"The precision and speed of FP8 acceleration, combined with optimized kernels for fused layers, allow us to push the boundaries of compute and model scale across NVIDIA GPUs. This leads to emergent understanding of biology in our frontier models for the scientific community." Hugging Face interoperability One of the key advantages of TE is its interoperability with existing machine learning ecosystems, including popular libraries like Hugging Face. This means you can use TE's performance benefits even when working with models loaded from the Hugging Face Transformers library.

TE layers can be embedded directly inside a Hugging Face Transformers PreTrainedModel , and are fully compatible with AutoModel.from_pretrained .

Related Topics: #PyTorch #NVIDIA #BioNeMo #flash-attention #Hugging Face #Transformer Engine #cu_seq_lens_q #ESM3 #EvolutionaryScale

Can researchers really dodge the steep learning curve? The attn_input_format flag does make hooking things together a bit easier, but you still have to spin up cu_seq_lens_q with THD-aware collators, which adds another moving part. A few groups say they saw faster convergence on benchmark sets, yet nobody has independently checked those numbers.

The speed-versus-code-familiarity trade-off is still fuzzy. Switching to BioNeMo’s recipes will probably mean a decent amount of engineering work. Because FP8 and FP4 are supported, memory pressure might drop, but we haven’t seen a full picture of the performance hit.

If the collator could be generated automatically, the extra cost would shrink - the docs, however, are pretty thin on that. Parallelism choices are still left to the user, so the promised simplicity could be swallowed by the hassle of picking the right libraries. Still, the approach does give a tangible route to scale biology transformers inside existing PyTorch pipelines.

More benchmarking will be needed to see if the speed wins outweigh the integration effort. In the end, whether teams adopt it will hinge on how smoothly they can slip the new flag into their current code without a major rewrite. The attn_input_format addition is modest, and its real-world payoff remains to be proven.

Common Questions Answered

How does the new attn_input_format parameter improve flash‑attention scaling for large biology transformers?

The attn_input_format flag allows layers to accept flash‑attention‑style cumulative sequence length arguments, eliminating costly reshaping operations. By feeding cu_seq_lens_q and related metadata directly, attention layers can handle wildly varying sequence lengths more efficiently on NVIDIA GPUs.

What role do THD‑aware collators like Hugging Face's DataCollatorWithFlattening play in using attn_input_format?

THD‑aware collators generate the required cu_seq_lens_q, cu_seq_lens_kv, max_length_q, and max_length_kv values that attn_input_format expects. This integration streamlines data preprocessing, ensuring the attention layers receive correctly formatted sequence length information without manual reshaping.

Can researchers expect faster convergence on benchmark datasets by adopting BioNeMo's attn_input_format optimization?

Some teams have reported faster convergence when using the attn_input_format flag with BioNeMo recipes, likely due to reduced overhead in attention computation. However, these results have not been independently verified, so the performance gain remains anecdotal.

What are the potential trade‑offs when integrating the attn_input_format flag into existing codebases?

While attn_input_format simplifies flash‑attention integration, it introduces a dependency on THD‑aware collators and BioNeMo's specific recipes, which may require additional engineering effort. Teams must balance the speed benefits against the learning curve and code familiarity required to adopt these new components.