Our content generation service is experiencing issues. A human-curated summary is being prepared.
LLMs & Generative AI

PyTorch and NVIDIA BioNeMo add attn_input_format for flash‑attention scaling

3 min read

Why does this matter for anyone training large biology transformers? Scaling those models has always run into a bottleneck: attention layers struggle to keep up when sequence lengths vary wildly across batches. While the underlying hardware—NVIDIA GPUs paired with PyTorch—offers raw speed, the software stack still needs a way to feed the right shape information without costly reshaping.

Here’s the thing: recent work in the PyTorch‑NVIDIA BioNeMo collaboration introduces a small but consequential change. By exposing a new argument on the relevant layers, developers can hand off the cumulative sequence‑length data that flash‑attention expects. That data, formatted as cu_seq_lens_q, can be produced on the fly by collators aware of thread‑level parallelism, like the DataColl utility from Hugging Face.

The result is a cleaner pipeline, fewer manual steps, and a path toward faster, more memory‑efficient training of biology‑focused transformer models.

TE makes this optimization relatively simple by adding an attn_input_format parameter to relevant layers, which then accepts standard flash-attention-style cumulative sequence length keyword arguments (cu_seq_lens_q) . These can be generated using THD-aware collators, such as Hugging Face's DataCollatorWithFlattening, or the masking version implemented in BioNeMo Recipes. "cu_seqlens_q": cu_seqlens, "cu_seqlens_kv": cu_seqlens, "max_length_q": max_length, "max_length_kv": max_length, } TE and sequence packing on/off performance Figure 2 shows the performance comparison, with a significant uplift in token throughput when TE is employed.

This demonstrates TE's ability to maximize the computational efficiency of your NVIDIA GPUs. EvolutionaryScale integrated Transformer Engine across their next-generation models as well: "ESM3 is the largest foundation model trained on biological data. Integrating the NVIDIA Transformer Engine was crucial to training it at this 98B parameter scale with high throughput and GPU utilization," said Tom Sercu, co-founder and VP of Engineering at EvolutionaryScale.

"The precision and speed of FP8 acceleration, combined with optimized kernels for fused layers, allow us to push the boundaries of compute and model scale across NVIDIA GPUs. This leads to emergent understanding of biology in our frontier models for the scientific community." Hugging Face interoperability One of the key advantages of TE is its interoperability with existing machine learning ecosystems, including popular libraries like Hugging Face. This means you can use TE's performance benefits even when working with models loaded from the Hugging Face Transformers library.

TE layers can be embedded directly inside a Hugging Face Transformers PreTrainedModel , and are fully compatible with AutoModel.from_pretrained .

Related Topics: #PyTorch #NVIDIA #BioNeMo #flash-attention #Hugging Face #Transformer Engine #cu_seq_lens_q #ESM3 #EvolutionaryScale

Can researchers truly avoid the steep learning curve? While the attn_input_format flag streamlines integration, the need to generate cu_seq_lens_q via THD‑aware collators adds another dependency. Some teams have reported faster convergence on benchmark datasets, yet those results have not been independently verified.

Yet, the trade‑off between speed and code familiarity remains unclear. Adapting to BioNeMo's recipes may still require substantial engineering effort. Because low‑precision formats like FP8 and FP4 are supported, memory pressure could ease, but performance impacts have not been fully disclosed.

If the collator generation can be automated, the overhead might shrink, but current documentation offers limited guidance. And the choice of parallelism strategies still falls to the user. Consequently, the promised simplicity might be offset by the complexity of selecting optimal libraries.

Nevertheless, the approach offers a concrete path for scaling biology transformers within existing PyTorch workflows. Further benchmarking will be needed to confirm whether speed gains outweigh integration costs. Thus, adoption will likely depend on how easily existing pipelines can incorporate the new parameter without extensive rewrites.

Overall, the addition of attn_input_format is a modest step, but its practical benefit remains to be demonstrated across diverse research pipelines.

Further Reading

Common Questions Answered

How does the new attn_input_format parameter improve flash‑attention scaling for large biology transformers?

The attn_input_format flag allows layers to accept flash‑attention‑style cumulative sequence length arguments, eliminating costly reshaping operations. By feeding cu_seq_lens_q and related metadata directly, attention layers can handle wildly varying sequence lengths more efficiently on NVIDIA GPUs.

What role do THD‑aware collators like Hugging Face's DataCollatorWithFlattening play in using attn_input_format?

THD‑aware collators generate the required cu_seq_lens_q, cu_seq_lens_kv, max_length_q, and max_length_kv values that attn_input_format expects. This integration streamlines data preprocessing, ensuring the attention layers receive correctly formatted sequence length information without manual reshaping.

Can researchers expect faster convergence on benchmark datasets by adopting BioNeMo's attn_input_format optimization?

Some teams have reported faster convergence when using the attn_input_format flag with BioNeMo recipes, likely due to reduced overhead in attention computation. However, these results have not been independently verified, so the performance gain remains anecdotal.

What are the potential trade‑offs when integrating the attn_input_format flag into existing codebases?

While attn_input_format simplifies flash‑attention integration, it introduces a dependency on THD‑aware collators and BioNeMo's specific recipes, which may require additional engineering effort. Teams must balance the speed benefits against the learning curve and code familiarity required to adopt these new components.