Skip to main content
Black Forest Labs team in a modern lab, pointing at a monitor showing Flux 2 UI and Mistral‑3 24B diagram.

Black Forest Labs releases Flux 2 with Mistral‑3 24B vision‑language model

2 min read

Black Forest Labs just released its newest model, and the timing feels deliberate. They rolled out Flux 2 with a “multi-reference” feature that seems to tie text and visual cues more tightly together. Most of the chatter still circles around raw parameter counts, but I keep wondering how the parts actually work.

The thing is, Flux 2 isn’t a single block; it’s split into two modules that each handle a different side of generation. One module deals with the meaning behind what you see and say, while the other pulls those pieces into a single layout. That separation looks like an attempt to preserve details - shapes, materials, spatial relationships - which older models often dropped.

For anyone building image-text pipelines, the architecture might end up mattering more than the headline numbers. The quote that follows shows how the two sections talk to each other, and hints why this hybrid style could be useful in real-world apps.

Hybrid architecture with Mistral vision language model Flux 2 combines two core components. A vision-language model, "Mistral-3 24B," interprets both text and image inputs, while a second module ("Rectified Flow Transformer") handles the logical layout and ensures that details like shapes and materials appear correctly. Flux 2 also uses a VAE image encoder to store and restore images efficiently without losing quality.

These systems work together to let the model create new content or edit existing images. Four models for different users The Flux 2 family includes four main versions, each tuned for different performance needs and levels of control: - Flux 2 [pro]: The highest-quality model, intended to match leading closed-source systems. It is available through the BFL Playground, the BFL API, and launch partners.

- Flux 2 [flex]: Designed for developers who want to adjust parameters like step count or guidance scale to trade speed for quality. It is also available through the Playground and API. - Flux 2 [dev]: A 32-billion-parameter model released with open weights.

Related Topics: #Flux 2 #Mistral-3 24B #vision-language model #Rectified Flow Transformer #VAE image encoder #Black Forest Labs #hybrid architecture #BFL API

Can one model really handle both vision and language? Black Forest Labs seems to think so, rolling out Flux 2 - a set of image generators that promise up to four-megapixel results and can take as many as ten reference pictures at once. The system mixes Mistral-3 24B, a vision-language model that reads text and images, with a Rectified Flow Transformer that tries to keep composition, shapes and material cues intact.

Developers get a choice: hit a lightweight API endpoint or grab the fully open weights and run it locally. Open weights might invite a lot of tinkering, but it’s still unclear how much the community will actually contribute. Likewise, the real benefit of feeding multiple references is hard to gauge - there are no public benchmarks or user studies yet.

The company touts high-resolution fidelity, yet we haven’t seen proof that the architecture holds up across very different subjects. All in all, Flux 2 adds some interesting tools to Black Forest Labs’ lineup, but we’ll have to wait and see how it performs in everyday use.

Common Questions Answered

What are the two core components of Flux 2 and how do they work together?

Flux 2 combines the Mistral‑3 24B vision‑language model, which interprets both text and image inputs, with a Rectified Flow Transformer that manages logical layout and ensures accurate shapes and material cues. Together they enable the system to generate coherent images while preserving detailed visual semantics.

How does the "multi‑reference" feature of Flux 2 enhance image generation?

The multi‑reference feature allows Flux 2 to ingest up to ten reference images simultaneously, providing richer visual context for the model. This capability helps the model produce more consistent and detailed outputs, especially when replicating complex compositions.

What role does the VAE image encoder play in Flux 2's architecture?

The VAE image encoder stores and restores images efficiently, compressing visual data without sacrificing quality. By integrating this encoder, Flux 2 can maintain high-fidelity outputs while managing the computational load of large image generation tasks.

What resolution does Flux 2 claim to achieve, and why is this significant?

Flux 2 claims to generate images at four‑megapixel resolution, which is notable for a model that also processes multiple reference images and complex textual prompts. This high resolution demonstrates the effectiveness of its hybrid architecture in delivering detailed, large‑scale visuals.