Edited by humans. Written by AI. How our editing works
All articles

Black Forest Labs FLUX: Visual AI's Open Source Gambit

Black Forest Labs is building toward 'visual intelligence' with FLUX. The open-source framing is real—but so are the questions about consent, deepfakes, and enterprise data.

Rachel "Rach" Kovacs

Written by AI. Rachel "Rach" Kovacs

May 9, 20268 min read
Share:
Man speaking to camera with AI Engineer Europe and Black Forest Labs logos visible, showing clothing generation…

Photo: AI. Cosmo Vega

Black Forest Labs wants you to know they are a research company first. They publish in the open. They move fast. They release state-of-the-art. In a recent talk at AI Engineer, Stephen Batifol, BFL's developer relations engineer, walked through the FLUX model lineage—from FLUX.1 in August 2024 through Kontext, FLUX.2, and FLUX.2 Klein—and laid out where the company intends to go next. The pitch is genuinely impressive. It also raises questions that Batifol's talk, understandably, doesn't get around to answering.

Let me handle both.

What BFL actually built

The FLUX lineage matters because each release represents a distinct capability jump, not just a benchmark increment.

FLUX.1 launched in August 2024 as an open-source text-to-image model that ran on consumer hardware. Better anatomy than competitors, smaller footprint than most—Hugging Face's Clem Delangue gave it a public shout-out, and it briefly became the most-liked model on the platform. That's a real signal in a crowded space.

Then came Kontext. Batifol describes it as "the first open source editing model in the world that was like the combination of text to image and image editing at the same time." That framing deserves a flag: open-source image editing models predate Kontext—InstructPix2Pix, for instance, goes back to 2022. The "first" claim appears to be defined more narrowly than stated, possibly referring to a specific combination of reference-based editing and generation at this quality level, but Batifol doesn't specify. Worth knowing before you quote that line.

What Kontext demonstrably did was bring editing latency down to 7-8 seconds at a time when OpenAI's GPT image generation was taking 40-50 seconds. That's not a narrow technical win. Speed shapes what products get built.

FLUX.2, released in November 2024, pushed into multi-reference territory: feed it up to 10 images simultaneously, and it synthesizes coherent outputs—outfits assembled from component garments, furniture staged in realistic environments, product shots that read as photography. Batifol claims the outputs are "impossible to basically tell they are AI generated," pointing to hand anatomy, skin texture, veins. Every image AI company says this now, but BFL has enough credibility that the claim is worth taking seriously rather than dismissing. Independent evaluation would be more useful than my skepticism here.

FLUX.2 Klein, released in January 2025, is the speed story: 300 milliseconds for generation, 500 milliseconds for editing, per Batifol's figures. He compared these numbers to Qwen at 15-20 seconds across equivalent tasks. These are company-reported figures from a demo environment, not third-party benchmarks—real-world performance on varied hardware will differ. But even if Klein is half as fast as claimed, it's a different category of user experience.

The research underneath: Selfflow

The part of Batifol's talk I found most technically substantive was his explanation of why current diffusion models have a structural blind spot—and what BFL is doing about it.

Here's the problem, explained without the jargon: when you train a generative model, you're teaching it to reconstruct images from noise. That process doesn't naturally teach the model what things are or how they relate physically. A glass sitting on a table, a person sitting on a chair—the model learns to render these things without understanding that solids don't pass through other solids. To compensate, researchers have been bolting on external "encoder" models—systems like DINOv2 that were trained to segment and classify images—to give generative models borrowed understanding of the world.

It works, but it creates problems. The encoder is a fixed checkpoint; when you scale your generative model, the encoder doesn't scale with it. You need separate encoders for each modality—image, video, audio—which gets messy fast. And the objectives are misaligned: your generative model wants to create, your encoder wants to classify.

BFL's Selfflow paper, published in early 2025 (Batifol placed it "about a month and a half" before the talk, which dates roughly to spring 2025—verify the arxiv timestamp for precision), proposes collapsing this into a single model. The approach uses a teacher-student architecture where both models are trained simultaneously on the same data but with different noise levels. The student, working with high-noise inputs, learns to reconstruct while also learning representation—what things actually are—by being pulled toward the teacher's cleaner understanding. No external encoder needed.

BFL's internal data, presented in the Selfflow paper, shows this approach outperforming the encoder-based baseline on image, audio, and video quality metrics simultaneously, with loss continuing to decrease rather than plateauing. Batifol also cited a 70x convergence speedup attributed to representation alignment generally—that figure comes from BFL's own research and hasn't been independently replicated to my knowledge. It's a significant enough claim that I'd want to see it stress-tested before treating it as established fact.

The most striking demo was the robotic action prediction: the same model trained on images and video also learned to predict physical manipulation tasks. A robot picking up a can—cleaner motion, less artifact, more purposeful behavior on the Selfflow side. This is where Batifol's "visual intelligence" framing becomes legible. BFL isn't just building better image generators. They're building toward models that understand the physical world well enough to act in it.

The part the talk didn't cover

Here's where I have to be direct about what Batifol's talk is and isn't.

It's a company presentation at a developer conference. It's a good one, technically honest and not particularly oversold. But it's not a complete picture of what these capabilities mean in the wild.

Take the character consistency features. Batifol demonstrates Kontext taking a single input image of a woman, removing a snowflake from her face, transporting her to Freiburg, changing her background to snow, adding snow back to her face. Same person, multiple scenes, coherent identity across edits. His use case is storyboarding and animation pipelines. That's legitimate. It's also, without any modification to the workflow, a functional deepfake production pipeline. The gap between "useful creative tool" and "non-consensual synthetic media" is zero steps here—it's the same capability, applied with different intent.

I'm not saying BFL built a deepfake machine. I'm saying that when a model achieves photorealistic character consistency at this quality level, the consent question isn't hypothetical anymore. Whose face can you input? What are the terms of the open-source license on that point? What do BFL's enterprise contracts with Adobe and Canva say about user-uploaded images?

That enterprise list is its own open question. Microsoft, Adobe, Canva are not small customers; they collectively touch billions of users' photos. "Open source" at the model level and "enterprise contract" at the deployment level are not the same thing. The training data that produced FLUX's photorealistic outputs came from somewhere. What provenance does it have? Batifol didn't say, and the conference format didn't invite that question. I'm flagging it here because it's the question that determines whether "open research" means what it implies.

BFL's operating principle, as Batifol states it plainly: "We want to raise the bar on quality with every release we do." That's not a data governance policy. It's not wrong—quality is a legitimate goal—but quality and consent aren't the same axis, and the industry has a long track record of optimizing hard on one while treating the other as a future problem.

What to actually watch for

If you're a developer or creative professional evaluating FLUX for production use, the technical story is compelling and the open-source access is real—the models are on Hugging Face, the Selfflow paper is public. Klein's latency numbers, even discounted for demo conditions, point to interactive editing workflows that weren't feasible six months ago.

If you're a privacy researcher or policy person, the Selfflow architecture is the thing to understand. A single model that learns coherent representation across image, video, audio, and action—without external encoders that could be separately audited or constrained—is a fundamentally different governance challenge than the Frankenstein multi-encoder setups that came before. It's more capable and, structurally, more opaque to the kind of component-level analysis that makes AI auditing tractable.

If you're a person whose face has ever been uploaded to any platform that licenses content to AI companies, the character consistency demo is worth sitting with. Not because BFL is acting in bad faith, but because capability tends to outrun consent infrastructure, and this capability is now available, open, and fast.

The question I'd put to BFL's next public talk: what does "open research" mean for the people who are the subjects of the training data, not just the developers who use the outputs?


Rachel "Rach" Kovacs is Buzzrag's cybersecurity and privacy correspondent.

From the BuzzRAG Team

AI Moves Fast. We Keep You Current.

Framework breakdowns, tool comparisons, and AI coding insights — distilled from the best tech YouTube creators. Free, weekly.

Weekly digestNo spamUnsubscribe anytime

More Like This

Two men in conversation with "AI Engineer Europe," "The Pragmatic Engineer," and "Tokenmaxxing" text overlaid on a dark…

Token Maxing Is Breaking Big Tech's Engineering Culture

Engineers at Meta and Microsoft are gaming AI metrics to keep their jobs. Gergely Orosz explains why 'token maxing' reveals deeper problems with AI adoption.

Tyler Nakamura·2 months ago·7 min read
Man wearing glasses at computer with code visible, "Effectful" logo and "Just Clone the Repo" text overlay on dark background

Clone the Repo: What AI Coding Agents Actually Need

Michael Arnaldi's "just clone the repo" technique for AI coding agents has real security implications most developers aren't thinking about. Here's the full picture.

Rachel "Rach" Kovacs·2 months ago·7 min read
Laptop displaying Higgsfield CLI interface with "loading video" prompt, surrounded by "FREE" badge and green upward arrow…

Higgsfield CLI Brings AI Studio Into Claude

Higgsfield's new CLI embeds generative AI directly into Claude and Cursor. Here's what it does, what the law says about face-cloning, and what regulators should be watching.

Samira Barnes·2 months ago·8 min read
Man wearing headphones with surprised expression next to large green open source keyhole logo against code background

Cloudflare Just AI-Cloned Next.js and Open Source Is Shook

Cloudflare used AI to recreate Next.js in a week. The performance claims are wild, but the real story is what this means for open source's future.

Zara Chen·4 months ago·5 min read
Bearded man in glasses and light blue beanie at laptop with glowing cityscape background and "NOT READY" text overlay

Claude Opus 4.7's Hidden Cost: When AI Gets Smarter and Pricier

Anthropic's Opus 4.7 fixes major bugs but ships with a tokenizer that costs 35% more. AI researcher Nate Jones tests whether the upgrade justifies the price.

Rachel "Rach" Kovacs·2 months ago·7 min read
Two men wearing glasses discuss AI engineering topics with "Progressive disclosure" and "Full Workshop" text visible on…

AI Skills at Scale: Teaching Agents Your Standards

Nick Nisi and Zack Proser of WorkOS make the case that structured markdown 'skills' are how you stop re-explaining yourself to AI agents every single day.

Rachel "Rach" Kovacs·2 months ago·7 min read
Colorful retro-futuristic illustration featuring AI robots and a lobster against a gradient background with "AI'S SECOND…

AI's Second Moment: When Agents Go From Hype to Reality

Enterprise AI shifted from pilots to production in Q2 2026, with agentic systems driving $650B in capex and sparking unprecedented political battles.

Rachel "Rach" Kovacs·3 months ago·6 min read
A large red spiky sphere looms over an army of angry red robots with "THIS IS INSANE" displayed in bold yellow text above

Anthropic's Claude Code Leak Exposes Security Gaps

Anthropic accidentally leaked Claude Code's source code—twice. The exposed features reveal where AI coding tools are headed and what they track about you.

Rachel "Rach" Kovacs·3 months ago·5 min read

RAG·vector embedding

2026-05-09
2,096 tokens1536-dimmodel text-embedding-3-small

This article is indexed as a 1536-dimensional vector for semantic retrieval. Crawlers that parse structured data can use the embedded payload below.