The Math That Makes Bayesian Inference Actually Work

There's something almost magical about math that makes hard problems trivial. Not easier—trivial. The kind of math where you're staring at a computational nightmare and someone shows you a trick that reduces the whole thing to adding two numbers together.

That's what conjugate priors do for Bayesian inference. And understanding why they work—really work—means understanding something called the exponential family of distributions, a mathematical structure so elegant it almost feels like cheating.

The Problem: Updating Beliefs Is Expensive

Bayesian inference is conceptually straightforward: you start with a prior belief, collect evidence, and update to a posterior belief. That posterior becomes your new prior for the next round of data. Rinse, repeat. You're supposed to be accumulating knowledge iteratively, refining your understanding with each observation.

Except in practice, this is a computational mess. Every update requires integrating probability distributions, normalizing constants, and juggling terms that don't want to cooperate. Unless.

Unless your posterior distribution and your prior distribution come from the same family. Then something remarkable happens: the update becomes almost embarrassingly simple. This property—where the posterior has the same functional form as the prior—defines what mathematicians call a conjugate prior.

Steve Brunton, in a recent lecture from the University of Washington, walks through exactly how this works. His example: coin flips. You're trying to estimate the probability of heads. The likelihood function for a sequence of coin flips follows a binomial distribution. The conjugate prior for binomial? The beta distribution.

"If I take a binomial likelihood times a beta prior I get a posterior which is also a beta distribution," Brunton explains. "That means that that posterior can become the prior for the next round of data collection."

The update rule? Literally add the number of heads to one parameter and the number of tails to another. That's it. No integration. No normalization headaches. Just addition.

The Exponential Family: Where Conjugates Live

But coin flips are just one case. The real insight is that this conjugate relationship isn't a lucky accident—it's a property of a much larger mathematical structure.

Many common probability distributions belong to what's called the exponential family. This includes binomials, Gaussians, Poisson distributions, and more. The exponential family isn't just a random collection; it's defined by a specific mathematical form.

Any distribution in this family can be written as h(x) times an exponential containing an inner product of parameters with sufficient statistics, minus a normalization term. It sounds abstract because it is abstract—but that abstraction is the point. The formula captures a pattern that appears throughout probability theory.

"Properties of this were analyzed by a lot of great mathematicians in the 1930s including Bernard Koopman," Brunton notes. "Koopman actually showed some really important properties of this exponential family of distributions around 1935, 1936."

The crucial fact: every member of the exponential family has a well-defined conjugate prior. If you can model your process using any likelihood function in this form, there exists a conjugate prior that will make your Bayesian updates tractable.

Gaussians: The Workhorse Example

Brunton spends the bulk of his lecture showing that Gaussian distributions belong to this family—and demonstrating exactly how to extract the conjugate relationship.

For a normally distributed random variable with mean μ and variance σ², you can rewrite the probability density function in exponential family form. The parameters θ become a vector containing μ/σ² and -1/(2σ²). The sufficient statistic f(x) contains x and x². When you multiply these out and simplify, you recover the familiar Gaussian shape.

The algebra gets messy—completing the square, absorbing constants into normalization terms, lots of σ²s flying around. But the result is clean: "This is showing you that for a normal distributed random variable X there is a θ and an f(x) and then you know these normalization constants... so that if you take the inner product of θ with f(x) and take e to that power you get something that is proportional to a normal distribution."

Which means Gaussians are their own conjugate priors. If your likelihood is Gaussian and your prior is Gaussian, your posterior will also be Gaussian. This self-conjugacy is why Gaussians dominate so much of practical statistics and machine learning—they play nicely with themselves.

The Historical Context

There's a historical layer here worth sitting with. Brunton acknowledges it near the end: "This is like in the 1930s this is all they could do because they didn't have big computations."

Conjugate priors weren't discovered because they were mathematically beautiful (though they are). They were discovered because they were necessary. Without computers, you simply couldn't do Bayesian inference unless you found distributions that made the math collapse into something manageable by hand.

This raises a question: are we teaching mathematical artifacts of computational limitation, or enduring insights about probability?

Brunton's answer seems to be: both. Modern approaches use numerical approximations and empirical distributions when dealing with messy, unknown distributions. "We're going to start approximating these numerically we're going to get what we call empirical distributions we're going to approximate this as a sum of Gaussians," he says.

But even those modern numerical methods often leverage the fact that Gaussians have conjugate properties. The old math doesn't become obsolete—it becomes a tool within more sophisticated frameworks. Understanding conjugate priors isn't about doing everything by hand; it's about knowing what structure to look for when you're building approximations.

What Gets Lost in Translation

The exponential family framework is powerful because it transforms dozens of specific distribution pairs into instances of a general pattern. But that level of abstraction has costs.

Brunton admits as much: "This is a lot of math it's kind of like messy and none of this makes a lot of sense." He's not wrong. The general conjugate prior formula—with its gamma parameters and inner products and conjugate normalization functions—is intimidating precisely because it tries to capture all cases at once.

The binomial-beta conjugacy is intuitive once you see it. The Gaussian self-conjugacy makes geometric sense if you stare at the exponents long enough. But the meta-pattern? That requires comfort with abstraction that most practitioners don't need day-to-day.

Which raises the pedagogical question: should we teach the general exponential family structure first, or lead with concrete examples and only later reveal the underlying pattern? Brunton chooses the latter, and it's probably the right call. Math education often makes the mistake of leading with maximum generality, leaving students lost in formalism before they have intuition to anchor it.

The Practical Reality

Brunton ends with characteristic honesty about modern practice: "Realistically just ask GPT or look this up in a book or you know whatever ask Wolfram Alpha like you don't have to probably compute these yourself very often."

This lands somewhere between liberating and concerning. Yes, you can look up conjugate pairs. Yes, software handles most of this automatically. But knowing why certain choices work—why Gaussians get used everywhere, why beta priors pair with binomial likelihoods—that knowledge changes how you think about model design.

The exponential family isn't trivia. It's a map showing which territories of probability space connect cleanly and which require computational brute force. Even if you never derive another conjugate prior by hand, knowing that map exists changes which questions you ask and which modeling choices you make.

Brunton assigns homework problems: show that binomial distributions fit the exponential family form, derive the beta conjugate explicitly, work through the Gaussian case in full detail. These aren't exercises in mathematical masochism. They're invitations to see the pattern for yourself, to experience the moment when the algebra collapses and reveals something simpler underneath.

Because that's what good mathematical structure does—it doesn't make hard problems easy, it makes them obvious.

— Nadia Marchetti, Unexplained Phenomena Correspondent