A Transformer Neural Network Just Ran on a 1979 Computer

There's something viscerally satisfying about watching a neural network train on a computer from 1979. Not on some cloud cluster burning enough electricity to power a small city, but on a PDP-11/44—a machine with one CPU core running at 6 MHz and 64K of RAM at a time.

Dave, an engineer with access to vintage computing hardware most of us only see in museums, recently did exactly this. He trained a genuine transformer network (yes, the architecture behind ChatGPT and friends) on his vintage DEC minicomputer. The whole thing runs in 32 kilobytes of memory. The compiled binary is 6,179 bytes. That's not even a reasonable file header by modern standards.

But here's what makes this more than a party trick: stripping AI down to this level reveals what's actually happening when we "train a model." And it turns out the dirty little secret isn't that complicated.

The Same Trick, Different Scale

"Here's the dirty little secret about neural networks," Dave explains in his video demonstration. "The core idea is not magical. It isn't even especially new. What's new is the scale at which it's being done."

The transformer running on Dave's PDP-11/44 has exactly 1,216 parameters. Modern large language models have hundreds of billions. But the fundamental process? Identical. The network makes a guess, measures how wrong it was, nudges a pile of numbers in memory slightly, then repeats. Over and over.

Dave's comparison is perfect: "It's less like summoning intelligence and more like training a dog, except the dog is made of matrices."

The model's task is beautifully simple: learn to reverse a sequence of eight digits. Feed it 47496358, it should output 85369474. That's it. No poetry generation, no anime girlfriends, no replacing your accountant.

But this simplicity is deceptive. The network can't just memorize patterns—it must discover a structural rule. Output position zero needs to attend to input position seven. Position one to position six. The model has to learn where digits go, not just what they are. And that's exactly what the attention mechanism in transformers was designed to solve.

What Attention Actually Does

Most explanations of transformer attention get handwavy fast. Dave cuts through with a concrete example: the sentence "Mary went down to the bank to get some cash."

The word "bank" could mean a financial institution or a riverbank. Older neural networks processed language "like a guy trying to remember the start of a good story while somebody else kept talking over top of them." They'd trudge left to right, clutching fading memories of earlier words.

Transformers changed that. With self-attention, each token can look back across earlier tokens and ask: what else here actually matters to me? When the model sees "cash," it learns to weight "bank" differently than if it had seen "fish" or "canoe."

"That simple idea, attention, turned out to be dynamite," Dave notes. "Suddenly, the machine wasn't just trudging left to right. It could directly connect this word to that word, this output to that input, and this idea to that other idea several positions away."

The 2017 paper that introduced this architecture was titled "Attention Is All You Need"—which sounds almost smug until you realize they might have actually been right.

Engineering for 1979 Hardware

Getting a transformer to run on a PDP-11 required some serious old-school optimization work. The project, called Attention11, was written in pure PDP-11 assembly language by Damian Buret. Not Python. Not PyTorch. Assembly language for a computer architecture that predates most people arguing about AGI online.

The original Fortran implementation was too slow—100 training steps took 25 minutes, meaning full training would require over six hours. So Buret did what engineers used to do when hardware said no: he got medieval. A complete rewrite in assembly, using a custom fixed-point neural network stack called NN11.

The arithmetic choices are genuinely clever. The forward pass uses Q8 fixed point (8 fractional bits). The backward pass uses Q15 (15 fractional bits) for high gradient precision. Weight accumulators live in 32-bit 16.16 fixed point.

Why? Because when you multiply an 8-bit activation by a 15-bit gradient, you get a 23-bit intermediate that drops perfectly into the PDP-11's 32-bit register pair. One arithmetic shift brings it back to Q15. Same basic multiply cost, vastly better precision.

That's not just optimization—it's the kind of arranged marriage between math and hardware that happens when you truly understand both.

Watching Intelligence Emerge

Dave loads the program using a Unibone card (a modern interface that plugs into the vintage Unibus), types "S1000" to start execution, and the machine springs to life.

The loss peaks around 2.9 at step 150, then falls rapidly as the model learns where digits belong. By step 350, accuracy hits 100%. Total training time on the PDP-11/44: three and a half minutes. "All in a machine family born when disco was still considered a survivable condition," Dave observes.

What makes this captivating isn't the speed—it's the visibility. Modern AI training happens behind so many layers of abstraction that the physical reality disappears. You might see a progress bar or a TensorBoard plot, but the actual work is invisible.

On the PDP-11, you can hear the machine working. Watch the front panel lights breathe with computation. See the state changes happening in real time on hardware designed when computers still had the decency to perform in public.

"Each weight update feels less like a hidden software event and more like some tiny industrial process taking place inside of a very obedient steel box," Dave says.

And that matters. Because once you see it happening—really see it, not as abstraction but as arithmetic grinding through memory—you understand what training actually is. Not AI magic. Just a machine repeatedly updating connection strengths so the next answer will be slightly less wrong than the last.

The Miniature Contains the Whole

The transformer running on Dave's PDP-11 isn't a toy. It's a working implementation of the same architecture powering the most advanced AI systems in the world. It has learned token embeddings, learned position embeddings, scaled dot-product attention, and softmax output distributions. It just leaves out the "cathedral scaffolding" because for this task, "the little chapel is enough."

The glamour of modern AI comes from doing this at staggering scale. But the essential act of learning? It's already here, fully present in miniature, running on hardware you could disconnect by pulling the power plug.

Dave's assembly code is available on GitHub. You can see every memory load, every multiplication, every gradient calculation laid bare. No mystery. No spellbooks. Just code that moves data, multiplies numbers, accumulates sums, and updates weights.

The question isn't whether we can make AI work on impossibly constrained hardware. We already did that in 1979—we just didn't have the scale to make it useful yet. The question is whether, in our rush to build bigger and bigger models, we've lost sight of the fact that the underlying trick is surprisingly simple.

A pile of adjustable numbers in memory. Repeated error correction. That's the whole game.

— Yuki Okonkwo, AI & Machine Learning Correspondent