10 CS Papers That Built Modern Computing and AI

Nobody sat down in 1936 and said, "I'm going to invent the computer today." Alan Turing was trying to answer a dry philosophical question about math. Claude Shannon was trying to measure surprise. Frank Rosenblatt was borrowing ideas from neuroscience. The trillion-dollar AI industry you're watching unfold right now is, at its core, a pile of accidental discoveries stacked on top of each other — each one solving a narrow problem that happened to unlock the next door.

Fireship's recent video walks through ten computer science papers spanning 1936 to 2020, making the case that this specific chain of work is what made modern AI possible. It's a tight argument, and it mostly holds up. Here's what those papers actually said, why they mattered, and where the story gets complicated.

The foundation: defining what "computing" even means

Before you can build anything, you need to know what can be built. Turing's 1936 paper, On Computable Numbers, with an Application to the Entscheidungsproblem, was answering David Hilbert's challenge: is there a universal algorithm that can decide whether any mathematical statement is true? Turing's answer was no — and to prove it, he had to define what an algorithm was in the first place.

His solution was the Turing Machine: a hypothetical device with an infinite tape, a read-write head, and a rulebook. Abstract, yes. But also the conceptual blueprint for every physical computer built since. The halting problem — can you write a program that determines whether any other program will finish or loop forever? — is provably unsolvable, which means there are hard limits baked into computation itself. That's a useful thing to know before you spend the next century building on top of the concept.

Twelve years later, Claude Shannon's A Mathematical Theory of Communication (1948) did something equally radical. He stripped meaning out of information entirely. As the video puts it: "'I love you' and 'the cat is on fire' carry the same information if they're equally surprising." He measured that surprise in bits, proved all information could be reduced to ones and zeros, and borrowed the concept of entropy from thermodynamics to quantify uncertainty.

Here's the part that still gives me pause: Shannon wasn't building AI. He was solving a communications engineering problem. But in doing so, he wrote what Fireship calls "the spiritual ancestor to the loss function" — the mathematical mechanism that tells every neural network today how wrong it is and nudges it toward being less wrong. Anthropic naming their model Claude isn't just a fun Easter egg. It's a genuine acknowledgment of where this lineage starts.

The hype cycle that never goes away

In 1958, a Cornell psychologist named Frank Rosenblatt built the perceptron — the first machine that could actually learn from examples. It took inputs, weighted them, and adjusted those weights when it made mistakes. The Navy funded it. The New York Times declared the computer would soon be conscious. The hype was, in the video's words, "immediate and unhinged."

Eleven years later, Marvin Minsky and Seymour Papert at MIT published Perceptrons and methodically dismantled the excitement. Using straightforward math, they showed a single-layer perceptron couldn't even learn XOR — a trivially simple logical operation. Funding dried up. The first AI winter set in.

What makes this episode historically interesting isn't just the crash — it's the buried footnote. Minsky and Papert actually noted that stacking layers of perceptrons would solve the problem. They just couldn't figure out how to train a stack. That answer wouldn't arrive for another seventeen years, and it would come from the same Geoffrey Hinton whose name keeps appearing throughout this entire story.

The perceptron hype cycle and its collapse also rhymes uncomfortably with where we are now. One useful lens: every AI winter so far has been preceded by claims that outpaced what the underlying systems could actually deliver. That pattern isn't an argument against the current wave of progress — the capabilities are genuinely different now — but it's worth holding onto.

The infrastructure nobody talks about

Here's a paper that doesn't get nearly enough credit in pop-CS history: Leslie Lamport's 1978 Time, Clocks, and the Ordering of Events in a Distributed System.

The problem: separate computers with no shared clock can't agree on when things happened. For a distributed system trying to coordinate, that's catastrophic. Lamport's fix was elegant — stop using wall-clock time entirely and order events by causality instead. If event A could have caused event B, A comes first. He called these logical clocks.

Fireship makes a point here that's easy to miss: "You need thousands of GPUs that constantly stay in sync and agree on state without dissolving into chaos." Without Lamport's work, the massive parallel training runs behind GPT-3, Gemini, and every other frontier model become coordination nightmares. The paper predates modern AI infrastructure by decades, but it's quietly load-bearing.

The moment deep learning actually worked

By 1986, Rumelhart, Hinton, and Williams published the backpropagation paper — finally answering how to train a stack of layers. The mechanism: run data forward through the network, measure how wrong the output is, then push that error backward through every layer using calculus's chain rule, nudging each weight toward "slightly less wrong." Repeat millions of times. The network teaches itself.

The surprising discovery was emergence. The hidden middle layers started inventing their own features — edges, textures, shapes — that nobody explicitly programmed. XOR, the problem that killed the first wave of AI hype, became trivial.

But backpropagation still needed two things it didn't have in 1986: enough data and enough compute. Those arrived together, in a roundabout way, through a 1998 paper about a search engine.

Brin and Page's PageRank paper — The Anatomy of a Large-Scale Hypertextual Web Search Engine — is usually discussed as a business story. The ranking algorithm that became Google, built in a dorm room, etc. But there's a second-order effect that matters more for AI: PageRank helped organize and index the largest pile of human-generated text ever assembled. That indexed web became the training feedstock for everything that followed.

The 2012 AlexNet paper is where these threads converge into a detonation. Alex Krizhevsky, Ilya Sutskever, and Hinton trained a deep convolutional neural network on the ImageNet dataset — millions of labeled images — using consumer-grade Nvidia gaming GPUs. They entered the annual ImageNet competition and, as the video describes, "dropped the error rate by 10 points in a single year." Everyone else was fighting over fractions of a percent. AlexNet didn't just win. It made the previous approach look obsolete overnight.

The architecture that ate everything

In 2017, Vaswani et al. at Google published Attention Is All You Need, introducing the transformer. The problem it solved: previous language models read tokens sequentially and kept losing context. A sentence that starts one way and ends another would have the model forgetting what it started with.

The transformer's fix: let every token attend to every other token simultaneously. Every word in a sequence can look at every other word and decide what's relevant. This made models dramatically more coherent and — critically — it scaled better as models got bigger. Google published the architecture openly, and as Fireship notes, "every AI lab uses it" now. That's where the T in ChatGPT comes from. Make of that business decision what you will.

Then in 2020, OpenAI published Language Models are Few-Shot Learners, documenting GPT-3. The bet was explicit: intelligence isn't a missing algorithm, it emerges from scale. Take the transformer, scale it to 175 billion parameters, feed it the internet, and see what happens. What happened was a model that could translate, summarize, and write code without being specifically trained for any of those tasks. Two years later, that paper became the foundation for ChatGPT.

What's strange about this whole history is how much of it was accidental. Turing proved a negative and got the computer. Shannon stripped meaning from language and got the mathematical bones of AI. Brin and Page built a search algorithm and accidentally assembled the world's largest training dataset. Each of these people was solving a specific, bounded problem — and the aggregate effect is a technology that none of them intended or predicted.

That's either comforting or unsettling, depending on your temperament. The systems we're building now are intentional in ways these earlier breakthroughs weren't. Whether intentionality makes the outcomes more predictable, or just more concentrated, is the live question.

— Yuki Okonkwo, AI & Machine Learning Correspondent, Buzzrag