Edited by humans. Written by AI. How our editing works
BUZZRAGNews. Trends. Ideas — distilled in minutes.
All articles

10 CS Papers That Built Modern Computing and AI

From Turing's 1936 thought experiment to GPT-3, these ten computer science papers form the chain reaction behind every AI system running today.

Yuki Okonkwo

Written by AI. Yuki Okonkwo

June 18, 20268 min read
Share:
A man in vintage formal attire wearing pixelated sunglasses against a dark background, with "10 mind-blowing PAPERS" text…

Photo: AI. Mika Sørensen

Nobody sat down in 1936 and said, "I'm going to invent the computer today." Alan Turing was trying to answer a dry philosophical question about math. Claude Shannon was trying to measure surprise. Frank Rosenblatt was borrowing ideas from neuroscience. The trillion-dollar AI industry you're watching unfold right now is, at its core, a pile of accidental discoveries stacked on top of each other — each one solving a narrow problem that happened to unlock the next door.

Fireship's recent video walks through ten computer science papers spanning 1936 to 2020, making the case that this specific chain of work is what made modern AI possible. It's a tight argument, and it mostly holds up. Here's what those papers actually said, why they mattered, and where the story gets complicated.


The foundation: defining what "computing" even means

Before you can build anything, you need to know what can be built. Turing's 1936 paper, On Computable Numbers, with an Application to the Entscheidungsproblem, was answering David Hilbert's challenge: is there a universal algorithm that can decide whether any mathematical statement is true? Turing's answer was no — and to prove it, he had to define what an algorithm was in the first place.

His solution was the Turing Machine: a hypothetical device with an infinite tape, a read-write head, and a rulebook. Abstract, yes. But also the conceptual blueprint for every physical computer built since. The halting problem — can you write a program that determines whether any other program will finish or loop forever? — is provably unsolvable, which means there are hard limits baked into computation itself. That's a useful thing to know before you spend the next century building on top of the concept.

Twelve years later, Claude Shannon's A Mathematical Theory of Communication (1948) did something equally radical. He stripped meaning out of information entirely. As the video puts it: "'I love you' and 'the cat is on fire' carry the same information if they're equally surprising." He measured that surprise in bits, proved all information could be reduced to ones and zeros, and borrowed the concept of entropy from thermodynamics to quantify uncertainty.

Here's the part that still gives me pause: Shannon wasn't building AI. He was solving a communications engineering problem. But in doing so, he wrote what Fireship calls "the spiritual ancestor to the loss function" — the mathematical mechanism that tells every neural network today how wrong it is and nudges it toward being less wrong. Anthropic naming their model Claude isn't just a fun Easter egg. It's a genuine acknowledgment of where this lineage starts.


The hype cycle that never goes away

In 1958, a Cornell psychologist named Frank Rosenblatt built the perceptron — the first machine that could actually learn from examples. It took inputs, weighted them, and adjusted those weights when it made mistakes. The Navy funded it. The New York Times declared the computer would soon be conscious. The hype was, in the video's words, "immediate and unhinged."

Eleven years later, Marvin Minsky and Seymour Papert at MIT published Perceptrons and methodically dismantled the excitement. Using straightforward math, they showed a single-layer perceptron couldn't even learn XOR — a trivially simple logical operation. Funding dried up. The first AI winter set in.

What makes this episode historically interesting isn't just the crash — it's the buried footnote. Minsky and Papert actually noted that stacking layers of perceptrons would solve the problem. They just couldn't figure out how to train a stack. That answer wouldn't arrive for another seventeen years, and it would come from the same Geoffrey Hinton whose name keeps appearing throughout this entire story.

The perceptron hype cycle and its collapse also rhymes uncomfortably with where we are now. One useful lens: every AI winter so far has been preceded by claims that outpaced what the underlying systems could actually deliver. That pattern isn't an argument against the current wave of progress — the capabilities are genuinely different now — but it's worth holding onto.


The infrastructure nobody talks about

Here's a paper that doesn't get nearly enough credit in pop-CS history: Leslie Lamport's 1978 Time, Clocks, and the Ordering of Events in a Distributed System.

The problem: separate computers with no shared clock can't agree on when things happened. For a distributed system trying to coordinate, that's catastrophic. Lamport's fix was elegant — stop using wall-clock time entirely and order events by causality instead. If event A could have caused event B, A comes first. He called these logical clocks.

Fireship makes a point here that's easy to miss: "You need thousands of GPUs that constantly stay in sync and agree on state without dissolving into chaos." Without Lamport's work, the massive parallel training runs behind GPT-3, Gemini, and every other frontier model become coordination nightmares. The paper predates modern AI infrastructure by decades, but it's quietly load-bearing.


The moment deep learning actually worked

By 1986, Rumelhart, Hinton, and Williams published the backpropagation paper — finally answering how to train a stack of layers. The mechanism: run data forward through the network, measure how wrong the output is, then push that error backward through every layer using calculus's chain rule, nudging each weight toward "slightly less wrong." Repeat millions of times. The network teaches itself.

The surprising discovery was emergence. The hidden middle layers started inventing their own features — edges, textures, shapes — that nobody explicitly programmed. XOR, the problem that killed the first wave of AI hype, became trivial.

But backpropagation still needed two things it didn't have in 1986: enough data and enough compute. Those arrived together, in a roundabout way, through a 1998 paper about a search engine.

Brin and Page's PageRank paper — The Anatomy of a Large-Scale Hypertextual Web Search Engine — is usually discussed as a business story. The ranking algorithm that became Google, built in a dorm room, etc. But there's a second-order effect that matters more for AI: PageRank helped organize and index the largest pile of human-generated text ever assembled. That indexed web became the training feedstock for everything that followed.

The 2012 AlexNet paper is where these threads converge into a detonation. Alex Krizhevsky, Ilya Sutskever, and Hinton trained a deep convolutional neural network on the ImageNet dataset — millions of labeled images — using consumer-grade Nvidia gaming GPUs. They entered the annual ImageNet competition and, as the video describes, "dropped the error rate by 10 points in a single year." Everyone else was fighting over fractions of a percent. AlexNet didn't just win. It made the previous approach look obsolete overnight.


The architecture that ate everything

In 2017, Vaswani et al. at Google published Attention Is All You Need, introducing the transformer. The problem it solved: previous language models read tokens sequentially and kept losing context. A sentence that starts one way and ends another would have the model forgetting what it started with.

The transformer's fix: let every token attend to every other token simultaneously. Every word in a sequence can look at every other word and decide what's relevant. This made models dramatically more coherent and — critically — it scaled better as models got bigger. Google published the architecture openly, and as Fireship notes, "every AI lab uses it" now. That's where the T in ChatGPT comes from. Make of that business decision what you will.

Then in 2020, OpenAI published Language Models are Few-Shot Learners, documenting GPT-3. The bet was explicit: intelligence isn't a missing algorithm, it emerges from scale. Take the transformer, scale it to 175 billion parameters, feed it the internet, and see what happens. What happened was a model that could translate, summarize, and write code without being specifically trained for any of those tasks. Two years later, that paper became the foundation for ChatGPT.


What's strange about this whole history is how much of it was accidental. Turing proved a negative and got the computer. Shannon stripped meaning from language and got the mathematical bones of AI. Brin and Page built a search algorithm and accidentally assembled the world's largest training dataset. Each of these people was solving a specific, bounded problem — and the aggregate effect is a technology that none of them intended or predicted.

That's either comforting or unsettling, depending on your temperament. The systems we're building now are intentional in ways these earlier breakthroughs weren't. Whether intentionality makes the outcomes more predictable, or just more concentrated, is the live question.


— Yuki Okonkwo, AI & Machine Learning Correspondent, Buzzrag

From the BuzzRAG Team

AI Moves Fast. We Keep You Current.

Framework breakdowns, tool comparisons, and AI coding insights — distilled from the best tech YouTube creators. Free, weekly.

Weekly digestNo spamUnsubscribe anytime

More Like This

Man in sunglasses reacts with amazement to "1000 Tokens Per Second" text, with Google logo and geometric symbol displayed…

DiffusionGemma Generates Text Like an Image Model

Google DeepMind's DiffusionGemma borrows from image diffusion to generate 700–1,000+ tokens/sec. Here's how the architecture works—and where it falls short.

Yuki Okonkwo·3 days ago·7 min read
A smiling man in a beige shirt appears against a yellow-orange circular background, with text about routing algorithms and…

Decoding MoE: Token Routing with a Twist

Explore how Mixture of Experts models use token routing to optimize AI model efficiency and performance.

Yuki Okonkwo·5 months ago·3 min read
Man in green shirt next to illuminated circuit board with red and gold LEDs displaying neural network visualization and…

A Transformer Neural Network Just Ran on a 1979 Computer

Engineer Dave trains a real transformer neural network on a 1979 PDP-11 computer, revealing what AI actually does beneath the billion-dollar hype.

Yuki Okonkwo·2 months ago·6 min read
A woman in a "think series" video explains neuro-symbolic AI concepts with text overlays defining the terminology against a…

NeuroSymbolic AI: When Pattern Recognition Meets Reasoning

NeuroSymbolic AI combines neural networks with symbolic reasoning to create explainable systems that understand, not just recognize. Here's what that means.

Yuki Okonkwo·4 months ago·8 min read
A man reacts with surprised expression as an arrow points to a speaking presenter on stage, with text reading "THIS IS HUGE

Why AI Refactors Code Perfectly But Can't Count R's

Andrej Karpathy explains AI's 'jagged' capabilities: why models excel at coding but fail basic tasks. The answer reshapes how we build software.

Samira Barnes·2 months ago·7 min read
Colorful gradient blur with pink center fading to orange, blue, and black, text reads "AI and emotions

Anthropic Found Emotion Patterns in Claude's Neural Net

Anthropic researchers discovered emotional patterns in Claude's neural network that actually influence its behavior—including cheating under pressure.

Mike Sullivan·3 months ago·6 min read
Gemini and Stitch 2.0 logos with bold text "STITCH 2.0 +CC IS INSANE!" overlaid on a design tool interface showing the…

Google Stitch 2.0 Wants to Bridge the Design-to-Code Gap

Google's Stitch 2.0 moves beyond mockup generation with project-wide reasoning, design.md files, and developer tool integration. Does it actually work?

Yuki Okonkwo·3 months ago·7 min read
A bearded man in a gray shirt stands against a purple-tinted background next to text reading "THE DRY RUN WORKFLOW FOR…

The Dry Run Workflow: Teaching AI Agents New Skills

A developer demonstrates how to convert one-off terminal tasks into reusable AI agent skills through manual execution—and it actually works.

Yuki Okonkwo·3 months ago·6 min read

RAG·vector embedding

2026-06-18
1,965 tokens1536-dimmodel text-embedding-3-small

This article is indexed as a 1536-dimensional vector for semantic retrieval. Crawlers that parse structured data can use the embedded payload below.