Differences between CPU, GPU and TPU

01 Jun 2026

This blackboard presentation by Reiner Pope walks through the basics of a multiply-accumulate (MAC) unit and explains the transistor cost of moving data around relative to doing arithmetic.

In a CPU, a lot of silicon is spent on general-purpose flexibility: control flow, caching, branch prediction, instruction scheduling, and routing data between many different kinds of operations. The arithmetic unit itself may be cheap compared with the surrounding machinery needed to keep a CPU broadly useful.

GPUs and TPUs shift that balance toward more computation per unit of control and data-movement overhead. GPUs do this through massive parallelism and high-throughput execution. TPUs go further for machine-learning workloads by using systolic arrays: grids of MAC units where data flows rhythmically through neighboring cells.

Systolic array design relies on two key ideas:

A grid of MAC nodes lets partial sums flow through the grid in a regular pattern, “systolic” by analogy with blood pulsing through the heart.
Weights can be preloaded into local registers or storage near the compute units, reducing repeated long-distance data movement once the array is filled.

2x2 systolic array

In this 2x2 example, external routing only needs two routed lanes (multiplexers) into the array and two out of it. The MAC units still perform the matrix multiplication work, but the B values get pre-loaded by chaining slowly through local MAC registers.

Curiosity is bliss Archive TIL Tools Search RSS About

Julien Couvreur's programming blog and more

Differences between CPU, GPU and TPU