Building Autograd From Scratch: Understanding Backpropagation
Building a minimal automatic differentiation engine from first principles to truly understand how modern ML frameworks compute gradients and train models.
Why build autograd from scratch?
Every modern ML framework—PyTorch, TensorFlow, JAX—relies on automatic differentiation to compute gradients. You call .backward() and somehow the framework knows exactly how to compute the derivative of your loss with respect to every parameter.
But how does it actually work? Not just conceptually, but mechanically, line by line?
I built a minimal autograd engine to find out. No libraries, no shortcuts, just the core algorithm. This is what I learned building v1 of ML-from-scratch.
Computational graphs: the foundation
Every mathematical expression can be represented as a directed acyclic graph (DAG). This insight is what makes automatic differentiation possible.
Consider a simple forward pass for linear regression:
y_hat = w * x + b
loss = (y - y_hat) ** 2
This creates a computational graph:
The key insight:
- Forward pass: Compute values flowing DOWN the graph
- Backward pass: Compute gradients flowing UP the graph
This structure lets us compute ALL derivatives with respect to ALL inputs in a single backward traversal.
The core abstraction: autograd nodes
Every node in the computational graph is an autograd object that stores its value, its gradient, and knows how to backpropagate through itself.
class autograd:
def __init__(self, value, parents=(), op=''):
self.value = value # Computed value (forward pass)
self.parents = parents # Input nodes
self.op = op # Operation name (for debugging)
self.grad = 0.0 # Gradient (backward pass)
self._backward = lambda: None # Gradient computation function
The magic is in _backward. Each operation defines its own gradient computation function that gets called during backpropagation.
How operations work
Let's trace through multiplication in detail. For z = x * y, we need to implement the chain rule derivatives:
@staticmethod
def multiply(a, b):
# Forward pass: compute output value
out = autograd(a.value * b.value, (a, b), '*')
# Define backward pass: compute gradients using chain rule
def _backward():
a.grad += b.value * out.grad # ∂L/∂a = ∂L/∂out * ∂out/∂a
b.grad += a.value * out.grad # ∂L/∂b = ∂L/∂out * ∂out/∂b
out._backward = _backward
return out
What's happening:
- Forward: Compute
out.value = a.value * b.value - Backward: When we have
out.grad(gradient from loss), propagate it to inputs - Chain rule: Multiply incoming gradient by local derivative
- Accumulation: Use
+=because gradients from multiple paths sum
All the derivatives
Each operation implements its derivative formula:
Addition :
# ∂z/∂x = 1, ∂z/∂y = 1
a.grad += 1.0 * out.grad
b.grad += 1.0 * out.grad
Subtraction :
# ∂z/∂x = 1, ∂z/∂y = -1
a.grad += 1.0 * out.grad
b.grad += -1.0 * out.grad # Negative!
Division :
# ∂z/∂x = 1/y, ∂z/∂y = -x/y²
a.grad += (1 / b.value) * out.grad
b.grad += -(a.value / (b.value ** 2)) * out.grad
Power :
# ∂z/∂x = n * x^(n-1)
a.grad += exp * (a.value ** (exp - 1)) * out.grad
The backward pass: topological sort
Here's where it gets interesting. We can't just call _backward() on nodes in random order. We need to process them in reverse topological order—children before parents.
Why? We can't compute gradients for a node until we've computed gradients for all nodes that depend on it.
def backward(self):
topo, visited = [], set()
# Build topological ordering via DFS
def build(v):
if v not in visited:
visited.add(v)
for parent in v.parents:
build(parent)
topo.append(v) # Add after visiting all dependencies
build(self)
# Initialize output gradient
self.grad = 1.0
# Process in reverse topological order
for node in reversed(topo):
node._backward()
Step-by-step example for loss = (w*x + b - y)**2:
- Build topological order:
[w, x, [*], b, [+], y, [-], [**2], loss] - Reverse it:
[loss, [**2], [-], y, [+], b, [*], x, w] - Initialize:
loss.grad = 1.0 - Propagate backwards:
loss._backward()→ updates power node gradient- Power node → updates subtraction gradient
- Subtraction → updates addition and y gradients
- Addition → updates multiplication and b gradients
- Multiplication → updates w and x gradients
The math: chain rule
This implements the multivariable chain rule. For :
Each node multiplies its local derivative by the incoming gradient, then passes the result to its parents.
Training loop: putting it together
Now we can train a model. Here's linear regression with gradient descent:
# Initialize parameters
w = autograd(random.uniform(-0.1, 0.1), ())
b = autograd(random.uniform(-0.1, 0.1), ())
# Training loop
for epoch in range(num_epochs):
for x_val, y_val in data:
# Wrap data in autograd nodes
x = autograd(x_val, ())
y = autograd(y_val, ())
# Forward pass: build computational graph
y_hat = w * x + b
# Compute loss
diff = y - y_hat
loss = diff ** 2
# Backward pass: compute all gradients
loss.backward()
# Update parameters
w.value -= learning_rate * w.grad
b.value -= learning_rate * b.grad
# Zero gradients for next iteration
w.grad = 0.0
b.grad = 0.0
What happens each iteration:
- Forward: Operations build graph, compute values
- Backward: Single call computes gradients for ALL parameters
- Update: Move parameters opposite to gradient (gradient descent)
- Reset: Zero gradients for next iteration
Why reset gradients?
Each _backward() call uses += to accumulate gradients. If we don't zero them, gradients from previous iterations keep summing up, which breaks the optimization.
The optimizer
Gradient descent is simple but powerful:
class optimizer:
def __init__(self, learning_rate=0.0001):
self.learning_rate = learning_rate
def adjust(self, param):
# Move opposite to gradient
param.value -= self.learning_rate * param.grad
# Reset for next iteration
param.grad = 0
The math: Update rule is:
Where:
- = parameter (w or b)
- = learning rate
- = gradient of loss with respect to parameter
Negative gradient points toward decreasing loss, so we subtract it.
Testing it: linear regression
I trained on synthetic data following , normalized to [0, 1].
Expected learning:
- Slope
- Intercept (after normalization)
Results after 50 epochs: Model converges successfully, demonstrating that:
- Gradients are computed correctly
- Backpropagation works through the entire graph
- Optimization updates parameters in the right direction
You can verify gradients manually:
# Create simple computational graph
x = autograd(2.0, ())
w = autograd(3.0, ())
y = w * x # y = 6.0
# Backward pass
y.grad = 1.0
y.backward()
# Check gradients
assert w.grad == x.value # Should be 2.0
assert x.grad == w.value # Should be 3.0
What I learned building this
1. Computational graphs are everywhere
Every ML framework uses this exact pattern:
- PyTorch:
torch.Tensornodes with.backward() - TensorFlow: Operations in
tf.Graph - JAX: Functional transformations with
grad()
The implementation details differ, but the core algorithm is identical.
2. Backpropagation is just calculus
There's no magic—just systematic application of the chain rule. The clever part is the structure (computational graphs) that lets you automate it.
3. Forward and backward are separate concerns
- Forward pass: Compute values, build graph structure
- Backward pass: Traverse graph, compute gradients
- Can't mix them or compute them out of order
4. Gradient accumulation is critical
When multiple paths lead to a parameter, gradients must sum:
loss_1 = w * x_1
loss_2 = w * x_2
total_loss = loss_1 + loss_2
# w receives gradients from both paths
# w.grad = grad_from_loss_1 + grad_from_loss_2
This is why we use += instead of = when accumulating gradients.
5. Topological sort isn't optional
Process nodes in the wrong order and you get wrong gradients. Children must be processed before parents—always.
6. Operator overloading enables clean code
Python's __add__, __mul__, etc. let you write:
loss = (w * x + b - y) ** 2
Instead of:
temp1 = autograd.multiply(w, x)
temp2 = autograd.add(temp1, b)
temp3 = autograd.subtract(temp2, y)
loss = autograd.power(temp3, 2)
Much more readable!
Limitations by design
This is v1—intentionally minimal:
- ❌ No tensors/arrays (only scalars)
- ❌ No batching (processes examples one at a time)
- ❌ No neural networks (just linear regression)
- ❌ Only basic SGD (no Adam, momentum, etc.)
- ❌ No GPU acceleration
Why these limitations? Focus on understanding the core algorithm first. Tensors, batching, and optimizations are complexity that can come later.
What's next
Future versions will add:
- v2: Tensor operations (proper matrix multiplication)
- v3: Neural network layers (fully connected, activation functions)
- v4: Batch processing for efficiency
- v5: Advanced optimizers (Adam, RMSprop)
- v6: GPU acceleration with compute shaders
Each version builds on the foundation established here.
Why this matters
You're implementing the same core algorithm that powers:
- PyTorch's autograd
- TensorFlow's GradientTape
- JAX's grad transformation
- Every modern deep learning framework
Understanding this makes you know how ML training works, not just how to use libraries. This is foundational knowledge that makes you a better ML engineer.
When you call loss.backward() in PyTorch, you now know exactly what's happening under the hood: topological sort, chain rule, gradient accumulation. No more magic.
Check out the full implementation on GitHub to see all the code and run it yourself. Building this was one of the best learning experiences I've had—highly recommend trying it.
References
Core Concepts
- Automatic Differentiation in Machine Learning: A Survey - Comprehensive overview of autodiff techniques
- Calculus on Computational Graphs: Backpropagation - Visual explanation by Chris Olah
Implementations
- micrograd - Andrej Karpathy's minimal autograd engine
- PyTorch Autograd - How PyTorch implements automatic differentiation
Related Reading
- ML-from-scratch - My implementation with tests and examples