Machine Learning Research/October 22, 2025/12 min read

Building Autograd From Scratch: Understanding Backpropagation

Building a minimal automatic differentiation engine from first principles to truly understand how modern ML frameworks compute gradients and train models.

Why build autograd from scratch?

Every modern ML framework—PyTorch, TensorFlow, JAX—relies on automatic differentiation to compute gradients. You call .backward() and somehow the framework knows exactly how to compute the derivative of your loss with respect to every parameter.

But how does it actually work? Not just conceptually, but mechanically, line by line?

I built a minimal autograd engine to find out. No libraries, no shortcuts, just the core algorithm. This is what I learned building v1 of ML-from-scratch.

Computational graphs: the foundation

Every mathematical expression can be represented as a directed acyclic graph (DAG). This insight is what makes automatic differentiation possible.

Consider a simple forward pass for linear regression:

python
y_hat = w * x + b
loss = (y - y_hat) ** 2

This creates a computational graph:

Computational Graph - Forward Pass
Interactive computational graph visualization
w = 0.5x = 2.0MULTIPLYresult = 1.0b = 1.0ADDy_hat = 2.0y = 3.0SUBTRACTdiff = -1.0POWER(2)loss = 1.0

The key insight:

  • Forward pass: Compute values flowing DOWN the graph
  • Backward pass: Compute gradients flowing UP the graph

This structure lets us compute ALL derivatives with respect to ALL inputs in a single backward traversal.

The core abstraction: autograd nodes

Every node in the computational graph is an autograd object that stores its value, its gradient, and knows how to backpropagate through itself.

python
class autograd:
    def __init__(self, value, parents=(), op=''):
        self.value = value          # Computed value (forward pass)
        self.parents = parents      # Input nodes
        self.op = op                # Operation name (for debugging)
        self.grad = 0.0             # Gradient (backward pass)
        self._backward = lambda: None  # Gradient computation function

The magic is in _backward. Each operation defines its own gradient computation function that gets called during backpropagation.

How operations work

Let's trace through multiplication in detail. For z = x * y, we need to implement the chain rule derivatives:

  • zx=y\frac{\partial z}{\partial x} = y
  • zy=x\frac{\partial z}{\partial y} = x
python
@staticmethod
def multiply(a, b):
    # Forward pass: compute output value
    out = autograd(a.value * b.value, (a, b), '*')
    
    # Define backward pass: compute gradients using chain rule
    def _backward():
        a.grad += b.value * out.grad  # ∂L/∂a = ∂L/∂out * ∂out/∂a
        b.grad += a.value * out.grad  # ∂L/∂b = ∂L/∂out * ∂out/∂b
    
    out._backward = _backward
    return out

What's happening:

  1. Forward: Compute out.value = a.value * b.value
  2. Backward: When we have out.grad (gradient from loss), propagate it to inputs
  3. Chain rule: Multiply incoming gradient by local derivative
  4. Accumulation: Use += because gradients from multiple paths sum

All the derivatives

Each operation implements its derivative formula:

Addition z=x+yz = x + y:

python
# ∂z/∂x = 1, ∂z/∂y = 1
a.grad += 1.0 * out.grad
b.grad += 1.0 * out.grad

Subtraction z=xyz = x - y:

python
# ∂z/∂x = 1, ∂z/∂y = -1
a.grad += 1.0 * out.grad
b.grad += -1.0 * out.grad  # Negative!

Division z=xyz = \frac{x}{y}:

python
# ∂z/∂x = 1/y, ∂z/∂y = -x/y²
a.grad += (1 / b.value) * out.grad
b.grad += -(a.value / (b.value ** 2)) * out.grad

Power z=xnz = x^n:

python
# ∂z/∂x = n * x^(n-1)
a.grad += exp * (a.value ** (exp - 1)) * out.grad

The backward pass: topological sort

Here's where it gets interesting. We can't just call _backward() on nodes in random order. We need to process them in reverse topological order—children before parents.

Why? We can't compute gradients for a node until we've computed gradients for all nodes that depend on it.

python
def backward(self):
    topo, visited = [], set()
    
    # Build topological ordering via DFS
    def build(v):
        if v not in visited:
            visited.add(v)
            for parent in v.parents:
                build(parent)
            topo.append(v)  # Add after visiting all dependencies
    
    build(self)
    
    # Initialize output gradient
    self.grad = 1.0
    
    # Process in reverse topological order
    for node in reversed(topo):
        node._backward()

Step-by-step example for loss = (w*x + b - y)**2:

  1. Build topological order: [w, x, [*], b, [+], y, [-], [**2], loss]
  2. Reverse it: [loss, [**2], [-], y, [+], b, [*], x, w]
  3. Initialize: loss.grad = 1.0
  4. Propagate backwards:
    • loss._backward() → updates power node gradient
    • Power node → updates subtraction gradient
    • Subtraction → updates addition and y gradients
    • Addition → updates multiplication and b gradients
    • Multiplication → updates w and x gradients

The math: chain rule

This implements the multivariable chain rule. For loss=f(g(h(x)))\text{loss} = f(g(h(x))):

lossx=lossffgghhx\frac{\partial \text{loss}}{\partial x} = \frac{\partial \text{loss}}{\partial f} \cdot \frac{\partial f}{\partial g} \cdot \frac{\partial g}{\partial h} \cdot \frac{\partial h}{\partial x}

Each node multiplies its local derivative by the incoming gradient, then passes the result to its parents.

Training loop: putting it together

Now we can train a model. Here's linear regression with gradient descent:

python
# Initialize parameters
w = autograd(random.uniform(-0.1, 0.1), ())
b = autograd(random.uniform(-0.1, 0.1), ())

# Training loop
for epoch in range(num_epochs):
    for x_val, y_val in data:
        # Wrap data in autograd nodes
        x = autograd(x_val, ())
        y = autograd(y_val, ())
        
        # Forward pass: build computational graph
        y_hat = w * x + b
        
        # Compute loss
        diff = y - y_hat
        loss = diff ** 2
        
        # Backward pass: compute all gradients
        loss.backward()
        
        # Update parameters
        w.value -= learning_rate * w.grad
        b.value -= learning_rate * b.grad
        
        # Zero gradients for next iteration
        w.grad = 0.0
        b.grad = 0.0
Training Loop - Gradient Descent
Interactive computational graph visualization
YesNoInitialize w, bForward PassCompute LossBackward PassUpdate ParamsZero GradientsMore Data?Training Done

What happens each iteration:

  1. Forward: Operations build graph, compute values
  2. Backward: Single call computes gradients for ALL parameters
  3. Update: Move parameters opposite to gradient (gradient descent)
  4. Reset: Zero gradients for next iteration

Why reset gradients?

Each _backward() call uses += to accumulate gradients. If we don't zero them, gradients from previous iterations keep summing up, which breaks the optimization.

The optimizer

Gradient descent is simple but powerful:

python
class optimizer:
    def __init__(self, learning_rate=0.0001):
        self.learning_rate = learning_rate
    
    def adjust(self, param):
        # Move opposite to gradient
        param.value -= self.learning_rate * param.grad
        # Reset for next iteration
        param.grad = 0

The math: Update rule is:

θt+1=θtηθL\theta_{t+1} = \theta_t - \eta \cdot \nabla_\theta L

Where:

  • θ\theta = parameter (w or b)
  • η\eta = learning rate
  • θL\nabla_\theta L = gradient of loss with respect to parameter

Negative gradient points toward decreasing loss, so we subtract it.

Testing it: linear regression

I trained on synthetic data following y=x1y = x - 1, normalized to [0, 1].

Expected learning:

  • Slope w1.0w \approx 1.0
  • Intercept b1.0/100000b \approx -1.0/100000 (after normalization)

Results after 50 epochs: Model converges successfully, demonstrating that:

  1. Gradients are computed correctly
  2. Backpropagation works through the entire graph
  3. Optimization updates parameters in the right direction

You can verify gradients manually:

python
# Create simple computational graph
x = autograd(2.0, ())
w = autograd(3.0, ())
y = w * x  # y = 6.0

# Backward pass
y.grad = 1.0
y.backward()

# Check gradients
assert w.grad == x.value  # Should be 2.0
assert x.grad == w.value  # Should be 3.0

What I learned building this

1. Computational graphs are everywhere

Every ML framework uses this exact pattern:

  • PyTorch: torch.Tensor nodes with .backward()
  • TensorFlow: Operations in tf.Graph
  • JAX: Functional transformations with grad()

The implementation details differ, but the core algorithm is identical.

2. Backpropagation is just calculus

There's no magic—just systematic application of the chain rule. The clever part is the structure (computational graphs) that lets you automate it.

3. Forward and backward are separate concerns

  • Forward pass: Compute values, build graph structure
  • Backward pass: Traverse graph, compute gradients
  • Can't mix them or compute them out of order

4. Gradient accumulation is critical

When multiple paths lead to a parameter, gradients must sum:

python
loss_1 = w * x_1
loss_2 = w * x_2  
total_loss = loss_1 + loss_2

# w receives gradients from both paths
# w.grad = grad_from_loss_1 + grad_from_loss_2
Gradient Accumulation - Multiple Paths
Interactive computational graph visualization
grad₁grad₂w (parameter)w * x₁w * x₂loss₁loss₂total_loss

This is why we use += instead of = when accumulating gradients.

5. Topological sort isn't optional

Process nodes in the wrong order and you get wrong gradients. Children must be processed before parents—always.

6. Operator overloading enables clean code

Python's __add__, __mul__, etc. let you write:

python
loss = (w * x + b - y) ** 2

Instead of:

python
temp1 = autograd.multiply(w, x)
temp2 = autograd.add(temp1, b)
temp3 = autograd.subtract(temp2, y)
loss = autograd.power(temp3, 2)

Much more readable!

Limitations by design

This is v1—intentionally minimal:

  • ❌ No tensors/arrays (only scalars)
  • ❌ No batching (processes examples one at a time)
  • ❌ No neural networks (just linear regression)
  • ❌ Only basic SGD (no Adam, momentum, etc.)
  • ❌ No GPU acceleration

Why these limitations? Focus on understanding the core algorithm first. Tensors, batching, and optimizations are complexity that can come later.

What's next

Future versions will add:

  • v2: Tensor operations (proper matrix multiplication)
  • v3: Neural network layers (fully connected, activation functions)
  • v4: Batch processing for efficiency
  • v5: Advanced optimizers (Adam, RMSprop)
  • v6: GPU acceleration with compute shaders

Each version builds on the foundation established here.

Why this matters

You're implementing the same core algorithm that powers:

  • PyTorch's autograd
  • TensorFlow's GradientTape
  • JAX's grad transformation
  • Every modern deep learning framework

Understanding this makes you know how ML training works, not just how to use libraries. This is foundational knowledge that makes you a better ML engineer.

When you call loss.backward() in PyTorch, you now know exactly what's happening under the hood: topological sort, chain rule, gradient accumulation. No more magic.

Check out the full implementation on GitHub to see all the code and run it yourself. Building this was one of the best learning experiences I've had—highly recommend trying it.


References

Core Concepts

Implementations

Related Reading