Machine Learning Research/November 2, 2025/9 min read

Dopamine-Style Learning vs Regular Learning: Grid Pathfinding Benchmark

Comparing adaptive learning rate (dopamine-inspired) against traditional fixed learning rate on a 2D grid pathfinding task with obstacles—exploring whether modulating learning rate by improvement signal helps long-term learning.

Overview

This benchmark compares two neural network learning approaches on a 2D grid pathfinding task with obstacles:

  1. Regular Learning: Traditional loss minimization (optimizes for accuracy)
  2. Dopamine-Style Learning: Uses adaptive learning rate based on improvement signal (dopamine-driven)

Key Point: This is a minimal, from-scratch implementation without modern SOTA optimizations—and dopamine learning still wins!

The question

What if neural networks adapted their learning rate based on getting better instead of using a fixed rate?

I've been exploring neural networks from scratch—building everything with NumPy to understand each component. Along the way, I came across this question: what if we trained networks like biological brains, where dopamine signals reward improvement rather than correctness?

Traditional neural networks minimize error: they chase a fixed target, get graded on accuracy, and stop learning once they're correct. But dopamine in biological systems isn't about how wrong you are—it's about how much better or worse you're doing than before.

I wanted to see what happens. So I built a benchmark comparing the two approaches—using the simplest possible implementation to test whether the core mechanism works.

The core concept

Traditional loss-based learning

Every neural network you've used minimizes error:

python
loss = |predicted - actual|

The network tries to get closer to a fixed target. You predict 0.8, the real answer is 1.0, so you train to make that gap smaller. The direction of improvement is always fixed externally.

The problem: Once you're correct, learning stops. There's no reason to explore further.

Dopamine-style learning

Dopamine in biological systems isn't about how wrong you are—it's about how much better or worse you're doing than before.

python
dopamine = previous_error - current_error
  • Positive dopamine = getting better (reward)
  • Negative dopamine = getting worse (punishment)

Important clarification: Dopamine modulates the learning rate based on improvement, but still uses error-based gradients. When improving, it amplifies weight updates. When getting worse, it dampens them. It's an adaptive learning rate mechanism, not a fundamentally different optimization objective.

Regular LossDopamine-Style
Chases a specific answerChases progress/discovery
Goal fixed externallyGoal changes as agent learns
Stops learning once error = 0Keeps exploring new things to improve

This creates open-ended learning, not just optimization. Just like animals don't stop acting once they master something—dopamine fades, so they seek new challenges.

Implementation

Architecture

Everything built from scratch with NumPy—no ML frameworks, just the fundamentals:

model/
├── neural_network.py    # Core NN with dopamine option
├── layer.py             # Fully connected layer
├── activation.py        # Sigmoid, ReLU functions
└── pathfinder_agent.py  # Grid navigation agent

The neural network

  • Input: 4 features (distance to goal x/y, normalized position x/y)
  • Hidden Layer: 8 neurons with sigmoid activation
  • Output: 4 actions (up, down, left, right)

Dopamine implementation

When use_dopamine=True, the network:

  1. Tracks previous error for comparison
  2. Calculates dopamine signal: dopamine = previous_error - current_error
  3. Modulates gradient magnitude (still uses error-based gradients):
    • If improving (positive dopamine): amplifies weight updates
    • If getting worse (negative dopamine): dampens weight updates
python
# Actual implementation (simplified)
dopamine = previous_error - current_error
dopamine_factor = 1 + dopamine * 2  # Scale dopamine impact
dopamine_factor = np.clip(dopamine_factor, 0.1, 3.0)  # Clamp to reasonable range
gradients *= dopamine_factor  # Modulate gradient magnitude

The key insight: dopamine creates an adaptive learning rate—amplifying updates when improving, dampening when getting worse. It still minimizes error, but adjusts how aggressively based on recent progress.

Pathfinding task with obstacles

The agent learns to navigate a 50x50 grid:

  • Grid Size: 50x50 cells
  • Obstacle Density: 15% of grid filled with random obstacles
  • State representation: Normalized position and distance to goal
  • Action selection: Epsilon-greedy (10% exploration, 90% exploitation)
  • Training: Supervised learning with heuristic targets (direction toward goal, not learned from rewards)
  • Goal: Reach point B from point A while avoiding obstacles

Note: This is supervised learning, not reinforcement learning. We create training targets based on the direction toward the goal for each state in the episode trajectory.

The obstacles make this interesting—they force the agent to discover paths, not just memorize routes.

Experimental setup

  • Grid Size: 50x50 cells
  • Obstacle Density: 15% of grid filled with random obstacles
  • Episodes: 20,000 total episodes
  • Architecture: 2-layer neural network (4 inputs → 8 hidden → 4 outputs)
  • Learning Rate: 0.08 (fixed throughout training)
  • Hardware: M4 Max (14 parallel workers)
  • Speed: ~1,870 episodes/second
  • Runtime: 7.67 seconds for 20,000 episodes

What we didn't use (no fancy optimizations)

We kept it intentionally simple. Here's what we built vs what we didn't:

What we built:

  • Basic neural network layers (input → hidden → output)
  • Simple sigmoid activation function
  • Basic backpropagation (the standard way to update weights)
  • Fixed learning rate (no automatic adjustments)

What we didn't use:

  • No advanced optimizers (like Adam)
  • No learning rate scheduling
  • No batch normalization or dropout
  • No advanced activation functions (just basic sigmoid)
  • No fancy weight initialization tricks
  • No experience replay or memory buffers
  • No target networks or advanced RL techniques
  • No GPU acceleration or special frameworks
  • No automatic differentiation libraries

In short: no modern optimizations. Just the basics—matrix multiplication, gradients, and weight updates.

Why this matters

The bare-bones approach

We intentionally kept it simple:

python
# That's it. Basic matrix multiplication.
def forward(x):
    z1 = x @ weights1 + bias1
    a1 = sigmoid(z1)
    z2 = a1 @ weights2 + bias2
    return sigmoid(z2)

# Basic backprop - no tricks
def backward(error):
    # Standard chain rule
    # Update weights
    weights -= learning_rate * gradients

No fancy optimizers, no tricks, no shortcuts. Just the fundamentals.

Why dopamine still wins

Even with this bare-bones setup, dopamine learning outperforms regular learning:

ApproachSuccess RateLearning Improvement
Regular19.77%+0.0% (plateaued)
Dopamine20.18%+1.5% (still improving)

The signal is strong enough that even a simple implementation shows the advantage.

Key results

Success rates

ApproachSuccess RateWinner
Regular19.77%-
Dopamine20.18%Winner

Dopamine wins by +0.41% (20.18% vs 19.77%). While the overall difference is small and may not be statistically significant, the learning trajectory tells a more meaningful story—especially the late-game performance gap of +5.5%.

Learning trajectory

Early Performance (Episodes 1-1000):

  • Regular: 15.7%
  • Dopamine: 15.0%
  • Regular starts ahead

Late Performance (Episodes 19001-20000):

  • Regular: 20.7%
  • Dopamine: 26.2%
  • Dopamine dominates by 5.5%

This is the key finding: dopamine starts slower but ends significantly stronger.

Learning improvement

ApproachFirst 25%Last 25%Improvement
Regular23.2%23.2%+0.0% (plateaued)
Dopamine22.8%24.3%+1.5% (still improving)

Key Insight: Regular learning plateaus early, while dopamine learning continues to improve throughout training—even without modern optimizations!

Efficiency (Successful Episodes)

ApproachMean StepsSuccess Count
Regular50.4 steps3,953 episodes
Dopamine51.0 steps4,036 episodes

Dopamine finds more successful paths (4,036 vs 3,953), which matches the success rate difference (20.18% vs 19.77% of 20,000 episodes).

Consistency

ApproachStd DevNotes
Regular3.62%More stable, less exploratory
Dopamine4.38%More variable, more exploratory

The higher variance suggests dopamine creates more exploratory behavior—which might be why it finds more successful paths.

Why dopamine wins with obstacles

The obstacle effect

With obstacles (current benchmark):

  • Regular: 19.77% success
  • Dopamine: 20.18% success
  • Margin: Dopamine +0.41%

Preliminary experiments without obstacles showed similar performance (~23% for both approaches), suggesting the advantage becomes more pronounced when obstacles require discovering new paths rather than following known routes.

Key advantages (even without optimizations)

  1. Exploration-Driven: Adaptive learning rate encourages discovering new paths when improving
  2. Adaptive Learning Rate: Modulates update magnitude based on improvement signal
  3. Late-Game Dominance: Continues learning while regular approach plateaus (26.2% vs 20.7%)
  4. Better with Complexity: Performance gap appears more pronounced when obstacles require discovering new paths

How dopamine learning works

python
# Regular Learning:
loss = |predicted - actual|
gradients = compute_gradients(loss)
weights -= learning_rate * gradients  # Fixed learning rate

# Dopamine Learning:
loss = |predicted - actual|
dopamine = previous_loss - current_loss
gradients = compute_gradients(loss)  # Still error-based gradients
dopamine_factor = 1 + dopamine * 2  # Modulate learning rate
gradients *= dopamine_factor  # Amplify/dampen based on improvement
weights -= learning_rate * gradients

The key: dopamine modulates the magnitude of updates based on improvement, creating an adaptive learning rate. It still minimizes error, but adjusts how aggressively based on recent progress.

Biological inspiration

Dopamine neurons in the brain don't fire for "being correct" - they fire for surprising improvements:

  • Positive dopamine: "I'm getting better than expected!" → Repeat this behavior
  • Negative dopamine: "I'm getting worse than expected!" → Avoid this behavior
  • No dopamine: "Exactly as predicted" → Habituated, seek novelty

This creates open-ended learning - the agent never fully "solves" the task, but keeps discovering better strategies.

What we could add (future improvements)

If we wanted to optimize this further, we could add modern techniques like:

  • Better optimizers (Adam instead of basic gradient descent)
  • Learning rate scheduling
  • Experience replay buffers
  • Advanced network architectures
  • GPU acceleration

But we didn't need any of these to show dopamine learning works! The core mechanism is strong enough to show up even without modern optimizations.

Final verdict

DOPAMINE-STYLE LEARNING WINS

Scoring: Dopamine wins on 4 out of 5 key metrics

Wins on:

  • Success Rate (+0.41%)
  • Learning Improvement (+1.5% vs 0.0%)
  • Late-game performance (26.2% vs 20.7%)
  • Exploration (finds more successful paths: 4,036 vs 3,953)

Regular wins on:

  • Efficiency (slightly faster paths: 50.4 vs 51.0 steps)
  • Consistency (lower variance: 3.62% vs 4.38% std dev)

Statistical considerations

The overall success rate difference (+0.41%) is modest. With 20,000 episodes, this could be within noise. However, the late-game performance gap (+5.5% in final episodes) is more meaningful and suggests dopamine learning continues improving while regular learning plateaus.

What this means:

  • The overall difference may not be statistically significant in isolation
  • But the learning trajectory pattern (dopamine continues improving, regular plateaus) is consistent
  • The late-game advantage (+5.5%) is more substantial than the overall average (+0.41%)
  • This suggests the adaptive learning rate mechanism helps with long-term learning

Limitations:

  • Single benchmark run (no multiple runs for statistical validation)
  • No formal significance testing (p-values, confidence intervals)
  • Results are exploratory, not definitive proof
  • Simple task may not generalize to complex domains

Conclusion

The dopamine-style learning approach (adaptive learning rate based on improvement) demonstrates superior performance even with a minimal, unoptimized implementation. While the overall difference is modest (+0.41%), the learning trajectory shows dopamine continues improving while regular learning plateaus—especially in late-game performance (+5.5%).

Key takeaways:

  1. Simple is enough: Basic neural networks can show adaptive learning rate advantages
  2. Core mechanism matters: Modulating learning rate by improvement signal helps long-term learning
  3. Room for growth: Adding modern techniques would likely amplify the advantage
  4. Biological plausibility: Simple dopamine modulation is inspired by (but simplified from) real neural learning

The insight: You don't need sophisticated optimizations to see that adaptive learning rates based on improvement create more adaptive, exploratory behavior. The effect is strong enough to show up even in the simplest possible setup—especially in the learning trajectory pattern.

Regular learning plateaus around episode 5,000 and stays flat. Dopamine learning starts behind but accelerates, ending 5.5% ahead in the final episodes. This suggests dopamine's adaptive learning rate is particularly valuable when the task requires discovering new solutions over time.

This isn't a breakthrough—it's a small experiment with modest results. But it's interesting enough to share. What if we built learning systems that adaptively modulated learning rates based on progress? That's worth exploring further.


Benchmark completed: 20,000 episodes in 7.67 seconds (~1,870 eps/sec)
Implementation: Pure NumPy, basic backprop, no SOTA optimizations