Training Neural Networks

Training Overview

Training adjusts network weights to minimize errors. It uses loss functions to measure errors. It uses backpropagation to compute gradients. It uses optimizers to update weights. Training continues until convergence or maximum iterations.

The training loop processes data in batches. Each batch updates weights once. Multiple epochs process all data multiple times. Early stopping prevents overfitting. Learning rate schedules adjust step sizes.

Figure: Training Process

The diagram shows training workflow. Data flows through forward pass. Loss computes errors. Backpropagation computes gradients. Optimizer updates weights. Process repeats until convergence.

Loss Functions

Loss functions measure prediction errors. They guide weight updates. Different problems use different losses. Regression uses MSE or MAE. Classification uses cross-entropy.

Mean squared error is MSE = (1/n) Σ(y_pred - y_true)². It emphasizes large errors. It works well for regression. Mean absolute error is MAE = (1/n) Σ|y_pred - y_true|. It treats all errors equally. It is robust to outliers.

Cross-entropy loss is CE = -Σ y_true × log(y_pred). It measures probability differences. It works well for classification. It penalizes confident wrong predictions.

# Loss Functions
import numpy as np

def mse_loss(y_pred, y_true):
    return np.mean((y_pred - y_true)**2)

def mae_loss(y_pred, y_true):
    return np.mean(np.abs(y_pred - y_true))

def cross_entropy_loss(y_pred, y_true):
    epsilon = 1e-15
    y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
    return -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))

# Example
y_pred_reg = np.array([100, 200, 300])
y_true_reg = np.array([110, 190, 310])
print("MSE: " + str(mse_loss(y_pred_reg, y_true_reg)))
print("MAE: " + str(mae_loss(y_pred_reg, y_true_reg)))

y_pred_clf = np.array([0.1, 0.9, 0.8])
y_true_clf = np.array([0, 1, 1])
print("Cross-entropy: " + str(cross_entropy_loss(y_pred_clf, y_true_clf)))
# Result:
# MSE: 100.0
# MAE: 10.0
# Cross-entropy: 0.105

Choose loss functions matching your problem. Regression problems use MSE or MAE. Classification problems use cross-entropy. Custom losses can encode domain knowledge.

Figure: Loss Functions

The diagram compares loss functions. MSE is quadratic. MAE is linear. Cross-entropy is logarithmic.

Backpropagation

Backpropagation computes gradients efficiently. It uses chain rule to propagate errors backward. It computes gradients for all weights in one pass. It enables training deep networks.

The process starts at output layer. It computes output error. It propagates error backward through layers. Each layer computes its gradient. Gradients accumulate using chain rule.

# Backpropagation
import numpy as np

def backward_propagation(y_pred, y_true, activations, weights, activation_derivative):
    m = y_true.shape[0]
    gradients_w = []
    gradients_b = []
    
    # Output layer error
    error = y_pred - y_true
    gradients_w.append(np.dot(activations[-2].T, error) / m)
    gradients_b.append(np.sum(error, axis=0, keepdims=True) / m)
    
    # Backpropagate through hidden layers
    for i in range(len(weights) - 2, -1, -1):
        error = np.dot(error, weights[i+1].T) * activation_derivative(activations[i+1])
        gradients_w.insert(0, np.dot(activations[i].T, error) / m)
        gradients_b.insert(0, np.sum(error, axis=0, keepdims=True) / m)
    
    return gradients_w, gradients_b

# Example usage
y_pred = np.array([[0.9], [0.1], [0.8]])
y_true = np.array([[1.0], [0.0], [1.0]])
activations = [np.array([[1, 2]]), np.array([[0.5, 0.7]]), np.array([[0.9]])]
weights = [np.array([[0.5, 0.3], [0.2, 0.4]]), np.array([[0.1], [0.6]])]

def relu_derivative(x):
    return (x > 0).astype(float)

grad_w, grad_b = backward_propagation(y_pred, y_true, activations, weights, relu_derivative)
print("Gradient shapes: " + str([g.shape for g in grad_w]))

Backpropagation is the core of neural network training. It enables efficient gradient computation. It makes deep learning practical.

Detailed Backpropagation Mathematics

Backpropagation uses chain rule from calculus. For output layer, error is δᴸ = ∇ₐC ⊙ σ'(zᴸ). C is cost function. σ' is activation derivative. ⊙ is element-wise multiplication.

For hidden layer l, error propagates backward. δˡ = ((wˡ⁺¹)ᵀ δˡ⁺¹) ⊙ σ'(zˡ). Weights from next layer transpose. Error from next layer. Activation derivative of current layer.

Gradient for weight wˡᵢⱼ is ∂C/∂wˡᵢⱼ = aˡ⁻¹ⱼ × δˡᵢ. Activation from previous layer. Error from current layer. Gradient for bias bˡᵢ is ∂C/∂bˡᵢ = δˡᵢ. Bias gradient equals error.

# Detailed Backpropagation Implementation
import numpy as np

class NeuralNetworkDetailed:
    def __init__(self, layers, learning_rate=0.01):
        self.layers = layers
        self.lr = learning_rate
        self.weights = []
        self.biases = []
        self.activations = []
        self.z_values = []
        
        # Initialize weights and biases
        for i in range(len(layers) - 1):
            w = np.random.randn(layers[i+1], layers[i]) * 0.1
            b = np.zeros((layers[i+1], 1))
            self.weights.append(w)
            self.biases.append(b)
    
    def sigmoid(self, z):
        return 1 / (1 + np.exp(-np.clip(z, -500, 500)))
    
    def sigmoid_derivative(self, z):
        s = self.sigmoid(z)
        return s * (1 - s)
    
    def forward(self, X):
        self.activations = [X.T]
        self.z_values = []
        
        for i in range(len(self.weights)):
            z = self.weights[i] @ self.activations[-1] + self.biases[i]
            self.z_values.append(z)
            a = self.sigmoid(z)
            self.activations.append(a)
        
        return self.activations[-1]
    
    def backward(self, y):
        m = y.shape[0]
        y = y.reshape(-1, 1).T
        
        # Output layer error
        delta = (self.activations[-1] - y) * self.sigmoid_derivative(self.z_values[-1])
        
        gradients_w = []
        gradients_b = []
        
        # Backpropagate through layers
        for i in range(len(self.weights) - 1, -1, -1):
            # Gradient for weights
            grad_w = (1/m) * delta @ self.activations[i].T
            gradients_w.insert(0, grad_w)
            
            # Gradient for biases
            grad_b = (1/m) * np.sum(delta, axis=1, keepdims=True)
            gradients_b.insert(0, grad_b)
            
            # Propagate error backward
            if i > 0:
                delta = (self.weights[i].T @ delta) * self.sigmoid_derivative(self.z_values[i-1])
        
        return gradients_w, gradients_b
    
    def update(self, gradients_w, gradients_b):
        for i in range(len(self.weights)):
            self.weights[i] -= self.lr * gradients_w[i]
            self.biases[i] -= self.lr * gradients_b[i]
    
    def train(self, X, y, epochs=1000):
        for epoch in range(epochs):
            # Forward pass
            output = self.forward(X)
            
            # Backward pass
            grad_w, grad_b = self.backward(y)
            
            # Update weights
            self.update(grad_w, grad_b)
            
            if epoch % 100 == 0:
                cost = np.mean((output - y.reshape(-1, 1).T)**2) / 2
                print(f"Epoch {epoch}, Cost: {cost:.4f}")

# Example
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([0, 1, 1, 0])  # XOR problem

nn = NeuralNetworkDetailed([2, 4, 1], learning_rate=0.5)
nn.train(X, y, epochs=1000)

predictions = nn.forward(X)
print("Predictions: " + str(predictions.T.flatten()))

Gradient Vanishing and Exploding

Gradient vanishing occurs when gradients become very small. It happens in deep networks. It prevents early layers from learning. It occurs with sigmoid and tanh activations. Their derivatives are bounded and small.

Gradient exploding occurs when gradients become very large. It causes unstable training. It happens with large weights. It causes NaN values. It occurs in recurrent networks.

Solutions include proper weight initialization. Xavier initialization uses variance 1/n. He initialization uses variance 2/n. Batch normalization normalizes activations. Residual connections provide gradient highways. Gradient clipping limits gradient magnitude.

# Gradient Vanishing Demonstration
import numpy as np
import matplotlib.pyplot as plt

def demonstrate_gradient_vanishing():
    # Simulate deep network
    depth = 10
    weights = [np.random.randn(10, 10) * 0.5 for _ in range(depth)]
    
    # Initial gradient
    initial_grad = np.ones((10, 1))
    
    # Propagate through layers
    grad = initial_grad
    grad_magnitudes = [np.linalg.norm(grad)]
    
    for i in range(depth):
        # Simulate sigmoid derivative (small values)
        sigmoid_deriv = 0.25  # Maximum value
        grad = weights[i].T @ grad * sigmoid_deriv
        grad_magnitudes.append(np.linalg.norm(grad))
    
    plt.plot(range(len(grad_magnitudes)), grad_magnitudes)
    plt.xlabel('Layer')
    plt.ylabel('Gradient Magnitude')
    plt.title('Gradient Vanishing in Deep Network')
    plt.yscale('log')
    plt.show()
    
    print("Initial gradient magnitude: " + str(grad_magnitudes[0]))
    print("Final gradient magnitude: " + str(grad_magnitudes[-1]))
    print("Reduction factor: " + str(grad_magnitudes[0] / grad_magnitudes[-1]))

demonstrate_gradient_vanishing()

Figure: Backpropagation

The diagram shows gradient flow. Errors propagate backward. Each layer computes its contribution. Gradients accumulate through chain rule.

Optimizers

Optimizers update weights using gradients. Different optimizers have different update rules. SGD uses simple gradient descent. Momentum adds velocity term. Adam combines momentum and adaptive learning rates.

Stochastic gradient descent is w = w - α × ∇w. Learning rate α controls step size. It is simple but can be slow. Momentum is v = βv + ∇w, w = w - αv. Velocity β accumulates gradients. It accelerates convergence.

Adam adapts learning rates per parameter. It maintains momentum and variance estimates. It works well for most problems. It is the default choice for many applications.

# Optimizers
import numpy as np

class SGD:
    def __init__(self, learning_rate=0.01):
        self.learning_rate = learning_rate
    
    def update(self, weights, gradients):
        return [w - self.learning_rate * g for w, g in zip(weights, gradients)]

class Adam:
    def __init__(self, learning_rate=0.001, beta1=0.9, beta2=0.999):
        self.learning_rate = learning_rate
        self.beta1 = beta1
        self.beta2 = beta2
        self.m = None
        self.v = None
        self.t = 0
    
    def update(self, weights, gradients):
        if self.m is None:
            self.m = [np.zeros_like(g) for g in gradients]
            self.v = [np.zeros_like(g) for g in gradients]
        
        self.t += 1
        updated_weights = []
        
        for w, g, m, v in zip(weights, gradients, self.m, self.v):
            m = self.beta1 * m + (1 - self.beta1) * g
            v = self.beta2 * v + (1 - self.beta2) * g**2
            m_hat = m / (1 - self.beta1**self.t)
            v_hat = v / (1 - self.beta2**self.t)
            w_new = w - self.learning_rate * m_hat / (np.sqrt(v_hat) + 1e-8)
            updated_weights.append(w_new)
        
        self.m = [self.beta1 * m + (1 - self.beta1) * g for m, g in zip(self.m, gradients)]
        self.v = [self.beta2 * v + (1 - self.beta2) * g**2 for v, g in zip(self.v, gradients)]
        
        return updated_weights

# Example
weights = [np.array([[0.5], [0.3]])]
gradients = [np.array([[0.1], [0.2]])]

sgd = SGD(learning_rate=0.01)
adam = Adam(learning_rate=0.001)

w_sgd = sgd.update(weights, gradients)
w_adam = adam.update(weights, gradients)

print("SGD update: " + str(w_sgd[0].flatten()))
print("Adam update: " + str(w_adam[0].flatten()))

Choose optimizers based on problem characteristics. SGD works for simple problems. Adam works for most problems. Experiment to find the best optimizer.

Figure: Optimizers

The diagram compares optimizer paths. SGD follows gradients directly. Momentum follows smoothed gradients. Adam adapts step sizes per parameter.

Learning Rate Schedules

Learning rate schedules adjust step sizes during training. Fixed rates can be too large or too small. Adaptive rates improve convergence. Common schedules include step decay, exponential decay, and cosine annealing.

Step decay reduces rate at fixed intervals. Exponential decay reduces rate continuously. Cosine annealing follows cosine curve. Warmup starts with small rates. It prevents early instability.

# Learning Rate Schedules
import numpy as np

def step_decay(epoch, initial_lr=0.01, drop=0.5, epochs_drop=10):
    return initial_lr * (drop ** (epoch // epochs_drop))

def exponential_decay(epoch, initial_lr=0.01, decay_rate=0.96):
    return initial_lr * (decay_rate ** epoch)

def cosine_annealing(epoch, initial_lr=0.01, max_epochs=100):
    return initial_lr * 0.5 * (1 + np.cos(np.pi * epoch / max_epochs))

# Example
for epoch in range(0, 50, 10):
    lr_step = step_decay(epoch)
    lr_exp = exponential_decay(epoch)
    lr_cos = cosine_annealing(epoch, max_epochs=50)
    print(f"Epoch {epoch}: Step={lr_step:.4f}, Exp={lr_exp:.4f}, Cos={lr_cos:.4f}")

Learning rate schedules improve training. They enable faster initial learning. They enable fine-tuning later. They prevent overshooting minima.

Figure: Learning Rate Schedules

The diagram shows different learning rate schedules. Fixed rate stays constant. Step decay reduces at intervals. Exponential decay reduces continuously. Each schedule has different convergence characteristics.

Detailed Learning Rate Schedule Strategies

Fixed learning rate uses constant value throughout training. It is simple to implement. It requires careful tuning. Too high causes instability. Too low causes slow convergence. It works for simple problems with stable loss landscapes.

Step decay reduces learning rate at fixed intervals. lr(epoch) = initial_lr × drop_factor^(epoch // drop_interval). Drop interval is typically 10-30 epochs. Drop factor is typically 0.1-0.5. It provides controlled reduction. It works well for many problems.

Exponential decay reduces learning rate continuously. lr(epoch) = initial_lr × decay_rate^epoch. Decay rate is typically 0.9-0.99. It provides smooth reduction. It requires tuning decay rate carefully. It works for problems needing gradual reduction.

Cosine annealing follows cosine curve. lr(epoch) = initial_lr × 0.5 × (1 + cos(π × epoch / max_epochs)). It starts high and ends low. It provides smooth transition. It works well for long training runs. It often improves final performance.

Warmup starts with small learning rate. It gradually increases to target rate. It prevents early instability. It helps with large batch training. Typical warmup is 5-10% of total epochs.

# Detailed Learning Rate Schedules
import numpy as np
import matplotlib.pyplot as plt

class LearningRateSchedules:
    def fixed(self, epoch, initial_lr=0.01):
        return initial_lr
    
    def step_decay(self, epoch, initial_lr=0.01, drop_factor=0.5, drop_interval=10):
        return initial_lr * (drop_factor ** (epoch // drop_interval))
    
    def exponential_decay(self, epoch, initial_lr=0.01, decay_rate=0.96):
        return initial_lr * (decay_rate ** epoch)
    
    def cosine_annealing(self, epoch, initial_lr=0.01, max_epochs=100):
        return initial_lr * 0.5 * (1 + np.cos(np.pi * epoch / max_epochs))
    
    def warmup_cosine(self, epoch, initial_lr=0.01, warmup_epochs=10, max_epochs=100):
        if epoch < warmup_epochs:
            return initial_lr * (epoch / warmup_epochs)
        else:
            cosine_epoch = epoch - warmup_epochs
            cosine_max = max_epochs - warmup_epochs
            return initial_lr * 0.5 * (1 + np.cos(np.pi * cosine_epoch / cosine_max))
    
    def polynomial_decay(self, epoch, initial_lr=0.01, max_epochs=100, power=0.9):
        return initial_lr * ((1 - epoch / max_epochs) ** power)

schedules = LearningRateSchedules()
epochs = np.arange(0, 100)

# Compare schedules
lr_fixed = [schedules.fixed(e) for e in epochs]
lr_step = [schedules.step_decay(e) for e in epochs]
lr_exp = [schedules.exponential_decay(e) for e in epochs]
lr_cosine = [schedules.cosine_annealing(e, max_epochs=100) for e in epochs]
lr_warmup = [schedules.warmup_cosine(e, max_epochs=100) for e in epochs]

plt.figure(figsize=(12, 6))
plt.plot(epochs, lr_fixed, label='Fixed')
plt.plot(epochs, lr_step, label='Step Decay')
plt.plot(epochs, lr_exp, label='Exponential')
plt.plot(epochs, lr_cosine, label='Cosine Annealing')
plt.plot(epochs, lr_warmup, label='Warmup + Cosine')
plt.xlabel('Epoch')
plt.ylabel('Learning Rate')
plt.title('Learning Rate Schedules Comparison')
plt.legend()
plt.grid(True)
plt.show()

# Performance comparison
print("Final learning rates:")
print("Fixed: " + str(lr_fixed[-1]))
print("Step: " + str(lr_step[-1]))
print("Exponential: " + str(lr_exp[-1]))
print("Cosine: " + str(lr_cosine[-1]))
print("Warmup+Cosine: " + str(lr_warmup[-1]))

Learning Rate Finder

Learning rate finder identifies optimal learning rate range. It trains with exponentially increasing rates. It plots loss versus learning rate. Optimal range is where loss decreases fastest. It helps choose initial learning rate.

Process starts with very small learning rate. It increases exponentially each batch. It records loss for each rate. It stops when loss diverges. Plot shows loss curve. Steepest descent indicates good range.

# Learning Rate Finder
import numpy as np
import matplotlib.pyplot as plt

def find_learning_rate(model, X_train, y_train, start_lr=1e-7, end_lr=1, num_iterations=100):
    learning_rates = []
    losses = []
    
    # Save initial weights
    initial_weights = [w.copy() for w in model.weights]
    
    # Exponential range
    lr_mult = (end_lr / start_lr) ** (1 / num_iterations)
    
    current_lr = start_lr
    
    for i in range(num_iterations):
        # Set learning rate
        model.lr = current_lr
        
        # Forward and backward pass
        output = model.forward(X_train)
        grad_w, grad_b = model.backward(y_train)
        model.update(grad_w, grad_b)
        
        # Record
        loss = np.mean((output - y_train.reshape(-1, 1).T)**2) / 2
        learning_rates.append(current_lr)
        losses.append(loss)
        
        # Increase learning rate
        current_lr *= lr_mult
        
        # Stop if loss explodes
        if loss > 10 * losses[0] or np.isnan(loss):
            break
    
    # Restore initial weights
    model.weights = initial_weights
    
    # Plot
    plt.figure(figsize=(10, 6))
    plt.semilogx(learning_rates, losses)
    plt.xlabel('Learning Rate')
    plt.ylabel('Loss')
    plt.title('Learning Rate Finder')
    plt.grid(True)
    plt.show()
    
    # Find steepest descent region
    loss_diff = np.diff(losses)
    steepest_idx = np.argmin(loss_diff)
    optimal_lr = learning_rates[steepest_idx]
    
    print("Suggested learning rate: " + str(optimal_lr))
    return optimal_lr, learning_rates, losses

# Example usage would be:
# optimal_lr, lrs, losses = find_learning_rate(nn, X, y)

Batch Processing

Batch processing groups examples for efficiency. Large batches provide stable gradients. Small batches provide frequent updates. Mini-batches balance stability and speed.

Batch size affects training dynamics. Large batches converge smoothly but slowly. Small batches converge quickly but noisily. Typical batch sizes are 32, 64, or 128.

# Batch Processing
import numpy as np

def create_batches(X, y, batch_size=32):
    n_samples = X.shape[0]
    batches = []
    
    for i in range(0, n_samples, batch_size):
        end_idx = min(i + batch_size, n_samples)
        batches.append((X[i:end_idx], y[i:end_idx]))
    
    return batches

# Example
X = np.random.randn(100, 10)
y = np.random.randn(100, 1)

batches = create_batches(X, y, batch_size=32)
print("Number of batches: " + str(len(batches)))
print("Batch sizes: " + str([b[0].shape[0] for b in batches]))
# Result:
# Number of batches: 4
# Batch sizes: [32, 32, 32, 4]

Batch processing enables efficient training. It uses parallel computation. It provides stable gradient estimates. It is essential for large datasets.

Gradient Clipping

Gradient clipping prevents exploding gradients. It limits gradient magnitudes. It stabilizes training. It is essential for recurrent networks.

Clipping methods include norm clipping and value clipping. Norm clipping scales gradients to maximum norm. Value clipping clamps gradient values. Both prevent extreme updates.

# Gradient Clipping
import numpy as np

def clip_gradients_norm(gradients, max_norm=1.0):
    total_norm = np.sqrt(sum(np.sum(g**2) for g in gradients))
    clip_coef = max_norm / (total_norm + 1e-6)
    
    if clip_coef < 1:
        return [g * clip_coef for g in gradients]
    return gradients

def clip_gradients_value(gradients, min_val=-1.0, max_val=1.0):
    return [np.clip(g, min_val, max_val) for g in gradients]

# Example
gradients = [np.array([[10.0], [5.0]]), np.array([[-8.0]])]

clipped_norm = clip_gradients_norm(gradients, max_norm=1.0)
clipped_val = clip_gradients_value(gradients, min_val=-1.0, max_val=1.0)

print("Original norm: " + str(np.sqrt(sum(np.sum(g**2) for g in gradients))))
print("Clipped norm: " + str(np.sqrt(sum(np.sum(g**2) for g in clipped_norm))))

Gradient clipping stabilizes training. It prevents weight explosions. It enables training deeper networks. It is especially important for RNNs.

Summary

Training adjusts weights to minimize errors. Loss functions measure prediction errors. Backpropagation computes gradients efficiently. Optimizers update weights using gradients. Learning rate schedules adjust step sizes. Batch processing enables efficient training. Gradient clipping prevents exploding gradients. Proper training setup enables learning complex patterns.