Training adjusts network weights to minimize errors. It uses loss functions to measure errors. It uses backpropagation to compute gradients. It uses optimizers to update weights. Training continues until convergence or maximum iterations.
The training loop processes data in batches. Each batch updates weights once. Multiple epochs process all data multiple times. Early stopping prevents overfitting. Learning rate schedules adjust step sizes.
The diagram shows training workflow. Data flows through forward pass. Loss computes errors. Backpropagation computes gradients. Optimizer updates weights. Process repeats until convergence.
Loss Functions
Loss functions measure prediction errors. They guide weight updates. Different problems use different losses. Regression uses MSE or MAE. Classification uses cross-entropy.
Mean squared error is MSE = (1/n) Σ(y_pred - y_true)². It emphasizes large errors. It works well for regression. Mean absolute error is MAE = (1/n) Σ|y_pred - y_true|. It treats all errors equally. It is robust to outliers.
Cross-entropy loss is CE = -Σ y_true × log(y_pred). It measures probability differences. It works well for classification. It penalizes confident wrong predictions.
Choose loss functions matching your problem. Regression problems use MSE or MAE. Classification problems use cross-entropy. Custom losses can encode domain knowledge.
The diagram compares loss functions. MSE is quadratic. MAE is linear. Cross-entropy is logarithmic.
Backpropagation
Backpropagation computes gradients efficiently. It uses chain rule to propagate errors backward. It computes gradients for all weights in one pass. It enables training deep networks.
The process starts at output layer. It computes output error. It propagates error backward through layers. Each layer computes its gradient. Gradients accumulate using chain rule.
print("Gradient shapes: "+str([g.shape for g in grad_w]))
Backpropagation is the core of neural network training. It enables efficient gradient computation. It makes deep learning practical.
Detailed Backpropagation Mathematics
Backpropagation uses chain rule from calculus. For output layer, error is δᴸ = ∇ₐC ⊙ σ'(zᴸ). C is cost function. σ' is activation derivative. ⊙ is element-wise multiplication.
For hidden layer l, error propagates backward. δˡ = ((wˡ⁺¹)ᵀ δˡ⁺¹) ⊙ σ'(zˡ). Weights from next layer transpose. Error from next layer. Activation derivative of current layer.
Gradient for weight wˡᵢⱼ is ∂C/∂wˡᵢⱼ = aˡ⁻¹ⱼ × δˡᵢ. Activation from previous layer. Error from current layer. Gradient for bias bˡᵢ is ∂C/∂bˡᵢ = δˡᵢ. Bias gradient equals error.
# Detailed Backpropagation Implementation
import numpy as np
classNeuralNetworkDetailed:
def__init__(self, layers, learning_rate=0.01):
self.layers = layers
self.lr = learning_rate
self.weights =[]
self.biases =[]
self.activations =[]
self.z_values =[]
# Initialize weights and biases
for i inrange(len(layers)-1):
w = np.random.randn(layers[i+1], layers[i])*0.1
b = np.zeros((layers[i+1],1))
self.weights.append(w)
self.biases.append(b)
defsigmoid(self, z):
return1/(1+ np.exp(-np.clip(z,-500,500)))
defsigmoid_derivative(self, z):
s = self.sigmoid(z)
return s *(1- s)
defforward(self, X):
self.activations =[X.T]
self.z_values =[]
for i inrange(len(self.weights)):
z = self.weights[i] @ self.activations[-1]+ self.biases[i]
Gradient vanishing occurs when gradients become very small. It happens in deep networks. It prevents early layers from learning. It occurs with sigmoid and tanh activations. Their derivatives are bounded and small.
Gradient exploding occurs when gradients become very large. It causes unstable training. It happens with large weights. It causes NaN values. It occurs in recurrent networks.
Solutions include proper weight initialization. Xavier initialization uses variance 1/n. He initialization uses variance 2/n. Batch normalization normalizes activations. Residual connections provide gradient highways. Gradient clipping limits gradient magnitude.
The diagram shows gradient flow. Errors propagate backward. Each layer computes its contribution. Gradients accumulate through chain rule.
Optimizers
Optimizers update weights using gradients. Different optimizers have different update rules. SGD uses simple gradient descent. Momentum adds velocity term. Adam combines momentum and adaptive learning rates.
Stochastic gradient descent is w = w - α × ∇w. Learning rate α controls step size. It is simple but can be slow. Momentum is v = βv + ∇w, w = w - αv. Velocity β accumulates gradients. It accelerates convergence.
Adam adapts learning rates per parameter. It maintains momentum and variance estimates. It works well for most problems. It is the default choice for many applications.
# Optimizers
import numpy as np
classSGD:
def__init__(self, learning_rate=0.01):
self.learning_rate = learning_rate
defupdate(self, weights, gradients):
return[w - self.learning_rate * g for w, g inzip(weights, gradients)]
for w, g, m, v inzip(weights, gradients, self.m, self.v):
m = self.beta1 * m +(1- self.beta1)* g
v = self.beta2 * v +(1- self.beta2)* g**2
m_hat = m /(1- self.beta1**self.t)
v_hat = v /(1- self.beta2**self.t)
w_new = w - self.learning_rate * m_hat /(np.sqrt(v_hat)+1e-8)
updated_weights.append(w_new)
self.m =[self.beta1 * m +(1- self.beta1)* g for m, g inzip(self.m, gradients)]
self.v =[self.beta2 * v +(1- self.beta2)* g**2for v, g inzip(self.v, gradients)]
return updated_weights
# Example
weights =[np.array([[0.5],[0.3]])]
gradients =[np.array([[0.1],[0.2]])]
sgd = SGD(learning_rate=0.01)
adam = Adam(learning_rate=0.001)
w_sgd = sgd.update(weights, gradients)
w_adam = adam.update(weights, gradients)
print("SGD update: "+str(w_sgd[0].flatten()))
print("Adam update: "+str(w_adam[0].flatten()))
Choose optimizers based on problem characteristics. SGD works for simple problems. Adam works for most problems. Experiment to find the best optimizer.
The diagram compares optimizer paths. SGD follows gradients directly. Momentum follows smoothed gradients. Adam adapts step sizes per parameter.
Learning Rate Schedules
Learning rate schedules adjust step sizes during training. Fixed rates can be too large or too small. Adaptive rates improve convergence. Common schedules include step decay, exponential decay, and cosine annealing.
Step decay reduces rate at fixed intervals. Exponential decay reduces rate continuously. Cosine annealing follows cosine curve. Warmup starts with small rates. It prevents early instability.
The diagram shows different learning rate schedules. Fixed rate stays constant. Step decay reduces at intervals. Exponential decay reduces continuously. Each schedule has different convergence characteristics.
Detailed Learning Rate Schedule Strategies
Fixed learning rate uses constant value throughout training. It is simple to implement. It requires careful tuning. Too high causes instability. Too low causes slow convergence. It works for simple problems with stable loss landscapes.
Step decay reduces learning rate at fixed intervals. lr(epoch) = initial_lr × drop_factor^(epoch // drop_interval). Drop interval is typically 10-30 epochs. Drop factor is typically 0.1-0.5. It provides controlled reduction. It works well for many problems.
Exponential decay reduces learning rate continuously. lr(epoch) = initial_lr × decay_rate^epoch. Decay rate is typically 0.9-0.99. It provides smooth reduction. It requires tuning decay rate carefully. It works for problems needing gradual reduction.
Cosine annealing follows cosine curve. lr(epoch) = initial_lr × 0.5 × (1 + cos(π × epoch / max_epochs)). It starts high and ends low. It provides smooth transition. It works well for long training runs. It often improves final performance.
Warmup starts with small learning rate. It gradually increases to target rate. It prevents early instability. It helps with large batch training. Typical warmup is 5-10% of total epochs.
Learning rate finder identifies optimal learning rate range. It trains with exponentially increasing rates. It plots loss versus learning rate. Optimal range is where loss decreases fastest. It helps choose initial learning rate.
Process starts with very small learning rate. It increases exponentially each batch. It records loss for each rate. It stops when loss diverges. Plot shows loss curve. Steepest descent indicates good range.
Batch processing groups examples for efficiency. Large batches provide stable gradients. Small batches provide frequent updates. Mini-batches balance stability and speed.
Batch size affects training dynamics. Large batches converge smoothly but slowly. Small batches converge quickly but noisily. Typical batch sizes are 32, 64, or 128.
# Batch Processing
import numpy as np
defcreate_batches(X, y, batch_size=32):
n_samples = X.shape[0]
batches =[]
for i inrange(0, n_samples, batch_size):
end_idx =min(i + batch_size, n_samples)
batches.append((X[i:end_idx], y[i:end_idx]))
return batches
# Example
X = np.random.randn(100,10)
y = np.random.randn(100,1)
batches = create_batches(X, y, batch_size=32)
print("Number of batches: "+str(len(batches)))
print("Batch sizes: "+str([b[0].shape[0]for b in batches]))
# Result:
# Number of batches: 4
# Batch sizes: [32, 32, 32, 4]
Batch processing enables efficient training. It uses parallel computation. It provides stable gradient estimates. It is essential for large datasets.
Gradient Clipping
Gradient clipping prevents exploding gradients. It limits gradient magnitudes. It stabilizes training. It is essential for recurrent networks.
Clipping methods include norm clipping and value clipping. Norm clipping scales gradients to maximum norm. Value clipping clamps gradient values. Both prevent extreme updates.
# Gradient Clipping
import numpy as np
defclip_gradients_norm(gradients, max_norm=1.0):
total_norm = np.sqrt(sum(np.sum(g**2)for g in gradients))