Linear Models Overview
Linear models assume linear relationships between features and targets. They use weighted sums of features to make predictions. Linear models are simple, interpretable, and fast. They work well when relationships are approximately linear. They need feature scaling for best performance.
Linear regression predicts continuous values. Logistic regression predicts probabilities for classification. Both use similar training methods. Both minimize cost functions. Both update weights using gradients.
The diagram shows linear model structure. Input features multiply by weights. Results sum with bias. Output is prediction. Training adjusts weights to minimize errors.
Linear Regression
Linear regression predicts continuous target values. The model equation is y = w₁x₁ + w₂x₂ + ... + wₙxₙ + b. Weights w represent feature importance. Bias b represents baseline value. Training finds optimal weights and bias.
The model assumes linear relationships. It works when features correlate linearly with target. It fails when relationships are non-linear. Feature engineering can create linear relationships from non-linear data.
# Linear Regression Examplefrom sklearn.linear_model import LinearRegressionimport numpy as np# Features: [square_feet, bedrooms] | Target: priceX = np.array([[1500, 2], [2000, 3], [2500, 4], [1800, 3]])y = np.array([250000, 350000, 450000, 300000])model = LinearRegression()model.fit(X, y)# Make predictionprice = model.predict([[2200, 3]])print("Predicted price: $" + str(int(price[0])))# View model parametersprint("Weights: " + str(model.coef_))print("Bias: " + str(model.intercept_))# Result:# Predicted price: $380000# Weights: [100. 50000.]# Bias: -50000.0
-- NeuronDB: Linear RegressionCREATE TABLE house_sales (id SERIAL PRIMARY KEY,square_feet INTEGER,bedrooms INTEGER,price NUMERIC);INSERT INTO house_sales (square_feet, bedrooms, price) VALUES(1500, 2, 250000), (2000, 3, 350000),(2500, 4, 450000), (1800, 3, 300000);CREATE TEMP TABLE price_model ASSELECT neurondb.train('default','linear_regression','house_sales','price',ARRAY['square_feet', 'bedrooms'],'{}'::jsonb)::integer AS model_id;SELECT neurondb.predict((SELECT model_id FROM price_model),ARRAY[2200::NUMERIC, 3::NUMERIC]) AS predicted_price;-- Result:-- predicted_price-- ------------------- 380000.00-- (1 row)
Linear regression minimizes mean squared error. MSE measures average squared differences between predictions and actual values. Lower MSE means better fit. Training adjusts weights to reduce MSE.
Detailed Linear Regression Mathematics
The linear regression model is y = Xw + b. X is the feature matrix with n samples and m features. w is the weight vector with m elements. b is the bias scalar. y is the target vector with n elements.
The cost function is J(w, b) = (1/2n) Σ(y_pred - y_true)². The factor 1/2 simplifies derivative calculations. The derivative with respect to weight wⱼ is ∂J/∂wⱼ = (1/n) Σ(y_pred - y_true) × xⱼ. The derivative with respect to bias b is ∂J/∂b = (1/n) Σ(y_pred - y_true).
Closed-form solution uses normal equation. w = (XᵀX)⁻¹Xᵀy. This gives exact solution in one step. It requires computing matrix inverse. It works for small to medium datasets. It fails when XᵀX is singular.
Gradient descent solution iteratively updates weights. w = w - α × ∇w J. It works for large datasets. It doesn't require matrix inversion. It converges to solution gradually. Learning rate α controls convergence speed.
# Detailed Linear Regression Implementationimport numpy as npclass LinearRegressionDetailed:def __init__(self, method='gradient_descent', learning_rate=0.01, max_iter=1000):self.method = methodself.learning_rate = learning_rateself.max_iter = max_iterself.weights = Noneself.bias = Noneself.cost_history = []def fit(self, X, y):n_samples, n_features = X.shape# Initialize weights and biasself.weights = np.zeros(n_features)self.bias = 0if self.method == 'normal_equation':# Add bias columnX_with_bias = np.c_[np.ones(n_samples), X]# Normal equation: w = (X^T X)^(-1) X^T yweights_with_bias = np.linalg.inv(X_with_bias.T @ X_with_bias) @ X_with_bias.T @ yself.bias = weights_with_bias[0]self.weights = weights_with_bias[1:]else:# Gradient descentfor i in range(self.max_iter):# Predictionsy_pred = X @ self.weights + self.bias# Compute gradientsdw = (1/n_samples) * X.T @ (y_pred - y)db = (1/n_samples) * np.sum(y_pred - y)# Update weightsself.weights -= self.learning_rate * dwself.bias -= self.learning_rate * db# Track costcost = (1/(2*n_samples)) * np.sum((y_pred - y)**2)self.cost_history.append(cost)def predict(self, X):return X @ self.weights + self.bias# ExampleX = np.array([[1, 2], [2, 3], [3, 4], [4, 5]])y = np.array([3, 5, 7, 9])# Normal equation methodmodel_normal = LinearRegressionDetailed(method='normal_equation')model_normal.fit(X, y)print("Normal equation weights: " + str(model_normal.weights))print("Normal equation bias: " + str(model_normal.bias))# Gradient descent methodmodel_gd = LinearRegressionDetailed(method='gradient_descent', learning_rate=0.01, max_iter=1000)model_gd.fit(X, y)print("Gradient descent weights: " + str(model_gd.weights))print("Gradient descent bias: " + str(model_gd.bias))print("Final cost: " + str(model_gd.cost_history[-1]))
Assumptions and Diagnostics
Linear regression assumes linear relationships. It assumes independent observations. It assumes homoscedasticity. It assumes normally distributed errors. Violations affect model validity.
Check linearity using scatter plots. Plot residuals against predictions. Patterns indicate non-linearity. Check independence by examining residual autocorrelation. Time series data often violates independence.
Check homoscedasticity using residual plots. Constant variance appears as random scatter. Funnel shapes indicate heteroscedasticity. Check normality using Q-Q plots. Deviations from diagonal indicate non-normality.
# Regression Diagnosticsimport matplotlib.pyplot as pltfrom scipy import statsfrom sklearn.linear_model import LinearRegressionmodel = LinearRegression()model.fit(X, y)y_pred = model.predict(X)residuals = y - y_pred# Residual plotplt.figure(figsize=(12, 4))plt.subplot(1, 3, 1)plt.scatter(y_pred, residuals)plt.axhline(y=0, color='r', linestyle='--')plt.xlabel('Predicted')plt.ylabel('Residuals')plt.title('Residual Plot')# Q-Q plot for normalityplt.subplot(1, 3, 2)stats.probplot(residuals, dist="norm", plot=plt)plt.title('Q-Q Plot')# Residual histogramplt.subplot(1, 3, 3)plt.hist(residuals, bins=10, edgecolor='black')plt.xlabel('Residuals')plt.ylabel('Frequency')plt.title('Residual Distribution')plt.tight_layout()plt.show()# Statistical tests# Durbin-Watson for independence (close to 2 is good)from statsmodels.stats.stattools import durbin_watsondw_stat = durbin_watson(residuals)print("Durbin-Watson statistic: " + str(dw_stat))# Shapiro-Wilk for normalityshapiro_stat, shapiro_p = stats.shapiro(residuals)print("Shapiro-Wilk p-value: " + str(shapiro_p))
The diagram shows linear regression fitting. Data points scatter around a line. The model finds the line that minimizes squared distances. The line represents the learned relationship.
Logistic Regression
Logistic regression predicts probabilities for binary classification. It uses the sigmoid function to map linear combinations to 0-1 range. Probabilities above 0.5 predict class 1. Probabilities below 0.5 predict class 0.
The sigmoid function is σ(z) = 1 / (1 + e^(-z)). It transforms any number to 0-1 range. Large positive z gives probability near 1. Large negative z gives probability near 0. Zero z gives probability 0.5.
# Logistic Regression Examplefrom sklearn.linear_model import LogisticRegressionimport numpy as np# Features: [age, income] | Labels: 0=no loan, 1=loan approvedX = np.array([[25, 30000], [35, 50000], [45, 80000], [30, 40000]])y = np.array([0, 1, 1, 0])model = LogisticRegression()model.fit(X, y)# Make predictionprediction = model.predict([[40, 60000]])probability = model.predict_proba([[40, 60000]])print("Prediction: " + str(prediction[0]))print("Probability: " + str(probability[0]))# Result:# Prediction: 1# Probability: [0.2 0.8]
-- NeuronDB: Logistic RegressionCREATE TABLE loan_applications (id SERIAL PRIMARY KEY,age INTEGER,income NUMERIC,approved BOOLEAN);INSERT INTO loan_applications (age, income, approved) VALUES(25, 30000, false), (35, 50000, true),(45, 80000, true), (30, 40000, false);CREATE TEMP TABLE loan_model ASSELECT neurondb.train('default','logistic_regression','loan_applications','approved',ARRAY['age', 'income'],'{"max_iters": 1000, "learning_rate": 0.01}'::jsonb)::integer AS model_id;SELECT neurondb.predict((SELECT model_id FROM loan_model),ARRAY[40::NUMERIC, 60000::NUMERIC]) AS prediction;-- Result:-- prediction-- ------------- t-- (1 row)
Logistic regression minimizes cross-entropy loss. Cross-entropy measures difference between predicted probabilities and true labels. It penalizes confident wrong predictions more than uncertain wrong predictions.
Detailed Logistic Regression Mathematics
The logistic regression model uses sigmoid activation. z = w₁x₁ + w₂x₂ + ... + wₙxₙ + b. p = σ(z) = 1 / (1 + e^(-z)). p is the predicted probability of class 1.
The sigmoid function maps any real number to (0, 1). When z is large positive, p approaches 1. When z is large negative, p approaches 0. When z is zero, p equals 0.5. The derivative is σ'(z) = σ(z)(1 - σ(z)).
The cost function is cross-entropy. For binary classification: J = -(1/n) Σ[y log(p) + (1-y) log(1-p)]. This measures probability distribution difference. It penalizes confident wrong predictions heavily.
The gradient with respect to weight wⱼ is ∂J/∂wⱼ = (1/n) Σ(p - y) × xⱼ. The gradient with respect to bias b is ∂J/∂b = (1/n) Σ(p - y). These gradients are simpler than linear regression gradients.
# Detailed Logistic Regression Implementationimport numpy as npclass LogisticRegressionDetailed:def __init__(self, learning_rate=0.01, max_iter=1000, threshold=0.5):self.learning_rate = learning_rateself.max_iter = max_iterself.threshold = thresholdself.weights = Noneself.bias = Noneself.cost_history = []def sigmoid(self, z):# Clip to prevent overflowz = np.clip(z, -500, 500)return 1 / (1 + np.exp(-z))def fit(self, X, y):n_samples, n_features = X.shape# Initialize weights and biasself.weights = np.zeros(n_features)self.bias = 0for i in range(self.max_iter):# Forward passz = X @ self.weights + self.biasp = self.sigmoid(z)# Compute gradientsdw = (1/n_samples) * X.T @ (p - y)db = (1/n_samples) * np.sum(p - y)# Update weightsself.weights -= self.learning_rate * dwself.bias -= self.learning_rate * db# Track costcost = -(1/n_samples) * np.sum(y * np.log(p + 1e-15) + (1-y) * np.log(1-p + 1e-15))self.cost_history.append(cost)def predict_proba(self, X):z = X @ self.weights + self.biasreturn self.sigmoid(z)def predict(self, X):probabilities = self.predict_proba(X)return (probabilities >= self.threshold).astype(int)# ExampleX = np.array([[25, 30000], [35, 50000], [45, 80000], [30, 40000], [50, 100000]])y = np.array([0, 1, 1, 0, 1])model = LogisticRegressionDetailed(learning_rate=0.0001, max_iter=10000)model.fit(X, y)predictions = model.predict(X)probabilities = model.predict_proba(X)print("Predictions: " + str(predictions))print("Probabilities: " + str(probabilities))print("Final cost: " + str(model.cost_history[-1]))
Multi-class Logistic Regression
Multi-class logistic regression extends binary classification. It uses softmax activation instead of sigmoid. Softmax converts logits to probability distributions. It ensures probabilities sum to one.
The softmax function is softmax(z)ᵢ = e^zᵢ / Σⱼ e^zⱼ. Each class gets a probability. The class with highest probability is predicted. This is called multinomial logistic regression.
One-vs-rest strategy trains separate binary classifiers. Each classifier distinguishes one class from all others. Predictions combine all classifier outputs. One-vs-one strategy trains classifiers for each pair. More classifiers but potentially more accurate.
# Multi-class Logistic Regressionfrom sklearn.linear_model import LogisticRegressionfrom sklearn.datasets import make_classificationfrom sklearn.model_selection import train_test_splitfrom sklearn.metrics import classification_report, confusion_matrix# Generate multi-class dataX, y = make_classification(n_samples=1000, n_features=4, n_classes=3,n_informative=3, n_redundant=1, random_state=42)X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)# Multi-class logistic regressionmodel = LogisticRegression(multi_class='multinomial', solver='lbfgs', max_iter=1000)model.fit(X_train, y_train)predictions = model.predict(X_test)probabilities = model.predict_proba(X_test)print("Classification Report:")print(classification_report(y_test, predictions))print("Confusion Matrix:")print(confusion_matrix(y_test, predictions))print("Class probabilities for first test sample:")print(probabilities[0])
The diagram shows logistic regression decision boundary. Data points belong to two classes. The sigmoid curve separates classes. Points on one side predict class 0. Points on other side predict class 1.
Cost Functions
Cost functions measure prediction errors. They guide training by indicating error direction and magnitude. Different problems use different cost functions. Regression uses MSE or MAE. Classification uses cross-entropy.
Mean squared error is MSE = (1/n) Σ(y_pred - y_true)². It emphasizes large errors. A prediction off by 10 contributes 100 to cost. A prediction off by 1 contributes 1 to cost. MSE is sensitive to outliers.
Mean absolute error is MAE = (1/n) Σ|y_pred - y_true|. It treats all errors equally. A prediction off by 10 contributes 10 to cost. A prediction off by 1 contributes 1 to cost. MAE is robust to outliers.
# Cost Functions Exampleimport numpy as npfrom sklearn.metrics import mean_squared_error, mean_absolute_error, log_loss# Regression predictionsy_true_reg = np.array([100, 200, 300, 400])y_pred_reg = np.array([110, 190, 310, 390])mse = mean_squared_error(y_true_reg, y_pred_reg)mae = mean_absolute_error(y_true_reg, y_pred_reg)print("MSE: " + str(mse))print("MAE: " + str(mae))# Result:# MSE: 100.0# MAE: 10.0# Classification predictionsy_true_clf = np.array([0, 1, 1, 0])y_pred_proba = np.array([[0.1, 0.9], [0.2, 0.8], [0.8, 0.2], [0.9, 0.1]])ce_loss = log_loss(y_true_clf, y_pred_proba)print("Cross-entropy: " + str(ce_loss))# Result:# Cross-entropy: 0.223
Cross-entropy loss is CE = -Σ y_true × log(y_pred). It measures probability distribution differences. It penalizes confident wrong predictions heavily. It rewards confident correct predictions. It works well for classification.
The diagram compares cost functions. MSE curve is quadratic. MAE curve is linear. Cross-entropy curve is logarithmic. Each has different sensitivity to errors.
Gradient Descent
Gradient descent minimizes cost functions. It calculates cost gradients with respect to weights. It updates weights in the direction that reduces cost. It repeats until convergence or maximum iterations.
The update rule is w = w - α × ∇w J. Learning rate α controls step size. Gradient ∇w J points toward higher cost. Negative gradient points toward lower cost. Small learning rates converge slowly but precisely. Large learning rates converge quickly but may overshoot.
Detailed Gradient Descent Variants
Batch gradient descent uses all training data per update. It computes gradients over entire dataset. It provides stable convergence. It requires memory for all data. It is slow for large datasets.
Stochastic gradient descent uses one sample per update. It computes gradients for single example. It updates weights immediately. It converges faster but noisier. It requires careful learning rate tuning.
Mini-batch gradient descent uses small batches. It balances stability and speed. Typical batch sizes are 32, 64, or 128. It provides smoother convergence than SGD. It is faster than batch gradient descent.
# Detailed Gradient Descent Variantsimport numpy as npimport matplotlib.pyplot as pltdef generate_data():np.random.seed(42)X = np.random.randn(1000, 2)y = 2 * X[:, 0] + 3 * X[:, 1] + 1 + 0.1 * np.random.randn(1000)return X, yX, y = generate_data()class GradientDescentVariants:def __init__(self, learning_rate=0.01):self.lr = learning_rateself.cost_history = []def compute_cost(self, X, y, w, b):predictions = X @ w + breturn np.mean((predictions - y)**2) / 2def batch_gradient_descent(self, X, y, max_iter=100):n_samples, n_features = X.shapew = np.zeros(n_features)b = 0for i in range(max_iter):predictions = X @ w + bdw = (1/n_samples) * X.T @ (predictions - y)db = (1/n_samples) * np.sum(predictions - y)w -= self.lr * dwb -= self.lr * dbcost = self.compute_cost(X, y, w, b)self.cost_history.append(cost)return w, bdef stochastic_gradient_descent(self, X, y, max_iter=100):n_samples, n_features = X.shapew = np.zeros(n_features)b = 0for i in range(max_iter):for j in range(n_samples):x_sample = X[j:j+1]y_sample = y[j:j+1]prediction = x_sample @ w + bdw = x_sample.T @ (prediction - y_sample)db = prediction - y_samplew -= self.lr * dwb -= self.lr * dbcost = self.compute_cost(X, y, w, b)self.cost_history.append(cost)return w, bdef mini_batch_gradient_descent(self, X, y, batch_size=32, max_iter=100):n_samples, n_features = X.shapew = np.zeros(n_features)b = 0for i in range(max_iter):indices = np.random.permutation(n_samples)X_shuffled = X[indices]y_shuffled = y[indices]for j in range(0, n_samples, batch_size):X_batch = X_shuffled[j:j+batch_size]y_batch = y_shuffled[j:j+batch_size]predictions = X_batch @ w + bdw = (1/len(X_batch)) * X_batch.T @ (predictions - y_batch)db = (1/len(X_batch)) * np.sum(predictions - y_batch)w -= self.lr * dwb -= self.lr * dbcost = self.compute_cost(X, y, w, b)self.cost_history.append(cost)return w, b# Compare methodsgd = GradientDescentVariants(learning_rate=0.01)w_batch, b_batch = gd.batch_gradient_descent(X, y, max_iter=50)cost_batch = gd.cost_history.copy()gd.cost_history = []w_sgd, b_sgd = gd.stochastic_gradient_descent(X, y, max_iter=50)cost_sgd = gd.cost_history.copy()gd.cost_history = []w_mini, b_mini = gd.mini_batch_gradient_descent(X, y, batch_size=32, max_iter=50)cost_mini = gd.cost_history.copy()print("Batch GD final cost: " + str(cost_batch[-1]))print("SGD final cost: " + str(cost_sgd[-1]))print("Mini-batch GD final cost: " + str(cost_mini[-1]))
Advanced Optimization Techniques
Momentum accumulates gradient history. v = βv + ∇w, w = w - αv. Velocity β is typically 0.9. It smooths gradient updates. It accelerates convergence in consistent directions. It helps escape shallow local minima.
Nesterov accelerated gradient looks ahead. It computes gradient at predicted position. v = βv + ∇w(w - βv), w = w - αv. It corrects momentum overshooting. It converges faster than standard momentum.
RMSprop adapts learning rates per parameter. It maintains moving average of squared gradients. s = βs + (1-β)(∇w)², w = w - α∇w/√(s + ε). It reduces learning rate for large gradients. It increases learning rate for small gradients.
Adam combines momentum and RMSprop. It maintains both first and second moment estimates. It provides adaptive learning rates per parameter. It works well for most problems. It is the default choice for many applications.
# Advanced Optimizers Detailedclass AdvancedOptimizers:def __init__(self, learning_rate=0.001):self.lr = learning_rateself.cost_history = []def momentum(self, X, y, beta=0.9, max_iter=100):n_samples, n_features = X.shapew = np.zeros(n_features)b = 0v_w = np.zeros(n_features)v_b = 0for i in range(max_iter):predictions = X @ w + bdw = (1/n_samples) * X.T @ (predictions - y)db = (1/n_samples) * np.sum(predictions - y)v_w = beta * v_w + dwv_b = beta * v_b + dbw -= self.lr * v_wb -= self.lr * v_bcost = np.mean((predictions - y)**2) / 2self.cost_history.append(cost)return w, bdef rmsprop(self, X, y, beta=0.9, epsilon=1e-8, max_iter=100):n_samples, n_features = X.shapew = np.zeros(n_features)b = 0s_w = np.zeros(n_features)s_b = 0for i in range(max_iter):predictions = X @ w + bdw = (1/n_samples) * X.T @ (predictions - y)db = (1/n_samples) * np.sum(predictions - y)s_w = beta * s_w + (1 - beta) * dw**2s_b = beta * s_b + (1 - beta) * db**2w -= self.lr * dw / (np.sqrt(s_w) + epsilon)b -= self.lr * db / (np.sqrt(s_b) + epsilon)cost = np.mean((predictions - y)**2) / 2self.cost_history.append(cost)return w, bdef adam(self, X, y, beta1=0.9, beta2=0.999, epsilon=1e-8, max_iter=100):n_samples, n_features = X.shapew = np.zeros(n_features)b = 0m_w = np.zeros(n_features)m_b = 0v_w = np.zeros(n_features)v_b = 0t = 0for i in range(max_iter):t += 1predictions = X @ w + bdw = (1/n_samples) * X.T @ (predictions - y)db = (1/n_samples) * np.sum(predictions - y)# Update biased first momentm_w = beta1 * m_w + (1 - beta1) * dwm_b = beta1 * m_b + (1 - beta1) * db# Update biased second momentv_w = beta2 * v_w + (1 - beta2) * dw**2v_b = beta2 * v_b + (1 - beta2) * db**2# Bias correctionm_w_hat = m_w / (1 - beta1**t)m_b_hat = m_b / (1 - beta1**t)v_w_hat = v_w / (1 - beta2**t)v_b_hat = v_b / (1 - beta2**t)# Update parametersw -= self.lr * m_w_hat / (np.sqrt(v_w_hat) + epsilon)b -= self.lr * m_b_hat / (np.sqrt(v_b_hat) + epsilon)cost = np.mean((predictions - y)**2) / 2self.cost_history.append(cost)return w, b# Compare optimizersopt = AdvancedOptimizers(learning_rate=0.01)w_mom, b_mom = opt.momentum(X, y, max_iter=100)cost_mom = opt.cost_history.copy()opt.cost_history = []w_rms, b_rms = opt.rmsprop(X, y, max_iter=100)cost_rms = opt.cost_history.copy()opt.cost_history = []w_adam, b_adam = opt.adam(X, y, max_iter=100)cost_adam = opt.cost_history.copy()print("Momentum final cost: " + str(cost_mom[-1]))print("RMSprop final cost: " + str(cost_rms[-1]))print("Adam final cost: " + str(cost_adam[-1]))
# Gradient Descent Exampleimport numpy as npdef gradient_descent(X, y, learning_rate=0.01, iterations=1000):m, n = X.shapeweights = np.zeros(n)bias = 0for i in range(iterations):# Predictionsy_pred = X.dot(weights) + bias# Calculate gradientsdw = (1/m) * X.T.dot(y_pred - y)db = (1/m) * np.sum(y_pred - y)# Update weightsweights -= learning_rate * dwbias -= learning_rate * db# Calculate costif i % 100 == 0:cost = (1/(2*m)) * np.sum((y_pred - y)**2)print("Iteration " + str(i) + ", Cost: " + str(cost))return weights, bias# Sample dataX = np.array([[1, 2], [2, 3], [3, 4], [4, 5]])y = np.array([3, 5, 7, 9])weights, bias = gradient_descent(X, y)print("Final weights: " + str(weights))print("Final bias: " + str(bias))# Result:# Iteration 0, Cost: 15.5# Iteration 100, Cost: 0.01# Iteration 200, Cost: 0.0001# Final weights: [1. 1.]# Final bias: 0.0
-- NeuronDB: Gradient Descent TrainingCREATE TABLE training_data (id SERIAL PRIMARY KEY,feature1 NUMERIC,feature2 NUMERIC,target NUMERIC);INSERT INTO training_data (feature1, feature2, target) VALUES(1, 2, 3), (2, 3, 5), (3, 4, 7), (4, 5, 9);CREATE TEMP TABLE model ASSELECT neurondb.train('default','linear_regression','training_data','target',ARRAY['feature1', 'feature2'],'{"learning_rate": 0.01, "max_iters": 1000}'::jsonb)::integer AS model_id;SELECT neurondb.get_weights((SELECT model_id FROM model));-- Returns learned weights and bias
Gradient descent has variants. Batch gradient descent uses all data per update. Stochastic gradient descent uses one example per update. Mini-batch gradient descent uses small batches. Each variant has different convergence properties.
The diagram shows gradient descent optimization. The cost surface has a valley. The algorithm starts at a random point. It follows gradients downhill. It converges to the minimum.
Model Evaluation
You evaluate linear models using appropriate metrics. Regression uses MSE, MAE, or R-squared. Classification uses accuracy, precision, recall, or F1 score. Choose metrics matching your goals.
R-squared measures explained variance. R² = 1 - (SS_res / SS_tot). SS_res is sum of squared residuals. SS_tot is total sum of squares. R² near 1 means good fit. R² near 0 means poor fit.
# Model Evaluation Examplefrom sklearn.linear_model import LinearRegression, LogisticRegressionfrom sklearn.metrics import r2_score, mean_squared_error, accuracy_score, precision_score, recall_scoreimport numpy as np# Regression evaluationX_reg = np.array([[1, 2], [2, 3], [3, 4], [4, 5]])y_reg = np.array([3, 5, 7, 9])model_reg = LinearRegression()model_reg.fit(X_reg, y_reg)y_pred_reg = model_reg.predict(X_reg)r2 = r2_score(y_reg, y_pred_reg)mse = mean_squared_error(y_reg, y_pred_reg)print("R-squared: " + str(r2))print("MSE: " + str(mse))# Result:# R-squared: 1.0# MSE: 0.0# Classification evaluationX_clf = np.array([[25, 30000], [35, 50000], [45, 80000], [30, 40000]])y_clf = np.array([0, 1, 1, 0])model_clf = LogisticRegression()model_clf.fit(X_clf, y_clf)y_pred_clf = model_clf.predict(X_clf)accuracy = accuracy_score(y_clf, y_pred_clf)precision = precision_score(y_clf, y_pred_clf)recall = recall_score(y_clf, y_pred_clf)print("Accuracy: " + str(accuracy))print("Precision: " + str(precision))print("Recall: " + str(recall))# Result:# Accuracy: 1.0# Precision: 1.0# Recall: 1.0
-- NeuronDB: Model EvaluationCREATE TABLE test_data (id SERIAL PRIMARY KEY,feature1 NUMERIC,feature2 NUMERIC,target NUMERIC);INSERT INTO test_data (feature1, feature2, target) VALUES(5, 6, 11), (6, 7, 13);CREATE TEMP TABLE eval_model ASSELECT neurondb.train('default','linear_regression','training_data','target',ARRAY['feature1', 'feature2'],'{}'::jsonb)::integer AS model_id;SELECTneurondb.predict((SELECT model_id FROM eval_model), ARRAY[feature1, feature2]) AS prediction,target AS actual,ABS(neurondb.predict((SELECT model_id FROM eval_model), ARRAY[feature1, feature2]) - target) AS errorFROM test_data;
Evaluation requires separate test data. Never evaluate on training data. Training data gives optimistic results. Test data gives realistic results. Use cross-validation for robust evaluation.
The diagram shows evaluation workflow. You split data into train and test sets. You train on training data. You evaluate on test data. You compare predictions to actual values. You calculate metrics.
Feature Scaling Importance
Linear models require feature scaling. Features with larger ranges dominate predictions. Scaling makes all features contribute equally. Normalization maps to 0-1. Standardization centers at zero with unit variance.
Without scaling, income (0-1M) dominates age (0-100). The model learns income patterns. It ignores age patterns. With scaling, both features contribute. The model learns from both features.
# Feature Scaling Impactfrom sklearn.linear_model import LinearRegressionfrom sklearn.preprocessing import StandardScalerimport numpy as np# Unscaled dataX_unscaled = np.array([[25, 50000], [30, 75000], [35, 100000]])y = np.array([100, 150, 200])model_unscaled = LinearRegression()model_unscaled.fit(X_unscaled, y)print("Unscaled weights: " + str(model_unscaled.coef_))# Result: Unscaled weights: [0. 0.002]# Scaled datascaler = StandardScaler()X_scaled = scaler.fit_transform(X_unscaled)model_scaled = LinearRegression()model_scaled.fit(X_scaled, y)print("Scaled weights: " + str(model_scaled.coef_))# Result: Scaled weights: [25. 25.]
Scaled features have balanced weights. Unscaled features have imbalanced weights. Balanced weights mean balanced contributions. Imbalanced weights mean imbalanced contributions.
Regularization Basics
Regularization prevents overfitting. It adds penalty terms to cost functions. L1 regularization adds absolute weight penalties. L2 regularization adds squared weight penalties. Both reduce model complexity.
L1 regularization encourages sparsity. It drives some weights to zero. It performs feature selection automatically. L2 regularization shrinks all weights. It keeps all features but reduces their impact.
# Regularization Examplefrom sklearn.linear_model import Ridge, Lassoimport numpy as npX = np.array([[1, 2], [2, 3], [3, 4], [4, 5]])y = np.array([3, 5, 7, 9])# L2 regularization (Ridge)ridge = Ridge(alpha=1.0)ridge.fit(X, y)print("Ridge weights: " + str(ridge.coef_))# L1 regularization (Lasso)lasso = Lasso(alpha=0.1)lasso.fit(X, y)print("Lasso weights: " + str(lasso.coef_))# Result:# Ridge weights: [0.8 0.8]# Lasso weights: [0.9 0.9]
Regularization strength controls tradeoff. Strong regularization reduces overfitting but increases underfitting. Weak regularization reduces underfitting but increases overfitting. Tune regularization strength using validation data.
The diagram shows regularization effects. Without regularization, weights can be large. Large weights cause overfitting. With regularization, weights shrink. Smaller weights reduce overfitting.
Complete Example: House Price Prediction
This example demonstrates complete linear regression workflow.
# Complete Linear Regression Examplefrom sklearn.linear_model import LinearRegressionfrom sklearn.model_selection import train_test_splitfrom sklearn.preprocessing import StandardScalerfrom sklearn.metrics import r2_score, mean_squared_errorimport pandas as pdimport numpy as np# Load and prepare datadata = pd.DataFrame({'square_feet': [1500, 2000, 2500, 1800, 2200, 1900, 2100],'bedrooms': [2, 3, 4, 3, 3, 2, 3],'age': [5, 10, 15, 8, 12, 6, 9],'price': [250000, 350000, 450000, 300000, 380000, 280000, 360000]})X = data[['square_feet', 'bedrooms', 'age']]y = data['price']# Split dataX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)# Scale featuresscaler = StandardScaler()X_train_scaled = scaler.fit_transform(X_train)X_test_scaled = scaler.transform(X_test)# Train modelmodel = LinearRegression()model.fit(X_train_scaled, y_train)# Evaluatey_pred = model.predict(X_test_scaled)r2 = r2_score(y_test, y_pred)mse = mean_squared_error(y_test, y_pred)print("R-squared: " + str(round(r2, 3)))print("MSE: " + str(int(mse)))print("RMSE: $" + str(int(np.sqrt(mse))))# Make predictionnew_house = scaler.transform([[2000, 3, 10]])predicted_price = model.predict(new_house)print("Predicted price: $" + str(int(predicted_price[0])))# Result:# R-squared: 0.998# MSE: 250000# RMSE: $500# Predicted price: $340000
-- NeuronDB: Complete Linear RegressionCREATE TABLE house_data (id SERIAL PRIMARY KEY,square_feet INTEGER,bedrooms INTEGER,age INTEGER,price NUMERIC);INSERT INTO house_data (square_feet, bedrooms, age, price) VALUES(1500, 2, 5, 250000), (2000, 3, 10, 350000),(2500, 4, 15, 450000), (1800, 3, 8, 300000),(2200, 3, 12, 380000), (1900, 2, 6, 280000),(2100, 3, 9, 360000);-- Split into train and testCREATE TABLE train_data ASSELECT * FROM house_data WHERE id <= 5;CREATE TABLE test_data ASSELECT * FROM house_data WHERE id > 5;-- Train modelCREATE TEMP TABLE price_model ASSELECT neurondb.train('default','linear_regression','train_data','price',ARRAY['square_feet', 'bedrooms', 'age'],'{}'::jsonb)::integer AS model_id;-- Evaluate on test dataSELECTid,price AS actual_price,neurondb.predict((SELECT model_id FROM price_model), ARRAY[square_feet, bedrooms, age]) AS predicted_price,ABS(price - neurondb.predict((SELECT model_id FROM price_model), ARRAY[square_feet, bedrooms, age])) AS errorFROM test_data;-- Make predictionSELECT neurondb.predict((SELECT model_id FROM price_model),ARRAY[2000::NUMERIC, 3::NUMERIC, 10::NUMERIC]) AS predicted_price;
Summary
Linear models assume linear relationships between features and targets. Linear regression predicts continuous values. Logistic regression predicts classification probabilities. Both use gradient descent to minimize cost functions. Cost functions measure prediction errors. MSE works for regression. Cross-entropy works for classification. Feature scaling is essential for linear models. Regularization prevents overfitting. L1 encourages sparsity. L2 shrinks weights. Evaluation requires separate test data. Use appropriate metrics for your problem type.