Linear models assume linear relationships between features and targets. They use weighted sums of features to make predictions. Linear models are simple, interpretable, and fast. They work well when relationships are approximately linear. They need feature scaling for best performance.
Linear regression predicts continuous values. Logistic regression predicts probabilities for classification. Both use similar training methods. Both minimize cost functions. Both update weights using gradients.
The diagram shows linear model structure. Input features multiply by weights. Results sum with bias. Output is prediction. Training adjusts weights to minimize errors.
Linear Regression
Linear regression predicts continuous target values. The model equation is y = w₁x₁ + w₂x₂ + ... + wₙxₙ + b. Weights w represent feature importance. Bias b represents baseline value. Training finds optimal weights and bias.
The model assumes linear relationships. It works when features correlate linearly with target. It fails when relationships are non-linear. Feature engineering can create linear relationships from non-linear data.
Linear regression minimizes mean squared error. MSE measures average squared differences between predictions and actual values. Lower MSE means better fit. Training adjusts weights to reduce MSE.
Detailed Linear Regression Mathematics
The linear regression model is y = Xw + b. X is the feature matrix with n samples and m features. w is the weight vector with m elements. b is the bias scalar. y is the target vector with n elements.
The cost function is J(w, b) = (1/2n) Σ(y_pred - y_true)². The factor 1/2 simplifies derivative calculations. The derivative with respect to weight wⱼ is ∂J/∂wⱼ = (1/n) Σ(y_pred - y_true) × xⱼ. The derivative with respect to bias b is ∂J/∂b = (1/n) Σ(y_pred - y_true).
Closed-form solution uses normal equation. w = (XᵀX)⁻¹Xᵀy. This gives exact solution in one step. It requires computing matrix inverse. It works for small to medium datasets. It fails when XᵀX is singular.
Gradient descent solution iteratively updates weights. w = w - α × ∇w J. It works for large datasets. It doesn't require matrix inversion. It converges to solution gradually. Learning rate α controls convergence speed.
Linear regression assumes linear relationships. It assumes independent observations. It assumes homoscedasticity. It assumes normally distributed errors. Violations affect model validity.
Check linearity using scatter plots. Plot residuals against predictions. Patterns indicate non-linearity. Check independence by examining residual autocorrelation. Time series data often violates independence.
Check homoscedasticity using residual plots. Constant variance appears as random scatter. Funnel shapes indicate heteroscedasticity. Check normality using Q-Q plots. Deviations from diagonal indicate non-normality.
# Regression Diagnostics
import matplotlib.pyplot as plt
from scipy import stats
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X, y)
y_pred = model.predict(X)
residuals = y - y_pred
# Residual plot
plt.figure(figsize=(12,4))
plt.subplot(1,3,1)
plt.scatter(y_pred, residuals)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('Predicted')
plt.ylabel('Residuals')
plt.title('Residual Plot')
# Q-Q plot for normality
plt.subplot(1,3,2)
stats.probplot(residuals, dist="norm", plot=plt)
plt.title('Q-Q Plot')
# Residual histogram
plt.subplot(1,3,3)
plt.hist(residuals, bins=10, edgecolor='black')
plt.xlabel('Residuals')
plt.ylabel('Frequency')
plt.title('Residual Distribution')
plt.tight_layout()
plt.show()
# Statistical tests
# Durbin-Watson for independence (close to 2 is good)
from statsmodels.stats.stattools import durbin_watson
The diagram shows linear regression fitting. Data points scatter around a line. The model finds the line that minimizes squared distances. The line represents the learned relationship.
Logistic Regression
Logistic regression predicts probabilities for binary classification. It uses the sigmoid function to map linear combinations to 0-1 range. Probabilities above 0.5 predict class 1. Probabilities below 0.5 predict class 0.
The sigmoid function is σ(z) = 1 / (1 + e^(-z)). It transforms any number to 0-1 range. Large positive z gives probability near 1. Large negative z gives probability near 0. Zero z gives probability 0.5.
# Logistic Regression Example
from sklearn.linear_model import LogisticRegression
Logistic regression minimizes cross-entropy loss. Cross-entropy measures difference between predicted probabilities and true labels. It penalizes confident wrong predictions more than uncertain wrong predictions.
Detailed Logistic Regression Mathematics
The logistic regression model uses sigmoid activation. z = w₁x₁ + w₂x₂ + ... + wₙxₙ + b. p = σ(z) = 1 / (1 + e^(-z)). p is the predicted probability of class 1.
The sigmoid function maps any real number to (0, 1). When z is large positive, p approaches 1. When z is large negative, p approaches 0. When z is zero, p equals 0.5. The derivative is σ'(z) = σ(z)(1 - σ(z)).
The cost function is cross-entropy. For binary classification: J = -(1/n) Σ[y log(p) + (1-y) log(1-p)]. This measures probability distribution difference. It penalizes confident wrong predictions heavily.
The gradient with respect to weight wⱼ is ∂J/∂wⱼ = (1/n) Σ(p - y) × xⱼ. The gradient with respect to bias b is ∂J/∂b = (1/n) Σ(p - y). These gradients are simpler than linear regression gradients.
X = np.array([[25,30000],[35,50000],[45,80000],[30,40000],[50,100000]])
y = np.array([0,1,1,0,1])
model = LogisticRegressionDetailed(learning_rate=0.0001, max_iter=10000)
model.fit(X, y)
predictions = model.predict(X)
probabilities = model.predict_proba(X)
print("Predictions: "+str(predictions))
print("Probabilities: "+str(probabilities))
print("Final cost: "+str(model.cost_history[-1]))
Multi-class Logistic Regression
Multi-class logistic regression extends binary classification. It uses softmax activation instead of sigmoid. Softmax converts logits to probability distributions. It ensures probabilities sum to one.
The softmax function is softmax(z)ᵢ = e^zᵢ / Σⱼ e^zⱼ. Each class gets a probability. The class with highest probability is predicted. This is called multinomial logistic regression.
One-vs-rest strategy trains separate binary classifiers. Each classifier distinguishes one class from all others. Predictions combine all classifier outputs. One-vs-one strategy trains classifiers for each pair. More classifiers but potentially more accurate.
# Multi-class Logistic Regression
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
# Generate multi-class data
X, y = make_classification(n_samples=1000, n_features=4, n_classes=3,
n_informative=3, n_redundant=1, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Multi-class logistic regression
model = LogisticRegression(multi_class='multinomial', solver='lbfgs', max_iter=1000)
model.fit(X_train, y_train)
predictions = model.predict(X_test)
probabilities = model.predict_proba(X_test)
print("Classification Report:")
print(classification_report(y_test, predictions))
print("Confusion Matrix:")
print(confusion_matrix(y_test, predictions))
print("Class probabilities for first test sample:")
The diagram shows logistic regression decision boundary. Data points belong to two classes. The sigmoid curve separates classes. Points on one side predict class 0. Points on other side predict class 1.
Cost Functions
Cost functions measure prediction errors. They guide training by indicating error direction and magnitude. Different problems use different cost functions. Regression uses MSE or MAE. Classification uses cross-entropy.
Mean squared error is MSE = (1/n) Σ(y_pred - y_true)². It emphasizes large errors. A prediction off by 10 contributes 100 to cost. A prediction off by 1 contributes 1 to cost. MSE is sensitive to outliers.
Mean absolute error is MAE = (1/n) Σ|y_pred - y_true|. It treats all errors equally. A prediction off by 10 contributes 10 to cost. A prediction off by 1 contributes 1 to cost. MAE is robust to outliers.
# Cost Functions Example
import numpy as np
from sklearn.metrics import mean_squared_error, mean_absolute_error, log_loss
Cross-entropy loss is CE = -Σ y_true × log(y_pred). It measures probability distribution differences. It penalizes confident wrong predictions heavily. It rewards confident correct predictions. It works well for classification.
The diagram compares cost functions. MSE curve is quadratic. MAE curve is linear. Cross-entropy curve is logarithmic. Each has different sensitivity to errors.
Gradient Descent
Gradient descent minimizes cost functions. It calculates cost gradients with respect to weights. It updates weights in the direction that reduces cost. It repeats until convergence or maximum iterations.
The update rule is w = w - α × ∇w J. Learning rate α controls step size. Gradient ∇w J points toward higher cost. Negative gradient points toward lower cost. Small learning rates converge slowly but precisely. Large learning rates converge quickly but may overshoot.
Detailed Gradient Descent Variants
Batch gradient descent uses all training data per update. It computes gradients over entire dataset. It provides stable convergence. It requires memory for all data. It is slow for large datasets.
Stochastic gradient descent uses one sample per update. It computes gradients for single example. It updates weights immediately. It converges faster but noisier. It requires careful learning rate tuning.
Mini-batch gradient descent uses small batches. It balances stability and speed. Typical batch sizes are 32, 64, or 128. It provides smoother convergence than SGD. It is faster than batch gradient descent.
# Detailed Gradient Descent Variants
import numpy as np
import matplotlib.pyplot as plt
defgenerate_data():
np.random.seed(42)
X = np.random.randn(1000,2)
y =2* X[:,0]+3* X[:,1]+1+0.1* np.random.randn(1000)
return X, y
X, y = generate_data()
classGradientDescentVariants:
def__init__(self, learning_rate=0.01):
self.lr = learning_rate
self.cost_history =[]
defcompute_cost(self, X, y, w, b):
predictions = X @ w + b
return np.mean((predictions - y)**2)/2
defbatch_gradient_descent(self, X, y, max_iter=100):
n_samples, n_features = X.shape
w = np.zeros(n_features)
b =0
for i inrange(max_iter):
predictions = X @ w + b
dw =(1/n_samples)* X.T @ (predictions - y)
db =(1/n_samples)* np.sum(predictions - y)
w -= self.lr * dw
b -= self.lr * db
cost = self.compute_cost(X, y, w, b)
self.cost_history.append(cost)
return w, b
defstochastic_gradient_descent(self, X, y, max_iter=100):
n_samples, n_features = X.shape
w = np.zeros(n_features)
b =0
for i inrange(max_iter):
for j inrange(n_samples):
x_sample = X[j:j+1]
y_sample = y[j:j+1]
prediction = x_sample @ w + b
dw = x_sample.T @ (prediction - y_sample)
db = prediction - y_sample
w -= self.lr * dw
b -= self.lr * db
cost = self.compute_cost(X, y, w, b)
self.cost_history.append(cost)
return w, b
defmini_batch_gradient_descent(self, X, y, batch_size=32, max_iter=100):
db =(1/len(X_batch))* np.sum(predictions - y_batch)
w -= self.lr * dw
b -= self.lr * db
cost = self.compute_cost(X, y, w, b)
self.cost_history.append(cost)
return w, b
# Compare methods
gd = GradientDescentVariants(learning_rate=0.01)
w_batch, b_batch = gd.batch_gradient_descent(X, y, max_iter=50)
cost_batch = gd.cost_history.copy()
gd.cost_history =[]
w_sgd, b_sgd = gd.stochastic_gradient_descent(X, y, max_iter=50)
cost_sgd = gd.cost_history.copy()
gd.cost_history =[]
w_mini, b_mini = gd.mini_batch_gradient_descent(X, y, batch_size=32, max_iter=50)
cost_mini = gd.cost_history.copy()
print("Batch GD final cost: "+str(cost_batch[-1]))
print("SGD final cost: "+str(cost_sgd[-1]))
print("Mini-batch GD final cost: "+str(cost_mini[-1]))
Advanced Optimization Techniques
Momentum accumulates gradient history. v = βv + ∇w, w = w - αv. Velocity β is typically 0.9. It smooths gradient updates. It accelerates convergence in consistent directions. It helps escape shallow local minima.
Nesterov accelerated gradient looks ahead. It computes gradient at predicted position. v = βv + ∇w(w - βv), w = w - αv. It corrects momentum overshooting. It converges faster than standard momentum.
RMSprop adapts learning rates per parameter. It maintains moving average of squared gradients. s = βs + (1-β)(∇w)², w = w - α∇w/√(s + ε). It reduces learning rate for large gradients. It increases learning rate for small gradients.
Adam combines momentum and RMSprop. It maintains both first and second moment estimates. It provides adaptive learning rates per parameter. It works well for most problems. It is the default choice for many applications.
# Advanced Optimizers Detailed
classAdvancedOptimizers:
def__init__(self, learning_rate=0.001):
self.lr = learning_rate
self.cost_history =[]
defmomentum(self, X, y, beta=0.9, max_iter=100):
n_samples, n_features = X.shape
w = np.zeros(n_features)
b =0
v_w = np.zeros(n_features)
v_b =0
for i inrange(max_iter):
predictions = X @ w + b
dw =(1/n_samples)* X.T @ (predictions - y)
db =(1/n_samples)* np.sum(predictions - y)
v_w = beta * v_w + dw
v_b = beta * v_b + db
w -= self.lr * v_w
b -= self.lr * v_b
cost = np.mean((predictions - y)**2)/2
self.cost_history.append(cost)
return w, b
defrmsprop(self, X, y, beta=0.9, epsilon=1e-8, max_iter=100):
n_samples, n_features = X.shape
w = np.zeros(n_features)
b =0
s_w = np.zeros(n_features)
s_b =0
for i inrange(max_iter):
predictions = X @ w + b
dw =(1/n_samples)* X.T @ (predictions - y)
db =(1/n_samples)* np.sum(predictions - y)
s_w = beta * s_w +(1- beta)* dw**2
s_b = beta * s_b +(1- beta)* db**2
w -= self.lr * dw /(np.sqrt(s_w)+ epsilon)
b -= self.lr * db /(np.sqrt(s_b)+ epsilon)
cost = np.mean((predictions - y)**2)/2
self.cost_history.append(cost)
return w, b
defadam(self, X, y, beta1=0.9, beta2=0.999, epsilon=1e-8, max_iter=100):
n_samples, n_features = X.shape
w = np.zeros(n_features)
b =0
m_w = np.zeros(n_features)
m_b =0
v_w = np.zeros(n_features)
v_b =0
t =0
for i inrange(max_iter):
t +=1
predictions = X @ w + b
dw =(1/n_samples)* X.T @ (predictions - y)
db =(1/n_samples)* np.sum(predictions - y)
# Update biased first moment
m_w = beta1 * m_w +(1- beta1)* dw
m_b = beta1 * m_b +(1- beta1)* db
# Update biased second moment
v_w = beta2 * v_w +(1- beta2)* dw**2
v_b = beta2 * v_b +(1- beta2)* db**2
# Bias correction
m_w_hat = m_w /(1- beta1**t)
m_b_hat = m_b /(1- beta1**t)
v_w_hat = v_w /(1- beta2**t)
v_b_hat = v_b /(1- beta2**t)
# Update parameters
w -= self.lr * m_w_hat /(np.sqrt(v_w_hat)+ epsilon)
b -= self.lr * m_b_hat /(np.sqrt(v_b_hat)+ epsilon)
cost = np.mean((predictions - y)**2)/2
self.cost_history.append(cost)
return w, b
# Compare optimizers
opt = AdvancedOptimizers(learning_rate=0.01)
w_mom, b_mom = opt.momentum(X, y, max_iter=100)
cost_mom = opt.cost_history.copy()
opt.cost_history =[]
w_rms, b_rms = opt.rmsprop(X, y, max_iter=100)
cost_rms = opt.cost_history.copy()
opt.cost_history =[]
w_adam, b_adam = opt.adam(X, y, max_iter=100)
cost_adam = opt.cost_history.copy()
print("Momentum final cost: "+str(cost_mom[-1]))
print("RMSprop final cost: "+str(cost_rms[-1]))
print("Adam final cost: "+str(cost_adam[-1]))
# Gradient Descent Example
import numpy as np
defgradient_descent(X, y, learning_rate=0.01, iterations=1000):
SELECT neurondb.get_weights((SELECT model_id FROM model));
-- Returns learned weights and bias
Gradient descent has variants. Batch gradient descent uses all data per update. Stochastic gradient descent uses one example per update. Mini-batch gradient descent uses small batches. Each variant has different convergence properties.
The diagram shows gradient descent optimization. The cost surface has a valley. The algorithm starts at a random point. It follows gradients downhill. It converges to the minimum.
Model Evaluation
You evaluate linear models using appropriate metrics. Regression uses MSE, MAE, or R-squared. Classification uses accuracy, precision, recall, or F1 score. Choose metrics matching your goals.
R-squared measures explained variance. R² = 1 - (SS_res / SS_tot). SS_res is sum of squared residuals. SS_tot is total sum of squares. R² near 1 means good fit. R² near 0 means poor fit.
# Model Evaluation Example
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.metrics import r2_score, mean_squared_error, accuracy_score, precision_score, recall_score
neurondb.predict((SELECT model_id FROM eval_model), ARRAY[feature1, feature2])AS prediction,
target AS actual,
ABS(neurondb.predict((SELECT model_id FROM eval_model), ARRAY[feature1, feature2])- target)AS error
FROM test_data;
Evaluation requires separate test data. Never evaluate on training data. Training data gives optimistic results. Test data gives realistic results. Use cross-validation for robust evaluation.
The diagram shows evaluation workflow. You split data into train and test sets. You train on training data. You evaluate on test data. You compare predictions to actual values. You calculate metrics.
Feature Scaling Importance
Linear models require feature scaling. Features with larger ranges dominate predictions. Scaling makes all features contribute equally. Normalization maps to 0-1. Standardization centers at zero with unit variance.
Without scaling, income (0-1M) dominates age (0-100). The model learns income patterns. It ignores age patterns. With scaling, both features contribute. The model learns from both features.
Scaled features have balanced weights. Unscaled features have imbalanced weights. Balanced weights mean balanced contributions. Imbalanced weights mean imbalanced contributions.
Regularization Basics
Regularization prevents overfitting. It adds penalty terms to cost functions. L1 regularization adds absolute weight penalties. L2 regularization adds squared weight penalties. Both reduce model complexity.
L1 regularization encourages sparsity. It drives some weights to zero. It performs feature selection automatically. L2 regularization shrinks all weights. It keeps all features but reduces their impact.
# Regularization Example
from sklearn.linear_model import Ridge, Lasso
import numpy as np
X = np.array([[1,2],[2,3],[3,4],[4,5]])
y = np.array([3,5,7,9])
# L2 regularization (Ridge)
ridge = Ridge(alpha=1.0)
ridge.fit(X, y)
print("Ridge weights: "+str(ridge.coef_))
# L1 regularization (Lasso)
lasso = Lasso(alpha=0.1)
lasso.fit(X, y)
print("Lasso weights: "+str(lasso.coef_))
# Result:
# Ridge weights: [0.8 0.8]
# Lasso weights: [0.9 0.9]
Regularization strength controls tradeoff. Strong regularization reduces overfitting but increases underfitting. Weak regularization reduces underfitting but increases overfitting. Tune regularization strength using validation data.
The diagram shows regularization effects. Without regularization, weights can be large. Large weights cause overfitting. With regularization, weights shrink. Smaller weights reduce overfitting.
Complete Example: House Price Prediction
This example demonstrates complete linear regression workflow.
# Complete Linear Regression Example
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score, mean_squared_error
neurondb.predict((SELECT model_id FROM price_model), ARRAY[square_feet, bedrooms, age])AS predicted_price,
ABS(price - neurondb.predict((SELECT model_id FROM price_model), ARRAY[square_feet, bedrooms, age]))AS error
FROM test_data;
-- Make prediction
SELECT neurondb.predict(
(SELECT model_id FROM price_model),
ARRAY[2000::NUMERIC,3::NUMERIC,10::NUMERIC]
)AS predicted_price;
Summary
Linear models assume linear relationships between features and targets. Linear regression predicts continuous values. Logistic regression predicts classification probabilities. Both use gradient descent to minimize cost functions. Cost functions measure prediction errors. MSE works for regression. Cross-entropy works for classification. Feature scaling is essential for linear models. Regularization prevents overfitting. L1 encourages sparsity. L2 shrinks weights. Evaluation requires separate test data. Use appropriate metrics for your problem type.