Linear Models: Regression and Classification

Linear Models Overview

Linear models assume linear relationships between features and targets. They use weighted sums of features to make predictions. Linear models are simple, interpretable, and fast. They work well when relationships are approximately linear. They need feature scaling for best performance.

Linear regression predicts continuous values. Logistic regression predicts probabilities for classification. Both use similar training methods. Both minimize cost functions. Both update weights using gradients.

Figure: Linear Models Overview

The diagram shows linear model structure. Input features multiply by weights. Results sum with bias. Output is prediction. Training adjusts weights to minimize errors.

Linear Regression

Linear regression predicts continuous target values. The model equation is y = w₁x₁ + w₂x₂ + ... + wₙxₙ + b. Weights w represent feature importance. Bias b represents baseline value. Training finds optimal weights and bias.

The model assumes linear relationships. It works when features correlate linearly with target. It fails when relationships are non-linear. Feature engineering can create linear relationships from non-linear data.

# Linear Regression Example
from sklearn.linear_model import LinearRegression
import numpy as np

# Features: [square_feet, bedrooms] | Target: price
X = np.array([[1500, 2], [2000, 3], [2500, 4], [1800, 3]])
y = np.array([250000, 350000, 450000, 300000])

model = LinearRegression()
model.fit(X, y)

# Make prediction
price = model.predict([[2200, 3]])
print("Predicted price: $" + str(int(price[0])))

# View model parameters
print("Weights: " + str(model.coef_))
print("Bias: " + str(model.intercept_))
# Result:
# Predicted price: $380000
# Weights: [100. 50000.]
# Bias: -50000.0

-- NeuronDB: Linear Regression
CREATE TABLE house_sales (
    id SERIAL PRIMARY KEY,
    square_feet INTEGER,
    bedrooms INTEGER,
    price NUMERIC
);

INSERT INTO house_sales (square_feet, bedrooms, price) VALUES
    (1500, 2, 250000), (2000, 3, 350000),
    (2500, 4, 450000), (1800, 3, 300000);

CREATE TEMP TABLE price_model AS
SELECT neurondb.train(
    'default',
    'linear_regression',
    'house_sales',
    'price',
    ARRAY['square_feet', 'bedrooms'],
    '{}'::jsonb
)::integer AS model_id;

SELECT neurondb.predict(
    (SELECT model_id FROM price_model),
    ARRAY[2200::NUMERIC, 3::NUMERIC]
) AS predicted_price;
-- Result:
--  predicted_price
-- -----------------
--       380000.00
-- (1 row)

Linear regression minimizes mean squared error. MSE measures average squared differences between predictions and actual values. Lower MSE means better fit. Training adjusts weights to reduce MSE.

Detailed Linear Regression Mathematics

The linear regression model is y = Xw + b. X is the feature matrix with n samples and m features. w is the weight vector with m elements. b is the bias scalar. y is the target vector with n elements.

The cost function is J(w, b) = (1/2n) Σ(y_pred - y_true)². The factor 1/2 simplifies derivative calculations. The derivative with respect to weight wⱼ is ∂J/∂wⱼ = (1/n) Σ(y_pred - y_true) × xⱼ. The derivative with respect to bias b is ∂J/∂b = (1/n) Σ(y_pred - y_true).

Closed-form solution uses normal equation. w = (XᵀX)⁻¹Xᵀy. This gives exact solution in one step. It requires computing matrix inverse. It works for small to medium datasets. It fails when XᵀX is singular.

Gradient descent solution iteratively updates weights. w = w - α × ∇w J. It works for large datasets. It doesn't require matrix inversion. It converges to solution gradually. Learning rate α controls convergence speed.

# Detailed Linear Regression Implementation
import numpy as np

class LinearRegressionDetailed:
    def __init__(self, method='gradient_descent', learning_rate=0.01, max_iter=1000):
        self.method = method
        self.learning_rate = learning_rate
        self.max_iter = max_iter
        self.weights = None
        self.bias = None
        self.cost_history = []
    
    def fit(self, X, y):
        n_samples, n_features = X.shape
        
        # Initialize weights and bias
        self.weights = np.zeros(n_features)
        self.bias = 0
        
        if self.method == 'normal_equation':
            # Add bias column
            X_with_bias = np.c_[np.ones(n_samples), X]
            # Normal equation: w = (X^T X)^(-1) X^T y
            weights_with_bias = np.linalg.inv(X_with_bias.T @ X_with_bias) @ X_with_bias.T @ y
            self.bias = weights_with_bias[0]
            self.weights = weights_with_bias[1:]
        else:
            # Gradient descent
            for i in range(self.max_iter):
                # Predictions
                y_pred = X @ self.weights + self.bias
                
                # Compute gradients
                dw = (1/n_samples) * X.T @ (y_pred - y)
                db = (1/n_samples) * np.sum(y_pred - y)
                
                # Update weights
                self.weights -= self.learning_rate * dw
                self.bias -= self.learning_rate * db
                
                # Track cost
                cost = (1/(2*n_samples)) * np.sum((y_pred - y)**2)
                self.cost_history.append(cost)
    
    def predict(self, X):
        return X @ self.weights + self.bias

# Example
X = np.array([[1, 2], [2, 3], [3, 4], [4, 5]])
y = np.array([3, 5, 7, 9])

# Normal equation method
model_normal = LinearRegressionDetailed(method='normal_equation')
model_normal.fit(X, y)
print("Normal equation weights: " + str(model_normal.weights))
print("Normal equation bias: " + str(model_normal.bias))

# Gradient descent method
model_gd = LinearRegressionDetailed(method='gradient_descent', learning_rate=0.01, max_iter=1000)
model_gd.fit(X, y)
print("Gradient descent weights: " + str(model_gd.weights))
print("Gradient descent bias: " + str(model_gd.bias))
print("Final cost: " + str(model_gd.cost_history[-1]))

Assumptions and Diagnostics

Linear regression assumes linear relationships. It assumes independent observations. It assumes homoscedasticity. It assumes normally distributed errors. Violations affect model validity.

Check linearity using scatter plots. Plot residuals against predictions. Patterns indicate non-linearity. Check independence by examining residual autocorrelation. Time series data often violates independence.

Check homoscedasticity using residual plots. Constant variance appears as random scatter. Funnel shapes indicate heteroscedasticity. Check normality using Q-Q plots. Deviations from diagonal indicate non-normality.

# Regression Diagnostics
import matplotlib.pyplot as plt
from scipy import stats
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X, y)
y_pred = model.predict(X)
residuals = y - y_pred

# Residual plot
plt.figure(figsize=(12, 4))

plt.subplot(1, 3, 1)
plt.scatter(y_pred, residuals)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('Predicted')
plt.ylabel('Residuals')
plt.title('Residual Plot')

# Q-Q plot for normality
plt.subplot(1, 3, 2)
stats.probplot(residuals, dist="norm", plot=plt)
plt.title('Q-Q Plot')

# Residual histogram
plt.subplot(1, 3, 3)
plt.hist(residuals, bins=10, edgecolor='black')
plt.xlabel('Residuals')
plt.ylabel('Frequency')
plt.title('Residual Distribution')

plt.tight_layout()
plt.show()

# Statistical tests
# Durbin-Watson for independence (close to 2 is good)
from statsmodels.stats.stattools import durbin_watson
dw_stat = durbin_watson(residuals)
print("Durbin-Watson statistic: " + str(dw_stat))

# Shapiro-Wilk for normality
shapiro_stat, shapiro_p = stats.shapiro(residuals)
print("Shapiro-Wilk p-value: " + str(shapiro_p))

Figure: Linear Regression

The diagram shows linear regression fitting. Data points scatter around a line. The model finds the line that minimizes squared distances. The line represents the learned relationship.

Logistic Regression

Logistic regression predicts probabilities for binary classification. It uses the sigmoid function to map linear combinations to 0-1 range. Probabilities above 0.5 predict class 1. Probabilities below 0.5 predict class 0.

The sigmoid function is σ(z) = 1 / (1 + e^(-z)). It transforms any number to 0-1 range. Large positive z gives probability near 1. Large negative z gives probability near 0. Zero z gives probability 0.5.

# Logistic Regression Example
from sklearn.linear_model import LogisticRegression
import numpy as np

# Features: [age, income] | Labels: 0=no loan, 1=loan approved
X = np.array([[25, 30000], [35, 50000], [45, 80000], [30, 40000]])
y = np.array([0, 1, 1, 0])

model = LogisticRegression()
model.fit(X, y)

# Make prediction
prediction = model.predict([[40, 60000]])
probability = model.predict_proba([[40, 60000]])

print("Prediction: " + str(prediction[0]))
print("Probability: " + str(probability[0]))
# Result:
# Prediction: 1
# Probability: [0.2 0.8]

-- NeuronDB: Logistic Regression
CREATE TABLE loan_applications (
    id SERIAL PRIMARY KEY,
    age INTEGER,
    income NUMERIC,
    approved BOOLEAN
);

INSERT INTO loan_applications (age, income, approved) VALUES
    (25, 30000, false), (35, 50000, true),
    (45, 80000, true), (30, 40000, false);

CREATE TEMP TABLE loan_model AS
SELECT neurondb.train(
    'default',
    'logistic_regression',
    'loan_applications',
    'approved',
    ARRAY['age', 'income'],
    '{"max_iters": 1000, "learning_rate": 0.01}'::jsonb
)::integer AS model_id;

SELECT neurondb.predict(
    (SELECT model_id FROM loan_model),
    ARRAY[40::NUMERIC, 60000::NUMERIC]
) AS prediction;
-- Result:
--  prediction
-- -----------
--  t
-- (1 row)

Logistic regression minimizes cross-entropy loss. Cross-entropy measures difference between predicted probabilities and true labels. It penalizes confident wrong predictions more than uncertain wrong predictions.

Detailed Logistic Regression Mathematics

The logistic regression model uses sigmoid activation. z = w₁x₁ + w₂x₂ + ... + wₙxₙ + b. p = σ(z) = 1 / (1 + e^(-z)). p is the predicted probability of class 1.

The sigmoid function maps any real number to (0, 1). When z is large positive, p approaches 1. When z is large negative, p approaches 0. When z is zero, p equals 0.5. The derivative is σ'(z) = σ(z)(1 - σ(z)).

The cost function is cross-entropy. For binary classification: J = -(1/n) Σ[y log(p) + (1-y) log(1-p)]. This measures probability distribution difference. It penalizes confident wrong predictions heavily.

The gradient with respect to weight wⱼ is ∂J/∂wⱼ = (1/n) Σ(p - y) × xⱼ. The gradient with respect to bias b is ∂J/∂b = (1/n) Σ(p - y). These gradients are simpler than linear regression gradients.

# Detailed Logistic Regression Implementation
import numpy as np

class LogisticRegressionDetailed:
    def __init__(self, learning_rate=0.01, max_iter=1000, threshold=0.5):
        self.learning_rate = learning_rate
        self.max_iter = max_iter
        self.threshold = threshold
        self.weights = None
        self.bias = None
        self.cost_history = []
    
    def sigmoid(self, z):
        # Clip to prevent overflow
        z = np.clip(z, -500, 500)
        return 1 / (1 + np.exp(-z))
    
    def fit(self, X, y):
        n_samples, n_features = X.shape
        
        # Initialize weights and bias
        self.weights = np.zeros(n_features)
        self.bias = 0
        
        for i in range(self.max_iter):
            # Forward pass
            z = X @ self.weights + self.bias
            p = self.sigmoid(z)
            
            # Compute gradients
            dw = (1/n_samples) * X.T @ (p - y)
            db = (1/n_samples) * np.sum(p - y)
            
            # Update weights
            self.weights -= self.learning_rate * dw
            self.bias -= self.learning_rate * db
            
            # Track cost
            cost = -(1/n_samples) * np.sum(y * np.log(p + 1e-15) + (1-y) * np.log(1-p + 1e-15))
            self.cost_history.append(cost)
    
    def predict_proba(self, X):
        z = X @ self.weights + self.bias
        return self.sigmoid(z)
    
    def predict(self, X):
        probabilities = self.predict_proba(X)
        return (probabilities >= self.threshold).astype(int)

# Example
X = np.array([[25, 30000], [35, 50000], [45, 80000], [30, 40000], [50, 100000]])
y = np.array([0, 1, 1, 0, 1])

model = LogisticRegressionDetailed(learning_rate=0.0001, max_iter=10000)
model.fit(X, y)

predictions = model.predict(X)
probabilities = model.predict_proba(X)

print("Predictions: " + str(predictions))
print("Probabilities: " + str(probabilities))
print("Final cost: " + str(model.cost_history[-1]))

Multi-class Logistic Regression

Multi-class logistic regression extends binary classification. It uses softmax activation instead of sigmoid. Softmax converts logits to probability distributions. It ensures probabilities sum to one.

The softmax function is softmax(z)ᵢ = e^zᵢ / Σⱼ e^zⱼ. Each class gets a probability. The class with highest probability is predicted. This is called multinomial logistic regression.

One-vs-rest strategy trains separate binary classifiers. Each classifier distinguishes one class from all others. Predictions combine all classifier outputs. One-vs-one strategy trains classifiers for each pair. More classifiers but potentially more accurate.

# Multi-class Logistic Regression
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix

# Generate multi-class data
X, y = make_classification(n_samples=1000, n_features=4, n_classes=3, 
                           n_informative=3, n_redundant=1, random_state=42)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Multi-class logistic regression
model = LogisticRegression(multi_class='multinomial', solver='lbfgs', max_iter=1000)
model.fit(X_train, y_train)

predictions = model.predict(X_test)
probabilities = model.predict_proba(X_test)

print("Classification Report:")
print(classification_report(y_test, predictions))
print("Confusion Matrix:")
print(confusion_matrix(y_test, predictions))
print("Class probabilities for first test sample:")
print(probabilities[0])

Figure: Logistic Regression

The diagram shows logistic regression decision boundary. Data points belong to two classes. The sigmoid curve separates classes. Points on one side predict class 0. Points on other side predict class 1.

Cost Functions

Cost functions measure prediction errors. They guide training by indicating error direction and magnitude. Different problems use different cost functions. Regression uses MSE or MAE. Classification uses cross-entropy.

Mean squared error is MSE = (1/n) Σ(y_pred - y_true)². It emphasizes large errors. A prediction off by 10 contributes 100 to cost. A prediction off by 1 contributes 1 to cost. MSE is sensitive to outliers.

Mean absolute error is MAE = (1/n) Σ|y_pred - y_true|. It treats all errors equally. A prediction off by 10 contributes 10 to cost. A prediction off by 1 contributes 1 to cost. MAE is robust to outliers.

# Cost Functions Example
import numpy as np
from sklearn.metrics import mean_squared_error, mean_absolute_error, log_loss

# Regression predictions
y_true_reg = np.array([100, 200, 300, 400])
y_pred_reg = np.array([110, 190, 310, 390])

mse = mean_squared_error(y_true_reg, y_pred_reg)
mae = mean_absolute_error(y_true_reg, y_pred_reg)

print("MSE: " + str(mse))
print("MAE: " + str(mae))
# Result:
# MSE: 100.0
# MAE: 10.0

# Classification predictions
y_true_clf = np.array([0, 1, 1, 0])
y_pred_proba = np.array([[0.1, 0.9], [0.2, 0.8], [0.8, 0.2], [0.9, 0.1]])

ce_loss = log_loss(y_true_clf, y_pred_proba)
print("Cross-entropy: " + str(ce_loss))
# Result:
# Cross-entropy: 0.223

Cross-entropy loss is CE = -Σ y_true × log(y_pred). It measures probability distribution differences. It penalizes confident wrong predictions heavily. It rewards confident correct predictions. It works well for classification.

Figure: Cost Functions

The diagram compares cost functions. MSE curve is quadratic. MAE curve is linear. Cross-entropy curve is logarithmic. Each has different sensitivity to errors.

Gradient Descent

Gradient descent minimizes cost functions. It calculates cost gradients with respect to weights. It updates weights in the direction that reduces cost. It repeats until convergence or maximum iterations.

The update rule is w = w - α × ∇w J. Learning rate α controls step size. Gradient ∇w J points toward higher cost. Negative gradient points toward lower cost. Small learning rates converge slowly but precisely. Large learning rates converge quickly but may overshoot.

Detailed Gradient Descent Variants

Batch gradient descent uses all training data per update. It computes gradients over entire dataset. It provides stable convergence. It requires memory for all data. It is slow for large datasets.

Stochastic gradient descent uses one sample per update. It computes gradients for single example. It updates weights immediately. It converges faster but noisier. It requires careful learning rate tuning.

Mini-batch gradient descent uses small batches. It balances stability and speed. Typical batch sizes are 32, 64, or 128. It provides smoother convergence than SGD. It is faster than batch gradient descent.

# Detailed Gradient Descent Variants
import numpy as np
import matplotlib.pyplot as plt

def generate_data():
    np.random.seed(42)
    X = np.random.randn(1000, 2)
    y = 2 * X[:, 0] + 3 * X[:, 1] + 1 + 0.1 * np.random.randn(1000)
    return X, y

X, y = generate_data()

class GradientDescentVariants:
    def __init__(self, learning_rate=0.01):
        self.lr = learning_rate
        self.cost_history = []
    
    def compute_cost(self, X, y, w, b):
        predictions = X @ w + b
        return np.mean((predictions - y)**2) / 2
    
    def batch_gradient_descent(self, X, y, max_iter=100):
        n_samples, n_features = X.shape
        w = np.zeros(n_features)
        b = 0
        
        for i in range(max_iter):
            predictions = X @ w + b
            dw = (1/n_samples) * X.T @ (predictions - y)
            db = (1/n_samples) * np.sum(predictions - y)
            
            w -= self.lr * dw
            b -= self.lr * db
            
            cost = self.compute_cost(X, y, w, b)
            self.cost_history.append(cost)
        
        return w, b
    
    def stochastic_gradient_descent(self, X, y, max_iter=100):
        n_samples, n_features = X.shape
        w = np.zeros(n_features)
        b = 0
        
        for i in range(max_iter):
            for j in range(n_samples):
                x_sample = X[j:j+1]
                y_sample = y[j:j+1]
                
                prediction = x_sample @ w + b
                dw = x_sample.T @ (prediction - y_sample)
                db = prediction - y_sample
                
                w -= self.lr * dw
                b -= self.lr * db
            
            cost = self.compute_cost(X, y, w, b)
            self.cost_history.append(cost)
        
        return w, b
    
    def mini_batch_gradient_descent(self, X, y, batch_size=32, max_iter=100):
        n_samples, n_features = X.shape
        w = np.zeros(n_features)
        b = 0
        
        for i in range(max_iter):
            indices = np.random.permutation(n_samples)
            X_shuffled = X[indices]
            y_shuffled = y[indices]
            
            for j in range(0, n_samples, batch_size):
                X_batch = X_shuffled[j:j+batch_size]
                y_batch = y_shuffled[j:j+batch_size]
                
                predictions = X_batch @ w + b
                dw = (1/len(X_batch)) * X_batch.T @ (predictions - y_batch)
                db = (1/len(X_batch)) * np.sum(predictions - y_batch)
                
                w -= self.lr * dw
                b -= self.lr * db
            
            cost = self.compute_cost(X, y, w, b)
            self.cost_history.append(cost)
        
        return w, b

# Compare methods
gd = GradientDescentVariants(learning_rate=0.01)

w_batch, b_batch = gd.batch_gradient_descent(X, y, max_iter=50)
cost_batch = gd.cost_history.copy()
gd.cost_history = []

w_sgd, b_sgd = gd.stochastic_gradient_descent(X, y, max_iter=50)
cost_sgd = gd.cost_history.copy()
gd.cost_history = []

w_mini, b_mini = gd.mini_batch_gradient_descent(X, y, batch_size=32, max_iter=50)
cost_mini = gd.cost_history.copy()

print("Batch GD final cost: " + str(cost_batch[-1]))
print("SGD final cost: " + str(cost_sgd[-1]))
print("Mini-batch GD final cost: " + str(cost_mini[-1]))

Advanced Optimization Techniques

Momentum accumulates gradient history. v = βv + ∇w, w = w - αv. Velocity β is typically 0.9. It smooths gradient updates. It accelerates convergence in consistent directions. It helps escape shallow local minima.

Nesterov accelerated gradient looks ahead. It computes gradient at predicted position. v = βv + ∇w(w - βv), w = w - αv. It corrects momentum overshooting. It converges faster than standard momentum.

RMSprop adapts learning rates per parameter. It maintains moving average of squared gradients. s = βs + (1-β)(∇w)², w = w - α∇w/√(s + ε). It reduces learning rate for large gradients. It increases learning rate for small gradients.

Adam combines momentum and RMSprop. It maintains both first and second moment estimates. It provides adaptive learning rates per parameter. It works well for most problems. It is the default choice for many applications.

# Advanced Optimizers Detailed
class AdvancedOptimizers:
    def __init__(self, learning_rate=0.001):
        self.lr = learning_rate
        self.cost_history = []
    
    def momentum(self, X, y, beta=0.9, max_iter=100):
        n_samples, n_features = X.shape
        w = np.zeros(n_features)
        b = 0
        v_w = np.zeros(n_features)
        v_b = 0
        
        for i in range(max_iter):
            predictions = X @ w + b
            dw = (1/n_samples) * X.T @ (predictions - y)
            db = (1/n_samples) * np.sum(predictions - y)
            
            v_w = beta * v_w + dw
            v_b = beta * v_b + db
            
            w -= self.lr * v_w
            b -= self.lr * v_b
            
            cost = np.mean((predictions - y)**2) / 2
            self.cost_history.append(cost)
        
        return w, b
    
    def rmsprop(self, X, y, beta=0.9, epsilon=1e-8, max_iter=100):
        n_samples, n_features = X.shape
        w = np.zeros(n_features)
        b = 0
        s_w = np.zeros(n_features)
        s_b = 0
        
        for i in range(max_iter):
            predictions = X @ w + b
            dw = (1/n_samples) * X.T @ (predictions - y)
            db = (1/n_samples) * np.sum(predictions - y)
            
            s_w = beta * s_w + (1 - beta) * dw**2
            s_b = beta * s_b + (1 - beta) * db**2
            
            w -= self.lr * dw / (np.sqrt(s_w) + epsilon)
            b -= self.lr * db / (np.sqrt(s_b) + epsilon)
            
            cost = np.mean((predictions - y)**2) / 2
            self.cost_history.append(cost)
        
        return w, b
    
    def adam(self, X, y, beta1=0.9, beta2=0.999, epsilon=1e-8, max_iter=100):
        n_samples, n_features = X.shape
        w = np.zeros(n_features)
        b = 0
        m_w = np.zeros(n_features)
        m_b = 0
        v_w = np.zeros(n_features)
        v_b = 0
        t = 0
        
        for i in range(max_iter):
            t += 1
            predictions = X @ w + b
            dw = (1/n_samples) * X.T @ (predictions - y)
            db = (1/n_samples) * np.sum(predictions - y)
            
            # Update biased first moment
            m_w = beta1 * m_w + (1 - beta1) * dw
            m_b = beta1 * m_b + (1 - beta1) * db
            
            # Update biased second moment
            v_w = beta2 * v_w + (1 - beta2) * dw**2
            v_b = beta2 * v_b + (1 - beta2) * db**2
            
            # Bias correction
            m_w_hat = m_w / (1 - beta1**t)
            m_b_hat = m_b / (1 - beta1**t)
            v_w_hat = v_w / (1 - beta2**t)
            v_b_hat = v_b / (1 - beta2**t)
            
            # Update parameters
            w -= self.lr * m_w_hat / (np.sqrt(v_w_hat) + epsilon)
            b -= self.lr * m_b_hat / (np.sqrt(v_b_hat) + epsilon)
            
            cost = np.mean((predictions - y)**2) / 2
            self.cost_history.append(cost)
        
        return w, b

# Compare optimizers
opt = AdvancedOptimizers(learning_rate=0.01)

w_mom, b_mom = opt.momentum(X, y, max_iter=100)
cost_mom = opt.cost_history.copy()
opt.cost_history = []

w_rms, b_rms = opt.rmsprop(X, y, max_iter=100)
cost_rms = opt.cost_history.copy()
opt.cost_history = []

w_adam, b_adam = opt.adam(X, y, max_iter=100)
cost_adam = opt.cost_history.copy()

print("Momentum final cost: " + str(cost_mom[-1]))
print("RMSprop final cost: " + str(cost_rms[-1]))
print("Adam final cost: " + str(cost_adam[-1]))

# Gradient Descent Example
import numpy as np

def gradient_descent(X, y, learning_rate=0.01, iterations=1000):
    m, n = X.shape
    weights = np.zeros(n)
    bias = 0
    
    for i in range(iterations):
        # Predictions
        y_pred = X.dot(weights) + bias
        
        # Calculate gradients
        dw = (1/m) * X.T.dot(y_pred - y)
        db = (1/m) * np.sum(y_pred - y)
        
        # Update weights
        weights -= learning_rate * dw
        bias -= learning_rate * db
        
        # Calculate cost
        if i % 100 == 0:
            cost = (1/(2*m)) * np.sum((y_pred - y)**2)
            print("Iteration " + str(i) + ", Cost: " + str(cost))
    
    return weights, bias

# Sample data
X = np.array([[1, 2], [2, 3], [3, 4], [4, 5]])
y = np.array([3, 5, 7, 9])

weights, bias = gradient_descent(X, y)
print("Final weights: " + str(weights))
print("Final bias: " + str(bias))
# Result:
# Iteration 0, Cost: 15.5
# Iteration 100, Cost: 0.01
# Iteration 200, Cost: 0.0001
# Final weights: [1. 1.]
# Final bias: 0.0

-- NeuronDB: Gradient Descent Training
CREATE TABLE training_data (
    id SERIAL PRIMARY KEY,
    feature1 NUMERIC,
    feature2 NUMERIC,
    target NUMERIC
);

INSERT INTO training_data (feature1, feature2, target) VALUES
    (1, 2, 3), (2, 3, 5), (3, 4, 7), (4, 5, 9);

CREATE TEMP TABLE model AS
SELECT neurondb.train(
    'default',
    'linear_regression',
    'training_data',
    'target',
    ARRAY['feature1', 'feature2'],
    '{"learning_rate": 0.01, "max_iters": 1000}'::jsonb
)::integer AS model_id;

SELECT neurondb.get_weights((SELECT model_id FROM model));
-- Returns learned weights and bias

Gradient descent has variants. Batch gradient descent uses all data per update. Stochastic gradient descent uses one example per update. Mini-batch gradient descent uses small batches. Each variant has different convergence properties.

Figure: Gradient Descent

The diagram shows gradient descent optimization. The cost surface has a valley. The algorithm starts at a random point. It follows gradients downhill. It converges to the minimum.

Model Evaluation

You evaluate linear models using appropriate metrics. Regression uses MSE, MAE, or R-squared. Classification uses accuracy, precision, recall, or F1 score. Choose metrics matching your goals.

R-squared measures explained variance. R² = 1 - (SS_res / SS_tot). SS_res is sum of squared residuals. SS_tot is total sum of squares. R² near 1 means good fit. R² near 0 means poor fit.

# Model Evaluation Example
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.metrics import r2_score, mean_squared_error, accuracy_score, precision_score, recall_score
import numpy as np

# Regression evaluation
X_reg = np.array([[1, 2], [2, 3], [3, 4], [4, 5]])
y_reg = np.array([3, 5, 7, 9])

model_reg = LinearRegression()
model_reg.fit(X_reg, y_reg)
y_pred_reg = model_reg.predict(X_reg)

r2 = r2_score(y_reg, y_pred_reg)
mse = mean_squared_error(y_reg, y_pred_reg)

print("R-squared: " + str(r2))
print("MSE: " + str(mse))
# Result:
# R-squared: 1.0
# MSE: 0.0

# Classification evaluation
X_clf = np.array([[25, 30000], [35, 50000], [45, 80000], [30, 40000]])
y_clf = np.array([0, 1, 1, 0])

model_clf = LogisticRegression()
model_clf.fit(X_clf, y_clf)
y_pred_clf = model_clf.predict(X_clf)

accuracy = accuracy_score(y_clf, y_pred_clf)
precision = precision_score(y_clf, y_pred_clf)
recall = recall_score(y_clf, y_pred_clf)

print("Accuracy: " + str(accuracy))
print("Precision: " + str(precision))
print("Recall: " + str(recall))
# Result:
# Accuracy: 1.0
# Precision: 1.0
# Recall: 1.0

-- NeuronDB: Model Evaluation
CREATE TABLE test_data (
    id SERIAL PRIMARY KEY,
    feature1 NUMERIC,
    feature2 NUMERIC,
    target NUMERIC
);

INSERT INTO test_data (feature1, feature2, target) VALUES
    (5, 6, 11), (6, 7, 13);

CREATE TEMP TABLE eval_model AS
SELECT neurondb.train(
    'default',
    'linear_regression',
    'training_data',
    'target',
    ARRAY['feature1', 'feature2'],
    '{}'::jsonb
)::integer AS model_id;

SELECT 
    neurondb.predict((SELECT model_id FROM eval_model), ARRAY[feature1, feature2]) AS prediction,
    target AS actual,
    ABS(neurondb.predict((SELECT model_id FROM eval_model), ARRAY[feature1, feature2]) - target) AS error
FROM test_data;

Evaluation requires separate test data. Never evaluate on training data. Training data gives optimistic results. Test data gives realistic results. Use cross-validation for robust evaluation.

Figure: Model Evaluation

The diagram shows evaluation workflow. You split data into train and test sets. You train on training data. You evaluate on test data. You compare predictions to actual values. You calculate metrics.

Feature Scaling Importance

Linear models require feature scaling. Features with larger ranges dominate predictions. Scaling makes all features contribute equally. Normalization maps to 0-1. Standardization centers at zero with unit variance.

Without scaling, income (0-1M) dominates age (0-100). The model learns income patterns. It ignores age patterns. With scaling, both features contribute. The model learns from both features.

# Feature Scaling Impact
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
import numpy as np

# Unscaled data
X_unscaled = np.array([[25, 50000], [30, 75000], [35, 100000]])
y = np.array([100, 150, 200])

model_unscaled = LinearRegression()
model_unscaled.fit(X_unscaled, y)
print("Unscaled weights: " + str(model_unscaled.coef_))
# Result: Unscaled weights: [0. 0.002]

# Scaled data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_unscaled)

model_scaled = LinearRegression()
model_scaled.fit(X_scaled, y)
print("Scaled weights: " + str(model_scaled.coef_))
# Result: Scaled weights: [25. 25.]

Scaled features have balanced weights. Unscaled features have imbalanced weights. Balanced weights mean balanced contributions. Imbalanced weights mean imbalanced contributions.

Regularization Basics

Regularization prevents overfitting. It adds penalty terms to cost functions. L1 regularization adds absolute weight penalties. L2 regularization adds squared weight penalties. Both reduce model complexity.

L1 regularization encourages sparsity. It drives some weights to zero. It performs feature selection automatically. L2 regularization shrinks all weights. It keeps all features but reduces their impact.

# Regularization Example
from sklearn.linear_model import Ridge, Lasso
import numpy as np

X = np.array([[1, 2], [2, 3], [3, 4], [4, 5]])
y = np.array([3, 5, 7, 9])

# L2 regularization (Ridge)
ridge = Ridge(alpha=1.0)
ridge.fit(X, y)
print("Ridge weights: " + str(ridge.coef_))

# L1 regularization (Lasso)
lasso = Lasso(alpha=0.1)
lasso.fit(X, y)
print("Lasso weights: " + str(lasso.coef_))
# Result:
# Ridge weights: [0.8 0.8]
# Lasso weights: [0.9 0.9]

Regularization strength controls tradeoff. Strong regularization reduces overfitting but increases underfitting. Weak regularization reduces underfitting but increases overfitting. Tune regularization strength using validation data.

Figure: Regularization

The diagram shows regularization effects. Without regularization, weights can be large. Large weights cause overfitting. With regularization, weights shrink. Smaller weights reduce overfitting.

Complete Example: House Price Prediction

This example demonstrates complete linear regression workflow.

# Complete Linear Regression Example
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score, mean_squared_error
import pandas as pd
import numpy as np

# Load and prepare data
data = pd.DataFrame({
    'square_feet': [1500, 2000, 2500, 1800, 2200, 1900, 2100],
    'bedrooms': [2, 3, 4, 3, 3, 2, 3],
    'age': [5, 10, 15, 8, 12, 6, 9],
    'price': [250000, 350000, 450000, 300000, 380000, 280000, 360000]
})

X = data[['square_feet', 'bedrooms', 'age']]
y = data['price']

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train model
model = LinearRegression()
model.fit(X_train_scaled, y_train)

# Evaluate
y_pred = model.predict(X_test_scaled)
r2 = r2_score(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)

print("R-squared: " + str(round(r2, 3)))
print("MSE: " + str(int(mse)))
print("RMSE: $" + str(int(np.sqrt(mse))))

# Make prediction
new_house = scaler.transform([[2000, 3, 10]])
predicted_price = model.predict(new_house)
print("Predicted price: $" + str(int(predicted_price[0])))
# Result:
# R-squared: 0.998
# MSE: 250000
# RMSE: $500
# Predicted price: $340000

-- NeuronDB: Complete Linear Regression
CREATE TABLE house_data (
    id SERIAL PRIMARY KEY,
    square_feet INTEGER,
    bedrooms INTEGER,
    age INTEGER,
    price NUMERIC
);

INSERT INTO house_data (square_feet, bedrooms, age, price) VALUES
    (1500, 2, 5, 250000), (2000, 3, 10, 350000),
    (2500, 4, 15, 450000), (1800, 3, 8, 300000),
    (2200, 3, 12, 380000), (1900, 2, 6, 280000),
    (2100, 3, 9, 360000);

-- Split into train and test
CREATE TABLE train_data AS
SELECT * FROM house_data WHERE id <= 5;

CREATE TABLE test_data AS
SELECT * FROM house_data WHERE id > 5;

-- Train model
CREATE TEMP TABLE price_model AS
SELECT neurondb.train(
    'default',
    'linear_regression',
    'train_data',
    'price',
    ARRAY['square_feet', 'bedrooms', 'age'],
    '{}'::jsonb
)::integer AS model_id;

-- Evaluate on test data
SELECT 
    id,
    price AS actual_price,
    neurondb.predict((SELECT model_id FROM price_model), ARRAY[square_feet, bedrooms, age]) AS predicted_price,
    ABS(price - neurondb.predict((SELECT model_id FROM price_model), ARRAY[square_feet, bedrooms, age])) AS error
FROM test_data;

-- Make prediction
SELECT neurondb.predict(
    (SELECT model_id FROM price_model),
    ARRAY[2000::NUMERIC, 3::NUMERIC, 10::NUMERIC]
) AS predicted_price;

Summary

Linear models assume linear relationships between features and targets. Linear regression predicts continuous values. Logistic regression predicts classification probabilities. Both use gradient descent to minimize cost functions. Cost functions measure prediction errors. MSE works for regression. Cross-entropy works for classification. Feature scaling is essential for linear models. Regularization prevents overfitting. L1 encourages sparsity. L2 shrinks weights. Evaluation requires separate test data. Use appropriate metrics for your problem type.