Multiple Linear Regression

Predict outcomes using multiple features simultaneously. Learn how combining multiple inputs creates more accurate and nuanced predictions.

From One to Many: The Power of Multiple Features

Evolution from Simple Linear

While simple linear regression uses one feature (like house size) to predict an outcome (like price), multiple linear regression combines many features for more accurate predictions.

Simple vs Multiple:

Simple: y^=wx+b\hat{y} = w \cdot x + b
Multiple: y^=b+w1x1+w2x2+...+wnxn\hat{y} = b + w_1x_1 + w_2x_2 + ... + w_nx_n

Real-World Example

Predicting house prices becomes much more accurate when you consider:

  • Square footage (x1x_1)
  • Number of bedrooms (x2x_2)
  • Location quality score (x3x_3)
  • Age of house (x4x_4)
  • Garage spaces (x5x_5)

Each feature contributes to the final prediction with its own weight.

Mathematical Foundation

The Equation

y^=b+w1x1+w2x2+w3x3+...+wnxn\hat{y} = b + w_1x_1 + w_2x_2 + w_3x_3 + ... + w_nx_n

Or in vector notation (more compact):

y^=wTx+b\hat{y} = \mathbf{w}^T\mathbf{x} + b

Components:

  • y^\hat{y} = predicted value
  • bb = bias (base prediction)
  • wiw_i = weight for feature ii
  • xix_i = value of feature ii

Interpretation:

Each weight wiw_i tells you how much the prediction changes when feature xix_i increases by 1, holding all other features constant. This is the “partial effect” of that feature.

Geometric Interpretation

While simple linear regression fits a line through 2D space, multiple linear regression fits a hyperplane through n-dimensional space:

2 features:

Fits a plane in 3D space

3 features:

Fits a hyperplane in 4D space

n features:

Fits a hyperplane in (n+1)D space

Key Concepts & Challenges

Feature Independence

Multiple linear regression assumes features are independent. When features are correlated (multicollinearity), it becomes hard to isolate each feature's effect.

Example of Multicollinearity:

  • House size and number of rooms (highly correlated)
  • Age and mileage in cars (usually correlated)
  • Height and weight in people (often correlated)

Feature Scaling

When features have different scales (e.g., age in years vs income in dollars), you need to normalize them for stable training.

Common Scaling Methods:

  • Standardization: z=xμσz = \frac{x - \mu}{\sigma}
  • Min-Max: x=xxminxmaxxminx' = \frac{x - x_{min}}{x_{max} - x_{min}}
  • Robust: Uses median and IQR

Curse of Dimensionality

As you add more features, you need exponentially more data to maintain the same prediction quality:

  • 10 features → need ~100 samples minimum
  • 100 features → need ~1,000 samples minimum
  • 1,000 features → need ~10,000 samples minimum

Rule of thumb: At least 10-20 samples per feature

Feature Selection

Not all features improve predictions. Some add noise:

  • Forward selection: Add features one by one
  • Backward elimination: Remove features one by one
  • L1 regularization: Automatically zeros out weak features

Training Process

Cost Function

Same as simple linear regression, but now summing over all features:

J(w,b)=12mi=1m(y^(i)y(i))2J(\mathbf{w}, b) = \frac{1}{2m} \sum_{i=1}^{m} \left( \hat{y}^{(i)} - y^{(i)} \right)^2

Where y^(i)=b+j=1nwjxj(i)\hat{y}^{(i)} = b + \sum_{j=1}^{n} w_j x_j^{(i)}

Optimization Methods

Normal Equation

[bw]=(XTX)1XTy\begin{bmatrix} b \\ \mathbf{w} \end{bmatrix} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}
  • Direct solution
  • No iterations needed
  • Slow for many features (>10,000)

Gradient Descent

Update rules:

wj:=wjαJwjw_j := w_j - \alpha \frac{\partial J}{\partial w_j}
b:=bαJbb := b - \alpha \frac{\partial J}{\partial b}
  • Scales to large datasets
  • Memory efficient
  • Requires learning rate tuning

Interactive Playground

Experiment with multiple linear regression using real or synthetic multi-feature datasets. Notice how different features contribute to the final prediction.

Parameters

|||||
10%20%30%40%50%

Data & Regression Line

Model Performance

Run regression to see metrics

When to Use Multiple Linear Regression

Perfect For:

  • Predicting continuous values with multiple relevant features
  • Understanding feature importance and relationships
  • Business metrics (sales, revenue, costs) with multiple drivers
  • Scientific measurements with multiple variables
  • Real estate pricing, demand forecasting, risk assessment

Consider Alternatives When:

  • Features have complex non-linear relationships → Try polynomial regression
  • Too many features relative to samples → Use regularization
  • Features are highly correlated → Consider PCA or feature selection
  • Categorical target variable → Use logistic regression
  • Very complex patterns → Try neural networks or tree-based models

Practical Tips

Feature Engineering

  • Create interaction terms (x₁ × x₂)
  • Add domain-specific features
  • Transform skewed features (log, sqrt)

Validation Strategy

  • Use cross-validation for small datasets
  • Check residual plots for patterns
  • Test for multicollinearity (VIF)

Interpretation

  • Standardize features for comparing weights
  • Check confidence intervals
  • Consider partial dependence plots

Remember: More features isn't always better. Start simple, add complexity gradually, and always validate that each new feature actually improves out-of-sample performance.