Ostralyan - Interactive Machine Learning Education Platform

From One to Many: The Power of Multiple Features

Evolution from Simple Linear

While simple linear regression uses one feature (like house size) to predict an outcome (like price), multiple linear regression combines many features for more accurate predictions.

Simple vs Multiple:

Simple:

\hat{y} = w \cdot x + b

Multiple:

\hat{y} = b + w_1x_1 + w_2x_2 + ... + w_nx_n

Real-World Example

Predicting house prices becomes much more accurate when you consider:

Square footage ( $x_1$ )
Number of bedrooms ( $x_2$ )
Location quality score ( $x_3$ )
Age of house ( $x_4$ )
Garage spaces ( $x_5$ )

Each feature contributes to the final prediction with its own weight.

Mathematical Foundation

The Equation

\hat{y} = b + w_1x_1 + w_2x_2 + w_3x_3 + ... + w_nx_n

Or in vector notation (more compact):

\hat{y} = \mathbf{w}^T\mathbf{x} + b

Components:

$\hat{y}$ = predicted value
$b$ = bias (base prediction)
$w_i$ = weight for feature $i$
$x_i$ = value of feature $i$

Interpretation:

Each weight $w_i$ tells you how much the prediction changes when feature $x_i$ increases by 1, holding all other features constant. This is the “partial effect” of that feature.

Geometric Interpretation

While simple linear regression fits a line through 2D space, multiple linear regression fits a hyperplane through n-dimensional space:

2 features:

Fits a plane in 3D space

3 features:

Fits a hyperplane in 4D space

n features:

Fits a hyperplane in (n+1)D space

Key Concepts & Challenges

Feature Independence

Multiple linear regression assumes features are independent. When features are correlated (multicollinearity), it becomes hard to isolate each feature's effect.

Example of Multicollinearity:

House size and number of rooms (highly correlated)
Age and mileage in cars (usually correlated)
Height and weight in people (often correlated)

Feature Scaling

When features have different scales (e.g., age in years vs income in dollars), you need to normalize them for stable training.

Common Scaling Methods:

Standardization: $z = \frac{x - \mu}{\sigma}$
Min-Max: $x' = \frac{x - x_{min}}{x_{max} - x_{min}}$
Robust: Uses median and IQR

Curse of Dimensionality

As you add more features, you need exponentially more data to maintain the same prediction quality:

10 features → need ~100 samples minimum
100 features → need ~1,000 samples minimum
1,000 features → need ~10,000 samples minimum

Rule of thumb: At least 10-20 samples per feature

Feature Selection

Not all features improve predictions. Some add noise:

Forward selection: Add features one by one
Backward elimination: Remove features one by one
L1 regularization: Automatically zeros out weak features

Training Process

Cost Function

Same as simple linear regression, but now summing over all features:

J(\mathbf{w}, b) = \frac{1}{2m} \sum_{i=1}^{m} \left( \hat{y}^{(i)} - y^{(i)} \right)^2

Where $\hat{y}^{(i)} = b + \sum_{j=1}^{n} w_j x_j^{(i)}$

Optimization Methods

Normal Equation

\begin{bmatrix} b \\ \mathbf{w} \end{bmatrix} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}

Direct solution
No iterations needed
Slow for many features (>10,000)

Gradient Descent

Update rules:

w_j := w_j - \alpha \frac{\partial J}{\partial w_j}

b := b - \alpha \frac{\partial J}{\partial b}

Scales to large datasets
Memory efficient
Requires learning rate tuning

Interactive Playground

Experiment with multiple linear regression using real or synthetic multi-feature datasets. Notice how different features contribute to the final prediction.

Parameters

Dataset

Model Type

Test Size20%

|||||

10%20%30%40%50%

Data & Regression Line

Model Performance

Run regression to see metrics

When to Use Multiple Linear Regression

Perfect For:

Predicting continuous values with multiple relevant features
Understanding feature importance and relationships
Business metrics (sales, revenue, costs) with multiple drivers
Scientific measurements with multiple variables
Real estate pricing, demand forecasting, risk assessment

Consider Alternatives When:

Features have complex non-linear relationships → Try polynomial regression
Too many features relative to samples → Use regularization
Features are highly correlated → Consider PCA or feature selection
Categorical target variable → Use logistic regression
Very complex patterns → Try neural networks or tree-based models

Practical Tips

Feature Engineering

Create interaction terms (x₁ × x₂)
Add domain-specific features
Transform skewed features (log, sqrt)

Validation Strategy

Use cross-validation for small datasets
Check residual plots for patterns
Test for multicollinearity (VIF)

Interpretation

Standardize features for comparing weights
Check confidence intervals
Consider partial dependence plots

Remember: More features isn't always better. Start simple, add complexity gradually, and always validate that each new feature actually improves out-of-sample performance.