BLOG POSTS

MangoHost Blog / Multiple Linear Regression in Python – Tutorial

Multiple Linear Regression in Python – Tutorial

Multiple linear regression is a fundamental statistical technique that allows you to predict a target variable based on multiple input features, making it essential for data analysis tasks like server load prediction, resource optimization, and performance monitoring. Unlike simple linear regression which uses only one predictor, multiple linear regression considers several variables simultaneously to create more accurate models. In this tutorial, you’ll learn how to implement multiple linear regression in Python, handle real-world datasets, troubleshoot common issues, and apply this technique to practical scenarios relevant to system administration and development work.

How Multiple Linear Regression Works

Multiple linear regression extends the basic linear regression concept by fitting a linear relationship between multiple independent variables (features) and one dependent variable (target). The mathematical formula is:

y = β₀ + β₁x₁ + β₂x₂ + ... + βₙxₙ + ε

Where y is the target variable, x₁, x₂, …, xₙ are the input features, β₀ is the intercept, β₁, β₂, …, βₙ are the coefficients, and ε represents the error term. Python’s scikit-learn library handles the complex matrix calculations behind the scenes, using the normal equation or gradient descent to find optimal coefficients.

The algorithm minimizes the sum of squared residuals (differences between predicted and actual values) to find the best-fitting hyperplane through your data points. This makes it particularly useful for predicting server metrics like CPU usage based on factors such as active connections, memory usage, and disk I/O.

Step-by-Step Implementation Guide

Let’s start with a practical example using server performance data. First, install the required libraries:

pip install numpy pandas scikit-learn matplotlib seaborn

Here’s a complete implementation that predicts server response time based on multiple factors:

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt
import seaborn as sns

# Create sample server performance dataset
np.random.seed(42)
n_samples = 1000

# Generate synthetic server metrics
cpu_usage = np.random.normal(50, 20, n_samples)
memory_usage = np.random.normal(60, 15, n_samples)
active_connections = np.random.poisson(100, n_samples)
disk_io = np.random.exponential(10, n_samples)

# Create target variable (response time) with realistic relationships
response_time = (
    0.5 * cpu_usage + 
    0.3 * memory_usage + 
    0.02 * active_connections + 
    0.8 * disk_io + 
    np.random.normal(0, 5, n_samples)  # Add noise
)

# Create DataFrame
df = pd.DataFrame({
    'cpu_usage': cpu_usage,
    'memory_usage': memory_usage,
    'active_connections': active_connections,
    'disk_io': disk_io,
    'response_time': response_time
})

# Clean data (remove negative values for realistic server metrics)
df = df[(df >= 0).all(axis=1)]

print("Dataset shape:", df.shape)
print("\nFirst few rows:")
print(df.head())

# Define features and target
X = df[['cpu_usage', 'memory_usage', 'active_connections', 'disk_io']]
y = df['response_time']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Create and train the model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Calculate performance metrics
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"\nModel Performance:")
print(f"Mean Squared Error: {mse:.2f}")
print(f"R-squared Score: {r2:.2f}")

# Display coefficients
print(f"\nModel Coefficients:")
print(f"Intercept: {model.intercept_:.2f}")
for feature, coef in zip(X.columns, model.coef_):
    print(f"{feature}: {coef:.2f}")

This code creates a realistic server performance dataset and trains a multiple linear regression model. The coefficients tell you how much each factor contributes to response time, which is valuable for server optimization.

Real-World Examples and Use Cases

Multiple linear regression has several practical applications in system administration and development:

Server Capacity Planning: Predict when you’ll need to upgrade your VPS or dedicated server based on current usage trends
Performance Monitoring: Identify which system metrics most strongly correlate with application performance
Cost Optimization: Predict cloud infrastructure costs based on usage patterns
Load Balancing: Distribute traffic based on predicted server load
Database Query Optimization: Predict query execution time based on table size, indexes, and complexity

Here’s a practical example for predicting database query performance:

# Database query performance prediction
query_data = pd.DataFrame({
    'table_rows': [1000, 5000, 10000, 50000, 100000, 500000],
    'num_joins': [0, 1, 2, 3, 1, 4],
    'index_count': [2, 3, 5, 4, 6, 8],
    'query_complexity': [1, 2, 3, 4, 2, 5],  # Subjective 1-5 scale
    'execution_time_ms': [10, 45, 120, 380, 95, 1200]
})

# Prepare features and target
X_query = query_data[['table_rows', 'num_joins', 'index_count', 'query_complexity']]
y_query = query_data['execution_time_ms']

# Train model
query_model = LinearRegression()
query_model.fit(X_query, y_query)

# Predict execution time for a new query
new_query = [[25000, 2, 4, 3]]  # 25k rows, 2 joins, 4 indexes, complexity 3
predicted_time = query_model.predict(new_query)
print(f"Predicted execution time: {predicted_time[0]:.1f}ms")

Comparison with Alternative Approaches

Understanding when to use multiple linear regression versus other techniques is crucial for choosing the right tool:

Method	Best For	Pros	Cons	Performance
Multiple Linear Regression	Linear relationships, interpretability	Fast, interpretable, no hyperparameters	Assumes linearity, sensitive to outliers	Training: O(n³), Prediction: O(n)
Polynomial Regression	Non-linear relationships	Captures curves, still interpretable	Overfitting risk, high complexity	Training: O(n³), Prediction: O(n)
Random Forest	Non-linear, feature interactions	Handles non-linearity, robust	Less interpretable, more parameters	Training: O(n log n), Prediction: O(log n)
Support Vector Regression	High-dimensional data	Effective in high dimensions	Difficult to interpret, sensitive to scaling	Training: O(n²), Prediction: O(n)

Here’s a comparison implementation showing performance differences:

from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import time

# Prepare models
models = {
    'Linear Regression': LinearRegression(),
    'Random Forest': RandomForestRegressor(n_estimators=100, random_state=42),
    'SVR': Pipeline([
        ('scaler', StandardScaler()),
        ('svr', SVR(kernel='rbf'))
    ])
}

# Compare performance
results = {}
for name, model in models.items():
    start_time = time.time()
    model.fit(X_train, y_train)
    training_time = time.time() - start_time
    
    start_time = time.time()
    predictions = model.predict(X_test)
    prediction_time = time.time() - start_time
    
    r2 = r2_score(y_test, predictions)
    
    results[name] = {
        'R²': r2,
        'Training Time (s)': training_time,
        'Prediction Time (s)': prediction_time
    }

# Display results
comparison_df = pd.DataFrame(results).T
print("\nModel Comparison:")
print(comparison_df.round(4))

Best Practices and Common Pitfalls

Avoiding common mistakes will save you hours of debugging and improve your model’s reliability:

Feature Scaling: While linear regression doesn’t require feature scaling for accuracy, it helps with coefficient interpretation
Multicollinearity: Highly correlated features can make coefficients unstable and difficult to interpret
Outlier Detection: A few extreme values can significantly skew your model
Assumption Validation: Check for linearity, homoscedasticity, and normal residuals
Cross-Validation: Use k-fold cross-validation for more robust performance estimates

Here’s code to detect and handle common issues:

# Check for multicollinearity
correlation_matrix = X.corr()
print("Feature Correlations:")
print(correlation_matrix)

# Detect high correlations (> 0.8)
high_corr_pairs = []
for i in range(len(correlation_matrix.columns)):
    for j in range(i+1, len(correlation_matrix.columns)):
        if abs(correlation_matrix.iloc[i, j]) > 0.8:
            high_corr_pairs.append((
                correlation_matrix.columns[i], 
                correlation_matrix.columns[j], 
                correlation_matrix.iloc[i, j]
            ))

if high_corr_pairs:
    print("\nHigh correlation pairs found:")
    for pair in high_corr_pairs:
        print(f"{pair[0]} - {pair[1]}: {pair[2]:.3f}")

# Outlier detection using IQR method
def detect_outliers(df, column):
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)]
    return outliers

# Check each feature for outliers
for column in X.columns:
    outliers = detect_outliers(df, column)
    if len(outliers) > 0:
        print(f"\n{column} has {len(outliers)} outliers")

# Residual analysis
residuals = y_test - y_pred

# Plot residuals vs predicted values
plt.figure(figsize=(10, 6))
plt.scatter(y_pred, residuals)
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title('Residuals vs Predicted Values')
plt.axhline(y=0, color='r', linestyle='--')
plt.show()

# Cross-validation for robust performance estimation
from sklearn.model_selection import cross_val_score

cv_scores = cross_val_score(model, X, y, cv=5, scoring='r2')
print(f"\nCross-validation R² scores: {cv_scores}")
print(f"Mean CV score: {cv_scores.mean():.3f} (+/- {cv_scores.std() * 2:.3f})")

For production deployments, consider implementing automated model retraining when performance degrades, and always validate your assumptions before trusting the results. The scikit-learn documentation provides comprehensive information about linear regression parameters and advanced techniques.

Multiple linear regression remains one of the most practical tools for understanding relationships in your data, especially when you need interpretable results for system optimization decisions. Whether you’re predicting server load, optimizing database performance, or planning infrastructure capacity, this technique provides a solid foundation for data-driven decision making.

This article incorporates information and material from various online sources. We acknowledge and appreciate the work of all original authors, publishers, and websites. While every effort has been made to appropriately credit the source material, any unintentional oversight or omission does not constitute a copyright infringement. All trademarks, logos, and images mentioned are the property of their respective owners. If you believe that any content used in this article infringes upon your copyright, please contact us immediately for review and prompt action.

This article is intended for informational and educational purposes only and does not infringe on the rights of the copyright owners. If any copyrighted material has been used without proper credit or in violation of copyright laws, it is unintentional and we will rectify it promptly upon notification. Please note that the republishing, redistribution, or reproduction of part or all of the contents in any form is prohibited without express written permission from the author and website owner. For permissions or further inquiries, please contact us.