BLOG POSTS

MangoHost Blog / Logistic Regression with Scikit Learn – Tutorial and Examples

Logistic Regression with Scikit Learn – Tutorial and Examples

Logistic regression stands as one of the fundamental machine learning algorithms for binary and multiclass classification problems, making it essential knowledge for developers working with data-driven applications. Unlike linear regression that predicts continuous values, logistic regression outputs probabilities between 0 and 1 using the logistic function, making it perfect for tasks like spam detection, medical diagnosis, or user behavior prediction. In this tutorial, you’ll learn how to implement logistic regression using Scikit-Learn, understand its mathematical foundations, explore real-world applications, and discover best practices for deploying these models on production servers.

How Logistic Regression Works

Logistic regression transforms the linear regression equation using the sigmoid (logistic) function to ensure outputs stay between 0 and 1. The core mathematical concept involves the logistic function:

p = 1 / (1 + e^(-(β₀ + β₁x₁ + β₂x₂ + ... + βₙxₙ)))

Where p represents the probability of the positive class, β values are coefficients learned during training, and x values are input features. The algorithm uses maximum likelihood estimation to find optimal coefficients that best separate classes.

The decision boundary occurs at p = 0.5, though you can adjust this threshold based on your specific requirements. For multiclass problems, Scikit-Learn implements one-vs-rest or multinomial approaches automatically.

Step-by-Step Implementation Guide

Let’s start with a basic binary classification example using the classic iris dataset, then progress to more complex scenarios.

Basic Binary Classification

import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import matplotlib.pyplot as plt

# Load and prepare data
iris = load_iris()
X = iris.data[:, :2]  # Using only first two features for visualization
y = (iris.target != 0) * 1  # Convert to binary problem

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create and train the model
logistic_model = LogisticRegression(random_state=42)
logistic_model.fit(X_train, y_train)

# Make predictions
y_pred = logistic_model.predict(X_test)
y_pred_proba = logistic_model.predict_proba(X_test)

# Evaluate performance
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.3f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

Multiclass Classification Example

# Full multiclass iris classification
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler

# Load full dataset
iris = load_iris()
X, y = iris.data, iris.target

# Feature scaling (recommended for logistic regression)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# Multiclass logistic regression
multi_logistic = LogisticRegression(
    multi_class='multinomial',  # Use multinomial instead of one-vs-rest
    solver='lbfgs',            # Good solver for multiclass
    max_iter=1000,
    random_state=42
)

multi_logistic.fit(X_train, y_train)

# Predictions and evaluation
y_pred_multi = multi_logistic.predict(X_test)
accuracy_multi = accuracy_score(y_test, y_pred_multi)

print(f"Multiclass Accuracy: {accuracy_multi:.3f}")
print(f"Classes: {iris.target_names}")

Real-World Examples and Use Cases

Email Spam Detection

Here’s a practical example for email classification that you might deploy on your VPS server:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
import joblib

# Sample email data (in practice, load from database)
emails = [
    "Free money! Click here now!",
    "Meeting scheduled for tomorrow at 2 PM",
    "WIN BIG! Claim your prize today!!!",
    "Please review the quarterly report",
    "URGENT: Your account will be suspended",
    "Coffee break in the kitchen"
]

labels = [1, 0, 1, 0, 1, 0]  # 1 = spam, 0 = not spam

# Create pipeline for text processing and classification
spam_pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(max_features=1000, stop_words='english')),
    ('classifier', LogisticRegression(random_state=42))
])

# Train the model
spam_pipeline.fit(emails, labels)

# Test new emails
new_emails = [
    "Important meeting reminder",
    "You've won $1000000! Click now!"
]

predictions = spam_pipeline.predict(new_emails)
probabilities = spam_pipeline.predict_proba(new_emails)

for email, pred, prob in zip(new_emails, predictions, probabilities):
    spam_prob = prob[1]
    print(f"Email: '{email}'")
    print(f"Prediction: {'Spam' if pred == 1 else 'Not Spam'} (confidence: {spam_prob:.3f})")

# Save model for production use
joblib.dump(spam_pipeline, 'spam_classifier.pkl')

Customer Churn Prediction

# Customer churn prediction example
import pandas as pd
from sklearn.preprocessing import LabelEncoder, StandardScaler

# Create sample customer data
customer_data = {
    'tenure': [12, 24, 6, 36, 3, 48],
    'monthly_charges': [70.5, 85.2, 45.0, 95.8, 35.5, 110.0],
    'total_charges': [846, 2044.8, 270, 3449, 106.5, 5280],
    'contract_type': ['Month-to-month', 'Two year', 'Month-to-month', 'Two year', 'Month-to-month', 'One year'],
    'payment_method': ['Credit card', 'Bank transfer', 'Electronic check', 'Credit card', 'Electronic check', 'Bank transfer'],
    'churned': [1, 0, 1, 0, 1, 0]
}

df = pd.DataFrame(customer_data)

# Encode categorical variables
le_contract = LabelEncoder()
le_payment = LabelEncoder()

df['contract_encoded'] = le_contract.fit_transform(df['contract_type'])
df['payment_encoded'] = le_payment.fit_transform(df['payment_method'])

# Prepare features
features = ['tenure', 'monthly_charges', 'total_charges', 'contract_encoded', 'payment_encoded']
X = df[features]
y = df['churned']

# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Train churn model
churn_model = LogisticRegression(random_state=42)
churn_model.fit(X_scaled, y)

# Feature importance (coefficients)
feature_importance = pd.DataFrame({
    'feature': features,
    'coefficient': churn_model.coef_[0],
    'abs_coefficient': np.abs(churn_model.coef_[0])
}).sort_values('abs_coefficient', ascending=False)

print("Feature Importance:")
print(feature_importance)

Comparison with Alternative Algorithms

Algorithm	Training Speed	Prediction Speed	Interpretability	Probability Output	Best Use Case
Logistic Regression	Fast	Very Fast	High	Yes	Linear relationships, baseline model
Random Forest	Medium	Fast	Medium	Yes	Non-linear relationships, feature importance
SVM	Slow	Fast	Low	No (by default)	High-dimensional data, complex boundaries
Neural Networks	Very Slow	Fast	Very Low	Yes	Complex patterns, large datasets
Naive Bayes	Very Fast	Very Fast	High	Yes	Text classification, small datasets

Performance Benchmarking

from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
import time

# Generate larger dataset for benchmarking
X_bench, y_bench = make_classification(n_samples=10000, n_features=20, n_classes=2, random_state=42)
X_train_bench, X_test_bench, y_train_bench, y_test_bench = train_test_split(X_bench, y_bench, test_size=0.2, random_state=42)

models = {
    'Logistic Regression': LogisticRegression(random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'SVM': SVC(probability=True, random_state=42),
    'Naive Bayes': GaussianNB()
}

results = []

for name, model in models.items():
    # Training time
    start_time = time.time()
    model.fit(X_train_bench, y_train_bench)
    train_time = time.time() - start_time
    
    # Prediction time
    start_time = time.time()
    y_pred_bench = model.predict(X_test_bench)
    predict_time = time.time() - start_time
    
    # Accuracy
    accuracy = accuracy_score(y_test_bench, y_pred_bench)
    
    results.append({
        'Model': name,
        'Training Time (s)': f"{train_time:.3f}",
        'Prediction Time (s)': f"{predict_time:.4f}",
        'Accuracy': f"{accuracy:.3f}"
    })

benchmark_df = pd.DataFrame(results)
print(benchmark_df.to_string(index=False))

Advanced Configuration and Hyperparameter Tuning

Scikit-Learn’s LogisticRegression offers several important parameters for optimization:

from sklearn.model_selection import GridSearchCV, cross_val_score

# Comprehensive hyperparameter tuning
param_grid = {
    'C': [0.001, 0.01, 0.1, 1, 10, 100],  # Regularization strength
    'penalty': ['l1', 'l2', 'elasticnet'],  # Regularization type
    'solver': ['liblinear', 'lbfgs', 'saga'],  # Optimization algorithm
    'max_iter': [1000, 2000, 5000]  # Maximum iterations
}

# Note: l1 penalty requires 'liblinear' or 'saga' solver
# elasticnet requires 'saga' solver
adjusted_param_grid = [
    {'C': [0.001, 0.01, 0.1, 1, 10, 100], 'penalty': ['l2'], 'solver': ['liblinear', 'lbfgs']},
    {'C': [0.001, 0.01, 0.1, 1, 10, 100], 'penalty': ['l1'], 'solver': ['liblinear', 'saga']},
    {'C': [0.001, 0.01, 0.1, 1, 10, 100], 'penalty': ['elasticnet'], 'solver': ['saga'], 'l1_ratio': [0.5]}
]

# Grid search with cross-validation
grid_search = GridSearchCV(
    LogisticRegression(random_state=42),
    adjusted_param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1,  # Use all CPU cores
    verbose=1
)

grid_search.fit(X_train, y_train)

print(f"Best parameters: {grid_search.best_params_}")
print(f"Best cross-validation score: {grid_search.best_score_:.3f}")

# Use the best model
best_model = grid_search.best_estimator_
final_accuracy = accuracy_score(y_test, best_model.predict(X_test))
print(f"Test accuracy with best model: {final_accuracy:.3f}")

Production Deployment and Best Practices

When deploying logistic regression models on production servers, especially on dedicated servers handling high traffic, consider these implementation patterns:

Model Serialization and Loading

import joblib
import pickle
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

# Create production-ready pipeline
production_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', LogisticRegression(
        C=1.0,
        penalty='l2',
        solver='lbfgs',
        max_iter=1000,
        random_state=42
    ))
])

# Train and save
production_pipeline.fit(X_train, y_train)

# Save using joblib (recommended for scikit-learn)
joblib.dump(production_pipeline, 'logistic_model_v1.pkl')

# Alternative: pickle (more universal but less efficient)
with open('logistic_model_v1.pickle', 'wb') as f:
    pickle.dump(production_pipeline, f)

# Loading in production
loaded_model = joblib.load('logistic_model_v1.pkl')

# Batch prediction function
def predict_batch(model, data_batch):
    """Optimized batch prediction for production"""
    try:
        predictions = model.predict(data_batch)
        probabilities = model.predict_proba(data_batch)
        return {
            'predictions': predictions.tolist(),
            'probabilities': probabilities.tolist(),
            'status': 'success'
        }
    except Exception as e:
        return {'status': 'error', 'message': str(e)}

# Example usage
sample_data = [[5.1, 3.5, 1.4, 0.2], [6.2, 2.9, 4.3, 1.3]]
result = predict_batch(loaded_model, sample_data)
print(result)

RESTful API Implementation

# Flask API for model serving
from flask import Flask, request, jsonify
import numpy as np
import joblib

app = Flask(__name__)

# Load model once at startup
model = joblib.load('logistic_model_v1.pkl')

@app.route('/predict', methods=['POST'])
def predict():
    try:
        # Get JSON data
        data = request.get_json()
        
        # Convert to numpy array
        features = np.array(data['features']).reshape(1, -1)
        
        # Make prediction
        prediction = model.predict(features)[0]
        probability = model.predict_proba(features)[0].max()
        
        return jsonify({
            'prediction': int(prediction),
            'confidence': float(probability),
            'status': 'success'
        })
    
    except Exception as e:
        return jsonify({
            'status': 'error',
            'message': str(e)
        }), 400

@app.route('/health', methods=['GET'])
def health_check():
    return jsonify({'status': 'healthy', 'model': 'logistic_regression_v1'})

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000, debug=False)

Common Pitfalls and Troubleshooting

Feature Scaling Issues

Logistic regression is sensitive to feature scaling. Always check for convergence warnings:

# Common scaling problems and solutions
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler

# Problem: Features with different scales
problematic_data = np.array([
    [1, 1000, 0.01],
    [2, 2000, 0.02],
    [3, 1500, 0.015]
])

# Solution 1: Standard scaling (most common)
standard_scaler = StandardScaler()
X_standard = standard_scaler.fit_transform(problematic_data)

# Solution 2: Min-max scaling (for bounded features)
minmax_scaler = MinMaxScaler()
X_minmax = minmax_scaler.fit_transform(problematic_data)

# Solution 3: Robust scaling (for outliers)
robust_scaler = RobustScaler()
X_robust = robust_scaler.fit_transform(problematic_data)

print("Original data:")
print(problematic_data)
print("\nStandardized:")
print(X_standard)

Handling Convergence Warnings

# Fix convergence issues
from sklearn.exceptions import ConvergenceWarning
import warnings

# Catch convergence warnings
with warnings.catch_warnings():
    warnings.simplefilter("error", ConvergenceWarning)
    
    try:
        model = LogisticRegression(max_iter=100)  # Low iterations to trigger warning
        model.fit(X_train, y_train)
    except ConvergenceWarning:
        print("Convergence warning detected. Increasing max_iter...")
        model = LogisticRegression(max_iter=2000)
        model.fit(X_train, y_train)
        print("Model trained successfully with increased iterations.")

Multicollinearity Detection

# Check for multicollinearity issues
from statsmodels.stats.outliers_influence import variance_inflation_factor

def calculate_vif(X_df):
    """Calculate Variance Inflation Factor for features"""
    vif_data = pd.DataFrame()
    vif_data["Feature"] = X_df.columns
    vif_data["VIF"] = [variance_inflation_factor(X_df.values, i) 
                       for i in range(len(X_df.columns))]
    return vif_data.sort_values('VIF', ascending=False)

# Example with correlated features
correlated_data = pd.DataFrame({
    'feature1': np.random.randn(100),
    'feature2': np.random.randn(100),
})
correlated_data['feature3'] = correlated_data['feature1'] * 2 + np.random.randn(100) * 0.1  # Highly correlated

vif_scores = calculate_vif(correlated_data)
print("VIF Scores (>5 indicates multicollinearity):")
print(vif_scores)

Performance Optimization and Monitoring

Memory-Efficient Training

# For large datasets that don't fit in memory
from sklearn.linear_model import SGDClassifier

# Stochastic Gradient Descent version (memory efficient)
sgd_logistic = SGDClassifier(
    loss='log',  # Logistic regression loss
    learning_rate='adaptive',
    eta0=0.01,
    random_state=42
)

# Simulate mini-batch training
batch_size = 1000
n_batches = len(X_train) // batch_size

for epoch in range(3):  # Multiple epochs over data
    for i in range(n_batches):
        start_idx = i * batch_size
        end_idx = start_idx + batch_size
        
        X_batch = X_train[start_idx:end_idx]
        y_batch = y_train[start_idx:end_idx]
        
        # Partial fit (incremental learning)
        sgd_logistic.partial_fit(X_batch, y_batch, classes=np.unique(y_train))

# Evaluate SGD model
sgd_accuracy = accuracy_score(y_test, sgd_logistic.predict(X_test))
print(f"SGD Logistic Regression Accuracy: {sgd_accuracy:.3f}")

Model Monitoring and Drift Detection

# Simple drift detection implementation
class ModelMonitor:
    def __init__(self, model, reference_data):
        self.model = model
        self.reference_mean = np.mean(reference_data, axis=0)
        self.reference_std = np.std(reference_data, axis=0)
        
    def detect_drift(self, new_data, threshold=2.0):
        """Detect if new data differs significantly from reference"""
        new_mean = np.mean(new_data, axis=0)
        
        # Calculate z-scores for feature means
        z_scores = np.abs((new_mean - self.reference_mean) / self.reference_std)
        
        drift_detected = np.any(z_scores > threshold)
        
        return {
            'drift_detected': drift_detected,
            'max_drift_score': np.max(z_scores),
            'drift_features': np.where(z_scores > threshold)[0].tolist()
        }
    
    def log_predictions(self, X, y_pred, y_true=None):
        """Log predictions for monitoring"""
        timestamp = pd.Timestamp.now()
        
        log_entry = {
            'timestamp': timestamp,
            'n_predictions': len(y_pred),
            'avg_confidence': np.mean(self.model.predict_proba(X).max(axis=1)),
            'prediction_distribution': np.bincount(y_pred).tolist()
        }
        
        if y_true is not None:
            log_entry['accuracy'] = accuracy_score(y_true, y_pred)
        
        return log_entry

# Usage example
monitor = ModelMonitor(loaded_model, X_train)
drift_result = monitor.detect_drift(X_test)
print(f"Drift detection result: {drift_result}")

For comprehensive documentation on Scikit-Learn’s logistic regression implementation, refer to the official Scikit-Learn documentation. The linear models user guide provides additional theoretical background and implementation details.

When deploying these models in production environments, ensure your server infrastructure can handle the computational requirements. Logistic regression models are generally lightweight and perform well even on modest hardware configurations, making them excellent choices for real-time prediction services.

This article incorporates information and material from various online sources. We acknowledge and appreciate the work of all original authors, publishers, and websites. While every effort has been made to appropriately credit the source material, any unintentional oversight or omission does not constitute a copyright infringement. All trademarks, logos, and images mentioned are the property of their respective owners. If you believe that any content used in this article infringes upon your copyright, please contact us immediately for review and prompt action.

This article is intended for informational and educational purposes only and does not infringe on the rights of the copyright owners. If any copyrighted material has been used without proper credit or in violation of copyright laws, it is unintentional and we will rectify it promptly upon notification. Please note that the republishing, redistribution, or reproduction of part or all of the contents in any form is prohibited without express written permission from the author and website owner. For permissions or further inquiries, please contact us.