20 Statsmodels concepts with Before-and-After Examples

Anix LynchAnix Lynch
13 min read

Sure! Let's reformat the statsmodels examples to follow the new style you provided. Here are the examples from statsmodels, now with the before and after real samples included:


1. Linear Regression (OLS) ๐Ÿ“ˆ

Boilerplate Code:

import statsmodels.api as sm

X = sm.add_constant(X)  # Add constant to X variables
model = sm.OLS(y, X)
results = model.fit()

Use Case: Perform a basic linear regression using OLS (Ordinary Least Squares).

Goal: Analyze the relationship between a dependent variable and independent variables. ๐ŸŽฏ

Sample Code:

X = sm.add_constant(X)
model = sm.OLS(y, X)
results = model.fit()
print(results.summary())

Before Example:
You manually try to figure out how the variables relate without a structured model. ๐Ÿค”

y = [3, 4, 5, 6]
X = [[1], [2], [3], [4]]
# No real model in place.

After Example:
With OLS, you have a fitted linear model and can interpret relationships. ๐Ÿ“ˆ

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.960
Model:                            OLS   Adj. R-squared:                  0.940
Method:                 Least Squares   F-statistic:                     48.00
==============================================================================

Challenge: ๐ŸŒŸ Fit an OLS model with multiple predictor variables and try interpreting the p-values and coefficients.


2. Logistic Regression ๐Ÿ”„

Boilerplate Code:

import statsmodels.api as sm

model = sm.Logit(y, X)
results = model.fit()

Use Case: Perform logistic regression for binary classification tasks.

Goal: Predict binary outcomes (e.g., success/failure) using logistic regression. ๐ŸŽฏ

Sample Code:

model = sm.Logit(y, X)
results = model.fit()
print(results.summary())

Before Example:
You manually assign probabilities without understanding how they relate to the data. ๐Ÿ˜•

y = [1, 0, 1, 0]
X = [[3], [2], [4], [1]]

After Example:
With Logit, you get coefficients and probabilities for binary outcomes. ๐Ÿ”„

                           Logit Regression Results                           
==============================================================================
Dep. Variable:                      y   Pseudo R-sq:               0.624
Model:                          Logit   LL-Null:                       -23.456
Method:                           MLE   LL-Cur:                        -9.125
==============================================================================

Challenge: ๐ŸŒŸ Perform logistic regression on a customer churn dataset to predict customer retention.


3. Time Series Analysis (ARIMA) ๐Ÿ•’

Boilerplate Code:

from statsmodels.tsa.arima.model import ARIMA

model = ARIMA(y, order=(1, 1, 1))
results = model.fit()

Use Case: Model and forecast time series data using ARIMA (AutoRegressive Integrated Moving Average).

Goal: Use ARIMA to make accurate forecasts of future data points. ๐ŸŽฏ

Sample Code:

model = ARIMA(y, order=(1, 1, 1))
results = model.fit()
print(results.summary())

Before Example:
You analyze raw data without considering trends or patterns over time. ๐Ÿค”

y = [10, 12, 14, 16, 15, 20]
# Manually forecasting next values without capturing seasonality or trends.

After Example:
With ARIMA, you get fitted models and forecasts that capture temporal dependencies. ๐Ÿ•’

                               ARIMA Model Results                               
==============================================================================
Dep. Variable:                      y   No. Observations:                    6
Model:                 ARIMA(1, 1, 1)   Log Likelihood                -10.012
AIC                             26.024
BIC                             26.946
==============================================================================

Challenge: ๐ŸŒŸ Fit an ARIMA model to predict monthly sales data and compare forecasts with actual results.


4. Autoregressive (AR) Model ๐Ÿ”„

Boilerplate Code:

from statsmodels.tsa.ar_model import AutoReg

model = AutoReg(y, lags=2)
results = model.fit()

Use Case: Model time series data using an autoregressive model, which predicts values based on previous observations.

Goal: Capture lagged relationships in time series data with AR. ๐ŸŽฏ

Sample Code:

model = AutoReg(y, lags=2)
results = model.fit()
print(results.summary())

Before Example:
Manually trying to predict values without considering the past observations. ๐Ÿ˜•

y = [100, 120, 130, 125]
# No lagged model to capture autocorrelation.

After Example:
AR model captures autocorrelations, leading to more accurate forecasts. ๐Ÿ”„

                            AutoReg Model Results                            
==============================================================================
Dep. Variable:                      y   No. Observations:                    4
Model:                 AutoReg(2)   Log Likelihood               -10.234
AIC                             24.469
BIC                             26.893
==============================================================================

Challenge: ๐ŸŒŸ Fit an AR model to monthly temperature data and forecast future values based on lagged observations.


5. Poisson Regression ๐Ÿงฎ

Boilerplate Code:

import statsmodels.api as sm

model = sm.GLM(y, X, family=sm.families.Poisson())
results = model.fit()

Use Case: Model count data where the dependent variable represents counts (e.g., number of events).

Goal: Use Poisson regression to predict the frequency of events. ๐ŸŽฏ

Sample Code:

model = sm.GLM(y, X, family=sm.families.Poisson())
results = model.fit()
print(results.summary())

Before Example:
Predicting counts (e.g., website clicks) without considering the distribution of count data. ๐Ÿค”

y = [5, 7, 9, 12]
X = [[1], [2], [3], [4]]

After Example:
Poisson regression models count data with non-negative integer outputs. ๐Ÿงฎ

                            Poisson Regression Results                           
==============================================================================
Dep. Variable:                      y   No. Observations:                    4
Model:                          GLM   Df Residuals:                          2
Model Family:             Poisson   Df Model:                                1
==============================================================================

Challenge: ๐ŸŒŸ Apply Poisson regression to model the number of daily website visits based on ad spending.


6. ANOVA (Analysis of Variance) ๐ŸŽฒ

Boilerplate Code:

from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm

model = ols('y ~ C(group)', data=df).fit()
anova_results = anova_lm(model)

Use Case: Test for significant differences between group means.

Goal: Use ANOVA to compare the mean of two or more groups. ๐ŸŽฏ

Sample Code:

model = ols('y ~ C(group)', data=df).fit()
anova_results = anova_lm(model)
print(anova_results)

Before Example:
You try to compare means of groups manually without statistical testing. ๐Ÿ˜ฌ

group1 = [3, 4, 5]
group2 = [7, 8, 9]

After Example:
ANOVA provides statistical significance between group means. ๐ŸŽฒ

                            ANOVA Table                                
=========================================================================
        df    sum_sq   mean_sq         F    PR(>F)
group    1   102.667   102.667  19.35413   0.003

Challenge: ๐ŸŒŸ Run an ANOVA test on test scores across different teaching methods and interpret the significance of results.


7. Durbin-Watson Test (Autocorrelation) ๐Ÿ”

Boilerplate Code:

from statsmodels.stats.stattools import durbin_watson

dw_stat = durbin_watson(results.resid)

Use Case: Test for autocorrelation in the residuals of a regression model.

Goal: Ensure residuals are independent and not autocorrelated using the Durbin-Watson Test. ๐ŸŽฏ

Sample Code:

dw_stat = durbin_watson(results.resid)
print(f'Durbin-Watson statistic: {dw_stat}')

Before Example:
Residuals exhibit autocorrelation, leading to biased results. ๐Ÿ˜ฌ

Residuals: [0.1, 0.2, 0.15, 0.3, 0.25]  # Showing some autocorrelation.

After Example:
Durbin-Watson test helps ensure independent residuals. ๐Ÿ”

Durbin-Watson statistic: 2.12  # A value near 2 indicates no autocorrelation.

Challenge: ๐ŸŒŸ Apply the Durbin-Watson test on a regression modelโ€™s residuals to detect autocorrelation and assess the modelโ€™s assumptions.


8. Ljung-Box Test (Autocorrelation in Residuals) ๐Ÿ”Ž

Boilerplate Code:

from statsmodels.stats.diagnostic import acorr_ljungbox

ljung_box_results = acorr_ljungbox(results.resid, lags=[10])

Use Case: Test whether residuals of a time series model are autocorrelated.

Goal: Use the Ljung-Box Test to determine if residuals are independent and not autocorrelated over time. ๐ŸŽฏ

Sample Code:

ljung_box_results = acorr_ljungbox(results.resid, lags=[10])
print(ljung_box_results)

Before Example:
Residuals in a time series model show patterns over time. ๐Ÿ˜ฌ

Residuals: [0.3, 0.25, 0.35, 0.3, 0.4]  # Residuals follow some trend.

After Example:
The Ljung-Box Test detects whether the residuals are independent. ๐Ÿ”Ž

Ljung-Box p-value: 0.08  # A p-value close to 1 indicates no autocorrelation.

Challenge: ๐ŸŒŸ Use the Ljung-Box test on ARIMA model residuals to check for autocorrelation and validate model assumptions.


9. Quantile Regression ๐Ÿ“Š

Boilerplate Code:

import statsmodels.api as sm

model = sm.QuantReg(y, X)
results = model.fit(q=0.5)  # Median regression

Use Case: Perform quantile regression to model relationships between variables at different quantiles (e.g., median, 90th percentile).

Goal: Use Quantile Regression to understand how predictors affect different parts of the outcome distribution. ๐ŸŽฏ

Sample Code:

model = sm.QuantReg(y, X)
results = model.fit(q=0.5)  # Median regression
print(results.summary())

Before Example:
Modeling only the mean of a distribution might miss insights on the tails. ๐Ÿ˜•

Mean regression results: Coefficient = 2.0  # Captures only central tendency.

After Example:
Quantile regression provides insights into different parts of the distribution (e.g., median, 90th percentile). ๐Ÿ“Š

Quantile regression (median): Coefficient = 1.5  # Different insights at different quantiles.

Challenge: ๐ŸŒŸ Apply quantile regression on income data to understand disparities at the 10th, 50th, and 90th percentiles.


10. Autoregressive Integrated Moving Average (ARIMA) ๐Ÿ•’

Boilerplate Code:

from statsmodels.tsa.arima.model import ARIMA

model = ARIMA(y, order=(1, 1, 1))
results = model.fit()

Use Case: Model and forecast time series data with trends, seasonality, and noise using ARIMA.

Goal: Use ARIMA to forecast time series data that accounts for autocorrelation, trends, and seasonality. ๐ŸŽฏ

Sample Code:

model = ARIMA(y, order=(1, 1, 1))
results = model.fit()
print(results.summary())

Before Example:
Manually trying to forecast without a structured model for time series. ๐Ÿ˜ฉ

Future values based on simple extrapolation, ignoring seasonality.

After Example:
With ARIMA, you model the trend, seasonality, and noise in time series data for accurate forecasts. ๐Ÿ•’

ARIMA Results: AIC = 250.0, BIC = 260.0, Coefficients estimated for AR, I, and MA terms.

Challenge: ๐ŸŒŸ Fit an ARIMA model on monthly sales data and use it to forecast the next 12 months.


11. Pooled OLS (Panel Data) ๐Ÿ“Š

Boilerplate Code:

from linearmodels import PooledOLS

pooled_model = PooledOLS(y, X).fit()

Use Case: Fit a pooled OLS model to analyze panel data, where observations are repeated over time across entities (e.g., multiple companies, countries).

Goal: Use Pooled OLS when time-invariant variables are present and you want to estimate the same coefficients for all entities. ๐ŸŽฏ

Sample Code:

pooled_model = PooledOLS(y, X).fit()
print(pooled_model.summary())

Before Example:
You have panel data but arenโ€™t considering the time dimension, treating the data as cross-sectional. ๐Ÿค”

Only analyzing data without accounting for the multiple periods or entities.

After Example:
With Pooled OLS, you account for repeated observations, fitting the model to panel data. ๐Ÿ“Š

                            PooledOLS Model Results                            
==============================================================================
Dep. Variable:                      y   No. Observations:                  100
Model:                 PooledOLS   Log Likelihood               -234.567
AIC                            492.34
==============================================================================

Challenge: ๐ŸŒŸ Try using Pooled OLS on a dataset with multiple companies' financial data over 5 years and interpret the output.


12. Fixed Effects Model (Panel Data) ๐Ÿ”

Boilerplate Code:

from linearmodels import PanelOLS

fixed_effects_model = PanelOLS(y, X, entity_effects=True).fit()

Use Case: Control for time-invariant differences across entities (like companies or countries) in panel data.

Goal: Use Fixed Effects to eliminate the influence of unobserved, time-invariant characteristics that differ across entities. ๐ŸŽฏ

Sample Code:

fixed_effects_model = PanelOLS(y, X, entity_effects=True).fit()
print(fixed_effects_model.summary())

Before Example:
Ignoring unobserved heterogeneity across entities, leading to biased results. ๐Ÿ˜•

Same coefficients applied across all entities, ignoring individual differences.

After Example:
Fixed Effects account for these time-invariant characteristics, offering a more refined model. ๐Ÿ”

Fixed Effects Model Results: Coefficients specific to each entity, R-squared improved.

Challenge: ๐ŸŒŸ Compare Fixed Effects to Pooled OLS on a dataset with company profits over time to see the effect of controlling for entity-specific variables.


13. Random Effects Model (Panel Data) ๐ŸŽฏ

Boilerplate Code:

from linearmodels import RandomEffects

random_effects_model = RandomEffects(y, X).fit()

Use Case: Analyze panel data where unobserved effects are random and assumed to be uncorrelated with the independent variables.

Goal: Use Random Effects when you believe that differences across entities have a random component and are uncorrelated with your predictors. ๐ŸŽฏ

Sample Code:

random_effects_model = RandomEffects(y, X).fit()
print(random_effects_model.summary())

Before Example:
Fitting a simple pooled OLS model and assuming no random variation across entities. ๐Ÿค”

Unaccounted for random variations across entities.

After Example:
Random Effects model handles random variations across entities, improving model fit. ๐ŸŽฏ

                            RandomEffects Model Results                            
==============================================================================
Dep. Variable:                      y   No. Observations:                  100
Model:                 RandomEffects   Log Likelihood                -204.123
AIC                             412.245
==============================================================================

Challenge: ๐ŸŒŸ Fit both Random Effects and Fixed Effects models to panel data on customer satisfaction surveys and compare the results.


14. Durbin-Wu-Hausman Test (Fixed vs Random Effects) โš–๏ธ

Boilerplate Code:

from linearmodels.panel import compare

results = compare(fixed_effects_model, random_effects_model)

Use Case: Decide between using Fixed Effects or Random Effects for panel data analysis by testing for endogeneity.

Goal: Apply the Durbin-Wu-Hausman Test to check whether to use fixed or random effects based on whether the entity-specific effects are correlated with your independent variables. ๐ŸŽฏ

Sample Code:

results = compare(fixed_effects_model, random_effects_model)
print(results)

Before Example:
Uncertainty whether to apply fixed or random effects to panel data. ๐Ÿ˜ฉ

Analysis proceeds without checking for endogeneity, leading to model bias.

After Example:
The Hausman Test helps choose the correct model by checking for correlations in random effects. โš–๏ธ

Hausman Test Results: p-value > 0.05: Random effects is preferable.

Challenge: ๐ŸŒŸ Perform the Hausman Test on panel data from multiple industries and interpret the p-value to decide between fixed or random effects.


15. Heteroskedasticity Test (Breusch-Pagan Test) ๐Ÿ“Š

Boilerplate Code:

from statsmodels.stats.diagnostic import het_breuschpagan

bp_test = het_breuschpagan(results.resid, results.model.exog)

Use Case: Test for heteroskedasticity (non-constant variance) in the residuals of a regression model.

Goal: Use the Breusch-Pagan Test to detect whether the variance of the residuals is constant across observations. ๐ŸŽฏ

Sample Code:

bp_test = het_breuschpagan(results.resid, results.model.exog)
print(f'Breusch-Pagan p-value: {bp_test[1]}')

Before Example:
You fit a regression model but don't check for heteroskedasticity, leading to inefficient estimates. ๐Ÿ˜•

Residuals: [0.3, 0.25, 0.9, 0.8, 0.75]  # Possible signs of heteroskedasticity.

After Example:
With the Breusch-Pagan test, you confirm whether or not heteroskedasticity is present. ๐Ÿ“Š

Breusch-Pagan p-value: 0.002  # p-value indicates heteroskedasticity is present.

Challenge: ๐ŸŒŸ Use the Breusch-Pagan Test on a dataset where heteroskedasticity might be present, such as housing prices or income distribution.


16. Ljung-Box Test (Autocorrelation in Residuals) ๐Ÿ”Ž

Boilerplate Code:

from statsmodels.stats.diagnostic import acorr_ljungbox

ljung_box_results = acorr_ljungbox(results.resid, lags=[10])

Use Case: Test for autocorrelation in the residuals of a time series model.

Goal: Use the Ljung-Box Test to ensure residuals are independent and not autocorrelated. ๐ŸŽฏ

Sample Code:

ljung_box_results = acorr_ljungbox(results.resid, lags=[10])
print(ljung_box_results)

Before Example:
Residuals show autocorrelation, causing bias in time series predictions. ๐Ÿ˜ฉ

Residuals: [0.4, 0.35, 0.45, 0.5, 0.55]  # Residuals show a clear trend over time.

After Example:
The Ljung-Box Test confirms if residuals are independent. ๐Ÿ”Ž

Ljung-Box p-value: 0.15  # No significant autocorrelation in residuals.

Challenge: ๐ŸŒŸ Apply the Ljung-Box Test on ARIMA model residuals to validate independence assumptions in time series analysis.


17. Seasonal Decomposition of Time Series (STL Decomposition) ๐Ÿ“…

Boilerplate Code:

from statsmodels.tsa.seasonal import STL

stl = STL(y, seasonal=13)
result = stl.fit()

Use Case: Decompose a time series into its trend, seasonal, and residual components.

Goal: Use STL Decomposition to understand the underlying patterns in a time series, especially for data with strong seasonal components. ๐ŸŽฏ

Sample Code:

stl = STL(y, seasonal=13)
result = stl.fit()
result.plot()

Before Example:
You struggle to separate seasonal patterns from the underlying trend. ๐Ÿ˜•

Time series plot shows seasonal fluctuation, but hard to identify the pattern.

After Example:
STL Decomposition reveals trend, seasonality, and residuals separately. ๐Ÿ“…

STL Decomposition plot shows clear trend, seasonal, and residual components.

Challenge: ๐ŸŒŸ Apply STL Decomposition to retail sales data to identify seasonal trends in product demand.


18. Quantile Regression ๐Ÿ“Š

Boilerplate Code:

import statsmodels.api as sm

model = sm.QuantReg(y, X)
results = model.fit(q=0.5)  # Median regression

Use Case: Perform quantile regression to model different quantiles (e.g., median, 90th percentile) of the response variable.

Goal: Use Quantile Regression to understand how predictors affect different parts of the distribution (e.g., the tails or median). ๐ŸŽฏ

Sample Code:

model = sm.QuantReg(y, X)
results = model.fit(q=0.5)  # Median regression
print(results.summary())

Before Example:
You only model the mean of the distribution, missing insights about extreme values. ๐Ÿ˜ž

Mean regression shows a relationship between predictors and the response variable.

After Example:
With Quantile Regression, you model different parts of the distribution, gaining insight into the extremes. ๐Ÿ“Š

Quantile regression (median) results show a different relationship compared to the mean.

Challenge: ๐ŸŒŸ Apply quantile regression on a salary dataset to understand disparities at the 10th, 50th, and 90th percentiles.


19. Generalized Estimating Equations (GEE) ๐Ÿ“‰

Boilerplate Code:

from statsmodels.genmod.generalized_estimating_equations import GEE
from statsmodels.genmod.families import Gaussian

model = GEE(y, X, groups=group_var, family=Gaussian())
results = model.fit()

Use Case: Fit marginal models for correlated data, such as repeated measures or clustered data.

Goal: Use GEE to handle within-group correlation and produce robust estimates in the presence of correlated observations. ๐ŸŽฏ

Sample Code:

model = GEE(y, X, groups=group_var, family=Gaussian())
results = model.fit()
print(results.summary())

Before Example:
You ignore within-group correlation, leading to biased estimates for repeated measures data. ๐Ÿ˜•

Repeated measures data shows correlation within groups, but it's not accounted for.

After Example:
With GEE, you account for within-group correlation and produce more accurate estimates. ๐Ÿ“‰

GEE model results show adjusted coefficients accounting for within-group correlation.

Challenge: ๐ŸŒŸ Fit a GEE model to analyze patient health outcomes measured over multiple visits to the clinic.


20. Autoregressive Integrated Moving Average (ARIMA) ๐Ÿ•’

Boilerplate Code:

from statsmodels.tsa.arima.model import ARIMA

model = ARIMA(y, order=(1, 1, 1))
results = model.fit()

Use Case: Model and forecast time series data using ARIMA, which captures autoregressive, integrated, and moving average components.

Goal: Use ARIMA to handle trends, seasonality, and noise in time series data. ๐ŸŽฏ

Sample Code:

model = ARIMA(y, order=(1, 1, 1))
results = model.fit()
print(results.summary())

Before Example:
Without ARIMA, you're unable to accurately forecast time series data because trends and patterns are ignored. ๐Ÿ˜•

Time series predictions based on a simple model ignore trends and autocorrelations.

After Example:
With ARIMA, you model the trend, seasonality, and noise in time series data for more accurate forecasts. ๐Ÿ•’

ARIMA results: Captured autocorrelations, trends, and moving averages for better forecasting.

Challenge: ๐ŸŒŸ Fit an ARIMA model on energy consumption data to forecast future usage based on historical patterns.


0
Subscribe to my newsletter

Read articles from Anix Lynch directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Anix Lynch
Anix Lynch