Fixing Heteroskedasticity with Python

Feb 26

Heteroskedasticity represents a condition where the variance of the error terms varies across observations, a breach of the homoskedasticity assumption essential for the optimal properties of Ordinary Least Squares (OLS) estimators. The presence of heteroskedasticity can be particularly insidious for several reasons:

Efficiency Loss: While OLS estimators remain unbiased in the face of heteroskedastic errors, they forfeit their BLUE (Best Linear Unbiased Estimator) status. This means there could be other estimators with lower variance, making OLS estimates less reliable.
Inference Inaccuracy: The standard errors of the regression coefficients, crucial for hypothesis testing and constructing confidence intervals, become biased. This misleads the statistical significance tests, potentially leading to erroneous conclusions about the relationship between variables.

Diagnosing Heteroskedasticity: A Critical Step
Identifying heteroskedasticity is a critical first step before taking corrective measures. This diagnosis typically involves both visual inspection and formal statistical tests:

Visual Inspection: Plotting residuals against fitted values or one of the independent variables can reveal patterns indicative of heteroskedasticity. A fan or funnel shape in these plots suggests increasing or decreasing variance of residuals.
Statistical Tests: Several tests, such as the Breusch-Pagan and White tests, offer a formal mechanism to detect heteroskedasticity. These tests assess the null hypothesis of homoskedasticity against the alternative of heteroskedasticity, providing a p-value to guide conclusions.

Mitigating Heteroskedasticity: Strategies and Solutions
Upon detecting heteroskedasticity, you can employ several strategies to mitigate its effects:

Robust Standard Errors (RSE): One of the most straightforward approaches involves adjusting the standard errors of the regression coefficients to account for heteroskedasticity, enabling valid hypothesis testing without altering the original OLS estimates. This is the approach we'll be applying in this blog, using Python.
Transformation of Variables: Applying transformations to the dependent variable, such as taking logarithms or square roots, can sometimes stabilize the variance of the errors.
Generalized Least Squares (GLS): This method involves transforming the original equation to "weight" observations differently, effectively neutralizing the problem of unequal variances.

Python Example

Try entering the following code into Python:

import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.stats.diagnostic import het_breuschpagan
from statsmodels.formula.api import ols

# Set seed for reproducibility
np.random.seed(123)

# Generate heteroskedastic data
x = np.linspace(1, 100, 100)
errors = np.random.normal(0, 0.1*x, size=100)
y = 2 + 3*x + errors

# Fit an OLS regression model
X = sm.add_constant(x) # Adds a constant term to the predictor
model = sm.OLS(y, X).fit()

# Summary of the model
print(model.summary())

# Plot residuals vs fitted values to visually check for heteroskedasticity
plt.scatter(model.fittedvalues, model.resid)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('Fitted values')
plt.ylabel('Residuals')
plt.title('Residuals vs Fitted Values')
plt.show()

# Perform Breusch-Pagan test
bp_test = het_breuschpagan(model.resid, model.model.exog)
labels = ['LM Statistic', 'LM-Test p-value', 'F-Statistic', 'F-Test p-value']
print(dict(zip(labels, bp_test)))

# Address heteroskedasticity by using robust standard errors
model_robust = model.get_robustcov_results()
print(model_robust.summary())

Let's start by looking at the residuals vs. fitted plot above. As you can see, the data points are not distributed randomly around y = 0. Rather, the dispersion of the data points gets wider as the fitted values increase. This classic fanning pattern is one of the vissual hallmarks of heteroskedasticity when you create a residuals vs. fitted plot.

Next, let's look at the readout that confirms the presence of heteroskedasticity. There's a lot of terminology here, but all that really matters is that the p value is well below .05, meaning that there is heteroskedasticity:

Therefore, you would want to conduct, and report on, the results of the RSE regression, the results of which are shown below:

The Role of Heteroskedasticity in Model Specification
Beyond its immediate effects on statistical inference, heteroskedasticity signals potential specification errors or opportunities for model improvement. It prompts researchers to revisit their models, considering whether key variables have been omitted, if non-linear relationships have been inadequately modeled, or if there's an underlying pattern in the data not captured by the current model form.

Conclusion
Heteroskedasticity, with its challenging presence in regression analysis, underscores the importance of diligent model checking and refinement in regression. By understanding heteroskedasticity's implications, diagnosing its presence, and applying appropriate remedies, researchers can enhance the robustness and reliability of their findings.

BridgeText can help you with all of your statistics needs.

Have any questions?

Our support team is ready to answer your questions.

Help Center FAQ