You can veiw the Jupyter notebook file(.ipynb) here
This notebook analyzes the Auto dataset to investigate how vehicle characteristics relate to fuel efficiency (mpg).
We apply simple linear regression (Q8) and multiple linear regression (Q9), including diagnostic plots and transformations.
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
# Load Auto.csv (make sure it's in the same folder)
auto = pd.read_csv("/assets/Auto/Auto.csv")
# Convert columns to numeric if necessary
auto['horsepower'] = pd.to_numeric(auto['horsepower'], errors='coerce')
auto = auto.dropna() # drop rows with missing values
auto.head()
| mpg | cylinders | displacement | horsepower | weight | acceleration | year | origin | name | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 18.0 | 8 | 307.0 | 130.0 | 3504 | 12.0 | 70 | 1 | chevrolet chevelle malibu |
| 1 | 15.0 | 8 | 350.0 | 165.0 | 3693 | 11.5 | 70 | 1 | buick skylark 320 |
| 2 | 18.0 | 8 | 318.0 | 150.0 | 3436 | 11.0 | 70 | 1 | plymouth satellite |
| 3 | 16.0 | 8 | 304.0 | 150.0 | 3433 | 12.0 | 70 | 1 | amc rebel sst |
| 4 | 17.0 | 8 | 302.0 | 140.0 | 3449 | 10.5 | 70 | 1 | ford torino |
We model mpg as the response and horsepower as the predictor.
# Simple linear regression
X = sm.add_constant(auto["horsepower"])
y = auto["mpg"]
model_simple = sm.OLS(y, X).fit()
model_simple.summary()
| Dep. Variable: | mpg | R-squared: | 0.606 |
|---|---|---|---|
| Model: | OLS | Adj. R-squared: | 0.605 |
| Method: | Least Squares | F-statistic: | 599.7 |
| Date: | Fri, 21 Nov 2025 | Prob (F-statistic): | 7.03e-81 |
| Time: | 10:22:31 | Log-Likelihood: | -1178.7 |
| No. Observations: | 392 | AIC: | 2361. |
| Df Residuals: | 390 | BIC: | 2369. |
| Df Model: | 1 | ||
| Covariance Type: | nonrobust |
| coef | std err | t | P>|t| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| const | 39.9359 | 0.717 | 55.660 | 0.000 | 38.525 | 41.347 |
| horsepower | -0.1578 | 0.006 | -24.489 | 0.000 | -0.171 | -0.145 |
| Omnibus: | 16.432 | Durbin-Watson: | 0.920 |
|---|---|---|---|
| Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 17.305 |
| Skew: | 0.492 | Prob(JB): | 0.000175 |
| Kurtosis: | 3.299 | Cond. No. | 322. |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
new_value = pd.DataFrame({"const":[1], "horsepower":[98]})
pred_simple = model_simple.get_prediction(new_value).summary_frame(alpha=0.05)
pred_simple
| mean | mean_se | mean_ci_lower | mean_ci_upper | obs_ci_lower | obs_ci_upper | |
|---|---|---|---|---|---|---|
| 0 | 24.467077 | 0.251262 | 23.973079 | 24.961075 | 14.809396 | 34.124758 |
fig, ax = plt.subplots(figsize=(10,6))
ax.scatter(auto["horsepower"], auto["mpg"], s=20, alpha=0.7)
ax.plot(auto["horsepower"], model_simple.predict(X), color='red')
ax.set_xlabel("Horsepower")
ax.set_ylabel("MPG")
ax.set_title("MPG vs Horsepower with Regression Line")
plt.show()

fig = plt.figure(figsize=(12,10))
sm.graphics.plot_regress_exog(model_simple, "horsepower", fig=fig)
plt.show()

We now include all other variables (except name) to predict mpg.
We also explore correlations, interactions, and transformations.
# Convert horsepower to numeric (some values may be '?')
auto['horsepower'] = pd.to_numeric(auto['horsepower'], errors='coerce')
# Drop rows with missing values
auto = auto.dropna()
# Create numeric-only dataframe (drop 'name' column)
auto_numeric = auto.drop(columns=['name'])
# Select key variables for scatterplot matrix
subset = ["mpg", "horsepower", "weight", "year"]
# Create the scatterplot matrix
sns.pairplot(auto_numeric[subset], height=2.5)
plt.suptitle("Scatterplot Matrix: Key Predictors vs MPG", y=1.02)
plt.show()

# Multiple regression
X_multi = auto_numeric.drop(columns=['mpg'])
X_multi = sm.add_constant(X_multi)
y_multi = auto_numeric['mpg']
model_multi = sm.OLS(y_multi, X_multi).fit()
model_multi.summary()
# Diagnostic plots for multiple regression
fig = plt.figure(figsize=(12,10))
sm.graphics.plot_regress_exog(model_multi, "weight", fig=fig)
plt.show()
We can try interactions (e.g., horsepower*weight) or transformations (log, sqrt, squared) to improve the model.
Check p-values for significance and whether plots look better.
# Example: interaction between horsepower and weight
X_inter = auto_numeric.copy()
X_inter['hp_weight'] = X_inter['horsepower'] * X_inter['weight']
X_inter = sm.add_constant(X_inter.drop(columns=['mpg']))
y_inter = auto_numeric['mpg']
model_inter = sm.OLS(y_inter, X_inter).fit()
model_inter.summary()
X_trans = auto_numeric.copy()
X_trans['log_horsepower'] = np.log(X_trans['horsepower'])
X_trans['weight_squared'] = X_trans['weight'] ** 2
X_trans = sm.add_constant(X_trans.drop(columns=['mpg']))
y_trans = auto_numeric['mpg']
model_trans = sm.OLS(y_trans, X_trans).fit()
model_trans.summary()
Working through this analysis helped me understand how vehicle characteristics like horsepower, weight, and year influence fuel efficiency. I learned how to interpret regression coefficients, evaluate model fit using diagnostic plots, and explore improvements through interactions and transformations. This project also strengthened my skills in presenting data analysis clearly in a professional blog format.