Linear Regression

(Based on this PDF tutorial by Andrew Andrade)

📌 Introduction

In this blog post, I walk through the Linear Regression Tutorial and explain the core concepts in an easy-to-understand way. I also include answers to the questions asked in the notebook and demonstrate what I learned while executing the notebook cells.

Additionally, here is the link to my GitHub repository where the linear-regression-tutorial.ipynb is running with all cells executed successfully:

➡️ GitHub Link: here

📈 What is Linear Regression?

Linear regression is a method used to model the relationship between two numeric variables. It finds the best straight line that fits a set of data points. This line can be written as:

y = m*x + b

m: Slope (how much y changes when x changes)
b: Intercept (value of y when x = 0)
x: Independent variable
y: Predicted dependent variable

The goal is to minimize the residuals, meaning the difference between the real value and the prediction.

residual = y_observed − y_predicted

🔍 Understanding Residuals

Residuals show how far off our model’s predictions are. We plot the residuals to check whether linear regression is a good fit.

A good linear regression model has:

Residuals evenly scattered around zero
A roughly bell-shaped histogram
No pattern or shape in the residual plot

If the residuals look random → linear regression is appropriate. If there’s a pattern → consider transforming the data or another model.

🧠 Different Types of Regression

Method	When to use	What is minimized
Vertical Least Squares (standard)	y depends on x	Vertical residuals
Horizontal Least Squares	x depends on y	Horizontal residuals
Total Least Squares	Both x & y contain error	Perpendicular residuals

Total Least Squares is the most realistic when both variables include measurement noise.

📊 Evaluating the Model — Key Metrics

Statistic	Meaning	Why it matters
R²	% of variation explained by the model	Closer to 1 = better fit
Adj. R²	R² adjusted for model complexity	Useful for comparing models
p-value	Significance of slope	< 0.05 means slope is meaningful
Confidence Interval	Range where true slope likely falls	Narrow = more certainty

❓ Answers to the Notebook Questions

What is the R² value?

The R² value from the results is 0.667, meaning the model explains 66.7% of the variation in the data.

What is the Adjusted R²?

The Adjusted R² value is 0.629, slightly lower because it accounts for sample size and complexity.

What is the p-value for the slope?

The slope p-value is 0.002, which is less than 0.05. This means the slope is statistically significant and the relationship between x and y is real.

What are the 95% confidence intervals for the slope?

The interval is approximately between 0.233 and 0.767, meaning we are 95% confident that the true slope lies in that range.

🎯 Key Takeaways

Linear regression fits a straight line to data by minimizing residuals.
Checking residuals helps determine if the model is valid.
R² and p-values tell us how strong the relationship is.
Different types of regression exist depending on which variable depends on which.
Total least squares considers error in both directions and may better represent real-world scenarios.

📚 Further Reading

Introduction to Statistical Learning (Chapter 2)
https://stattrek.com/regression/
https://scikit-learn.org/stable/modules/linear_model.html

🏁 Conclusion

Linear Regression is one of the simplest yet most powerful statistical modeling techniques. Understanding residuals, model assumptions, and key metrics is essential before applying it. By completing this tutorial and running the notebook, I developed a stronger understanding of how regression works and how to evaluate the model results.