Regression analysis is a statistical process for estimating the relationships between variables. It can be used to build a model to predict the value of the target variable from the predictor variables.
Mathematically, a regression model is represented as y= f(X), where y is the target or dependent variable and X is the set of predictors or independent variables (x1, x2, …, xn).
If a linear regression model involves only one predictor variable, it is called a Simple Linear Regression model.
f(X) = ß0 + ß1*x1 + ∈
The ß values are known as weights (ß0 is also called intercept and the subsequent ß1, ß2, etc. are called as coefficients). The error , ϵ is assumed to be normally distributed with a constant variance.
Assumptions of Linear Regression
Assumption 1: The target (dependent) variable and the predictor (independent) variables should be continuous numerical values.
Assumption 2: There should be linear relationship between the predictor variable and the target variable. A scatterplot with the predictor and the target variables along the x-axis and the y-axis, can be used as a simple check to validate this assumption.
Assumption 3: There should not be any significant outliers in the data.
Assumption 4: The data is iid (Independent and identically distributed). In other words, one observation should not depend on another.
Assumption 5: The residuals (difference between the actual value and predicted value) of a regression should not exhibit any pattern. That is, they should be homoscedastic (exhibit equal variance across all instances). This assumption can be validated by plotting a scatter plot of the residuals. If the residuals exhibit a pattern, then they are not homoscedastic (in other words, they are heteroscedastic). If the residuals are randomly distributed, then it is homoscedastic in nature.
Assumption 6: The residuals of the regression line should be approximately normally distributed. The assumption can be checked by plotting a Normal Q-Q plot on the residuals.
Implementation:
SLR Implementation