In this post, you can understand the basic behind the most famous regression model. The Linear regression.
What is Linear Regression?
A linear regression is a statistical model that analyzes the relationship between a dependent variable (often called y) and one or more variables and their interactions (often called x or independent variables).
Linear regression attempts to model the relationship between two or more variables by fitting a linear equation to observed data. A linear regression line has an equation of the form Y = β0 + β1*X1 +β2*X2 …, where X1 and X2 are the independent variables and Y is the dependent variable, here the β0 is the intercept β1, β2 are the slopes of the respective slopes of X1, X2.
Examples: we can predict a height of a mango tree by size of its trunk (thicker the trunk taller with be the tree), we can predict the mileage (mpg) of a car depending on it weight or power (horse power), like lighter the car higher will be the mileage etc.
Regression has many practical uses. If the goal is prediction, forecasting, error reduction, linear regression can be used to fit a predictive model to an observed data set of values of the dependent and independent variables.
Now we are going to build a linear model to predict the number of people employed depending on other economic variables using longley dataset (available in R).
Creating Linear Regression Model
Before creating Regression model, we should consider the below points for accurate predictions:
- Understand the Data– knowing the data stored in each column, Volume of
the data, data types of the fields. in our example we have assigned the dataset
to the variable “lon”, we can check the structure of the data by the command
str(dataset).Below we can see the structure of Longley dataset.
- Cleaning the data– filling the missing values and excluding the outliers from the model for better prediction. We can find the missing values by various methods, one of them is summary(dataset) command. This dataset has no missing values as we cannot see any NA values in summary.
- Identification of Independent variables– every problem cannot be solved using the same model. To build a linear regression model there should be a relationship between the independent and dependent (single and multiple) variable.
First let’s build a linear model with single independent variable.
In the below plot we can see that there is linear relation between the GNP and Employed
Splitting the dataset
Divide the dataset into test and train data. Training data is the one on which we train and fit our model basically to fit the parameters whereas test data is used only to assess performance of model.
are many approaches to split the dataset. Below commands can be used to split
the dataset in testing and training data. caTools
Package contains several basic utility functions, one of them is sample.split that we are using to split
the dataset. S Sample split will divide the data set evenly in the ratio 7:3,
where 70% of the records will be marked as true and rest 30% will be marked as
false and store these values in a variable (spl in this case). In line
64 subset command assigns all the records from dataset lon where
spl is TRUE to lon_train, similarly in line
65 subset command assigns all the records from dataset lon where
spl is FALSE to lon_test.
regression can be calculated in R with the command
takes the variables in the format:
lm([dependent variable] ~ [independent
variables], data = [data source])
With the command
summary(modEmployed) you can see detailed
information on the model’s performance and coefficients.
In the red square, you can see the values of the intercept (“β0” value) and the slope (“β1” value) for the age. These “β0” and “β1” values plot a line between all the points of the data. In this case, if the GNP is 300, β0 is 51.927277 and b is 0.034520, the model predicts (on average) that the number of employed for GNP 300 is around 51.92 + (0.035 * 300) = 62.42.
In R, to add another coefficient, add the symbol “+” for every additional variable you want to add to the model.
The red rectangle indicates the p-values for the coefficients GNP and Armed.Forces. In simple terms, a p-value indicates if you can reject or accept a hypothesis. The hypothesis, in this case, is that the independent variables is not meaningful for your model. Number of stars indicate how significant is the variable for your model. In this case we can see GNP is very significant for the prediction and Armed forces is not contributing in prediction of Employed.
- The p-value for age is 6.88e-13 or 0.0000169. A very small value means that age is probably an excellent addition to your model.
- The p-value for the Armed forces is 0.3. In other words, there’s high chance that this predictor is not meaningful for the regression.
A standard way to test if the predictors are not meaningful is looking if the p-values smaller than 0.05.
A good way to test the quality of the model is to look at the residuals or the differences between the real values and the predicted values. The straight line in the image below represents the predicted values. The red vertical line from the straight line to the observed data value is the residual.
The idea in here is that the sum of the residuals is approximately zero or as low as possible. In real life, most cases will not follow a perfectly straight line, so residuals are expected. In the R summary of the lm function, you can see descriptive statistics about the residuals of the model, following the same example, the red square shows how the residuals are approximately zero. The residuals are distributed normally (with median approximately 0) even if our predicted variable in skewed. Please find below the residuals summary of this model
The predict function is used to the dependent variable for test data.
predict(model, newdata=dataset, interval
this gives us 3 values
the best fit, lower range and upper range. If we do not give the 3rd parameter
predict(model, newdata=dataset), the output will be the best fil line
How to test if your linear model has a good fit?
Most common value to check how good is your model is the coefficient of determination or R-squared. R-squared is a statistical measure of how close the data are to the fitted regression line. R-squared values range from 0 to 1 and are commonly stated as percentages from 0% to 100%. An R-squared of 100% means that all movements of dependent variable (Employed in this case) are completely explained by movements in the independent variables (GNP & Armed forces in this case).
SSE=∑ni(yi−2 & SST=∑ni(yi−yi¯)^2
Here, is the fitted value for observation i and yi¯ is the mean of Y.
In the green rectangle, notice that there’s two different R² from summary of the model, one multiple and one adjusted. The multiple is the R² that you saw previously. One problem with this R² is that it cannot decrease as you add more independent variables to your model, it will continue increasing as you make the model more complex, even if these variables don’t add anything to your predictions (like the example of the Armed Forces). For this reason, the adjusted R² is probably better to look at if you are adding more than one variable to the model, since it only increases if it reduces the overall error of the predictions. High R-squared value and low or approximately zero residuals together combined makes a good model.
Linear Regression is the most simple and important regression model algorithms. You can type “?lm” in the console to view all the parameters that has not been covered here. It has been stayed around since 19th century and will stay for long time. If you are interested in statistical models, start with linear regression.
Let me know your views on this post by commenting below.