# Linear Regression with R

In this post, you can understand the basic behind the most famous regression model. The Linear regression.

What is Linear Regression?

A linear regression is a statistical model that analyzes the relationship between a dependent variable (often called y) and one or more variables and their interactions (often called x or independent variables).

Linear regression attempts to model the relationship between two or more variables by fitting a linear equation to observed data. A linear regression line has an equation of the form Y = β0 + β1*X1 +β2*X2 …, where X1 and X2 are the independent variables and Y is the dependent variable, here the β0 is the intercept β1, β2 are the slopes of the respective slopes of X1, X2.

Examples: we can predict a height of a mango tree by size of its trunk (thicker the trunk taller with be the tree), we can predict the mileage (mpg) of a car depending on it weight or power (horse power), like lighter the car higher will be the mileage etc.

Regression has many practical uses. If the goal is prediction, forecasting, error reduction, linear regression can be used to fit a predictive model to an observed data set of values of the dependent and independent variables.

Now we are going to build a linear model to predict the number of people employed depending on other economic variables using longley dataset (available in R).

Creating Linear Regression Model

Before creating Regression model, we should consider the below points for accurate predictions:

• Understand the Data– knowing the data stored in each column, Volume of the data, data types of the fields. in our example we have assigned the dataset to the variable “lon”, we can check the structure of the data by the command `str(dataset).` Below we can see the structure of Longley dataset.
• Cleaning the data– filling the missing values and excluding the outliers from the model for better prediction. We can find the missing values by various methods, one of them is summary(dataset) command. This dataset has no missing values as we cannot see any NA values in summary.
• Identification of Independent variables– every problem cannot be solved using the same model. To build a linear regression model there should be a relationship between the independent and dependent (single and multiple) variable.

First let’s build a linear model with single independent variable.

In the below plot we can see that there is linear relation between the GNP and Employed

Splitting the dataset

Divide the dataset into test and train data. Training data is the one on which we train and fit our model basically to fit the parameters whereas test data is used only to assess performance of model.

There are many approaches to split the dataset. Below commands can be used to split the dataset in testing and training data. caTools Package contains several basic utility functions, one of them is sample.split that we are using to split the dataset. S Sample split will divide the data set evenly in the ratio 7:3, where 70% of the records will be marked as true and rest 30% will be marked as false and store these values in a variable (spl in this case). In line `64` subset command assigns all the records from dataset lon where spl is TRUE to lon_train, similarly in line `65` subset command assigns all the records from dataset lon where spl is FALSE to lon_test.

Model

Linear regression can be calculated in R with the command `lm``. `The `lm` command takes the variables in the format:

```lm([dependent variable] ~ [independent variables], data = [data source])```

With the command `summary(modEmployed)` you can see detailed information on the model’s performance and coefficients.

Coefficients

In the red square, you can see the values of the intercept (“β0” value) and the slope (“β1” value) for the age. These “β0” and “β1” values plot a line between all the points of the data. In this case, if the GNP is 300, β0 is 51.927277 and b is 0.034520, the model predicts (on average) that the number of employed for GNP 300 is around 51.92 + (0.035 * 300) = 62.42.

The red rectangle indicates the p-values for the coefficients GNP and Armed.Forces. In simple terms, a p-value indicates if you can reject or accept a hypothesis. The hypothesis, in this case, is that the independent variables is not meaningful for your model. Number of stars indicate how significant is the variable for your model. In this case we can see GNP is very significant for the prediction and Armed forces is not contributing in prediction of Employed.

• The p-value for age is 6.88e-13 or 0.0000169. A very small value means that age is probably an excellent addition to your model.
• The p-value for the Armed forces is 0.3. In other words, there’s high chance that this predictor is not meaningful for the regression.

A standard way to test if the predictors are not meaningful is looking if the p-values smaller than 0.05.

Residuals

A good way to test the quality of the model is to look at the residuals or the differences between the real values and the predicted values. The straight line in the image below represents the predicted values. The red vertical line from the straight line to the observed data value is the residual.

The idea in here is that the sum of the residuals is approximately zero or as low as possible. In real life, most cases will not follow a perfectly straight line, so residuals are expected. In the R summary of the lm function, you can see descriptive statistics about the residuals of the model, following the same example, the red square shows how the residuals are approximately zero. The residuals are distributed normally (with median approximately 0) even if our predicted variable in skewed. Please find below the residuals summary of this model

Prediction

The predict function is used to the dependent variable for test data.

```predict(model, newdata=dataset, interval = "prediction")```

this gives us 3 values the best fit, lower range and upper range. If we do not give the 3rd parameter i.e. `predict(model, newdata=dataset`), the output will be the best fil line only.

How to test if your linear model has a good fit?

Most common value to check how good is your model is the coefficient of determination or R-squared. R-squared is a statistical measure of how close the data are to the fitted regression line. R-squared values range from 0 to 1 and are commonly stated as percentages from 0% to 100%. An R-squared of 100% means that all movements of dependent variable (Employed in this case) are completely explained by movements in the independent variables (GNP & Armed forces in this case).

R-squared Formula:

R-squared=1−SSE/SST

SSE=∑ni(yi−2 & SST=∑ni(yi−yi¯)^2

Here,  is the fitted value for observation i and yi¯ is the mean of Y.

In the green rectangle, notice that there’s two different R² from summary of the model, one multiple and one adjusted. The multiple is the R² that you saw previously. One problem with this R² is that it cannot decrease as you add more independent variables to your model, it will continue increasing as you make the model more complex, even if these variables don’t add anything to your predictions (like the example of the Armed Forces). For this reason, the adjusted R² is probably better to look at if you are adding more than one variable to the model, since it only increases if it reduces the overall error of the predictions. High R-squared value and low or approximately zero residuals together combined makes a good model.

Conclusion

Linear Regression is the most simple and important regression model algorithms. You can type “?lm” in the console to view all the parameters that has not been covered here. It has been stayed around since 19th century and will stay for long time. If you are interested in statistical models, start with linear regression.

Let me know your views on this post by commenting below.

get more interesting articles on R Tutorials, Python Tutorials #### Gaurav Tiwari

My Name is Gaurav Tiwari. I am working in the IT industry for over 3.5+ years. I completed my B.E. from Mumbai University in 2015, Since then I’m working with Accenture Solutions PVT. LTD. as data Analyst.
I’ve started writing blogs as hobby.

View all posts

• Rajesh Tiwari says:

Good effort.

• Gaurav Tiwari says:

Thank you Rajesh Tiwari

• Gaurav Tiwari says:

Glad to hear this रितेश कुमार रामानन्द यादव.. Thank you

• Shishir says:

Very good information for the folks who need to dive in or start with machine learning process. Good work Gaurav.

• Gaurav Tiwari says:

Thank you Shishir

• DSSingh says:

Brilliant Work Keep it up D.S.Singh

• Gaurav Tiwari says:

Thank you D.S.Singh

• esp roblox says:

I’ve observed that in the world today, video games include the latest craze with children of all ages. Often times it may be extremely hard to drag your son or daughter away from the activities. If you want the best of both worlds, there are many educational video games for kids. Great post.

• Guqinz says:

Hi there I am so happy I found your webpage, I really found you by accident, while I was browsing on Digg for something else, Anyhow I am here now and would just like to say thanks a lot for a marvelous post and a all round enjoyable blog (I also love the theme/design), I don抰 have time to look over it all at the minute but I have saved it and also included your RSS feeds, so when I have time I will be back to read a great deal more, Please do keep up the fantastic job.

• Gaurav Tiwari says:

Thank you Guqinz

• Free Stuff says:

Fantastic blog! Do you have any tips for aspiring writers? I’m hoping to start my own site soon but I’m a little lost on everything. Would you advise starting with a free platform like WordPress or go for a paid option? There are so many choices out there that I’m totally confused .. Any recommendations? Kudos!

• Gaurav Tiwari says:

Thank you.. in the beginning, I will suggest going for the free version. later when your blog or site is developed and working as expected you may move to paid versions for better service.

• Bennett Schwendeman says:

Appreciate the efforts

• Free Samples says:

Spot on with this write-up, I truly assume this web site wants way more consideration. I抣l in all probability be once more to read far more, thanks for that info.

• Past life regressionists says:

This reminds me of the other page I was looking at

• past lives test says:

• what were you in your past life says:

Just passing through… and I decided to post because earlier this week I found a news site writing about something very similar.The synchronicity wasfascinating I admit.

• Hairstyles says:

I have been checking out a few of your stories and it’s pretty clever stuff. I will make sure to bookmark your blog.

• Hairstyles VIP says:

Wow! Thank you! I always wanted to write on my site something like that. Can I take a part of your post to my site?

• Gaurav Tiwari says:

yes sure.. you can send me your contents through the contact form..

• Hairstyles says:

I have read a few good stuff here. Certainly worth bookmarking for revisiting. I wonder how much effort you put to create such a magnificent informative website.

• Hairstyles says:

Great paintings! That is the kind of info that are meant to be shared around the net. Disgrace on Google for not positioning this post higher! Come on over and seek advice from my site . Thank you =)

• Hairstyles says:

It is in point of fact a great and useful piece of info. I am glad that you simply shared this useful info with us. Please keep us informed like this. Thank you for sharing.