Data Analytics with Linear Regression Modeling in R

Data Analytics in R

Before I stepped into the corporate world I always felt the use of languages like R was just not justified. Many questions echoed in my mind “Why R?”. Why can’t I just pick up the exact same work in Excel? Hell, It’s a powerful tool! What do I have to gain with R?

These are some valid questions, but the dependence changes quickly when you have to work with the dataset that has 3 million+ odd rows with 90+ columns. When you have to work with a CSV file that has a size of 1.5GB+, an urgent need for R arises.
If you would try to open a file of this size in excel. Your computer would first cuss at you in different sets of 1’s and 0’s and momentarily stop working :D!

Excel supports 1,048,576 rows by 16,384 columns which prove to be ineffective when working on data sets of this size. Hence, R is a powerful language used widely for data analysis and statistical computing.

R was developed at the University of Auckland, New Zealand by Ross Ihaka and Robert Gentleman. The current development of the R platform is done by R Development Core Team.

R has come a long way from those early builds. It’s much more user-friendly and now more than ever.

R has a lot going for it:

  • The style of coding is quite easy.
  • It’s open source. No need to pay any subscription charges.
  • Availability of instant access to over 7800 packages customized for various computation tasks.
  • The community support is overwhelming. There are numerous forums to help you out.
  • Get high performance computing experience ( require packages)
  • One of highly sought skill by analytics and data science companies.

R can be used seamlessly on many platforms but R Studio and Jupyter Notebook are the most widely used ones.

I would like to recommend Jupyter Notebook to anyone who wants to get into Data Analytics from scratch. It’s Open source and very basic in design. Using Jupyter notebook you always have an option to work in Python wherever you want.

Let’s Get it Running

I would like to give you some simple steps to make your life easier

Step 1: Make sure you download Python3 package from python.org and install it using the same setting as given below

Step 2: After installing it open up a command prompt and type the following commands in the exact same order

  • pip install jupyterlab
  • pip install notebook

Step 3: Download and install R for your system from here. DO NOT CHANGE ANY SETTINGS. Just keep on clicking next after reading the pages carefully.

Step 4: After installation is done you will find a folder like the one shown below in your start menu. Open R x64 file.

Step 5: Write the following commands in the exact same order.

  • install.packages(‘IRkernel’)
  • IRkernel::installspec()
  • IRkernel::installspec(user = FALSE)

Step 6: Open up a new command prompt and write

  • jupyter notebook

This should open up a tab in your default browser which will look like this.

Congratulation you can now run R and Python on your PC.


Hands-on R

If you are a beginner or someone who is very new to the field of programming. Here is a video that can help you with your very first step in a clean and concise manner


Exploratory Data Analysis with R

Exploratory Data Analysis (EDA) is the process of analyzing and visualizing the data to get a better understanding of the data and glean insight from it. Data Exploration is a crucial stage of a predictive model. You can’t build great and practical models unless you learn to explore the data from beginning to end. This stage forms a concrete foundation for data manipulation (the very next stage).
The data consists of:

  1. Import the data
  2. Clean the data
  3. Process the data
  4. Visualize the data

Linear Regression but in R!

Linear Regression is one of the simplest yet most effective way to run predictive analysis on any dataset irrespective of it’s size. Here is a sheet I use regularly to identify the model I wanna use in my regression analysis.

You can definitely run a linear regression model in Excel but like we discussed earlier it has it’s own limitations.

First of all, we need to get familiar with some terms:
Response Variable (Dependent Variable): In a data set, the response variable (y) is one on which we make predictions.
Predictor Variable (Independent Variable): In a data set, predictor variables (Xi) are those using which the prediction is made on the response variable.
Train Data: The predictive model is always built on train data set. An intuitive way to identify the training data is, that it always has the ‘response variable’ included.
Test Data: Once the model is built, its accuracy is ‘tested’ on test data. This data always contains fewer observations than the train data set. Also, it does not include the ‘response variable’.

Linear regression analysis is used to predict the value of a variable based on the value of another variable/variables. The variable you want to predict is called the dependent variable. The variable you are using to predict the other variable’s value is called the independent variable. Now that we know what we

Before running the regression there are very basic check we can do to check the validity and linearity of the data. Usually in real world scenarios we always run multivariate regressions as one variable can be many independent on many other factors that might influence the dependent variables.

Multivariate Regression presents another problem where the independent variables can be correlated to each other which can influence the results and give us unreliable results. So breaking down these steps let’s go over them one by one and understand this with an example dataset.

Before getting started let’s download all the packages we might need during our processing.

library(data.table)
library(dplyr)
library(stringr)
library(stringi)
library(rio)
library(readxl)
library(writexl)
library(readr)
library(readxlsb)
library(data.table)
library(plyr)
library(“readxl”)
library(lubridate)
library(reshape2)
library(anytime)
library(arrow)
library(openxlsx)
library(ggcorrplot)

If you don’t have any package install them by using – install.packages(“package_name“)

Loading the Dataset First and making it a Data Table

d1 <- read_excel(“C:\Jupyter Workbooks\Book1.xlsx”,sheet=”Data”)
setDT(d1)

Correlation Matrix

command : cor(dependent variable, independent variable 1, independent variable 2, independent variable 3….)

corr1 <- cor(d1[,-c(“Month”)])

(You will be able to correlation that exists between different Independent and Dependent Varibales)

You can either create a conditionally formatted correlation plot by writing this data in excel or for client presentation we can also use the following command.

ggcorrplot(cor(d1[,-c(“Month”)][,c(‘S&P’,’CRUDE_PETRO’,’LEAD’,’CRUDE_BRENT’,’CRUDE_DUBAI’,’CRUDE_WTI’,’COAL_AUS’,’COAL_SAFRICA’,’NGAS_US’,’NGAS_EUR’,’NGAS_JP’,’ALUMINUM’,’IRON_ORE’,’COPPER’,’Tin’,’NICKEL’,’Zinc’,’GOLD’,’PLATINUM’,’SILVER’)]))

Correlation plots are extremely useful in sorting out correlated tactics and coming up with different solution to kill the correlation before putting them in our final model.

Linearity

We can create intuitive histograms to see the range of our data set to understand the data even better

d4 <- copy(d1)
hist(d4$’S&P’)

Apart from histograms we can also check two scatter plots against each other.

plot(S&P ~ CRUDE_PETRO + LEAD + CRUDE_BRENT + CRUDE_DUBAI + CRUDE_WTI + COAL_AUS + COAL_SAFRICA + NGAS_US + NGAS_EUR + NGAS_JP + ALUMINUM + IRON_ORE + COPPER + Tin + NICKEL + Zinc + GOLD + PLATINUM + SILVER, data=d4)

This command will create multiple scatter plots and line them up in your jupyter notebook, Since these plots are very useful during modelling I usually save them in a pdf file for future reference. Use the following commands to generate a pdf with all plots.

1. Open pdf file

pdf(“linearity_plots.pdf”)

2. Create the plot

plot(S&P ~ CRUDE_PETRO + LEAD + CRUDE_BRENT + CRUDE_DUBAI + CRUDE_WTI + COAL_AUS + COAL_SAFRICA + NGAS_US + NGAS_EUR + NGAS_JP + ALUMINUM + IRON_ORE + COPPER + Tin + NICKEL + Zinc + GOLD + PLATINUM + SILVER, data=d4)

3. Close the file

dev.off()

You will get the file present below.

Linear Regression Model

worldbankdata.lm <- lm(S&P ~ CRUDE_PETRO + LEAD + CRUDE_BRENT + CRUDE_DUBAI + CRUDE_WTI + COAL_AUS + COAL_SAFRICA + NGAS_US + NGAS_EUR + NGAS_JP + ALUMINUM + IRON_ORE + COPPER + Tin + NICKEL + Zinc + GOLD + PLATINUM + SILVER, data=d4)

summary(worldbankdata.lm)

The model summary above shows every detail about this model from R square to the p value of very independent variable with Residual Standard deviation error. As we can see this not a great model as there are many independent variables for whom the p value in very high. To create a stable model we will have to render back to some outlier removal approaches which is beyond the scope of this article but we will try to touch back on the topic of outliering soon in another post.

You can get the process to calculate percentage impact in the excel below. (Note that this is not a good model at all but it’s an illustration on how to run a linear model in R)

We can also see the predicted value vs the actual value of the model from the following command.

plot(predict(worldbankdata.lm), d4$S&P,
xlab = “Predicted Values”,
ylab = “Observed Values”)
abline(a = 0, b = 1, lwd=2,
col = “green”)

You can easily get a predicated value from the model vs actual value chart to check the fit for your line (here we can see there are clearly some outliers when removed would result in a better linear regression model).

That’s it you have successfully ran a Linear Regression Model and learned how to do so effectively in R. I hope you had a great time like I did while writing this post. Do reach out in case you run into some questions.

4 thoughts on “Data Analytics with Linear Regression Modeling in R

  1. Pingback: Outlier Removal Techniques for Linear Regression Model in R | BeingMaverick

  2. Pingback: Tools to help your Linear Regression Model predict better! | BeingMaverick

Leave a comment