Outlier Removal Techniques for Linear Regression Model in R

PRE-READ

If you have been following this series of Learning to run regression models in R. You might start from here to start this process of exploring Data Analytic in R. Please refer to this post as the starting point to get even better context to what follows. DO REMEMBER we will be using the same dataset from my previous post to build our further understanding.

https://rajatmshukla.wordpress.com/2022/11/27/data-analytics-with-linear-regression-modeling-in-r/

Why do Outlier Removal?

Outlier removal, also known as outlier detection or outlier analysis, is a process of identifying and removing extreme values that are significantly different from the majority of the data. These values, also known as outliers, can have a significant impact on the results of a statistical model, and can sometimes even lead to incorrect conclusions.

There are several reasons why outlier removal is performed in a model:

  1. Outliers can distort the overall distribution of the data and affect the statistical measures, such as the mean and standard deviation. This can lead to inaccurate conclusions and predictions.
  2. Outliers can have a disproportionate influence on the model’s performance, especially in regression models. For example, a single outlier with a large value may significantly alter the slope of the regression line, leading to a model that does not accurately capture the relationship between the predictor and response variables.
  3. Outliers can be caused by errors in data collection or data entry, and removing them can help ensure that the model is based on accurate and reliable data.
  4. In some cases, the presence of outliers may indicate the need for a different model or a different approach to modeling the data.

It’s important to note that outlier removal should be done carefully, as it can also remove important information from the data. Before removing any outliers, it’s important to thoroughly understand the data and carefully consider the potential consequences of removing them.

Outlier Removal Techniques for a Linear Regression Model in R

There are several different techniques that can be used to identify and remove outliers from a dataset before fitting a linear regression model. Some of the most common outlier removal techniques include the following:

  • Standardized Residuals
  • Cook’s distance
  • DF Beta
  • boxplots
  • hatvalues

Standardized residuals: This method involves calculating the standardized residuals for each observation in the fitted regression model, and identifying the observations with the largest standardized residuals as potential outliers. To implement this method in R, you can use the cooks.distance() function from the car package to calculate the standardized residuals, and then use the which() function to identify the observations with the largest standardized residuals.

# fit the linear regression model
worldbankdata.lm <- lm(S&P ~ CRUDE_PETRO + LEAD + CRUDE_BRENT + CRUDE_DUBAI + CRUDE_WTI + COAL_AUS + COAL_SAFRICA + NGAS_US + NGAS_EUR + NGAS_JP + ALUMINUM + IRON_ORE + COPPER + Tin + NICKEL + Zinc + GOLD + PLATINUM + SILVER, data=d4)

summary(worldbankdata.lm)

# calculate the standardized residuals
std_residuals <- cooks.distance(worldbankdata.lm)

library(olsrr)
ols_plot_resid_stand(worldbankdata.lm)


# identify the observations with the largest standardized residuals
outliers <- which(std_residuals > 2 * sd(std_residuals))

# remove the identified outliers from the dataset
d4_std_residual_greater2 <- d4[-outliers, ]

# Run the linear regression model again to see the results after outlier removal
worldbankdata.lm_std_residual_greater2 <- lm(S&P ~ CRUDE_PETRO + LEAD + CRUDE_BRENT + CRUDE_DUBAI + CRUDE_WTI + COAL_AUS + COAL_SAFRICA + NGAS_US + NGAS_EUR + NGAS_JP + ALUMINUM + IRON_ORE + COPPER + Tin + NICKEL + Zinc + GOLD + PLATINUM + SILVER, data=d4_std_residual_greater2)

summary(worldbankdata.lmd4_std_residual_greater2)


Cook’s distance: This method uses Cook’s distance, which is a measure of the influence of each observation on a fitted regression model. It is calculated as the sum of the squared differences between the predicted values of the response variable for the original model and the predicted values for the model after removing each observation. Observations with a large Cook’s distance are considered to be potential outliers, and can be removed from the dataset.

# fit the linear regression model
worldbankdata.lm <- lm(S&P ~ CRUDE_PETRO + LEAD + CRUDE_BRENT + CRUDE_DUBAI + CRUDE_WTI + COAL_AUS + COAL_SAFRICA + NGAS_US + NGAS_EUR + NGAS_JP + ALUMINUM + IRON_ORE + COPPER + Tin + NICKEL + Zinc + GOLD + PLATINUM + SILVER, data=d4)

summary(worldbankdata.lm)

# calculate the Cook’s distance for each observation
cooks_distance <- cooks.distance(worldbankdata.lm)

library(olsrr)
ols_plot_cooksd_bar(worldbankdata.lm)


# identify the observations with the largest Cook’s distance values
outliers <- which(cooks_distance > 4 / (nrow(d4) – 2))

# remove the identified outliers from the dataset
d4_cooksdistance <- d4[-outliers, ]

# Run the linear regression model again to see the results after outlier removal
worldbankdata.lm_cooksdistance <- lm(S&P ~ CRUDE_PETRO + LEAD + CRUDE_BRENT + CRUDE_DUBAI + CRUDE_WTI + COAL_AUS + COAL_SAFRICA + NGAS_US + NGAS_EUR + NGAS_JP + ALUMINUM + IRON_ORE + COPPER + Tin + NICKEL + Zinc + GOLD + PLATINUM + SILVER, data=d4_cooksdistance)

summary(worldbankdata.lm_cooksdistance)


DF Beta: DF Beta (DFBETA) outlier removal is a method for identifying and removing influential observations in a regression model. DFBETA measures the change in the estimated regression coefficients when an individual observation is removed from the data. Observations with large DFBETA values are considered to have a high degree of influence on the model and may be potential outliers.
To perform DFBETA outlier removal in R, you can use the dfbeta() function in the car package. This function takes a fitted linear model object as an input and returns a vector of DFBETA values for each observation in the model. You can then use these values to identify and remove influential observations from the data.

For example, you can use the following code to fit a linear regression model to a dataset and remove observations with DFBETA values greater than a certain threshold:

library(car)

#Fit linear regression model
worldbankdata.lm <- lm(S&P ~ CRUDE_PETRO + LEAD + CRUDE_BRENT + CRUDE_DUBAI + CRUDE_WTI + COAL_AUS + COAL_SAFRICA + NGAS_US + NGAS_EUR + NGAS_JP + ALUMINUM + IRON_ORE + COPPER + Tin + NICKEL + Zinc + GOLD + PLATINUM + SILVER, data=d4)

summary(worldbankdata.lm)

#Calculate the DFBETA values for each observation
dfbeta_values <- dfbeta(worldbankdata.lm)

#Identify observations with large 2*Standard Deviation in DFBETA values
outliers <- which(dfbeta_values > 2*sd(dfbeta_values))

#Remove the identified outliers from the data
d4_dfbeta <- d4[-outliers, ]

#Fit the model to the cleaned data
worldbankdata.lm_dfbeta <- lm(S&P ~ CRUDE_PETRO + LEAD + CRUDE_BRENT + CRUDE_DUBAI + CRUDE_WTI + COAL_AUS + COAL_SAFRICA + NGAS_US + NGAS_EUR + NGAS_JP + ALUMINUM + IRON_ORE + COPPER + Tin + NICKEL + Zinc + GOLD + PLATINUM + SILVER, data=d4_dfbeta)

summary(worldbankdata.lm_dfbeta)

It’s important to note that DFBETA outlier removal should be used in conjunction with other diagnostic tools, such as residual plots, to assess the fit of a regression model. Removing influential observations can improve the fit of the model, but it can also remove important information from the data. It’s important to carefully consider the potential consequences of removing outliers before doing so.


boxplot: A boxplot is a graphical representation of a dataset that can be used to identify and visualize potential outliers. In a boxplot, the central box represents the middle 50% of the data, and the whiskers extend to the minimum and maximum values of the data (excluding outliers). Outliers are typically represented as individual points outside the whiskers of the plot.
To remove outliers from a dataset using a boxplot in R, you can use the boxplot() function in the base graphics package to create a boxplot of the data, identify the outliers, and then remove them from the dataset.

For example, you can use the following code to create a boxplot of a dataset and remove observations that are outside the whiskers of the plot:

#Fit linear regression model
worldbankdata.lm <- lm(S&P ~ CRUDE_PETRO + LEAD + CRUDE_BRENT + CRUDE_DUBAI + CRUDE_WTI + COAL_AUS + COAL_SAFRICA + NGAS_US + NGAS_EUR + NGAS_JP + ALUMINUM + IRON_ORE + COPPER + Tin + NICKEL + Zinc + GOLD + PLATINUM + SILVER, data=d4)

summary(worldbankdata.lm)

# Create a boxplot of the residuals
boxplot(worldbankdata.lm$residuals)

#Identify outliers from boxplot
outliers <- which(worldbankdata.lm$residuals >= boxplot.stats(worldbankdata.lm$residuals)$out)

#Remove the identified outliers from the data
d4_boxplot <- d4[-outliers, ]

#Fit the model to the cleaned data
worldbankdata.lm_boxplot <- lm(S&P ~ CRUDE_PETRO + LEAD + CRUDE_BRENT + CRUDE_DUBAI + CRUDE_WTI + COAL_AUS + COAL_SAFRICA + NGAS_US + NGAS_EUR + NGAS_JP + ALUMINUM + IRON_ORE + COPPER + Tin + NICKEL + Zinc + GOLD + PLATINUM + SILVER, data=d4_boxplot)

summary(worldbankdata.lm_boxplot)


hatvalues: The hat values, also known as leverage values, are measures of the influence of individual observations on a fitted linear regression model. Observations with large hat values are considered to have a high degree of influence on the model and may be potential outliers.

To perform outlier removal using hat values in R, you can use the hatvalues() function to calculate the hat values for a fitted linear model object. You can then use these values to identify and remove influential observations from the data.

For example, you can use the following code to fit a linear regression model to a dataset and remove observations with hat values greater than a certain threshold:

# Fit the linear regression model
worldbankdata.lm <- lm(S&P ~ CRUDE_PETRO + LEAD + CRUDE_BRENT + CRUDE_DUBAI + CRUDE_WTI + COAL_AUS + COAL_SAFRICA + NGAS_US + NGAS_EUR + NGAS_JP + ALUMINUM + IRON_ORE + COPPER + Tin + NICKEL + Zinc + GOLD + PLATINUM + SILVER, data=d4)

summary(worldbankdata.lm)

# Calculate the hat values for each observation
hat_values <- hatvalues(worldbankdata.lm)

plot(hat_values)

# Identify observations with large hat values
outliers <- which(hat_values > 0.05)

# Remove the identified outliers from the data
d4_hatvalues <- d4[-outliers, ]

# Fit the model to the cleaned data
worldbankdata.lm_hatvalues <- lm(S&P ~ CRUDE_PETRO + LEAD + CRUDE_BRENT + CRUDE_DUBAI + CRUDE_WTI + COAL_AUS + COAL_SAFRICA + NGAS_US + NGAS_EUR + NGAS_JP + ALUMINUM + IRON_ORE + COPPER + Tin + NICKEL + Zinc + GOLD + PLATINUM + SILVER, data=d4_hatvalues)

summary(worldbankdata.lm_hatvalues)


What I Prefer and Practice

To stay true to our Data Analysis and Prediction Modeling key is to remove as little of Dependent variable as possible.

I prefer creating DF Beta plots for all the variables and do selective outlier removal.

1. Open pdf file

pdf(“dfbeta_plots.pdf”)

# 2. Create the plot

ols_plot_dfbetas(worldbankdata.lm)

# 3. Close the file

dev.off()

Output File:

The process involves strategically removing certain point where the p-value is extremely high. This method involves a bit of trial and error to identify perfect points to get a stable and acceptable linear regression model results.

The best approach to model would be go variable by variable. Instead of removing all points above the threshold value we can be selective with unique combination.

# Fit the model and remove selective points from the database to arrive at a stable model
worldbankdata.lm_selectivedfbeta <- lm(S&P ~ CRUDE_PETRO + LEAD + CRUDE_BRENT + CRUDE_DUBAI + CRUDE_WTI + COAL_AUS + COAL_SAFRICA + NGAS_US + NGAS_EUR + NGAS_JP + ALUMINUM + IRON_ORE + COPPER + Tin + NICKEL + Zinc + GOLD + PLATINUM + SILVER, data=d4[-c(615,126,125,631,618)])

summary(worldbankdata.lm_selectivedfbeta)


Hope you learnt a bit more on how to do Outlier Removal in Linear Regression Modeling in R from this piece. Happy Modeling!

One thought on “Outlier Removal Techniques for Linear Regression Model in R

  1. Pingback: Tools to help your Linear Regression Model predict better! | BeingMaverick

Leave a comment