Tools to help your Linear Regression Model predict better!

A Statistical model is an exceptional Data Investigation tool and helps users find insights into the data with tremendous success. That said there are certain instances of time where model might need your help in evening out the data and cook some insights for you. Now how you might do these investigation while driving some insights you might ask. There are several statistically sound techniques that can be a life safer in situations like these and we discuss them today in this blog.

But before we go ahead and deep dive into these techniques I would recommend you to go through the following pieces to get a gist of regression modeling.

Here we learn how to get R running on your system.
https://rajatmshukla.wordpress.com/2022/11/27/data-analytics-with-linear-regression-modeling-in-r/

This piece will help you understand the process of liner regression modeling in greater more details.
https://rajatmshukla.wordpress.com/2022/12/20/outlier-removal-techniques-for-linear-regression-model-in-r/

Explore the Data

Before we get into regression modeling. Statistical exploration of Data is of utmost importance. A package called DataExplorer in R is what I have preferred to create a sweet summary of my dataset with the entire team. It’s easy to use and visually awesome.

#Loading all the libraries
library(data.table)
library(dplyr)
library(stringr)
library(stringi)
library(rio)
library(readxl)
library(writexl)
library(readr)
library(readxlsb)
library(data.table)
library(plyr)
library(lubridate)
library(reshape2)
library(anytime)
library(arrow)
library(openxlsx)
library(tidyverse)
library(broom)
library(ggcorrplot)
library(DataExplorer)
library(ggplot2)

#Read the file and Create Report using DataExplorer
adv <- read_excel(“C:\Jupyter Workbooks\advertising.xlsx”)
setDT(adv)
create_report(adv[,-(“Date”)])

#Covert the HTML file it produces to a PDF

#Data Prep and keeping Data from June onwards
adv1 <- copy(adv)
adv1$Date <- as.Date(adv1$Date, format = “%d-%m-%Y”)
adv1$YRMO <- format(as.Date(adv1$Date), “%Y%m”)
head(adv1)

Carryover

In marketing, a “carryover” effect refers to the influence that past marketing activities have on current and future sales or other marketing outcomes. This effect can be difficult to quantify, but it can be important to consider when analyzing the performance of marketing campaigns and making decisions about future marketing investments.

One way to account for carryover effects in marketing analysis is through the use of a marketing mix model. A marketing mix model is a statistical tool that helps marketers understand the relative impact of different marketing activities on sales or other outcomes. The model estimates the effects of various marketing mix elements (such as advertising, price, and promotion) on sales or other outcomes, taking into account the interactions and dependencies among these elements.

To account for carryover effects in a marketing mix model, you can include lagged variables in the model to represent the influence of past marketing activities on current outcomes. For example, you could include a lagged variable for advertising expenditure to represent the influence of past advertising on current sales. The coefficient on this lagged variable would represent the carryover effect of past advertising on current sales. By including lagged variables in the model, you can estimate the carryover effects of past marketing activities and use this information to inform your marketing strategy and budgeting decisions. However, it is important to carefully consider the appropriate lag periods to include in the model, as the magnitude and duration of carryover effects can vary depending on the specific marketing activities and product or market being analyzed.

To create a lagged variable in R, you can use the lag() function from the dplyr library. Here is an example of how you can use lag() to create a lagged variable:

#Loading the required library
library(dplyr)
#Create a vector of values with 1 lag value
y <- lag(x, 1)

# For our Processing we are hypnotizing that the carryover effect is for 15 days in our Day Level Model.
adv1$Carryover <- lag(adv1$Sales,15)
adv2 <- adv1[!(YRMO == “202205”)]

#Let’s run our first iteration of this model to see what results are we getting.
m1 <- lm(Sales~Carryover+TV+Radio+Newspaper,data=adv2)
summary(m1)

We clearly see that our model has high p-values for Carryover and Newspaper Variable. Now to predict the sales uplift even better we might need to bring in some business sense and control for the effective change in marketing tactics.

Following tools can be a great way to bring uniformity to your data while reserving the variance so it can help your linear regression model predict better.

Time Crediting and Adstock

Time crediting and adstock are two related concepts used in marketing and advertising to measure the effectiveness of advertising campaigns.

Time crediting refers to the idea that the effects of an advertising campaign can last for a certain period of time after the campaign has ended. This means that even if a campaign has ended, it can still have an impact on consumer behavior and sales. Time crediting is used to account for this delayed effect, and to help marketers understand how long an advertising campaign will continue to have an impact.

Adstock is a measure of the residual effects of an advertising campaign. It is calculated by taking the product of the advertising campaign’s “base” effectiveness and the “decay” rate over time. This helps to quantify the lasting impact of an advertising campaign, and can be used to predict future sales and consumer behavior.
Adstock is a term used in marketing and advertising to describe the residual effect of an advertisement on sales over time. The idea is that the impact of an advertisement on sales is not instantaneous, but rather builds up over time and decays over time as well.
One way to model the adstock effect is to use a time-crediting approach. In this approach, you assign a credit to each time period in which the advertisement was exposed to the audience. The credit for each time period is equal to the ad’s decay factor raised to the power of the number of time periods since the ad was exposed. The decay factor is a number between 0 and 1 that represents the rate at which the ad’s impact on sales decays over time.
For example, suppose we have an ad that was exposed to the audience for one time period, and we want to model the adstock effect for three time periods after the ad was exposed. We can use the following equation to calculate the credit for each time period:

Credit_t = Decay_factor^(t-1)

where t is the time period (1, 2, or 3) and Decay_factor is the decay factor for the ad.

We can then use these credits to calculate the total adstock effect by summing the credits for each time period. For example, if the decay factor is 0.5 and the ad was exposed in time period 1, the adstock effect for each time period would be as follows:
Time period 1: Credit_1 = 0.5^(1-1) = 1
Time period 2: Credit_2 = 0.5^(2-1) = 0.5
Time period 3: Credit_3 = 0.5^(3-1) = 0.25

Total adstock effect = 1 + 0.5 + 0.25 = 1.75

This is just one way to model the adstock effect, and there are many other approaches that can be used as well.

Poor Man’s Curve

The poor man’s curve is a simple visual tool used to quickly assess the relationship between two variables in a dataset. The poor man’s curve is a useful tool because it allows you to quickly see if there is a relationship between the two variables and what that relationship looks like. It can also help you identify any potential outliers or unusual points in the data. However, it is not a substitute for more advanced statistical methods, such as regression analysis, which can provide more detailed and accurate information about the relationship between the variables. It is called a “poor man’s curve” because it is a simple and inexpensive way to visualize the relationship between the two variables, compared to more sophisticated techniques such as non-linear regression or smoothing splines.

#Create Poor Man’s Plot
ggplot(adv2, aes(x = TV, y = Sales)) +
geom_point() +
geom_smooth(method = ‘loess’, formula = ‘y ~ x’)

This will create a scatter plot of the data points and then connect them with a line to show the general trend in the data.

Transformation

Before we dive deeper in the Transformation realm it is important to understand why transformation !?
In linear regression, the goal is to fit a linear model to the data of the form y = mx + b, where y is the response variable, x is the predictor variable, m is the slope of the line, and b is the y-intercept. The slope and y-intercept are chosen to minimize the sum of the squared residuals between the observed values of y and the predicted values of y. Sometimes, the relationship between the predictor and response variables is not strictly linear. In these cases, it may be useful to transform the variables in order to linearize the relationship and improve the fit of the model. This is known as transformation method.

One reason to use Transformation is that it can help to normalize the data, which can make it easier to model and analyze. For example, if your data has a skewed distribution, taking the log of the variable can help to make the distribution more symmetrical, which can make it easier to apply certain statistical techniques.
Another reason to use transformation is that it can help to stabilize the variance of a variable. This is especially important in cases where the variance of a variable is not constant, which can make it difficult to model and analyze the data. Transforming the variable can help to reduce this variance, which can make it easier to work with the data.

In terms of real world applications, Transformation is often used in fields such as finance, economics, and biology, where the data may have a skewed distribution or where the variance is not constant. For example, in finance, transformation is often used to analyze stock prices, which can have a highly skewed distribution. Overall, transformation is a useful technique that can help to normalize and stabilize data, making it easier to model and analyze. This can be helpful in a wide variety of fields and applications where data analysis is important.

There are many different types of transformations that can be applied, depending on the nature of the data and the form of the relationship. Some common transformations include:

  • Log transformation: This transformation is often used when the relationship between the variables is multiplicative rather than additive. For example, if the response variable y is the logarithm of the predictor variable x, the relationship can be modeled as log(y) = mlog(x) + b.
    Example: y_log <- log(y)
  • Square root transformation: This transformation is often used when the relationship between the variables is quadratic. For example, if the response variable y is the square root of the predictor variable x, the relationship can be modeled as sqrt(y) = msqrt(x) + b.
    Example: y <- sqrt(abs(x))
  • Box-Cox transformation: This transformation is a generalization of the log transformation that can be used to model relationships that are more complex than a simple power law. The Box-Cox transformation is defined as y' = (y^lambda - 1)/lambda, where lambda is a parameter that can be estimated from the data.
    Example:
    #install and load the MASS package
    library(MASS)
    #apply a Box-Cox transformation to x
    y <- boxcox(x, lambda = 0.5)
  • Power transformation: This transformation is used to model relationships that are more complex than a simple power law. The power transformation is defined as y' = y^a, where a is a parameter that can be estimated from the data.
    Example: y_power <- y^a
  • Inverse transformation: This transformation is used to model relationships that are inverse. For example, if the response variable y is the reciprocal of the predictor variable x, the relationship can be modeled as 1/y = m/x + b.
  • Exponential transformation: This transformation is used to model relationships that are exponential. For example, if the response variable y is the exponent of the predictor variable x, the relationship can be modeled as exp(y) = exp(mx)exp(b).
    Example: z <- exp(y)

Normalization

Normalization can be useful in modeling when you are working with large datasets that may contain redundant or correlated features. By normalizing the features in the dataset, you can improve the performance of your model and avoid some common problems such as overfitting.

For example, imagine you are building a machine learning model to predict the price of a house based on various features such as the size of the house, the number of bedrooms, and the location. If the size of the house is measured in square feet, and the number of bedrooms is an integer, then these two features will be on different scales. This can make it difficult for your model to learn effectively, because the size of the house will dominate the other feature. By normalizing the features, you can put them on the same scale, which allows your model to learn more effectively and make better predictions. This can be especially important when working with large datasets, where the scale of the features can vary significantly.

#Normalization Example
y_by_x_norm <- y/x

Normalization can also be useful in modeling when you want to identify and eliminate correlated features. Correlated features can cause problems in a model because they can introduce redundant information, which can make it harder for the model to learn effectively. By normalizing the data and identifying correlated features, you can reduce the complexity of your model and improve its performance.

Hope this piece was helpful.
Happy Modeling!

3 thoughts on “Tools to help your Linear Regression Model predict better!

  1. Pingback: A Deep Dive into Bayesian Linear Regression! | BeingMaverick

  2. Pingback: Looking into the Random Forest Learning Model! | BeingMaverick

  3. Pingback: Learning about Gradient Boosting using XGBoost! | BeingMaverick

Leave a comment