Here \(\hat{\beta}_1 = 10\), means holding all other variables constant, if \(x_1\) increases by 1 unit, \(Y\) is predicted to increase by 10 units. Similarly, \(\hat{\beta}_2 = 8\) means holding all other variables constant, if \(x_2\) increases by 1 unit, \(Y\) is predicted to increase by \(8\) units. Finally, \(\hat{\beta}_3 = 9\) means holding all other variables constant, if \(x_3\) increases by 1 unit, \(Y\) is predicted to increase by 9 units.
# define the function
y_hat <- function(x1, x2, x3){
output <- 25 + 10*x1 + 8*x2 + 9*x3
return(output)
}
# test the function
y_hat(2, 1, 5)
## [1] 98
# predict Y
y_hat(15, 10, 5)
## [1] 300
We can also give the names of the arguments, the benefit is that we don’t have to remember the order of the arguments.
# predict Y
y_hat(x2 = 10, x1 = 15, x3 = 5)
## [1] 300
Let’s calculate SSE first
# SSR
SST <- 6724.125
SSR <- 6216.375
SSE <- SST - SSR
SSE
## [1] 507.75
We can calculate MSR with the following formula
\[ \text{MSR} = \frac{\text{SSR}}{p} \]
# MSR
p <- 2 # already this is defined before
MSR <- SSR / p
MSR
## [1] 3108.188
Now we can also calculate MSE with the following formula
\[ \text{MSE} = \frac{\text{SSE}}{n-p-1} \]
n <- 10
p <- 2
MSE <- SSE / (n - p - 1)
MSE
## [1] 72.53571
Finally we can calculate \(F\) with the following formula
\[ F = \frac{\text{MSR}}{\text{MSE}} \]
Fcalc <- MSR / MSE
Fcalc
## [1] 42.85044
Now to do the F-test we need to find the critical value from the
\(F\) distribution with \(p = 2\) and \(n-p-1 = 7\) degrees of freedom. We can use
the qf
function to find the critical value for the F-test.
Here \(n = 10\).
# find the critical value of F
alpha <- 0.05
n <- 10
p <- 2
# F test is always one tail test, this is F(1-alpha)
Fcrit <- qf(p = 1 - alpha, df1 = p, df2 = n - p - 1)
Fcalc > Fcrit
## [1] TRUE
Since \(F_{calc} > F_{crit}\), we reject the null hypothesis. Recall the Null hypothesis here is
\[ H_0: \beta_1 = \beta_2 = 0 \]
So this means at least one of the coefficients is non-zero. Now similarly we can do the t-test for \(\beta_1\) and \(\beta_2\). The t-test for \(\beta_1\) is
# t-test for beta_1
tcalc_beta1 <- (0.5906 - 0) / 0.0813
# it's a two tail test, so let's calculate the p value directly
pvalue_beta1 <- 2 * (1 - pt( abs(tcalc_beta1) , df = n - p - 1) )
pvalue_beta1
## [1] 0.0001678217
pvalue_beta1 < alpha
## [1] TRUE
So we reject the Null hypothesis that \(\beta_1 = 0\). Similarly, we can do the t-test for \(\beta_2\).
# t-test for beta_2
tcalc_beta2 <- (0.4980 - 0) / 0.0567
pvalue_beta2 <- 2 * (1 - pt( abs(tcalc_beta2) , df = n - p - 1) )
pvalue_beta2
## [1] 4.997936e-05
pvalue_beta2 < alpha
## [1] TRUE
So we reject the Null hypothesis that \(\beta_2 = 0\).
In answer a. we need to load the data, fit the simple linear
regression model with tv
, and then write the estimated
regression equation.
# load the library
library(readxl)
# directly load the data
Showtime <- read_excel("/home/tanvir/Documents/ownCloud/Git_Repos/EWU_repos/3_Fall_2023/eco_204/ewu-eco204.github.io/pdf_files/ps/04_mlr/Solutions/Showtime.xlsx")
We can view the data
Showtime
## # A tibble: 8 Ă— 5
## revenue tv newspaper magazines leaflets
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 96 5 1.5 4.3 3.01
## 2 90 2 2 2.3 2
## 3 95 4 1.5 4.4 2.5
## 4 92 2.5 2.5 3.3 2.3
## 5 95 3 3.3 2.4 3.5
## 6 94 3.5 2.3 3.5 1.7
## 7 94 2.5 4.2 2.4 2.5
## 8 94 3 2.5 5.3 2.7
It’s also good to see some summary stats
summary(Showtime)
## revenue tv newspaper magazines
## Min. :90.00 Min. :2.000 Min. :1.500 Min. :2.300
## 1st Qu.:93.50 1st Qu.:2.500 1st Qu.:1.875 1st Qu.:2.400
## Median :94.00 Median :3.000 Median :2.400 Median :3.400
## Mean :93.75 Mean :3.188 Mean :2.475 Mean :3.487
## 3rd Qu.:95.00 3rd Qu.:3.625 3rd Qu.:2.700 3rd Qu.:4.325
## Max. :96.00 Max. :5.000 Max. :4.200 Max. :5.300
## leaflets
## Min. :1.700
## 1st Qu.:2.225
## Median :2.500
## Mean :2.526
## 3rd Qu.:2.777
## Max. :3.500
Looks okm Now let’s regress revenue
on
tv
# fit the model
model_slr <- lm(revenue ~ tv, data = Showtime)
# view the results with stargazer package
library(stargazer)
stargazer(model_slr, type = "text")
##
## ===============================================
## Dependent variable:
## ---------------------------
## revenue
## -----------------------------------------------
## tv 1.604**
## (0.478)
##
## Constant 88.638***
## (1.582)
##
## -----------------------------------------------
## Observations 8
## R2 0.653
## Adjusted R2 0.595
## Residual Std. Error 1.215 (df = 6)
## F Statistic 11.269** (df = 1; 6)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
So the estimated regression equation is
\[ \widehat{revenue} = 88.63 + 1.604 \text{tv} \]
We need to develop an estimated equation using both tv and newspaper advertising. So let’s fit MLR (notice the dot after the tilde sign, this means all the other variables in the data set except the dependent variable revenue will be used in the model)
# fit the model
model_mlr <- lm(revenue ~ ., data = Showtime)
# view the results with stargazer package
library(stargazer)
stargazer(model_mlr, type = "text")
##
## ===============================================
## Dependent variable:
## ---------------------------
## revenue
## -----------------------------------------------
## tv 1.952**
## (0.360)
##
## newspaper 1.243**
## (0.360)
##
## magazines 0.311
## (0.272)
##
## leaflets 0.537
## (0.487)
##
## Constant 82.010***
## (1.714)
##
## -----------------------------------------------
## Observations 8
## R2 0.956
## Adjusted R2 0.898
## Residual Std. Error 0.610 (df = 3)
## F Statistic 16.397** (df = 4; 3)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
So the estimated regression equation is
\[ \widehat{revenue} = 82.010 + 1.952 \text{tv} + 1.243 \text{newspaper} + 0.311 \text{magazines} + 0.537 \text{leaflets} \]
The two estimated coefficients of tv are different. I am not writing the interpretation but you should be able to do this (see the slides)
For multiple linear regression getting anova table in R requires a
bit more work. If we run simple anova(model_mlr)
function,
this will not work (you can try!). We need to put two regression model
in anova()
# fit null model
model_null <- lm(revenue ~ 1, data = Showtime)
# we already have the complete model
# anova function will compare between two models
anova(model_null, model_mlr)
## Analysis of Variance Table
##
## Model 1: revenue ~ 1
## Model 2: revenue ~ tv + newspaper + magazines + leaflets
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 7 25.5000
## 2 3 1.1153 4 24.385 16.397 0.02227 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Careful here RSS (Residual sum of squares) is actually SSE, so the anova table is showing SSE. For null model the \(\text{SSE} = 25.50\), the \(\text{Df} = 7\). Recall for the Null model SSE is same as SST for the full model. So for the full model the \(\text{SST} = 25.50\), the \(\text{Df} = n - 1= 7\). Now the second row shows
With this it is possible to calculate other things asked in the question (do it!) Important is here \(\text{SSR} = \text{SSE}_{R} - \text{SSE}\), because \(\text{SSE}_R = \text{SST}\).
The \(R^2\) in SLR is 0.653 and the \(R^2\) in MLR is 0.956. So the \(R^2\) is slightly higher in MLR, this is because we have added another variable in the model, so the \(R^2\) will increase. The adjusted \(R^2\) in MLR model is 0.898.
answer of i.. Clearly in the MLR results, the
tv
and newspaper
are significant both at 5%
level of significance. But magazines and leaflets are not significant,
not even at 10% level of significance.
answer of ii. For overall significance testing we need to do the F-test. Again looking the summary of the regression output, it seems we can reject the joint hypothesis that all coefficients are \(0\) can he rejected at 5% significance level. We can also see the value of the F-statistic and the p-value of this calculated F-statistics in the anova table (this is in answer d).
answer of iii. In this case we have two restrictions, so we need to use restricted / unrestricted F-test. The null hypothesis is
\[ H_0: \beta_3 = \beta_4 = 0 \]
The alternative is same as before. Again we need to use the anova
table to do the F-test. First fit the model with the restrictions, this
means we need to drop the magazines
and
leaflets
from the model.
model_restrict <- lm(revenue ~ tv + newspaper, data = Showtime)
Now compare this with the full model
anova( model_restrict, model_mlr)
## Analysis of Variance Table
##
## Model 1: revenue ~ tv + newspaper
## Model 2: revenue ~ tv + newspaper + magazines + leaflets
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 5 2.0646
## 2 3 1.1153 2 0.94926 1.2766 0.3971
Here $SSE_R - SSE = 0.94926 $ and \(Df = 2\). So the F-statistic is \(1.2766\) and the p-value is \(0.312\). So we fail to reject the null hypothesis that \(\beta_3 = \beta_4 = 0\) at 5% level of significance. So we can say that magazines and leaflets are not significant at 5% level of significance. Both coefficients are zero. So perhaps we can drop them.
This is the same prediction problem, but careful we need to divide the numbers by \(1000\) because the model is in thousands of dollars.
test_data <- data.frame(tv = 3500/1000, newspaper = 2300/1000, magazines = 1000/1000, leaflets = 500/1000)
predict(model_mlr, newdata = test_data)
## 1
## 92.28007
The question is asking to provide a confidence interval for the mean
of revenues when we have a value for the tv
and
newspaper
advertisement. So we need to use the restricted
model to predict (not the full model).
predict(model_restrict, newdata = data.frame(tv = 3500/1000, newspaper = 2300/1000), interval = "confidence")
## fit lwr upr
## 1 94.23801 93.61968 94.85633
Same question, but this time prediction interval
predict(model_restrict, newdata = data.frame(tv = 3500/1000, newspaper = 2300/1000), interval = "prediction")
## fit lwr upr
## 1 94.23801 92.47425 96.00177
Let’s see the residual plot, here we need to plot the residuals against the fitted values.
plot(model_mlr$fitted.values, model_mlr$residuals)
This is a simulation exercise, simulation means the data world is in your control, you can generate the data. In this case we will generate a data with a given Data Generating Process (DGP). Here is the DGP
\[ Y = 3 + 5X_1+ 2 X_2 + \epsilon \]
We call it DGP, because if we know the how \(\epsilon\), \(X_1\) and \(X_2\) are generated, then we can generate the \(Y\) variable. In this case we will generate the \(X_1\), \(X_2\) and \(\epsilon\) from given distributions. Note that here \(\beta_0 = 3\), \(\beta_1 = 5\) and \(\beta_2 = 2\).
# set the seed for reproducibility
set.seed(1238818)
# fix number of observations
n <- 50
# generate epsilon
epsilon <- rnorm(n, mean = 0, sd = sqrt(.25))
x2 <- rnorm(n, mean = 0, sd = sqrt(.35))
x1 <- runif(n, min = 0, max = 1)
# generate y
beta0 <- 3
beta1 <- 5
beta2 <- 2
y <- beta0 + beta1*x1 + beta2*x2 + epsilon
# create a data frame
sim_data <- data.frame(sales = y, tv = x1, newspaper = x2)
# view the data
sim_data
## sales tv newspaper
## 1 4.488870 0.67475860 -0.88409171
## 2 4.622145 0.23868809 -0.07856125
## 3 6.758594 0.84156399 -0.53307615
## 4 4.672921 0.32841212 -0.09015324
## 5 7.531821 0.66415995 0.73722200
## 6 6.888550 0.97950962 -0.23011098
## 7 7.013012 0.54656473 0.54823801
## 8 4.179907 0.14094410 0.70350506
## 9 4.109454 0.26633477 -0.10927090
## 10 3.911798 0.07834296 -0.24842969
## 11 8.682285 0.96228672 0.17763126
## 12 4.129723 0.48018464 -0.66100598
## 13 5.294358 0.51607777 -0.07072806
## 14 3.754744 0.34645097 -0.51126184
## 15 4.108006 0.03651132 0.27954052
## 16 6.159030 0.73164819 -0.01979241
## 17 4.857479 0.36765320 0.30927280
## 18 5.363396 0.03544635 1.07404090
## 19 5.402776 0.58979273 -0.23550138
## 20 2.668866 0.32410347 -0.74674305
## 21 6.462291 0.58577540 0.29084035
## 22 4.192903 0.64075905 -1.08463652
## 23 6.139809 0.68333499 -0.20025186
## 24 7.022200 0.82843651 0.26080553
## 25 4.639332 0.13464558 0.40706327
## 26 5.577750 0.53171211 -0.07781594
## 27 6.952884 0.07929914 1.18350233
## 28 6.397108 0.64287492 0.55759626
## 29 8.057331 0.74319998 0.18020091
## 30 4.627889 0.63376057 -0.79549043
## 31 7.705306 0.51871720 0.43074176
## 32 9.096185 0.67693402 1.27519794
## 33 5.896129 0.85302258 -0.31919664
## 34 5.200500 0.41885106 0.52612226
## 35 4.927863 0.32940194 0.48290405
## 36 7.426120 0.84504846 0.10357934
## 37 5.658618 0.80828391 -0.54442856
## 38 1.883194 0.11722543 -0.96721930
## 39 5.397571 0.58166589 -0.32375446
## 40 8.388477 0.90007700 0.46233869
## 41 3.371092 0.19324682 -0.25581953
## 42 5.907344 0.51609563 0.37455582
## 43 7.224768 0.62993903 0.53770501
## 44 5.412905 0.43644514 -0.03660434
## 45 6.359813 0.52871944 0.28302049
## 46 6.956927 0.75649477 -0.14057255
## 47 6.887140 0.02564032 1.36090442
## 48 5.003667 0.23858327 0.13021873
## 49 5.259506 0.91529906 -0.94328946
## 50 2.863081 0.04504938 -0.39662903
Since we need to generate the data for \(n = 50\), then \(n = 1000\), then \(n = 300\) and \(n = 500\), rather than copying and pasting we will write a function first and generate the data just by changing the value of \(n\)
# function to generate the data
generate_data <- function(n){
# generate epsilon
epsilon <- rnorm(n, mean = 0, sd = sqrt(.25))
x2 <- rnorm(n, mean = 0, sd = sqrt(.35))
x1 <- runif(n, min = 0, max = 1)
# generate y
y <- 3 + 5*x1 + 2*x2 + epsilon
# create a data frame
sim_data <- data.frame(y, x1, x2)
# return the data
return(sim_data)
}
So every time we call this function it will return a data frame for us.
# generate the data for n = 50
# give the data a name
data_50 <- generate_data(50)
# see the data
data_50
## y x1 x2
## 1 7.160021 0.8643315108 -0.19719144
## 2 9.734567 0.9969846567 0.58983081
## 3 7.991663 0.6065868419 1.00988904
## 4 6.169271 0.6150654876 0.07433327
## 5 4.942380 0.3112173709 -0.07363607
## 6 3.040282 0.2318498383 -0.55072124
## 7 6.053363 0.5054107767 0.09598662
## 8 4.010350 0.1884302876 0.02694795
## 9 7.921904 0.6750195345 0.68835958
## 10 7.405118 0.5041511960 0.77197236
## 11 7.389173 0.5789139955 0.78967562
## 12 4.347276 0.4874288528 -0.35773069
## 13 5.709370 0.3924997861 0.30472539
## 14 2.000691 0.4026356814 -1.61490029
## 15 6.494485 0.3747238233 0.81481434
## 16 5.466603 0.1296375915 1.25576296
## 17 6.228939 0.6780066260 -0.08746064
## 18 6.505348 0.4280622276 0.63755284
## 19 8.548070 0.9217563251 0.08847848
## 20 6.046780 0.6598597541 -0.18164946
## 21 7.414852 0.6765351992 0.40062186
## 22 4.802390 0.2228303547 0.39143820
## 23 3.772313 0.2202511090 0.05416289
## 24 6.289815 0.5026718231 0.39166579
## 25 6.870056 0.6968779189 -0.04820738
## 26 3.700308 0.0770797962 0.15239624
## 27 3.064961 0.0514335553 0.07420946
## 28 3.549345 0.2654457928 -0.24022771
## 29 8.432483 0.6817292878 0.82744826
## 30 4.365981 0.1752574323 0.18817417
## 31 4.801454 0.0008933423 0.76074385
## 32 4.425263 0.1407824385 0.10744378
## 33 2.435871 0.1311974372 -0.42757508
## 34 4.767181 0.5122128390 0.14078620
## 35 8.822670 0.9512334329 0.55533630
## 36 3.847990 0.3305822532 -0.22609786
## 37 5.568301 0.5961034803 -0.23829744
## 38 8.049991 0.9791409415 0.04746287
## 39 5.075841 0.7545945263 -0.74532186
## 40 6.015720 0.6723935523 0.09361944
## 41 8.098380 0.9895856185 -0.06795263
## 42 3.034823 0.0246400333 -0.05446336
## 43 3.431211 0.5167486565 -0.78645788
## 44 8.071740 0.8496184449 0.80016113
## 45 2.962801 0.1994201830 -0.52511673
## 46 5.591551 0.6358970397 -0.42060792
## 47 5.024179 0.0855034578 0.79432019
## 48 8.304139 0.8475595561 0.26167367
## 49 5.239059 0.3890425721 -0.07001134
## 50 8.412478 0.7185553380 0.72129103
Let’s do it for other values of \(n\).
# generate the data for n = 1000
data_100 <- generate_data(100)
# generate the data for n = 300
data_300 <- generate_data(300)
# generate the data for n = 500
data_500 <- generate_data(500)
Now finally we will fit the model for each of the data set and see the results.
# fit the model for n = 50
model_50 <- lm(y ~ x1 + x2, data = data_50)
summary(model_50)
##
## Call:
## lm(formula = y ~ x1 + x2, data = data_50)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.10598 -0.19676 0.08032 0.23143 0.60772
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.87250 0.10523 27.30 <2e-16 ***
## x1 5.30474 0.18596 28.53 <2e-16 ***
## x2 2.01375 0.09929 20.28 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3706 on 47 degrees of freedom
## Multiple R-squared: 0.9648, Adjusted R-squared: 0.9633
## F-statistic: 644.8 on 2 and 47 DF, p-value: < 2.2e-16
# fit the model for n = 1000
model_100 <- lm(y ~ x1 + x2, data = data_100)
summary(model_100)
##
## Call:
## lm(formula = y ~ x1 + x2, data = data_100)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.12078 -0.38419 0.01002 0.36810 1.13314
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.99315 0.11044 27.10 <2e-16 ***
## x1 4.93396 0.18659 26.44 <2e-16 ***
## x2 2.04248 0.08212 24.87 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5164 on 97 degrees of freedom
## Multiple R-squared: 0.9264, Adjusted R-squared: 0.9249
## F-statistic: 610.4 on 2 and 97 DF, p-value: < 2.2e-16
# fit the model for n = 300
model_300 <- lm(y ~ x1 + x2, data = data_300)
summary(model_300)
##
## Call:
## lm(formula = y ~ x1 + x2, data = data_300)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.62477 -0.31072 0.00712 0.31896 1.17742
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.91282 0.05883 49.51 <2e-16 ***
## x1 5.07259 0.09549 53.12 <2e-16 ***
## x2 2.05007 0.04601 44.56 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4787 on 297 degrees of freedom
## Multiple R-squared: 0.9429, Adjusted R-squared: 0.9425
## F-statistic: 2451 on 2 and 297 DF, p-value: < 2.2e-16
# fit the model for n = 500
model_500 <- lm(y ~ x1 + x2, data = data_500)
summary(model_500)
##
## Call:
## lm(formula = y ~ x1 + x2, data = data_500)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.31164 -0.32046 -0.01006 0.31560 1.27403
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.05733 0.04150 73.67 <2e-16 ***
## x1 4.92437 0.07255 67.87 <2e-16 ***
## x2 1.99314 0.03658 54.48 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4784 on 497 degrees of freedom
## Multiple R-squared: 0.9373, Adjusted R-squared: 0.9371
## F-statistic: 3715 on 2 and 497 DF, p-value: < 2.2e-16
For any fitted results we can also get the estimated coefficients by using the following command,
# get the estimated coefficients for n = 50
model_50$coefficients
## (Intercept) x1 x2
## 2.872500 5.304738 2.013745
Now we will generate the data for sample size \(n = 50\), \(1000\) times, fit the model and save \(\hat{\beta}_1\) and plto the histogram.
# set the seed for reproducibility
set.seed(1238818)
# create an empty vector to save the results
beta_1_hat_50 <- c()
# generate the data for n = 50, 1000 times
for(i in 1:1000){
# generate the data
data_50 <- generate_data(50)
# fit the model
model_50 <- lm(y ~ x1 + x2, data = data_50)
# save the results
beta_1_hat_50[i] <- model_50$coefficients[2]
}
Now we have all the estimated coefficients for \(\hat{\beta}_1\) saved in the vector
beta_1_hat
, there will be \(1000\) of them, since we did repeated
sampling \(1000\) times. Also note,
sample size was 50 each times. Let’s plot the histogram.
beta_1_hat_50 # there will be 1000 of the beta1hat from 1000 times repeated sampling
## [1] 4.485478 5.304738 4.959747 5.251674 4.895635 5.101398 4.754430 5.054136
## [9] 5.306305 5.236520 4.901062 4.991120 5.166903 5.254400 5.007742 5.168038
## [17] 4.743377 4.666466 5.135399 4.790026 4.898384 5.093273 5.101810 4.836331
## [25] 4.437754 4.828734 5.081656 5.162919 4.781128 4.649088 5.181499 5.069309
## [33] 4.987362 5.620613 5.013005 5.116759 4.665046 4.862916 4.757111 4.997230
## [41] 5.093257 4.670814 5.486029 5.231187 5.063979 5.250513 5.085916 4.924652
## [49] 5.347803 5.268144 4.563011 4.908670 5.452769 4.518165 4.598946 5.317104
## [57] 4.928645 5.285162 4.812861 5.078343 4.714498 4.790625 4.832180 4.642192
## [65] 4.696364 4.898946 4.822505 4.919054 5.172281 5.204671 4.976395 4.921758
## [73] 4.866372 4.761117 4.646155 5.493639 5.326390 5.067666 4.902544 4.786867
## [81] 4.850499 5.553196 4.865611 4.733506 5.062203 4.855864 5.367243 5.187651
## [89] 4.834076 5.056856 4.929961 5.397157 5.027832 4.582912 5.038255 5.386291
## [97] 5.088605 4.954704 4.737006 5.112566 5.399548 5.163330 4.889125 5.071413
## [105] 5.252471 5.158623 5.345738 5.141599 4.672340 4.980025 4.994469 4.905299
## [113] 5.374069 4.629335 4.885723 4.806755 4.962482 4.887558 5.065308 4.597303
## [121] 4.717842 5.102193 4.959834 5.132458 5.556404 4.746247 5.381661 4.892779
## [129] 5.153609 5.583849 4.508429 4.925393 5.031795 5.537385 5.143640 5.098437
## [137] 4.840591 4.568493 4.848978 5.260674 4.993063 5.111483 4.756304 5.084543
## [145] 5.085331 4.621612 5.126513 4.868906 4.927173 4.802615 4.816373 4.769819
## [153] 4.698057 4.757790 5.071649 4.994194 5.018386 5.094399 5.291332 5.261723
## [161] 4.833200 5.337813 5.318991 4.708214 4.643773 5.368306 5.053251 4.843891
## [169] 5.207936 4.872206 5.185777 5.130650 4.880765 4.874002 4.908826 5.069372
## [177] 5.236570 4.901159 4.869120 4.724201 4.769751 4.979136 5.051217 5.076313
## [185] 5.249442 5.141515 4.580691 4.836354 5.752185 4.668689 5.183913 5.374048
## [193] 5.364288 4.776408 5.179440 4.548132 5.070780 5.092629 5.098910 4.991300
## [201] 5.240991 5.132287 5.300620 5.163879 4.844649 5.018457 5.201190 5.206082
## [209] 5.228265 5.262440 4.857969 4.934296 4.772525 4.906537 4.461187 5.227305
## [217] 5.109712 5.055504 5.141304 5.015812 4.920844 5.186784 5.040651 5.442981
## [225] 5.175959 4.455601 4.873602 4.952239 5.105042 4.525350 4.904833 5.179180
## [233] 5.067308 4.930890 4.699570 4.762481 4.844586 4.823260 4.833469 4.826104
## [241] 5.232760 4.966485 4.608799 4.623180 4.920504 4.648176 4.802458 4.531290
## [249] 5.251741 5.069689 4.269026 4.532997 5.285251 5.350430 5.170922 4.898040
## [257] 5.135269 4.949178 5.197332 4.889503 4.949457 4.842371 5.420308 5.032928
## [265] 5.061658 4.849542 4.824760 5.026706 4.857946 4.993099 4.726667 4.824758
## [273] 4.749239 5.039691 5.057484 5.132294 5.256295 5.253414 4.773786 4.836962
## [281] 4.381831 4.381285 4.937885 5.007333 5.460112 5.143479 4.810318 4.918331
## [289] 4.741700 4.801868 4.903011 5.052805 5.025423 4.658228 5.264722 5.010361
## [297] 5.199477 4.689741 5.159770 5.613585 4.999266 5.374725 5.448959 5.130768
## [305] 5.172403 5.165188 5.245571 4.684331 5.185454 4.944623 4.798659 4.537141
## [313] 4.979890 4.845987 5.566050 5.370217 4.866928 5.132236 4.766464 4.934627
## [321] 5.277240 5.212898 5.181595 4.728540 5.645782 4.894996 5.025202 4.986876
## [329] 5.429623 4.773761 4.710793 5.233495 5.184020 4.902078 5.181114 5.487995
## [337] 4.993412 5.373381 4.988172 5.287826 4.917209 4.995167 4.938579 4.871048
## [345] 5.301386 4.826808 4.554810 4.970244 5.156826 5.358136 4.855491 4.855747
## [353] 4.979132 5.477457 4.872498 5.194213 4.834971 5.065981 5.079478 4.957526
## [361] 4.519404 5.053165 4.679644 5.308541 4.829895 5.303690 5.397950 5.037411
## [369] 4.920102 4.881469 5.054026 4.503253 5.088003 5.470697 5.068837 4.582705
## [377] 5.255538 4.450715 4.785105 5.080632 5.148690 4.562165 5.195823 4.754681
## [385] 4.777834 5.038562 4.935939 5.198320 5.160737 5.100383 4.542710 4.938678
## [393] 5.357185 5.097343 4.832971 4.855398 4.917219 5.174243 4.586963 5.118714
## [401] 4.996982 5.145911 5.278410 5.333532 5.031475 5.017665 5.022780 4.684846
## [409] 4.932596 5.510468 4.642632 4.565029 4.682752 5.381128 4.837027 5.051457
## [417] 5.298639 4.766839 5.361065 5.097074 5.287559 4.910771 5.360722 4.740435
## [425] 5.036110 5.133384 4.777498 5.435280 4.907954 4.824628 5.035225 4.654535
## [433] 5.240856 4.824306 5.078145 5.155043 4.901482 5.046447 5.109161 5.125047
## [441] 5.125059 4.829513 5.292919 5.296713 5.044832 5.300289 5.116186 5.459428
## [449] 4.930736 5.012760 5.715140 4.970881 4.885448 4.681220 4.971130 5.176769
## [457] 5.123961 5.576833 4.815896 5.151657 4.892834 5.045430 5.225182 4.945834
## [465] 4.594349 5.049462 5.134431 4.734553 4.798456 5.264664 5.054971 5.058237
## [473] 4.703205 4.563755 5.124378 4.901899 4.720079 4.974602 4.857145 4.640791
## [481] 5.718206 4.975588 5.302425 4.577156 4.667167 4.971344 4.662398 4.747257
## [489] 5.061702 4.596287 5.074772 5.185422 5.037521 4.992930 4.982168 4.746163
## [497] 5.011922 5.009470 4.873124 4.911402 4.995777 5.012149 5.056535 4.735705
## [505] 4.910249 4.882824 4.676207 5.132500 5.116744 4.809766 5.058672 4.988212
## [513] 4.895384 4.424062 4.593222 4.959165 4.972550 4.997310 5.160204 5.126934
## [521] 4.747962 4.832251 4.817506 5.213387 4.937497 5.051790 4.727891 5.046995
## [529] 5.001643 5.090624 4.587650 5.140236 4.853415 4.881284 5.210002 5.131030
## [537] 5.011962 5.159495 4.496885 4.895233 5.134878 4.659649 5.094232 4.944533
## [545] 5.007720 5.204960 4.808499 4.858506 4.500230 4.986009 4.856290 5.134735
## [553] 5.207438 4.573706 5.121951 4.970375 4.953576 5.475451 5.319224 5.060667
## [561] 4.614044 5.205308 5.000751 5.240097 4.621917 5.371268 5.302424 4.773791
## [569] 5.366924 5.129211 5.029783 5.141469 5.300557 5.366798 4.828968 5.283011
## [577] 5.129549 5.080294 4.920634 4.973983 5.371933 5.515251 4.922464 4.940064
## [585] 5.292185 5.188624 5.074337 4.934006 5.320070 4.991679 4.952876 5.083044
## [593] 5.341746 5.168061 4.789659 4.571647 5.581421 4.871725 5.239086 4.727561
## [601] 5.055505 4.736164 5.286184 5.255170 4.600747 4.809100 5.199217 5.082497
## [609] 4.847345 4.809383 5.113691 4.708357 5.008076 4.846232 5.088397 4.855276
## [617] 4.884021 4.856612 4.997421 5.515449 5.126846 4.792163 4.961163 5.094427
## [625] 4.811079 4.885611 4.537607 5.002173 4.715811 4.727964 5.547906 5.517590
## [633] 5.103533 5.169705 4.615584 4.771699 5.556764 4.695611 5.068552 4.270519
## [641] 5.192190 5.084283 5.008164 4.972422 4.856107 5.387147 4.613194 4.897642
## [649] 4.402707 5.152640 4.985800 4.789159 5.062928 5.135107 5.166197 4.980036
## [657] 5.047420 5.156798 5.389144 5.244218 4.947095 5.454278 5.465370 5.342624
## [665] 5.600024 5.268523 4.757993 5.104025 4.977780 5.396636 5.316958 4.887372
## [673] 5.025701 4.668983 4.848392 4.796946 4.908208 5.187225 4.749204 5.142773
## [681] 4.890062 4.622630 5.025661 4.824953 4.553671 5.214686 5.267459 5.336312
## [689] 5.258944 4.871090 4.610731 5.192959 5.148209 4.910966 4.521401 4.921669
## [697] 4.973159 4.785430 5.217412 5.102504 4.805551 4.746484 5.326392 4.840600
## [705] 5.225079 5.099077 4.535507 4.845539 4.742769 5.017666 4.741661 4.827594
## [713] 5.051057 4.896021 4.963346 4.822602 5.016697 5.317157 4.723515 4.816076
## [721] 4.754528 4.657783 4.722018 5.229558 5.039458 4.757617 4.709079 5.003111
## [729] 5.187133 5.385466 5.516635 5.195196 5.058186 5.013071 5.209867 5.248617
## [737] 5.198047 4.748710 5.111081 5.287853 5.077555 4.858578 4.891701 5.575714
## [745] 5.455877 5.012015 5.024091 4.911574 4.784502 4.928050 5.154982 5.331623
## [753] 5.428720 5.006266 4.992775 5.002264 5.653871 5.353800 5.068550 5.201087
## [761] 5.573610 4.931420 5.210005 5.084679 5.064386 5.274057 4.733860 5.251118
## [769] 4.936122 4.482423 5.069234 4.906082 5.061590 5.659901 4.460655 5.434459
## [777] 4.957917 5.598068 5.113459 5.073720 4.726930 5.071080 5.483408 5.497302
## [785] 5.373398 4.633976 5.033450 5.263387 4.300644 4.557961 5.546296 4.790253
## [793] 4.586537 5.213404 5.330883 4.601047 4.972378 5.042581 5.134534 4.607962
## [801] 5.136604 5.987094 5.172229 5.293352 4.975573 4.886609 4.931155 4.488223
## [809] 5.119127 4.825748 5.245991 5.147783 5.515031 5.111311 4.988581 4.540124
## [817] 4.500462 5.055904 5.003116 4.670038 4.460839 5.086395 4.733326 5.372004
## [825] 4.786599 5.067048 4.675092 5.122879 5.248668 5.126275 4.823130 4.536572
## [833] 4.728632 4.726170 5.385931 5.005362 4.996158 4.997981 4.865040 4.939332
## [841] 4.966626 4.975067 5.304325 5.008067 5.205826 4.885080 5.043048 5.379153
## [849] 4.637116 4.704769 5.097440 5.382212 4.732754 5.195657 4.850238 5.086946
## [857] 5.026773 4.609905 5.210246 5.175134 5.157766 5.155974 4.768472 4.558817
## [865] 5.353287 5.105809 5.130621 5.066071 4.701898 5.025615 4.994343 5.198548
## [873] 5.322911 5.472932 4.887698 5.179708 4.853338 5.013853 5.139585 4.818475
## [881] 5.472540 4.931023 5.398423 4.849495 4.718906 5.172996 4.933374 5.299378
## [889] 5.488836 4.823664 5.230227 4.664390 4.923887 4.989167 5.026607 5.358095
## [897] 4.971740 4.836640 4.585121 5.324230 5.091373 4.799231 4.490879 5.077438
## [905] 5.060028 5.109613 5.431736 5.053831 5.440969 4.879381 4.967715 4.821120
## [913] 4.743480 4.861706 5.380455 4.639657 5.102443 4.726643 5.362494 4.857883
## [921] 4.979018 4.952605 4.662066 5.118392 4.425176 4.546790 4.497086 5.286347
## [929] 4.753523 5.198947 5.033935 5.233569 5.095337 5.288266 4.774368 5.072913
## [937] 4.717581 5.472029 5.074954 4.944193 4.745364 5.250638 4.628851 5.256483
## [945] 4.616283 5.362248 4.971833 5.102638 4.963357 5.228972 4.582668 5.309639
## [953] 4.975145 4.970081 5.311372 4.773843 5.347762 4.379886 5.120061 5.407706
## [961] 5.170946 4.949518 5.198010 4.951369 5.072183 4.944847 5.332221 5.025111
## [969] 4.612138 5.023289 5.131394 5.127251 4.577943 5.043374 4.789041 5.215886
## [977] 5.075715 5.123780 4.937445 4.922138 5.637467 5.239811 5.665027 4.801432
## [985] 4.973035 4.954486 4.864773 4.742852 4.771484 5.163935 4.677989 4.762401
## [993] 5.156840 5.480446 4.666277 5.178436 4.933347 5.091997 4.787239 4.548936
# plot the histogram of 1000s beta1hats
hist(beta_1_hat_50, breaks = 20)
This is the sampling distribution of \(\hat{\beta}_1\) for \(n = 50\). Notice it looks like Normal distribution. We can also calculate the standadrd error from this simulation, which is the standard deviation of the sampling distribution.
# calculate the standard error
sd(beta_1_hat_50)
## [1] 0.2604082
In this way you can also get the sampling distribution for \(\hat{\beta}_0\) and \(\hat{\beta}_2\). Note you need to use
model_50$coefficients[0]
command for \(\hat{\beta}_0\) and
model_50$coefficients[3]
command for \(\hat{\beta}_2\).
Now will do the same for \(n = 1000\) and compare the sampling distribution of \(\hat{\beta}_1\) for \(n = 50\) and \(n = 1000\).
# set the seed for reproducibility
set.seed(1238818)
# create an empty vector to save the results
beta_1_hat_1000 <- c()
# generate the data for n = 500, 1000 times
for(i in 1:1000){
# generate the data
data_1000 <- generate_data(1000)
# fit the model
model_1000 <- lm(y ~ x1 + x2, data = data_1000)
# save the results
beta_1_hat_1000[i] <- model_1000$coefficients[2]
}
let’s calculate the standard error.
# calculate the standard error
sd(beta_1_hat_1000)
## [1] 0.05418711
Notice the standard error significantly reduced. This is the similar story like sample means, as we have more and more data, we have more accurate estimate.
Now let’s compare the two histograms
# plot the histogram for n = 50
hist(beta_1_hat_50, breaks = 10)
# plot the histogram for n = 1000
hist(beta_1_hat_1000, breaks = 200)
Notice the histogram for \(n = 1000\) is much more concentrated around the true value of \(\beta_1 = 5\).
We can include the interaction term by running following command,
# fit the model
model_interaction <- lm(revenue ~ tv*newspaper, data = Showtime)
# view the results with stargazer package
library(stargazer)
stargazer(model_interaction, type = "text")
##
## ===============================================
## Dependent variable:
## ---------------------------
## revenue
## -----------------------------------------------
## tv 1.089
## (1.020)
##
## newspaper -0.667
## (1.631)
##
## tv:newspaper 0.724
## (0.589)
##
## Constant 86.531***
## (3.077)
##
## -----------------------------------------------
## Observations 8
## R2 0.941
## Adjusted R2 0.897
## Residual Std. Error 0.612 (df = 4)
## F Statistic 21.349*** (df = 3; 4)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Again we can write the estimated model as,
\[ \hat{revenue} = 86.531 - 1.089 \times tv - 0.667 \times newspaper + 0.724 \times (tv \times newspaper) \]
Now let’s interpret the results. This is an interaction model, so we need to be careful. First let’s slightly rewrite estimated model as
\[ \hat{revenue} = 86.531 + (-1.089 + .724\; newspaper) \times tv - (0.667 \times newspaper) \]
This means if we increase the TV advertisement by 1 unit, the revenue is predicted to increase by \(-1.089 + .724\; newspaper\) units. Notice this is a function of newspaper advertisement. So the effect of TV advertisement on revenue depends on the level of newspaper advertisement.
This is for the auto data set that you already saw before. In this task we will do some non-linear regression, in particular polynomial regression with given degree. Interestingly we will see that the polynomial regression is actually a special case of multiple linear regression.
The first task is same, load the data.
# load the library, load the data
library(readxl)
Auto <- read_excel("/home/tanvir/Documents/ownCloud/Git_Repos/EWU_repos/3_Fall_2023/eco_204/ewu-eco204.github.io/pdf_files/ps/04_mlr/Solutions/Auto.xlsx")
summary(Auto)
## mpg cylinders displacement horsepower
## Min. : 9.00 Min. :3.000 Min. : 68.0 Length:397
## 1st Qu.:17.50 1st Qu.:4.000 1st Qu.:104.0 Class :character
## Median :23.00 Median :4.000 Median :146.0 Mode :character
## Mean :23.52 Mean :5.458 Mean :193.5
## 3rd Qu.:29.00 3rd Qu.:8.000 3rd Qu.:262.0
## Max. :46.60 Max. :8.000 Max. :455.0
## weight acceleration year origin
## Min. :1613 Min. : 8.00 Min. :70.00 Min. :1.000
## 1st Qu.:2223 1st Qu.:13.80 1st Qu.:73.00 1st Qu.:1.000
## Median :2800 Median :15.50 Median :76.00 Median :1.000
## Mean :2970 Mean :15.56 Mean :75.99 Mean :1.574
## 3rd Qu.:3609 3rd Qu.:17.10 3rd Qu.:79.00 3rd Qu.:2.000
## Max. :5140 Max. :24.80 Max. :82.00 Max. :3.000
## name
## Length:397
## Class :character
## Mode :character
##
##
##
surprisingly horsepower is showing some character, we need to fix it.
The summary stats show there is a char
which means
character or text in horepower. This is a problem. let’s see the data
for horsepower,
sort(Auto$horsepower)
## [1] "?" "?" "?" "?" "?" "100" "100" "100" "100" "100" "100" "100"
## [13] "100" "100" "100" "100" "100" "100" "100" "100" "100" "100" "102" "103"
## [25] "105" "105" "105" "105" "105" "105" "105" "105" "105" "105" "105" "105"
## [37] "107" "108" "110" "110" "110" "110" "110" "110" "110" "110" "110" "110"
## [49] "110" "110" "110" "110" "110" "110" "110" "110" "112" "112" "112" "113"
## [61] "115" "115" "115" "115" "115" "116" "120" "120" "120" "120" "122" "125"
## [73] "125" "125" "129" "129" "130" "130" "130" "130" "130" "132" "133" "135"
## [85] "137" "138" "139" "139" "140" "140" "140" "140" "140" "140" "140" "142"
## [97] "145" "145" "145" "145" "145" "145" "145" "148" "149" "150" "150" "150"
## [109] "150" "150" "150" "150" "150" "150" "150" "150" "150" "150" "150" "150"
## [121] "150" "150" "150" "150" "150" "150" "150" "152" "153" "153" "155" "155"
## [133] "158" "160" "160" "165" "165" "165" "165" "167" "170" "170" "170" "170"
## [145] "170" "175" "175" "175" "175" "175" "180" "180" "180" "180" "180" "190"
## [157] "190" "190" "193" "198" "198" "200" "208" "210" "215" "215" "215" "220"
## [169] "225" "225" "225" "230" "46" "46" "48" "48" "48" "49" "52" "52"
## [181] "52" "52" "53" "53" "54" "58" "58" "60" "60" "60" "60" "60"
## [193] "61" "62" "62" "63" "63" "63" "64" "65" "65" "65" "65" "65"
## [205] "65" "65" "65" "65" "65" "66" "67" "67" "67" "67" "67" "67"
## [217] "67" "67" "67" "67" "67" "67" "68" "68" "68" "68" "68" "68"
## [229] "69" "69" "69" "70" "70" "70" "70" "70" "70" "70" "70" "70"
## [241] "70" "70" "70" "71" "71" "71" "71" "71" "72" "72" "72" "72"
## [253] "72" "72" "74" "74" "74" "75" "75" "75" "75" "75" "75" "75"
## [265] "75" "75" "75" "75" "75" "75" "75" "76" "76" "76" "76" "77"
## [277] "78" "78" "78" "78" "78" "78" "79" "79" "80" "80" "80" "80"
## [289] "80" "80" "80" "81" "81" "82" "83" "83" "83" "83" "84" "84"
## [301] "84" "84" "84" "84" "85" "85" "85" "85" "85" "85" "85" "85"
## [313] "85" "86" "86" "86" "86" "86" "87" "87" "88" "88" "88" "88"
## [325] "88" "88" "88" "88" "88" "88" "88" "88" "88" "88" "88" "88"
## [337] "88" "88" "88" "89" "90" "90" "90" "90" "90" "90" "90" "90"
## [349] "90" "90" "90" "90" "90" "90" "90" "90" "90" "90" "90" "90"
## [361] "91" "92" "92" "92" "92" "92" "92" "93" "94" "95" "95" "95"
## [373] "95" "95" "95" "95" "95" "95" "95" "95" "95" "95" "95" "96"
## [385] "96" "96" "97" "97" "97" "97" "97" "97" "97" "97" "97" "98"
## [397] "98"
there is a actually a question mark, this is actually missing data. We need to fix it. Following command will remove the question marks,
# there are many ways also with a package dplyr, but here is a simple way to do this
Auto <- Auto[Auto$horsepower != "?", ]
However still it’s character, we need to convert it to numeric, the
function in R is as.numeric()
, for example
"134"
## [1] "134"
as.numeric("134")
## [1] 134
Let’s now do it for horsepower
Auto$horsepower <- as.numeric(Auto$horsepower)
Now let’s see the summary statistics again
summary(Auto)
## mpg cylinders displacement horsepower weight
## Min. : 9.00 Min. :3.000 Min. : 68.0 Min. : 46.0 Min. :1613
## 1st Qu.:17.00 1st Qu.:4.000 1st Qu.:105.0 1st Qu.: 75.0 1st Qu.:2225
## Median :22.75 Median :4.000 Median :151.0 Median : 93.5 Median :2804
## Mean :23.45 Mean :5.472 Mean :194.4 Mean :104.5 Mean :2978
## 3rd Qu.:29.00 3rd Qu.:8.000 3rd Qu.:275.8 3rd Qu.:126.0 3rd Qu.:3615
## Max. :46.60 Max. :8.000 Max. :455.0 Max. :230.0 Max. :5140
## acceleration year origin name
## Min. : 8.00 Min. :70.00 Min. :1.000 Length:392
## 1st Qu.:13.78 1st Qu.:73.00 1st Qu.:1.000 Class :character
## Median :15.50 Median :76.00 Median :1.000 Mode :character
## Mean :15.54 Mean :75.98 Mean :1.577
## 3rd Qu.:17.02 3rd Qu.:79.00 3rd Qu.:2.000
## Max. :24.80 Max. :82.00 Max. :3.000
Looks good
Let’s plot the data,
plot(Auto$horsepower, Auto$mpg)
So the pattern shows some non-linearities
Now let’s perfoem polynomial regression. We will use the
I()
to incorporate polynomials in lm()
quad_fit <- lm(mpg ~ horsepower + I(horsepower^2), data = Auto)
library(stargazer)
stargazer(quad_fit, type = "text")
##
## ===============================================
## Dependent variable:
## ---------------------------
## mpg
## -----------------------------------------------
## horsepower -0.466***
## (0.031)
##
## I(horsepower2) 0.001***
## (0.0001)
##
## Constant 56.900***
## (1.800)
##
## -----------------------------------------------
## Observations 392
## R2 0.688
## Adjusted R2 0.686
## Residual Std. Error 4.374 (df = 389)
## F Statistic 428.018*** (df = 2; 389)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
plot(Auto$horsepower, Auto$mpg)
lines(sort(Auto$horsepower), fitted(quad_fit)[order(Auto$horsepower)], col = "red", lwd = 2)
What if we want to perform cubic regression? We can do that too,
cubic_fit <- lm(mpg ~ horsepower + I(horsepower^2) + I(horsepower^3), data = Auto)
stargazer(cubic_fit, type = "text")
##
## ===============================================
## Dependent variable:
## ---------------------------
## mpg
## -----------------------------------------------
## horsepower -0.569***
## (0.118)
##
## I(horsepower2) 0.002**
## (0.001)
##
## I(horsepower3) -0.00000
## (0.00000)
##
## Constant 60.685***
## (4.563)
##
## -----------------------------------------------
## Observations 392
## R2 0.688
## Adjusted R2 0.686
## Residual Std. Error 4.375 (df = 388)
## F Statistic 285.481*** (df = 3; 388)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Plot all the lines
plot(Auto$horsepower, Auto$mpg)
lines(sort(Auto$horsepower), fitted(quad_fit)[order(Auto$horsepower)], col = "red", lwd = 2)
lines(sort(Auto$horsepower), fitted(cubic_fit)[order(Auto$horsepower)], col = "blue", lwd = 2)
The improvement is not so much, we can also do it for higher order
polynomia, in this way with I()
in lm()
. But
there is another function in R called poly()
that can do it
for us.
poly()
poly_fit <- lm(mpg ~ poly(horsepower, 3, raw = TRUE ), data = Auto)
stargazer(poly_fit, type = "text")
##
## ============================================================
## Dependent variable:
## ---------------------------
## mpg
## ------------------------------------------------------------
## poly(horsepower, 3, raw = TRUE)1 -0.569***
## (0.118)
##
## poly(horsepower, 3, raw = TRUE)2 0.002**
## (0.001)
##
## poly(horsepower, 3, raw = TRUE)3 -0.00000
## (0.00000)
##
## Constant 60.685***
## (4.563)
##
## ------------------------------------------------------------
## Observations 392
## R2 0.688
## Adjusted R2 0.686
## Residual Std. Error 4.375 (df = 388)
## F Statistic 285.481*** (df = 3; 388)
## ============================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
The first task is same, load the data.
Auto <- read_excel("/home/tanvir/Documents/ownCloud/Git_Repos/EWU_repos/3_Fall_2023/eco_204/ewu-eco204.github.io/pdf_files/ps/04_mlr/Solutions/Auto_clean.xlsx")
summary(Auto)
## mpg cylinders displacement horsepower weight
## Min. : 9.00 Min. :3.000 Min. : 68.0 Min. : 46.0 Min. :1613
## 1st Qu.:17.00 1st Qu.:4.000 1st Qu.:105.0 1st Qu.: 75.0 1st Qu.:2225
## Median :22.75 Median :4.000 Median :151.0 Median : 93.5 Median :2804
## Mean :23.45 Mean :5.472 Mean :194.4 Mean :104.5 Mean :2978
## 3rd Qu.:29.00 3rd Qu.:8.000 3rd Qu.:275.8 3rd Qu.:126.0 3rd Qu.:3615
## Max. :46.60 Max. :8.000 Max. :455.0 Max. :230.0 Max. :5140
## acceleration year origin name
## Min. : 8.00 Min. :70.00 Min. :1.000 Length:392
## 1st Qu.:13.78 1st Qu.:73.00 1st Qu.:1.000 Class :character
## Median :15.50 Median :76.00 Median :1.000 Mode :character
## Mean :15.54 Mean :75.98 Mean :1.577
## 3rd Qu.:17.02 3rd Qu.:79.00 3rd Qu.:2.000
## Max. :24.80 Max. :82.00 Max. :3.000
Origin variable should be a factor, let’s fix it
Auto$origin <- factor(Auto$origin, labels = c("USA", "Europe", "Japan"))
summary(Auto)
## mpg cylinders displacement horsepower weight
## Min. : 9.00 Min. :3.000 Min. : 68.0 Min. : 46.0 Min. :1613
## 1st Qu.:17.00 1st Qu.:4.000 1st Qu.:105.0 1st Qu.: 75.0 1st Qu.:2225
## Median :22.75 Median :4.000 Median :151.0 Median : 93.5 Median :2804
## Mean :23.45 Mean :5.472 Mean :194.4 Mean :104.5 Mean :2978
## 3rd Qu.:29.00 3rd Qu.:8.000 3rd Qu.:275.8 3rd Qu.:126.0 3rd Qu.:3615
## Max. :46.60 Max. :8.000 Max. :455.0 Max. :230.0 Max. :5140
## acceleration year origin name
## Min. : 8.00 Min. :70.00 USA :245 Length:392
## 1st Qu.:13.78 1st Qu.:73.00 Europe: 68 Class :character
## Median :15.50 Median :76.00 Japan : 79 Mode :character
## Mean :15.54 Mean :75.98
## 3rd Qu.:17.02 3rd Qu.:79.00
## Max. :24.80 Max. :82.00
# without interaction
model1 <- lm(mpg ~ horsepower + origin, data = Auto)
summary(model1)
##
## Call:
## lm(formula = mpg ~ horsepower + origin, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.322 -3.292 -0.584 2.526 14.167
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35.944129 0.867303 41.444 < 2e-16 ***
## horsepower -0.133648 0.006863 -19.474 < 2e-16 ***
## originEurope 2.425339 0.677860 3.578 0.00039 ***
## originJapan 5.176352 0.647817 7.990 1.55e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.554 on 388 degrees of freedom
## Multiple R-squared: 0.6621, Adjusted R-squared: 0.6595
## F-statistic: 253.4 on 3 and 388 DF, p-value: < 2.2e-16
# with interaction
model2 <- lm(mpg ~ horsepower + origin + origin*horsepower, data = Auto)
summary(model2)
##
## Call:
## lm(formula = mpg ~ horsepower + origin + origin * horsepower,
## data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.7415 -2.9547 -0.6389 2.3978 14.2495
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.476496 0.890665 38.709 < 2e-16 ***
## horsepower -0.121320 0.007095 -17.099 < 2e-16 ***
## originEurope 10.997230 2.396209 4.589 6.02e-06 ***
## originJapan 14.339718 2.464293 5.819 1.24e-08 ***
## horsepower:originEurope -0.100515 0.027723 -3.626 0.000327 ***
## horsepower:originJapan -0.108723 0.028980 -3.752 0.000203 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.422 on 386 degrees of freedom
## Multiple R-squared: 0.6831, Adjusted R-squared: 0.679
## F-statistic: 166.4 on 5 and 386 DF, p-value: < 2.2e-16