clear the env
rm(list = ls())
Let’s load the service data
library(readxl)
# Load the data
data1 <- read_excel("/home/tanvir/Documents/ownCloud/Git_Repos/EWU_repos/3_Fall_2023/eco_204/ewu-eco204.github.io/pdf_files/slides/05_anova/Codes/service_data_1.xlsx")
# this data is in wide format
# look at the column names
data1
## # A tibble: 7 × 3
## `Service A` `Service B` `No Service`
## <dbl> <dbl> <dbl>
## 1 61 91 99
## 2 65 88 83
## 3 71 74 75
## 4 52 56 95
## 5 67 76 62
## 6 58 81 69
## 7 78 66 80
This data is in wide format and we need to convert this data to long
format before running codes for ANOVA (to understand the difference
between the wide and long format you can read this
page (click here) ). There is a classic function in R called
reshape
to convert the data from wide to long format. But I
think now there is an excellent package called tidyr
which
does the job beautifully and also it’s a good package to learn (check this). We will use a
function pivot_longer
from the tidyr
package
to convert these kind of data sets in a long format.
# load the tidyr package
library(tidyr)
# reshape the data to long format
data1_long <- pivot_longer(data1, cols = 1:3, names_to = c("service"), values_to = c("score"))
Here - data1 is our datain wideformat - cols = 1:3 means we want to convert column 1 to 3 to long - names_to = c(“service”) the converted column will be named as service - values_to = the \(Y\) values in the converted column will be named as score
Now let’s look at the data
data1_long
## # A tibble: 21 × 2
## service score
## <chr> <dbl>
## 1 Service A 61
## 2 Service B 91
## 3 No Service 99
## 4 Service A 65
## 5 Service B 88
## 6 No Service 83
## 7 Service A 71
## 8 Service B 74
## 9 No Service 75
## 10 Service A 52
## # ℹ 11 more rows
We also need to convert the service variable to factor before running the ANOVA
# convert the service to factor
data1_long$service <- as.factor(data1_long$service)
# check the summary stats
summary(data1_long)
## service score
## No Service:7 Min. :52.00
## Service A :7 1st Qu.:65.00
## Service B :7 Median :74.00
## Mean :73.67
## 3rd Qu.:81.00
## Max. :99.00
Now finally we will run the aov()
function to run the
ANOVA, notice the syntax is similar to the lm()
function
# now we are ready to run anova
anova_result1 <- aov(score ~ service, data =data1_long)
summary(anova_result1)
## Df Sum Sq Mean Sq F value Pr(>F)
## service 2 937.2 468.6 3.51 0.0516 .
## Residuals 18 2403.4 133.5
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Now let’s do it for two way. The procedure is similar
# Load the data
data2 <- read_excel("/home/tanvir/Documents/ownCloud/Git_Repos/EWU_repos/3_Fall_2023/eco_204/ewu-eco204.github.io/pdf_files/slides/05_anova/Codes/service_data_2.xlsx")
# look at the column names
names(data2)
## [1] "Student" "Service A" "Service B" "Service C" "Service D"
# reshape from wide to long, we will use pivot_longer from tidyr package
data2_long <- pivot_longer(data2, cols = starts_with("Service"), names_to = c("service"), values_to = c("score"))
# rename the columns
names(data2_long) <- c("student", "service", "score")
# convert the service and stduent variable to factor
data2_long$service <- as.factor(data2_long$service)
data2_long$student <- as.factor(data2_long$student)
summary(data2_long)
## student service score
## 1:4 Service A:5 Min. : 7.00
## 2:4 Service B:5 1st Qu.: 9.00
## 3:4 Service C:5 Median :10.00
## 4:4 Service D:5 Mean :10.15
## 5:4 3rd Qu.:12.00
## Max. :14.00
Now the run the aov function
anova_result2 <- aov(score ~ service + student , data = data2_long)
summary(anova_result2)
## Df Sum Sq Mean Sq F value Pr(>F)
## service 3 31.35 10.450 3.931 0.0363 *
## student 4 3.30 0.825 0.310 0.8656
## Residuals 12 31.90 2.658
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Actually ANOVA is a special case of LM. To see this you can run the
lm()
function for the two way ANOVA data and run
anova()
function, then you will get the same result as the
aov()
function
lm_result <- lm(score ~ service + student , data = data2_long)
summary(lm_result)
##
## Call:
## lm(formula = score ~ service + student, data = data2_long)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.3000 -0.8250 -0.0250 0.6875 2.3500
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.650e+00 1.031e+00 9.358 7.3e-07 ***
## serviceService B 1.032e-15 1.031e+00 0.000 1.0000
## serviceService C -1.000e+00 1.031e+00 -0.970 0.3513
## serviceService D 2.400e+00 1.031e+00 2.327 0.0382 *
## student2 -5.000e-01 1.153e+00 -0.434 0.6722
## student3 2.500e-01 1.153e+00 0.217 0.8320
## student4 7.500e-01 1.153e+00 0.651 0.5276
## student5 2.500e-01 1.153e+00 0.217 0.8320
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.63 on 12 degrees of freedom
## Multiple R-squared: 0.5207, Adjusted R-squared: 0.241
## F-statistic: 1.862 on 7 and 12 DF, p-value: 0.1642
# in this case the anova function will give you the same result as the aov function
anova(lm_result)
## Analysis of Variance Table
##
## Response: score
## Df Sum Sq Mean Sq F value Pr(>F)
## service 3 31.35 10.4500 3.9310 0.03631 *
## student 4 3.30 0.8250 0.3103 0.86558
## Residuals 12 31.90 2.6583
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Now you should understand what does the anova
function
do in R. It runs the lm()
function removing each of the
variable one by one and then compares the model with the full model and
gives you the result.