clear the env

rm(list = ls())

Doing one way ANOVA

Let’s load the service data

library(readxl)
# Load the data
data1 <- read_excel("/home/tanvir/Documents/ownCloud/Git_Repos/EWU_repos/3_Fall_2023/eco_204/ewu-eco204.github.io/pdf_files/slides/05_anova/Codes/service_data_1.xlsx")

# this data is in wide format
# look at the column names
data1
## # A tibble: 7 × 3
##   `Service A` `Service B` `No Service`
##         <dbl>       <dbl>        <dbl>
## 1          61          91           99
## 2          65          88           83
## 3          71          74           75
## 4          52          56           95
## 5          67          76           62
## 6          58          81           69
## 7          78          66           80

This data is in wide format and we need to convert this data to long format before running codes for ANOVA (to understand the difference between the wide and long format you can read this page (click here) ). There is a classic function in R called reshape to convert the data from wide to long format. But I think now there is an excellent package called tidyr which does the job beautifully and also it’s a good package to learn (check this). We will use a function pivot_longer from the tidyr package to convert these kind of data sets in a long format.

# load the tidyr package
library(tidyr)

# reshape the data to long format
data1_long <- pivot_longer(data1, cols = 1:3, names_to = c("service"), values_to = c("score"))

Here - data1 is our datain wideformat - cols = 1:3 means we want to convert column 1 to 3 to long - names_to = c(“service”) the converted column will be named as service - values_to = the \(Y\) values in the converted column will be named as score

Now let’s look at the data

data1_long
## # A tibble: 21 × 2
##    service    score
##    <chr>      <dbl>
##  1 Service A     61
##  2 Service B     91
##  3 No Service    99
##  4 Service A     65
##  5 Service B     88
##  6 No Service    83
##  7 Service A     71
##  8 Service B     74
##  9 No Service    75
## 10 Service A     52
## # ℹ 11 more rows

We also need to convert the service variable to factor before running the ANOVA

# convert the service to factor
data1_long$service <- as.factor(data1_long$service)


# check the summary stats
summary(data1_long)
##        service      score      
##  No Service:7   Min.   :52.00  
##  Service A :7   1st Qu.:65.00  
##  Service B :7   Median :74.00  
##                 Mean   :73.67  
##                 3rd Qu.:81.00  
##                 Max.   :99.00

Now finally we will run the aov() function to run the ANOVA, notice the syntax is similar to the lm() function

# now we are ready to run anova
anova_result1 <- aov(score ~ service, data =data1_long)
summary(anova_result1)
##             Df Sum Sq Mean Sq F value Pr(>F)  
## service      2  937.2   468.6    3.51 0.0516 .
## Residuals   18 2403.4   133.5                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Doing two way ANOVA

Now let’s do it for two way. The procedure is similar

# Load the data
data2 <- read_excel("/home/tanvir/Documents/ownCloud/Git_Repos/EWU_repos/3_Fall_2023/eco_204/ewu-eco204.github.io/pdf_files/slides/05_anova/Codes/service_data_2.xlsx")

# look at the column names
names(data2)
## [1] "Student"   "Service A" "Service B" "Service C" "Service D"
# reshape from wide to long, we will use pivot_longer from tidyr package
data2_long <- pivot_longer(data2, cols = starts_with("Service"), names_to = c("service"), values_to = c("score"))


# rename the columns
names(data2_long) <- c("student", "service", "score")

# convert the service and stduent variable to factor
data2_long$service <- as.factor(data2_long$service)
data2_long$student <- as.factor(data2_long$student)

summary(data2_long)
##  student      service      score      
##  1:4     Service A:5   Min.   : 7.00  
##  2:4     Service B:5   1st Qu.: 9.00  
##  3:4     Service C:5   Median :10.00  
##  4:4     Service D:5   Mean   :10.15  
##  5:4                   3rd Qu.:12.00  
##                        Max.   :14.00

Now the run the aov function

anova_result2 <- aov(score ~ service + student , data = data2_long)
summary(anova_result2)
##             Df Sum Sq Mean Sq F value Pr(>F)  
## service      3  31.35  10.450   3.931 0.0363 *
## student      4   3.30   0.825   0.310 0.8656  
## Residuals   12  31.90   2.658                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ANOVA Vs. LM

Actually ANOVA is a special case of LM. To see this you can run the lm() function for the two way ANOVA data and run anova() function, then you will get the same result as the aov() function

lm_result <- lm(score ~ service + student , data = data2_long)

summary(lm_result)
## 
## Call:
## lm(formula = score ~ service + student, data = data2_long)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.3000 -0.8250 -0.0250  0.6875  2.3500 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       9.650e+00  1.031e+00   9.358  7.3e-07 ***
## serviceService B  1.032e-15  1.031e+00   0.000   1.0000    
## serviceService C -1.000e+00  1.031e+00  -0.970   0.3513    
## serviceService D  2.400e+00  1.031e+00   2.327   0.0382 *  
## student2         -5.000e-01  1.153e+00  -0.434   0.6722    
## student3          2.500e-01  1.153e+00   0.217   0.8320    
## student4          7.500e-01  1.153e+00   0.651   0.5276    
## student5          2.500e-01  1.153e+00   0.217   0.8320    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.63 on 12 degrees of freedom
## Multiple R-squared:  0.5207, Adjusted R-squared:  0.241 
## F-statistic: 1.862 on 7 and 12 DF,  p-value: 0.1642
# in this case the anova function will give you the same result as the aov function
anova(lm_result)
## Analysis of Variance Table
## 
## Response: score
##           Df Sum Sq Mean Sq F value  Pr(>F)  
## service    3  31.35 10.4500  3.9310 0.03631 *
## student    4   3.30  0.8250  0.3103 0.86558  
## Residuals 12  31.90  2.6583                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Now you should understand what does the anova function do in R. It runs the lm() function removing each of the variable one by one and then compares the model with the full model and gives you the result.