Student’s t-test for one sample
One of the most common statistical hypothesis tests, when we are analyzing populations, is the Student’s t-test. We can mainly use this test to evaluate a characteristic’s value in a population based on a sample. It’s important to highlight that the sample needs to follow a normal distribution for carrying out the Student’s t-test (also called normality). Let’s see three examples.
First example
Firstly, we load the data set employees_for_ttest
that we can find in the package HRdatsets
. This data set contains a sample with 550 observations.
library(tidyverse)
devtools::install_github("vicencfernandez/HRdatasets")
library(HRdatasets)
employees_for_ttest
## # A tibble: 550 x 6
## employeeID gender salary20 salary21 performance20 performance21
## <int> <fct> <dbl> <dbl> <dbl> <dbl>
## 1 1 female 1850. 1867. 58.9 63.0
## 2 2 female 2449. 2481. 73.0 73.8
## 3 3 female 2154. 2179. 91.2 87.9
## 4 4 female 2011. 2032. 55.9 60.7
## 5 5 female 1910. 1929. 69.6 71.2
## 6 6 female 2324. 2353. 72.3 73.3
## 7 7 female 1939. 1958. 79.8 79.1
## 8 8 female 1919. 1939. 67.5 69.6
## 9 9 female 1912. 1931. 96.4 91.8
## 10 10 female 2012. 2034. 68.8 70.6
## # … with 540 more rows
We want to know if the average salary of the female employees is equal to or different than 1,930 euros in 2021. So, the first step is to define the null and the alternative hypotheses:
- Null Hypothesis: The average salary of female employees in 2021 is equal to 1,930 euros
- Alternative Hypothesis: The average salary of female employees in 2021 is different than 1,930 euros
We can also write them mathematically.
- H0: \(\mu_{salary} = 1930\)
- H1: \(\mu_{salary} \neq 1930\)
The second step is to check the normality of the sample. Then, we can carry out a Shapiro-Wilk test to evaluate its distribution. Of course, when we have samples greater than 30, it’s unnecessary to analyze the normality due to the central limit theorem. However, here we are going to test it.
female_employees <- employees_for_ttest %>% filter(gender == "female")
female_employees$salary21 %>% shapiro.test()
##
## Shapiro-Wilk normality test
##
## data: .
## W = 0.99466, p-value = 0.5296
As the p-value (0.5296) is high (greater than 0.05), we cannot reject the hypothesis that the sample comes from a normal distribution population. So, we can accept that the population follows a normal distribution.
The third step is to set the maximum probability of error that we accept (significance level). In this example, we can use the 5% (it’s the most common value). Remember that we want to say something about the population based on a sample.
Finally, the fourth and last step is to carry out the Student t-test with the function t.test()
and the following parameters:
- The data set - In this scenario, the data set is
female_employess$salary21
- The value to compare:
mu
- In this case, it’s \(1930\) - The type of the alternative hypothesis:
alternative
- We can face three types of alternative hypothesis: (1) different to (“two.sided”), (2) greater than (“greater”), and (3) less than (“less”). In this case, we have a ‘different to’ hypothesis. - The confidence level of the interval:
conf.level
- Its value is one minus the significance level. In this case, the confidence level is \(0.95\).
female_employees$salary21 %>% t.test(mu = 1930, alternative = "two.sided", conf.level = 0.95)
##
## One Sample t-test
##
## data: .
## t = 1.9721, df = 249, p-value = 0.0497
## alternative hypothesis: true mean is not equal to 1930
## 95 percent confidence interval:
## 1930.034 1980.719
## sample estimates:
## mean of x
## 1955.376
The p-value tells us the probability of making a mistake if we reject the null hypothesis. As the p-value (0.0497) is less than our significance level (0.05), we can reject the null hypothesis, and so, we can accept the alternative hypothesis. So, the average salary of female employees in 2021 is different than 1,930 euros.
Second example
Let’s see another example with the same data set. Now, we want to test if the average salary of female employees in 2021 is greater than 1,930 euros. So, we define the following hypotheses:
- H0: \(\mu_{salary} = 1930\)
- H1: \(\mu_{salary} > 1930\)
Now, we can use again the function t.test()
, but with the parameter alternative = "greater"
.
female_employees$salary21 %>% t.test(mu = 1930, alternative = "greater", conf.level = 0.95)
##
## One Sample t-test
##
## data: .
## t = 1.9721, df = 249, p-value = 0.02485
## alternative hypothesis: true mean is greater than 1930
## 95 percent confidence interval:
## 1934.132 Inf
## sample estimates:
## mean of x
## 1955.376
As the p-value (0.02485) is less than our significance level (0.05), we can reject the null hypothesis, and so, we can accept the alternative hypothesis. So, the average salary of female employees in 2021 is greater than 1,930 euros.
Third example
Let’s see the last example with the same data set. Now, we want to test if the average salary of female employees in 2021 is less than 1,930 euros. So, we define the following hypotheses:
- H0: \(\mu_{salary} = 1930\)
- H1: \(\mu_{salary} < 1930\)
Now, we can use again the function t.test()
, but with the parameter alternative = "less"
.
female_employees$salary21 %>% t.test(mu = 1930, alternative = "less", conf.level = 0.95)
##
## One Sample t-test
##
## data: .
## t = 1.9721, df = 249, p-value = 0.9752
## alternative hypothesis: true mean is less than 1930
## 95 percent confidence interval:
## -Inf 1976.62
## sample estimates:
## mean of x
## 1955.376
As the p-value (0.9752) is greater than our significance level (0.05), we can’t reject the null hypothesis, and so, we can’t accept the alternative hypothesis. So, we cannot say that the average salary of female employees in 2021 is less than 1,930 euros.