
MATH 4720/MSSC 5720 Introduction to Statistics
In many research settings, weβd like to compare 3 or more population means.
4 types of devices used to determine the pH of soil samples.
Determine whether there are differences in the mean readings of those 4 devices.


A factor is a property or characteristic (categorical variable) that allows us to distinguish the different populations from one another.
Type of devices and treatment of trees are factors.
One-way ANOVA examines the effect of a categorical variable on the mean of a numerical variable (response).
We use analysis of variance to test the equality of 3 or more population means. π€
The method is one-way because we use one single property (categorical variable) for categorizing the populations.
The populations of each category are normally distributed.
The populations have the same variance \(\sigma^2\) (two-sample pooled \(t\)-test).
The samples are random samples.
The samples are independent of each other. (not matched or paired in any way)
Data 1 and Data 2 have the same group sample means \(\bar{y}_1\), \(\bar{y}_2\) and \(\bar{y}_3\) denoted as red dots.
Which data you are more confident to say the population means \(\mu_1\), \(\mu_2\) and \(\mu_3\) are not all the same?
Data 1: Variability between samples is large in comparison to the variation within samples.
Data 2: Variation between samples is small relatively to the variation within samples.
More confident to conclude there is a difference in population means when variation between samples is relatively larger than variation within samples.
There are 5 populations.
Within each population, 4 data points are collected.
Data \(y_{ij}\) is the the \(j\)-th data point in the \(i\)-th group.
| Population | Data | Sample Mean | Population Mean |
|---|---|---|---|
| 1 | \(y_{11}\) \(\quad\) \(y_{12}\) \(\quad\) \(y_{13}\) \(\quad\) \(y_{14}\) | \(\bar{y}_{1}\) | \(\mu_{1}\) |
| 2 | \(y_{21}\) \(\quad\) \(y_{22}\) \(\quad\) \(y_{23}\) \(\quad\) \(y_{24}\) | \(\bar{y}_{2}\) | \(\mu_{2}\) |
| 3 | \(y_{31}\) \(\quad\) \(y_{32}\) \(\quad\) \(y_{33}\) \(\quad\) \(y_{34}\) | \(\bar{y}_{3}\) | \(\mu_{3}\) |
| 4 | \(y_{41}\) \(\quad\) \(y_{42}\) \(\quad\) \(y_{43}\) \(\quad\) \(y_{44}\) | \(\bar{y}_{4}\) | \(\mu_{4}\) |
| 5 | \(y_{51}\) \(\quad\) \(y_{52}\) \(\quad\) \(y_{53}\) \(\quad\) \(y_{54}\) | \(\bar{y}_{5}\) | \(\mu_{5}\) |
This is NOT a tidy data matrix. We may need to save the data in another format before we do ANOVA.
\[\frac{\text{variance between samples}}{\text{variance within samples}} \sim F_{df_B,\, df_W}\]
Key: Define variance between samples and variance within samples so that the ratio is \(F\) distributed.
\[s_p^2 = \frac{(n_1-1)s_1^2 + (n_2-1)s_2^2}{n_1 + n_2 - 2}\]
What if general \(k\) samples?
\[\boxed{s_W^2 = \frac{(n_1-1)s_1^2 + (n_2-1)s_2^2 + \cdots + (n_k-1)s_k^2}{n_1 + n_2 + \cdots + n_k - k}}\] where \(s_i^2\), \(i = 1, \dots ,k\), is the sample variance of group \(i\).
\[\boxed{s^2_{B} = \frac{\sum_{i=1}^k n_i (\bar{y}_{i} - \bar{y}_{})^2}{k-1}}\]
\(\bar{y}_{i}\) is the \(i\)-th sample mean.
\(\bar{y}_{}\) is the grand sample mean with all data points in all groups combined.
\(s^2_{B}\) is also an estimate of \(\sigma^2\) and measures variability among sample means for the \(k\) groups.
If \(H_0\) is true \((\mu_1 = \cdots = \mu_k = \mu)\), any variation in the sample means is due to chance and randomness, and shouldnβt be too large.
Variance is defined as \(\frac{\text{Sum of Squares}}{\text{Degrees of Freedom}}\), which is also called \(\text{Mean Square (MS)}\)
\(s_B^2 = \frac{\sum_{i=1}^k n_i (\bar{y}_{i} - \bar{y}_{})^2}{k-1} = \frac{\text{Sum of Squares Between Samples (SSB)}}{df_B} = MSB\)
\(s_W^2 = \frac{\sum_{i=1}^{k} (n_i - 1)s_i^2}{n_1 + n_2 + \cdots + n_k - k} = \frac{\text{Sum of Squares Within Samples (SSW)}}{df_W} = MSW\) \((N = n_1 + \cdots + n_k)\)
\[\text{Total Sum of Squares (SST)} = \sum_{j=1}^{n_i}\sum_{i=1}^{k} \left(y_{ij} - \bar{y}_{}\right)^2 = SSB + SSW\]
\[df_{T} = df_{B} + df_{W} \implies N - 1 = (k-1) + (N - k)\]
\(F_{test} = \frac{MSB}{MSW}\)
Under \(H_0\), \(\frac{S^2_{B}}{S_W^2} \sim F_{k-1, \, N-k}\)
Reject \(H_0\) if



A sample of 5 each is taken and the amount of isoflavones is measured.
Is there a sufficient evidence to conclude that the mean isoflavone levels vary among these food items? \(\alpha = 0.05\).
data 1 2 3
1 3 19 25
2 17 10 15
3 12 9 12
4 10 7 9
5 4 5 8
Here columns represent food items and rows are samples.
So tell me what is the value of \(y_{23}\)!
π Do all calculations and generate an ANOVA table using just one line of code! π€ βοΈ

Analysis of Variance Table
Response: y
Df Sum Sq Mean Sq F value Pr(>F)
food 2 60.4 30.20 0.828 0.46
Residuals 12 437.6 36.47