When you have three or more groups to compare, the temptation is to run a t-test on every pair. The right move is a one-way analysis of variance (ANOVA). It tests a single null hypothesis across all the groups at once, controls the error rate, and reduces to a familiar two-sample t-test when there are only two groups. This walkthrough shows when to use ANOVA, why it works, and how to compute the F-statistic by hand.
Why Not Just Run a Bunch of T-Tests?
If you have four groups and you compare every pair, that is six separate t-tests. Each test has its own probability of a false positive — about 5% at α = 0.05. The chance that at least one of six tests fires by chance is much higher: roughly 1 − 0.95⁶ ≈ 0.26, or 26%. That inflated false-positive rate is the "multiple comparisons problem."
One-way ANOVA collapses the question into a single test: are any of the group means different? The null hypothesis covers every group:
- H₀: μ₁ = μ₂ = μ₃ = ... = μₖ
- H₁: at least one μ is different from the others
If ANOVA rejects, you then run follow-up comparisons (Tukey's HSD is the standard choice) to find which groups differ. If ANOVA fails to reject, you stop — there is no evidence of any difference, and chasing individual pairs would be fishing.
What ANOVA Actually Compares
The name is exact: ANOVA analyzes variance. It splits the total variability in the data into two pieces and compares them.
- Between-group variance. How spread out the group means are around the overall mean. If the groups really do come from populations with different means, this variance is large.
- Within-group variance. How spread out the data are inside each group, around that group's own mean. This is the baseline "noise" — variability the group label cannot explain.
The F-statistic is the ratio:
F = MSB / MSW
where MSB is the mean square between groups and MSW is the mean square within groups. If the groups really differ, MSB is much larger than MSW and F is large. If the groups are all from the same population, MSB and MSW estimate the same thing, F is near 1, and the p-value is large.
The Worked Example
A psychology study tests reaction time (in milliseconds) under three caffeine doses: 0 mg, 100 mg, and 200 mg. Five subjects per group. The data:
- Group A (0 mg): 320, 340, 330, 350, 360. Mean x̄_A = 340.
- Group B (100 mg): 310, 300, 320, 290, 280. Mean x̄_B = 300.
- Group C (200 mg): 270, 280, 290, 250, 260. Mean x̄_C = 270.
Overall mean x̄ across all 15 observations = (340 + 300 + 270) / 3 = 303.33 ms. Test at α = 0.05.
Step 1 — Hypotheses. H₀: μ_A = μ_B = μ_C. H₁: at least one mean differs.
Step 2 — Sum of squares between (SSB). For each group, take (group mean − overall mean)² × group size, then add.
- (340 − 303.33)² × 5 = (36.67)² × 5 = 1344.4 × 5 = 6722
- (300 − 303.33)² × 5 = (−3.33)² × 5 = 11.1 × 5 = 55.5
- (270 − 303.33)² × 5 = (−33.33)² × 5 = 1110.9 × 5 = 5554.5
SSB ≈ 6722 + 55.5 + 5554.5 = 12,332
Step 3 — Sum of squares within (SSW). For each group, sum the squared deviations from that group's own mean.
- Group A: (320−340)² + (340−340)² + (330−340)² + (350−340)² + (360−340)² = 400 + 0 + 100 + 100 + 400 = 1000
- Group B: (310−300)² + (300−300)² + (320−300)² + (290−300)² + (280−300)² = 100 + 0 + 400 + 100 + 400 = 1000
- Group C: (270−270)² + (280−270)² + (290−270)² + (250−270)² + (260−270)² = 0 + 100 + 400 + 400 + 100 = 1000
SSW = 1000 + 1000 + 1000 = 3000
Step 4 — Degrees of freedom. With k = 3 groups and N = 15 observations:
- df_between = k − 1 = 2
- df_within = N − k = 12
Step 5 — Mean squares.
- MSB = SSB / df_between = 12,332 / 2 = 6166
- MSW = SSW / df_within = 3000 / 12 = 250
Step 6 — F-statistic.
F = MSB / MSW = 6166 / 250 ≈ 24.66
Step 7 — Compare to F-critical. For α = 0.05 with df₁ = 2 and df₂ = 12, F-critical ≈ 3.89. Our F = 24.66 is far past it; the p-value is well under 0.001. Reject H₀.
Conclusion in plain English: "There is statistically significant evidence at the 0.05 level that mean reaction time differs across the three caffeine doses." A follow-up Tukey test would show that 0 mg, 100 mg, and 200 mg all differ from each other.
Conditions to Check
ANOVA's validity rests on three conditions:
- Independence. Observations within and between groups are independent. Random assignment to groups is the cleanest way to satisfy this.
- Normality. Each group's population is approximately normal. ANOVA is fairly robust to mild violations, especially when group sizes are equal.
- Equal variances (homoscedasticity). Each group has roughly the same population variance. A common rule: the largest sample SD should be no more than about twice the smallest. If variances are clearly different, use Welch's ANOVA, which is the multi-group analogue of Welch's t-test.
Why the Two-Group Case Reduces to a T-Test
With k = 2 groups, one-way ANOVA gives an F-statistic that equals the square of the pooled two-sample t-statistic: F = t². The p-values are identical. So ANOVA is not a different test for two groups — it is the same machinery, written in a way that generalizes to more groups.
Getting Help
ANOVA assumes you already know how to set up hypothesis tests, so if any of the framework feels shaky, setting up a hypothesis test covers the structure. To see the two-group t-test that ANOVA generalizes, the two-sample t-test walkthrough is the right companion piece.
Conclusion
One-way ANOVA compares three or more group means with a single F-test that controls the overall false-positive rate. The F-statistic is the ratio of between-group variance to within-group variance, so a large F means the group means are spread out relative to the noise inside each group. In the caffeine example F = 24.66 on df (2, 12), which is far beyond the 0.05 cutoff. Once ANOVA rejects, a Tukey follow-up identifies the specific pairs that differ.