When should I use ANOVA instead of multiple t-tests?

Whenever you have three or more groups to compare on one quantitative outcome. ANOVA controls the family-wise false-positive rate, which running k(k−1)/2 separate t-tests does not. If ANOVA rejects, follow up with a post-hoc test like Tukey's HSD to identify which specific pairs differ.

What does it mean if F is close to 1?

It means the between-group variance is about the same as the within-group variance — the group means are no more spread out than the noise inside each group. That is exactly what you would expect if the null hypothesis were true. F values around 1 give large p-values; you fail to reject H₀.

Can I use ANOVA with unequal sample sizes?

Yes. The formulas above weight each group's contribution by its sample size, so groups with different n still work. ANOVA is more robust to violations of the equal-variance assumption when sample sizes are equal, however, so try to balance the design if you can.

What is the difference between one-way and two-way ANOVA?

One-way ANOVA has a single grouping factor (e.g., caffeine dose). Two-way ANOVA has two crossed factors (e.g., caffeine dose and time of day) and lets you test each main effect plus the interaction between them. The setup and interpretation are different — start with one-way until you are comfortable, then move up.

One-Way ANOVA: When to Use It and How the F-Statistic Works

When you have three or more groups to compare, the temptation is to run a t-test on every pair. The right move is a one-way analysis of variance (ANOVA). It tests a single null hypothesis across all the groups at once, controls the error rate, and reduces to a familiar two-sample t-test when there are only two groups. This walkthrough shows when to use ANOVA, why it works, and how to compute the F-statistic by hand.

Why Not Just Run a Bunch of T-Tests?

If you have four groups and you compare every pair, that is six separate t-tests. Each test has its own probability of a false positive — about 5% at α = 0.05. The chance that at least one of six tests fires by chance is much higher: roughly 1 − 0.95⁶ ≈ 0.26, or 26%. That inflated false-positive rate is the "multiple comparisons problem."

One-way ANOVA collapses the question into a single test: are any of the group means different? The null hypothesis covers every group:

H₀: μ₁ = μ₂ = μ₃ = ... = μₖ
H₁: at least one μ is different from the others

If ANOVA rejects, you then run follow-up comparisons (Tukey's HSD is the standard choice) to find which groups differ. If ANOVA fails to reject, you stop — there is no evidence of any difference, and chasing individual pairs would be fishing.

What ANOVA Actually Compares

The name is exact: ANOVA analyzes variance. It splits the total variability in the data into two pieces and compares them.

Between-group variance. How spread out the group means are around the overall mean. If the groups really do come from populations with different means, this variance is large.
Within-group variance. How spread out the data are inside each group, around that group's own mean. This is the baseline "noise" — variability the group label cannot explain.

The F-statistic is the ratio:

F = MSB / MSW

where MSB is the mean square between groups and MSW is the mean square within groups. If the groups really differ, MSB is much larger than MSW and F is large. If the groups are all from the same population, MSB and MSW estimate the same thing, F is near 1, and the p-value is large.

Three side-by-side dot plots showing variability within groups versus spread of group means — Three side-by-side dot plots showing data spread inside each group versus the spread of the three group means

The Worked Example

A psychology study tests reaction time (in milliseconds) under three caffeine doses: 0 mg, 100 mg, and 200 mg. Five subjects per group. The data:

Group A (0 mg): 320, 340, 330, 350, 360. Mean x̄_A = 340.
Group B (100 mg): 310, 300, 320, 290, 280. Mean x̄_B = 300.
Group C (200 mg): 270, 280, 290, 250, 260. Mean x̄_C = 270.

Overall mean x̄ across all 15 observations = (340 + 300 + 270) / 3 = 303.33 ms. Test at α = 0.05.

Step 1 — Hypotheses. H₀: μ_A = μ_B = μ_C. H₁: at least one mean differs.

Step 2 — Sum of squares between (SSB). For each group, take (group mean − overall mean)² × group size, then add.

(340 − 303.33)² × 5 = (36.67)² × 5 = 1344.4 × 5 = 6722
(300 − 303.33)² × 5 = (−3.33)² × 5 = 11.1 × 5 = 55.5
(270 − 303.33)² × 5 = (−33.33)² × 5 = 1110.9 × 5 = 5554.5

SSB ≈ 6722 + 55.5 + 5554.5 = 12,332

Step 3 — Sum of squares within (SSW). For each group, sum the squared deviations from that group's own mean.

Group A: (320−340)² + (340−340)² + (330−340)² + (350−340)² + (360−340)² = 400 + 0 + 100 + 100 + 400 = 1000
Group B: (310−300)² + (300−300)² + (320−300)² + (290−300)² + (280−300)² = 100 + 0 + 400 + 100 + 400 = 1000
Group C: (270−270)² + (280−270)² + (290−270)² + (250−270)² + (260−270)² = 0 + 100 + 400 + 400 + 100 = 1000

SSW = 1000 + 1000 + 1000 = 3000

Step 4 — Degrees of freedom. With k = 3 groups and N = 15 observations:

df_between = k − 1 = 2
df_within = N − k = 12

Step 5 — Mean squares.

MSB = SSB / df_between = 12,332 / 2 = 6166
MSW = SSW / df_within = 3000 / 12 = 250

Step 6 — F-statistic.

F = MSB / MSW = 6166 / 250 ≈ 24.66

Step 7 — Compare to F-critical. For α = 0.05 with df₁ = 2 and df₂ = 12, F-critical ≈ 3.89. Our F = 24.66 is far past it; the p-value is well under 0.001. Reject H₀.

Conclusion in plain English: "There is statistically significant evidence at the 0.05 level that mean reaction time differs across the three caffeine doses." A follow-up Tukey test would show that 0 mg, 100 mg, and 200 mg all differ from each other.

Conditions to Check

ANOVA's validity rests on three conditions:

Independence. Observations within and between groups are independent. Random assignment to groups is the cleanest way to satisfy this.
Normality. Each group's population is approximately normal. ANOVA is fairly robust to mild violations, especially when group sizes are equal.
Equal variances (homoscedasticity). Each group has roughly the same population variance. A common rule: the largest sample SD should be no more than about twice the smallest. If variances are clearly different, use Welch's ANOVA, which is the multi-group analogue of Welch's t-test.

Why the Two-Group Case Reduces to a T-Test

With k = 2 groups, one-way ANOVA gives an F-statistic that equals the square of the pooled two-sample t-statistic: F = t². The p-values are identical. So ANOVA is not a different test for two groups — it is the same machinery, written in a way that generalizes to more groups.

Getting Help

ANOVA assumes you already know how to set up hypothesis tests, so if any of the framework feels shaky, setting up a hypothesis test covers the structure. To see the two-group t-test that ANOVA generalizes, the two-sample t-test walkthrough is the right companion piece.

Conclusion

One-way ANOVA compares three or more group means with a single F-test that controls the overall false-positive rate. The F-statistic is the ratio of between-group variance to within-group variance, so a large F means the group means are spread out relative to the noise inside each group. In the caffeine example F = 24.66 on df (2, 12), which is far beyond the 0.05 cutoff. Once ANOVA rejects, a Tukey follow-up identifies the specific pairs that differ.

One-Way ANOVA: When to Use It and How the F-Statistic Works

Why Not Just Run a Bunch of T-Tests?

What ANOVA Actually Compares

The Worked Example

Conditions to Check

Why the Two-Group Case Reduces to a T-Test

Getting Help

Conclusion

Frequently Asked Questions

Clear study guides,
straight to your inbox.

Why Not Just Run a Bunch of T-Tests?

What ANOVA Actually Compares

The Worked Example

Conditions to Check

Why the Two-Group Case Reduces to a T-Test

Getting Help

Conclusion

Frequently Asked Questions

Keep reading

Chi-Square Test of Independence: Building Expected Counts From a Contingency Table

Can McGraw-Hill Detect Copy and Paste?

Clear study guides,straight to your inbox.

Clear study guides,
straight to your inbox.