"Ice cream sales and drowning deaths rise together — so ice cream causes drowning." Stated that bluntly, the error is obvious. But the same mistake hides inside serious-sounding claims every day. Understanding correlation vs. causation means knowing exactly why a strong relationship between two variables does not, on its own, prove one drives the other — and what evidence would.

What Correlation Actually Tells You

Correlation measures whether two variables move together. The correlation coefficient, written r, runs from −1 to +1. An r near +1 means they rise and fall together; near −1 means one rises as the other falls; near 0 means no linear relationship.

What correlation does not tell you is why they move together. A correlation of r = 0.85 between two variables is a fact about the data — it is real and measurable. The leap to "therefore A causes B" is an interpretation, and it is the interpretation that fails. Correlation is necessary evidence for causation but nowhere near sufficient.

Notice too that r only captures linear association. Two variables can be tightly related in a curved pattern and still produce an r near 0, so a small correlation does not even rule out a relationship.

A scatter plot of dots on graph paper showing a loose upward trend
A scatter plot can show a clear correlation without revealing any cause.

The Three Reasons a Correlation Need Not Mean Cause

When variables A and B are correlated, there are at least four possible explanations, and only one of them is "A causes B."

A confounding variable drives both

A confounder is a hidden third variable that influences both A and B, creating a correlation between them even though neither causes the other. This is the ice-cream-and-drowning case: hot weather is the confounder. Hot days increase ice cream sales and increase swimming, which increases drownings. Ice cream and drowning are correlated, but the link runs through temperature.

Confounders are the single most common reason correlations mislead. Another classic: cities with more firefighters at a blaze tend to have more fire damage. The confounder is the size of the fire — bigger fires summon more firefighters and cause more damage. Concluding "firefighters cause damage" would be exactly backwards, yet the correlation is genuine and strong.

The causation runs the other way

Sometimes B causes A, not the reverse. Studies find that people who exercise report lower stress. It is tempting to conclude exercise reduces stress — but it is also plausible that low-stress people simply have more energy and time to exercise. The correlation alone cannot distinguish "exercise → less stress" from "less stress → more exercise." This is reverse causation.

It is coincidence

With enough variables, some will correlate by pure chance. Compare hundreds of unrelated trends over a decade and a few will track each other closely with no mechanism connecting them at all. A correlation found by sifting through many comparisons is far weaker evidence than one predicted in advance — which is why a result is more trustworthy when the relationship was hypothesized before the data was examined.

What It Takes to Establish Causation

If correlation is not enough, what is? The gold standard is the randomized controlled experiment.

In a randomized experiment, the researcher assigns subjects to groups by chance — a treatment group and a control group. Random assignment is the key move: it makes the groups equivalent, on average, for every variable including the ones nobody thought to measure. Because the only systematic difference between groups is the treatment, a difference in outcomes can be attributed to the treatment itself. Random assignment is what neutralizes confounders.

This is why "observational" data — data simply collected without assignment — supports causal claims so weakly. When people choose for themselves whether to exercise, take a supplement, or attend a program, the people who choose differ in countless other ways. Those differences are confounders waiting to happen.

When experiments are impossible or unethical, researchers strengthen observational evidence by controlling for known confounders statistically, testing whether the effect holds across many settings, and checking that the cause precedes the effect in time. None of this proves causation the way an experiment does, but together it builds a case.

Spotting the Error in the Wild

On exams and in headlines, the tell is a study described as observational — "researchers surveyed," "data showed," "people who did X also had Y" — followed by a causal verb: causes, boosts, prevents, leads to. When you see that pattern, ask the three questions: Could a confounder explain both? Could the causation run backward? Could it be coincidence? If any answer is "plausibly yes," the causal claim is not supported by the correlation alone.

Getting Help

Correlation is often the starting point for reading regression output, which quantifies how variables relate — and which carries exactly the same warning about not over-reading a relationship as cause. For the broader toolkit, browse the rest of the Statistics & Data study guides.

Conclusion

Correlation vs. causation is the most reliable trap in introductory statistics because the mistake feels so natural. A correlation tells you two variables move together; it never tells you why. A confounding variable, reverse causation, or sheer coincidence can each manufacture a correlation with no causal link. Only a randomized controlled experiment — where random assignment neutralizes confounders — lets you claim that one variable truly causes the other.