Simpson's paradox

From SubSurfWiki
Revision as of 17:45, 19 April 2011 by Matt (talk | contribs) (added content from blog)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

An apparent paradox of statistics, when correlations reverse or proportions change when groups of the data are combined. Sometimes called the Yule–Simpson effect.


Simpson’s paradox is my favourite example of something simple, something we know we understand, indeed have always understood, suddenly turning on us.

Exploration geophysicists often use information extracted from seismic data, called attributes, to help predict rock properties in the subsurface. Suppose you are a geophysicist comparing two new seismic attributes, truth and beauty, each purported to predict fluid type. You compare their hydrocarbon-predicting success rates on 35 discoveries and it’s close, but beauty has an 83% hit rate, while truth manages only 77%. There's not much in it, but since you only need one attribute, all else being equal, beauty it is.

Truth Beauty
Score % Score %
Oil 8/8 100% 25/29 86%
Gas 19/27 70% 4/6 67%
Overall 27/35 77% 29/35 83%

But then someone asks you about predicting oil in particular. You dig out your data and drill down:

Apparently, truth did a little better when you just look at oil. And what about gas, they ask? Well, the data showed that truth was also better than beauty at predicting gas. So truth does a better job at both oil and gas, but somehow beauty edges out overall.

Impossible? Clearly not: these numbers are real and plausible, I haven't done anything sneaky. In this case, hydrocarbon type is a confounding variable, and it’s important to look for such groupings in your data. Improbable? No, it’s quite common in all kinds of data and this trap is well known among statisticians.

Avoiding Simpson's paradox

How can you avoid it? Be especially wary when the sample size in one or more of the groups you are interested in is much smaller than the others. Be even more alert if group sizes are inconsistent across the variables, as in my example: oil is under-sampled for truth, gas for beauty.

Ultimately, there's no guarantee this effect won’t crop up; that’s just how proportions are. All you can do is make sure you ask your data the questions you care about.

Agile links

External links