Imagine this card trick. A statistician divides a regular deck of cards into two sets: one of 20 and one of 32 cards. Next, he urges two groups of students to investigate the cards, and hands out one set of cards to each of the groups. Both groups start counting the cards, and cross-tabulating the numbers based on several ways they can come up with. Quite rapidly, a member of the group with 20 cards observes an interesting pattern. Amongst the cards his group is studying, an interesting pattern emerges: a disproportionally large number of black court cards. A hypothesis is formulated: could it be that their ‘blackness’ causes them being a ‘court’ card more frequently?
The results are drawn up in a table and shown to the entire group:
|Red||8 (66.7%)||4 (33.3%)||12 (100%)|
|Black||5 (62.5%)||3 (37.5%)||8 (100%)|
|Total||13 (65.0%)||7 (35.0%)||20 (100%)|
The math is correct: the numbers add up to 20, so no card was missed or counted double. For proper interpretation, the percentages were calculated by row. The percentage ‘court’ cards amongst the black ones (37.5%) is larger than amongst the red cards (33.3%). To make sure, an odds ratio was computed as well with (8 * 3) / (5 * 4) = 1.2. This measure of association thus also indicates ‘black’ causing ‘court’. The group was satisfied with the support for their hypothesis.
“Oh, come on!” a member of the other group cries out, “Use your common sense. Every deck of cards has an equal number of black and red cards, and the court cards are distributed equally over the colors. If you find a disproportionally large number of black court cards, we should simply find a correspondingly large number of red court cards in our set. Right?” And so, a contrasting hypothesis was raised. The second group quickly tabulated their set of cards, and produced the following table:
|Red||12 (85.7%)||2 (14.3%)||14 (100%)|
|Black||15 (83.3%)||3 (16.7%)||18 (100%)|
|Total||27 (84.4%)||5 (15.6%)||32 (100%)|
How could that be? To their surprise, the only valid conclusion seems to be that in this subset of cards a disproportionally large number of black court cards is present. Or, in a more liberal interpretation of the findings, amongst these cards it was found again that their ‘blackness’ causes them to be a ‘court’ card. This resulted in a very puzzling situation: a deck of cards was split into two subsets and in both sets a positive association was found between the cards’ ‘blackness’ and them being a ‘court’ card. Could it be that the deck of cards was rigged?
The two groups teamed up, and aggregated their two tables:
|Red||20 (76.9%)||6 (23.1%)||26 (100%)|
|Black||20 (76.9%)||6 (23.1%)||26 (100%)|
|Total||40 (76.9%)||12 (23.1%)||52 (100%)|
No, the deck of cards was not rigged. The numbers are correct since the aggregated numbers represent a typical deck of cards: 40 plain cards, 12 court cards, and an equal number of red and black cards. Also, in this contingency table, absolutely no association is present between the color of the card, and them being ‘court’ or ‘plain’.
This must be magic: in the complete deck of cards no association is present, while in the two subsets of cards a positive association is found. Since this association is in the same direction in both subsets, we cannot simply argue that the two associations cancel each other out upon aggregation. But it is no magic, it’s statistics: the two sets of cards were selected by the statistician E.H. Simpson (1951).
Of course, the cards in the two subsets were selected carefully, so that the smaller subset both had a disproportionally small number of black cards, and a disproportionally large number of court cards. Simpson selected these subsets to illustrate a paradox that is as fascinating as it is relevant to our analytical practice. Finding associations – with correct math! – in subsets, whereas this association is not present in the aggregated sets is so counter-intuitive, that we can easily make a mistake.
Now imagine that the cards in Simpson’s deck represent observations in the project you are currently working on. The deck of cards represents the overall population you are interested in, and the two subsets represent two sub-populations you might be studying separately (we often do, for instance if we have separate samples). Even if your sub-populations encompass all people in the population (e.g. you are studying men and women separately), and even if the findings in both sub-populations are consistent, you should not simply conclude that these consistent findings hold true in the complete population. With Simpson’s paradox in mind, you know you’re just one (dis)aggregation away from being led astray.
Simpson, E.H. (1951). The Interpretation of Interaction in Contingency Tables Journal of the Royal Statistical Society. Series B (Methodological), 13 (2), 238-241