Here's a nice exemplar (posted at Andrew Gelman's blog) of how even descriptive statistics require attention to detail.
A recent PNAS article includes a figure illustrating “a marked increase in the all-cause mortality of middle-aged white non-Hispanic men and women in the United States between 1999 and 2013.” The data in the figure, however, were "not age-adjusted within the 10-y 45-54 age group.” Annual mortality rates were calculated by dividing the total number of deaths for the age group by the population of the age group. Also notable is that the NYT ran a story about this paper and prominently featured the (now misleading) figure (see below).
Enter Andrew Gelman (Coulmbia--Statistics) and Jonathan Auerbach (Columbia--Poli Sci) who correctly "suspected an aggregation bias and examined whether much of the increase in aggregate mortality rates for this age group could be due to the changing composition of the 45–54 year old age group over the 1990 to 2013 time period. If this were the case, the change in the group mortality rate over time may not reflect a change in age-specific mortality rates. Adjusting for age confirmed this suspicion. Contrary to Case and Deaton’s figure, we find there is no longer a steady increase in mortality rates for this age group. Instead there is an increasing trend from 1999–2005 and a constant trend thereafter. Moreover, stratifying age-adjusted mortality rates by sex shows a marked increase only for women and not men, contrary to the article’s headline" (emphasis added).
The important teaching take-away, as Gelman notes, is that "when performing reverse causal inference, remember that people move, and, as we’ve discussed before, the cohorts are changing. 45-54-year-olds in 1999 aren’t the same people as 45-54-year-olds in 2013. We adjust for changing age distributions (ya gotta do that) but we’re still talking about different cohorts."
Comments