Central to most empirical research is the control of plausible rival hypotheses. If we are trying to show empirically that the world is or operates in one way, we need to convince the reader that our results are not likely to be attributable to explanations that undercut or are inconsistent with the description or explanation we provide. The key concept here is plausible. We need not dispose of all possible explanations for our results. If we had to do this, we never could publish. But some law professors, many lawyers, and too often courts, are adept at ignoring the plausibility constraint when they want to reject an empirically-based argument. Instead they treat, and get away with treating, possible alternative explanations as if their mere possibility seriously undercuts reliable data and well done analysis.
David Baldus and his coauthors, if I recall correctly, included more than 600 independent variables in a model aimed at investigating whether death sentences in Georgia were tainted by race effects. The goal, I assume, was to fend off this kind of uninformed (or bad faith) prosecutorial and judicial nit picking. From a scientific standpoint, the far more parsimonious model they included, with 32 variables as I recall, was a better way of examining the data. But their tactic worked – in a way. When the McCleskey case, in which the Baldus study was the key evidence, reached the Supreme Court, the conservative majority chose not to contend with the analysis, but instead held that in order to show the racial discrimination needed to reverse a death sentence a defendant had to show that in his particular case discrimination played a role in his sentence. This meant the Court never had to confront the Baldus finding of substantial race of victim discrimination when cases were of intermediate seriousness. It had become irrelevant.
Potential
plausible rival hypothesis are numerous and, for the most part, study
specific, but there are two methodologically motivated hypotheses that
are ubiquitous in social science research that I shall comment
specifically on. The first is that the
conclusions drawn in a study are not supported by the evidence because
the variables used in the analysis do not have the meaning the study’s
author claims for them. This is the problem of operationalization. Seldom can empirical researchers measure directly or with precision the concepts that most concern them. Indeed
many important concepts are neither natural categories nor numerical
quantities, and without transformation to a categorical or quantitative
form cannot be included in quantitative models. So instead of the variable that precisely interests us, we use we use, and often create, quantifiable indicators of the concepts that figure in the research, variables that can be operated on. But we write, and tend to read, as if we have hold of the concepts themselves.
U.S. News, for example, may in a given year tell us that Yale is a more prestigious school than Harvard, using language like “Among American law schools Yale ranks highest in faculty prestige. Harvard is second.” They don’t write, “Among the 23% of law professor surveyed who returned our questionnaires, only 2 failed to rate Yale in the top 25% of the nation’s law schools, while 5 didn’t place Harvard there.” The numbers are made up, but the portrait is sound. This is, or was years ago when last I looked, what differences in the faculty prestige accorded different law schools meant. There is something wrong with this picture? Who are the 2 people who did not think Yale was among the top quarter of the nation’s law schools in prestige, and the 5 people who didn’t place Harvard in the top 25%, what were they thinking! Is relative prestige being determined by the small group of law professors so out of touch with reality that they belong in mental institutions? Or maybe what we have here is a rating not of prestige but of the treatment of visitors, and since Harvard has many more visitors than Yale, it also has more “disgruntees”.
It’s fun to pick on U.S. News and usually safe as well, but the problem is ours not theirs. We too often treat our variables as if they capture precisely the concepts they have been chosen to represent. We
are too often lazy readers, and accept the English language
characterization of a variable rather than ask what it precisely
represents; that is what it specifically measures and how it does so. Whether
we are law professors, lawyers or judges, most of us, if we like
results, uncritically accept and cite the author’s characterization of
them without considering what has been measured and hence without
considering precisely what they mean. Indeed,
even when challenging unwelcome results, lawyers often fail to
thoroughly examine how concepts are operationalized and instead seek to
find weaknesses elsewhere. Yet in quantitative research everything rests on the adequacy of our measures and the fairness with which we interpret them. I have read numbers of articles where I saw substantial room for improvement.
The
other rival hypothesis that must be disposed of in most empirical
research is the hypothesis that relations we have found do not reflect
patterned behavior but are simply the result of chance. Whenever
we are working with samples rather than populations this is potentially
a plausible rival hypothesis, and depending on the question we are
asking of population data it can be a threat there as well. Fortunately
we have a well studied and precise way of evaluating the seriousness of
this threat – tests of statistical significance.
It is too late to ban the term significance test, but if we could we should, and we can in our own minds. These statistical tests do not tell us the significance of anything, at least not in the sense of importance. Rather
they tell us how likely we are to see a certain distribution or
association, or one that is more extreme than what we are seeing, if
the pattern in the data reflects only chance variation. Significance
tests are particularly valuable in investigations that have
intentionally introduced chance into the data, either through random
sampling or the random administration of a stimulus. In
such designs, the possibility that a relationship has resulted from
chance may be the only rival hypothesis that is plausible and even when
it is not, it is one that always must be disposed of.
Significance
tests are thus immensely valuable, but they unfortunately carry many
traps for statistical novices and for the unwary. The easiest is to confuse significance with importance; cognitively the words are so close we want to make the substitution. But a highly significant relationship may be unimportant, while one that fails to achieve significance may matter. Both
overweighting and underweighting the implications of statistical
significance often result from failing to give close attention to the
numbers on which significance tests are based. In very large samples, even trivial relationships may be highly significant yet have little in the way of explanatory power. In
very small samples, variables that contribute a lot to explaining
outcomes may nonetheless fail to achieve statistical significance. If
the pattern of explanation makes sense, proper treatment is not to
discard the variable as insignificant but rather to acknowledge that it
may be important – indeed very important – but to also recognize that
there are too few available cases to assert that conclusion.
Empirical work on the law often has the practical aim of affecting policy. No
study that aims at affecting policy should rely on significance alone
as a measure of a variable’s importance or of the value of an
explanatory model. Far more important is an
understanding of how much a variable contributes to a condition that
policy seeks to rectify or an end that policy seeks to achieve, or how
well a model, taken as a whole, does in explaining the phenomenon under
investigation. Explained variance and other statistics, not significance tests, help answer these questions.
Tests of statistical significance can also mislead if they draw attention away from more plausible rival hypotheses. If
a treatment is not introduced randomly, for example, but is biased, the
fact that highly significant treatment effects emerge may deserve no
more than a yawn. For example, suppose a law
professor develops a new teaching tool – an interactive internet
activity, for example – and asks the top students in his class to test
it, hypothesizing it will enhance learning. The
fact that these students then do far better on the final exam than the
others who did not use the tool, and that the difference in grades is
significant beyond, say, the .0001 level matters not at all, for chance
is not an important rival hypothesis. The same
would be true if the students were allowed to decide for themselves
whether to use the new technology because the better students might be
more likely to use it. In either case it is the likelihood of selection bias that must first be addressed. Unless
this is controlled for, the significance level of the grade difference
merits no attention, unless of course the relationship is not
significant. But in the latter instance the
sophisticated analyst would not conclude that the new modality had no
educational impact which is what the failure to achieve significance
would conventionally be taken to suggest. Rather,
in light of the likely selection bias, the researcher should consider
the possibility that the professor has developed a technology
detrimental to learning.
Significance
tests can also mean less or more than they seem to mean depending on
how research has been conducted or, indeed, on the body of research
into which a particular study fits. If a
hypothesis has been tested in a number of ways and only some tests have
yielded significant results, it may be that even the significant
results do not refute the hypothesis of a chance relationship. If
enough tests have been done, even if the hypothesized relationship does
not exist in the data, some results can be expected to reach
significance by chance alone.
Some
readers may be aware that I am one of a number of people who have been
contesting the viability of Richard Sanders mismatch theory of
affirmative action and the validity of the statistical analyses on
which it rests. (Professor Sander’s
contributions may be found on his web site and ours are on the web site
of the Equal Justice Society and on the Michigan SSN website.) I
don’t think it appropriate to use my blogging opportunity to make my
case against his research, but I do want to refer to one aspect of his
work because it so nicely illustrates the point I wish to make here. In
his reply to criticism by a group I am part of and others, Sander
changed his original approach and focused on the difference between the
performance of students attending their first and second choice law
schools. For reasons I need not go into and
reject, he argues that if second choice students do better than first
choice students that result supports his mismatch hypothesis. He
then looks at how well second choice students do on the bar the first
time they take it and finds that on several measures they do
significantly better than first choice students, and claims support for
his hypothesis. However, if eventual bar passage
rather than first time bar passage is the dependent variable, the
significant difference between first and second choice student
performance for the most part disappears. Now
suppose this study had come out differently and second choice students
had done significantly better on eventual bar passage but no
differently than first choice students on first time bar passage. Again Sander could have claimed support for his theory. What
this implies is that the significance levels he reports when first time
bar passage is dependent are more impressive than they should be
because they do not take into account the he had another opportunity to
support his hypothesis. There is nothing
particularly wrong here, or at least if he gave equal emphasis to the
eventual bar passage results there would not be, for his reporting
without discounting is common when people test theories in different
ways or with different dependent variables. But
the wary reader should note this, and when an hypothesis has been
tested in more than one way, only some of which yield significant
results, the reader should consider what degree of discounting the
failed tests make appropriate.
The converse situation may also pertain. A series of tests may yield no significant effects of a given variable. But
if the tests are independent and if all or, with a large enough number
of tests, almost all results point in the same hypothesized direction,
the results taken together may be significant and, particularly if the
individual tests were of low power (i.e., unlikely to reveal a moderate
sized relationship even if one existed) they may identify an important
relationship. Meta-analyses aggregate in this way, controlling if they can for data and analytic quality.
I expect many readers will find that I have said little that is new to them. My purpose is rather to share a research perspective and give certain reminders. I think the matters I have discussed are important enough that one cannot be reminded of them too often.
Rick
Tomorrow Thoughts on the Past, Present and Future of Empirical Legal Studies
You're right. This is nothing much new, but it always bears repeating.
One caveat, however: you are treating with significance tests in the context of random samples. That is not the only way to make stat inferences or even the best one. The use of significance tests becomes more important and - a bonus - more understandable if they are generated by resampling techniques. Bootstrapped estimates are more directly relevant to interpreting the stability of substantive estimates. The jacknife is also handy in some situations, though the widespread adoption of resampling in stat programs makes it less necessary.
Other than that: spot on.
Posted by: Tracy Lightcap | 14 August 2006 at 11:56 AM
A Further Thought:
One should also note that while significance tests should not be used to measure the importance or policy relevance of relationships, measures of the magnitude of effects may also be misinterpreted or misused. For example, data transformations, such as those commonly used in studies of the deterrent effects of capital punishment, mean that coefficients are properly interpreted as elasticities, and reports of what has been found typically report the tradeoffs these elasticities imply. Thus Ehrlich reported that in his data each additional execution appeared to prevent seven or eight homicides. Assume there were no problems with the data or analysis, and this estimate was correct. It would still not necessarily mean that if policies were changed to execute more people, each additional execution would save 8 lives. The effect found we know to hold true only over the range of cases studied. Marginal effects caused by policy changes could be very different. Thus, the trade-off found cannot be magically translated into ideal policy even when there is substantial value consensus. Doubling the execution rate in a state might diminish the homicide rate substantially less than a study of the status quo ante implies, and it would increase the risk that an innocent person would be executed. A policy maker who values both increased deterrence and low risk to the innocent might find the trade-off at the margin unacceptable even if he were willing to pay the cost of increased risk to the innocent if for each execuition eight lives were saved . This does not mean that the hypothesized research would not support policy change, but policy change based on research about past states should be the occasion for continuing research on what is happening in the changed world and not a marker that we now know what we needed to know to devise wise policy and so should spend our research money on other problems.
One must also be cautious of measures of effect because researchers in presenting their work naturally emphasize the effects of the variables whose impacts they are focusing on. Yet for the policy maker the effects of control variables may be just as important, and perhaps should shape the policy implications drawn from the effects of the focal variables. Suppose, for example, that a student of capital punishment found that each additional execution seems to translate into 8 fewer homicides but each year beyond 8th grade that the average student stays in school seems to translate into 20 fewer homicides. Is it either wise or moral in these circumstances to invest political and monetary capital in increasing execution rates rather than in programs designed to keep children in school? The policy maker should, and should want to, consider this question. But the effects of policy-relevant control variables may not be evident from the reported research, either because information on the effects of control variables are not given or are given only in tabular form and not highlighted in that abstracts, executive summaries or conclusions that grab our attention.
A final xample involves the intepretation of logistic regressions. Because it is difficult to intuit the implications of logistic regression coefficients, information is often presented on their effects at the mean values of the other independent variables. In a particular jurisdiction, however, all other variables are unlikely to be be at their mean, and the implications of changes made in response to the research may, even on the model's own terms, be quite different from what the presentation of results suggests.
The implications I draw from what i have said is that policy makers must be cautious in drawing conclusions about the likely effectiveness of policy changes from even well-conducted quantitative reserach and even when variables that suggest change appear to be both statistically significant and substantively important. But these cautions are decidedly NOT a call for refraining from social science research, ignoring social science findings or preferring softer over harder data. Rather they are a call for the careful and sophisticated interpretation of what research does and does not tell us and for treating research on important policy-relevant issues as an ongoing project in which findings from past studies should be continually updated and where there will almost always be more that is relevant to be learned.
Posted by: Richard Lempert | 11 August 2006 at 02:29 PM