Natural and social scientists have long had a practice of using asterisks ("stars") to indicate statistical significance in tables. I'll leave aside (for now) the question of whether this is a good practice or a bad one (and, for that matter, whether we ought to be doing statistical significance testing at all), and instead focus on the (mis)use of "stars."
Here's another example from a recent, well-respected law review. If we focus on the use of "stars," the table illustrates three potential issues:
(1) First, there are simply too many levels of them. Having separate numbers of asterisks for P < 0.2, 0.1, 0.05, and 0.01 is overkill. One or -- at most -- two asterisks is always enough; convention says to indicate P < .05 and/or .01, and stop.
(2) Second, there is no indication of the "tailedness" of the significance tests. This is important, if for no other reason than that it goes directly to how many stars appear next to each estimate.
(3) Finally, there's the issue of "stars" next to estimates for the constant term. Scholars routinely and unthinkingly include them (and, for that matter, P-values) in their tables even when they have no hypothesis about (or even any interest in) that term. Here, for example, the star on the constant in Model Two indicates that we can reject the hypothesis that the constant term is zero (i.e., that the probability of a lawsuit is 0.5 when all the other covariates are equal to zero) at P < 0.2. One's initial impulse to say "who cares?" is bolstered by the fact that such a condition implies that the hypothetical IPO amount was $0, which (I'd wager) is not an in-sample value.
I don't want to come across as completely negative. The authors managed to avoid a few other common mistakes when using "stars," including placing them next to the standard errors / t-statistics (rather than the parameters) or using a gaggle of arcane symbols (daggers, double-daggers, etc.) in place of asterisks. And, in general, this is a pretty good table; while the variable names are a bit cryptic, having variable descriptions in the table notes helps. I'll talk more about table-related matters in my next post.
It's not bad, but I'm old school: I want to see some measures of fit. Especially when the N is below 200. If you have survey data with 1K or so cases, fit will always be pretty poor and you should go directly to the coeffs to see if they have substantive effect (imho; significance is important, but secondary). But here I'd want something to see if there is any juice in the equation.
I also don't like the footnote for the indicator descriptions. Straightforward short descriptions would be much better, even if you had to (horrors!) use two lines to do it. This is all SPSS's fault; we've gotten lazy about using the cryptic indicator values we plug into it.
A final complaint: we have to assume here that usual frequentist tests are legit. When I do this kind of thing, I usually report the standard tests; everybody at the journals seems to want them. I always check those readings by bootstraping the model, however. We really need to get the reviewers to start taking bootstraped errors more often; given the kinds of "samples" social scientists work with these days it's the only legit route.
Which would lead us into a discussion of significance testing tout coup, I suppose.
Posted by: Tracy Lightcap | 27 June 2008 at 10:14 PM
We had a blog forum on significance testing in early 2007. Unfortunately, there is no category tag that pulls up the old posts on this topic. Maybe we need one.
Here are the posts I could find:
http://www.elsblog.org/the_empirical_legal_studi/2007/01/blog_forum_sign.html
http://www.elsblog.org/the_empirical_legal_studi/2007/01/a_brief_overvie.html
http://www.elsblog.org/the_empirical_legal_studi/2007/01/the_uses_of_sig.html
http://www.elsblog.org/the_empirical_legal_studi/2007/01/the_social_scie.html
It would be worth revisiting the topic with another blog forum.
Posted by: William Ford | 18 June 2008 at 10:01 PM
Christopher: True as well. I'm very picky with my own Ph.D. students when it comes to making them state explicitly whether their hypotheses are directional, and (so) the "tailedness" of their tests. But I also agree that confidence/credible intervals are generally better.
As for the general subject, perhaps we can/should hold a little blog forum on that topic in the future...
Posted by: C. Zorn | 18 June 2008 at 08:38 AM
Given that most (if not all) researchers use pre-programmed estimators to generate their empirical results, the reader should probably assume that the significance test is two-tailed. It seems to me, along the lines of Jeremy's comment, that what we should discuss more is the relative significance of statistical significance. Even if a point estimate passes the p < 0.5 test, it may still be a poor estimate if the standard error is large enough. Although it requires more table space, I would like to see more use of confidence intervals rather than SEs and asterisks so that the reader can easily determine the precision of reported estimates.
Posted by: C. Griffin | 17 June 2008 at 10:53 PM
Reasonable points, Jeremy. Actually, this table is a bit unusual in going the P=0.10, P=0.20 route. The more common means of proliferation for stars is along the lines of "one asterisk indicates P<.05, two indicate P<.01, three indicate P<.001, four indicate P<.0001, five indicate P<.00001, etc."
As for "tails," it seems easy enough to note e.g. "(one-tailed)" after the description of the P-values. And, that's consistent with the general rule that a table should "stand on its own" whenever possible.
Posted by: C. Zorn | 17 June 2008 at 02:09 PM
A couple quibbles and a request . . . The request: please don’t leave for too long a discussion of the problems of over-emphasis on statistical significance! The more (and more often) we note the potential shortcomings of focusing ONLY on p-values, I think, the better.
Which relates to one of the quibbles—it’s certainly possible that an effect can be significant at p-values higher than .05, but for that reason be reflexively dismissed despite being potentially interesting. Thus, I don’t know that it’s necessarily a problem to report p-values of <.10, though perhaps .20 is stretching it a little. (Whether it’s necessary to do it in the table is another question.) The second quibble is on the tailed-ness point: no question an author should make clear whether a 1- or 2-tailed test is being reported, but that seems appropriate to put in the text, rather than a table, for reasons of (1) clutter and (2) the opportunity to explain why the particular test is being used.
Posted by: Jeremy A. Blumenthal | 17 June 2008 at 01:18 PM