Professor Henkel has set out the basic logic of significance testing. His background, however, includes much thought and experience connected with the uses and misuses of testing, a topic to which I hope he will return in this discussion. Let me start by offering a few thoughts of my own on this slightly more advanced topic.

As I see it, there are two legitimate technical uses of significance testing in social science, inference to a population parameter and inference to causality. The first is the one that is chiefly covered in Prof. Henkel's introduction. Some would say that the second use, causality, is just a subset of the first one. Be that as it may, I think it's at least convenient to view it separately.

To use testing in the first way, one must have a probability sample, call it a random sample for convenience, of a specific larger population. One calculates a statistic on the sample and uses testing to make an inference to the population. In the overwhelming majority of cases in social science, this inference is about a relationship between two or more variables. Again in the overwhelming majority of cases, the null hypothesis tested is that the relationship in the population is zero. If the result is significant at, say, the 5% level, then we can infer with small chance of error that the relationship in the population is not zero. This use is completely legitimate. It comes up in survey research based on random samples all the time. Note, however, that the test tells us nothing about whether the nonzero relationship in the population is causal. Most survey researchers know that they have to jump through a lot of hoops in order to reach this further conclusion even tentatively.

The second legitimate use is to infer causality in a randomized experiment. The logic involved is not that of statistics alone, but has to do as well with the nature of the experimental design. If, say, an experimental and a control group are selected by randomization and if an experimental treatment is then administered only to the first group, then the two groups should differ on the outcome variable only (a) by virtue of a random sampling or randomization vagary that leaves the two groups somewhat unequal on this outcome variable or (b) by virtue of the experimental treatment. If a test is carried out with a significant result, it tells us that there is less than, say, a 5% chance that a randomization vagary alone would yield a difference as large as the one we observed. In that case, we can legitimately infer that the cause of at least some of the observed difference was the experimental treatment, i.e., that the causal connection is greater than zero. Note that the result in itself says nothing about causality in any larger population or even in this very group at a different time. This second legitimate use is accessed often in psychology and sometimes but not as often in the other social sciences.

Before going on to illegitimate uses, note how puny the first two will usually be. It has often been justly remarked that a nonzero causal or population relationship tells us very little. Minuscule relationships may well not be very important. Again, there are hoops to jump through in order to ascertain that the causal or population relationship is strong enough to be interesting.

There are several illegitimate uses of testing in abundant evidence. One type is to infer something about a population parameter on the basis of a sample that is not a random sample, as emphasized in Henkel's introduction. Another is to infer that a factor introduced or occurring in one of two observed groups is causal when the two groups have not been selected by randomization. Unfortunately, an investigator will rarely come out and make one of these illegitimate inferences explicitly, but it is not uncommon, in my experience, to hint at it or imply it by the subtleties of the language used. Explicit or implied, the inference is illegitimate.

The illegitimate use I want to emphasize here, however, is the common practice of testing when there is no random sample and there is no randomization. Testing is almost knee-jerk in social science, but random samples and randomization are not all that common. We look at some group and find a relationship between two variables and test it for significance. Why? What can we possibly learn? Without random sampling or randomization, nothing. (I'll qualify this below.) Far better to forget the significance testing entirely and look at other aspects of the data for interesting findings. Furthermore, the testing might be considered dangerous because it can subtly lull us into thinking that, on this basis, the relationship is perhpas more general or perhaps causal. Too often, a significant result can lead us to think that the observed relationship is nonchance and therefore "real". Of course it's real; you observed it. You don't need a significance test to tell you that. If "real" is meant to imply more general or causal, the test is being used illegitimately.

However, this apparently vacuous and potentially misleading application of the test is so very common that one is tempted to think there may be some hidden justification behind it. I would urge that there is indeed a hidden justification, but since it's not in the textbooks and not often spoken about very explicitly, it may well be hidden even from the investigator who is in some sense leaning on it.

The justification I have in mind is not very powerful, perhaps, but it is legitimate and can be both convenient and appealing. It is the use of the significance test as a universal measure of strength. If one looks at a group of interest and finds a relationship between two (or more) variables that is statistically significant, say the variables X and Y, one knows that the relationship is in some sense strong. The sense is the following: The relationship is strong enough that if groups had been assigned by randomization instead of by values of the X variable, the corresponding pattern of differences on the Y variable would not have occurred as much as, say, 5% of the time. That relation might often be considered strong enough to be interesting in context. If the result is significant at, say, the .001 level, the relationship is even stronger, and so on. If it is not statistically significant, it is in this same sense weak, and perhaps too weak to figure prominently in further thinking on the subject. There is no pretending here that the significant relation is causal or more general, just strong in one narrow sense.

The sense is appealing, though, because it's so easy to find the test results and they can be used in this way with any statistic that has a known sampling distribution. In a rough but nontrivial way, one can even compare results of two X variables for the same Y using two different statistics, a regression coefficient on one, for example, and a Kendall's tau on the other. If one statistic is significant at the 1% level and the other is not even significant at the 10% level, then the the relation of Y to the first X is strong and to the second weak, or at least the first X is, in this rough but perhaps meaningful way, considerably more strongly related to Y than the second. There are important caveats having to do with sample sizes and variances that do not bear exploring in this context, but if the test is used in this way with a certain amount of expertise, both in methods and in the subject matter area, and with a certain amount of caution and common sense, it is a huge boon to research.

My suggestion is that significance testing may be used in just this way quite often, but without the investigator's being explicit about it. Even in that case, he or she could be deriving profit from the application. Avoiding the misuses reviewed above and using the test intelligently and explicitly in this tentative and preliminary way could introduce a healthy dose of transparency into social science research.

Certainly, Professor Henkel is right. When I said that a causal connection can "legitimately" be inferred on the basis of a significant result I meant legitimate in terms of accepted practice, not mathematics. The idea of legitimacy here is to contrast the case of randomization with other, nonrandom designs.

Posted by: Lawrence Mohr | 05 February 2007 at 03:58 PM

As a very belated comment on Professor Mohr's post that occurred to me after my initial comment,I disagree with a statement Professor Mohr makes about the interpretation of a "significant" result. The statement is the following: "If a test is carried out with a significant result, it tells us that there is less than, say, a 5% chance that a randomization vagary alone would yield a difference as large as the one we observed. In that case, we can legitimately infer that the cause of at least some of the observed difference was the experimental treatment, i.e., that the causal connection is greater than zero." Why I disagree is that 100% of the sampling distribution is the result of chance ALONE if the null hypothesis is true. It is only our sense of credibility that then makes the leap to the conclusion that such a "significant result", in other words, rare event, must be the result of something other than chance. Put another way, "Why should the unlikely happen to me?" There is nothing in the mathematics of significance testing that says a significant result is anything other than the result of chance

factors.

In response to Professor Lightcap's comments regarding bootstrap sampling, my knowledge of bootstrap sampling is at the level of knowing the technique exists and how it is implemented, not much more. It may have considerable utility in the situations in which decision theory is applicable, but what it contributes, or may contribute, to investigations of the validity of theory is something I have only vague ideas about and they are mostly negative. One website illustrates the nature of at least one of the concerns that underly my uneasiness in that it indicates that an assumption underlying the use of the technique is that "Your sample is a valid representative of the population", whatever that may mean. If it means a miniature version of the population, or something close, then one still has the problem of basing the results on a very tenuous assumption when employing the technique in theory validation. (the URL is long, so probably will require some pasting http://people.revoledu.com/

kardi/tutorial/Bootstrap/

bootstrap.htm#assumption)

Posted by: Ramon Henkel | 05 February 2007 at 10:17 AM

This is a response both to Henkel and Lightcap. I must apologize for not making myself clear in the post. I was supporting the position that a significance test can legitimately be used as a measure of strength. What I didn't emphasize strongly enough is that this use has nothing whatever to do with statistical inference based on the test.

When a test is run, you get a number -- a significance or probability level -- and in most cases you then go on to make an inference. If the result is significant, the inference would usually be that the population or causal parameter is not zero. In this use as a strength measure we don't take that second step. We stop with just the information that the result observed is "significant."

In context, this means that if the observation were based on random sampling or randomization, which it is not, it would fall into, say, the 5% tail of the appropriate sampling distribution. This information is then used simply and only as a basis of comparison, not for any inference. Shifting our sights now to that sampling distribution, we know that the relationship in the hypothetical observation just like ours but based on randomization has a certain strength, label it "strong". Everything else being equal (mainly sample sizes and variances), it is necessarily stronger than a relationship not significant at the 10% level and weaker than one that is significant at the 1% level.

By comparison, we can now say that our observed relation in the non-randomized study is "strong", meaning that if it had been a randomized study in would have been "strong" and that it is stronger in this sense than, everything else being equal, a similar result not based on randomness that was not significant at the 10% level, and weaker than one significant at the 1% level -- where the term "significant", again, does not carry the usual implication of proceeding to some stage of inference about a population or about causality. We have only a measure of strength, comparable to a correlation coefficient or a standardized beta coefficient.

My claim, however, is that its near universality or ubiquity makes it exceptionally useful -- and legitimate as long as the "everything else equal" caveat is observed or managed with proper discounting of differences. I gave one example in the post of comparing regression coefficients and Kendall's tau-b's. Another example is sorting out the value of looking further into each of a whole lot of chi-squares (without randomization), where the chi-square values themselves tell us little but the significance levels provide just the sort of exploratory handle we need. And I further claimed that the test is wittingly or unwittingly used in this way overwhelmingly often in the analysis of data, and the use is justified if it is exploratory and used with a good consciousness that sample sizes and variances make a difference.

Posted by: Lawrence Mohr | 01 February 2007 at 03:09 PM

Professor Henkel's response here tracks his response to my inquiry of a few days ago. I quite agree, but I do think there is a way out: resampling statistics. No doubt Professor Henkel will correct me if I misstep, but, if I'm not mistaken, resampling estimates don't depend in any way on the character of samples themselves; they are drawn entirely from the characteriztics of the data being analyzed and have no further inferential extension. Iow, when we see a bootstrapped estimate of standard error, it is telling us that, given the dataset we are working with, the coefficient we are working with has less than, say, 5 chances in 100 of occuring by chance. There is no inference here to anything but a series of "resamples" of the original data; i.e. the estimates are non-parametric and assume nothing about the data collection methods or the structure of the data themselves.

Such estimates would be limited in application - no doubt, one reason why so few of us use them (I avoid them myself) - but they avoid many of the controversial aspects of significance testing. Or so people tell me: finding out more about Professor Henkel's views would be a instructive.

Posted by: Tracy Lightcap | 31 January 2007 at 10:38 PM

I have been puzzling over Professor Mohr's statement that there is a legitimate interpretation of a significance test on a non-probability sample or non-randomized experiment. Finally, I have concluded that what he is referring to is what might be called a random process model of a phenomena of interest. In other words, the substantive hypothesis (research hypothesis) is that in the population the variables are randomly paired. Thus a "significant" result would imply that the variables were not randomly paired, or put another way, the variables are related in some nonrandom fashion. If my interpretation of Professor Mohr's position is correct, his thinking parallels that of others. David Gold in an article in The American Sociologist in February of 1969 titled "Statistical Tests and Substantive Significance" presents this perspective. Though I cannot recover a source, it is my recollection that H. Blalock has made a similar suggestion and I think it likely that there are others who have done the same. I am of the opinion, however, that such a perspective has a minefield of caveats and cautions to cross (as, seemingly, does Professor Mohr since he suggests using this approach intelligently and explicitly) if one is to use it "properly". One of my concerns is that it is unlikely that, say in sociological research, variables of interest are not related, and one ends up detecting the obvious by rejecting a random process model, and of course rejecting such a model will often be a function of the size of the non-probability sample or experimental groups. There is in addition the question of the "representativeness" of the sample or pool of experimental subjects of any possible population of interest and thus whether or not there is something to be followed up in future research. But, I'm also confident that cautions such as those I've just expressed are part of what Professor Mohr refers to as "intelligent" use of the approach.

Posted by: Ramon Henkel | 31 January 2007 at 12:58 PM