Professor Henkel has set out the basic logic of significance testing. His background, however, includes much thought and experience connected with the uses and misuses of testing, a topic to which I hope he will return in this discussion. Let me start by offering a few thoughts of my own on this slightly more advanced topic.
As I see it, there are two legitimate technical uses of significance testing in social science, inference to a population parameter and inference to causality. The first is the one that is chiefly covered in Prof. Henkel's introduction. Some would say that the second use, causality, is just a subset of the first one. Be that as it may, I think it's at least convenient to view it separately.
To use testing in the first way, one must have a probability sample, call it a random sample for convenience, of a specific larger population. One calculates a statistic on the sample and uses testing to make an inference to the population. In the overwhelming majority of cases in social science, this inference is about a relationship between two or more variables. Again in the overwhelming majority of cases, the null hypothesis tested is that the relationship in the population is zero. If the result is significant at, say, the 5% level, then we can infer with small chance of error that the relationship in the population is not zero. This use is completely legitimate. It comes up in survey research based on random samples all the time. Note, however, that the test tells us nothing about whether the nonzero relationship in the population is causal. Most survey researchers know that they have to jump through a lot of hoops in order to reach this further conclusion even tentatively.
The second legitimate use is to infer causality in a randomized experiment. The logic involved is not that of statistics alone, but has to do as well with the nature of the experimental design. If, say, an experimental and a control group are selected by randomization and if an experimental treatment is then administered only to the first group, then the two groups should differ on the outcome variable only (a) by virtue of a random sampling or randomization vagary that leaves the two groups somewhat unequal on this outcome variable or (b) by virtue of the experimental treatment. If a test is carried out with a significant result, it tells us that there is less than, say, a 5% chance that a randomization vagary alone would yield a difference as large as the one we observed. In that case, we can legitimately infer that the cause of at least some of the observed difference was the experimental treatment, i.e., that the causal connection is greater than zero. Note that the result in itself says nothing about causality in any larger population or even in this very group at a different time. This second legitimate use is accessed often in psychology and sometimes but not as often in the other social sciences.
Before going on to illegitimate uses, note how puny the first two will usually be. It has often been justly remarked that a nonzero causal or population relationship tells us very little. Minuscule relationships may well not be very important. Again, there are hoops to jump through in order to ascertain that the causal or population relationship is strong enough to be interesting.
There are several illegitimate uses of testing in abundant evidence. One type is to infer something about a population parameter on the basis of a sample that is not a random sample, as emphasized in Henkel's introduction. Another is to infer that a factor introduced or occurring in one of two observed groups is causal when the two groups have not been selected by randomization. Unfortunately, an investigator will rarely come out and make one of these illegitimate inferences explicitly, but it is not uncommon, in my experience, to hint at it or imply it by the subtleties of the language used. Explicit or implied, the inference is illegitimate.
The illegitimate use I want to emphasize here, however, is the common practice of testing when there is no random sample and there is no randomization. Testing is almost knee-jerk in social science, but random samples and randomization are not all that common. We look at some group and find a relationship between two variables and test it for significance. Why? What can we possibly learn? Without random sampling or randomization, nothing. (I'll qualify this below.) Far better to forget the significance testing entirely and look at other aspects of the data for interesting findings. Furthermore, the testing might be considered dangerous because it can subtly lull us into thinking that, on this basis, the relationship is perhpas more general or perhaps causal. Too often, a significant result can lead us to think that the observed relationship is nonchance and therefore "real". Of course it's real; you observed it. You don't need a significance test to tell you that. If "real" is meant to imply more general or causal, the test is being used illegitimately.
However, this apparently vacuous and potentially misleading application of the test is so very common that one is tempted to think there may be some hidden justification behind it. I would urge that there is indeed a hidden justification, but since it's not in the textbooks and not often spoken about very explicitly, it may well be hidden even from the investigator who is in some sense leaning on it.
The justification I have in mind is not very powerful, perhaps, but it is legitimate and can be both convenient and appealing. It is the use of the significance test as a universal measure of strength. If one looks at a group of interest and finds a relationship between two (or more) variables that is statistically significant, say the variables X and Y, one knows that the relationship is in some sense strong. The sense is the following: The relationship is strong enough that if groups had been assigned by randomization instead of by values of the X variable, the corresponding pattern of differences on the Y variable would not have occurred as much as, say, 5% of the time. That relation might often be considered strong enough to be interesting in context. If the result is significant at, say, the .001 level, the relationship is even stronger, and so on. If it is not statistically significant, it is in this same sense weak, and perhaps too weak to figure prominently in further thinking on the subject. There is no pretending here that the significant relation is causal or more general, just strong in one narrow sense.
The sense is appealing, though, because it's so easy to find the test results and they can be used in this way with any statistic that has a known sampling distribution. In a rough but nontrivial way, one can even compare results of two X variables for the same Y using two different statistics, a regression coefficient on one, for example, and a Kendall's tau on the other. If one statistic is significant at the 1% level and the other is not even significant at the 10% level, then the the relation of Y to the first X is strong and to the second weak, or at least the first X is, in this rough but perhaps meaningful way, considerably more strongly related to Y than the second. There are important caveats having to do with sample sizes and variances that do not bear exploring in this context, but if the test is used in this way with a certain amount of expertise, both in methods and in the subject matter area, and with a certain amount of caution and common sense, it is a huge boon to research.
My suggestion is that significance testing may be used in just this way quite often, but without the investigator's being explicit about it. Even in that case, he or she could be deriving profit from the application. Avoiding the misuses reviewed above and using the test intelligently and explicitly in this tentative and preliminary way could introduce a healthy dose of transparency into social science research.