The social science controversy over significance testing arises from two, often overlapping, sources so there is arbitrariness in saying a criticism of significance testing stems from either source. One source concerns the technical requirements of significance tests, the second source is essentially a philosophy of science question, whether significance tests provide information of value in developing and validating scientific theories (call this process scientific inference). There are those who feel that in social science research, one rarely meets the technical requirements for appropriate use of significance tests and thus such tests should not be used and of course there are those who hold that significance tests can legitimately be used. There are those who feel that significance testing in social science, even if technical requirements are met, contributes nothing of value to the process of scientific inference and of course there are those that hold that significance testing is a necessary and important aspect of scientific inference.
From the perspective of technical requirements for the results of a significance test to be meaningful, one of the most basic requirements is that the sample be an appropriate probability sample. As noted in earlier material, this essentially means that the sample has to be a simple random sample (SRS). The reason for this is that the significance level tables for tests of significance and the calculation formulas for the various test statistics (such as Chi square, Normal, F, t) are created assuming SRS. As a result, if one employs a probability sampling procedure, but not SRS the probability calculated will be wrong, but, generally, one does not know by how much or in which directionBtoo large or too small. If one has not used a probability sampling procedure, there is no basis for the probability calculation outside of assuming random factors other than those introduced by probability sampling. Such assumptions, though one may make them, cannot be demonstrated to be valid or invalid and employing them as a basis for employing statistical inference techniques leads to confusion as classical statistical inference considers random factors in only the sampling process.
Regarding the issue of sampling, it is difficult to obtain a SRS for human populations studied by social scientists since one rarely, if ever, has a list of the population. Even if a list were available, such as a list of eligible voters, such lists are rarely current and complete.
Procedurally there are often misapplications of, and misinterpretation of significance tests that, for those critical of the use of significance tests, are the basis for their rejection of the use of significance testing.
Among the misuse/misapplication of inferential techniques are such as independent sample techniques being used on correlated samples, using the wrong sampling distributions, techniques are applied to data at an inappropriate level of measurement and one-tailed tests are often used when non-directional tests are appropriate.
There are usually population distribution assumptions underlying the use of a particular significance test, such as a normally distributed population. These assumptions are infrequently met in social science research with usually unknown effects on the sampling distribution of the statistic and thus unknown effects on the probability calculation.
Statistically one may make inferences to only the population from which the proper probability sample has been drawn. Attempts to infer beyond the sampled population or from non‑probability samples are obviously incorrect, but occur.
There are often errors in the interpretation of the significance level. A common error is to interpret significance in terms of importance (significance at the .01 level is evidence of a more important result than significance at the .05 level) or as an indication of the strength of a relationship or the probability of the replicability of results.
The choice of level of significance is not a result of mathematical theory or substantive theory, the level chosen is totally arbitrary and at best reflects a researcher=s sense that the improbable will not happen to him/her.
The power of the test is essentially never known in social science because social science theory is not sufficiently developed to provide the requisite parameters for the null and alternative hypotheses, thus one does not have the requisite sampling distributions to calculate power. Without an estimation of the power of the test one might as well use a box containing red and green beads in the proportion reflecting the level of significance chosen, for example 5 red and 95 green for a test at the .05 level, with the null hypothesis rejected on the random selection of a red bead in a single draw from the box (Robert Chandler, The Statistical Concepts of Confidence and Significance, Psychological Bulletin, 54, 5, 1957, 429-430).
Finally, it is not uncommon for researchers to run all possible combinations of variables and select those which show significance. The practice is usually called data dredging and changes the actual level of significance used to a less stringent level.
From the perspective of philosophy of science concerns there are a number of issues. In any science, developing and validating theories (scientific inference) is generally considered a primary goal and the concern here is for how significance tests contribute or fail to contribute to this goal. Loosely, scientific inference is the process of modifying our degree of belief in the validity of scientific theories. It is important to note that this is a cognitive process of varying our degree of belief in theories and is an incremental process, not an either‑or process.
From a positivist perspective [Karl Popper, The Logic of Scientific Discovery] we validate a theory (or invalidate it) by using the theory to make predictions about the real world, then obtain data from the real world to compare with the predictions. If the real world data is close to what theory predicts, we tend to view the data as supporting, or consistent with the theory. In other words, we have at least not invalidated the theory, though we may be some way from considering the theory to be valid. If it is not close to what the theory predicts, the lack of support is consistent with the theory being invalid. In other words, science is concerned with the change in (or resulting) degree of belief in the theory as a result of looking at the result of an empirical analysis of data bearing on the theory.
The slipperiest aspect of this oversimplified characterization of how theory is validated is the question of how close is close enough? How many units of whatever is being measured (fertility, crime rate, prejudice ...) may we be from what theory predicts before we begin to doubt the validity of the theory? How big a difference is a difference? Unfortunately, there is no set of universal criteria which can determine what constitutes "close enough" in terms of units of whatever is being measured. If we were to stick to comparisons of closeness in terms of units of measurement of our variables, judgments of "close enough" would probably always have a very large "subjective judgment" component.
It is in this context, the context of how we judge "close enough" that statistical inference (in terms of significance testing) offers one possible criterion. Closeness is evaluated in terms of the probability of differences of a given or greater magnitude between the real world data and the predictions of the theory, not in terms of absolute difference in whatever units of measure we are using.
Returning to the idea of scientific inference, we may characterize scientific inference as involving the question of "What is the probability of the theory, given the data?@ (P[T|D]), since we are considering the validity of the theory in terms of the correspondence between theory and empirical data. However, what significance testing gives is the probability of the data given the theory (P[D|T]). In other words, how likely are the results assuming the theory tested is true. Thus significance tests provide information not directly relevant to the process of scientific inference, and would thus seem to contribute little toward scientific inference.
To compound the problem of the disjoint between statistical inference and scientific inference there are several additional problems:
Theories apply to hypothetical populations, in other words, to populations that are neither time nor space bound. Statistical inference applies only to currently existing populations as these are the only ones available for sampling. Thus inferential techniques do not inform us about populations of theoretical importance without making assumptions about the extent to which existent populations can be considered probability samples of hypothetical populations.
The significance test model does not intrinsically allow for the accumulation of knowledge--no Chi Square test, t‑test, or other standard test includes information on prior tests of the same hypothesis, therefore the significance testing model does not intrinsically involve the actual scientific process of cumulating knowledge. Were Bayesian approaches sufficiently accepted, the application of Bayes theorem would allow the calculation of the posterior probability given the prior probability of the theory.
The null hypothesis tested in social science is scientifically uninformative as it is known beforehand to be essentially false, and with large enough samples one can usually reject any null hypothesis.
The purpose of a significance test is to reach a decision to reject or fail to reject the tested hypothesis, while scientific inference involves a cognitive change, a change in degree of belief in a theory.
Finally, placing one=s faith in a mechanical process (significance testing), particularly when the fate of a theory hangs on the second, third or subsequent decimal place in a probability calculation instead of careful thought is absurd.
Many of the technical issues raised by those who consider significance testing to not contribute to the goals of science might be considered nit picking, but the philosophical issues are less easily dismissed. Put another way, proper training might eliminate most of the technical issues (except for problems in obtaining proper probability samples), but philosophical issues are valid concerns even if no technical problems exist.