My training is in sociology and, though I have an undergraduate background in mathematics and a graduate background in statistics, I do not consider myself a statistician. As a result, my perspective on significance testing may be different from that of the professional statistician and those trained in the physical sciences or business.
To better understand significance tests and their use and misuse, it seems useful to provide a relatively elementary level presentation of both where significance tests fit in the framework of statistics and where statistical inference fits in the framework of scientific inference as well as other contexts in which statistical inference is used. Accomplishing this elementary (oversimplified) presentation involves a consideration of what comprises statistics, some statistical terminology and some criteria that determine if significance tests are properly used, along with a little philosophy of science to set the stage for evaluating the utility of significance testing in, at least, the social sciences. The proper use of significance tests minimally involves discussions of sampling procedures while considerations of the utility of significance testing involves understanding the distinction between statistical and scientific inference.
Statistics, as a discipline, is usually classified into two general areas, descriptive statistics and statistical inference, with statistical inference usually subdivided into hypothesis testing and estimation with some differences in procedures in scientific research as contrast with decision theory uses.
Descriptive statistics are generally numbers which describe some aspect or aspects of some phenomenon of interest, but can also include other ways of presenting a description of the phenomenon such as charts, tables and graphs. It is descriptive statistics which provide substantive knowledge about a population, such as averages (income, education level) and other parameters and indications of relation between variables (crime and level of education as an example).
Statistical inference, for the most part, is a process that does nothing more than attach a probability to a descriptive statistic. That probability tells us how likely it is that one would obtain that calculated (or more extreme) value for the descriptive statistic using data obtained from the specified population of values by employing a proper probability sampling procedure. In other words, what is obtained is the probability of the data given the hypothesis about the population (P[D|H]). Put even another way, the probability calculated assumes that the hypothesis correctly specifies the population, if it does not, the probability is meaningless. These probabilities may be attached to either a point estimate (a single value descriptive statistic such as an average, calculated from the sample) or to an interval (usually called a confidence interval) usually centered on the point estimate, such that a set proportion (called the confidence level) of such intervals, properly constructed, will contain the point estimate, but for any interval so constructed, the probability that it contains the parameter is either 0 or 1.
Though most readers of this material will be familiar with at least some statistical terminology, it is probably useful to note that a distinction is made between populations and samples where a population is some well defined set of values and a sample is some subset of these values obtained by some selection process. Characteristics of populations are called parameters while characteristics of samples are called statistics.
Samples are drawn mostly because enumerating a population is often simply infeasible due to lack of sufficient resources and time to do such an enumeration. Intuitively, what one wants in a sample is a miniature version of the population as thus anything true of the sample should be true of the population. Such a sample might be called a representative sample, though Arepresentative sample@ is not a part of the technical vocabulary of statistics. The question is, can we develop procedures which will give us Arepresentative@ samples in the sense of being a miniature of the population every time we employ them? The answer, generally speaking, is no. There are, however, probability sampling procedures which, Aon average in the long run@, give us samples that provide reasonably accurate estimates of population characteristics or some means of assessing the likelihood that the estimates are reasonable.
In probability sampling procedures, some random process is employed in selecting elements of the population while in non-probability sampling procedures randomness is not employed in the selection of population elements. Another way of characterizing probability and non-probability samples is if it is theoretically possible to calculate the probability of an element of the population ending up in the sample, then the sample is a probability sample. Note that being theoretically possible does not mean that it is simple and easy to do. If it is not theoretically possible to calculate the probability of an element of the population ending up in the sample, the sample is a non-probability sample. Though the process of selecting a probability sample can be quite complexBbooks are written on the subjectBan oversimplified classification of probability procedures for sampling human populations (the underlying ideas can be applied to other types of populations) is provided below.
For most people the term Arandom sample@ applies to what is more specifically called a simple random sample (SRS). Its characteristics are that each element of the population has an equal probability of selection, the selection of each element in the sample is independent of the selection of any other element, and sampling is without replacement (each selected element of the population is not returned to the population and so cannot end up in the sample more than once). An equivalent way of characterizing SRS is that all samples of the same size have an equal probability of selection.
However, for economic reasons (economy of time and resources) one can group elements of the population by physical proximity or stratify by characteristics of the elements such as age, level of education, or other characteristics of interest.
When a list of the population is not available the procedure of choice is that of grouping by physical proximity. Examples of such groupings are counties, townships, city blocks or other areas delimited by easily identified boundaries such as rivers and roads. The initial sampling will be a random selection of some of the delimited physical areas. The final sample would then be obtained by probability methods within the selected areas. Such samples are often called cluster or area probability samples.
When grouping (stratifying) by characteristics of elements of the population such as age, race, level of education, one will select elements by some probability process within each strata of interest. Such samples are often called stratified random samples.
There are other probability procedures, and of course, one can employ more than one probability process in obtaining the sample and such samples are often referred to as complex samples.
Note that both simple random sampling and stratified random sampling require lists of the population.
Non-probability samples can be obtained in many ways. One can employ available cases, inmates of institutions such as prisons, members of an organization such as the PTA or a union. One can obtain haphazard samples by standing on a street corner and including those among passers-by that are willing to be included, one can select cases on the basis that the resulting sample is, in one=s judgment, representative of the population one wishes to sample. One can obtain a Asnowball@ sample by asking each case selected to provide one or more additional cases to be included. The important issue here is that these procedures do not provide the information needed to calculate the probability of an element of the target population ending up in the sample.
Experimental designs introduce randomness by employing some randomization process in assigning cases to the various groups employed in the experiment, minimally an experimental group and a control group. However, randomization of cases in an experiment does not turn a non-probability sample into a probability sample thus, unless the original pool of cases was selected on a probability basis, one cannot generalize the results statistically to any population.
The concept of a random variable is one of the most important concepts in statistical inference. For statistical inference about a statistic to be legitimate, the statistic must be a random variable. A random variable is one whose obtained values are the result of chance factors and has the characteristic that we cannot predict the next value obtained, but do know the frequency distribution for the hypothetical case of an infinite number of such obtained values. The latter is usually referred to as the sampling distribution of the statistic. This theoretical distribution presents all the theoretically possible values the statistic may take and the relative frequency of occurrence of each of these values.
It is because statistical inference is about random variables that drawing the sample by an appropriate probability process or employing randomization in experimental situations is absolutely critical to the legitimate use of inferential techniques, as it is through this probability process that the obtained statistic attains the randomness needed to qualify as a random variable. However, just because one has employed a probability process, it is not the case that any statistic calculated for the sample is a random variable. One still has to know the theoretical frequency distribution of that statistic for the sampling procedure used, and this frequency distribution is obtained through mathematical derivations, not empirically. In other words, it is not based on a large number of actual samples for which the statistic is calculated and tabulated.
An analytic study or analytic statistic is one in which two or more variables are simultaneously analyzed (correlation, other measures of relationship, analysis of variance, multiple correlation and regression as examples).
Though it is theoretically possible to derive sampling distributions for almost any statistic as long as a probability process was used in drawing the sample, it is the case that for analytic studies it is difficult to mathematically derive the sampling distributions for analytic statistics and may be virtually impossible when the sampling design is complex though this is not necessarily the case for univariate statistics. As a result, the published sampling distributions for various statistics available for use by researchers, such as Chi square, normal, F, and t assume simple random sampling has been employed.
One of the important consequences of sampling procedures is the scope of the population to which a statistical inference applies. Statistically, one can make inferences only to the population from which a proper probability sample was drawn. Given that one has used a non-probability sample, there is no population to which one can statistically infer based on a test of significance on that non-probability sample. In this connection it is useful to note that statistical inference, as considered from the perspective of the mathematical statistician, treats chance/random factors in only the sampling process. Measurement error and other assumed sources of randomness come into play in other interpretations of significance tests.
Substantive hypotheses are statements about phenomena of interest in one=s discipline. A sociologist might hypothesize that level of education varies by social class. In a business context, one might hypothesize that magazine advertisements are more effective at attracting customers than radio advertisements.
Statistical hypotheses are simply statements about population parameter values for one or more populations. An example involving a single parameter (single population) is the population mean age is 50. An example involving more than one parameter (more than one population, same characteristic), is there is no difference between the mean ages of populations A and B.
The situation with statistical hypotheses becomes a bit complicated because in terms of logic, one cannot prove a hypothesis to be true. The best one can do is determine that a hypothesis is false (check modus tollens and modus ponens in a logic text). As a result, R. A. Fisher concocted the idea of the null hypothesis, which involves stating the hypothesis to be tested as the logical complement of the hypothesis one wants to demonstrate as valid. Call the hypothesis that one wants to validate the research hypothesis. Fisher called the logical complement of the research hypothesis the null hypothesis. In his scheme, the hypothesis to be validated and its complement exhaust all possibilities, so by rejecting the null hypothesis, the only possibility remaining is that the research hypothesis is valid. Also, in his scheme the null hypothesis was never accepted (i.e., assumed true). It was either rejected or a decision was held in abeyance.
Unfortunately, over time the term Anull hypothesis@ has taken on more than one meaning, leading to some confusion. One meaning is the one attributable to Fisher, but a common meaning is that the parameter of interest has a value of zero, and in terms of decision theory (briefly discussed below), the null hypothesis is simply the hypothesis tested.
In hypothesis testing in the Neyman-Pearson perspective on hypothesis testing there are two possible wrong decisions which are called errors. One can reject a true null hypothesis, and this type of error is called a Type I or alpha error (α) (from Fisher=s perspective, one can only make a Type I error), or one can fail to reject a false null hypothesis and this error is called a Type II or beta error (β). The probability of making a Type I error is referred to as the level of significance. The power of the test is the probability of rejecting a false null hypothesis and is calculated as 1 - β.
Decision theory is an approach in which one has two point hypotheses (hypotheses specifying a single value for the parameter), one must make a decision between the two, and one can determine the cost of making both a Type I and Type II error. The latter is generally not feasible in scientific research. The two hypotheses are the one to be tested (the null) and an alternative.
The probability required to perform a significance test, depending on the statistic, can be obtained in one of three ways: by directly calculating the probability (for example using the binomial expansion); by obtaining a test statistic (for example an F value) which is then used to obtain a probability from a table, or by obtaining a critical value from a table such that the critical value reflects a result occurring at the probability level set as the level of significance. This critical value will then be compared to the statistic calculated from the sample.
Procedurally, a test of significance involves two hypotheses, the null hypothesis and an alternative hypothesis where the alternative may be non‑directional (when theory or prior experience cannot specify in which direction from the parameter specified by the null hypothesis the true parameter lies), directional (requires theory or prior experience to predict direction) or a point hypothesis. In order to reject the null hypothesis, simultaneously the results must be unlikely if the null hypothesis is true and likely if the alternative is true. Given these two conditions, the critical region(s) are always put in the tails (or tail if a directional or point alternative) of the sampling distribution. Since the alternative must by definition be other than the null (the two would not posit identical parameters) the critical region is placed where it maximizes the probability of detecting a false null which means away from the parameter representing the null and in the direction of the parameter specified by the alternative.