My training is in sociology and, though I have an
undergraduate background in mathematics and a graduate background in statistics,
I do not consider myself a statistician. As a result, my perspective on significance testing may be different
from that of the professional statistician and those trained in the physical
sciences or business.

To better understand significance tests and their use and misuse, it seems useful to provide a relatively elementary level presentation of both where significance tests fit in the framework of statistics and where statistical inference fits in the framework of scientific inference as well as other contexts in which statistical inference is used. Accomplishing this elementary (oversimplified) presentation involves a consideration of what comprises statistics, some statistical terminology and some criteria that determine if significance tests are properly used, along with a little philosophy of science to set the stage for evaluating the utility of significance testing in, at least, the social sciences. The proper use of significance tests minimally involves discussions of sampling procedures while considerations of the utility of significance testing involves understanding the distinction between statistical and scientific inference.

**Statistics**, as a
discipline, is usually classified into two general areas, descriptive
statistics and statistical inference, with statistical inference usually
subdivided into hypothesis testing and estimation with some differences in
procedures in scientific research as contrast with decision theory uses.

**Descriptive statistics** are generally numbers which
describe some aspect or aspects of some phenomenon of interest, but can also
include other ways of presenting a description of the phenomenon such as
charts, tables and graphs. It is
descriptive statistics which provide substantive knowledge about a population,
such as averages (income, education level) and other parameters and indications
of relation between variables (crime and level of education as an example).

**Statistical inference**, for the most part, is a
process that does nothing more than attach a probability to a descriptive
statistic. That probability tells us how
likely it is that one would obtain that calculated (or more extreme) value for
the descriptive statistic using data obtained from the specified population of
values by employing a proper probability sampling procedure. In other words, what is obtained is the
probability of the data given the hypothesis about the population
(P[D|H]). Put even another way, the
probability calculated assumes that the hypothesis correctly specifies the
population, if it does not, the probability is meaningless. These probabilities may be attached to either
a point estimate (a single value descriptive statistic such as an average,
calculated from the sample) or to an interval (usually called a **confidence
interval**) usually centered on the point estimate, such that a set proportion
(called the **confidence level**) of
such intervals, properly constructed, will contain the point estimate, but for
any interval so constructed, the probability that it contains the parameter is
either 0 or 1.

Though most readers of this material will be familiar with
at least some **statistical terminology**,
it is probably useful to note that a distinction is made between populations
and samples where a population is some well defined set of values and a sample
is some subset of these values obtained by some selection process. Characteristics of populations are called *parameters* while characteristics of
samples are called *statistics*.

**Samples** are drawn mostly because enumerating a
population is often simply infeasible due to lack of sufficient resources and
time to do such an enumeration. Intuitively, what one wants in a sample is a miniature version of the
population as thus anything true of the sample should be true of the
population. Such a sample might be
called a representative sample, though Arepresentative
sample@ is not a
part of the technical vocabulary of statistics. The question is, can we develop procedures which will give us Arepresentative@
samples in the sense of being a miniature of the population every time we
employ them? The answer, generally
speaking, is **no**. There are,
however, probability sampling procedures which, Aon
average in the long run@,
give us samples that provide reasonably accurate estimates of population
characteristics or some means of assessing the likelihood that the estimates
are reasonable.

In **probability sampling** procedures, some random
process is employed in selecting elements of the population while in **non-probability
sampling** procedures randomness is not employed in the selection of
population elements. Another way of characterizing probability and
non-probability samples is if it is theoretically possible to calculate the
probability of an element of the population ending up in the sample, then the
sample is a probability sample. Note
that being theoretically possible does not mean that it is simple and easy to
do. If it is not theoretically possible
to calculate the probability of an element of the population ending up in the
sample, the sample is a non-probability sample. Though the process of selecting a probability sample can be quite
complexBbooks are
written on the subjectBan
oversimplified classification of probability procedures for sampling human
populations (the underlying ideas can be applied to other types of populations)
is provided below.

For most people the term Arandom
sample@ applies
to what is more specifically called a **simple random sample** (**SRS**). Its characteristics are that each element of the population has an equal
probability of selection, the selection of each element in the sample is independent
of the selection of any other element, and sampling is without replacement
(each selected element of the population is not returned to the population and
so cannot end up in the sample more than once). An equivalent way of characterizing SRS is that all samples of the same
size have an equal probability of selection.

However, for economic reasons (economy of time and
resources) one can *group elements of the
population by physical proximity* or *stratify
by characteristics of the elements* such as age, level of education, or
other characteristics of interest.

When a list of the population is not available the procedure
of choice is that of grouping by physical proximity. Examples of such groupings are counties,
townships, city blocks or other areas delimited by easily identified boundaries
such as rivers and roads. The initial sampling will be a random selection of
some of the delimited physical areas. The final sample would then be obtained by probability methods within
the selected areas. Such samples are
often called **cluster** or **area probability** samples.

When grouping (stratifying) by characteristics of elements
of the population such as age, race, level of education, one will select
elements by some probability process within each strata of interest. Such samples are often called **stratified random** samples.

There are other probability procedures, and of course, one
can employ more than one probability process in obtaining the sample and such
samples are often referred to as **complex
samples**.

Note that both simple random sampling and stratified random sampling require lists of the population.

**Non-probability samples** can be obtained in many
ways. One can employ available cases,
inmates of institutions such as prisons, members of an organization such as the
PTA or a union. One can obtain haphazard
samples by standing on a street corner and including those among passers-by
that are willing to be included, one can select cases on the basis that the
resulting sample is, in one=s
judgment, representative of the population one wishes to sample. One can obtain a Asnowball@ sample by asking each case selected to
provide one or more additional cases to be included. The important issue here is that these
procedures *do not provide the information
needed to calculate the probability of an element of the target population
ending up in the sample*.

**Experimental designs **introduce randomness by
employing some randomization process in assigning cases to the various groups
employed in the experiment, minimally an experimental group and a control
group. However, randomization of cases
in an experiment does *not* turn a non-probability sample into a
probability sample thus, unless the original pool of cases was selected on a
probability basis, one cannot generalize the results *statistically* to
any population.

The concept of a **random variable **is one of the most
important concepts in statistical inference. For statistical inference about a statistic to be legitimate, the
statistic must be a random variable. A
random variable is one whose obtained values are the result of chance factors
and has the characteristic that we cannot predict the next value obtained, but
do know the frequency distribution for the hypothetical case of an infinite
number of such obtained values. The
latter is usually referred to as the **sampling distribution **of the
statistic. This theoretical distribution
presents all the theoretically possible values the statistic may take and the
relative frequency of occurrence of each of these values.

It is because statistical inference is about random variables that drawing the sample by an appropriate probability process or employing randomization in experimental situations is absolutely critical to the legitimate use of inferential techniques, as it is through this probability process that the obtained statistic attains the randomness needed to qualify as a random variable. However, just because one has employed a probability process, it is not the case that any statistic calculated for the sample is a random variable. One still has to know the theoretical frequency distribution of that statistic for the sampling procedure used, and this frequency distribution is obtained through mathematical derivations, not empirically. In other words, it is not based on a large number of actual samples for which the statistic is calculated and tabulated.

An *analytic study* or *analytic statistic* is one
in which two or more variables are simultaneously analyzed (correlation, other
measures of relationship, analysis of variance, multiple correlation and
regression as examples).

Though it is theoretically possible to derive sampling distributions for almost any statistic as long as a probability process was used in drawing the sample, it is the case that for analytic studies it is difficult to mathematically derive the sampling distributions for analytic statistics and may be virtually impossible when the sampling design is complex though this is not necessarily the case for univariate statistics. As a result, the published sampling distributions for various statistics available for use by researchers, such as Chi square, normal, F, and t assume simple random sampling has been employed.

One of the important consequences of sampling procedures is
the scope of the population to which a statistical inference applies. *Statistically*, one can make inferences
only to the population from which a proper probability sample was drawn. Given that one has used a non-probability
sample, there is no population to which one can statistically infer based on a
test of significance on that non-probability sample. In this connection it is useful to note that
statistical inference, as considered from the perspective of the mathematical
statistician, treats chance/random factors in **only** the sampling
process. *Measurement error and other assumed sources of randomness* come into
play in other interpretations of significance tests.

**Substantive hypotheses** are statements about phenomena
of interest in one=s
discipline. A sociologist might
hypothesize that level of education varies by social class. In a business context, one might hypothesize
that magazine advertisements are more effective at attracting customers than
radio advertisements.

**Statistical hypotheses** are simply statements about
population parameter values for one or more populations. An example involving a single parameter
(single population) is the population mean age is 50. An example involving more than one parameter
(more than one population, same characteristic), is there is no difference
between the mean ages of populations A and B.

The situation with statistical hypotheses becomes a bit
complicated because in terms of logic, one cannot prove a hypothesis to be
true. The best one can do is determine
that a hypothesis is false (check modus tollens and modus ponens in a logic
text). As a result, R. A. Fisher
concocted the idea of the **null hypothesis**, which involves stating the
hypothesis to be tested as the *logical complement* of the hypothesis one
wants to demonstrate as valid. Call the
hypothesis that one wants to validate the research hypothesis. Fisher called the logical complement of the
research hypothesis the null hypothesis. In his scheme, the hypothesis to be validated and its complement exhaust
all possibilities, so by rejecting the null hypothesis, the only possibility
remaining is that the research hypothesis is valid. Also, in his scheme the null hypothesis was
never accepted (i.e., assumed true). It
was either rejected or a decision was held in abeyance.

Unfortunately, over time the term Anull hypothesis@ has taken on more than one meaning, leading to some confusion. One meaning is the one attributable to Fisher, but a common meaning is that the parameter of interest has a value of zero, and in terms of decision theory (briefly discussed below), the null hypothesis is simply the hypothesis tested.

In hypothesis testing in the Neyman-Pearson perspective on
hypothesis testing there are two possible wrong decisions which are called
errors. One can reject a true null hypothesis,
and this type of error is called a **Type I or alpha error (α) **(from
Fisher=s
perspective, one can only make a Type I error), or one can fail to reject a
false null hypothesis and this error is called a **Type II or beta error (β)**. The probability of making a Type I error is
referred to as the **level of significance**. The **power
of the test** is the probability of rejecting a false null hypothesis and is
calculated as 1 - **β**.

**Decision theory** is an approach in which one has two
point hypotheses (hypotheses specifying a single value for the parameter), one
must make a decision between the two, and one can determine the cost of making both
a Type I and Type II error. The latter
is generally not feasible in scientific research. The two hypotheses are the one to be tested
(the null) and an alternative.

The probability required to perform a significance test,
depending on the statistic, can be obtained in one of three ways: by directly calculating the probability (for
example using the binomial expansion); by obtaining a **test statistic**
(for example an F value) which is then used to obtain a probability from a
table, or by obtaining a **critical value** from a table such that the
critical value reflects a result occurring at the probability level set as the
level of significance. This critical
value will then be compared to the statistic calculated from the sample.

Procedurally, a test of
significance involves two hypotheses, the null hypothesis and an alternative
hypothesis where the alternative may be non‑directional (when theory or prior
experience cannot specify in which direction from the parameter specified by
the null hypothesis the true parameter lies), directional (requires theory or
prior experience to predict direction) or a point hypothesis. In order to reject the null hypothesis,
simultaneously the results must be unlikely if the null hypothesis is true and
likely if the alternative is true. Given
these two conditions, the critical region(s) are always put in the tails (or
tail if a directional or point alternative) of the sampling distribution. Since the alternative must by definition be
other than the null (the two would not posit identical parameters) the critical
region is placed where it maximizes the probability of detecting a false null
which means away from the parameter representing the null and in the direction of the parameter specified
by the alternative.

Professor Henkel:

You did indeed understand my question and I share your reservations. I think the re-sampling techniques may show more promise. Thanks for taking the time to share your expertise with us.

Posted by: Tracy Lightcap | 30 January 2007 at 11:51 AM

Ramon, this was quite an interesting and informative primer. Thanks for such a clear exposition. bh.

Posted by: William Henderson | 29 January 2007 at 09:02 PM

Professor Henkel:

What's the current thinking about the use of significance testing with cross-sections of populations? When I was back in school, we normally got the Blalock take on this: that population cross-sections could be seen as a sample of an set of possible population values, presumably extending over time. Thus, normal significance testing could be applied, provided one kept the limitations were kept in mind. Today, if I'm not mistaken, there has been a vogue for non-parametric estimates of error, usually drawn from bootstrap procedures. These new estimates can then be used to conduct the usual significance tests.

Of course, if I showed more industry and had the maths, I could determine the state of opinion on these matters for myself. Lacking both, might I ask you to expound on this? The question is still pressing; one sees studies using usual significance tests on populations regularly. Thanks for participating here.

Professor Lighcap:

Since I retired from teaching over a decade ago and have not kept up with the literature, I can't really say what the current thinking is regarding population cross-sections and I'm not certain I understand your question. I think your question is "Can an existing population be considered a probability sample of the hypothetical population unbound by time or space?" If that is the question, my position is that a yes answer involves making an additional assumption that cannot be demonstrated to be either true or false, but from the perspective of a social scientist (yes Blalock was a social scientist as was Hagood who raised this justification many, many years ago) I'm not at all convinced that existing populations are samples from the hypothetical unbound by space and time population, to which theories usually apply, for many characteristics of interest.

As to the non-parametric approach, if the sample isn't a probability sample of the population to which one wishes to make inferences, one simply cannot use mathematical techniques to make the sample a probability sample, making the use of probability theory (statistical inference) questionable at best.

Posted by: Tracy Lightcap | 29 January 2007 at 08:38 PM