A test suitable for the simultaneous testing of hypotheses concerning the equality of three or more population means. When samples have been taken from several populations, a question of interest is whether the populations all have the same mean. In the case of m populations, with the mean of population j denoted by μj, the null hypothesis is
with the alternative being that H0 is false.
In the case m=2, an appropriate test statistic (assuming the populations have the same variance) is T given by , where y¯j is the mean of the nj values sampled from population j, and s2 is the pooled estimate of the common variance (see pooled estimate of common mean). The statistic T has an approximate t-distribution with ν=(n1+n2−2) degrees of freedom (the approximation is exact for samples from normal distributions). Denoting the upper 100α % point of a t-distribution with ν degrees of freedom by t(α, ν), H0 is rejected at the 200 α % level if |T|>t(α, ν).
In the case of m populations, the null hypothesis can be rewritten in the form:
H0: μ1=μ2, μ1=μ3, …, μm−1=μm,
which demonstrates that there are c=½m(m−1) pairs of populations that could be compared. However, if c independent t-tests are performed each at the α level then the overall significance level is 1−(1−α)c and is not α.
In the case of equal sample sizes (all n), the quantity is called the least significant difference (LSD). If no differences are greater than this, then H0 may be accepted at the α level.
One way of reducing the overall significance level is to reduce the value of α for the individual tests. The Bonferroni inequality leads to the replacement of α by α/c: the resulting test is variously known as the Dunn test or as the Bonferroni t-test. A preferable alternative uses the Sidak correction, in which α is replaced by 1−(1−α)1c. However, both tests have rather low power when m is large.
Tukey suggested using the Studentized range distribution in place of the t-distribution. The resulting test is familiarly called either the Tukey test, the honestly significant difference test, or the HSD test. This test assumes equal sample sizes; modifications for unequal sizes are the Tukey–Kramer test which uses 1/ni+1/nj when comparing populations i and j, and the Spjotvoll–Stoline test which uses 2/n*, where n* is the smallest of the m sample sizes. The Tukey tests are probably the best choices of all the multiple comparison tests. Similar in spirit to the Tukey tests are the Hochberg test and the Gabriel test; their test statistics are compared with the distribution of the maximum absolute value rather than with that of the Studentized range. The Waller–Duncan test is a test based on the F-test for overall differences between treatments.
An alternative to comparing all pairs simultaneously is to use a multistage test. Suppose that the samples are labelled in order of their means, so that sample 1 has the least mean and sample m the greatest mean. Initially all m samples are compared. If H0 is accepted, then testing ceases. However, if it is rejected, then the hypotheses μ1=μ2=…=μm−1 and μ2=μ3=…=μm are considered, using the Studentized range values for the comparison of m−1 populations. If a hypothesis is rejected, then comparisons of m−2 populations are made. Successive reductions are made until acceptable hypotheses are found. Examples of this type are Duncan's test (which uses the significance level 1−(1−α)l−1 when l means are compared), the Newman–Keuls test (which uses α throughout), and the Ryan–Einot–Gabriel–Welsch (R–E–G–W) test which uses for l<m−1 and α otherwise. A compromise between the Newman–Keuls test and the HSD test is the Tukey wholly significant difference test, which is also called the WSD test or Tukey's b-test.
Subjects: Probability and Statistics.