Introducing Power Analysis
A hypothesis test tells us the probability of our result (or a more extreme result) occurring, if the null hypothesis is true. If the probability is lower than a pre-specified value (alpha, usually 0.05), it is rejected. It can be likened to a search process, we are searching for evidence to reject the null hypothesis, in the same way that we may search for (say) presence of a chemical in an environment.
The ability to reject the null hypothesis depends upon:
Power analysis allows us to make sure that we have looked hard enough to find it, if look hard enough to find it, if there is enough of it there to bother us. The size of the thing we are looking for is known as the "effect size." Several methods exist for deciding what effect size we would be interested in. Different statistical tests have different effect sizes developed for them, however the general principle is the same.
Use conventions. Cohen (again) has defined small, medium and large effect sizes for many types of test. These form useful conventions, and can guide you, if you know approximately how strong the effect is likely to be.
Three types of power analysis exist, a priori, post hoc, and compromise. This workshop will focus on the first two. Compromise power analysis is a more complex issue, which is rarely used, and slightly controversial.
1. A Priori Power Analysis
Ideally, power analysis is carried out a priori, that is during the design stage of the study. A study can conclude that a null hypothesis was true or false. The real answer, i.e. the state of the world, can be that the hypothesis is true or false.
Given the three factors alpha, sample size and effect size, a fourth variable can be calculated, called beta. Where alpha is the probability of a type I error (i.e. rejection of a correct null hypothesis) beta is the probability of a type II error (acceptance of a false null hypothesis).
A power analysis program can be used to determine power given the values of a, sample size and effect size. If the power is deemed to be insufficient, steps can be taken to increase the power, (most commonly, but not exclusively, by increasing the sample size.)
We will use the previous example of Kraemer and Thiemann, in which we are interested in an increase, approximately equivalent to 1 decade, caused by drinking coffee. Figure 1 shows the first screen of the SamplePower program in which we tell the program what sort of statistical test we will be using.
In this case it is a t-test with 2 independent
groups. We know that the average score for men in their 40’s
on the CMI is 8, and that the SD is 7. We have decided that an increase
of 3.5 (increase associated with ageing 1 decade) will be sufficient to
concern us. (This increase would, in Cohen’s classifications,
be a medium effect size.)
Figure 2 shows the sample size for 80% power (usually considered) to be sufficient power. This would require an N of 51 or group, or a total N of 102.
A graph can be a useful way of making the best decision regarding the trade-off between power and sample size. The graph drawn by SamplePower is shown in Figure 3. It can be seen that the rate of increase of increase in power starts to reduce dramatically at around the 80% power figure. An appropriate sample size can then be determined, based on expense and desired power.
A final and very useful feature of SamplePower is
its ability to generate a text report, to communicate the findings of
the power analysis.
One goal of the proposed study is to test the null hypothesis that the two population means are equal. The criterion for significance (alpha) has been set at 0.05. The test is 1-tailed, which means that only an effect in the expected direction will be interpreted.
With the proposed sample size of 51 and 51 for the two groups, the study will have power of 80.6% to yield a statistically significant result.
This computation assumes that the mean difference is -3.5 (corresponding to means of 8.0 versus 11.5) and the common within-group standard deviation is 7.0.
This effect was selected as the smallest effect that would be important to detect, in the sense that any smaller effect would not be of clinical or substantive significance. It is also assumed that this effect size is reasonable, in the sense that an effect of this magnitude could be anticipated in this field of research.
Precision for estimating the effect size
A second goal of this study is to estimate the mean difference between the two populations. On average, a study of this design would enable us to report the mean difference with a precision (95.0% confidence level) of plus/minus 2.29 points.
For example, an observed difference of -3.5 would be reported with a 95.0% confidence interval of 1.21 to infinity, or (alternatively, per the a priori hypothesis) of minus infinity to 5.79. (Since the confidence interval has been defined as one tailed, only one boundary is meaningful).
The precision estimated here is the median precision. Precision will vary as a function of the observed standard deviation (as well as sample size), and in any single study will be narrower or wider than this estimate.
Computational option: Variance is estimated (t-test)
Post Hoc Analysis
Imagine that a study had been carried out to see if CMI scores were significantly correlated with coffee consumption (measured in average number of cups per day.) A researcher carries out a study, using 40 subjects, fails to find a significant correlation, and therefore concludes that coffee consumption does not alter CMI score.
The effect size for correlation coefficients is simply r, the correlation coefficient. A power analysis can be carried out to find out what effect size it would have been likely to detect. This is most easily done by examining a graph of power as a function of effect size, for the sample size used in the study. This graph is shown in Figure 4. The x-axis goes from 0.1 to 0.5. (Cohen defines a small effect size to be r=0.1, a medium effect size to be r = 0.3, and a large effect size to be r=0.5.)
It can be seen from the graph that the power to
detect a large effect size is very high, above 0.95. It could therefore
be safely concluded from this study that coffee consumption does not
have a large effect on CMI score. At the medium effect size (r = 0.3)
the power to detect a significant result is slightly above 0.6.
Although this is reasonable power, it is not sufficient to safely
assume that there is not a medium effect. Finally at a small effect
size, the power is very low – around 0.15. This is certainly
not sufficient to conclude that there is not a small effect.
Whether the power of the study was sufficient to decide that there is no effect in the population would depend upon the degree of correlation which would be determined to be large enough to be important.
Note that there is also a slightly different definition of post hoc power that is sometimes used. This is given, for example, by the SPSS GLM procedure when you ask for power. However, the power that this gives is the power that you had to detect the effect size that you found, not the effect size that might have been. This means that the power is a function of the p-value. If p = 0.05, power = 0.50. (Note, that that's not necessarily the case when you have a multivariate outcome).
Although studies with excessive power exist, they tend to be few and far between. The main problem in designing and carrying out studies is to gain sufficient power.
1. Increase Sample Size
As we have seen, the main way of increasing power is to increase sample size.
4. Use ANCOVA
Adding covariates to an experimental study statistically reduces the error variance, and therefore increases the relative effect size.
In the following table SD is always equal to 1, a = 0.05
Books about Power Analysis
Kraemer, H.C. and Theimann, S. (1987). How many subjects? Statistical power analysis in research. Newbury Park, CA: Sage.
Murphy, K.R. and Myors, B. (2003). Statistical Power Analysis: A Simple and General Model for Traditional and Modern Hypothesis Tests, 2nd Edition. Lawrence Erlbaum Associates, Inc (that's my current favourite book on power).
Books Which Discuss Power Analysis (You can see pricing information at Amazon.co.uk by clicking on the title.)
R.P. (1995). Statistics as principled argument. Hillsdale,
NJ: Erlbaum. ISBN: 0-8058-0528-1
Chow, S.L. (1996). Statistical Significance: Rationale, Validity and Utility. London: Sage. ISBN 0-7619-5205-5
Several chapters in the two volume set:
Tversky, A. and Kahneman, D. Belief in the law of small numbers. (Methodological Issues – this is an edited reprint of their classic article from the Psychological Bulletin).
Greenwald, A.G. Consequences of prejudice against the null hypothesis. (Methodological Issues).
Tatsuoka, M. Effect Size. (Methodological Issues).
Appendix 2: Computer Programs
Power analysis calculations are fairly difficult to carry out, they often have no closed form solution. It is possible to calculate power using the non-central distribution functions in SPSS, although not easy. The two computer programs used to produce the results for this paper were GPower and SamplePower.