# Getting the Sample Size Right:

## A Brief Introduction to Power Analysis

Queries:  Email Jeremy Miles

Introducing Power Analysis

A hypothesis test tells us the probability of our result (or a more extreme result) occurring, if the null hypothesis is true. If the probability is lower than a pre-specified value (alpha, usually 0.05), it is rejected. It can be likened to a search process, we are searching for evidence to reject the null hypothesis, in the same way that we may search for (say) presence of a chemical in an environment.

The ability to reject the null hypothesis depends upon:

• Alpha (a): Usually set to be 0.05, although this is somewhat arbitrary. This is the probability of a type I error, that is the probability of rejecting the null hypothesis given that that the null hypothesis is true.  To use the search analogy, it is the probability of thinking we have found something when it is not really there.
• Sample size: A larger sample size leads to more accurate parameter estimates, which leads to a greater ability to find what we were looking for. The harder we look, the more likely we are to find it.
• Effect Size: The size of the effect in the population. The bigger it is, the easier it will be to find.

• However, the above is not strictly correct. Jacob Cohen (author of several books and articles on power analysis) has pointed out "all null hypotheses, at least in their 2-tailed forms, are false." Whatever we are looking for is always going to be there – it might just be there in such small quantities that we are not bothered about finding it.

Power analysis allows us to make sure that we have looked hard enough to find it, if look hard enough to find it, if there is enough of it there to bother us. The size of the thing we are looking for is known as the "effect size." Several methods exist for deciding what effect size we would be interested in. Different statistical tests have different effect sizes developed for them, however the general principle is the same.

• Base it on substantive knowledge. Kraemer and Thiemann (see end) provide the following example. It is hypothesise that 40 year old men who drink more than three cups of coffee per day will score more highly on the Cornell Medical Index (CMI) than men who do not drink coffee. The CMI ranges from 0 to 195, and previous research has shown that scores on the CMI increase by about 3.5 points for every decade of life. It is decided that an increase, caused by drinking coffee, which was equivalent to about 10 years of age would be enough to warrant concern, and so an effect size can be calculated based on that assumption.
• Base it on previous research. See what effect sizes other researchers studying similar fields have found. Use this as an estimate of sample size.
Use conventions. Cohen (again) has defined small, medium and large effect sizes for many types of test. These form useful conventions, and can guide you, if you know approximately how strong the effect is likely to be.
• Doing Power Analysis

Three types of power analysis exist, a priori, post hoc, and compromise. This workshop will focus on the first two. Compromise power analysis is a more complex issue, which is rarely used, and slightly controversial.

1. A Priori Power Analysis

Ideally, power analysis is carried out a priori, that is during the design stage of the study. A study can conclude that a null hypothesis was true or false. The real answer, i.e. the state of the world, can be that the hypothesis is true or false.

Given the three factors alpha, sample size and effect size, a fourth variable can be calculated, called beta. Where alpha is the probability of a type I error (i.e. rejection of a correct null hypothesis) beta is the probability of a type II error (acceptance of a false null hypothesis).

 State of the World H0 True H0 False Research Findings H0 True ü Type II Error (p = b) H0 False Type I Error (p = a) ü
The probability of correctly accepting the null hypothesis is equal to 1-a, which is fixed, the probability of incorrectly rejecting the null hypothesis is b. The probability of correctly rejecting the null hypothesis is equal to 1 – b, which is called power. The power of a test refers to its ability to detect what it is looking for. To return to the search analogy, the power of a test is our probability of finding what we were looking for, given its size.

A power analysis program can be used to determine power given the values of a, sample size and effect size. If the power is deemed to be insufficient, steps can be taken to increase the power, (most commonly, but not exclusively, by increasing the sample size.)

Figure 1: Picking an Appropriate Test
Example Using SamplePower

We will use the previous example of Kraemer and Thiemann, in which we are interested in an increase, approximately equivalent to 1 decade, caused by drinking coffee. Figure 1 shows the first screen of the SamplePower program in which we tell the program what sort of statistical test we will be using.

In this case it is a t-test with 2 independent groups. We know that the average score for men in their 40’s on the CMI is 8, and that the SD is 7. We have decided that an increase of 3.5 (increase associated with ageing 1 decade) will be sufficient to concern us. (This increase would, in Cohen’s classifications, be a medium effect size.)

Figure 2: Sample Size Calculation for 80% Power

Figure 2 shows the sample size for 80% power (usually considered) to be sufficient power. This would require an N of 51 or group, or a total N of 102.

A graph can be a useful way of making the best decision regarding the trade-off between power and sample size. The graph drawn by SamplePower is shown in Figure 3. It can be seen that the rate of increase of increase in power starts to reduce dramatically at around the 80% power figure. An appropriate sample size can then be determined, based on expense and desired power.

Figure 3: Graph Drawn by SamplePower

A final and very useful feature of SamplePower is its ability to generate a text report, to communicate the findings of the power analysis.

Power for a test of the null hypothesis

One goal of the proposed study is to test the null hypothesis that the two population means are equal. The criterion for significance (alpha) has been set at 0.05. The test is 1-tailed, which means that only an effect in the expected direction will be interpreted.

With the proposed sample size of 51 and 51 for the two groups, the study will have power of 80.6% to yield a statistically significant result.

This computation assumes that the mean difference is -3.5 (corresponding to means of 8.0 versus 11.5) and the common within-group standard deviation is 7.0.

This effect was selected as the smallest effect that would be important to detect, in the sense that any smaller effect would not be of clinical or substantive significance. It is also assumed that this effect size is reasonable, in the sense that an effect of this magnitude could be anticipated in this field of research.

Precision for estimating the effect size

A second goal of this study is to estimate the mean difference between the two populations. On average, a study of this design would enable us to report the mean difference with a precision (95.0% confidence level) of plus/minus 2.29 points.

For example, an observed difference of -3.5 would be reported with a 95.0% confidence interval of 1.21 to infinity, or (alternatively, per the a priori hypothesis) of minus infinity to 5.79. (Since the confidence interval has been defined as one tailed, only one boundary is meaningful).

The precision estimated here is the median precision. Precision will vary as a function of the observed standard deviation (as well as sample size), and in any single study will be narrower or wider than this estimate.

Notes

Computational option: Variance is estimated (t-test)

A priori power analysis can ensure that you do not waste time and resources carrying out a study which has very little chance of finding a significant effect, and can also ensure that you do not waste time and resources testing more subjects than are necessary to detect an effect.

Post Hoc Analysis
Whereas a priori analysis is done before a study has been carried out, post-hoc analysis is done after a study has been carried out to help to explain the results if a study which did not find any significant effects.

Imagine that a study had been carried out to see if CMI scores were significantly correlated with coffee consumption (measured in average number of cups per day.) A researcher carries out a study, using 40 subjects, fails to find a significant correlation, and therefore concludes that coffee consumption does not alter CMI score.

The effect size for correlation coefficients is simply r, the correlation coefficient. A power analysis can be carried out to find out what effect size it would have been likely to detect. This is most easily done by examining a graph of power as a function of effect size, for the sample size used in the study. This graph is shown in Figure 4. The x-axis goes from 0.1 to 0.5. (Cohen defines a small effect size to be r=0.1, a medium effect size to be r = 0.3, and a large effect size to be r=0.5.)

It can be seen from the graph that the power to detect a large effect size is very high, above 0.95. It could therefore be safely concluded from this study that coffee consumption does not have a large effect on CMI score. At the medium effect size (r = 0.3) the power to detect a significant result is slightly above 0.6. Although this is reasonable power, it is not sufficient to safely assume that there is not a medium effect. Finally at a small effect size, the power is very low – around 0.15. This is certainly not sufficient to conclude that there is not a small effect.

Figure 4: Power as a function of effect size

Whether the power of the study was sufficient to decide that there is no effect in the population would depend upon the degree of correlation which would be determined to be large enough to be important.

Note that there is also a slightly different definition of post hoc power that is sometimes used. This is given, for example, by the SPSS GLM procedure when you ask for power.  However, the power that this gives is the power that you had to detect the effect size that you found, not the effect size that might have been.  This means that the power is a function of the p-value.  If p = 0.05, power = 0.50.  (Note, that that's not necessarily the case when you have a multivariate outcome).

Increasing Power

Although studies with excessive power exist, they tend to be few and far between. The main problem in designing and carrying out studies is to gain sufficient power.

1. Increase Sample Size

As we have seen, the main way of increasing power is to increase sample size.

1. Increase Alpha

All of the studies we have mentioned have used an alpha of 0.05. The use of alpha of 0.05 disregards the role of alpha in determine the level of beta.

If we assume a correlation value of 0.3 (medium effect size) to be worth finding:

a = 0.05 (2-tailed), N = 40, Power = 0.49
a = 0.10 (2-tailed), N = 40, Power = 0.62

2. Shrink Standard Deviations
By using more homogenous groups (in an experimental study) the relative effect size increases. Similarly increasing the reliability of the measures will have the same effect.

4. Use ANCOVA

Adding covariates to an experimental study statistically reduces the error variance, and therefore increases the relative effect size.

In the following table SD is always equal to 1, a = 0.05

 Mean Group 1 Mean Group 2 Effect Size Total N R2 of Covs Power 0.0 0.2 Small 100 0.00 0.17 0.0 0.2 Small 100 0.49 0.28 0.0 0.2 Small 200 0.00 0.28 0.0 0.5 Medium 60 0.00 0.47 0.0 0.5 Medium 60 0.25 0.59 0.0 0.5 Medium 80 0.00 0.59
Addition of covariates with an R2 of 0.49 (i.e. a correlation of 0.7 with the DV) increases power to the same extent as a doubling in sample size. Addition of covariates with an R2 of 0.25 (correlation of 0.5) increases power to the same extent as an increase in sample size of one third.

Cohen, J. (1989). Statistical Power Analysis for the Behavioural Sciences. 2nd Ed. Hillsdale, NJ: Erlbaum.

Kraemer, H.C. and Theimann, S. (1987). How many subjects? Statistical power analysis in research. Newbury Park, CA: Sage.

Murphy, K.R. and Myors, B. (2003).  Statistical Power Analysis: A Simple and General Model for Traditional and Modern Hypothesis Tests, 2nd Edition. Lawrence Erlbaum Associates, Inc (that's my current favourite book on power).

Books Which Discuss Power Analysis (You can see pricing information at Amazon.co.uk by clicking on the title.)

Abelson, R.P. (1995). Statistics as principled argument. Hillsdale, NJ: Erlbaum. ISBN: 0-8058-0528-1

Chow, S.L. (1996).  Statistical Significance: Rationale, Validity and Utility. London: Sage. ISBN 0-7619-5205-5

Several chapters in the two volume set:
Keren, G. and Lewis, C. (1993). A Handbook for Data Analysis in the Behavioural Sciences. Hillsdale, NJ: Erlbaum.
Or separately:
Methodological Issues ISBN: 0-8058-1037-4
Statistical Issues ISBN: 0-8058-1093-5

These include: Gigerenzer, G. The superego, the ego and the id in statistical reasoning. (Methodological Issues)

Tversky, A. and Kahneman, D. Belief in the law of small numbers. (Methodological Issues – this is an edited reprint of their classic article from the Psychological Bulletin).

Greenwald, A.G. Consequences of prejudice against the null hypothesis. (Methodological Issues).

Tatsuoka, M. Effect Size. (Methodological Issues).

## Appendix 2: Computer Programs

Power analysis calculations are fairly difficult to carry out, they often have no closed form solution. It is possible to calculate power using the non-central distribution functions in SPSS, although not easy. The two computer programs used to produce the results for this paper were GPower and SamplePower.

#### Commercial

SamplePower 2.0  is produced by, and is available from SPSS.
NCSS/PASS (power and sample size).
NQuery

Free

FREE HIT COUNTER