Friday, December 09, 2005

Rambling about normal distributions

I posted a message to MedStats about normality, and thought I might elaborate on it here, because I can add some pictures. If you want to look at the original post for some reason, it's here.

Someone asked "How do I know if my data are normally distributed? How do I know if it matters?"

It's worth playing around with sample sizes and distributions to see what matters. It's easy to do in programs like SPSS.

Here's a bit of SPSS syntax:
input program. loop #case = 1 to 100 * 50.
compute vnorm =NORMAL(1).
compute vchisq1 = rv.chisq(1) - 1.
compute vchisq2 = rv.chisq(2) - 2.

end case.

end loop.

end file.

end input program.

compute group = mod($casenum, 100).

GRAPH /ERRORBAR( CI 95 )=vnorm BY group .

GRAPH /ERRORBAR( CI 95 )=vchisq1 BY group .

GRAPH /ERRORBAR( CI 95 )=vchisq2 BY group .

It generates 100 sets of data, of sample size 50, from three distributions: normal, chi-square with 1 df (pretty highly skewed, skewness = approx 3), and chi-square with 2 df (moderately skewed, skewness = approx 2). The mean of each variable is zero.

It then draws three graphs, one for each variable, of the mean, with 95% confidence intervals. We would expect that in 95% of the samples, the population mean (zero) would lie within the CIs. If fewer (or more, even) do not lie within the CIs, then the assumption would seem to have been violated sufficiently badly that it matters.

Here's the normal distribution. Of the 100 samples, the 95% CIs of five of them do not include the population value of zero. Pretty good, exactly what we would expect.

Here's the chi-square with 1 df. Of the 100 samples, the 95% CIs do not include the population value in 5 of them. Pretty good - exactly what we would expect, if we hadn't violated any assumptions. In other words, this much skew didn't seem to matter. (Acutally, it usually matters a bit - we were lucky this time. 100 samples isn't really enough to see it well, we should run it a few more times to check).

Change the sample sizes by changing the number 50, on the 2nd line (bold and red). A sample size of 100 also does fine, even with highly skewed data.

Two more thoughts:

First, it's often not the data that are assumed to be normally distributed, it's the residuals, so be careful about that.

Second, Wilcox and other proponents of robust estimation, say that it's not skew that's the problem, it's mixture distributions. I think it's in his 2005 Introduction to robust estimation and hypothesis testing (London, Academic Press) that he gives an example of a mixture distribution. 90% of the sample is sampled from a population where the mean is 0 and the sd = 1, and 10% from a population where the mean is 0 and sd = 10. In this case, there is no skew, but the percentage of times the 95% CIs include the population mean veers away from what we would expect.

Here's another simulation (written this time in R - it's free, and powerful and flexible). We first generate some data, to match that described above.

Here's the distribution (it's got a normal distribution imposed on it, with the same mean and SD):

Which looks pretty good.

The syntax runs 100,000 samples (rather than 100) and handily counts the number of times the confidence intervals contain the population mean - it gives the result that 96.9% of the time, the CIs contain the population mean.

Is this a problem? It doesn't sound like a problem. It is a problem because it means that our power is reduced - we are not finding a significant result as often as we should, and therefore we are making more type II errors than is necessary.

sampleSize <- 40 #Set size of samples
nSamples <- 100000 #set number of samples
lowerSD <- 1 #set SD for group with lower SD
upperSD <- 10 #setSD for group with higher SD
proportionInUpper = 0.1 #set proportio of people with higher SD
data <- rnorm(sampleSize * nSamples, 0, lowerSD) + rnorm(sampleSize * nSamples, 0, (upperSD - lowerSD)) * rbinom(n = sampleSize * nSamples, size = 1, p = proportionInUpper)

dim(data) <- c(nSamples, sampleSize)
sampleMeans <- rowSums(data) / sampleSize

xAxisLabels <- seq(min(sampleMeans), max(sampleMeans), length = 100)
xAxisLabels <- xAxisLabels - mean(sampleMeans) / sd(sampleMeans)
NormalHeights <- dnorm(xAxisLabels, mean = mean(sampleMeans), sd = sd(sampleMeans))

#hist(sampleMeans, probability = TRUE)
heights <- table(round(sampleMeans, 2))

plot(table(round(sampleMeans,1 )))
lines(xAxisLabels, NormalHeights * (nSamples / 10), col = "red")

dataSquared <- data^2
dataRowSums <- rowSums(data)
dataRowSumsSquared <- rowSums(dataSquared)
sampleSD <- sqrt((sampleSize * dataRowSumsSquared - dataRowSums^2) / (sampleSize * (sampleSize - 1)))
sampleSE <- sampleSD/sqrt(sampleSize)
lowerCIs <- sampleMeans - (sampleSE * (qt(0.975, sampleSize - 1)))
upperCIs <- sampleMeans + (sampleSE * (qt(0.975, sampleSize - 1)))
within95CIs <- lowerCIs <> 0