Huber-White estimates in SPSS
Statistical tests usually assume that the measurements are independently and identically distributed (i.i.d.). This means that if you know the residual from one person, you should not be able to make any prediction about the residual from any other person. One occasion where this is violated is in cluster randomised trials. In a cluster randomised trial, we don't randomise individually, instead we randomise in groups, or clusters. This happens, for example, when we randomise a hospital to a group, but measure the effects on patients, or if we randomise a teacher to a group, and measure the effect on children.Cluster randomised trials are becoming more common in medical, and other research, and there is increasing recognition that we need to analyse them in appropriate ways. This can be done in a number of ways. One approach that is commonly used is the Huber-White sandwich estimator, implemented in Stata as 'robust estimates'. It's also possible to use multilevel modelling in SPSS (although it's a bit fiddly - you really need to use syntax) or generalised estimating equations, which will be available in SPSS 15.
However, it's possible to get the same results using the SPSS complex samples procedure, although it's a bit fiddly. I'll demonstrate this using some data that were the basis of a paper by Richards, et al, from the BMJ.
When people telephone their doctor's surgery out of hours, they usually get through to a member of staff at the surgery. In this study, some weeks this remained the case, some weeks they were put through to NHS Direct . One of the outcomes was how much it then cost to treat the person. The assumption of i.i.d. is violated - a person phoning in the same week is likely to be more similar in cost to a person phoning in another week. (For example, if there is a flu bug going around, a lot of people may phone the doctor, to be told that they need to go to bed and take paracetamol, which is cheap. If there is a very high pollen count in a week, people may be hospitalised with asthma, which is expensive.) We don't need to know if this happened, we just need to know if it might have happened.
(Although we could check using the intra class correlation - ICC, sometimes called the intra-unit correlation. I did, and it was 0.02. Not very high, but big enough to matter, as we shall see).
I first did the analysis with SPSS, ignoring the clustering of the data. Here, I find that that the difference between the groups was £2.40 per person, with 95% confidence intervals £0.58 to £4.22, p =0.010, with NHS direct costing more. Thus, we would conclude that there is evidence that people who telephone NHS direct cost more to treat (in total) than people who telephone the surgery.
I then did the analysis with Stata, requesting robust standard errors, with week as a clustering variable - this gives Huber White sandwich estimators. The syntax for this is horribly easy:
regress totcost group, robust cluster(week)
The difference between the two groups is the same - NHS direct cost £2.40 more (which is good, if that wasn't the case, then we would have done something wrong). However, the 95% confidence intervals are now £5.65 to -£0.85, p = 0.141. Ignoring the clustering would have meant that we risked an inflated type I error rate.
I can also do this in SPSS (although it's not widely known, and the SPSS website support doesn't tell you this). We have to use the Complex Samples procedure, which appeared in SPSS 13.0. (And you have to have this module - it's an 'optional extra', as is much of SPSS).
First, we need to create a variable which is full of 1s. The syntax is:
compute constant = 1.
Or you can use the menu commands.
We then choose Analyse, Complex Samples, Prepare for Analysis.
Select 'Create a plan file' (which is the default), click 'Browse' and type a file name (I'll call it example, adn SPSS will add '.csaplan' to it). Click Next.
Choose the constant variable as the Sample Weight (so that this is always 1 - we want everyone to be sampled) and in the cluster variable, put (surprisingly) the cluster variable - this is week. Click Next.
We want sampling with replacement. This is the default on the next screen, so click next.
Then click Finish. SPSS produces a chunk of syntax which looks like this:
CSPLAN ANALYSIS
/PLAN FILE='example.csaplan'
/PLANVARS ANALYSISWEIGHT=constant
/PRINT PLAN
/DESIGN CLUSTER= week
/ESTIMATOR TYPE=WR
Next, we run the analysis. To do this, select Analyse, Complex Samples, General Linear Model.
In the Plan File, we choose the file that we just created, and then click continue.
Put the outcome into the dependent variable box, and the predictor into the covariate box (even though our predictor is categorical I prefer to put it into covariate, because you never know how SPSS is going to code it.)
Click on the statistics button, and ask for model: parameters, standard error, confidence interval and t-test (why oh why oh why would you not want these?). Then click continue and OK.
We get the following chunk of syntax:
CSGLM totcost WITH group
/PLAN FILE = 'example.csaplan'
/MODEL group
/INTERCEPT INCLUDE=YES SHOW=YES
/STATISTICS PARAMETER SE CINTERVAL TTEST
/PRINT SUMMARY VARIABLEINFO SAMPLEINFO
/TEST TYPE=F PADJUST=LSD
/MISSING CLASSMISSING=EXCLUDE
/CRITERIA CILEVEL=95.
The estimate for the difference is £2.40, with 95% CIs £5.65 to -£0.85, and p = 0.141. Which is the same as we got in Stata.
If the residuals are independently distributed, but are not identically distributed (and are still normal) this means that there is heteroscedasticity, and the correction can still be used. In this case, repeat the above procedure, but just don't put anything into the clusters box.
(Veronica Morton helped me with this.)


