Friday, November 6, 2009

No, REALLY don't do statistics with Excel

There was a query on the SAS mailing list today - someone got inconsistent results for confidence intervals between Excel and SAS. In Excel, they were using the confidence() function, which I'd not come across before. And I'm glad about that.

See, to calculate a confidence interval, you multiply the standard error of the distribution for the critical value from the t-distribution.

You can find that value using (say) R, with the qt() function or Excel, with the tinv() function. The t-distribution approximates the normal distribution as the sample size increases - you need a sample size of infinity for them to be exactly the same, but if the same size is large enough, then it's close. With the normal distribution, if you want a 95% confidence interval, the critical value is 1.96, which is so close to 2 that you can pretty much us 2 and get away with it. (Around 95% of cases lie within 2 SDs of the mean in a normal distribution).

If you have a sample of 500, then the critical value for t is 1.96 - the same as for the normal distribution. But even if your sample is as low as 50, the critical value is 2.0, which is close enough for almost anything (if you're using a computer it will work it out, and use it, to 16 decimal places, and we don't need to worry anyway).

However, this person had a sample size of 6. With a sample size of 6, the critical value is 2.6, that's a pretty long way off. Well, it turns out that Excel doesn't use the t distribution, it uses the normal distribution. If you have a large sample, this is going to cause you no problems, but if you've got a small sample, you are going to be off - off by almost 30%, which is quite a long way for a confidence interval to be off.

I googled (not Binged, of course) around this a bit, because I don't believe I'm the only person to have noticed, and I found this web page. Which is called "Function Consistency Improvements in Excel 2010", and renames the confidence() function to confidence.t(), and says: "Consistent definition with industry best practices. Confidence function assuming a Student’s t distribution."

Oh, that's great. It wasn't wrong or a cock up or a mistake. It was just inconsistent with best industry practices. Well, I'll try that at home next time I tread dog turd into the living room / spill tomato soup on the sofa / allow the children to play with the rat poison. That wasn't wrong, or a mistake, I will claim, as I am reprimanded. In the future I will eat tomato soup at the table, in order to ensure that my eating behavior is consistent with best dining practice .

2 Comments:

At November 13, 2009 12:44 PM , Blogger Palinurus said...

The confidence() function is new to me too.

I use Excel all the time for confidence intervals and It is useful, but it important to build the intervals up using functions like tinv() and so forth. It is also import to check the spreadsheet works (against R, SPSS or a textbook example). Finally, its important to understand the limitations of Excel (e.g., with respect to rounding).

The main advantages of ExceI, I think, are i) as a teaching aid and ii) as a way of storing examples for future use.* I used Excel today in class and it seemed to work well.

* You can do this in SPSS or R, but Excel has a lower overhead when you go back to the example after a few weeks or months.

 
At November 15, 2009 6:03 AM , Blogger Yang said...

I only use excel to do data mining

 

Post a Comment

<< Home