Wednesday, March 28, 2007

Standard Deviations (Sample and Population)

There are two different kinds of standard deviations, and they are a bit prone to getting confused. Here's the text of an email that Darren Van Laar sent to me:

There's a bit on p.42 that i'm unsure of though - as the population and sample standard deviation terms seem to be a bit conflated unless i'm going crazy (possible!).
Sample Sd is the unbiased estimator (is divided by n-1) and is denoted by s. Pop Sd is the other one.
Also, this makes Excel correct, but probably just me...

As I haven't got the book yet, I'm not exactly sure what's on page 42, but I can guess what it's about. And this is where there are slightly different terminologies for the standard deviation.

If you have a sample, taken from the population (which is what you almost always have), then you use the SD which divides by N - 1. This is the unbiased estimate of the standard deviation. It is the best estimate of the population standard deviation.
Because it's the estimate of the population standard deviation, it's sometimes called the population standard deviation (this is what we call it). But it's the estimate that
you should use when you have a sample, so you could call it the sample standard deviation as well.

Similarly, if you have measured the entire population, then your standard deviation is not divided by N-1, it's divided by N. Sometimes (and Excel is included), this is called the population standard deviation, because it's the standard deviation that's used when you have measured the population.

In fact, it's mostly a lot of worrying about not very much, because we never have an entire population, so we never divide by N, we always divide by N - 1 .

However, it's possible that it's wrong in the book - I haven't seen it to check, and we had such a horrible time with the typesetter that all kinds of things changed - between what we wrote and the proofs that we saw.

Book is Out

I heard that the book I have written with Phil Banyard "Understanding and Using Statistics in Psychology" is available for purchase in your favourite bookshop.

You'd imagine that someone like the publisher would tell me stuff like that - after all. But you'd be wrong. I discovered it was out because Darren Van Laar, from Portsmouth University, emailed to ask me a question about something on Page 42.

Unfortunately, as I haven't actually got a copy of the book yet, I was only able to help partially. Also unfortunately, this means that all the things I said I was going to put on the book's website, because they wouldn't fit in the book, will actually have to be put on the book's website.

Scoring questionnaires in SPSS and Stata.

I sent an email to the psych-postgrads list, about scoring questionnaires in SPSS. I thought I'd reproduce it here, and elaborate a bit to include Stata.

First, use syntax. You can check syntax for errors, you can rerun it when you find another late questionnaire, you can save it and re-use it years later. You can give it to your friends and make yourself a more popular person.

Here are some tips for syntax:

First, SPSS doesn't mind "whitespace", so you can make your syntax
more legible, and less prone to mistakes:

Instead of:

COMPUTE AQPHYSIC = q1aq + q5aq + q7aq + q12aq + q13aq + q20aq + q24aq + q27aq + q29aq.


q1aq +
q5aq +
q7aq +
q12aq +
q13aq +
q20aq +
q24aq +
q27aq +

Secondly, items often need to be reversed, so you need to create a new variable for that, which you might call q29aq.rev. But if you do that, you've got to remember to put it in your syntax to score the questionnaire. Instead of doing that, create a new variable for every item in your scale (or subscale).

So instead, create new variables with the new scales.

For example, I wrote some syntax (in 1994) for scoring the EPQ-R (short form). Here's the bit that creates the lie scale:

COMPUTE = epq3 .
COMPUTE = epq8 .
COMPUTE = epq12 .
COMPUTE = epq16.
COMPUTE = epq20 .
COMPUTE = epq24 .
COMPUTE = epq29 .
COMPUTE = epq33 .
COMPUTE = epq37 .
COMPUTE = epq40 .
COMPUTE = epq45 .
COMPUTE = epq47 .

I create items to (I call them .jm to make it obvious what they were, and because I might be using this on anyone's dataset, and I don't know what their
variables will be called.)

Next you need to reverse the items that need to be reversed:

RECODE (1=0) (0=1) .

When you've done that, you can create the sums very easily, using compute statements.

There's an additional advantage to using new variables, and that is that of writing all the variables out, you can use "to".


compute = sum( to

And SPSS knows that means all the between and

BUT WAIT, there's another problem, which is missing data. Software handles this in two ways: Excel gives it a zero. SPSS either gives it a zero, or makes the sum variable missing. Neither of these are right.

The solution is to use the mean score, and then multiple the mean by the number of items. If all variables are completed, then the mean score multiplied by the number of items will equal the total. If a score is missing, then that item will be given the average of all the items that the person did complete.

However, if a person only completed one item out of 50, we probably don't want to give them a score for the total. The solution is mean.x, where x is the number of items that must have been completed for a score to be given.

So, if you want to only give people a score on the lie scale if they have completed 8 items, you use:

compute = mean.8( to * 12.

However, it's often better not to use the total score anyway, the mean
score for an item is more useful, as it's more interpretable, so just
miss off the *12, and use:

compute = mean.8( to .

You can also do this in Stata, and it's far, far easier. In Stata, you use the -alpha- command. The command works out which items need to be reversed (very, very occasionally it gets it wrong though), and calculates a mean score.

Stata does care about carriage returns (unless you tell it not to) so you can't use the same trick to clarify your code.

In Stata, you would write (all on one line):
alpha epq8 epq12 epq16epq20 epq24 epq29 epq33 epq37 epq40 epq45 epq47 , generate ( min(8) item

The generate() option tells Stata to generate a new variable. min(8) tells Stata to only include cases that have answered at least 8 items, and item tells Stata to do an item analysis, so that you can see which items Stata thought should be reversed.

Wednesday, March 07, 2007


If you're the kind of person who reads about stuff on the internet (which obviously you are, 'cos you're reading this), you'll have encountered Conservapedia, and various websites that have mocked some of the entries.

I had a look for some articles on statistical sorts of things, so that I could join in the mocking, that being my idea of fun. But I failed. The only article I found was one on statistics, which is brief but I couldn't really knock for accuracy. I couldn't find anything else, mean, median, mode, regression, correlation, Fisher, Pearson, all have no entry.

I tried more words more distantly related to statistics - causality takes you to a page on non-sequiturs , which does a half-hearted job of explaining the relation between correlation and causality. 'Evidence' takes you to a page about Jesus.

There are lots of pages to mock (read the Jesus page, but also note that the discussion shows that there are some more sensible people there), but really, it's too easy. However, I'll point out that on the Macroevolution page they make the error of describing macroevolution as "the unproven theory". Of course it's unproven. Theories can be lots of things - supported, useful, good, elaborate, but what they can't be is proven.