Thursday, November 10, 2005

Rubik's Cube (and trimmed means)

First of all, who would have thought that Rubik's cubes were still about? And who would have thought that there was still a world championship?

But that's not what this post is about. There is a world championship, and each person has to solve 5 cubes - the mean score is the score that is used to determine the ranks. However, the problem with using time in any measure is that it tends to be positively skewed - you can't take less than no time, but you can take an awful long time. So what the cube people do is use a trimmed mean - that is, they remove the best and worst times, and find the average of all the rest. It's basically a way of discarding outliers, but doing it fairly.

Wednesday, November 09, 2005

Prosecutor's Fallacy

The Prosecutor's Fallacy might be raising its head in the trial of Bradley John Murdoch for the murder of Peter Falconio in the Austalian outback. The BBC News reported that "Blood on Joanne Lees' T-shirt was 150 million billion times more likely to be Bradley Murdoch's than any other local white male's, a Darwin court has heard." However, the Brisbane Courier Mail gives a quote from the prosecutor "It's 150 quadrillion times more likely that it's come from Bradley John Murdoch than another person from the population selected at random," she said.

But hold on! These aren't the same thing. Another person selected at the population at random isn't the same as any other local white male. The probability that my birthday matches any other randomly selected person is 1 in 365, but the probability that I share my birthday with someone else that lives in the area is almost 1 - it is virtually certain that someone who lives near me shares my birthday, which makes this probability rather different.

Let's try to think about this in an easier way. Say that a murder has been carried out, and my DNA has been found to match. Let's say that this DNA sample matches one in a million people, and that (to make life easier) we know the murderer lived in the UK, and that the population of the UK is 50 million (it's not, but it makes the sums easier).

The probability that I would match the DNA of the murderer is 1 in a million. Therefore the chance I didn't do the murder is one in a million, so it's off to prison for me? Wrong.

There are three ways to think about this. First is the way we just described - the probability of it matching me is one in a million, if I am a randomly selected person.

But, this means that there are 49 other people in the country, who also match this DNA sample. This means that the probability that I did the murder is actually only 2%. So we could say that the chance I did the murder is 99.9999%, or it's 2%.
But what is the probability that there doesn't exist someone in the country who matches the DNA, who didn't do the murder (or, to put it more simply, what's the probability that the only person who matches the DNA in the country is the murderer)? This is trickier to work out (we need to use the binomial distribution), but the answer is 0.0000000000000000009836%.

So we have three rather different probabilities of my guilt. Now, IANAL, and it isn't looking good for the accused, but if people are going to chuck numbers about, they should make sure that they do it properly.

However, all of this has a bit more relevance for us, doing statistics in a research context, because we use probability, and a lot of the time this isn't really understood - researchers thrown probabilities around in journal articles, sometimes without really understanding them.

We are going to use the symbol | to mean "given that". Let's use our DNA example again.

We have a null hypothesis - you did the murder (homicide), so we'll call this H. And we have a DNA match, so we'll call this D. We want to know the probability that you did the murder.

The prosecutor's fallacy is this: We have a 1 in 1,000,000 probability of a match, and they think that the probability that you did not do the murder (H), given that we have a DNA match, is 1 in a million, i.e. 0.000001. We could write that as p(H|D) = 0.000001.

However, this isn't what the test, on its own, tells us. Instead, it tells us the probability that we have a DNA match (D), given that you didn't do the murder (H). That is, it tells us p(D|H), and this value is much, much higher - close to 100%.

In a statistical test, we test a null hypothesis (H), using data (D). We want our significance value from our statistical test to tell us p(H|D), that is the probability that our null hypothesis is true. But it doesn't do this, it only tells us p(D|H) - the probability of obtaining our result, given that the null hypothesis is true, which is quite a different thing.

So, how do we find out what we really want to know - p(H)? The answer is that we have to use Bayes Theorem. The big problem with Bayes theorem is that we have to have a subjective prior probability that the null hypothesis is true - that's not such a problem in a court case. In science, it is rather a problem. What is your subjective prior probability that homeopathy is effective? mine is zero (or as close to as makes no difference), yours might be different, in which case we are going to get a different p-value, based on the same data, which makes life a bit tricky, really.