Applying Regression Blog
Monday, August 04, 2008
Sunday, January 20, 2008
The Ghost MapIf you're anything like me, then when you learned about Jon Snow and the Broad Street Pump, you learned something like this:
- There was a cholera epidemic in an area of Soho in London in 1854.
- Jon Snow was a doctor there, who drew a map of where the deaths were, and realized that water coming from the Broad Street pump was the source.
- Upon realizing that, he removed the handle from the pump, to stop people using it.
- Snow showed that cholera was an infectious disease.
- He thereby also founded the discipline of epidemiology.
The first thing you learn from the book is that the popular account of the Broad Street pump incident is wrong in many, many ways. The events happened in almost the reverse order. Epidemiology existed before the incident (not long before, and Snow was a founder). Some people thought that cholera was infectious before this incident. Most didn't. Most didn't think it was infectious afterwards either. Snow removed the pump handle - but before drew the map. He didn't even draw the best map - someone else drew a better improved version later on.
The book also goes into detail about what was known about the minor details of people's lives, where they worked (this was important, because this affected what they drank) where their relatives lived, but sometimes details are lost - no one knows the name of the first victim. It examines the epidemic on a broader, political scale - why had things got to the state where it was possible for a cholera epidemic to occur. It examines how the situation arose that cholera could spread - the city got too big, the carts that had to carry the, errrmmm... , waste had too far to go, and removing the 'night soil' became too expensive; people filled their cellars with shit instead. It looks at what happened as a result - sewage systems removed sewage from people's houses, and dumped it in the river.
So why am I writing about it here? It shows how three things that are required for statistical analysis to have an effect come together. First, there was a theory - the theory could be tested by the collection of data. Snow (and others) spend a long time collecting that data, finding who had died, where they had lived, what they did, but the theory was needed to guide the data collection. Second, the data had to be presented clearly - in this case on the map, which was a form of bar chart, with a bar on each house. (Nowadays we'd call that a Geographical Information System (GIS).) Third, the consumers of research had to be persuaded - in this case, the government departments who needed to build sewers to improve public health.
Finally, the book is extraordinarily well written. It manages to keep the text flowing without becoming turgid, and without deviating from the facts, here's a brief example: "Word of the outbreak had traveled through the wider city and beyond. The chemist's son who had enjoyed his pudding days before on Wardour Street died on that Sunday at his home in Willesden."
There's more information in the John Snow entry on Wikipedia.
Saturday, November 17, 2007
Effect SizesA Davies sent a message to the psych-methods list:
I have carried out three hierarchical regression analyses (linear forattitudes and intention, and logistic for behaviour). Within these regressions I have controlled for past behaviour and baseline psychological variables, such that I want to know what the effect of the intervention isonce these variables are controlled for in the early steps.
I am familiar with using means and standard deviations to calculate cohen’s d, but am unsure how to obtain these values from a regression analysis? Are there other approaches that can be used from data analysed using hierarchical regression? I understand that the R2 value can be used but as I understand it that is an indicator of the effect of adding the variable to the regression. I've also read about using the t-test statistic but when I have used it it seems to indicate a large effect of the intervention (d=.45) when actually the intervention is only marginally significant (p<.07) as a predictor of change. Does anyone have any suggestions?
To answer this question, we should think about what an effect size is, and what it's for. Sometimes the units of measurement are meaningful. If eating a tomato every day is associate with living (on average) two years longer, then there's no need to turn this into an effect size. You understand what two years means. It makes sense. You don't need an effect size, and you can easily compare it to other studies that examined the influence of eating a carrot every day.
In psychology, and other social sciences, we often have measures that are not meaningful. If one study has looked at the efficacy of Prozac for depression, and found that Prozac was associated with a difference of 8 points on the CESD; and a second study has looked at the efficacy of CBT and found a difference of 4 points on the PHQ, it's hard to compare those two effects.
The solution is an effect size, something like Cohen's d. Cohen's d is very simple. It's the difference between the two groups, divided by the standard deviation (that's a pooled standard deviation, so it's a tiny bit fiddlier, but not much).
If the SD of the CESD was 16, then 8 is half of that, so the effect of Prozac was d = 0.5. If the SD of the PHQ was 8, then 4 is half of that, so the effect of Prozac was d = 0.5, and the two effects were the same.
And as long as we have a dichotomous predictor, we are happy.
However, when we move to regression and multiple regression, it gets trickier, and we need something different. There are several things that we can use.
The first is the standardized effect, what SPSS calls beta. If you've only got one predictor, it's the correlation. Correlations are nice. We understand correlations. We might not like the standardized effect for a couple of reasons. One of them is that it destroys the units. If we have units we are interested in, then the standardized effect hides them. If the standardized effect of the relationship between tomatoes eaten per day and longevity is moderate (say 0.3) I might go and eat a lot of tomatoes. However, you might then tell me that eating one more tomato per day increases longevity by 12 seconds. That's pretty poor. If I only knew the standardized effect, that would be hidden from me. If your predictor is dichotomous, then the standardized effect is very silly.
The second choice is a partially standardized effect. Here, you standardize only the outcome variable, and keep the predictor unstandardized. This effect is the difference, in standard deviation units, associated with a 1 unit change in the predictor. If your predictor is dichotomous, that's Cohen's d. If it's not dichotomous, it's analogous to Cohen's d.
The third choice if you're using hierarchical regression is the change in R2. You ask how much additional variance is explained by the predictor (or predictors) that you added to the model at each step.
In the case of logistic regression, things get harder (as a general rule, everything is harder with logistic regression). Every sortware program gives you the estimate of the logistic regression as output, but not every package gives the Odds Ratio (sometimes called Exp(B)) unless you ask for it. (In Stata, for example, you need to use the , or option; in R, SPSS and SAS it's automatic).
You should always present the odds ratio, because it makes more sense than the estimate. But it doesn't make a lot of sense.
We can't standardize our outcome, because it's dichotomous. We could standardize our predictor, if that made sense, and present the partially standardized OR.
However, odds ratios are funny. They're funny because people don't know how to interpret them, so you should give them help. And you give them help by converting the parameter estimates into probabilities, then choose sensible values for the covariates, and calculate the probabilities associated with them.
I'm going to use the auto data in Stata to demonstrate this (I'll give the Stata code at the end). I'm going to regress foreign (a dichotomous variable indicating whether a car was made in the USA or not), on price (in 1000$) and mpg.
Here's the Stata output:
foreign | Coef.
price | .2660188
mpg | .2338353
_cons | -7.648111
Of course, we need to get the odds ratios instead of the linear estimates.
foreign | Odds Ratio
price | 1.30476
mpg | 1.263436
The two variables are in real units, so there's no need to standardize them. A price increase of $1000 is associated with the odds of a car being foreign increasing by 1.3 times (these are odds ratios, so they are multiplicativel; holding mpg constant) and one more mpg is associated with odds of being foreign 1.26 times higher. But what does that mean, in terms of the probability (because that's what we're interested in) of a car being foreign.
Let's compare two cars. One that costs $4000, and one that costs $5000 (these aren't new data). We can calculate the probability that each of those cars is foreign, and that will give us an effect size.
We just need to plug our numbers into the regression equation. But hold on, we need numbers for mpg. Let's pick a low value, say 14. (I'm not going to go through all the calculations, because it will take too long. If you're not familiar with how this is done, you can either believe me, or look it up.)
Our $4k which does 14 mpg car has a 0.035 (3.5%) probability of being foreign.
Our $5k which does 14 mpg car has a 0.045 (4.5%) probability of being foreign.
So adding $1k to the price of a car increases the probability it's foreign by 1%.
But what if we chose a different number of mpg? Let's use 25.
Our $4k which does 25 mpg car has a 0.32 (32%) probability of being foreign.
Our $5k which does 25 mpg car has a 0.38 (38%) probability of being foreign.
Which means that at this range, adding $1k has increased the probability by 6%. That's rather a large difference.
You need to calculate those probabilities, and you need to calculate them at appropriate values of the other covariates, in order to ensure that the reader can interpret your regression. However, it's a difficult issue, and you'll want to read more. Two good references are:
Regression Models for Categorical and Limited Dependent Variables. by J Scott Long. (There's a 2nd edition of this book out, which I haven't seen, but it's published by Stata Press, and therefore might be Stata specific).
Data Analysis Using Regression and Multilevel/Hierarchical Models , by Andrew Gelman and Jennifer Hill.
(Tip of the Hat: Some of my thinking on this has been influenced by Thom Baguley, and thanks to Greg Meyer for a correction).
replace price = price / 1000
logit foreign price mpg
logit foreign price mpg, or
Thursday, July 05, 2007
Start them youngSean Carey teaches a course called Introduction to Regression, at the Essex Summer School in Social Science Data Analysis. Here's a picture of his offspring that he uses in his initial presentation. I liked it, so I
Friday, June 01, 2007
Historical RegressionThere's been a story on the news recently that a report suggests that childhood obesity is increasing because more mothers are working, and not staying at home with their children. It's been reported in a few places, one of them is here. It says
"We saw that start to happen. We could track childhood obesity. There's a direct correlation," said Terry Mason, Chicago's public health commissioner."This might well be true, or it might not (that's the sort of thing that other blogs can worry about), they mentioned correlation, so we're going to talk about it on this blog.
So, a direct correlation, eh? And what's the sample size here? Let's see, well, we've got before, and we've got after. So we'll give them N=2. That's not a lot, is it? In fact, any two measures (which change) measured before and after some time have to have a correlation of r=+1, or r=-1. They don't even give a specific date to it - it's 'the 1980s'. As far as that evidence goes, you might as well say that it was Ronald Reagan getting elected, the release of Windows version 1.0 (anyone remember that? It had a clock that was pretty cool, but nothing else), or the fact that W Germany made it to the world cup finals 3 times (I'm including 1990, 'cos we're allowed to be a bit fuzzy here.)
In other words, if there WASN'T a direct correlation, we should be surprised.
This is an example of a more general problem (well, two more general problems).
The first is inspecting the data, and then making up a theory about the causal relationships that exist. But you've cheated, you looked at the data. Another example of this sort of thing is the clustering of leukaemia cases around nuclear power stations. When people first theorised this, they first looked at the data, and they said let's choose leukaemia (not any other sort of cancer) amongst children (we think of leukaemia as a children's disease, because that's what makes the news, but it's not necessarily), and let's make the children aged, errmm, less than 3, and make it, errrmmm, 5 miles from a nuclear power station. No! 10 miles! Ah look, we've got a cluster!" Obesity is a problem now, but they identify the root of this in the 1980s - 20 years ago, so any event between now and then would appear to do the job.
The second (related) problem is that of the sample size. Any time you say "Y happened, because X happened before" where X and Y are one off national events, you have an N of 2, and you can't say anything. Donohue and Levitt famously argued in 1999 that legalising abortion had decreased murder rates 20 (or so) years later (it's famous because they discuss it in Freakonomics). But they've got an N of 2. Murder rates are unstable, they weren't going to stay the same, so there was always going to be a correlation. This is discussed in a paper by Ted Goertzel called Myths of Murder and Multiple Regression (and if I ever write a paper with a title that good, I'll die happy). If you want to make this sort of statement, you need to have a bigger sample - you need to have abortion criminalised again, a few years later, because the murder rate should then rise, then ban it, and it should drop again. Alternatively, you need to have abortion legalised in different countries at different times, and then see what happens to the murder rate in those countries, the same period afterwards.
Social scientists do have a habit of being able to make up a plausible sounding theoretical explanation of any result that they happened to find. What if Donohue and Levitt had found that murder rates went up, as a result of abortion being legalized? Do you think they could have made up a plausible sounding theory to account for that too? (Go on, you try. I bet you can.) Well, it happens that other researchers have found that, by controlling for a couple of other variables, legalizing abortion really did make the murder rate go up.
It's unlikely that this will ever replicate, at least enough times to get a sensible sample, and it's unlikely that there will be changes in the workforce that mean that most mothers no longer work, so we'll never know who (if anyone) is correct. And a hallmark of science is that it has to be able to be wrong - for a theory or a finding or a fact to be considered to be scientific, we must be able to state what evidence would make us say "Oh, that was wrong then". For statistical analysis of this sort of result, there isn't anything that would enable us to reject the theory, and if we can't reject it, we might as well rely on psychoanalysis. ("You hate your father? That's because of repressed sexual urges. You love your father? That's because of unrepressed sexual urges.")
Friday, May 04, 2007
Stepwise regressionThere has been a little exchange on the SAS-L list about stepwise regression, and whether you should use it.
Jerry Davis (defending stepwise regression, in applied settings) wrote:
Stepwise selection methods are just statistical tools and like any tool, may be used inappropriately. I doubt if Walmart gives a rip about multicollinearity, they just want better predictions. If stepwise does the job, so be it.
Peter Flom wrote:
The analogy to a tool is common.
Stepwise methods ARE like a tool. They are like a hammer whose head flies off at frequent but irregular intervals.
Thursday, May 03, 2007
MediationThe latest issue of Psychological Methods (Volume 12, issue 1) has two (count them! Two!) articles on mediation and moderation.
The first, by Jeffrey Edwards and Lisa Schurer, "Methods for Integrating Moderation and Mediation: A General Analytical Framework Using Moderated Path Analysis" (link is to the abstract only) is about how to most appropriately combine analysis on moderation and mediation. The article looks a bit tricky - there are lots of equations, but they are not nasty hard equations, and working through them gets what you want.
The second is by Scott Maxwell and David Cole, Bias in Cross-Sectional Analyses of Longitudinal Mediation. is possibly more important, and is about whether mediation analyses are biased, if everything is measured cross-sectionally (i.e. at the same time). The short answer is yes.
This is problematic for many researchers, who measure three things, and then carry out a mediation analysis to show that X causes M and M causes Y. Almost every effect that we are interested in turns out to be biased if you don't measure the effects longitudinally.