Regression Using R
This page is one of a (small) set of additional appendices to the book Applying
Regression and Correlation, by Jeremy Miles
and Mark Shevlin.
R is not the most user friendly program in the known universe.
But it is one of the most powerful. And it's free (as in
beer,
and also as in speech). To make it a little easier for
beginners, here is a guide to doing regression in R.
- First, you need to get hold of R. Typing R into google
will find you the R homepage, or you can go straight to the Comprehensive R Archive
Network and download and install it from a mirror near you.
- When you have R installed, and you run it, it should
look a bit like:

- We need to get some data into R. We'll
cover two types of data, first tab-delimited, then SPSS
For either kind of file, it will make life easier if you change the
working directory to the same directory that your data are currently
stored in. To do this, click on the file menu, and choose
Change
Dir:

Then choose the directory where your data are stored.
- To read a tab delimited file, with variable names at
the top, we'll assume that your data look like:
HASSLES ANX 38 10 10 12 60 21 90 16 88 27 96 30 1 9 41 7 86 32 59 11 ...
You can find the whole file here.
We use the command read.table,
and type:
data1
<- read.table("data6_1a.dat", header = TRUE)
If we go through that a bit at a time, we'll see
what it means.
- data1
is
the name of the data object that we are creating in R.
Effectively, R can have lots of datasets that it is working
with
at the same time, and it needs to know what to call each one.
Because we aren't very imaginative, we are going to call it data1.
- <-
is what we would call the assignment operator, if we wanted
to
use computery talk It's like using the = sign.
(Except
that the problem with the equals sign is that it is ambiguous.
If
we write x = 4
does that
mean "put the value 4 into the box labelled x", or does it mean "is x
equal to 4"? In R, for the first, we would write x <- 4,
and for the second, x
== 4).
- Next we have our file name, in speech marks.
We
changed the directory that R would look in, so that we didn't have to
type in the whole path (e.g. "c:\my
documents\statistics\examples\R\data6_1a.dat").
- Finally, we put header = TRUE,
which means that it is true that the variable names are included in the
header at the top of the file.
For an SPSS file, we have to be introduced to the world of R packages.
R packages are rather cool, and are what really make R what
it
is. To read an SPSS file, you need to have a package loaded
called 'foreign'. Click on the Packages menu, and then choose
Load Package. You are presented with a list of packages that
are
available. If it's there, choose foreign, and click OK.
If it's not there, you need to get it. Click on Packages and
choose Set CRAN Mirror. If you are in a UK university, choose
UK
(Bristol). If you aren't, then choose the one that's nearest
to
you.
Then choose packages, Install Packages, and choose the package
'foreign'. R will now download the appropriate file over the
web.
But it won't install it. You have to go back to
packages,
load package, foreign.
The you just type:
data1
<- read.spss("data6_1a.sav")
And we have successfully read in our file.
- Each variable can now be referred to using the name
of the
dataset, and the name of the variable, separated by a $ sign.
So
the first variable is called data1$HASSLES.
Notice that R is case sensitive, and variable names (when
read from SPSS) are always in capitals.
If you type:
data1$HASSLES
R will give you the values of the figures in the first variable.
You can refer to this in other ways (which is useful later on, when you
want to program).
data1[1]
Will also give the values of the first variables, with the following
output:
> data1[1]
$HASSLES
[1]
38 10 60 90 88 96 1 41 86 59 25 5 3 16 22
41 29 72 55 36 96 36 91 47 63
[26] 64 98
81 99 90 95 5 71 82 97 47 30 75 35 78
If you use double square brackets, you get:
> data1[[1]]
[1]
38 10 60 90 88 96 1 41 86 59 25 5 3 16 22
41 29 72 55 36 96 36 91 47 63
[26] 64 98
81 99 90 95 5 71 82 97 47 30 75 35 78
Notice the differeence? The second
time, we didn't get the variable name at the top.
We can also draw a histogram of the variable:We can also draw a
histogram of the variable:
> hist(data1$HASSLES)
Will give:

And:
>
plot(data1$HASSLES, data1$ANX)
Will give

- Now we've got our data in, we can go ahead and do our
regression, but first we are going to do an extra step, because we are,
fundamentally, lazy.
It's a bit dull having to retype data1$ each time we want to access a
variable. (It's useful, because we can have multiple datasets
open, but it's still dull.) So, we are going to attach our
dataset. When a dataset is attached, you only need to type the variable
name, and R takes it from the dataset which is attached. We
type:
> attach(data1)
and our data are attached.
- The basic regression command in R is GLM (general, or
generalised, linear model). We use the command:
glm(outcome
~ predictor1 + predictor2 + predictor3 )
For our first regression, the analysis we want
to do is:
glm(ANX ~ HASSLES)
And R outputs:
Call:
glm(formula = ANX ~ HASSLES)
Coefficients:
(Intercept)
HASSLES
5.4226
0.2526
Degrees of
Freedom: 39 Total (i.e. Null); 38 Residual
Null
Deviance: 4627
Residual
Deviance:
2159
AIC: 279
But this is now gone - we have the estimates, but no more. We
need to have the output stored somewhere, so we can do something with
it. To do this, we use the assignment operator, as before:
glm.linear <- glm(ANX ~ HASSLES)
We are naming the output of our regression
glm.linear - in R language, we are creating an object, called glm.linear .
If you are used to almost any other program, the result will
seem a little strange, because there isn't any. If you want
to know what happened, you have to ask.
One way to ask is to just type the name of the object, and the object
will be give to you. (Just like when we typed the name of the dataset,
the dataset was output).
>
glm.linear
Call:
glm(formula = ANX ~ HASSLES)
Coefficients:
(Intercept)
HASSLES
5.4226
0.2526
Degrees of
Freedom: 39 Total (i.e. Null); 38 Residual
Null
Deviance: 4627
Residual
Deviance:
2159
AIC: 279
But we can do other things with the object. The function summary(), for
example, when applied to a variable, gave a certain kind of result.
When applied to a dataset, it gave a different kind of
result. When applied to a glm object, it gives a different
kind of result:
>
summary(glm.linear)
Call:
glm(formula
= ANX ~ HASSLES)
Deviance
Residuals:
Min
1Q
Median
3Q
Max
-13.3153
-5.0549 -0.3794
4.5765 17.5913
Coefficients:
Estimate Std. Error t value
Pr(>|t|)
(Intercept)
5.42265 2.46541
2.199 0.034 *
HASSLES
0.25259 0.03832 6.592
8.81e-08 ***
---
Signif.
codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion
parameter for gaussian family taken to be 56.80561)
Null deviance: 4627.1 on 39 degrees of freedom
Residual
deviance: 2158.6 on 38 degrees of freedom
AIC: 279.05
Number of
Fisher Scoring iterations: 2
And that is what we would consider to be our
answer from the regression.
- However, we can do more. There are other
functions that we can apply, to get different information from our
regression analysis. We can ask for residuals, using the
command (and this will surprise you):
>
residuals(glm.linear)
Trouble is that the output isn't very helpful:
1
2
3
4
5
6
-5.02121610
4.05141414 0.42171728 -12.15610083
-0.65091296 0.32833554
7
8
9
10
11
12
3.32475957 -8.77899791
4.85427492 -9.32568878
1.26250508 -1.68561618
13
14
15
16
17
18
13.81957170
-0.46414949 0.02028689
7.22100209 -6.74787067 -2.60940996
19
20
21
22
23
24
-13.31531303
4.48397177 -12.67166446 -1.51602823
17.59130523 -0.29456154
25
26
27
28
29
30
-5.33606453
1.41134153 9.82314767
4.11724460 12.57055373 -5.15610083
31
32
33
34
35
36
9.58092948 9.31438382
-10.35681603 4.86465066
7.07574161 -1.29456154
37
38
39
40
-5.00046461
-8.36719178 -3.26343429 -2.12497359
But we can do stuff with that output to make it more useful.
Specifically, we could put it into a variable, using the
command:
>
glm.linear.resids <- residuals(glm.linear)
> hist(glm.linear.resids)
Which gives the following:
>
hist(glm.linear.resids)
And then we could draw a histogram of that variable:

- Finally, we could find the predicted
values, using the
fitted.values() function.
> glm.linear.preds
<- fitted.values(glm.linear)
And draw a scatterplot of residuals against predicted values, to
examine heteroscedasticity, linearity and multivariate outliers.
>
plot(glm.linear.preds, glm.linear.resids)

That was all rather hard work, wasn't it? Why bother, when we
(probably) have access to SPSS, Excel, or other programs that can do
regression?
There are, as I see it, two reasons.
First, R is more than a statistics package, it's a programming
environment. You can access parts of the regression output,
and do whatever you want to with them. Because glm.linear is
an object, you can access it like a dataset. The first
element in that object is referred to as glm.linear[1] - if you type
this you get:
> glm.linear[1]
$coefficients
(Intercept)
HASSLES
5.4226465 0.2525939
The coefficients, are another object. You can extract a part
of that object, by asking for the first element of it,
>
glm.linear[[1]][[1]]
[1] 5.422646
And because that's a bit of output, you can make it
a variable, and do stuff with it.
> x
<- glm.linear[[1]][[1]]
> x
[1] 5.422646
> x
<- glm.linear[[1]][[1]]
> x * 3
[1] 16.26794
It's a bit like the output management system in SPSS, but much easier
to use. (do you use SPSS? Have you ever used the output
management system? Thought not - because it's fiddly.)
The second reason to bother with this is that, although R has a steep
learning curve to start with, it gets flatter, faster, than other
programs, because all of the commands are very similar. To do
a logistic regression in (say) SPSS, you need to use a different
interface, with different rules. In R, to do a logistic
regression, you add 'family
= binomial', to do a Poisson regression, you add 'family = poisson'.
Multilevel models in R are similar: instead of the glm()
command, we use the lme()
command (for linear mixed effects) or nlme() for
non-linear mixed effects. The commands are very similar, so
once you've learned one, you've learned them all.
|
|