Regression with Excel / Multiple Regression with Excel |
A guide to regression analysis using MS Excel 2010. For older versions of Excel, see here. This is a version of Appendix 2 of the book “Applying regression and correlation: A guide for students and researchers.” This is a version of Appendix 2 of the book “Applied Regression Analysis: A guide for students and researchers.” It is intended to be read with the book, but should work as a standalone guide to doing regression with Excel Go Back to In this section we will describe how to use the Excel 2010 for Window to carry out some of the procedures we have described in the book. While this approach uses Excel 2010, I don't think that the regression procedure has changed (much) between different versions. To do regression in Excel, you need the Analysis Toolpak add-in to be installed in Excel.Check that the add-in is installed, and added-in, by clicking the File menu, and choosing Options. ![]() Click the button that says "Go" at the bottow of the window. ![]() Make sure that "Analysis ToolPak" is selected. ![]() You can now use the data analysis functions in Excel, which include multiple regression. The example that we will work through is taken from dataset 6.1b in the book "Applying regression and correlation" (if you jumped straight in here, that is what these web pages is about. Don't worry about that though, you can download the data from here .) To get to the data analysis functions in Excel, you select the Data menu, and then choose Data Analysis (it's on the right).. This gives the following Dialog, click on Regression and then click OK. The following window appears: In here, we tell Excel about the data that we would like to analyse. The first box is the input Y range. Here, we tell Excel about our outcome variable. The ou variable must be a column, 1 cell wide, and N cells long (where N is the number of individuals that we are analysing). The the dataset we are using, the dependent variable is Anx, which is the column which goes from cell D1 to Cell D41. You can either type this information in directly as D1:D41, or you can select the appropriate data from the spreadsheet with the mouse. Because we have included row 1, which includes the variable name, we are going to have to tell Excel this, by clicking on the "Labels" checkbox. The next stage is to input the predictor variables. The predictor variables must be a block of data, of k columns (where k is the number of predictor variables) and N rows (where N is still the number of people). In the dataset we are using we have three independent variables: hassles, hassles2 and hassles3. (These represent the linear, quadratic and cubic effects of hassles - we are analysing a non-linear relationship here,) These are held in rows 1 - 41 of columns A, B and C. Again, we can type in A1:C41 or select the data from the spreadsheet with the mouse- it will have the same effect. Next we tell Excel where we want the results to be written. It is best to ask for a new sheet - you don't want to accidentally overwrite some of your precious data, and have to go to all of the effort of restoring it from a backup,. (You do have a backup, don't you?) We can ask fro residuals and standardised residuals to be saved - these will be new columns of numbers created in the the new spreadsheet. Two types of graphs will be drawn automatically if you ask for them.
The dialog box now looks like this: . So, finally, we click OK. And we get a lot of output, written to a new sheet. A note about this output - output from analysis in Excel is usually "live" that is to say, the data are linked to the output. If you change the data, you will change the output. This is not the case for this type of output in Excel. The results of the analysis are "dead" and will not change. The ResultsThe results presented below have been copied from Excel to HTML, so they don't look identical. You can download the Excel file exactly as Excel wrote it for me here . |
| Multiple R | 0.807582 |
| R Square | 0.652189 |
| Adjusted R Square | 0.623205 |
| Standard Error | 6.686129 |
| Observations | 40 |
| The first part of the output is the
regression statistics.
These are standard statistics which are given by most
programs. |
| ANOVA | |||||
| df | SS | MS | F | Significance F | |
| Regression | 3 | 3017.745 | 1005.915 | 22.50151 | 2.22E-08 |
| Residual | 36 | 1609.355 | 44.70432 | ||
| Total | 39 | 4627.1 |
| The ANOVA table comes next.
This gives a test
of significance of the R2. Note that
Excel uses scientific
notation, by default, so when it says 2.22E-08 it means, 2.22 * 10-8
. (i.e. 0.0000000222). |
| Coefficients | Standard Error | t Stat | P-value | Lower 95% | Upper 95% | Lower 95.0% | Upper 95.0% | |
|
Intercept
|
12.39
|
3.90
|
3.18
|
0.00
|
4.48
|
20.29
|
4.48
|
20.29
|
|
HASSLES
|
-0.06
|
0.34
|
-0.19
|
0.85
|
-0.75
|
0.62
|
-0.75
|
0.62
|
|
HASSLES2
|
0.00
|
0.01
|
0.06
|
0.95
|
-0.02
|
0.02
|
-0.02
|
0.02
|
|
HASSLES3
|
0.00
|
0.00
|
0.51
|
0.61
|
0.00
|
0.00
|
0.00
|
0.00
|
| The next stage is the coefficients.
Note that
here I have converted the numbers to 2 decimal places to save space).
It gives the coefficient for each parameter, including the intercept
(the
constant). The standard errors, and the t-values follow (the
t-value
is the coefficient divided by the standard error). Next comes
the p-value
associated with the variable, and the confidence intervals of the
parameter
estimates (Excel gave these to me twice, even though I didn't ask for
them.) |
|
Observation
|
Predicted ANX
|
Residuals
|
Standard Residuals
|
|
1.00
|
12.08
|
-2.08
|
-0.32
|
|
2.00
|
11.82
|
0.18
|
0.03
|
|
3.00
|
15.98
|
5.02
|
0.78
|
|
4.00
|
29.86
|
-13.86
|
-2.16
|
|
5.00
|
28.56
|
-1.56
|
-0.24
|
|
6.00
|
34.16
|
-4.16
|
-0.65
|
|
7.00
|
12.32
|
-3.32
|
-0.52
|
|
8.00
|
12.37
|
-5.37
|
-0.84
|
|
9.00
|
27.31
|
4.69
|
0.73
|
|
10.00
|
15.64
|
-4.64
|
-0.72
|
| The final part of the output is the
residual information.
The observation in the left had column is the case number -
although
Excel never told us about this, it has labelled the first person
Observation
1, the second Observation 2, etc. (Note
that this is NOT
the original row number - Observation 1 was row 2). The predicted anxiety score is the score that was predicted from the regression equation. The residual is the raw residual - that is the difference between the predicted score and the actual score on the dependent variable. The final value is the standardised residual (the residuals adjusted to ensure that they have a standard deviation of 1; they have a mean of zero already). |
| Graphs Finally we will have a quick look at the graphs. We asked for (and we got) residual plots - but what we really wanted was the plot of the residuals against the predicted values. In linear regression, this would be fine. In multiple regression, it's not what we want. The second graphs shows the predicted and actual anxiety scores plotted against hassles. This graph is useful, as it shows us the non-linear nature of the relationship. ![]() |
| A Note to End On I have written this file which shows you how to do regression in Excel, but this does not mean that I think that you should be doing regression in Excel. Regression in Excel has a number of shortcomings, which include:
|