Datasets: from Applying Regression and Correlation 
Go Back to
Jeremy Miles's Home page
Regression Analysis Home page
This page contains datasets from the book
Applying Regression and Correlation: A guide for students and
researchers, published by Sage. Note all datasets
contain fictional data, designed to demonstrate different aspects of
statistical analysis in psychology, and not to make any theoretical or
empirical point. The datasets can be downloaded individually, in SPSS (.sav) Excel (.xls) or tabdelimited format, or you can download them all at once, in zip format. Download a zip file containing all datasets in [SPSS format  10kb] [Excel format  17kb] [tabdelimited format  7kb]. Part 1: I need to do regression analysis tomorrowChapter 1: Building Models with Regression and CorrelationDataset 1.1: Bivariate regressionThis dataset is used to illustrate the idea of regression analysis. It contains 3 variables, books, attend and grade and has 40 cases. Books represents the number of books read by students on a statistics course, attend represents the number of lectures they attended and grade represents their final grade on the course. In Chapter 1 this dataset is used as an example to show how grade can be predicted using books.SPSS (data1_1.sav), Excel (data1_1.xls) tab delimited (data1_1.dat). Chapter 2: Multiple RegressionThe first example in this chapter uses dataset 1.1, but shows that when the variables are entered together, in a multiple regression, the parameter estimates are altered. It is wrongly labelled dataset 2.1 in the book.)Dataset 2.1: Multiple regressionThis dataset is used as an example of hierarchical regression, it contains 4 variables, sex, age, extro and car. Sex and age are the sex and age of participants, extro is an extroversion score and car is the amount of time people spend looking after their cars. The dataset is used to demonstrate that when we are more sure about the causal hierarchy of a set of variables we can analyse them hierarchically, entering sex and age first, and then entering extroversion reduces the estimate of the proportion of variance in car which is accounted for by extroversion. It is wrongly labelled dataset 2.2 in the book.SPSS (data2_1.sav), Excel (data2_1.xls), tabdelimited (data2_1.dat). Chapter 3: Categorical Independent Variables in RegressionDataset 3.1: Ttest as regressionA very simple dataset designed to show the equivalence of ttests and regression. The results of an experiment examining the memory of 20 participants, 10 of whom were told how to use a mnemonic, 10 of whom were not. Two variables, group and score.SPSS (data3_1.sav), Excel (data3_1.xls) tabdelimited (data3_1.dat). Dataset 3.2a: Oneway ANOVA as regression (dummy variable coding)This is the same data as dataset 3.1, with a third experimental group added (aromatherapy). It shows the equivalence of ANOVA and regression.SPSS (data3_2a.sav), Excel (data3_2a.xls) tabdelimited (data3_2a.dat). Dataset 3.2bSame as dataset 3.2a, but with the dummy variables already recoded (group_1 and group_2).SPSS (data3_2a.sav), Excel (data3_2a.xls) tabdelimited (data3_2a.dat). Dataset 3.3a: One way ANOVA as regression (indicator coding)This dataset is designed to demonstrate different types of coding for categorical variables in regression. It contains the stress ratings (stress) for five different types of educator (job) primary school teacher (1), secondary school teacher (2), college lecturer (3), old university lecturer (4) and new university lecturer (5).SPSS (data3_3a.sav), Excel (data3_3a.xls) tabdelimited (data3_3a.dat). Dataset 3.3bThis contains the same data as dataset 3.3a, but the data have been recoded to indicator coding, adding four new variables (group_1, group_2, group_3, group_4).SPSS (data3_3b.sav), Excel (data3_3b.xls) tabdelimited (data3_3b.dat). Dataset 3.4: Analysis of changeThis dataset demonstrates the problems of analysing data that contain change scores using ANOVA. The dataset contains data from an experiment where memory was tested using a pretest, posttest design. If the data are analysed using a ttest on the difference scores, or twoway mixed ANOVA, they just fail to achieve significance (at the 0.05 level). If a regression (or ANCOVA) based approach is used, the analysis just achieves significance at the 0.05 level. Contains 4 variables: group, unprimed (premanipulation score), primed (postmanipulation score) and diff (difference).SPSS
(data3_4.sav), Excel (data3_4.xls) tabdelimited (data3_4.dat).
Part 2: I need to do regression analysis next weekChapter 4: Assumptions in regression analysisDataset 4.1: Transforing skewed dataOne variable showing reaction times. The data are highly positively skewed, but a logtransform normalises themSPSS (data4_1.sav), Excel (data4_1.xls) tabdelimited (data4_1.dat). Dataset 4.2a: Bivariate OutlierContains the IQ scores of couples, to demonstrate that two variables may be normally distributed, but this does not mean that the variables will have a bivariate normal distribution. Contains two variables, female and male.SPSS (data4_2a.sav), Excel (data4_2a.xls) tabdelimited (data4_2a.dat). Dataset 4.2bSame as dataset 4.2b, but also contains male predicted scores (from female) and residual.SPSS (data4_2b.sav), Excel (data4_2b.xls) tabdelimited (data4_2b.dat). Dataset 4.3: Multivariate outlierThis dataset is designed to demonstrate a multivariate outlier.It contains measures of life events (events), daily hassles (hassles), social support (support) and depression (dep). The variables are all univariate normal, and the scattergraphs all show that the data are bivariate normal, but when a regression analysis is carried out, a multivariate outlier emergers. SPSS (data4_3.sav), Excel (data4_3.xls) tabdelimited (data4_3.dat). Dataset 4.4: HeteroscedascityDemonstrates heteroscedascity, and shows how the presence of heteroscedascity means that a misspecification has taken place, which may be in terms of a missing interaction effect. The data contains 4 variables cash is the amount of money that an individual earns, import is the importance of the work that a particular charity does, to that person, and given is the amount of money that individual donated to the charity. The dataset also includes a variable di_cash, a dichotomised cash variable, to enable the interaction to be explored.SPSS (data4_4.sav), Excel (data4_4.xls) tabdelimited (data4_4.dat). Dataset 4.5: NonlinearityContains 2 variables, labelled x and y. Demonstrates how nonlinearity can reduce our ability to predict a DV from an IV. The scattergraph shows that prediction should be almost perfect, but the value of R squared shows that the prediction is good, it is not perfect.SPSS (data4_5.sav), Excel (data4_5.xls) tabdelimited (data4_5.dat). Dataset 4.6: More nonlinearityThis is a similar dataset to 4.5, but shows nonlinearity in relation to the surface area of houses, and the amount of money they cost. Contains 2 variables, size and price.SPSS (data4_6.sav), Excel (data4_6.xls) tabdelimited (data4_6.dat). Chapter 5: Issues in regression analysisDataset 5.1: CollinearityThis dataset is an expansion on dataset 1.1, it contains a variable (late) which correlates with both attend and books. When all three variables are entered into the equation, the parameter estimates are no longer significant.SPSS
(data5_1.sav), Excel (data5_1.xls) tabdelimited (data5_1.dat).
Part 3: I would like to know more of the things that regression analysis can doChapter 6: Nonlinear and Logistic RegressionDataset 6.1a: Nonlinear modelThis dataset contains a measure of daily hassles (hassles) and a measure of anxiety (anx). It shows how the effect of hassles on anxiety may be nonlinear, as the first hassle may have little implact on psychological wellbeing, but the 50th has a much greater impact.SPSS
(data6_1a.sav), Excel (data6_1a.xls) tabdelimited (data6_1a.dat).
Dataset 6.1bThis is the same as dataset 6.1a, with the addition of a quadratic (hassles^2) and cubic (hassles^3) variables.SPSS
(data6_1b.sav), Excel (data6_1b.xls) tabdelimited (data6_1b.dat).
Dataset 6.2: Logistic RegressionDemonstrates logistic regression. Score represents scores on an aptitude test for a course, exp represents months of relevant previous experience, and pass indicates whether the individual passed the exam at the end of the course.SPSS (data6_2.sav), Excel (data6_2.xls) tabdelimited (data6_2.dat). Chapter 7: Moderator and Mediator AnalysisDataset 7.1: Moderator effect with two categorical independent variablesContains data from the classic context dependent memory study, showing the interaction effect. Contains two independent variables: learning environment (learn, with two levels, dry and wet) and testing environment (test, with two levels, dry and wet). Also contains scores, indicator coding of learn and test (indlearn and indtest, respectively) and an interaction term (lxt).SPSS (data7_1.sav), Excel (data7_1.xls) tabdelimited (data7_1.dat). Dataset 7.2: Moderator effect with one categorical and one continuous variableContains three variables: Events is a person's score on a life event scale, indicating the number and severity of recent life events. Status is a measure of whether a person cohabits with a partner (a 0 indicates that they do not, and a 1 indicates that they do). Stress is the score on selfreport measure of experienced stressSPSS (data7_2.sav), Excel (data7_2.xls) tabdelimited (data7_2.dat). Dataset 1.1 (repeated): Moderator effect with two continuous IVsThere is a significant interaction effect in this dataset  the attend * books interaction is significant, showing that it is not sufficient to just read books, or just to attend lectures, students must do both.Dataset 7.3: Mediator effectContains three variables: enjoy indicates how much a person reports they enjoy reading books, read indicates the number of books that they have read, and buy indicates the number of books that they have purchased in the previous 12 months.SPSS
(data7_3.sav), Excel (data7_3.xls) tabdelimited (data7_3.dat).

Click a link to jump straight to that part of the page Part 1: I need to do regression analysis tomorrow Chapter 1: Building Models with Regression
and Correlation Chapter 2: Multiple Regression
Chapter 3: Categorical Independent Variables in
Regression Part 2: I need to do regression analysis next weekChapter 4: Assumptions in regression analysisDataset 4.1: Transforing skewed data Dataset 4.2: Bivariate Outlier Dataset 4.3: Multivariate outlier Dataset 4.4: Heteroscedascity Dataset 4.5: Nonlinearity Dataset 4.6: More nonlinearity Chapter 5: Issues in regression analysis
Part 3: I would like to know more of the things that regression analysis can do Chapter 6: Nonlinear and Logistic Regression
Chapter 7: Moderator and Mediator Analysis


ContactGot a comment on this web site? Contact webmaster@jeremymiles.co.ukGot a comment on the book? Contact jeremy@jeremymiles.co.uk 