Datasets: from Applying Regression and Correlation

Go Back to
Jeremy Miles's Home page
Regression Analysis Home page
 
 
This page contains datasets from the book Applying Regression and Correlation: A guide for students and researchers, published by Sage.  Note all datasets contain fictional data, designed to demonstrate different aspects of statistical analysis in psychology, and not to make any theoretical or empirical point.
The datasets can be downloaded individually, in SPSS (.sav) Excel (.xls) or tab-delimited format, or you can download them all at once, in zip format. 

Download a zip file containing all datasets in [SPSS format - 10kb] [Excel format - 17kb] [tab-delimited format - 7kb]. 
 

Part 1: I need to do regression analysis tomorrow

Chapter 1: Building Models with Regression and Correlation

Dataset 1.1: Bivariate regression

This dataset is used to illustrate the idea of regression analysis.  It contains 3 variables, books, attend and grade and has 40 cases.  Books represents the number of books read by students on a statistics course, attend represents the number of lectures they attended and grade represents their final grade on the course.  In Chapter 1 this dataset is used as an example to show how grade can be predicted using books

SPSS (data1_1.sav), Excel (data1_1.xls) tab delimited (data1_1.dat).

Chapter 2: Multiple Regression

The first example in this chapter uses dataset 1.1, but shows that when the variables are entered together, in a multiple regression, the parameter estimates are altered.  It is wrongly labelled dataset 2.1 in the book.)

Dataset 2.1: Multiple regression

This dataset is used as an example of hierarchical regression, it contains 4 variables, sex, age, extro and car.  Sex and age are the sex and age of participants, extro is an extroversion score and car is the amount of time people spend looking after their cars.  The dataset is used to demonstrate that when we are more sure about the causal hierarchy of a set of variables we can analyse them hierarchically, entering sex and age first, and then entering extroversion reduces the estimate of the proportion of variance in car which is accounted for by extroversion.  It is wrongly labelled dataset 2.2 in the book.  

SPSS (data2_1.sav), Excel (data2_1.xls), tab-delimited (data2_1.dat).

Chapter 3: Categorical Independent Variables in Regression

Dataset 3.1: T-test as regression

A very simple dataset designed to show the equivalence of t-tests and regression.  The results of an experiment examining the memory of 20 participants, 10 of whom were told how to use a mnemonic, 10 of whom were not.  Two variables, group and score.

SPSS (data3_1.sav), Excel (data3_1.xls) tab-delimited (data3_1.dat).

Dataset 3.2a: One-way ANOVA as regression (dummy variable coding)

This is the same data as dataset 3.1, with a third experimental group added (aromatherapy).  It shows the equivalence of ANOVA and regression.

SPSS (data3_2a.sav), Excel (data3_2a.xls) tab-delimited (data3_2a.dat).

Dataset 3.2b

Same as dataset 3.2a, but with the dummy variables already recoded (group_1 and group_2).

SPSS (data3_2a.sav), Excel (data3_2a.xls) tab-delimited (data3_2a.dat).

Dataset 3.3a: One way ANOVA as regression (indicator coding)

This dataset is designed to demonstrate different types of coding for categorical variables in regression.  It contains the stress ratings (stress) for five different types of educator (job) primary school teacher (1), secondary school teacher (2), college lecturer (3), old university lecturer (4) and new university lecturer (5).

SPSS (data3_3a.sav), Excel (data3_3a.xls) tab-delimited (data3_3a.dat).

Dataset 3.3b

This contains the same data as dataset 3.3a, but the data have been recoded to indicator coding, adding four new variables (group_1, group_2, group_3, group_4). 

SPSS (data3_3b.sav), Excel (data3_3b.xls) tab-delimited (data3_3b.dat).

Dataset 3.4: Analysis of change

This dataset demonstrates the problems of analysing data that contain change scores using ANOVA.  The dataset contains data from an experiment where memory was tested using a pre-test, post-test design.  If the data are analysed using a t-test on the difference scores, or two-way mixed ANOVA, they just fail to achieve significance (at the 0.05 level).  If a regression (or ANCOVA) based approach is used, the analysis just achieves significance at the 0.05 level.  Contains 4 variables: group, unprimed (pre-manipulation score), primed (post-manipulation score) and diff (difference).

SPSS (data3_4.sav), Excel (data3_4.xls) tab-delimited (data3_4.dat).
 

Part 2: I need to do regression analysis next week

Chapter 4: Assumptions in regression analysis

Dataset 4.1: Transforing skewed data

One variable showing reaction times.  The data are highly positively skewed, but a log-transform normalises them

SPSS (data4_1.sav), Excel (data4_1.xls) tab-delimited (data4_1.dat).

Dataset 4.2a: Bivariate Outlier

Contains the IQ scores of couples, to demonstrate that two variables may be normally distributed, but this does not mean that the variables will have a bivariate normal distribution.  Contains two variables, female and male.

SPSS (data4_2a.sav), Excel (data4_2a.xls) tab-delimited (data4_2a.dat).

Dataset 4.2b

Same as dataset 4.2b, but also contains male predicted scores (from female) and residual.

SPSS (data4_2b.sav), Excel (data4_2b.xls) tab-delimited (data4_2b.dat).

Dataset 4.3: Multivariate outlier

This dataset is designed to demonstrate a multivariate outlier. 
It contains measures of life events (events), daily hassles (hassles), social support (support) and depression (dep).  The variables are all univariate normal, and the scattergraphs all show that the data are bivariate normal, but when a regression analysis is carried out, a multivariate outlier emergers.

SPSS (data4_3.sav), Excel (data4_3.xls) tab-delimited (data4_3.dat).

Dataset 4.4: Heteroscedascity

Demonstrates heteroscedascity, and shows how the presence of heteroscedascity means that a misspecification has taken place, which may be in terms of a missing interaction effect.  The data contains 4 variables cash is the amount of money that an individual earns, import is the importance of the work that a particular charity does, to that person, and given is the amount of money that individual donated to the charity.  The dataset also includes a variable di_cash, a dichotomised cash variable, to enable the interaction to be explored.

SPSS (data4_4.sav), Excel (data4_4.xls) tab-delimited (data4_4.dat).

Dataset 4.5: Non-linearity

Contains 2 variables, labelled x and y.  Demonstrates how non-linearity can reduce our ability to predict a DV from an IV.  The scattergraph shows that prediction should be almost perfect, but the value of R squared shows that the prediction is good, it is not perfect.

SPSS (data4_5.sav), Excel (data4_5.xls) tab-delimited (data4_5.dat).

Dataset 4.6: More non-linearity

This is a similar dataset to 4.5, but shows non-linearity in relation to the surface area of houses, and the amount of money they cost.  Contains 2 variables, size and price

SPSS (data4_6.sav), Excel (data4_6.xls) tab-delimited (data4_6.dat).

Chapter 5: Issues in regression analysis

Dataset 5.1: Collinearity

This dataset is an expansion on dataset 1.1, it contains a variable (late) which correlates with both attend and books.  When all three variables are entered into the equation, the parameter estimates are no longer significant.

SPSS (data5_1.sav), Excel (data5_1.xls) tab-delimited (data5_1.dat).
 

Part 3: I would like to know more of the things that regression analysis can do

Chapter 6: Nonlinear and Logistic Regression

Dataset 6.1a: Non-linear model

This dataset contains a measure of daily hassles (hassles) and a measure of anxiety (anx).  It shows how the effect of hassles on anxiety may be non-linear, as the first hassle may have little implact on psychological wellbeing, but the 50th has a much greater impact.

SPSS (data6_1a.sav), Excel (data6_1a.xls) tab-delimited (data6_1a.dat).
 

Dataset 6.1b

This is the same as dataset 6.1a, with the addition of a quadratic (hassles^2) and cubic (hassles^3) variables.

SPSS (data6_1b.sav), Excel (data6_1b.xls) tab-delimited (data6_1b.dat).
 

Dataset 6.2: Logistic Regression

Demonstrates logistic regression.  Score represents scores on an aptitude test for a course, exp represents months of relevant previous experience, and pass indicates whether the individual passed the exam at the end of the course.

SPSS (data6_2.sav), Excel (data6_2.xls) tab-delimited (data6_2.dat).

Chapter 7: Moderator and Mediator Analysis

Dataset 7.1: Moderator effect with two categorical independent variables

Contains data from the classic context dependent memory study, showing the interaction effect.  Contains two independent variables: learning environment (learn, with two levels, dry and wet) and testing environment (test, with two levels, dry and wet).  Also contains scores, indicator coding of learn and test (indlearn and indtest, respectively) and an interaction term (lxt).

SPSS (data7_1.sav), Excel (data7_1.xls) tab-delimited (data7_1.dat).

Dataset 7.2: Moderator effect with one categorical and one continuous variable

Contains three variables: Events is a person's score  on a life event scale, indicating the number and severity of recent life events.  Status is a measure of whether a person co-habits with a partner (a 0 indicates that they do not, and a 1 indicates that they do).  Stress is the score on self-report measure of experienced stress

SPSS (data7_2.sav), Excel (data7_2.xls) tab-delimited (data7_2.dat).

Dataset 1.1 (repeated): Moderator effect with two continuous IVs

There is a significant interaction effect in this dataset - the attend * books interaction is significant, showing that it is not sufficient to just read books, or just to attend lectures, students must do both.

Dataset 7.3: Mediator  effect

Contains three variables: enjoy indicates how much a person reports they enjoy reading books, read indicates the number of books that they have read, and buy indicates the number of books that they have purchased in the previous 12 months.

SPSS (data7_3.sav), Excel (data7_3.xls) tab-delimited (data7_3.dat).
 

 

Contents

Click a link to jump straight to that part of the page

Part 1: I need to do regression analysis tomorrow

Chapter 1: Building Models with Regression and Correlation
Dataset 1.1: Bivariate regression

Chapter 2: Multiple Regression
Dataset 2.1: Multiple regression

Chapter 3: Categorical Independent Variables in Regression
Dataset 3.1: T-test as regression
Dataset 3.2a: One-way ANOVA as regression (dummy variable coding)
Dataset 3.3: One way ANOVA as regression (indicator coding)
Dataset 3.4: Analysis of change

Part 2: I need to do regression analysis next week

Chapter 4: Assumptions in regression analysis
Dataset 4.1: Transforing skewed data
Dataset 4.2: Bivariate Outlier
Dataset 4.3: Multivariate outlier
Dataset 4.4: Heteroscedascity
Dataset 4.5: Non-linearity
Dataset 4.6: More non-linearity

Chapter 5: Issues in regression analysis
Dataset 5.1: Collinearity

Part 3: I would like to know more of the things that regression analysis can do

Chapter 6: Nonlinear and Logistic Regression
Dataset 6.1: Non-linear model
Dataset 6.2: Logistic Regression

Chapter 7: Moderator and Mediator Analysis
Dataset 7.1:Moderator effect with two categorical independent variables
Dataset 7.2: Moderator effect with one categorical and one continuous variable
Dataset 1.1 (repeated): Moderator efect with two continuous IVs
Dataset 7.3: Mediator effect

Contact

Got a comment on this web site?  Contact webmaster@jeremymiles.co.uk
Got a comment on the book?  Contact jeremy@jeremymiles.co.uk
>


FREE HIT COUNTER