## Lesson 11: Linear Regression

#### Objectives

- Determine the regression equation.
- Compute predicted
*Y*values. - Compute and interpret residuals.

#### Overview

Closely related to correlation is the topic of linear regression. As you learned in Lesson 10, the correlation coefficient is an index of linear relationship. If the correlation coefficient is significant, that is an indication that a linear equation can be used to model the relationship between the predictor *X *and the criterion *Y*. In this lesson you will learn how to determine the equation of the line of best fit between the predictor and the criterion, how to compute predicted values based on that linear equation, and how to calculate and interpret residuals.

#### Example Problem and Data

This spring term you are in a large introductory psychology class. You observe an apparent relationship between the outside temperature and the number of people who skip class on a given day. More people seem to be absent when the weather is warmer, and more seem to be present when it is cooler outside. You randomly select 10 class periods and record the outside temperature reading 10 minutes before class time and then count the number of students in attendance that day. If you determine that there is a significant linear relationship, you would like to impress your professor by predicting how many people will be present on a given day, based on the outside temperature. The data you collect are the following:

Temp |
Attendance |

50 |
87 |

77 |
60 |

67 |
73 |

53 |
86 |

75 |
59 |

70 |
65 |

83 |
65 |

85 |
62 |

80 |
58 |

64 |
89 |

#### Entering the Data in SPSS

These pairs of data must be entered as separate variables. The data file may look something like the following (see Figure 11-1):

Figure 11-1 Data in SPSS

If you prefer, you can download a copy of the data. As you learned in Lesson 10, you should first determine whether there is a significant correlation between temperature and attendance. Running the Correlation procedure (see Lesson 10 for details), you find that the correlation is -.87, and is significant at the .01 level (see Figure 11-2).

Figure 11-2 Significant correlation

A scatterplot is helpful in visualizing the relationship (see Figure 11-3). Clearly, there is a negative relationship between attendance and temperature.

Figure 11-3 Scatterplot

#### Linear Regression

The correlation and scatterplot indicate a strong, though by no means perfect, relationship between the two variables. Let us now turn our attention to regression. We will "regress" the attendance (*Y*)on the temperature (*X*). In linear regression, we are seeking the equation of a straight line that best fits the observations. The usefulness of such a line may not be immediately apparent, but if we can model the relationship by a straight line, we can use that line to predict a value of *Y* for any value of *X*, even those that have not yet been observed. For example, looking at the scatterplot in Figure 11-3, what attendance would you predict for a temperature of 60 degrees? The regression line can answer that question. This line will have an intercept term and a slope coefficient and will be of the general form

The intercept and slope (regression) coefficient are derived in such a way that the sums of the squared deviations of the actual data points from the line are minimized. This is called "ordinary least squares" estimation or OLS. Note that the predicted value of *Y* (read "*Y*-hat") is a linear combination of two constants, the intercept term and the slope term, and the value of *X*, so that the only thing that varies is the value of* X*. Therefore, the correlation between the predicted *Y*s and the observed *Y*s will be the same as the correlation between the observed *Y*s and the observed *X*s. If we subtract the predicted value of *Y* from the observed value of *Y*, the difference is called a "residual." A residual represents the part of the *Y* variable that cannot be explained by the *X* variable. Visually, the distance between the observed data points and the line of best fit represents the residual.

SPSS's Regression procedure allows us to determine the equation of the line of best fit, to calculate predicted values of *Y*, and to calculate and interpret residuals. Optionally, you can save the predicted values of *Y *and the residuals as either standard scores or raw-score equivalents.

#### Running the Regression Procedure

Open the data file in SPSS. Select Analyze, Regression, and then Linear (see Figure 11-4).

Figure 11- 4 Performing the Regression procedure

The Regression procedure outputs a value called "Multiple R," which will always range from 0 to 1. In the bivariate case, Multiple R is the absolute value of the Pearson *r*, and is thus .87. The square of* r* or of Multiple R is .752, and represents the amount of shared variance between *Y* and *X*. When we run the regression tool, we can optionally ask for either standardized or unstandardized (raw-score) predicted values of* Y* and residuals to be calculated and saved as new variables (see Figure 11-5).

Figure 11-5 Save options in the Regression procedure

Click OK to run the Regression procedure. The output is shown in Figure 11-6. In the ANOVA table summarizing the regression, the omnibus *F * test tests the hypothesis that the population Multiple R is zero. We can safely reject that null hypothesis. Notice that dividing the regression sum of squares, which is based on the predicted values of *Y*, by the total sum of squares, which is based on the observed values of *Y*, produces the same value as R Square. The value of R Square thus represents the proportion of variance in the criterion that can be explained by the predictor. The residual sum of squares represents the variance in the criterion that remains unexplained.

Figure 11-6 Regression procedure output

In Figure 11-7 you can see that the residuals and predicted values are now saved as new variables in the SPSS data file.

Figure 11-7 Saving predicted values and residuals

The regression equation for predicting attendance from the outside temperature is 133.556 - .897 x Temp. So for a temperature of 60 degrees, you would predict the attendance to be 80 students (see Figure 11-8 in which this is illustrated graphically). Note that this process of using a linear equation to predict attendance from the temperature has some obvious practical limits. You would never predict attendance higher than 100 percent, for example, and there may be a point at which the temperature becomes so hot as to be unbearable, and the attendance could begin to rise simply because the classroom is air-conditioned.

Figure 11-8 Linear trend line and regression equation

To impress your professor, assume that the outside temperature on a class day is 72 degrees. Substituting 72 for *X* in the regression equation, you predict that there will be 69 students in attendance that day.

#### Examining Residuals

A residual is the difference between the observed and predicted values for the criterion variable (Hair, Black, Babin, Anderson, & Tatham, 2006). Bivariate linear regression and multiple linear regression make four key assumptions about these residuals.

- The phenomenon (i.e., the regression model being considered) is linear, so that the relationship between
*X*and*Y*is linear. - The residuals have equal variances at all levels of the predicted values of
*Y*. - The residuals are independent. This is another way of saying that the successive observations of the dependent variable are uncorrelated.
- The residuals are normally distributed with a mean of zero.

Thus it can be very instructive to examine the residuals when you perform a regression analysis. It is helpful to examine a histogram of the standardized residuals (see Figure 11-9), which can be created from the Plots menu. The normal curve can be superimposed for visual reference.

Figure 11-9 Histogram of standardized residuals

These residuals appear to be approximately normally distributed. Another useful plot is the normal p-p plot produced as an option in the Plots menu. This plot compares the cumulative probabilities of the residuals to the expected frequencies if the residuals were normally distributed. Significant departures from a straight line would indicate nonnormality in the data (see Figure 11-10). In this case the residuals appear once again to be fairly normally distributed.

Figure 11-10 Normal p-p plot of observed and expected cumulative probabilities of residuals

When there are significant departures from normality, homoscedasticity, and linearity, data transformations or the introduction of polynomial terms such as quadratic or cubic values of the original independent or dependent variables can often be of help (Edwards, 1976).

#### References

Edwards, A. L. (1976). *An introduction to linear regression and correlation*. San Francisco: Freeman.

Hair, J. F., Black, W. C., Babin, B. J., Anderson, R. E., and Tatham, R. L. (2006). *Multivariate data analysis* (6th ed.). Upper Saddle River, NJ: Pearson Prentice Hall.