## Lesson 12: Multiple Correlation and Regression

#### Objectives

1. Perform and interpret a multiple regression analysis.
2. Test the significance of the regression and the regression coefficients.
3. Examine residuals for diagnostic purposes.

#### Overview

Multiple regression involves one continuous criterion (dependent) variable and two or more predictors (independent variables). The equation for a line of best fit is derived in such a way as to minimize the sums of the squared deviations from the line. Although there are multiple predictors, there is only one predicted Y value, and the correlation between the observed and predicted Y values is called Multiple R. The value of Multiple R will range from zero to one. In the case of bivariate correlation, a regression analysis will yield a value of Multiple R that is the absolute value of the Pearson product moment correlation coefficient between X and Y, as discussed in Lesson 11. The multiple linear regression equation will take the following general form: Instead of using a to represent the Y intercept, it is common practice in multiple regression to call the intercept term b0. The significance of Multiple R, and thus of the entire regression, must be tested. As well, the significiance of the individual regression coefficients must be examined to verify that a particular independent variable is adding significantly to the prediction.

As in simple linear regression, residual plots are helpful in diagnosing the degree to which the linearity, normality, and homoscedasticity assumptions have been met. Various data transformations can be attempted to accommodate situations of curvilinearity, non-normality, and heteroscedasticity. In multiple regression we must also consider the potential impact of multicollinearity, which is the degree of linear relationship among the predictors. When there is a high degree of collinearity in the predictors, the regression equation will tend to be distorted, and may lead to inappropriate conclusions regarding which predictors are statistically significant (Lind, Marchal, and Wathen, 2006). For this reason, we will ask for collinearity diagnostics when we run our regression. As a rule of thumb, if the variance inflation factor (VIF) for a given predictor is very high or if the absolute value of the correlation between two predictors is greater than .70, one or more of the predictors should be dropped from the analysis, and the regression equation should be recomputed.

Multiple regression is in actuality a general family of techniques, and the mathematical and statistical underpinnings of multiple regression make it an extremely powerful and flexible tool. By using group membership or treatment level qualitative coding variables as predictors, one can easily use multiple regression in place of t tests and analyses of variance. In this tutorial we will concentrate on the simplest kind of multiple regression, a forced or simultaneous regression in which all predictor variables are entered into the regression equation at one time. Other approaches include stepwise regression in which variables are entered according to their predictive ability and hierarchical regression in which variables are entered according to theory or hypothesis. We will examine hierarchical regression more closely in Lesson 14 on analysis of covariance.

#### Example Data

The following data (see Figure 12-1) represent statistics course grades, GRE Quantitative scores, and cumulative GPAs for 32 graduate students at a large public university in the southern U.S. (source: data collected by the webmaster). You may click here to retrieve a copy of the entire dataset. Figure 12-1 Statistics course grades, GREQ, and GPA (partial data)

#### Preparing for the Regression Analysis

We will determine whether quantitative ability (GREQ) and cumulative GPA can be used to predict performance in the statistics course. A very useful first step is to calculate the zero-order correlations among the predictors and the criterion. We will use the Correlate procedure for that purpose. Select Analyze, Correlate, Bivariate (see Figure 12-2). Figure 12-2 Calculate intercorrelations as preparation for regression analysis

In the Options menu of the resulting dialog box, you can request descriptive statistics if you like. The resulting intercorrelation matrix reveals that GREQ and GPA are both significantly related to the course grade, but are not significantly related to each other. Thus our initial impression is that collinearity will not be a problem (see Figure 12-3). Figure 12-3 Descriptive statistics and intercorrelations

#### Conducting the Regression Analysis

To conduct the regression analysis, select Analyze, Regression, Linear (see Figure 12-4). Figure 12-4 Selecting the Linear Regression procedure

In the Linear Regression dialog box, move Grade to the Dependent variable field and GPA and GREQ to the Independent(s) list, as shown in Figure 12-5. Figure 12-5 Linear Regression dialog box

Click on the Statistics button and check the box in front of collinearity diagnostics (see Figure 12-6). Figure 12-6 Requesting collinearity diagnostics

Select Continue and then click on Plots to request standardized residual plots and also to request scatter diagrams. You should request a histogram and normal distribution plot of the standardized residuals. You can also plot the standardized residuals against the standardized predicted values to check the assumption of homoscedasticity (see Figure 12-7). Click OK to run the regression analysis. The results are excerpted in Figure 12-8. Figure 12-8 Regression procedure output (excerpt)

#### Interpreting the Regression Output

The significant overall regression indicates that a linear combination of GREQ and GPA predicts grades in the statistics course. The value of R-Square is .513, and indicates that about 51 percent of the variation in grades is accounted for by knowledge of GPA and GREQ. The significant t values for the regression coefficients for GREQ and GPA show that each variable contributes significantly to the prediction. Examining the unstandardized regression coefficients is not very instructive, because these are based on raw scores and their values are influenced by the units of measurement of the predictors. Thus, the raw-score regression coefficient for GREQ is much smaller than that for GPA because the two variables use different scales. On the other hand, the standardized coefficients are quite interpretable, because each shows the relative contribution to the prediction of the given variable with the other variable held constant. These are technically standardized partial regression coefficicients. In the present case, we can conclude that GREQ has more predictive value than GPA, though both are significant.

The collinearity diagnostics indicate a low degree of overlap between the predictors (as we predicted). If the two predictor variables were orthogonal (uncorrelated), the variance inflation factor (VIF) for each would be 1. Thus we conclude that there is not a problem with collinearity in this case.

The histogram of the standardized residuals shows that the departure from normality is not too severe (see Figure 12-9). Figure 12-9 Histogram of standardized residuals

The normal p-p plot indicates some departure from normality and may suggest a curvilinear relationship between the predictors and the criterion (see Figure 12-10). Figure 12-10 Nomal p-p plot

The plot of standardized predicted values against the standardized residuals indicates a large degree of heteroscedasticity (see Figure 12-11). This is mostly the result of a single outlier, case 11 (Participant 118), whose GREQ and grade scores are significantly lower than those of the remainder of the group. Eliminating that case and recomputing the regression increases Multiple R slightly and also reduces the heteroscedasticity. Figure 12-11 Plot of predicted values against residuals