LAB SESSION 5

CORRELATION AND REGRESSION

INTRODUCTION:  Not only is it important to analyze single variables, but frequently one needs to determine if and how two variables are related.  The correlation coefficient is a measure of the strength of the linear relationship between two variables.  In these exercises you will use Excel to analyze this statistic, and these exercises will also give you a very brief introduction to linear regression.

INVESTIGATIONS OF THE CORRELATION COEFFICIENT

The data set below is a sample of weight and waist size for 11 women.  You will            use that data to estimate the correlation between a woman's weight and her waist      size.  Once that value has been determined you will show that this value is                         independent of the scale of the two variables.

Weights and Waist Sizes

weight(lbs): 110  143  120  127  143  111  137  154  123  104 140

waist (ins):    22    29    27   26     27   24    28    28    26    25   23

Enter the data into the worksheet and name the two variables.

Get a scatter diagram of the bivariate data set.  The variable 'WEIGHT' should be

on the x-axis and 'WAIST' on the y-axis.

Choose:  Chart Wizard > XY(Scatter) > 1st picture > Next

Enter:      Data Range:  select cells > Next

We can edit the scatter plot (since all the points are in a corner), and rescale the axes to reflect the data range.

Right click on the X-axis

Select: Format Axes

Click on the Scale tab:  Change the minimum and maximum values for X.

Minimum:  100

Maximum:  160

Major unit:  20

Minor unit:  5

Also make appropriate changes to the y-axis scale

Let’s also generate descriptive statistics for each of these variables:

Choose:   Tools > Data Analysis > Descriptive Statistics > OK

Enter:       Input Range:  select cells

Output Range:  select cell

Choose:    Summary statistics > OK

* the output shown below is edited for clarity

 Weight waist (ins) Mean 128.3636 Mean 25.90909 Median 127 Median 26 Mode 143 Mode 27 Standard Deviation 16.21279 Standard Deviation 2.21154 Range 50 Range 7 Minimum 104 Minimum 22 Maximum 154 Maximum 29 Count 11 Count 11

Calculate the correlation coefficient, r.

Choose:  Insert function, fx > Statistical > Correl > OK

Enter:     Array 1:  x data range

Array 2:  y data range > OK

*

 correlation 0.603436716

QUESTIONS:

1.         Would you say that the variables were positively or negatively correlated?  Is there a strong or weak correlation?

2.         If you were to add an equal amount of weight to each woman (assume no change in waist size), would the value of r, the correlation coefficient, change?  Test your conjecture by adding 25 lbs. to each woman's weight and recalculate r.  The necessary commands are:

Activate cell C2 and type “= A2 + 25”, then drag right corner down to perform the same calculation on all of column A.  Redo the correlation using column C for Array 1.

3.         If you were to change the scale of the variables: weight to kg and waist size to meters, would the value of r change?  Test your conjecture by multiplying 'WEIGHT' by 0.453 and 'WAIST' by .0254  and recalculate r.  How will the scatter diagram change when you change the scales?

4.         The last observation in your data set was for a model known for her especially thin figure.  If you eliminated it from the data set, how much would r change?  Would you say that the statistic, r, is sensitive to extreme observations?  Explain.

INTERPRETATION OF THE CORRELATION COEFFICIENT

In this next section, we will be examining some scatter diagrams of computer-generated data to gain a more thorough understanding of just what the value of the correlation coefficient means. For each pair of variables, you will calculate r and look at the corresponding scatter diagram.

Enter the values from 0 to 50 for your first variable and name your variable “x”.

Enter 0 in A1, enter 1 in A2, then right click on lower right corner and drag to

A52

Choose:  Fill series

In cell B1 enter the name Random, then activate cell B2, and continue with:

Enter:   =rand( )

Click and drag:  lower right corner of B2 cell to row 52

Get a scatter diagram of the two variables and calculate r.

correlation

 0.047592

When comparing your output to that presented here, remember you are working            with random data and there will be variation in results.

Next, generate a set of y values which has no random component:

Activate cell C2, type =2+A2*.5, click and drag lower right corner of cell

down through row 52.

 correlation 1

Generate a set of y values that have a small random component and repeat above          procedure.

fill column D with  = 2 + 0.5 * A2 + B2

Generate a set of y values that are negatively correlated, and repeat above procedure.

fill column E with  = 2 - 0.5 * A2 + B2

Generate a set of y values that have a large random component and repeat previous procedure.

fill column F with  = 5 + 0.5 * A2 + 2 * B2

Generate a set of y values that are non-linearly related to x.

fill column G with  = SQRT(0.1*A2)

 correlation 0.973675

Generate a second set of y values which are related but not linearly related to x

and repeat previous procedure.

fill column H with  = 9 - (A2 - 25)**2

 correlation -1.25106E-17

QUESTIONS:

1.   Using the results from above, what type of relationship can you determine between the correlation coefficient and the scatterplot?  What type of pattern do     you see in the scatter diagram when r is close to zero?  When r is close to one?  What is the pattern like when r is negative?

2.   Does r being close to zero imply that the two variables are unrelated?  Check column H versus column A before answering this question.

LINEAR REGRESSION

Example: Consider the two variables: a person’s current age and the expected number of remaining years of living independently.

 Age, x 65 67 69 71 73 75 77 79 81 82 Years remaining, y 16.5 15.1 13.7 12.4 11.2 10.1 9 8.4 7.1 6.4

Get a feeling for whether years remaining and the person’s age are correlated by doing a scatter plot and calculating correlation.

 correlation -0.99561

The least squares line can be added to the plot, along with its equation and the value of r2.  Right click on one of the data points shown in the scatter plot.  A drop-down menu will appear.

Select:  Type : Linear

Select:  Options,

then check Display equation of chart and Display R-squared on chart > OK.

What is the expected years remaining for a person who is 70 years old?  To answer this question we’ll use the Forecast command from the insert function.

Choose:   Insert function, fx

Select:     Or select a category : Statistical

Select:      FORECAST > OK

Fill in the dialog box as shown:

The worksheet shows the results.

If we want to obtain the values of slope and intercept without using the Chart Wizard, we can use LINEST(y range, x range).  Activate two horizontally adjacent cells on the worksheet.  Type =LINEST(B2:B11, A2:A11) in the formula bar and press Ctrl+Shift+Enter to generate the values of both slope and intercept.

*

 slope intercept -0.559696919 52.40757151

Here we will illustrate the default output generated by the Regression command for Exercise 3.61.  Notice that a great deal of information is generated, but at this point we would need only the coefficients highlighted in red.

Scrolling down would show the rest of the residual output.

ASSIGNMENT: Do Exercises 3.22, 3.41, 3.42, 3.61, and 3.86 in your text.