LAB SESSION 5

CORRELATION AND REGRESSION

 

INTRODUCTION:  Not only is it important to analyze single variables, but frequently one needs to determine if and how two variables are related.  The correlation coefficient is a measure of the strength of the linear relationship between two variables.  In these exercises you will use Excel to analyze this statistic, and these exercises will also give you a very brief introduction to linear regression.

 

INVESTIGATIONS OF THE CORRELATION COEFFICIENT

The data set below is a sample of weight and waist size for 11 women.  You will            use that data to estimate the correlation between a woman's weight and her waist      size.  Once that value has been determined you will show that this value is                         independent of the scale of the two variables.

 

Weights and Waist Sizes

weight(lbs): 110  143  120  127  143  111  137  154  123  104 140

waist (ins):    22    29    27   26     27   24    28    28    26    25   23

 

 

Enter the data into the worksheet and name the two variables.

 

Get a scatter diagram of the bivariate data set.  The variable 'WEIGHT' should be

on the x-axis and 'WAIST' on the y-axis.

 

            Choose:  Chart Wizard > XY(Scatter) > 1st picture > Next

            Enter:      Data Range:  select cells > Next

 

We can edit the scatter plot (since all the points are in a corner), and rescale the axes to reflect the data range. 

Right click on the X-axis

Select: Format Axes

Click on the Scale tab:  Change the minimum and maximum values for X.

                                      Minimum:  100

                                      Maximum:  160

                                      Major unit:  20

                                      Minor unit:  5

Also make appropriate changes to the y-axis scale

 

 

Let’s also generate descriptive statistics for each of these variables:

 

            Choose:   Tools > Data Analysis > Descriptive Statistics > OK

            Enter:       Input Range:  select cells

                             Output Range:  select cell

            Choose:    Summary statistics > OK             

 

* the output shown below is edited for clarity

Weight

 

waist (ins)

 

 

 

 

 

Mean

128.3636

 

Mean

25.90909

Median

127

 

Median

26

Mode

143

 

Mode

27

Standard Deviation

16.21279

 

Standard Deviation

2.21154

Range

50

 

Range

7

Minimum

104

 

Minimum

22

Maximum

154

 

Maximum

29

Count

11

 

Count

11

 

 

Calculate the correlation coefficient, r.

                Choose:  Insert function, fx > Statistical > Correl > OK

            Enter:     Array 1:  x data range

                           Array 2:  y data range > OK

 

*

correlation

0.603436716

 

 


QUESTIONS:

 

1.         Would you say that the variables were positively or negatively correlated?  Is there a strong or weak correlation?

 

2.         If you were to add an equal amount of weight to each woman (assume no change in waist size), would the value of r, the correlation coefficient, change?  Test your conjecture by adding 25 lbs. to each woman's weight and recalculate r.  The necessary commands are:

                       

Activate cell C2 and type “= A2 + 25”, then drag right corner down to perform the same calculation on all of column A.  Redo the correlation using column C for Array 1.

 

3.         If you were to change the scale of the variables: weight to kg and waist size to meters, would the value of r change?  Test your conjecture by multiplying 'WEIGHT' by 0.453 and 'WAIST' by .0254  and recalculate r.  How will the scatter diagram change when you change the scales?

 

4.         The last observation in your data set was for a model known for her especially thin figure.  If you eliminated it from the data set, how much would r change?  Would you say that the statistic, r, is sensitive to extreme observations?  Explain.

 

 

 

INTERPRETATION OF THE CORRELATION COEFFICIENT

 

In this next section, we will be examining some scatter diagrams of computer-generated data to gain a more thorough understanding of just what the value of the correlation coefficient means. For each pair of variables, you will calculate r and look at the corresponding scatter diagram.

 

Enter the values from 0 to 50 for your first variable and name your variable “x”.

Enter 0 in A1, enter 1 in A2, then right click on lower right corner and drag to

A52

            Choose:  Fill series

 

In cell B1 enter the name Random, then activate cell B2, and continue with:

Enter:   =rand( )

Click and drag:  lower right corner of B2 cell to row 52

                               

 

Get a scatter diagram of the two variables and calculate r.

 

correlation

0.047592

 

When comparing your output to that presented here, remember you are working            with random data and there will be variation in results.

 


 


Next, generate a set of y values which has no random component:

 

Activate cell C2, type =2+A2*.5, click and drag lower right corner of cell

       down through row 52.

 

correlation

1

 

Generate a set of y values that have a small random component and repeat above          procedure.

 

fill column D with  = 2 + 0.5 * A2 + B2

 

Generate a set of y values that are negatively correlated, and repeat above procedure.

 

fill column E with  = 2 - 0.5 * A2 + B2

     

Generate a set of y values that have a large random component and repeat previous procedure.

                                               

fill column F with  = 5 + 0.5 * A2 + 2 * B2


Generate a set of y values that are non-linearly related to x.

 

fill column G with  = SQRT(0.1*A2)

 

 

correlation

0.973675

 

 

 

 

 

Generate a second set of y values which are related but not linearly related to x

and repeat previous procedure.

fill column H with  = 9 - (A2 - 25)**2

 

correlation

-1.25106E-17

 

 

 


QUESTIONS: 

1.   Using the results from above, what type of relationship can you determine between the correlation coefficient and the scatterplot?  What type of pattern do     you see in the scatter diagram when r is close to zero?  When r is close to one?  What is the pattern like when r is negative?

 

2.   Does r being close to zero imply that the two variables are unrelated?  Check column H versus column A before answering this question.

 

 


LINEAR REGRESSION

 

Example: Consider the two variables: a person’s current age and the expected number of remaining years of living independently.

 

Age, x

65

67

69

71

73

75

77

79

81

82

Years remaining, y

16.5

15.1

13.7

12.4

11.2

10.1

9.0

8.4

7.1

6.4

 

 

Get a feeling for whether years remaining and the person’s age are correlated by doing a scatter plot and calculating correlation.

 

correlation

-0.99561

 

 

 

                The least squares line can be added to the plot, along with its equation and the value of r2.  Right click on one of the data points shown in the scatter plot.  A drop-down menu will appear. 

            Select:  Add Trendline

            Select:  Type : Linear

            Select:  Options,

             then check Display equation of chart and Display R-squared on chart > OK.

 

 

 

 

               

 

What is the expected years remaining for a person who is 70 years old?  To answer this question we’ll use the Forecast command from the insert function.

 

            Choose:   Insert function, fx

            Select:     Or select a category : Statistical

            Select:      FORECAST > OK

 

            Fill in the dialog box as shown:

 

             

 

 

The worksheet shows the results.

 

 

If we want to obtain the values of slope and intercept without using the Chart Wizard, we can use LINEST(y range, x range).  Activate two horizontally adjacent cells on the worksheet.  Type =LINEST(B2:B11, A2:A11) in the formula bar and press Ctrl+Shift+Enter to generate the values of both slope and intercept.

            *

slope

intercept

-0.559696919

52.40757151

 

 

 

 

 

 

 

 

 

 


Here we will illustrate the default output generated by the Regression command for Exercise 3.61.  Notice that a great deal of information is generated, but at this point we would need only the coefficients highlighted in red.

 

 

 

 

Scrolling down would show the rest of the residual output.

 

 

 

ASSIGNMENT: Do Exercises 3.22, 3.41, 3.42, 3.61, and 3.86 in your text.