Pearson's r
INTERVAL DATA: Association & Inference
Pearson's r: r = N(XY)
- X(Y)
-----------------[N(X2) - (X)2
][N(Y2) - (Y)2 ]
Where: X = one variable
Y = other variable
N = total in sample
Assumptions: linear relationship; homoscedasticity
Example:
Person | Height X | Income Y |
A | 10 | 10 |
B | 8 | 9 |
C | 6 | 7 |
D | 3 | 2 |
Q: What is the association between height and income?
A: First check assumptions by making a scatter diagram.
This is a linear relationship, so it also meets the assumption
for homoscedasticity.
Person | Height X | X^{2} | Income Y | Y^{2} | XY |
A | 10 | 100 | 10 | 100 | 100 |
B | 8 | 64 | 9 | 81 | 72 |
C | 6 | 36 | 7 | 49 | 42 |
D | 3 | 9 | 2 | 4 | 6 |
N=4 | 27 | 209 | 28 | 234 | 220 |
r = 4(220) - 27(20) = ----------------------------124
= --------------------- .97
---[4(209) - (27)^{2} ][4(234) -
(20)^{2} ] ----------[107][152]
Interpretation: Use the scale: "There is a ______
association between (variable 1) and (variable 2)."
For r^{2}: Convert to
a percent and include in the statement "_____% of the variance
in (variable 1) can be explained by (variable 2)."
or vice versa. For 1 - r^{2}: Convert to
a percent and include in the statement, "_____% of the variance
in (variable 1) cannot be explained by (variable 2)."
or vice versa.
Interpretation of the example above:
Large positive relationship between height and income.
94% of variance in height can be explained by income.
6% of variance in height cannot be explained by income.
94% reduction in error when predicting height from income.
B. Test of significance for r and partial r
Test of significance for r:
1. Compute r and |r|.
2. d.f. = N - 2
3. Look up p-value of |r|
in table in appendix.
Assumptions: 1) linear relationship
2) homoscedasticity
3) normal distribution of both variables in the whole population.
Assume a normal distribution in this class because
computation of it is beyond the scope of this course.
Example:
Person | Height X | Income Y |
A | 10 | 10 |
B | 8 | 9 |
C | 6 | 7 |
D | 3 | 2 |
Q: Can this association be generalized to the whole population?
A: r = .97; d.f. = N - 2 = 4
- 2 = 2; |r| = .97
4. Check to see if this is a one-tailed or a two-tailed test.
In this question there is no hypothesis, so do the two-tailed
test. The table looks something like this:
Two-tailed test
d.f. p >.05 p = .05 ----------p = .01
1 : :
2 ----------------.950 --------------.990
3 : :
r =.97 So yes, one can generalize this association
to the whole population.
C. Multivariate Association : Partial r
Partial r (r _{12.3}) = r_{12} - (r_{13})(r_{23})
(divided by)______________________________
-----------------------[1 - (r_{13})^{2}
][1 - (r_{23})^{2} ]
Where:
r_{12} = Pearson's r for variables 1 & 2.
r_{13} = " r " " 1 & 3.
r_{23} = " r " " 2 & 3.
Assumptions: linear relationship; homoscedasticity
Example:
Person | Education #1 | Income #2 | Age #3 |
A | 10 | 10 | 5 |
B | 8 | 9 | 4 |
C | 6 | 7 | 3 |
D | 3 | 2 | 2 |
E | 1 | 1 |
Q: What is the association between education and income with age held constant?
A: Variable #3 will always be the one held constant. Call education variable #1 and income variable #2. Make scatter diagrams to check if assumptions are met. (In this example it is a stretch
to imagine a linear relationships, but proceed as if
they did.)
Recall: r = N(XY) - X(Y)
(divided by)___________________________________________
-----------[N(X^{2})
- (X)^{2} ][N(Y^{2})
- (Y)^{2} ]
Education (#1) Income (#2)
Person | Education #1 | X^{2} | Income #2 | Y^{2} | XY |
A | 10 | 100 | 1 | 1 | 10 |
B | 5 | 25 | 3 | 9 | 15 |
C | 4 | 16 | 4 | 16 | 16 |
D | 2 | 4 | 5 | 25 | 10 |
E | 1 | 1 | 10 | 100 | 10 |
N=5 | 22 | 146 | 23 | 151 | 61 |
r_{12} = 5(61) -
22(23) = -.85
-----[5(146) - (22)2 ][5(151) - (23)2
]
Person | Education #1 | X^{2} | Age#3 | Y^{2} | XY |
A | 10 | 100 | 5 | 25 | 50 |
B | 5 | 25 | 4 | 16 | 20 |
C | 4 | 16 | 3 | 9 | 12 |
D | 2 | 4 | 2 | 4 | 4 |
E | 1 | 1 | 1 | 1 | 1 |
N=5 | 22 | 146 | 15 | 55 | 87 |
r_{13} = 5(87) -
22(15) = .95
------[5(146) - (22)2 ][5(55) - (15)2
]
Income (#2) Age (#3)
Person | Income #2 | X^{2} | Age #3 | Y^{2} | XY |
A | 1 | 1 | 5 | 25 | 5 |
B | 3 | 9 | 4 | 16 | 12 |
C | 4 | 16 | 3 | 9 | 12 |
D | 5 | 25 | 2 | 4 | 10 |
E | 10 | 100 | 1 | 1 | 10 |
N=5 | 23 | 151 | 15 | 55 | 49 |
r_{23}
= 5(49) - 23(15) --------------------------------r_{23} = -.94
------[5(151) - (23)^{2} ][5(55)
- (15)^{2} ]
r_{12.3} = (-.85)
- (+.95)(-.94) =
[1 - (+.95)^{2} ][1 - (-.94)^{2}
]
(-.85) - (-.893)
[1 - (+.9025)][1 - (+.8836)]
r_{12.3} = +.043
=
[.0975][.1164]
043 =
.011349
.043 = + .40
.1065
Interpretation: same as for r
but add 3rd variable which is held constant.
Example: r_{12.3} = .40, r_{12.3}^{2} = (.40)2 = .16, 1 - r_{12.3}^{2} = 1 - .16 = .84
Moderate association between education and income with age held constant.
16% of variance in education can be explained by income (or vice versa) with age held constant.
94% of variance in education cannot be explained by income (or vice versa) with age held constant.
16% reduction in error when predicting education from income (or vice versa) with age held constant.