Correlation Coefficient (Pyx)
we quantify the direction and strength of the linear relationship between two continuous random variables, Y and X, by estimating their true correlation coefficient,
When ryx is non-zero, does that indicate that the true correlation, pyx, is also non-zero?
Not necessarily, because of sampling variability. We must perform a hypothesis test that accounts for the
sampling variability in order make a rigorous inference about whether yx truly is different from zero.
The following hypothesis testing procedure can be used:
Simple linear regression (SLR) is a statistical method which allows us to _______ the association between
a continuous variable of interest (the ______ variable) that is thought to be associated with another variable (the ______) in linear fashion.
Simple linear regression (SLR) is a statistical method which allows us to quantify the association between
a continuous variable of interest (the “dependent” variable) that is thought to be associated with another
variable (the “independent variable”) in linear fashion.
ANOVA test statistic
Test Statistic: F = MSR/MSE = (SSR/1) / (SSE/(n-2)), F ~ F1, n-2
negative correlation
Y decreases as X increases
The _______ is a point estimator of the true correlation between the two random variables
X and Y.
ryx
True or False: The independent variable may be continuous or
categorical in simple linear regression.
True
The linear nature of the relationship can usually be confirmed by examining a scatterplot of
the dependent and independent variables.
True or False: if 𝛽̂0 and 𝛽̂1, the least squares point estimates, are not equal to 0, that means that
the true (i.e., population) slope and y-intercept are not equal to 0.
False
After obtaining the point estimates, we
should perform hypothesis tests (or compute confidence intervals) to determine whether the true slope and
y-intercept are significantly different from 0. In order to do this, we first need to estimate 2, the true (or
population) variance of Y.
ANOVA null and alternative hypotheses
H0: No linear association between X and Y (i.e. regression is not significant)
HA: There is a linear association between X and Y (i.e. regression is significant)
Overall F-test for SLR:
H0: 1 = 0 vs. HA: 1 0, with test Statistic F = MSR/MSE = (SSR/1) / (SSE/(n-2)), (F ~ F1, n-2)
It can be proven, mathematically, that the T-test statistics will be equal to each other, and that
squaring them will yield the F-test statistic value. The p-values for all three will be identical, in
simple linear regression. The latter fact should make sense on an intuitive level, because in each
case the hypotheses are actually the same: H0: there is no linear association between X and Y vs.
HA: there is a linear association between X and Y.
The magnitude of correlation coefficient (Pyx) indicates the ______ of the association between X and Y.
Strength of association
True or False: Two variables can be
significantly and strongly correlated without having a cause-and-effect relationship.
True, For example, there is
a strong, statistically significant and positive correlation between weight and height; however, eating
more and gaining weight will not necessarily cause you to grow taller! Here’s another example: there is
likely to be a positive association between a mother's age and her child's height; however, it would not
make sense to infer that an increase in the mother's age directly causes the child's height to increase. Any
increase in the child's height would be a reflection of the physiological changes in the child, not of the
increasing age of the mother.
What are the three things we do by performing SLR
estimate the best-fitting straight line that is thought to relate Y and X;
- perform hypothesis tests to determine whether or not the relationship is statistically significant;
- make predictions about the average value of Y for values of X that are of interest.
The distance from the line to ith data point, 𝑌𝑖 - 𝑌̂𝑖, is called the______ or the_____
for the ith observation
ith residual or estimate of error
Since the method of least squares involves minimizing the sum of the squared
distances between the line and the plotted data values, we can also say that the method involves
minimizing the sum of the squared residuals or, equivalently, minimizing the sum of squared errors
(“SSE”).
What is a good strategy for SLR analysis
i. State the SLR model.
ii. Perform a descriptive statistical analysis for Y and X.
iii. Plot Y vs X. Inspect plot for linear pattern.
iv. Calculate and characterize r. Interpret.
v. Fit the SLR model.
vi. Check for validity of SLR assumptions *.
vii. Perform the overall F-test. If not significant, report this finding and stop working with the
current model --it is not significant; do not interpret the slope estimate (the failure to reject
the null means that you have not been able to show that the slope is anything other than
zero).
If significant, proceed with additional inferences/interpretations (below).
viii. Present estimate of the model. Determine a confidence interval for the slope, and present
and interpret it.
ix. Proceed with additional model inferences, if appropriate and wanted: test on Y-intercept,
prediction intervals, confidence intervals on Y.
Can you measure a non-linear association with the correlation coefficient?
It measures the linear association between the two variables; it may not describe the association well
if the variables are associated non-linearly.
The first hypothesis test that is performed in a simple linear regression
is a test for the significance of the
model slope.
H0: B1 = 0 Where B1 = true slope of straight line that is assumed to relate X and Y.
HA: B1 does does not equal 0
Let’s state the simple linear regression model for the sodium intake-blood pressure example:
Researchers believe that there is a
linear association between average daily sodium
intake (variable X, in g) and blood pressure
(variable Y, in mmHg). Blood pressures and
sodium intakes are recorded for a random sample
of 12 patients. From Fig. 1.2 it is evident that, as
sodium intake increases, blood pressure
increases, in fairly linear fashion. Estimate the
true correlation between blood pressure and
sodium intake.
Y = B0 + B1X + E where
Y = blood pressure, in mm Hg
X = average daily sodium intake, in g
B0 = true (population) y-intercept
B1 = true (population) slope
E = random error term, assumed ~N(0, o^2) (More on o^2, the variance of E, later)
What is the SAS code for performing SLR?
PROC REG DATA=dataset;
MODEL y=x;
RUN;
“R-square” or “Coefficient of Determination” or just R2
Rsquared = SSR/_____
SSR/SSY = proportion of the total variation (SSY) that was explained by the regression.
Higher R2 values mean that more of the variation in Y is explained by its relationship with X (people
SLR ANOVA Table
18
often say that large R2 means that the model “fits” the data well).
Two notes:
• In SLR, R2 actually turns out to be the square of the sample correlation coefficient, r. Verify this for
the sodium intake-blood pressure example!
• Be careful--R2 can be artificially inflated in multiple linear regression models (models with more than
1 predictor) simply by adding lots of predictors to the model, even when the predictors are not
strongly related to Y. In multiple linear regression models, we report an ‘adjusted R2’, which includes
a penalty term that shrinks the R2 value if predictors are included which do not have a good linear
relationship with Y. (More on this later).
True or False: A correlation coefficient of 1 means the relationship is a perfect straight line
True, the data points in the scatterplot fall perfectly on a straight line
Interpret the conclusions for if a test of the significance of slope results in rejecting the null OR not rejecting the null
If the null is not rejected, this means that we have not found a statistically significant linear
relationship between X and Y (at the significance level). If this is the conclusion, our work with this
SLR model ends. Note that there may be a relationship of some other kind (e.g. a non-linear relationship)
between X and Y, so failing to reject the null does not imply that there is no relationship at all.
If the null is rejected, this means that there is a significant linear relationship between X and Y, and we
continue our SLR work (the estimate of the slope should be interpreted and further statistical inferences,
such as predictions, can be made if necessary --see sections 1.3.10 and 1.3.11, below).
What is the general premise of the Method of Least Squares
the best line is the one for which the distances from the sample data values to the
line are minimized.
Interpret the model AND the estimate of the model
https://dochub.com/ryansandford/P0B76b3K6NX2YejKn2y1Gg/screen-shot-2022-02-26-at-10-15-43-pm-png
Yi = β0 + β1Xi = -198.03 + 52.46Xi
Yi = estimated mean blood pressure for when
Conclude from this:
Model Estimate: 𝑌̂ =−1.88+0.24𝑋 (R2 = 0.74)
Std. Error: (0.53) (0.05)
There is a strong and statistically significant (r=0.86, 95% CI 0.54-0.96) linear association
between chick age and weight. For each 1 day increase in age, weight increases, on average, by 0.24 lbs (95%
confidence interval 0.13 lbs to 0.34 lbs). From Figure 2 it is apparent that a modified model, one which takes
the observed curvature into account, and for which the dependent variable is normally distributed, would be
more appropriate.