Correlation Analysis
Ryx
Simple Linear Regression
Slope
ANOVA
100

Correlation Coefficient (Pyx)

we quantify the direction and strength of the linear relationship between two continuous random variables, Y and X, by estimating their true correlation coefficient,

100

When ryx is non-zero, does that indicate that the true correlation, pyx, is also non-zero?

Not necessarily, because of sampling variability. We must perform a hypothesis test that accounts for the
sampling variability in order make a rigorous inference about whether yx truly is different from zero.
The following hypothesis testing procedure can be used: 

100

Simple linear regression (SLR) is a statistical method which allows us to _______ the association between
a continuous variable of interest (the ______ variable) that is thought to be associated with another variable (the ______) in linear fashion.

Simple linear regression (SLR) is a statistical method which allows us to quantify the association between
a continuous variable of interest (the “dependent” variable) that is thought to be associated with another
variable (the “independent variable”) in linear fashion.

100

ANOVA test statistic

Test Statistic:  F = MSR/MSE = (SSR/1) / (SSE/(n-2)), F ~ F1, n-2

200

negative correlation

Y decreases as X increases

200

The _______ is a point estimator of the true correlation between the two random variables
X and Y.

ryx

200

True or False: The independent variable may be continuous or
categorical in simple linear regression.

True


The linear nature of the relationship can usually be confirmed by examining a scatterplot of
the dependent and independent variables.

200

True or False: if 𝛽̂0 and 𝛽̂1, the least squares point estimates, are not equal to 0, that means that
the true (i.e., population) slope and y-intercept are not equal to 0.

False


After obtaining the point estimates, we
should perform hypothesis tests (or compute confidence intervals) to determine whether the true slope and
y-intercept are significantly different from 0. In order to do this, we first need to estimate 2, the true (or
population) variance of Y.

200

ANOVA null and alternative hypotheses

H0: No linear association between X and Y (i.e. regression is not significant)
HA: There is a linear association between X and Y (i.e. regression is significant)
 


Overall F-test for SLR:
H0: 1 = 0 vs. HA: 1 0, with test Statistic F = MSR/MSE = (SSR/1) / (SSE/(n-2)), (F ~ F1, n-2)
 
It can be proven, mathematically, that the T-test statistics will be equal to each other, and that
squaring them will yield the F-test statistic value. The p-values for all three will be identical, in
simple linear regression. The latter fact should make sense on an intuitive level, because in each
case the hypotheses are actually the same: H0: there is no linear association between X and Y vs.
HA: there is a linear association between X and Y.

300

The magnitude of correlation coefficient (Pyx) indicates the ______ of the association between X and Y.

Strength of association

300

True or False: Two variables can be
significantly and strongly correlated without having a cause-and-effect relationship. 

True, For example, there is
a strong, statistically significant and positive correlation between weight and height; however, eating
more and gaining weight will not necessarily cause you to grow taller! Here’s another example: there is
likely to be a positive association between a mother's age and her child's height; however, it would not
make sense to infer that an increase in the mother's age directly causes the child's height to increase. Any
increase in the child's height would be a reflection of the physiological changes in the child, not of the
increasing age of the mother.

300

What are the three things we do by performing SLR

estimate the best-fitting straight line that is thought to relate Y and X;
- perform hypothesis tests to determine whether or not the relationship is statistically significant;  
- make predictions about the average value of Y for values of X that are of interest.
 

300

The distance from the line to ith data point, 𝑌𝑖 - 𝑌̂𝑖, is called the______ or the_____
for the ith observation

 ith residual or estimate of error

Since the method of least squares involves minimizing the sum of the squared
distances between the line and the plotted data values, we can also say that the method involves
minimizing the sum of the squared residuals or, equivalently, minimizing the sum of squared errors
(“SSE”).

300

What is a good strategy for SLR analysis

i. State the SLR model.
ii. Perform a descriptive statistical analysis for Y and X.
iii. Plot Y vs X.  Inspect plot for linear pattern.
iv. Calculate and characterize r.  Interpret.  
v. Fit the SLR model.  
vi. Check for validity of SLR assumptions *.  
vii. Perform the overall F-test.  If not significant, report this finding and stop working with the
current model --it is not significant; do not interpret the slope estimate (the failure to reject
the null means that you have not been able to show that the slope is anything other than
zero).
 If significant, proceed with additional inferences/interpretations (below).
viii. Present estimate of the model.  Determine a confidence interval for the slope, and present
and interpret it.
ix. Proceed with additional model inferences, if appropriate and wanted: test on Y-intercept,
prediction intervals, confidence intervals on Y.

400

Can you measure a non-linear association with the correlation coefficient?

It measures the linear association between the two variables; it may not describe the association well
if the variables are associated non-linearly.

400

The first hypothesis test that is performed in a simple linear regression

is a test for the significance of the
model slope.  

 
H0: B1 = 0 Where B1 = true slope of straight line that is assumed to relate X and Y.
HA: B1 does does not equal 0

400

Let’s state the simple linear regression model for the sodium intake-blood pressure example:
Researchers believe that there is a
linear association between average daily sodium
intake (variable X, in g) and blood pressure
(variable Y, in mmHg).   Blood pressures and
sodium intakes are recorded for a random sample
of 12 patients.  From Fig. 1.2 it is evident that, as
sodium intake increases, blood pressure
increases, in fairly linear fashion.  Estimate the
true correlation between blood pressure and
sodium intake.

Y = B0 + B1X + E where
 
Y = blood pressure, in mm Hg
X = average daily sodium intake, in g
B0 = true (population) y-intercept
B1 = true (population) slope
E = random error term, assumed ~N(0, o^2) (More on o^2, the variance of E, later)

400

What is the SAS code for performing SLR?

PROC REG DATA=dataset;

MODEL y=x;

RUN;

400

 “R-square” or “Coefficient of Determination” or just R2
 Rsquared = SSR/_____

 

SSR/SSY = proportion of the total variation (SSY) that was explained by the regression.  
 Higher R2 values mean that more of the variation in Y is explained by its relationship with X (people 





 SLR ANOVA Table
 
 18
often say that large R2 means that the model “fits” the data well).  
 Two notes:
• In SLR, R2 actually turns out to be the square of the sample correlation coefficient, r. Verify this for
the sodium intake-blood pressure example!
• Be careful--R2 can be artificially inflated in multiple linear regression models (models with more than
1 predictor) simply by adding lots of predictors to the model, even when the predictors are not
strongly related to Y. In multiple linear regression models, we report an ‘adjusted R2’, which includes
a penalty term that shrinks the R2 value if predictors are included which do not have a good linear
relationship with Y. (More on this later).  
 

500

True or False: A correlation coefficient of 1 means the relationship is a perfect straight line

True, the data points in the scatterplot fall perfectly on a straight line

500

Interpret the conclusions for if a test of the significance of slope results in rejecting the null OR not rejecting the null

If the null is not rejected, this means that we have not found a statistically significant linear
relationship between X and Y (at the  significance level).  If this is the conclusion, our work with this
SLR model ends.  Note that there may be a relationship of some other kind (e.g. a non-linear relationship)
between X and Y, so failing to reject the null does not imply that there is no relationship at all.
 
If the null is rejected, this means that there is a significant linear relationship between X and Y, and we
continue our SLR work (the estimate of the slope should be interpreted and further statistical inferences,
such as predictions, can be made if necessary --see sections 1.3.10 and 1.3.11, below).
 

500

What is the general premise of the Method of Least Squares

the best line is the one for which the distances from the sample data values to the
line are minimized.

500


Yi = β0 + β1Xi = -198.03 + 52.46Xi


Yi = estimated mean blood pressure for when

500

Conclude from this: 

Model Estimate: 𝑌̂ =−1.88+0.24𝑋 (R2 = 0.74)
Std. Error:                    (0.53)    (0.05)    

There is a strong and statistically significant (r=0.86, 95% CI 0.54-0.96) linear association
between chick age and weight. For each 1 day increase in age, weight increases, on average, by 0.24 lbs (95%
confidence interval 0.13 lbs to 0.34 lbs). From Figure 2 it is apparent that a modified model, one which takes
the observed curvature into account, and for which the dependent variable is normally distributed, would be
more appropriate.