A correlation exists between two random variables—a predictor variable (or
explanatory or independent variable) and a response variable (or dependent
variable)—if the value of the response variable changes in a consistent
manner whenever the value of the predictor variable changes. If the
relationship between the variables is linear, it is referred to as linear
correlation.
It is possible for correlation between random variables to be nonlinear.
However, in this course, we will only deal with linear relationships, and all
techniques described in Chapter 10 make that assumption.
There are two types of linear correlation:
− Positive: The response variable tends to increase when the predictor
variable increases.
− Negative: The two variables tend to change in opposite directions,
with the response variable decreasing as the predictor variable
increases.
Scatter Plots
A scatter plot (or scatter diagram) is a plot of ordered pairs of predictor and
response variable values plotted on an ???? plane. The predictor variable
value is the ??-coordinate and the response value is the ??-coordinate.
A scatter plot is typically the first tool a statistician will use to investigate
the potential existence of correlation. It will not only provide a visual clue
as to the type of correlation present, if any, but also the strength of that
correlation. The stronger the relationship between the two variables, the
closer the scatter plot pattern will be to a straight line.
To construct a scatter plot using StatCrunch, first load the data set, then do
the following:
- Select Graph Scatter Plot.
- Use the drop-down menus to identify the ?? (predictor) variable and ??
(response) variables. - Click “Compute!”
Example 1: The Handspan-Height data set consists of observed height in
inches (the response variable) and handspan in centimeters (the predictor
variable) of 167 individuals. Build a scatter plot and assess whether the
variables appear to be correlated. If so, what type of correlation is present?
The scatter plot clearly appears to increase
from lower left to upper right, suggesting
the two variables are positively linearly
correlated.
Example 2: The IQ-Cranial data set contains the paired measurements of
cranial circumference in centimeters (predictor) and Stanford-Binet IQ
scores (response) for 20 randomly selected individuals. Build a scatter plot
and assess whether the variables appear to be correlated. If so, what type of
correlation is present?
There appears to be no obvious pattern in the
scatter plot, suggesting the two variables are
not correlated.
Measures of Correlation
The correlation coefficient r (sometimes called Pearson’s r) is a sample statistic
that measures not only the strength of the linear correlation between two
variables of interest, but also the type (positive or negative). It is defined on
the interval −1 ≤ ?? ≤ 1, with negative values corresponding to negative
correlation and positive correlation. The closer ?? is to ±1, the stronger the
relationship between the two variables.
The coefficient of determination– the square of the correlation coefficient—is
denoted by ??2. This sample statistic measures the proportion of the
variation in the response variable that can be explained by the variation of
the associated predictor variable. It is defined on the interval 0 ≤ ??2 ≤ 1.
The closer ??2 is to 1 the stronger the relationship between the variables.
The coefficient of determination will be our primary measure of linear
association.
To calculate the above two coefficients in StatCrunch: - Select Stat Summary Stats Correlation.
- Choose the appropriate data columns; click the label for the explanatory
variables first to put it on the ??-axis. - Click “Compute!”.
The displayed result is ??. This method will not produce ??2; to calculate it,
you will need to square ?? by hand. (We’ll learn later in this lesson how to
obtain ??2 directly from StatCrunch output.)
Example 3: Calculate ?? and ??2 for the two variables in the HandspanHeight data set. Are the results consistent with those of Example 1?
StatCrunch returns a correlation coefficient of ?? = 0.740. Squaring ?? gives
the coefficient of determination ??2 = 0.547. This suggests a moderate level
of positive correlation between the two variables and is consistent with the
scatter plot from Example 1.
Example 4: Calculate ?? and ??2 for the two variables in the IQ-Cranial data
set. Are the results consistent with those of Example 2?
StatCrunch returns a correlation coefficient of ?? = 0.138. Squaring ?? gives
the coefficient of determination ??2 = 0.019. This suggests a weak to
nonexistent level correlation between the two variables and is consistent
with the scatter plot from Example 2.
Common Errors When Using Correlation
− Calculating linear correlation measures for a data set where the
variables’ relationship is non-linear relationship.
− Manipulating the value of ?? or ??2 by removing influential data
values.
− Assuming that the existence of correlation between the predictor and
response variables implies a cause-and-effect relationship between
those variables.
Example 5: Suppose a data analysis shows a positive correlation between
the number of stork breeding pairs in a certain geographic region and the
region’s birth rate:
Adapted from Matthews, “Storks Deliver Babies (p = 0.008),”
Teaching Statistics, Vol 22 No 2 (June 2000).
Does it necessarily follow that the increase in the number of stork breeding
pairs causes an increase in the birth rate?
No, because correlation does not imply causation. In other words, the mere
existence of a relationship between the variables does not mean that changes
to the values of one cause changes to the values of the second.
Example 6: Suppose someone shows you the following graphic, saying that
the information in it is “proof” that organic foods “cause” autism. Does that
claim make sense? Why or why not?
The claim does not make sense. The presence of correlation does not “prove”
anything. Also, this confusing graph does not clarify that the number of
spectrum diagnoses is the predictor variable, rather than the response
variable as the person making the claim has assumed. Since organic foods are
often recommended as part of a spectrum patient’s diet, the real (and more
believable conclusion is that an increase in spectrum diagnoses may
contribute to an increase in organic food consumption.
Note also that the assumption that spectrum diagnoses and organic food
consumption are the only variables that affect each other is a gross
oversimplification. It is quite likely that each variable’s values are influenced
by a variety of factors not considered in this study.
Simple Linear Regression
If two random variables are linearly correlated, their relationship can be
exploited to estimate the population mean of the response variable for a
given predictor variable. This is done using a simple linear regression model:
???? = ??0 + ??1??
where ?? is the predictor variable, ?? is the response variable, ???? is the
population mean of ??—the average response for all possible values of ??—
and ??0 and ??1 are the regression coefficients.
The above model—the population regression model—is the “true” model,
which in nearly all cases we cannot obtain directly. Instead, we need to
approximate it using an estimated regression (or least-squares) model:
??� = ??0 + ??1??
where ?? is a fixed value of the predictor variable, ??0 and ??1 are estimates of
??0 and ??1, and ??� is an estimate of ???? for the given value of ??.
For each ordered pair (??, ??) in the data set, the residual is the difference
between ?? and ??�, the point estimate of ???? produced by substituting ?? into
the estimated regression equation ??� = ??0 + ??1??. The values of ??0 and ??1 are
chosen such that the sum of the squares of the residuals is the smallest
value possible. This is referred to as the least-squares property.
Interpreting the Regression Coefficients
Slope: The slope of the population regression line, ??1, represents the change
in ???? for each unit increase in ??. When the predictor and response variables
are not correlated, then ??1 = 0 and the population “regression” line will be
the horizontal line ?? = ???? (represented by the estimated model ??� = ??�,
where ??� is the sample mean of the response variable).
Intercept: The intercept of the population regression line, ??0, represents the
value of ???? when ?? = 0. If ?? = 0 isn’t defined (that is, when the predictor
variable cannot equal zero), then ??0 has no practical interpretation.
Constructing a Simple Linear Regression (SLR) Model
To construct a SLR model, use the following procedure: - Select Stat Regression Simple Linear.
- Choose the appropriate data columns: the explanatory column for the
??-values and the response column for the ??-values. Note: these are not
interchangeable—read the problem carefully to determine which variable is
which. - Click “Compute!”.
By default, the output window will contain two pages: the numerical
results and a graph of the regression model with a scatter plot of the data.
(The number of pages will increase as we ask StatCrunch to make further
calculations.) Use the arrows in the bottom right corner of the window to
toggle between pages.
The regression equation will be the fourth line down from the top. The
correlation coefficient ?? and the coefficient of determination ??2 (shown on
the display as “R-sq”) will be on the sixth and seventh lines, respectively.
Round the estimated regression coefficients to one more decimal place than the
predictor variable values in the original data set.
Example 7: Construct a SLR model using the Handspan-Height data set.
Interpret the coefficients of the estimated regression line in the context of
the problem.
Model: ????????ℎ?? = 35.53 + 1.56 ⋅ ????????????????.
Interpretation of slope: For each centimeter increase in handspan, height
increases by an average of 1.56 inches.
Interpretation of intercept: Since the data set contains no row where
???????????????? = 0, the intercept has no practical interpretation.
Example 8: Construct a SLR model using the IQ-Cranial data set. Interpret
the coefficients of the estimated regression line in the context of the
problem.
Model: ???? = 45.05 + 1.00 ⋅ ??????????????????????????.
Interpretation of slope: For each centimeter increase in handspan, IQ score
increases by an average of 1.00 points.
Interpretation of intercept: Since the data set contains no row where
?????????????????????????? = 0, the intercept has no practical interpretation.
Sample Solution