A correlation exists between two random variables—a predictor variable (or
explanatory or independent variable) and a response variable (or dependent
variable)—if the value of the response variable changes in a consistent
manner whenever the value of the predictor variable changes. If the
relationship between the variables is linear, it is referred to as linear
correlation.
It is possible for correlation between random variables to be nonlinear.
However, in this course, we will only deal with linear relationships, and all
techniques described in Chapter 10 make that assumption.
There are two types of linear correlation:
− Positive: The response variable tends to increase when the predictor
variable increases.
− Negative: The two variables tend to change in opposite directions,
with the response variable decreasing as the predictor variable
increases.
Scatter Plots
A scatter plot (or scatter diagram) is a plot of ordered pairs of predictor and
response variable values plotted on an ???? plane. The predictor variable
value is the ??-coordinate and the response value is the ??-coordinate.
A scatter plot is typically the first tool a statistician will use to investigate
the potential existence of correlation. It will not only provide a visual clue
as to the type of correlation present, if any, but also the strength of that
correlation. The stronger the relationship between the two variables, the
closer the scatter plot pattern will be to a straight line.
To construct a scatter plot using StatCrunch, first load the data set, then do
the following:

  1. Select Graph  Scatter Plot.
  2. Use the drop-down menus to identify the ?? (predictor) variable and ??
    (response) variables.
  3. Click “Compute!”
    Example 1: The Handspan-Height data set consists of observed height in
    inches (the response variable) and handspan in centimeters (the predictor
    variable) of 167 individuals. Build a scatter plot and assess whether the
    variables appear to be correlated. If so, what type of correlation is present?
    The scatter plot clearly appears to increase
    from lower left to upper right, suggesting
    the two variables are positively linearly
    correlated.
    Example 2: The IQ-Cranial data set contains the paired measurements of
    cranial circumference in centimeters (predictor) and Stanford-Binet IQ
    scores (response) for 20 randomly selected individuals. Build a scatter plot
    and assess whether the variables appear to be correlated. If so, what type of
    correlation is present?
    There appears to be no obvious pattern in the
    scatter plot, suggesting the two variables are
    not correlated.
    Measures of Correlation
    The correlation coefficient r (sometimes called Pearson’s r) is a sample statistic
    that measures not only the strength of the linear correlation between two
    variables of interest, but also the type (positive or negative). It is defined on
    the interval −1 ≤ ?? ≤ 1, with negative values corresponding to negative
    correlation and positive correlation. The closer ?? is to ±1, the stronger the
    relationship between the two variables.
    The coefficient of determination– the square of the correlation coefficient—is
    denoted by ??2. This sample statistic measures the proportion of the
    variation in the response variable that can be explained by the variation of
    the associated predictor variable. It is defined on the interval 0 ≤ ??2 ≤ 1.
    The closer ??2 is to 1 the stronger the relationship between the variables.
    The coefficient of determination will be our primary measure of linear
    association.
    To calculate the above two coefficients in StatCrunch:
  4. Select Stat  Summary Stats  Correlation.
  5. Choose the appropriate data columns; click the label for the explanatory
    variables first to put it on the ??-axis.
  6. Click “Compute!”.
    The displayed result is ??. This method will not produce ??2; to calculate it,
    you will need to square ?? by hand. (We’ll learn later in this lesson how to
    obtain ??2 directly from StatCrunch output.)
    Example 3: Calculate ?? and ??2 for the two variables in the HandspanHeight data set. Are the results consistent with those of Example 1?
    StatCrunch returns a correlation coefficient of ?? = 0.740. Squaring ?? gives
    the coefficient of determination ??2 = 0.547. This suggests a moderate level
    of positive correlation between the two variables and is consistent with the
    scatter plot from Example 1.
    Example 4: Calculate ?? and ??2 for the two variables in the IQ-Cranial data
    set. Are the results consistent with those of Example 2?
    StatCrunch returns a correlation coefficient of ?? = 0.138. Squaring ?? gives
    the coefficient of determination ??2 = 0.019. This suggests a weak to
    nonexistent level correlation between the two variables and is consistent
    with the scatter plot from Example 2.
    Common Errors When Using Correlation
    − Calculating linear correlation measures for a data set where the
    variables’ relationship is non-linear relationship.
    − Manipulating the value of ?? or ??2 by removing influential data
    values.
    − Assuming that the existence of correlation between the predictor and
    response variables implies a cause-and-effect relationship between
    those variables.
    Example 5: Suppose a data analysis shows a positive correlation between
    the number of stork breeding pairs in a certain geographic region and the
    region’s birth rate:
    Adapted from Matthews, “Storks Deliver Babies (p = 0.008),”
    Teaching Statistics, Vol 22 No 2 (June 2000).
    Does it necessarily follow that the increase in the number of stork breeding
    pairs causes an increase in the birth rate?
    No, because correlation does not imply causation. In other words, the mere
    existence of a relationship between the variables does not mean that changes
    to the values of one cause changes to the values of the second.
    Example 6: Suppose someone shows you the following graphic, saying that
    the information in it is “proof” that organic foods “cause” autism. Does that
    claim make sense? Why or why not?
    The claim does not make sense. The presence of correlation does not “prove”
    anything. Also, this confusing graph does not clarify that the number of
    spectrum diagnoses is the predictor variable, rather than the response
    variable as the person making the claim has assumed. Since organic foods are
    often recommended as part of a spectrum patient’s diet, the real (and more
    believable conclusion is that an increase in spectrum diagnoses may
    contribute to an increase in organic food consumption.
    Note also that the assumption that spectrum diagnoses and organic food
    consumption are the only variables that affect each other is a gross
    oversimplification. It is quite likely that each variable’s values are influenced
    by a variety of factors not considered in this study.
    Simple Linear Regression
    If two random variables are linearly correlated, their relationship can be
    exploited to estimate the population mean of the response variable for a
    given predictor variable. This is done using a simple linear regression model:
    ???? = ??0 + ??1??
    where ?? is the predictor variable, ?? is the response variable, ???? is the
    population mean of ??—the average response for all possible values of ??—
    and ??0 and ??1 are the regression coefficients.
    The above model—the population regression model—is the “true” model,
    which in nearly all cases we cannot obtain directly. Instead, we need to
    approximate it using an estimated regression (or least-squares) model:
    ??� = ??0 + ??1??
    where ?? is a fixed value of the predictor variable, ??0 and ??1 are estimates of
    ??0 and ??1, and ??� is an estimate of ???? for the given value of ??.
    For each ordered pair (??, ??) in the data set, the residual is the difference
    between ?? and ??�, the point estimate of ???? produced by substituting ?? into
    the estimated regression equation ??� = ??0 + ??1??. The values of ??0 and ??1 are
    chosen such that the sum of the squares of the residuals is the smallest
    value possible. This is referred to as the least-squares property.
    Interpreting the Regression Coefficients
    Slope: The slope of the population regression line, ??1, represents the change
    in ???? for each unit increase in ??. When the predictor and response variables
    are not correlated, then ??1 = 0 and the population “regression” line will be
    the horizontal line ?? = ???? (represented by the estimated model ??� = ??�,
    where ??� is the sample mean of the response variable).
    Intercept: The intercept of the population regression line, ??0, represents the
    value of ???? when ?? = 0. If ?? = 0 isn’t defined (that is, when the predictor
    variable cannot equal zero), then ??0 has no practical interpretation.
    Constructing a Simple Linear Regression (SLR) Model
    To construct a SLR model, use the following procedure:
  7. Select Stat  Regression  Simple Linear.
  8. Choose the appropriate data columns: the explanatory column for the
    ??-values and the response column for the ??-values. Note: these are not
    interchangeable—read the problem carefully to determine which variable is
    which.
  9. Click “Compute!”.
    By default, the output window will contain two pages: the numerical
    results and a graph of the regression model with a scatter plot of the data.
    (The number of pages will increase as we ask StatCrunch to make further
    calculations.) Use the arrows in the bottom right corner of the window to
    toggle between pages.
    The regression equation will be the fourth line down from the top. The
    correlation coefficient ?? and the coefficient of determination ??2 (shown on
    the display as “R-sq”) will be on the sixth and seventh lines, respectively.
    Round the estimated regression coefficients to one more decimal place than the
    predictor variable values in the original data set.
    Example 7: Construct a SLR model using the Handspan-Height data set.
    Interpret the coefficients of the estimated regression line in the context of
    the problem.
    Model: ????????ℎ?? = 35.53 + 1.56 ⋅ ????????????????.
    Interpretation of slope: For each centimeter increase in handspan, height
    increases by an average of 1.56 inches.
    Interpretation of intercept: Since the data set contains no row where
    ???????????????? = 0, the intercept has no practical interpretation.
    Example 8: Construct a SLR model using the IQ-Cranial data set. Interpret
    the coefficients of the estimated regression line in the context of the
    problem.
    Model: ???? = 45.05 + 1.00 ⋅ ??????????????????????????.
    Interpretation of slope: For each centimeter increase in handspan, IQ score
    increases by an average of 1.00 points.
    Interpretation of intercept: Since the data set contains no row where
    ?????????????????????????? = 0, the intercept has no practical interpretation.

Sample Solution

This question has been answered.

Get Answer