Use the dataset that you have been using for the previous projects.
Use all of your independent variables, your response variable, and the lm() function to build a multiple linear regression model.
Print the model with the summary() function. The output will be similar to the bottom of page 141.
Use the pairs() function to look at the scatterplots of the interval/ratio variables. Color your points by the value of a nominal/ordinal variable.
A standard regression model with correlated independent variables will almost always perform poorly. For this project, you will remove independent variables until the model is trustworthy.
Use the summary() output, and the scatterplots to decide if a variable should be removed. Remove the variable.
Repeat the process of
• build model
• check summary() and scatterplots
• remove variable
until you believe all variables in the model should stay in the model.
Use par(mfrow = c(2,2)) and the plot() function to look at diagnostic plots of the reduced model (similar to the plots on page 129).
Grading criteria:
• full model
• summary()
• pairs()
• remove independent variables, print reduced model each time (60% of grade)
• plot() final model
Create a simple linear regression model with one of your numeric independent variable and your response variable.
Build the scatterplot of the response variable by the independent variable, and the scatterplot of the residuals by the independent variable (similar to figure 3.3, page 50). Include the line of best fit on the first scatterplot. Also, plot the residuals by the response variable. Do you the scatterplots indicate that there are any problems with the model?
Use hist() to plot a histogram of the residuals. Do the residuals appear to be normally distributed?
Use qqnorm() and qqline() to plot a QQ-normal plot with the QQ-line of the residuals. Do the residuals appear to be normally distributed?
Use par(mfrow = c(2,2)) and plot(‘linear model’) to build a plot similar to figure 3.14 on page 70.
Record which data points are labeled in the subplots, then print those observations. Investigate each of these points and decide which ones are legitimate data points and which ones are erroneous and polluting your dataset.
Use car::powerTransform() to find power transformations for
• y – min(y) + 1, and
• x – min(x) + 1.
Transform the data and call the new data y_new and x_new. Build four scatterplots.
• y ~ x
• y_new ~ x
• y ~ x_new
• y_new ~ x_new
Which of these models appears to be the be fit? Build the corresponding linear model.
Grading Criteria:
• Simple linear regression model (no transformations)
• Scatterplot y ~ x
• Scatterplot residuals ~ x
• Scatterplot residuals ~ y
• hist(residuals)
• QQ-norm plot
• Linear model four plots
• Leverage data
• Four post-transformation scatterplots
• New linear model
Sample Solution