You will explore some of the techniques you have learned thus far by examining data

on housing prices in the Seattle area in 2015. The data have been placed on Wattle. While

there are number of variables available, for this assignment you will only consider the following:

{ id: an id number for the house. Note: some house have been sold more than once.

{ price: the price that the house was sold at in USD

{ bedrooms: the number of bedrooms in the house

{ bathrooms: the number of bathrooms in the house

{ sqft.living: square footage of total living space

(a) Conduct an exploratory data analysis, where price is the response (y) and the

variables which may affect price are: bedrooms, bathrooms, and sqft.living. In doing your

analysis make sure to identify any unusual points and discuss why they are unusual.

(b) Is there a statistically significant correlation between price and sqft.living? Use

the cor.test() function to conduct a suitable hypothesis test. Clearly specify the hypotheses

you are testing and present and interpret the results.

(c) Experiment with applying natural log transformations (to the base e, which is the

default for the log() function in R) and square root transformations to one or both of price

and sqft.living, and repeat the analysis in parts (a) and (b). Do NOT show all of your results,

just pick whichever one you think is the best choice of scale for the two variables and show

and discuss the results for your chosen combination.

(d) Fit a simple linear regression (SLR) model with your chosen transformation of

price as the response variable and your chosen transformation of sqft.living as the predictor.

Construct a plot of the residuals against the tted values, a normal Q-Q plot of the residuals,

a bar plot of the leverages for each observation and a bar plot of Cook’s distances for each

observation. Use these plots (and other means) to comment on the model assumptions and

on any unusual data points.

(e) Produce the ANOVA (Analysis of Variance) table for the SLR model in part (d)

and interpret the results of the F test. What is the coecient of determination for this model

and how should you interpret this summary measure?

(f) What are the estimated coecients of the SLR model in part (d) and the standard

errors associated with these coecients? Interpret the values of these estimated coecients

and perform t-tests to test whether or not these coecients dier signicantly from zero.

What do you conclude as a result of these t-tests?

(g) Consider two other simple linear regressions. One where x =bedrooms and one

where x =bathrooms. Use the same transformation for the response as you did in part (d) [if

you decided to use one]. Interpret these two models. How do these models compare to the

one in part (d)?

(h)Construct the following covariate in R which examines the number of bathrooms

and bedrooms per square foot of living space:

xi

=

sqft:livingi

bedroomsi + bathroomsi + 1

Fit a SLR using x. Use the same transformation for the response as you did in part (d) [if

you decided to use one]. Interpret the model. How do this model compare to the one in parts

(d) and (g)?