Data-analytical procedures

  1. Many data-analytical procedures share one version or another of the same underlying conceptual model:

data = predictable component + unpredictable component; or
data = signal + noise; or
data = smooth component + irregular component; or
data = common variation + unique variation.

For a) regression analysis, b) analysis of variance, c) nonparameteric regression, and d) principal components or factor analysis, describe the particular version of that common conceptual model that applies, and why that conceptual model makes sense given the goals of the analysis. Note that you shouldn’t necessarily try to match the three analyses with one or another of different versions of the conceptual model–those are just examples, and there could be other ways of describing the basic model that might work better in each case. Or it could be the case that two analyses share the same model. It might be useful to make a table first, before writing; if so, the whole answer should fit on a page. (2 pts.)

  1. Describe the general context in which multiple linear regression analysis is applicable. (What is it used for? Are there any assumptions that underlie its use? How is it implemented in practice? (And don’t just say, “by using the lm() function…”!) (2 pts.)
  2. Many of the statistics we’ve seen, like the coefficient of variation, correlation coefficient, t-test, and others, are fractions, with a term in the numerator and a term in the denominator. For as many statistics you can think of, characterize in words what the quantities in the numerator and denomentor represent, and then comment why that particular arrangement (something in the numerator, something in the denominator) makes sense in general. Again, an efficient way of answering this question might be a table and a sentence or two. (2 pts.)
  3. Suppose you are in charge of the data-analysis component of a project that generates one or more data sets (or data frames), that include the following kinds of variables (i.e. columns):

some kind of text identification label (like the abbreviations of the weather station names in the Oregon climate data set);
locational information (e.g. latitude and longitude, or x and y);
one more response variables (i.e. variables that you would like to “explain” or predict);
several candidate predictor variables;
one or more factor (or group membership) variables (like the Reach variable in the Summit Cr. data set) that identify which group a particular observation comes from or is assigned to.
Describe an overall strategy for making sense of this data set. What kind of plots or visualizations might you apply (and why)? What kind of analyses (and why)? (3 pts.)

Sample Solution