explore the data in the RStudio data viewer, and type ?gss_cat into the console to see a description of the variables in this data set.
1a)
Use this code to calculate the fraction of this survey who are divorced
gss_cat %>%
summarize(divorced = sum(marital == "Divorced"),
N = n()) %>%
mutate(p = divorced / N)
What do you get?
1b)
Suppose that instead of this “population” parameter, we only had funding to collect virtual samples of smaller sizes.
In particular, suppose that we are interested in what would happen if we could only sample 50 people at a time. We will simulate the sampling distribution first!
To do so, run the following code:
n50_1000rep <- gss_cat %>%
rep_sample_n(size = 50, reps = 1000)
p_hat_n50_1000rep <- n50_1000rep %>%
group_by(replicate) %>%
summarize(divorce_count = sum(marital == "Divorced"),
n = n()) %>%
mutate(p_hat = divorce_count / n)
Explain what this code does and interpret the output.
1c)
Create and interpret this graph:
ggplot(p_hat_n50_1000rep, aes(x = p_hat)) +
geom_histogram(binwidth = 0.02, color = "black", fill = "aquamarine3", boundary=0) +
labs(x = "Sample proportion of divorced respondents",
title = "Sampling distribution of p-hat based on n = 50")
Why is it not perfectly symmetrical?
1d)
Create your own version of the analysis in 1b-1c for a sample size of 300, repeated 1000 times. Plot the resulting histogram. How does your answer differ from the graph in part 1c? Carefully explain the intuition.
1e)
Create your own version of the analysis in 1b-1c for a sample size of 100, repeated 10,000 times. Plot the resulting histogram. How does your answer differ from the graph in part 1c? Carefully explain the intuition.
2)
The following table is from Persico, Postlewaite, and Silverman and their paper 2012 “The Effect of Adolescent Experience on Labor Market Outcomes: The Case of Height”
The columns in these tables are from separate regressions. The independent variable is log wage for all regressions. The regressions only include data on male, full-time workers.
Each row indicates an independent variable. In each cell of the table, the top number indicates the regression coefficient for that independent variable for that column’s regression. The number in parentheses is the standard error associated with each estimate. For example, the estimated effect of siblings in model (2) is -0.033 with a standard error of 0.0084.
Answer the questions that follow based on this table.
2a) Using model (1), what is the predicted wage difference between a 5.5 and 6.5 foot man? (Remember, there are 12 inches in a foot!)
2b) Using the back-of-the-envelope formula that we used in class, calculate a 95 percent confidence interval for the adult height variable in model (1)
2c) In words, explain the intuition of the confidence interval that you found in (2)
2d) If instead you calculated (using R) the 90 percent confidence interval, would it be wider or narrower than what you found in (2)? What about a 99 percent confidence interval?
2e) Model (3) is the same as model (1) except that it controls for youth height. What is the OVB on the adult height coefficient caused by failing to control for youth height?
2f) Explain intuitively why youth height is an important omitted variable in regression (1).
2g) Using the back-of-the-envelope formula that we used in class, calculate a 95 percent confidence interval for the adult height variable in model (3). How does it differ from your result in (2)?
2h) A natural null hypothesis for adult height coefficients is zero: that adult height has no effect on wages. Intuitively explain what the null sampling distribution is in this case. (It might help to draw a picture!)
2i) Using the back-of-the-envelope formula that we used in class, test whether the adult height coefficient is statistically significantly different from 0 in model (1). Do the same for adult height coefficient in model (3). How do your answers differ? Explain the intuition!
2j) From (9), what do we know is true about the size of the p-value associated with the adult height coefficient in model (1)? In model (3)? (just focus on a 95-percent confidence level for now)
Sample Solution