1.1.1 Instructions
This assignment is due Monday, July 26th at 11:59pm. You are given six slip days thoughout
the quarter which can extend the deadline by one day. See the syllabus for more details. With
the exception of using slip days, late work will not be accepted unless you have made special
arrangements with your instructor.
Important: The otter tests don’t usually tell you that your answer is correct. More often, they
help catch careless mistakes. It’s up to you to ensure that your answer is correct. If you’re not
sure, ask someone (not for the answer, but for some guidance about your approach).
[1]: # please don’t change this cell, but do make sure to run it
import babypandas as bpd
import matplotlib.pyplot as plt
import numpy as np
import otter
grader = otter.Notebook()
1.2 1. Polling
Four candidates are running for President of Dataland. A polling company surveys 1000 people
selected uniformly at random from among voters in Dataland, and it asks each one who they are
planning on voting for. After compiling the results, the polling company releases the following
proportions from their sample:
Candidate Proportion
Candidate C 0.55
Candidate T 0.32
Candidate J 0.08
1
Candidate Proportion
Candidate S 0.03
Undecided 0.02
These proportions represent a uniform random sample of the population of Dataland. We will
attempt to estimate the corresponding population parameters – the proportions of each kind of
voter in the entire population. We will use confidence intervals to compute a range of values that
reflects the uncertainty of our estimate.
The table votes contains the results of the survey. Candidates are represented by their initials.
Undecided voters are denoted by U.
[2]: #: run this cell to display the results of the survey — don’t change this cell!
votes = bpd.DataFrame().assign(vote=np.array([‘C’]550 + [‘T’]320 + [‘J’]80 +␣ ,→[‘S’]30 + [‘U’]*20))
votes = votes.sample(votes.shape[0],replace=False)
num_votes = votes.shape[0]
votes
Below, we have given you code that will use bootstrapped samples to compute estimates of the
true proportion of voters who are planning on voting for Candidate T.
[3]: #: run the bootstrap
def proportions_in_resamples():
statistics = np.array([])
for i in np.arange(1000):
bootstrap = votes.sample(num_votes, replace = True)
sample_statistic = np.count_nonzero(bootstrap.get(‘vote’) == ‘T’)/
,→num_votes
statistics = np.append(statistics, sample_statistic)
return statistics
boot_proportions = proportions_in_resamples()
bpd.DataFrame().assign(Estimated_Proportion=boot_proportions).
,→plot(kind=’hist’,bins=np.arange(0.1,0.5,0.01))
Question 1.1. Using the array boot_proportions, compute an approximate 95% confidence
interval for the true proportion of voters planning on voting for candidate C. (Compute the lower
and upper ends of the interval, named votes_lower_bound and votes_upper_bound, respectively.)
[4]: votes_lower_bound = …
votes_upper_bound = …
: print the confidence interval
print(“Bootstrapped 95% confidence interval for the proportion of T voters in␣
,→the population: [{:f}, {:f}]”.format(votes_lower_bound, votes_upper_bound))
2
Question 1.2. The survey results seem to indicate that Candidate C is beating Candidate T
among voters. We would like to use confidence intervals to determine a range of likely values for
her true lead. Candidate C’s lead over Candidate T is:
(Candidate C’s proportion of the vote) − (Candidate T’s proportion of the vote).
Use the bootstrap with 1000 resamples to compute an approximate distribution for Candidate C’s
lead over Candidate T, and store your bootstrap estimates in an array called boot_leads. Plot a
histogram of the resulting samples.
Hint: Use the code for proportions_in_resamples given to you above as a starting point.
[10]: def leads_in_resamples():
…
Question 1.3. Compute an approximate 97% confidence interval for the difference in proportions.
[16]: diff_lower_bound = …
diff_upper_bound = …
: print the confidence interval
print(“Bootstrapped 97% confidence interval for Candidate C’s true lead over␣
,→Candidate T: [{:f}, {:f}]”.format(diff_lower_bound, diff_upper_bound))
The staff computed the following 95% confidence interval for the proportion of Candidate C voters:
[.52, .58]
(Your answer might have been slightly different, but that doesn’t mean it was wrong since the data
was randomly sampled.)
Question 1.4. Can we say that 95% of the population lies in the range [.52, .58]? Assign your
choice to variable q1_4.
- Yes
- No
[22]: q1_4 = …
q1_4
3
Question 1.5. Can we say that the true proportion of the population that will vote for Candidate
C is a random quantity with approximately a 95% chance of falling between 0.52 and 0.58? Assign
your choice to variable q1_5. - No
- Yes
[26]: q1_5 = …
q1_5
Question 1.6. Suppose we produced 20,000 new samples (each one a uniform random sample
of 1,000 voters) and created a 97% confidence interval from each one. Roughly how many of
those 20,000 intervals do you expect will actually contain the true proportion of the population?
Assign your answer to the variable how_many below. It should be the number of intervals, not the
proportion or percentage.
[30]: how_many = …
how_many
Question 1.7.
The staff also created 80%, 90%, and 99% confidence intervals from one sample (shown below),
but we forgot to label which confidence interval represented which percentages! Match the interval
to the percent of confidence the interval represents and assign your choices in variables q1_7_80,
q1_7_90, and q1_7_99, each for likely 80%, 90%, and 99% confidence intervals respectively.
Tip: Draw out the confidence intervals on a piece a paper to help you visualize them better. - [.538, .563]
- [.516, .584]
- [.53, .57]
[33]: q1_7_80 = …
q1_7_90 = …
q1_7_99 = …
q1_7_80, q1_7_90, q1_7_99
1.3 2. Hardest Writing Course
Suppose it’s application season and you’re a current high school senior looking to apply to the
prestigious UCSD for data science. Also, suppose you dislike writing and want to strategically
analyze all the UCSD college writing courses, to figure out colleges to avoid applying to and
colleges where you have the best shot at getting a decent grade. Luckily, UCSD has data on its
4
CAPES website about writing courses (except for Muir’s writing course due to unknown reasons).
Each row corresponds to a particular quarter and course, and the data includes the name of the
course, the average study hours per week for the quarter, and the average grade for the quarter (on
a GPA scale). Now it’s time to analyze and figure out whether the writing course rumors are true
(or people just like complaining).
[43]: # Run this cell to read data; don’t change it
writing = bpd.read_csv(“data/writing_courses_ucsd.csv”, index_col = 0)
writing.iloc[:5]
Question 2.1a. The first thing to do before jumping into analysis is to figure out the mean study
hours and mean grade for each course. Create a table called course_means that has index as
course and columns consist of Study Hrs/wk and grades. Study Hrs/wk and grades contain the
means of Study Hrs/wk and grades respectively.
[44]: course_means = …
course_means
Question 2.1b. You may have noticed that the mean grades for some courses is nan. This means
that some grades are missing for these courses (missing values are represented by nan). Drop all
the rows in the writing table that contain missing values and assign the new table to the variable
writing_fixed. After this, create a table called course_means_fixed with no nan values in the
grades column.
Hint: np.isnan() or np.isfinite() might be useful.
[48]: writing_fixed = …
course_means_fixed = …
display(writing_fixed, course_means_fixed)
Question 2.1c. It’s hard to judge whether a course is hard just based on study hours per week
or grades. Therefore, we will calculate a “difficulty score” that captures the difficulty of the class.
This metric is positively related to study hours per week and negatively related to grades. We will
calculate this score using the formula:
10 ×
√
Study Hrs/week
grades2
For instance, if a course has study hours per week of 4 and an average grade of 3.0, then its
difficulty will be 10 ×
√
4
3
2 = 20/9.
Add a new column named “difficulty” to writing_fixed which contains the calculated difficulty
score for each course.
5
[52]: writing_fixed = …
display(writing_fixed, writing_fixed.groupby(‘course’).mean())
Question 2.2. Revelle’s writing course HUM seems to have pretty high difficulty score. Produce 1,000 bootstrapped estimates for the average difficulty of HUM. Store the estimates in the
hum_averages array. Use this hum_averages array to plot a histogram of the estimated averages.
The label on the x-axis should be “Estimated Difficulty for HUM”.
Use the hum_averages array to calculate an approximate 95% confidence interval for the true
average difficulty. Assign the the corresponding bounds to lower_bound and upper_bound. Do
NOT round the bounds.
[55]: hum_averages = …
…
lower_bound = …
upper_bound = …
lower_bound, upper_bound
Question 2.3. You want to create a similar histogram for each of the other courses, and also
calculate the corresponding confidence intervals. Repeating the process above 4 times would be
time-consuming. Create a function called ci_and_hist, which takes in a course name as its input,
plots the histogram for 1,000 bootstrapped estimates for the average difficulty and returns a str
describing the approximate 95% confidence interval for the course (see the example below).
For example, ci_and_hist(‘HUM’) should plot the same histogram in Question 2 and return ‘The
95% confidence interval for HUM is [2.85, 2.93]’, where the 2.85 and 2.93 were calculated by
rounding lower_bound and upper_bound to two decimal places.
Note: For the returned string, make sure you follow the format above and remember to change
the course name and the confidence interval for different courses. For the histogram, the label on
the x-axis should also change accordingly to the courses.
[61]: def ci_and_hist(course_name):
…
[63]: #: try it out
ci_and_hist(‘WCWP’)
Question 2.4. Your friend claims that Marshall’s writing course DOC is actually not as hard as
everyone says. She says that because our CAPE data is only a sample of the full population of
course offerings, the actual average difficulty for DOC could be 2.25. Run the cell below to use the
ci_and_hist function you defined above to calculate an approximate 95% confidence interval for
6
the average difficulty in DOC. Can you reject her hypothesis using this confidence interval? Assign
your answer to variable q2_4. - Yes, the confidence interval includes 3.3
- No, the confidence interval includes 3.3
- Yes, the confidence interval doesn’t include 3.3
- No, the confidence interval doesn’t include 3.3
[65]: q2_4 = …
q2_4
Question 2.5a. Now that you’ve looked at the average difficulty for different courses, but you
believe that study time does not matter as long as you achieve a good grade. This time, you’ll
test whether each individual course has the same average grade as that of all the writing courses
combined.
First, produce 1,000 bootstrapped estimates for the average grade of all the writing courses combined. Use these estimates to produce an approximate 99% confidence interval for the true average grade. Round the bounds of the confidence interval to 2 decimal places and save them into
grade_lower_bound and grade_upper_bound.
grade_lower_bound = …
grade_upper_bound = …
grade_lower_bound, grade_upper_bound
Question 2.5b. Compare the average grade for each individual writing course to the average
grade of all writing courses combined. Your final answer should be a 5 element array named
grade_hypotheses.
In the order of [CAT, DOC, HUM, MMW, WCWP], the corresponding element in the array
grade_hypotheses should be -1 if the course’s average grade is significantly lower than that of
all the writing courses combined, 0 if you cannot reject the hypothesis that the course has the same
average grade as that of all the courses combined, and 1 if the course’s average grade is significantly
higher than that of all the courses combined. You may want to use the course_means_fixed table
you created in Question 1b.
Note: It’s okay to hard code your answer for this question.
[74]: grade_hypotheses = …
grade_hypotheses
7
1.4 3. Testing the Central Limit Theorem
The Central Limit Theorem tells us that the probability distribution of the sum or average of a
large random sample drawn with replacement will be roughly normal, regardless of the distribution
of the population from which the sample is drawn.
That’s a pretty big claim, but the theorem doesn’t stop there. It further states that the standard
deviation of this normal distribution is given by
sd of the original distribution
√
sample size
In other words, suppose we start with any distribution that has standard deviation σ, take a sample
of size n (where n is a large number) from that distribution with replacement, and compute the
mean of that sample. If we repeat this procedure many times, then those sample means will have
a normal distribution with standard deviation √σ
n
.
That’s an even bigger claim than the first one! The proof of the theorem is beyond the scope of
this class, but in this exercise, we will be exploring some data to see the CLT in action.
Question 3.1. The CLT only applies when sample sizes are “sufficiently large.” This isn’t a very
precise statement. Is 10 large? How about 50? The truth is that it depends both on the original
population distribution and just how “normal” you want the result to look. Let’s use a simulation
to get a feel for how the distribution of the sample mean changes as sample size goes up.
Consider a coin flip. If we say Heads is 1 and Tails is 0, then there’s a 50% chance of getting a
1 and a 50% chance of getting a 0, which is definitely not a normal distribution. The average of
several coin tosses is equal to the proportion of heads in those coin tosses, so the CLT should apply
if we compute the sample proportion of heads many times.
Write a function called simulate_sample_n that takes in a sample size n. It should return an
array that contains 5000 sample proportions of heads, each from n coin flips.
[78]: def simulate_sample_n(n):
…
simulate_sample_n(5)
8
The code below will use the function you just defined to plot the empirical distribution of the
sample mean for several different sample sizes. The x- and y-scales are kept the same to facilitate
comparisons.
[80]: #: run this cell to visualize
bins = np.arange(-0.01,1.05,0.02)
for sample_size in np.array([2, 5, 10, 20, 50, 100, 200, 400]):
bpd.DataFrame().assign(**{‘Sample_Size:{}’.format(sample_size) :␣
,→simulate_sample_n(sample_size)}).plot(kind = ‘hist’, bins=bins)
You can see that even the means of samples of 10 items follow a roughly bell-shaped distribution.
A sample of 50 items looks quite bell-shaped.
9
Now we will test the second claim of the CLT: That the SD of the sample mean is the SD of the
original distribution, divided by the square root of the sample size.
We have imported flight delay data and computed the standard deviation of delay time (in minutes):
[81]: #: run this cell, but don’t change it under penalty of law!
united = bpd.read_csv(‘data/united_summer2015.csv’)
united_std = np.std(united.get(‘Delay’))
united_std
Question 3.2. Write a function called predict_sd. It takes a sample size n (a number) as its
argument. It returns the predicted standard deviation of the sample mean for samples of size n
from the flight delays, according to the CLT.
[82]: def predict_sd(n):
…
predict_sd(10)
Question 3.3. Write a function called empirical_sd that takes a sample size n as its argument.
The function should simulate 1000 samples of size n from the flight delays dataset, and it should
return the standard deviation of the means of those 1000 samples.
Hint: This function will be similar to the simulate_sample_n function you wrote earlier.
[87]: def empirical_sd(n):
…
empirical_sd(10)
The cell below will plot the predicted and empirical SDs for the delay data for various sample sizes.
It may take a few moments to run.
[92]: #: run this cell to visualize
sd_table = bpd.DataFrame().assign(Sample_Size = np.arange(1, 101, 10))
predicted = sd_table.get(‘Sample_Size’).apply(predict_sd)
empirical = sd_table.get(‘Sample_Size’).apply(empirical_sd)
sd_table = sd_table.assign(Predicted_SD = predicted, Empirical_SD = empirical)
sd_table.plot(kind=’scatter’,x=’Sample_Size’, y=’Empirical_SD’,label =␣
,→’Empirical_SD’)
sd_table.plot(kind=’scatter’,x=’Sample_Size’, y=’Predicted_SD’,label =␣
,→’Predicted_SD’)
1.5 4. Polling and the Normal Distribution
Michelle is a statistical consultant, and she works for a group that supports Proposition 68 (which
would mandate labeling of all horizontal or vertical axes), called Yes on 68. They want to know
10
how many Californians will vote for the proposition.
Michelle polls a uniform random sample of all California voters, and she finds that 285 of the 500
sampled voters will vote in favor of the proposition.
[93]: #: run this cell, but don’t change it!
sample = bpd.DataFrame().assign(
Vote =np.array([“Yes”, “No”]),
Count= np.array([285, 215]))
sample_size = sample.get(“Count”).sum()
sample_proportions = sample.assign(
Proportion=sample.get(“Count”) / sample_size)
sample_proportions
She uses 1,000 bootstrap resamples to compute a confidence interval for the proportion of all
California voters who will vote Yes. Run the next cell to see the empirical distribution of Yes
proportions in the 10,000 resamples.
[94]: #: run this cell, but don’t change it!
resample_yes_proportions = np.array([])
for i in np.arange(1000):
resample = np.random.multinomial(sample_size,sample_proportions.
,→get(“Proportion”))/sample_size
resample_yes_proportions = np.append(resample_yes_proportions, resample[0])
bpd.DataFrame().assign(Resample_Yes_proportion = resample_yes_proportions).
,→plot(kind = ‘hist’,bins=np.arange(.2, .8, .01))
11
In a population whose members are 0 and 1, there is a simple formula for the standard deviation
of that population:
standard deviation =
√
(proportion of 0s) × (proportion of 1s)
(Figuring out this formula, starting from the definition of the standard deviation, is a fun exercise
for those who enjoy algebra – and who doesn’t?)
Question 4.1. Without accessing the data in resample_yes_proportions in any way, and
instead using only the Central Limit Theorem and the numbers of Yes and No voters in our sample
of 500, compute a number approximate_sd that’s the predicted standard deviation of the array
resample_yes_proportions according to the Central Limit Theorem. Since you don’t know the
true proportions of 0s and 1s in the population, use the proportions in the sample instead (since
they’re probably similar).
[95]: approximate_sd = …
approximate_sd
Question 4.2. Compute the standard deviation of the array resample_yes_proportions to verify
that your answer to question 2 is approximately right.
[98]: exact_sd = …
exact_sd
Question 4.3. Still without accessing resample_yes_proportions in any way, compute an
approximate 95% confidence interval for the proportion of Yes voters in California. The cell below
draws your interval as a red bar below the histogram of resample_yes_proportions; use that to
verify that your answer looks right.
Hint: Before, we’ve used percentile on the bootstrap distribution to find the bounds for the
confidence interval. Now, we’re not allowed to use the bootstrap distribution – but we don’t need
it! We know (from the Central Limit Theorem) that the distribution of the sample mean is Normal
with a certain standard deviation. We also know that 95% of the area of the normal distribution
falls within a certain number of standard deviations from the mean.
[102]: lower_limit = …
upper_limit = …
lower_limit, upper_limit
12
[107]: #: print the confidence interval
print(‘lower:’, lower_limit, ‘upper:’, upper_limit)
[108]: #: run this cell to plot your confidence interval
bpd.DataFrame().assign(Resample_Yes_proportion = resample_yes_proportions).
,→plot(bins=np.arange(.2, .8, .01),kind = ‘hist’).plot(np.array([lower_limit,␣
,→upper_limit]), np.array([0,0]), c=’r’, lw=10);
Your confidence interval should overlap the number 0.55. That means we can’t be very sure whether
Proposition 68 is winning, even though the sample Yes proportion is a bit above 0.5.
The Yes on 68 campaign really needs to know whether they’re winning. To have more confidence in
the result of the poll, the decide to redo it with a larger sample. They’d be happy if the standard
deviation of the sample mean were only 0.005. They ask Michelle to run a new poll with a sample
size that’s large enough to achieve that. (Polling is expensive, so the sample also shouldn’t be
bigger than necessary.)
Instead of making the conservative assumption that the population standard deviation is 0.5 (coding
Yes voters as 1 and No voters as 0), she decides to assume that it’s equal to the standard deviation
of the sample,
√
(Yes proportion in the sample) × (No proportion in the sample).
Under that assumption, Michelle computes the smallest sample size necessary in order to be confident that the standard deviation of the sample mean is only 0.005.
Question 4.4. What sample size did she find? Assign your answer to the variable sample_size.
Remember the sample size needs to be an integer.
[109]: sigma = …
sample_size = …
sample_size = …
sample_size
We know that
sample means SD =
population SD
√
sample size
,
so
sample size =
(
population SD
sample means SD)2
√(
285
500 ) ( 215
500 )
0.005
2
2 Finish Line
Congratulations! You are done with homework 7.
To submit your assignment:
13
- Select Kernel -> Restart & Run All to ensure that you have executed all cells, including
the test cells. - Read through the notebook to make sure everything is fine and all tests passed.
- Run the cell below to run all tests, and make sure that they all pass.
- Download your notebook using File -> Download as -> Notebook (.ipynb), then upload
your notebook to Gradescope.
14
Sample Solution