1.1.1 Instructions

This assignment is due Monday, July 26th at 11:59pm. You are given six slip days thoughout

the quarter which can extend the deadline by one day. See the syllabus for more details. With

the exception of using slip days, late work will not be accepted unless you have made special

arrangements with your instructor.

Important: The otter tests don’t usually tell you that your answer is correct. More often, they

help catch careless mistakes. It’s up to you to ensure that your answer is correct. If you’re not

sure, ask someone (not for the answer, but for some guidance about your approach).

[1]: # please don’t change this cell, but do make sure to run it

import babypandas as bpd

import matplotlib.pyplot as plt

import numpy as np

import otter

grader = otter.Notebook()

1.2 1. Polling

Four candidates are running for President of Dataland. A polling company surveys 1000 people

selected uniformly at random from among voters in Dataland, and it asks each one who they are

planning on voting for. After compiling the results, the polling company releases the following

proportions from their sample:

Candidate Proportion

Candidate C 0.55

Candidate T 0.32

Candidate J 0.08

1

Candidate Proportion

Candidate S 0.03

Undecided 0.02

These proportions represent a uniform random sample of the population of Dataland. We will

attempt to estimate the corresponding population parameters – the proportions of each kind of

voter in the entire population. We will use confidence intervals to compute a range of values that

reflects the uncertainty of our estimate.

The table votes contains the results of the survey. Candidates are represented by their initials.

Undecided voters are denoted by U.

[2]: #: run this cell to display the results of the survey — don’t change this cell!

votes = bpd.DataFrame().assign(vote=np.array([‘C’]*550 + [‘T’]*320 + [‘J’]*80 +␣ ,→[‘S’]*30 + [‘U’]*20))

votes = votes.sample(votes.shape[0],replace=False)

num_votes = votes.shape[0]

votes

Below, we have given you code that will use bootstrapped samples to compute estimates of the

true proportion of voters who are planning on voting for Candidate T.

[3]: #: run the bootstrap

def proportions_in_resamples():

statistics = np.array([])

for i in np.arange(1000):

bootstrap = votes.sample(num_votes, replace = True)

sample_statistic = np.count_nonzero(bootstrap.get(‘vote’) == ‘T’)/

,→num_votes

statistics = np.append(statistics, sample_statistic)

return statistics

boot_proportions = proportions_in_resamples()

bpd.DataFrame().assign(Estimated_Proportion=boot_proportions).

,→plot(kind=’hist’,bins=np.arange(0.1,0.5,0.01))

Question 1.1. Using the array boot_proportions, compute an approximate 95% confidence

interval for the true proportion of voters planning on voting for candidate C. (Compute the lower

and upper ends of the interval, named votes_lower_bound and votes_upper_bound, respectively.)

[4]: votes_lower_bound = …

votes_upper_bound = …

print(“Bootstrapped 95% confidence interval for the proportion of T voters in␣

,→the population: [{:f}, {:f}]”.format(votes_lower_bound, votes_upper_bound))

2

Question 1.2. The survey results seem to indicate that Candidate C is beating Candidate T

among voters. We would like to use confidence intervals to determine a range of likely values for

her true lead. Candidate C’s lead over Candidate T is:

(Candidate C’s proportion of the vote) − (Candidate T’s proportion of the vote).

Use the bootstrap with 1000 resamples to compute an approximate distribution for Candidate C’s

lead over Candidate T, and store your bootstrap estimates in an array called boot_leads. Plot a

histogram of the resulting samples.

Hint: Use the code for proportions_in_resamples given to you above as a starting point.

[10]: def leads_in_resamples():

…

Question 1.3. Compute an approximate 97% confidence interval for the difference in proportions.

[16]: diff_lower_bound = …

diff_upper_bound = …

print(“Bootstrapped 97% confidence interval for Candidate C’s true lead over␣

,→Candidate T: [{:f}, {:f}]”.format(diff_lower_bound, diff_upper_bound))

The staff computed the following 95% confidence interval for the proportion of Candidate C voters:

[.52, .58]

(Your answer might have been slightly different, but that doesn’t mean it was wrong since the data

was randomly sampled.)

Question 1.4. Can we say that 95% of the population lies in the range [.52, .58]? Assign your

choice to variable q1_4.

- Yes
- No

[22]: q1_4 = …

q1_4

3

Question 1.5. Can we say that the true proportion of the population that will vote for Candidate

C is a random quantity with approximately a 95% chance of falling between 0.52 and 0.58? Assign

your choice to variable q1_5. - No
- Yes

[26]: q1_5 = …

q1_5

Question 1.6. Suppose we produced 20,000 new samples (each one a uniform random sample

of 1,000 voters) and created a 97% confidence interval from each one. Roughly how many of

those 20,000 intervals do you expect will actually contain the true proportion of the population?

Assign your answer to the variable how_many below. It should be the number of intervals, not the

proportion or percentage.

[30]: how_many = …

how_many

Question 1.7.

The staff also created 80%, 90%, and 99% confidence intervals from one sample (shown below),

but we forgot to label which confidence interval represented which percentages! Match the interval

to the percent of confidence the interval represents and assign your choices in variables q1_7_80,

q1_7_90, and q1_7_99, each for likely 80%, 90%, and 99% confidence intervals respectively.

Tip: Draw out the confidence intervals on a piece a paper to help you visualize them better. - [.538, .563]
- [.516, .584]
- [.53, .57]

[33]: q1_7_80 = …

q1_7_90 = …

q1_7_99 = …

q1_7_80, q1_7_90, q1_7_99

1.3 2. Hardest Writing Course

Suppose it’s application season and you’re a current high school senior looking to apply to the

prestigious UCSD for data science. Also, suppose you dislike writing and want to strategically

analyze all the UCSD college writing courses, to figure out colleges to avoid applying to and

colleges where you have the best shot at getting a decent grade. Luckily, UCSD has data on its

4

CAPES website about writing courses (except for Muir’s writing course due to unknown reasons).

Each row corresponds to a particular quarter and course, and the data includes the name of the

course, the average study hours per week for the quarter, and the average grade for the quarter (on

a GPA scale). Now it’s time to analyze and figure out whether the writing course rumors are true

(or people just like complaining).

[43]: # Run this cell to read data; don’t change it

writing = bpd.read_csv(“data/writing_courses_ucsd.csv”, index_col = 0)

writing.iloc[:5]

Question 2.1a. The first thing to do before jumping into analysis is to figure out the mean study

hours and mean grade for each course. Create a table called course_means that has index as

course and columns consist of Study Hrs/wk and grades. Study Hrs/wk and grades contain the

means of Study Hrs/wk and grades respectively.

[44]: course_means = …

course_means

Question 2.1b. You may have noticed that the mean grades for some courses is nan. This means

that some grades are missing for these courses (missing values are represented by nan). Drop all

the rows in the writing table that contain missing values and assign the new table to the variable

writing_fixed. After this, create a table called course_means_fixed with no nan values in the

grades column.

Hint: np.isnan() or np.isfinite() might be useful.

[48]: writing_fixed = …

course_means_fixed = …

display(writing_fixed, course_means_fixed)

Question 2.1c. It’s hard to judge whether a course is hard just based on study hours per week

or grades. Therefore, we will calculate a “difficulty score” that captures the difficulty of the class.

This metric is positively related to study hours per week and negatively related to grades. We will

calculate this score using the formula:

10 ×

√

Study Hrs/week

grades2

For instance, if a course has study hours per week of 4 and an average grade of 3.0, then its

difficulty will be 10 ×

√

4

3

2 = 20/9.

Add a new column named “difficulty” to writing_fixed which contains the calculated difficulty

score for each course.

5

[52]: writing_fixed = …

display(writing_fixed, writing_fixed.groupby(‘course’).mean())

Question 2.2. Revelle’s writing course HUM seems to have pretty high difficulty score. Produce 1,000 bootstrapped estimates for the average difficulty of HUM. Store the estimates in the

hum_averages array. Use this hum_averages array to plot a histogram of the estimated averages.

The label on the x-axis should be “Estimated Difficulty for HUM”.

Use the hum_averages array to calculate an approximate 95% confidence interval for the true

average difficulty. Assign the the corresponding bounds to lower_bound and upper_bound. Do

NOT round the bounds.

[55]: hum_averages = …

…

lower_bound = …

upper_bound = …

lower_bound, upper_bound

Question 2.3. You want to create a similar histogram for each of the other courses, and also

calculate the corresponding confidence intervals. Repeating the process above 4 times would be

time-consuming. Create a function called ci_and_hist, which takes in a course name as its input,

plots the histogram for 1,000 bootstrapped estimates for the average difficulty and returns a str

describing the approximate 95% confidence interval for the course (see the example below).

For example, ci_and_hist(‘HUM’) should plot the same histogram in Question 2 and return ‘The

95% confidence interval for HUM is [2.85, 2.93]’, where the 2.85 and 2.93 were calculated by

rounding lower_bound and upper_bound to two decimal places.

Note: For the returned string, make sure you follow the format above and remember to change

the course name and the confidence interval for different courses. For the histogram, the label on

the x-axis should also change accordingly to the courses.

[61]: def ci_and_hist(course_name):

…

[63]: #: try it out

ci_and_hist(‘WCWP’)

Question 2.4. Your friend claims that Marshall’s writing course DOC is actually not as hard as

everyone says. She says that because our CAPE data is only a sample of the full population of

course offerings, the actual average difficulty for DOC could be 2.25. Run the cell below to use the

ci_and_hist function you defined above to calculate an approximate 95% confidence interval for

6

the average difficulty in DOC. Can you reject her hypothesis using this confidence interval? Assign

your answer to variable q2_4. - Yes, the confidence interval includes 3.3
- No, the confidence interval includes 3.3
- Yes, the confidence interval doesn’t include 3.3
- No, the confidence interval doesn’t include 3.3

[65]: q2_4 = …

q2_4

Question 2.5a. Now that you’ve looked at the average difficulty for different courses, but you

believe that study time does not matter as long as you achieve a good grade. This time, you’ll

test whether each individual course has the same average grade as that of all the writing courses

combined.

First, produce 1,000 bootstrapped estimates for the average grade of all the writing courses combined. Use these estimates to produce an approximate 99% confidence interval for the true average grade. Round the bounds of the confidence interval to 2 decimal places and save them into

grade_lower_bound and grade_upper_bound.

grade_lower_bound = …

grade_upper_bound = …

grade_lower_bound, grade_upper_bound

Question 2.5b. Compare the average grade for each individual writing course to the average

grade of all writing courses combined. Your final answer should be a 5 element array named

grade_hypotheses.

In the order of [CAT, DOC, HUM, MMW, WCWP], the corresponding element in the array

grade_hypotheses should be -1 if the course’s average grade is significantly lower than that of

all the writing courses combined, 0 if you cannot reject the hypothesis that the course has the same

average grade as that of all the courses combined, and 1 if the course’s average grade is significantly

higher than that of all the courses combined. You may want to use the course_means_fixed table

you created in Question 1b.

Note: It’s okay to hard code your answer for this question.

[74]: grade_hypotheses = …

grade_hypotheses

7

1.4 3. Testing the Central Limit Theorem

The Central Limit Theorem tells us that the probability distribution of the sum or average of a

large random sample drawn with replacement will be roughly normal, regardless of the distribution

of the population from which the sample is drawn.

That’s a pretty big claim, but the theorem doesn’t stop there. It further states that the standard

deviation of this normal distribution is given by

sd of the original distribution

√

sample size

In other words, suppose we start with any distribution that has standard deviation σ, take a sample

of size n (where n is a large number) from that distribution with replacement, and compute the

mean of that sample. If we repeat this procedure many times, then those sample means will have

a normal distribution with standard deviation √σ

n

.

That’s an even bigger claim than the first one! The proof of the theorem is beyond the scope of

this class, but in this exercise, we will be exploring some data to see the CLT in action.

Question 3.1. The CLT only applies when sample sizes are “sufficiently large.” This isn’t a very

precise statement. Is 10 large? How about 50? The truth is that it depends both on the original

population distribution and just how “normal” you want the result to look. Let’s use a simulation

to get a feel for how the distribution of the sample mean changes as sample size goes up.

Consider a coin flip. If we say Heads is 1 and Tails is 0, then there’s a 50% chance of getting a

1 and a 50% chance of getting a 0, which is definitely not a normal distribution. The average of

several coin tosses is equal to the proportion of heads in those coin tosses, so the CLT should apply

if we compute the sample proportion of heads many times.

Write a function called simulate_sample_n that takes in a sample size n. It should return an

array that contains 5000 sample proportions of heads, each from n coin flips.

[78]: def simulate_sample_n(n):

…

simulate_sample_n(5)

8

The code below will use the function you just defined to plot the empirical distribution of the

sample mean for several different sample sizes. The x- and y-scales are kept the same to facilitate

comparisons.

[80]: #: run this cell to visualize

bins = np.arange(-0.01,1.05,0.02)

for sample_size in np.array([2, 5, 10, 20, 50, 100, 200, 400]):

bpd.DataFrame().assign(**{‘Sample_Size:{}’.format(sample_size) :␣

,→simulate_sample_n(sample_size)}).plot(kind = ‘hist’, bins=bins)

You can see that even the means of samples of 10 items follow a roughly bell-shaped distribution.

A sample of 50 items looks quite bell-shaped.

9

Now we will test the second claim of the CLT: That the SD of the sample mean is the SD of the

original distribution, divided by the square root of the sample size.

We have imported flight delay data and computed the standard deviation of delay time (in minutes):

[81]: #: run this cell, but don’t change it under penalty of law!

united = bpd.read_csv(‘data/united_summer2015.csv’)

united_std = np.std(united.get(‘Delay’))

united_std

Question 3.2. Write a function called predict_sd. It takes a sample size n (a number) as its

argument. It returns the predicted standard deviation of the sample mean for samples of size n

from the flight delays, according to the CLT.

[82]: def predict_sd(n):

…

predict_sd(10)

Question 3.3. Write a function called empirical_sd that takes a sample size n as its argument.

The function should simulate 1000 samples of size n from the flight delays dataset, and it should

return the standard deviation of the means of those 1000 samples.

Hint: This function will be similar to the simulate_sample_n function you wrote earlier.

[87]: def empirical_sd(n):

…

empirical_sd(10)

The cell below will plot the predicted and empirical SDs for the delay data for various sample sizes.

It may take a few moments to run.

[92]: #: run this cell to visualize

sd_table = bpd.DataFrame().assign(Sample_Size = np.arange(1, 101, 10))

predicted = sd_table.get(‘Sample_Size’).apply(predict_sd)

empirical = sd_table.get(‘Sample_Size’).apply(empirical_sd)

sd_table = sd_table.assign(Predicted_SD = predicted, Empirical_SD = empirical)

sd_table.plot(kind=’scatter’,x=’Sample_Size’, y=’Empirical_SD’,label =␣

,→’Empirical_SD’)

sd_table.plot(kind=’scatter’,x=’Sample_Size’, y=’Predicted_SD’,label =␣

,→’Predicted_SD’)

1.5 4. Polling and the Normal Distribution

Michelle is a statistical consultant, and she works for a group that supports Proposition 68 (which

would mandate labeling of all horizontal or vertical axes), called Yes on 68. They want to know

10

how many Californians will vote for the proposition.

Michelle polls a uniform random sample of all California voters, and she finds that 285 of the 500

sampled voters will vote in favor of the proposition.

[93]: #: run this cell, but don’t change it!

sample = bpd.DataFrame().assign(

Vote =np.array([“Yes”, “No”]),

Count= np.array([285, 215]))

sample_size = sample.get(“Count”).sum()

sample_proportions = sample.assign(

Proportion=sample.get(“Count”) / sample_size)

sample_proportions

She uses 1,000 bootstrap resamples to compute a confidence interval for the proportion of all

California voters who will vote Yes. Run the next cell to see the empirical distribution of Yes

proportions in the 10,000 resamples.

[94]: #: run this cell, but don’t change it!

resample_yes_proportions = np.array([])

for i in np.arange(1000):

resample = np.random.multinomial(sample_size,sample_proportions.

,→get(“Proportion”))/sample_size

resample_yes_proportions = np.append(resample_yes_proportions, resample[0])

bpd.DataFrame().assign(Resample_Yes_proportion = resample_yes_proportions).

,→plot(kind = ‘hist’,bins=np.arange(.2, .8, .01))

11

In a population whose members are 0 and 1, there is a simple formula for the standard deviation

of that population:

standard deviation =

√

(proportion of 0s) × (proportion of 1s)

(Figuring out this formula, starting from the definition of the standard deviation, is a fun exercise

for those who enjoy algebra – and who doesn’t?)

Question 4.1. Without accessing the data in resample_yes_proportions in any way, and

instead using only the Central Limit Theorem and the numbers of Yes and No voters in our sample

of 500, compute a number approximate_sd that’s the predicted standard deviation of the array

resample_yes_proportions according to the Central Limit Theorem. Since you don’t know the

true proportions of 0s and 1s in the population, use the proportions in the sample instead (since

they’re probably similar).

[95]: approximate_sd = …

approximate_sd

Question 4.2. Compute the standard deviation of the array resample_yes_proportions to verify

that your answer to question 2 is approximately right.

[98]: exact_sd = …

exact_sd

Question 4.3. Still without accessing resample_yes_proportions in any way, compute an

approximate 95% confidence interval for the proportion of Yes voters in California. The cell below

draws your interval as a red bar below the histogram of resample_yes_proportions; use that to

verify that your answer looks right.

Hint: Before, we’ve used percentile on the bootstrap distribution to find the bounds for the

confidence interval. Now, we’re not allowed to use the bootstrap distribution – but we don’t need

it! We know (from the Central Limit Theorem) that the distribution of the sample mean is Normal

with a certain standard deviation. We also know that 95% of the area of the normal distribution

falls within a certain number of standard deviations from the mean.

[102]: lower_limit = …

upper_limit = …

lower_limit, upper_limit

12

[107]: #: print the confidence interval

print(‘lower:’, lower_limit, ‘upper:’, upper_limit)

[108]: #: run this cell to plot your confidence interval

bpd.DataFrame().assign(Resample_Yes_proportion = resample_yes_proportions).

,→plot(bins=np.arange(.2, .8, .01),kind = ‘hist’).plot(np.array([lower_limit,␣

,→upper_limit]), np.array([0,0]), c=’r’, lw=10);

Your confidence interval should overlap the number 0.55. That means we can’t be very sure whether

Proposition 68 is winning, even though the sample Yes proportion is a bit above 0.5.

The Yes on 68 campaign really needs to know whether they’re winning. To have more confidence in

the result of the poll, the decide to redo it with a larger sample. They’d be happy if the standard

deviation of the sample mean were only 0.005. They ask Michelle to run a new poll with a sample

size that’s large enough to achieve that. (Polling is expensive, so the sample also shouldn’t be

bigger than necessary.)

Instead of making the conservative assumption that the population standard deviation is 0.5 (coding

Yes voters as 1 and No voters as 0), she decides to assume that it’s equal to the standard deviation

of the sample,

√

(Yes proportion in the sample) × (No proportion in the sample).

Under that assumption, Michelle computes the smallest sample size necessary in order to be confident that the standard deviation of the sample mean is only 0.005.

Question 4.4. What sample size did she find? Assign your answer to the variable sample_size.

Remember the sample size needs to be an integer.

[109]: sigma = …

sample_size = …

sample_size = …

sample_size

We know that

sample means SD =

population SD

√

sample size

,

so

sample size =

(

population SD

√(

285

500 ) ( 215

500 )

0.005

2 Finish Line

Congratulations! You are done with homework 7.

To submit your assignment:

13

- Select Kernel -> Restart & Run All to ensure that you have executed all cells, including

the test cells. - Read through the notebook to make sure everything is fine and all tests passed.
- Run the cell below to run all tests, and make sure that they all pass.
- Download your notebook using File -> Download as -> Notebook (.ipynb), then upload

your notebook to Gradescope.

14

Sample Solution

With Love, Revenge Retribution isn't an actual existence however a subject principally including show and different books. Subsequently, the most ideal approach to investigate retribution is to once in a while check the story close by. As a dramatization, the awfulness of retribution is principally character - driven, the intention of the character is straightforward: retaliation - under the name of adoration. Bel-imperia is looking for retribution on her darling, Andre. The thought process speaks to a cozy connection between certainty of misfortune and vengeance and love. The subject of affection and retribution in Shakespeare's "Hamlet's Love" is one of Hamlet's most remarkable topics, however one preferred position - the intensity of vengeance advances Hamlet's adoration. I will vindicate the homicide of his dad. While Hamlet was befuddled, he discovered that his mom got hitched to his uncle not long after his dad kicked the bucket. Despite the fact that he didn't question the sudden passing of his dad soon he was as yet stunned. As Kenneth Muir stated, "He (Hamlet) was stunned by Gertrude. Retribution is want for vengeance and fairness. At the point when the individual you love gets injured, that nature will vindicate. In any case, inaccurate conduct of these counter can prompt genuine results. In the awfulness of William Shakespeare 'Hamlet', Fortinbras, Hamlet, Lertes demonstrated that their craving for retribution unavoidably prompts misfortune. By losing somebody you love, Hamlet 's job has made it conceivable to make a move. The youthful Fortinbras established a military to restore the land lost by his dad to Hamlet and the lord. Both Hamlet and Leltus added to the vengeance of the showy topic. Both of these characters love their dad definitely. That is the reason they are delivering retribution. Be that as it may, their adoration is misshaped, and now they want to fight back against the passing of their dads. Be that as it may, they treat it in different ways. Hamlet realized who murdered his dad, yet he didn't make a move right away. Rarthes was somewhat indistinct, which was the reason for his dad's passing, yet he quickly acted to vindicate him. "For heck, unwaveringness! Promise, Retribution is the topic of this film. We saw this when Sook-hee promised to vindicate her dad's passing, her better half, and her retribution once more. Retribution is the explanation behind acquainting her with the universe of Assassin and helping her accomplish her objectives. It has become an endless loop of her. In some cases I see it influences her, however this is by all accounts the best thing knowing how Sook-hee does. She is carrying on a "typical" life and decided to render retribution while strolling. Despite the fact that it appears to be a toy of her destiny, Sook-hee settles on his own choice>

GET ANSWER