Confidence Intervals and Sample Size

1.1.1 Instructions
This assignment is due Monday, July 26th at 11:59pm. You are given six slip days thoughout
the quarter which can extend the deadline by one day. See the syllabus for more details. With
the exception of using slip days, late work will not be accepted unless you have made special
arrangements with your instructor.
Important: The otter tests don’t usually tell you that your answer is correct. More often, they
help catch careless mistakes. It’s up to you to ensure that your answer is correct. If you’re not
sure, ask someone (not for the answer, but for some guidance about your approach).
[1]: # please don’t change this cell, but do make sure to run it
import babypandas as bpd
import matplotlib.pyplot as plt
import numpy as np
import otter
grader = otter.Notebook()
1.2 1. Polling
Four candidates are running for President of Dataland. A polling company surveys 1000 people
selected uniformly at random from among voters in Dataland, and it asks each one who they are
planning on voting for. After compiling the results, the polling company releases the following
proportions from their sample:
Candidate Proportion
Candidate C 0.55
Candidate T 0.32
Candidate J 0.08
1
Candidate Proportion
Candidate S 0.03
Undecided 0.02
These proportions represent a uniform random sample of the population of Dataland. We will
attempt to estimate the corresponding population parameters – the proportions of each kind of
voter in the entire population. We will use confidence intervals to compute a range of values that
reflects the uncertainty of our estimate.
The table votes contains the results of the survey. Candidates are represented by their initials.
Undecided voters are denoted by U.
[2]: #: run this cell to display the results of the survey — don’t change this cell!
votes = bpd.DataFrame().assign(vote=np.array([‘C’]550 + [‘T’]320 + [‘J’]80 +␣ ,→[‘S’]30 + [‘U’]*20))
votes = votes.sample(votes.shape[0],replace=False)
num_votes = votes.shape[0]
votes
Below, we have given you code that will use bootstrapped samples to compute estimates of the
true proportion of voters who are planning on voting for Candidate T.
[3]: #: run the bootstrap
def proportions_in_resamples():
statistics = np.array([])
for i in np.arange(1000):
bootstrap = votes.sample(num_votes, replace = True)
sample_statistic = np.count_nonzero(bootstrap.get(‘vote’) == ‘T’)/
,→num_votes
statistics = np.append(statistics, sample_statistic)
return statistics
boot_proportions = proportions_in_resamples()
bpd.DataFrame().assign(Estimated_Proportion=boot_proportions).
,→plot(kind=’hist’,bins=np.arange(0.1,0.5,0.01))
Question 1.1. Using the array boot_proportions, compute an approximate 95% confidence
interval for the true proportion of voters planning on voting for candidate C. (Compute the lower
and upper ends of the interval, named votes_lower_bound and votes_upper_bound, respectively.)
[4]: votes_lower_bound = …
votes_upper_bound = …

: print the confidence interval

print(“Bootstrapped 95% confidence interval for the proportion of T voters in␣
,→the population: [{:f}, {:f}]”.format(votes_lower_bound, votes_upper_bound))
2
Question 1.2. The survey results seem to indicate that Candidate C is beating Candidate T
among voters. We would like to use confidence intervals to determine a range of likely values for
her true lead. Candidate C’s lead over Candidate T is:
(Candidate C’s proportion of the vote) − (Candidate T’s proportion of the vote).
Use the bootstrap with 1000 resamples to compute an approximate distribution for Candidate C’s
lead over Candidate T, and store your bootstrap estimates in an array called boot_leads. Plot a
histogram of the resulting samples.
Hint: Use the code for proportions_in_resamples given to you above as a starting point.
[10]: def leads_in_resamples():

Question 1.3. Compute an approximate 97% confidence interval for the difference in proportions.
[16]: diff_lower_bound = …
diff_upper_bound = …

: print the confidence interval

print(“Bootstrapped 97% confidence interval for Candidate C’s true lead over␣
,→Candidate T: [{:f}, {:f}]”.format(diff_lower_bound, diff_upper_bound))
The staff computed the following 95% confidence interval for the proportion of Candidate C voters:
[.52, .58]
(Your answer might have been slightly different, but that doesn’t mean it was wrong since the data
was randomly sampled.)
Question 1.4. Can we say that 95% of the population lies in the range [.52, .58]? Assign your
choice to variable q1_4.

  1. Yes
  2. No
    [22]: q1_4 = …
    q1_4
    3
    Question 1.5. Can we say that the true proportion of the population that will vote for Candidate
    C is a random quantity with approximately a 95% chance of falling between 0.52 and 0.58? Assign
    your choice to variable q1_5.
  3. No
  4. Yes
    [26]: q1_5 = …
    q1_5
    Question 1.6. Suppose we produced 20,000 new samples (each one a uniform random sample
    of 1,000 voters) and created a 97% confidence interval from each one. Roughly how many of
    those 20,000 intervals do you expect will actually contain the true proportion of the population?
    Assign your answer to the variable how_many below. It should be the number of intervals, not the
    proportion or percentage.
    [30]: how_many = …
    how_many
    Question 1.7.
    The staff also created 80%, 90%, and 99% confidence intervals from one sample (shown below),
    but we forgot to label which confidence interval represented which percentages! Match the interval
    to the percent of confidence the interval represents and assign your choices in variables q1_7_80,
    q1_7_90, and q1_7_99, each for likely 80%, 90%, and 99% confidence intervals respectively.
    Tip: Draw out the confidence intervals on a piece a paper to help you visualize them better.
  5. [.538, .563]
  6. [.516, .584]
  7. [.53, .57]
    [33]: q1_7_80 = …
    q1_7_90 = …
    q1_7_99 = …
    q1_7_80, q1_7_90, q1_7_99
    1.3 2. Hardest Writing Course
    Suppose it’s application season and you’re a current high school senior looking to apply to the
    prestigious UCSD for data science. Also, suppose you dislike writing and want to strategically
    analyze all the UCSD college writing courses, to figure out colleges to avoid applying to and
    colleges where you have the best shot at getting a decent grade. Luckily, UCSD has data on its
    4
    CAPES website about writing courses (except for Muir’s writing course due to unknown reasons).
    Each row corresponds to a particular quarter and course, and the data includes the name of the
    course, the average study hours per week for the quarter, and the average grade for the quarter (on
    a GPA scale). Now it’s time to analyze and figure out whether the writing course rumors are true
    (or people just like complaining).
    [43]: # Run this cell to read data; don’t change it
    writing = bpd.read_csv(“data/writing_courses_ucsd.csv”, index_col = 0)
    writing.iloc[:5]
    Question 2.1a. The first thing to do before jumping into analysis is to figure out the mean study
    hours and mean grade for each course. Create a table called course_means that has index as
    course and columns consist of Study Hrs/wk and grades. Study Hrs/wk and grades contain the
    means of Study Hrs/wk and grades respectively.
    [44]: course_means = …
    course_means
    Question 2.1b. You may have noticed that the mean grades for some courses is nan. This means
    that some grades are missing for these courses (missing values are represented by nan). Drop all
    the rows in the writing table that contain missing values and assign the new table to the variable
    writing_fixed. After this, create a table called course_means_fixed with no nan values in the
    grades column.
    Hint: np.isnan() or np.isfinite() might be useful.
    [48]: writing_fixed = …
    course_means_fixed = …
    display(writing_fixed, course_means_fixed)
    Question 2.1c. It’s hard to judge whether a course is hard just based on study hours per week
    or grades. Therefore, we will calculate a “difficulty score” that captures the difficulty of the class.
    This metric is positively related to study hours per week and negatively related to grades. We will
    calculate this score using the formula:
    10 ×

    Study Hrs/week
    grades2
    For instance, if a course has study hours per week of 4 and an average grade of 3.0, then its
    difficulty will be 10 ×

    4
    3
    2 = 20/9.
    Add a new column named “difficulty” to writing_fixed which contains the calculated difficulty
    score for each course.
    5
    [52]: writing_fixed = …
    display(writing_fixed, writing_fixed.groupby(‘course’).mean())
    Question 2.2. Revelle’s writing course HUM seems to have pretty high difficulty score. Produce 1,000 bootstrapped estimates for the average difficulty of HUM. Store the estimates in the
    hum_averages array. Use this hum_averages array to plot a histogram of the estimated averages.
    The label on the x-axis should be “Estimated Difficulty for HUM”.
    Use the hum_averages array to calculate an approximate 95% confidence interval for the true
    average difficulty. Assign the the corresponding bounds to lower_bound and upper_bound. Do
    NOT round the bounds.
    [55]: hum_averages = …

    lower_bound = …
    upper_bound = …
    lower_bound, upper_bound
    Question 2.3. You want to create a similar histogram for each of the other courses, and also
    calculate the corresponding confidence intervals. Repeating the process above 4 times would be
    time-consuming. Create a function called ci_and_hist, which takes in a course name as its input,
    plots the histogram for 1,000 bootstrapped estimates for the average difficulty and returns a str
    describing the approximate 95% confidence interval for the course (see the example below).
    For example, ci_and_hist(‘HUM’) should plot the same histogram in Question 2 and return ‘The
    95% confidence interval for HUM is [2.85, 2.93]’, where the 2.85 and 2.93 were calculated by
    rounding lower_bound and upper_bound to two decimal places.
    Note: For the returned string, make sure you follow the format above and remember to change
    the course name and the confidence interval for different courses. For the histogram, the label on
    the x-axis should also change accordingly to the courses.
    [61]: def ci_and_hist(course_name):

    [63]: #: try it out
    ci_and_hist(‘WCWP’)
    Question 2.4. Your friend claims that Marshall’s writing course DOC is actually not as hard as
    everyone says. She says that because our CAPE data is only a sample of the full population of
    course offerings, the actual average difficulty for DOC could be 2.25. Run the cell below to use the
    ci_and_hist function you defined above to calculate an approximate 95% confidence interval for
    6
    the average difficulty in DOC. Can you reject her hypothesis using this confidence interval? Assign
    your answer to variable q2_4.
  8. Yes, the confidence interval includes 3.3
  9. No, the confidence interval includes 3.3
  10. Yes, the confidence interval doesn’t include 3.3
  11. No, the confidence interval doesn’t include 3.3
    [65]: q2_4 = …
    q2_4
    Question 2.5a. Now that you’ve looked at the average difficulty for different courses, but you
    believe that study time does not matter as long as you achieve a good grade. This time, you’ll
    test whether each individual course has the same average grade as that of all the writing courses
    combined.
    First, produce 1,000 bootstrapped estimates for the average grade of all the writing courses combined. Use these estimates to produce an approximate 99% confidence interval for the true average grade. Round the bounds of the confidence interval to 2 decimal places and save them into
    grade_lower_bound and grade_upper_bound.
    grade_lower_bound = …
    grade_upper_bound = …
    grade_lower_bound, grade_upper_bound
    Question 2.5b. Compare the average grade for each individual writing course to the average
    grade of all writing courses combined. Your final answer should be a 5 element array named
    grade_hypotheses.
    In the order of [CAT, DOC, HUM, MMW, WCWP], the corresponding element in the array
    grade_hypotheses should be -1 if the course’s average grade is significantly lower than that of
    all the writing courses combined, 0 if you cannot reject the hypothesis that the course has the same
    average grade as that of all the courses combined, and 1 if the course’s average grade is significantly
    higher than that of all the courses combined. You may want to use the course_means_fixed table
    you created in Question 1b.
    Note: It’s okay to hard code your answer for this question.
    [74]: grade_hypotheses = …
    grade_hypotheses
    7
    1.4 3. Testing the Central Limit Theorem
    The Central Limit Theorem tells us that the probability distribution of the sum or average of a
    large random sample drawn with replacement will be roughly normal, regardless of the distribution
    of the population from which the sample is drawn.
    That’s a pretty big claim, but the theorem doesn’t stop there. It further states that the standard
    deviation of this normal distribution is given by
    sd of the original distribution

    sample size
    In other words, suppose we start with any distribution that has standard deviation σ, take a sample
    of size n (where n is a large number) from that distribution with replacement, and compute the
    mean of that sample. If we repeat this procedure many times, then those sample means will have
    a normal distribution with standard deviation √σ
    n
    .
    That’s an even bigger claim than the first one! The proof of the theorem is beyond the scope of
    this class, but in this exercise, we will be exploring some data to see the CLT in action.
    Question 3.1. The CLT only applies when sample sizes are “sufficiently large.” This isn’t a very
    precise statement. Is 10 large? How about 50? The truth is that it depends both on the original
    population distribution and just how “normal” you want the result to look. Let’s use a simulation
    to get a feel for how the distribution of the sample mean changes as sample size goes up.
    Consider a coin flip. If we say Heads is 1 and Tails is 0, then there’s a 50% chance of getting a
    1 and a 50% chance of getting a 0, which is definitely not a normal distribution. The average of
    several coin tosses is equal to the proportion of heads in those coin tosses, so the CLT should apply
    if we compute the sample proportion of heads many times.
    Write a function called simulate_sample_n that takes in a sample size n. It should return an
    array that contains 5000 sample proportions of heads, each from n coin flips.
    [78]: def simulate_sample_n(n):

    simulate_sample_n(5)
    8
    The code below will use the function you just defined to plot the empirical distribution of the
    sample mean for several different sample sizes. The x- and y-scales are kept the same to facilitate
    comparisons.
    [80]: #: run this cell to visualize
    bins = np.arange(-0.01,1.05,0.02)
    for sample_size in np.array([2, 5, 10, 20, 50, 100, 200, 400]):
    bpd.DataFrame().assign(**{‘Sample_Size:{}’.format(sample_size) :␣
    ,→simulate_sample_n(sample_size)}).plot(kind = ‘hist’, bins=bins)
    You can see that even the means of samples of 10 items follow a roughly bell-shaped distribution.
    A sample of 50 items looks quite bell-shaped.
    9
    Now we will test the second claim of the CLT: That the SD of the sample mean is the SD of the
    original distribution, divided by the square root of the sample size.
    We have imported flight delay data and computed the standard deviation of delay time (in minutes):
    [81]: #: run this cell, but don’t change it under penalty of law!
    united = bpd.read_csv(‘data/united_summer2015.csv’)
    united_std = np.std(united.get(‘Delay’))
    united_std
    Question 3.2. Write a function called predict_sd. It takes a sample size n (a number) as its
    argument. It returns the predicted standard deviation of the sample mean for samples of size n
    from the flight delays, according to the CLT.
    [82]: def predict_sd(n):

    predict_sd(10)
    Question 3.3. Write a function called empirical_sd that takes a sample size n as its argument.
    The function should simulate 1000 samples of size n from the flight delays dataset, and it should
    return the standard deviation of the means of those 1000 samples.
    Hint: This function will be similar to the simulate_sample_n function you wrote earlier.
    [87]: def empirical_sd(n):

    empirical_sd(10)
    The cell below will plot the predicted and empirical SDs for the delay data for various sample sizes.
    It may take a few moments to run.
    [92]: #: run this cell to visualize
    sd_table = bpd.DataFrame().assign(Sample_Size = np.arange(1, 101, 10))
    predicted = sd_table.get(‘Sample_Size’).apply(predict_sd)
    empirical = sd_table.get(‘Sample_Size’).apply(empirical_sd)
    sd_table = sd_table.assign(Predicted_SD = predicted, Empirical_SD = empirical)
    sd_table.plot(kind=’scatter’,x=’Sample_Size’, y=’Empirical_SD’,label =␣
    ,→’Empirical_SD’)
    sd_table.plot(kind=’scatter’,x=’Sample_Size’, y=’Predicted_SD’,label =␣
    ,→’Predicted_SD’)
    1.5 4. Polling and the Normal Distribution
    Michelle is a statistical consultant, and she works for a group that supports Proposition 68 (which
    would mandate labeling of all horizontal or vertical axes), called Yes on 68. They want to know
    10
    how many Californians will vote for the proposition.
    Michelle polls a uniform random sample of all California voters, and she finds that 285 of the 500
    sampled voters will vote in favor of the proposition.
    [93]: #: run this cell, but don’t change it!
    sample = bpd.DataFrame().assign(
    Vote =np.array([“Yes”, “No”]),
    Count= np.array([285, 215]))
    sample_size = sample.get(“Count”).sum()
    sample_proportions = sample.assign(
    Proportion=sample.get(“Count”) / sample_size)
    sample_proportions
    She uses 1,000 bootstrap resamples to compute a confidence interval for the proportion of all
    California voters who will vote Yes. Run the next cell to see the empirical distribution of Yes
    proportions in the 10,000 resamples.
    [94]: #: run this cell, but don’t change it!
    resample_yes_proportions = np.array([])
    for i in np.arange(1000):
    resample = np.random.multinomial(sample_size,sample_proportions.
    ,→get(“Proportion”))/sample_size
    resample_yes_proportions = np.append(resample_yes_proportions, resample[0])
    bpd.DataFrame().assign(Resample_Yes_proportion = resample_yes_proportions).
    ,→plot(kind = ‘hist’,bins=np.arange(.2, .8, .01))
    11
    In a population whose members are 0 and 1, there is a simple formula for the standard deviation
    of that population:
    standard deviation =

    (proportion of 0s) × (proportion of 1s)
    (Figuring out this formula, starting from the definition of the standard deviation, is a fun exercise
    for those who enjoy algebra – and who doesn’t?)
    Question 4.1. Without accessing the data in resample_yes_proportions in any way, and
    instead using only the Central Limit Theorem and the numbers of Yes and No voters in our sample
    of 500, compute a number approximate_sd that’s the predicted standard deviation of the array
    resample_yes_proportions according to the Central Limit Theorem. Since you don’t know the
    true proportions of 0s and 1s in the population, use the proportions in the sample instead (since
    they’re probably similar).
    [95]: approximate_sd = …
    approximate_sd
    Question 4.2. Compute the standard deviation of the array resample_yes_proportions to verify
    that your answer to question 2 is approximately right.
    [98]: exact_sd = …
    exact_sd
    Question 4.3. Still without accessing resample_yes_proportions in any way, compute an
    approximate 95% confidence interval for the proportion of Yes voters in California. The cell below
    draws your interval as a red bar below the histogram of resample_yes_proportions; use that to
    verify that your answer looks right.
    Hint: Before, we’ve used percentile on the bootstrap distribution to find the bounds for the
    confidence interval. Now, we’re not allowed to use the bootstrap distribution – but we don’t need
    it! We know (from the Central Limit Theorem) that the distribution of the sample mean is Normal
    with a certain standard deviation. We also know that 95% of the area of the normal distribution
    falls within a certain number of standard deviations from the mean.
    [102]: lower_limit = …
    upper_limit = …
    lower_limit, upper_limit
    12
    [107]: #: print the confidence interval
    print(‘lower:’, lower_limit, ‘upper:’, upper_limit)
    [108]: #: run this cell to plot your confidence interval
    bpd.DataFrame().assign(Resample_Yes_proportion = resample_yes_proportions).
    ,→plot(bins=np.arange(.2, .8, .01),kind = ‘hist’).plot(np.array([lower_limit,␣
    ,→upper_limit]), np.array([0,0]), c=’r’, lw=10);
    Your confidence interval should overlap the number 0.55. That means we can’t be very sure whether
    Proposition 68 is winning, even though the sample Yes proportion is a bit above 0.5.
    The Yes on 68 campaign really needs to know whether they’re winning. To have more confidence in
    the result of the poll, the decide to redo it with a larger sample. They’d be happy if the standard
    deviation of the sample mean were only 0.005. They ask Michelle to run a new poll with a sample
    size that’s large enough to achieve that. (Polling is expensive, so the sample also shouldn’t be
    bigger than necessary.)
    Instead of making the conservative assumption that the population standard deviation is 0.5 (coding
    Yes voters as 1 and No voters as 0), she decides to assume that it’s equal to the standard deviation
    of the sample,

    (Yes proportion in the sample) × (No proportion in the sample).
    Under that assumption, Michelle computes the smallest sample size necessary in order to be confident that the standard deviation of the sample mean is only 0.005.
    Question 4.4. What sample size did she find? Assign your answer to the variable sample_size.
    Remember the sample size needs to be an integer.
    [109]: sigma = …
    sample_size = …
    sample_size = …
    sample_size
    We know that
    sample means SD =
    population SD

    sample size
    ,
    so
    sample size =
    (
    population SD

sample means SD)2



√(
285
500 ) ( 215
500 )
0.005

2

2 Finish Line
Congratulations! You are done with homework 7.
To submit your assignment:
13

  1. Select Kernel -> Restart & Run All to ensure that you have executed all cells, including
    the test cells.
  2. Read through the notebook to make sure everything is fine and all tests passed.
  3. Run the cell below to run all tests, and make sure that they all pass.
  4. Download your notebook using File -> Download as -> Notebook (.ipynb), then upload
    your notebook to Gradescope.
    14

Sample Solution

ACED ESSAYS