Statistical Literacy

Sampling Proportions
Complete the following tasks digitally on this document, or physically on a separate sheet of paper. You can turn in the .docx, or you can take a picture and turn in the .pdf. In order to earn full credit you must show support, your problem-solving process must be clearly communicated, and your answers must be clearly marked. If you cannot open .docx bc you have a mac then do this assignment in google docs.
The purpose of this workout is to help you gain a better understanding of how research is done out there in the real world. In particular, this workout will help you grow in the ideas of sampling, sampling variability, sampling distribution, and estimating a population proportion from sample data.

Let’s Get to Know the Data
Here is the data that we are going to use for this homework (see link below). This data is real life data from the famous Child Health and Development Studies done during 1961 and 1962. Cool!
https://docs.google.com/spreadsheets/d/1O_jDKXK0tKFr_CWmx32mosASUHZ_Hu_5ESOvHO-4yLg/edit?usp=sharing
The unit of observation in this dataset is a pregnancy. And our sample size is 234 random pregnancies.

There were many variables measured on each pregnancy, but we are only going to play with seven of them. Let’s learn about them below:
• subject_id: this is the identification number of a particular pregnancy. We will use these numbers to do our simple random sampling.
• gestation: this is how long the baby was inside of mom as measured in days.
• wt: this is how much baby weighed at birth as measured in ounces.
• age: this is how old mom was at birth of baby rounded to the nearest year.
• grad_college: This is a yes/no question. If the mom graduated college then they said yes. If the mom didn’t graduate college then they said no. Think of this as being highly educated or not highly educated.
• non_smoker: This is a yes/no question. If the mom is a non-smoker then they said yes. If the mom is a smoker then they said no. This variable can be confusing so make sense of it. Also, remember, this data is from the early 60s and smoking was quite popular then.
• high_ses: This is a yes/no question. If the mom is considered wealthy because of her income to be high socio-economic status, then they said yes. Think of this as being rich or not rich. Yes means rich, no means not rich.

Let’s Play With the Data
Let’s pretend we are interested the following question: what percent of moms were non-smokers back in the 60s?
An answer could be 40% of moms were non-smokers, or 60% of moms were non-smokers.
Before computers like Sheets we would have to count by hand how many yes/no we see out of the 224 participants in this study.
To make this less tedious, researchers sample instead of using the whole dataset or population of people (which is often called a cenus).
To gain a better understanding of how many non-smokers there were we might do the same study twice bc more data and studies lead to a more comprehensive picture of what is happening.
To help you feel the above we will sample twenty pregnancies.
And we will do two studies on this data and then share our data and use our combined data to make our results more comprehensive.
Getting your Sample of Twenty from the 224
Use the following random number generator (see link a bit below) to do a simple random sample of twenty pregnancies. It is ok to use the same pregnancy twice or more in this simulation.
Format your webpage to look like the following. Notice the 20, 2, 225, and 20 in my boxes. This means we are going to pick 20 pregnancies from their subject_id’s which are between 2 and 225, and we want to look at all 20 numbers. See screenshot below.

https://www.random.org/integers/

Data table
Organize your two studies in the data table below:

Sample Subject_ids Number of non-smokers Proportion of non-smokers
Study Example 1
154 42 3 156 55 149 98 63 165 171 184 20 35 46

5   5/20 = 0.25

Study Example 2

3 165 85 51 207 12 2 119 173 147 160 155 45 82

4   4/20 = 0.20

Study 1

Study 2

Sharing Data
How can we get more accurate in our study of non-smokers? We can do more studies and average the results!!! But this is tedious if we have to do it all alone, so let’s share our data.
Also, hopefully you are starting to see the power of technology and appreciate it more bc we only had a sample size of 20 in our study, and we did two studies. Could you imagine a more realistic situation of 1,000 pregnancies in our study (the original data set has about 1200 pregnancies in it but I only gave us 224 of them to make the spreadsheet easier to read)
Share your column: Number of non-smokers in the following google Form.
https://docs.google.com/forms/d/e/1FAIpQLSeU16McoHXO8vhUB25TKkCGQIzSVeHNfzoWSSIsSue272ubDQ/viewform?usp=sf_link

Questions to Answer for Homework 1

Based on your data, what percent of moms in the 60s were non-smokers?
How confident are you in your answer? Like, do you think your answer above is the true actual percent of moms who were non-smokers in the 60s? Explain.

The following Qs are based on our shared data which can be found here:
https://docs.google.com/spreadsheets/d/18dCd2XWzXMOhoIRK_UgJ1-1mZNmpQfbTwMFsES_xaUQ/edit?usp=sharing

Screenshot or copy/paste the histogram of our shared data below
Mark on the graph where your two studies are?
The most extreme study said up to what percent of moms were non-smokers?
The most extreme study said as low as what percent of moms were non-smokers?
Most studies (the middle 95% of them) suggest that the percent of moms who were non-smokers is between what and what?
Did each study give us the same number? Explain what is happening.
How do you know which study is correct? Explain.
Our shared data suggests that the true percent of moms who were non smokers is what, and with what margin of error?

Statistical Literacy

Aced Essays

Free Resources