HW 2 - Foundations + the Practice of Inference
This homework is due Sunday, Sep 23 at 11:59pm.
Packages
Tips
Remember that continuing to develop a sound workflow for reproducible data analysis is important as you complete this homework and other assignments in this course. There will be reminders in this assignment for you to Render your document. The last thing you want to do is work an entire assignment before realizing you have an error somewhere that makes it so you can’t compile your document. Render after each completed question.
Note: Do not let R output answer the question for you unless the question specifically tells you to do so.
Note: You can not earn more than 100% on this assignment
Exercises
Exercise 1: Office budget
Milliions of records maintained by the office of budget in a particular state indicate that the amount of time elapsed between the submission of travel vouchers and the final reimbursement of funds has approximately a normal distribution with mean 36 days and standard deviation of 4 days.
- In words, define the random variable from the context above.
Solution
Let X be the elapsed time between submission and reimbursement.
- In proper notation, write out how the random variable is distributed.
Solution
\[ X \sim N(36, 4) \]
- Here is a visualization of your population distribution for this question. Suppose you want to know the probability that your random value takes on a value larger than 55. Write out his probability statement using proper notation.
Solution
P(X > 55)
- What is the probability that your random value takes on a value larger than 55? Justify your answer. Note, you do not have to perform any calculations to answer this question.
The probability that X > 55 is a very small number (<0.0001). This is because probability is calculated as area under the curve, and we have basically 0 (but not 0) area under the distribution that is larger than 55.
- What is the probability that your random value takes on a value that is equal to 40? Justify your answer. Note, you do not have to perform any calculations to answer this question.
Solution
The probability that X = 40 is 0. This is because the area under a single point is 0…or that because a continuous variable can take on an infinite number of values within a given range, meaning the probability at any one specific point becomes infinitesimally small, essentially equaling zero.
Exercise 2: Office Budget
- Suppose that you are a researcher, and take a random sample of 100 office records and you calculate your sample statistic. What is the proper notation for your statistic?
Solution
\[\bar{x}\]
- Knowing the population distribution from exercise 1, write out how the sample statistic is distributed (sampling distribution of the mean), in proper notation. Include 3-5 sentences justifying the the center, spread, and shape of the distribution.
Solution
\[ \bar{x} \sim N(36, \frac{4}{\sqrt{100}} ) \]
The sapmling distribution of \(\bar{x}\) is going to be normal if the population distribution is normal. It is also going to have the same center (36). However, the spread of the sampling distribution is smaller, and is calculated by looking at \(\frac{s}{\sqrt{n}}\).
Exercise 3: Skittles
Skittles are a brand of chewy, fruit-flavored candies that come in many colors and flavors. The original skittles colors consisted of Red, Orange, Yellow, Green, and Purple. The company claims that they put the same amount of each color of skittles in their bags of candy.
You, as a Skittles enthusiast, question if this is really true. More specifically, you suspect that they are putting more Purple Skilttles in their bags of candy than they claim to be.
With this information, set up your null and alternative hypothesis. Write out each in both words and proper notation.
\(H_o\):
\(H_a\):
Solution
\(H_o\): \(\pi\) = 0.2
\(H_a\): \(\pi\) > 0.2
- Suppose you sneak into the factory, and take a random sample of 205 skittles. You found 47 purple skittles, and 158 skittles that were not purple. Below, in proper notation, write out your summary statistic.
Solution
\(\hat{p} = \frac{47}{205}\)
- Is the sampling distribution going to be normal, under the assumption of the null hypothesis? Justify your answer.
Solution
Yes. We can assume the independence assumption because we took a random sample, and our sample size is < 10% of the entire population of skittles. Under the assumption of the null hypothesis, we would expect 41 successes (purple skittles) and 164 failures (not purple skittles). Both of these values are larger than 10. Thus, we would expect the sampling distribution to be normal.
.2205 = 41 > 10 .8205 = 164 > 10
- Where is the sampling distribution going to be centered? Why does this make sense?
Solution
The sampling distribution is going to be centered at our null value of 0.2. This makes sense because we are assuming that the true proportion of purple skittles is 20%.
- Now, calculate your standardized test statistic that you will use to test your hypothesis above. Show your work!
Solution
\(Z = \frac{.229 - .2}{\sqrt{\frac{.2*.8}{205}}}\) = 1.07
- Finally, calculate your p-value. At the \(\alpha\) value of 0.05. Write an appropriate decision and conclusion in the context of the problem.
Solution
pnorm(1.07, 0 , 1, lower.tail = FALSE)
[1] 0.1423097
Based on a p-value of 0.142, we fail to reject the null hypothesis, and have weak evidence to conclude that the true proportion of purple skittles is greater than 20%.
Exercise 4 (Simulation)
Suppose that instead of using theory based techniques, we wanted to use simulation based techniques to test if there are more purple skittles than the company claims.
- Before carrying out a simulation study, would you expect the results from your simulated hypothesis test to be the same or different than when using theory based techniques?
Solution
Simulation based techniques have more relaxed assumptions than theory based techniques. The independence and sample size assumptions are also met for this case (success and failures > 5), suggesting that we should see the same results.
- Below is the simulated null distribution for the hypothesis test in Exercise 3.
set.seed(12345)
skittles_data <- tibble(
correct_guess = c((rep("Purple" , 47)), rep("Not" , 158)
))
null_dist <- skittles_data |>
specify(response = correct_guess, success = "Purple") |>
hypothesize(null = "point", p = .2) |> #fill in the blank
generate(reps = 10000, type = "draw") |> #fill in the blank
calculate(stat = "prop") #fill in the blank
visualize(null_dist)
There are 10,000 observations on this histogram. In as much detail as possible, explain how one observation (dot) on this plot was created.
Solution
We would set up a spinner with 20% of it being purple, and 80% of it being not purple. We could spin this 205 times, and take the new simulated proportion of purple over the total of 205. This new proportion would be one dot on the plot.
OR
We have 205 cards, where 41 are purple and 164 are not purple (20% and 80%). Then, we randomly sample with replacement 205 times, and calculate the new simulated sample proportion. This new simulated sample proportion is one dot on the plot.
Exercise 5: Class Question
Suppose that the board of Education came out with an assessment question, and they claimed that 50% of all students who take this question, actually get the answer correct. Wanting to test this, I assigned the question to my class, and observed 68 students get the answer correct, and 83 students get the answer incorrect.
Below, you can see the associated visualization and p-value for this prompt.
# A tibble: 1 × 1
p_value
<dbl>
1 0.182
- Interpret the p-value in the context of the problem. Note, the interpretation of a p-value is different than writing a decision or a conclusion.
Solution
The probability of observing 45% of students get the question correct or less, and the the probability of observing 55% of students get the question correct or more, assuming that the true proportion of getting the question correct is 50%, is roughly 18.2%.
Note, all percentages can be in decimals
Submission
- Go to http://www.gradescope.com and click Log in in the top right corner.
- Log in with your school credentials.
- Click on your STA 511 course.
- Click on the assignment, and you’ll be prompted to submit it.
- Mark all the pages associated with exercise. All the pages of your homework should be associated with at least one question (i.e., should be “checked”). If you do not do this, you will be subject to lose points on the assignment.
- Do not select any pages of your PDF submission to be associated with the “Workflow & formatting” question.
The “Workflow & formatting” grade is to assess the reproducible workflow. This includes:
- linking all pages appropriately on Gradescope
- putting your name in the YAML at the top of the document
- Pipes
%>%
,|>
and ggplot layers+
should be followed by a new line - You should be consistent with stylistic choices, e.g.
%>%
vs|>
Grading
- Exercise 1: 13 points
- Exercise 2: 10 points
- Exercise 3: 22 points
- Exercise 4: 6 points
- Exercise 5: 6 points
- Workflow + formatting: 5 points
- Total: 62 points