Exam 1 - Take-home

ST511

By submitting this exam, I hereby state that I have not communicated with or gained information in any way from my classmates during this exam, and that all work is my own.

Any potential violation of the NC State policy on academic integrity will be reported to the Office of Student Conduct & Community Standards. All work on this exam must be your own.

This exam is due Tuesday, October 15th, at 11:59PM. Late work is not accepted for the take-home.
You are allowed to ask content clarification questions via Slack. Any question(s) that give you an advantage to answering a question will not be answered.
This is an individual exam. You may not use your classmates or AI to help you answer the following questions. You may use all other class resources.
Show your work. This includes any and all code used to answer each question. Partial credit can not be earned without any work shown.
Round all answers to 3 digits (ex. 2.34245 = 2.342)
Only use the functions we have learned in class unless the question otherwise specifies. If you have a question about if a function is allowed, please send me an email.
If you are having trouble recreating a plot exactly, you still may earn partial credit if you recreate the plot partially.

Do not forget to render after each question! You do not want to have rendering issues close to the due date.

Reminder: Unless you are recreating a plot exactly, all plots should have appropriate labels and not have any redundant information.

Reminder: The phrase “In a single pipeline” means not stopping / having no breaks in your code. An example of a single pipeline is:

penguins |> 
  filter(species == "Gentoo") |>
  summarize(center = mean(bill_length_mm))

A break in your pipeline would look like:

Gentoo_data <- penguins |> 
  filter(species == "Gentoo")
  
Gentoo_data |>
  summarize(center = mean(bill_length_mm))

Reminder: LaTex is not required for the exam.

Reminder: Use help files! You can access help files for functions by typing ?function.name into your Console.

Good luck!

Exam Start

Packages

You will need the following packages for the exam.

library(tidyverse)

Warning: package 'ggplot2' was built under R version 4.3.3

Warning: package 'tidyr' was built under R version 4.3.3

Warning: package 'readr' was built under R version 4.3.3

Warning: package 'dplyr' was built under R version 4.3.3

Warning: package 'stringr' was built under R version 4.3.3

Warning: package 'lubridate' was built under R version 4.3.3

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4          ✔ readr     2.1.5     
✔ forcats   1.0.0          ✔ stringr   1.5.1     
✔ ggplot2   3.5.1          ✔ tibble    3.2.1     
✔ lubridate 1.9.3          ✔ tidyr     1.3.1     
✔ purrr     1.0.2.9000     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(tidymodels)

Warning: package 'tidymodels' was built under R version 4.3.3

── Attaching packages ────────────────────────────────────── tidymodels 1.2.0 ──
✔ broom        1.0.5      ✔ rsample      1.2.1 
✔ dials        1.2.1      ✔ tune         1.2.0 
✔ infer        1.0.7      ✔ workflows    1.1.4 
✔ modeldata    1.3.0      ✔ workflowsets 1.1.0 
✔ parsnip      1.2.1      ✔ yardstick    1.3.1 
✔ recipes      1.0.10

Warning: package 'dials' was built under R version 4.3.3

Warning: package 'scales' was built under R version 4.3.3

Warning: package 'infer' was built under R version 4.3.3

Warning: package 'modeldata' was built under R version 4.3.3

Warning: package 'parsnip' was built under R version 4.3.3

Warning: package 'recipes' was built under R version 4.3.3

Warning: package 'rsample' was built under R version 4.3.3

Warning: package 'tune' was built under R version 4.3.3

Warning: package 'workflows' was built under R version 4.3.3

Warning: package 'workflowsets' was built under R version 4.3.3

Warning: package 'yardstick' was built under R version 4.3.3

── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ scales::discard() masks purrr::discard()
✖ dplyr::filter()   masks stats::filter()
✖ recipes::fixed()  masks stringr::fixed()
✖ dplyr::lag()      masks stats::lag()
✖ yardstick::spec() masks readr::spec()
✖ recipes::step()   masks stats::step()
• Use tidymodels_prefer() to resolve common conflicts.

library(palmerpenguins)


Attaching package: 'palmerpenguins'

The following object is masked from 'package:modeldata':

    penguins

library(dplyr)
library(scales)

Now is a good time to save and render

Question 1: Package questions

When you render your exam, you will notice a lot of additional warnings underneath the library(tidyverse), library(tidymodels) and library(palmerpenguins) code.

-- Warning...

-- Attaching core tidyverse packages ...

-- Conflicts ...

etc...

Hide the warning text (not the code itself) using the appropriate code chunk argument for the packages code chunk. You do not need to include any text to answer this question (just hide all the extra text). Hint: You can find a list of these arguments here in the Options available for customizing output include: towards the top.

Describe, in detail, the difference between the library() function, and the install.packages() function.
Suppose someone you know forgets to library their packages, and instead just try to run the following code.

penguins |>
  summarize(mean_peng = mean(bill_length_mm)

What will the code produce? Why?

Now is a good time to save and render

Question 2: Penguin recreation

We are going to use the very familiar penguin data for question 2.

In a single pipeline, calculate the median bill length for each species of penguin. Your answer should be a 3x2 tibble.
In a single pipeline, calculate the min, max, and IQR of bill length for each species of penguin. Your answer should be a 3x4 table. Hint: Check the help file for the function summarize() for Useful functions.
In a single pipeline, recreate the following plot below.

Hint: This plot uses theme_minimal and scale_color_viridis_d(option = “D”)

Note: Your dots will not look exactly the same as the graph below. They will be similar.

Now is a good time to save and render

Question 3: Flights simulation

Data

Read in the airplanes_data.csv for the follow questions.

airplanes <- read_csv("https://st511-01.github.io/data/airplanes_data.csv")

For question 3, we will be using the following variables:

flight_distance - distance in miles

customer_type - categorized as either disloyal Customer or Loyal Customer

A disloyal customer is a customer who stops flying from the airline. A loyal customer is someone who continues flying with the airline.

In as much detail as possible, DESCRIBE the simulation scheme for

conducting a hypothesis test for evaluating whether the average flight distance are different between types of customers.

airplanes |>
  group_by(customer_type) |>
  summarize(count = n())

# A tibble: 2 × 2
  customer_type     count
  <chr>             <int>
1 disloyal_customer    86
2 loyal_customer      406

creating a confidence interval for estimating the difference between the average flight distance for the different types of customers.

Be as precise and add context when possible, and use information from the actual data (e.g. sample sizes, etc.) in your answer. Use the order of subtraction loyal - disloyal

Now is a good time to save and render

Question 4: Flights confidence interval

In this question, you are going to create a simulation based 90% confidence interval for the scenario described in question 3. Do you expect the sampling distribution of your difference in means to be roughly normal? Justify your answer. Add to the existing code to create any appropriate plots that would help you justify your claim. You can make more than one plot for each group if you would like (start a new pipe the same way you see below). For this question, you can assume that one person’s flight experience is independent from another, regardless of group.

Pipeline with only loyal customer group

airplanes |>
  filter(customer_type == "loyal_customer") |>
  # insert code here

Pipeline with only disloyal customer group

airplanes |>
  filter(customer_type == "disloyal_customer") |>
  # insert code here

Regardless of your answer above, Add to the existing code below by replacing the ... with the appropriate information to create your simulated sampling distribution, as described in question 2. You are saving your simulated difference in means into a data set called boot_df

Fill in the … for set.seed with your NC State ID number (ex. 001138280). Report a glimpse() of your new boot_df to answer this question.

set.seed(...)

boot_df <- airplanes |>
  specify(response = ..., explanatory = ...) |>
  generate(reps = 1000, type = "bootstrap") |>
  calculate(stat = "...", order = c("loyal_customer", "disloyal_customer"))

Plot a histogram of boot_df. Where is this distribution centered? Why does this make sense?
Now, create a 90% confidence interval using your boot_df data frame by filling in the ... below. Then, interpret this interval in the context of the problem.

boot_df |>
  summarize(lower = quantile(stat, ...),
            upper = quantile(stat, ...))

Discuss one benefit and disadvantage of making an 80% confidence interval instead of a 90% confidence interval.

Now is a good time to save and render

Question 5: Testing flights

In this question, we are tasked to conduct a theory based hypothesis test to see if there is a relationship between a customer’s type of travel, and their overall satisfaction of the flight. That is, does the type of travel impact their satisfaction? We are investigating if customers who are traveling for personal travel are satisfied more often then those those who are traveling for business travel.

For this question, we are going to use the following variables.

satisfaction - was the customer satisfied or dissatisfied with their experience

type_of_travel - was the travel for business or personal

Below, carry out the necessary steps to conduct a theory based hypothesis test (z-test). Show all your work. You may assume \(\alpha\) = 0.05. At the end of your hypothesis test, write your decision and conclusion in the context of the problem.

\(H_o\):

\(H_a\):

Question 6: ChickWeights

For this question, we are going to use a chickweight data set. This is a data set that has data from an experiment of the effect of diet on early growth of baby chickens. Please read in the following data below.

Data

The data key can be seen below:

variable name	description
time	The week in which the measurement was taken (0, 2, 4, 6, 8).
chick	ID number (1, 2, 21, 22)
diet	type of diet(1, 2)
weight	body weight of the chick (gm)

The data can be read in below. Note that we can use as.factor() to make sure R is treating a variable as a category.|

chickweight <- read_csv("https://st511-01.github.io/data/chickenweight.csv") |>
  mutate(diet = as.factor(diet),
         chick = as.factor(chick),
         time = as.factor(time))

Rows: 20 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (4): time, chick, diet, weight

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Report the mean and standard deviation for weight for each measurement day. This should be a 5 x 3 tibble. Name the columns mean_weight and spread.

As researchers, it’s critical that you demonstrate your ability to learn and implement new computing skills. In the next two questions, you are going to demonstrate your ability to build upon your existing computing R/statistics skills.

The chicken weights are measured in grams. Suppose that, instead of grams, you want to have the weight measured in ounces. We are going to learn a new function that allows us to create new variables. This function is called mutate(). The help file for mutate() can be found here. mutate() is in the dplyr family, and has similar arguments to functions we have used commonly in class, such as summarize(name = action).

One gram is 0.035274 ounces. In one pipeline, create a new variable called mean_weight_o or weight_o (which is more appropriate) that represents the number of ounces a chicken weighs and calculate the mean weight and standard deviation of your new mean_weight_o/weight_o variable for each combination of time and diet. This should be a 10 x 4 tibble.

Please recreate the following graph below.

As a reminder, a list of geoms in the tidyverse package can be found here.

Hint: Go to the reference above and look for the geom that connects observations with a line.

Hint: To recreate this plot, we need to specify a group = variable.name to tell R which observations we want to connect to each other in the aes() function.

Hint: This plot uses scale_color_viridis_d()

Why do we use scale_color_viridis_d() instead of scale_color_viridis_c() in the plot above?
Why is it important to use scale_color_viridis functions when we create plots?

Now is a good time to save and render

Note

The “Workflow & formatting” grade is to assess the reproducible workflow. This includes:

linking all pages appropriately on Gradescope
putting your name in the YAML at the top of the document
Pipes %>%, |> and ggplot layers + should be followed by a new line
Proper spacing before and after pipes and layers
You should be consistent with stylistic choices, e.g. %>% vs |>

Grading

Package questions: 10 points
Penguins recreation: 16 points
Flights simulation: 12 points
Flights confidence interval: 30 points
Testing flights: 30 points
ChickWeights: 27 points
Workflow + formatting: 5 points
Total: 130 points