HW 5 - Modeling
This homework is due Tuesday, Nov 19th at 11:59pm.
Packages
Hint:
For this homework, I have you save values as R-objects. This is a reminder that saving an object does not give the output. When you are asked to save a value, and report it. A reminder that you need to include the new name of that value in your code chunk.
a <- 3 # store a value as 3
a # output the value that a is
[1] 3
Data
For this homework, we are interested in the impact of smoking during pregnancy.
We will use the dataset births14
, included in the openintro loaded at the beginning of the assignment. This is a random sample of 1,000 mothers from a data set released in 2014 by the state of North Carolina about the relation between habits and practices of expectant mothers and the birth of their children. Please, look at help(births14)
for the variables contained in the dataset.
We are going to create a version of the births14
data set that drops NA
values for the variable name habit
. We are saving this as a new data set called births14_habitgiven
. This will be the data set we use moving forward.
births14_habitgiven <- births14 |>
drop_na(habit) #drop_na drops na values for just a single col
Exercise 1 - Smoking during pregnancy
- A researcher is interested in assessing the relationship between babies’ weights and mothers’ ages. Suppose that we are interested in weight, and will use mother’s age as the explanatory variable. Fit a linear model to investigate this relationship. Provide the summary output below.
Hint: Pull up the help file for births14
to view a variable key.
In 1-3 sentences, explain how the regression line to model these data is fit, i.e., based on what criteria R determines the regression line.
Interpret the intercept in the context of the problem. Is the intercept meaningful in this context? Why or why not?
Interpret the slope in the context of the problem.
Data
For the next two exercises, we will be using the lego_sample
dataset, which includes data about Lego sets on sale on Amazon.
lego_sample <- read_csv("data/lego_sample.csv")
We are going to use the following three variables:
amazon_price
: the price for the Lego set listed on Amazon;pieces
: the number of Lego pieces in the set;theme
: the category the set belongs to.
Exercise 2 - LEGO
Let’s examine the relation between the Amazon Price and the number of pieces in a Lego set.
- Compute and report the correlation
r
of the two variablesamazon_price
andpieces
. Save this value asr
. Next, Interpret the correlation coefficient in the context of the problem..
Hint: use the definition of a correlation coefficient.
- Compute and report the standard deviations of the variables
amazon_price
andpieces
, that we will denote respectively as \(s_{price}\) and \(s_{pieces}\). Note, save these calculations ass_price
ands_pieces
.
Hint: Do each calculation in separate pipelines so you can save them to their respective names.
Now fit a linear model where the number of pieces is a predictor of the Amazon price. Provide the summary output and write the estimated model with proper notation.
From your output in c, report the coefficient of determination for this linear model. Next, show that it is equal to correlation coefficient squared (\(r^2\)).
Hint: Use your saved value of r
, to show that r^2
equals the reported coefficient of determination.
Now, interpret the coefficient of determination in the context of the problem.
If we denote with \(\widehat{\beta_{piceces}}\) as the slope you estimated at point c, show that this relationship holds: \[ \widehat{\beta_{piceces}} = \frac{s_{price}}{s_{pieces}} r. \]
Use the slope coefficient from c, and the saved values of r
, s_price
and s_pieces
to show that this relationship holds.
(Hint: you have to compute the number on the right and show that it equals the estimated slope! This relationship can be found in your Chapter 7 prepare material.)
Exercise 3
Let us look at how the relation between the Amazon Price and the number of pieces in a Lego set varies according to the theme of the Lego set.
Make a scatter plot of the Amazon price (y) and number of pieces (x) coloring the points according to the theme of the Lego set (z). Add linear lines for each theme. Is there evidence that the relationship between price and number of pieces is different by theme? Justify your answer.
Regardless of your answer to part a, let’s fit an additive model named
price_fit_add
. Report the summary output and interpret the coefficient associated withpieces
in the context of the problem.
Note: You do not need to write out the estimated model for this question.
Now, write out the estimated model for the
themeFriends
. Simplify your equation as much as possible.For this model, use R code to predict the mean Amazon price of Friends legos that have 50 pieces.
Submission
- Go to http://www.gradescope.com and click Log in in the top right corner.
- Log in with your school credentials.
- Click on your STA 511 course.
- Click on the assignment, and you’ll be prompted to submit it.
- Mark all the pages associated with exercise. All the pages of your homework should be associated with at least one question (i.e., should be “checked”). If you do not do this, you will be subject to lose points on the assignment.
- Do not select any pages of your PDF submission to be associated with the “Workflow & formatting” question.
The “Workflow & formatting” grade is to assess the reproducible workflow. This includes:
- linking all pages appropriately on Gradescope
- putting your name in the YAML at the top of the document
- Pipes
%>%
,|>
and ggplot layers+
should be followed by a new line - You should be consistent with stylistic choices, e.g.
%>%
vs|>
Grading for HW-4
Exercise-1: 15 Exercise-2: 30 Exercise-3: 25 Workflow + formatting: 5 points Total: 75 points