HW 6 - Interaction Models: Solutions

Homework

Packages

library(tidyverse)
library(tidymodels)
library(gapminder)

Data

For the first, we will be using the lego_sample dataset, which includes data about Lego sets on sale on Amazon.

lego_sample <- read_csv("data/lego_sample.csv")

We are going to use the following three variables:

amazon_price: the price for the Lego set listed on Amazon;
pieces: the number of Lego pieces in the set;
theme: the category the set belongs to.

Exercise-1: Interaction model output

a. Fit an interaction model with theme and number of pieces as your explanatory variables to investigate Amazon price. Name this model price_fit_int. Report your model output below.

price_fit_int <- lm(amazon_price ~ pieces*theme, data = lego_sample)

summary(price_fit_int)


Call:
lm(formula = amazon_price ~ pieces * theme, data = lego_sample)

Residuals:
    Min      1Q  Median      3Q     Max 
-41.683  -9.851  -3.566   4.961  90.741 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)          9.437468   6.528891   1.445    0.153    
pieces               0.130469   0.018315   7.124 7.94e-10 ***
themeDUPLO®          0.117102   9.117357   0.013    0.990    
themeFriends        -6.034602   9.546438  -0.632    0.529    
pieces:themeDUPLO®   0.583630   0.140334   4.159 9.05e-05 ***
pieces:themeFriends -0.004644   0.027071  -0.172    0.864    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 20.81 on 69 degrees of freedom
Multiple R-squared:  0.6358,    Adjusted R-squared:  0.6094 
F-statistic: 24.09 on 5 and 69 DF,  p-value: 6.12e-14

b. Write out the entire estimated model in proper notation.

\(\hat{price} = 9.43 + .13*pieces + .117*DUPLO - 6.03*Friends + .584*pieces*DUPLO - 0.005*pieces*Friends\)

c. Write out the estimated model for the DUPLO theme in proper notation.

\(\hat{price} = 9.547 + .714*pieces\)

d. Write out the estimated model for the City theme in proper notation.

\(\hat{price} = 9.43 + .13*pieces\)

e. Without any calculations, justify if the coefficient of determination for the price_fit_int model will be larger, smaller, or not enough information to know, than the coefficient of determination for an additive model with theme and pieces as explanatory variables, and amazon price as the response.

The \(R^2\) will be higher for the interaction model vs the additive model, because the interaction model has the same variables as the additive model, but has more coefficients (is more complicated).

f. Without any calculations, justify if the coefficient of determination for the price_fit_int model will be larger, smaller, or not enough information to know, than the coefficient of determination for an additive model with theme and pieces as explanatory variables, and pages of the manual as the response.

We don’t know if \(R^2\) will be higher or lower. The response variable is different.

Exercise-2: Should we fit a lm?

Data

Gapminder is a “fact tank” that uses publicly available world data to produce data visualizations and teaching resources on global development. We will use an excerpt of their data to explore relationships among world health metrics across countries and regions between the years 1952 and 2007. The data set is called gapminder, from the gapminder package. A table of variables can be found below.

We are going to use the data just from the year 2007 using the following code.

gapminder_2007 <- gapminder |>
  filter(year == 2007) |>
  droplevels()

Use the gapminder_2007 data set to answer the follow questions.

Hint: pull up the help file for gapminder for a variable name key.

We are interested in learning more about life expectancy (y), and we’ll start with exploring the relationship between life expectancy and GDP (x). Create a scatter plot of gdpPercap vs. lifeExp.

gapminder_2007 |>
  ggplot(
    aes(x = gdpPercap, y = lifeExp)
  ) +
  geom_point() +
  labs(x = "GDP per cap",
       y = "Like expectancy", 
       title = "Scatterplot of gapminder data")

While working with a fellow student, they claim that they don’t think it’s appropriate to fit a linear model to explore this relationship. In 1-2 sentences, explain why your fellow student is correct.

There is not a linear relationship between life expectancy and GDP. Therfore, the student is correct, and this relationship should not be studied using a linear model.

Because of the concern your fellow student raised, we are going to take the natural log of gdpPercap, and use this as our new response variable (lnDGP) using the following code below.

gapminder_2007_log <- gapminder_2007 |>
  mutate(lnGDP = log(gdpPercap))

For the remainder of this question, we are going to use the gapminder_2007_log data set.

Create a scatter plot of lnGDP vs. lifeExp. Then, make a comment on if you think it is appropriate to fit a linear model to analyze this relationship.

gapminder_2007_log |>
  ggplot(
    aes(x = lnGDP, y = lifeExp)
  ) +
  geom_point() +
  labs(x = "ln of GDP per cap",
       y = "Like expectancy", 
       title = "Scatterplot of gapminder data")

It is more appropriate to model the the ln of GDP per cap by life expectancy with a linear model, as there appears to be some evidence of a linear relationship between the two variables (although still a slight curve).

Exercise-3: Model selection

Another student comes to you and suggests that they believe continent could also be a useful variable at explaining the variability in lnGDP. In this exercise, we are going to investigate which model would be the most appropriate to model lnGDP.

Using the gapminder_2007_log data set, fit the three models below, and report their summary output.

SLR model with lifeExp as the explanatory varaible

Additive model with lifeExp and continent

Interaction model with lifeExp and continent

slr <- lm(lifeExp ~ lnGDP, data = gapminder_2007_log)
summary(slr)


Call:
lm(formula = lifeExp ~ lnGDP, data = gapminder_2007_log)

Residuals:
    Min      1Q  Median      3Q     Max 
-25.947  -2.661   1.215   4.469  13.115 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   4.9496     3.8577   1.283    0.202    
lnGDP         7.2028     0.4423  16.283   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.122 on 140 degrees of freedom
Multiple R-squared:  0.6544,    Adjusted R-squared:  0.652 
F-statistic: 265.2 on 1 and 140 DF,  p-value: < 2.2e-16

add <- lm(lifeExp ~ lnGDP + continent, data = gapminder_2007_log)
summary(add)


Call:
lm(formula = lifeExp ~ lnGDP + continent, data = gapminder_2007_log)

Residuals:
     Min       1Q   Median       3Q      Max 
-19.4917  -2.3146  -0.0432   2.5498  14.8818 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)        20.1376     4.0332   4.993 1.79e-06 ***
lnGDP               4.6308     0.5274   8.780 6.14e-15 ***
continentAmericas  11.6942     1.6546   7.068 7.46e-11 ***
continentAsia      10.1144     1.4761   6.852 2.31e-10 ***
continentEurope    11.2682     1.8936   5.951 2.14e-08 ***
continentOceania   12.9293     4.5211   2.860  0.00491 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.929 on 136 degrees of freedom
Multiple R-squared:  0.7674,    Adjusted R-squared:  0.7588 
F-statistic: 89.72 on 5 and 136 DF,  p-value: < 2.2e-16

int <- lm(lifeExp ~ lnGDP * continent, data = gapminder_2007_log)
summary(int)


Call:
lm(formula = lifeExp ~ lnGDP * continent, data = gapminder_2007_log)

Residuals:
     Min       1Q   Median       3Q      Max 
-19.1483  -2.4726  -0.0517   2.2124  15.4213 

Coefficients:
                         Estimate Std. Error t value Pr(>|t|)    
(Intercept)              22.90682    6.22285   3.681 0.000337 ***
lnGDP                     4.26088    0.82373   5.173 8.36e-07 ***
continentAmericas        10.15826   15.68051   0.648 0.518222    
continentAsia             2.74329    9.80627   0.280 0.780108    
continentEurope          12.53213   19.92268   0.629 0.530411    
continentOceania         23.89890  279.39987   0.086 0.931964    
lnGDP:continentAmericas   0.23319    1.79057   0.130 0.896581    
lnGDP:continentAsia       0.89638    1.18998   0.753 0.452625    
lnGDP:continentEurope    -0.03398    2.06360  -0.016 0.986886    
lnGDP:continentOceania   -0.96521   27.15408  -0.036 0.971698    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.004 on 132 degrees of freedom
Multiple R-squared:  0.7685,    Adjusted R-squared:  0.7527 
F-statistic: 48.68 on 9 and 132 DF,  p-value: < 2.2e-16

Now, using adjusted r-squared, determine what model (of the three models fit) is the “best” model to model life expectancy.

summary(slr)$adj.r.squared

[1] 0.6519808

summary(add)$adj.r.squared

[1] 0.7588141

summary(int)$adj.r.squared

[1] 0.7526912

Based on the adjusted-r-squared value, it appears that the additive model with ln GDP and continent is the most appropriate.

c). Based on your answer to part b, select the most appropriate interpretation of your model below…

Our model suggests that the relationship between ln of GDP per cap and life expectancy does not depend on continent.
Our model suggests that the relationship between ln of GDP per cap and life expectancy does depend on continent.

Exercise-4

I hope you enjoyed this question :). Communicating results is extremely important, and we can use power tools to engage an audience when telling a story.

Communication is a critical yet often overlooked part of research. When we engage with our audience and capture their interest, we can ultimately better communicate what we are trying to share.

Please watch the following video: Hans Rosling: 200 years in 4 minutes.

Then, write a paragraph (4-5 sentences) addressing the following:

What did you enjoy about the presentation of data? What did you find interesting

Were there any aspects of the presentation that were hard to follow? If so, what?

What are your general takeaways from this presentation?

What are your general takeaways from how this presentation was given?