Additive + Interaction Models: Solutions

Load packages and data

Today

By the end of today you will…

  • be able to build, fit and interpret linear models with \(>1\) predictor
  • think critically about r-squared as a model selection tool
  • use adjusted r-squared to compare models

Interaction model

First, let’s visualize

penguins |>
  ggplot(
    aes(x = bill_length_mm, y = flipper_length_mm, color = species)
  ) + 
  geom_point() + 
  geom_smooth(method = "lm", se = F)
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

Now, let’s fit an interaction model to get the estimates associated with the plot above. Name this model_int. Provide the summary output.

model_int <- lm(flipper_length_mm ~ bill_length_mm*species, data = penguins)

summary(model_int)

Call:
lm(formula = flipper_length_mm ~ bill_length_mm * species, data = penguins)

Residuals:
     Min       1Q   Median       3Q      Max 
-24.0561  -3.3195  -0.1806   3.5830  16.4397 

Coefficients:
                                Estimate Std. Error t value Pr(>|t|)    
(Intercept)                     158.9244     6.9039  23.020  < 2e-16 ***
bill_length_mm                    0.7999     0.1776   4.505 9.17e-06 ***
speciesChinstrap                -12.2886    12.4595  -0.986   0.3247    
speciesGentoo                    -7.8284    10.6428  -0.736   0.4625    
bill_length_mm:speciesChinstrap   0.2073     0.2765   0.750   0.4538    
bill_length_mm:speciesGentoo      0.5913     0.2459   2.405   0.0167 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.792 on 336 degrees of freedom
  (2 observations deleted due to missingness)
Multiple R-squared:  0.8328,    Adjusted R-squared:  0.8303 
F-statistic: 334.8 on 5 and 336 DF,  p-value: < 2.2e-16

And just like before, we can make predictions!

predict(model_int, data.frame(bill_length_mm = 40,  species = "Adelie"))
       1 
190.9204 

Now, simplify our model for just the Chinstrap penguins, showing how both the slope and intercept change from the baseline in this interaction model.

See demo on doc-cam: take notes here

Model selection

Let’s talk more about r-squared

As a reminder, \(R^2\) is defined as the amount of variability in our response (y) that is explained by our model with x explanatory variable(s)

Demo

set.seed(12345)
random_peng <- rbeta(nrow(penguins), 20, 10) # make 100 random numbers

penguins_fake <- cbind(penguins, random_peng) # add that variable to the data

####### models + r.squared value

slr_model <- lm(flipper_length_mm ~ bill_length_mm, data = penguins_fake)
summary(slr_model)$r.squared # slr model
[1] 0.430574
add_model <- lm(flipper_length_mm ~ bill_length_mm + species, data = penguins_fake)
summary(add_model)$r.squared # additive model
[1] 0.8298693
summary(model_int)$r.squared # interaction model
[1] 0.8328305
nonsense_model <- lm(flipper_length_mm ~ bill_length_mm*species*random_peng, data = penguins_fake )
summary(nonsense_model)$r.squared #interaction model with a literal random variable
[1] 0.8441509

Can we use r-squared for model selection? Why or why not?

We can not use R^2 for model selection. This value always increases as your model gets more complicated, even if you add variables that do not make sense

Adjusted r.squared

summary(slr_model)$adj.r.squared # slr model
[1] 0.4288992
summary(add_model)$adj.r.squared # additive model
[1] 0.8283592
summary(model_int)$adj.r.squared # interaction model
[1] 0.8303428
summary(nonsense_model)$adj.r.squared # nonsense model
[1] 0.8389559

Takeaway: Our interaction model with bill length and species provides us with the best overall model fit, after accounting for the number of parameters in the model.