Logistic Regression: Solutions

Warning: package 'ggridges' was built under R version 4.4.2

Learning goals

By the end of today, you will…

  • use logistic regression to fit a model for a binary response variable
  • fit a logistic regression model in R
  • think about using a logistic regression model for classification

To illustrate logistic regression, we will build a spam filter from email data. Today’s data represent incoming emails in David Diez’s (one of the authors of OpenIntro textbooks) Gmail account for the first three months of 2012 . All personally identifiable information has been removed.

email <- read_csv("https://st511-01.github.io/data/email.csv") |>
  mutate(spam = factor(spam))
Rows: 3890 Columns: 21
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr   (2): winner, number
dbl  (18): spam, to_multiple, from, cc, sent_email, image, attach, dollar, i...
dttm  (1): time

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(email)
Rows: 3,890
Columns: 21
$ spam         <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ to_multiple  <dbl> 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ from         <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ cc           <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 2, 1, 0, 2, 0, …
$ sent_email   <dbl> 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, …
$ time         <dttm> 2012-01-01 06:16:41, 2012-01-01 07:03:59, 2012-01-01 16:…
$ image        <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ attach       <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ dollar       <dbl> 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 5, 0, 0, …
$ winner       <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "no…
$ inherit      <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ viagra       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ password     <dbl> 0, 0, 0, 0, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, …
$ num_char     <dbl> 11.370, 10.504, 7.773, 13.256, 1.231, 1.091, 4.837, 7.421…
$ line_breaks  <dbl> 202, 202, 192, 255, 29, 25, 193, 237, 69, 68, 25, 79, 191…
$ format       <dbl> 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, …
$ re_subj      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, …
$ exclaim_subj <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, …
$ urgent_subj  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ exclaim_mess <dbl> 0, 1, 6, 48, 1, 1, 1, 18, 1, 0, 2, 1, 0, 10, 4, 10, 20, 0…
$ number       <chr> "big", "small", "small", "small", "none", "none", "big", …

The variables we’ll use in this analysis are

  • spam: 1 if the email is spam, 0 otherwise
  • exclaim_mess: The number of exclamation points in the email message

We want to use the number of exclamation points in an email to predict whether or not it is spam.

– Response: spam

– Explanatory: exclaim_mess

Exploratory Data Analysis

Let’s start by taking a look at our data. Create an density plot to investigate the relationship between spam and exclaim_mess. What does this plot suggest?

email |> 
  ggplot(
    aes(x = exclaim_mess, fill = spam)
  ) + 
  geom_density()

Let’s try a linear model

Suppose we try using a linear model to describe the relationship between the number of exclamation points and whether an email is spam. Visualize the linear model that models spam by exclamation marks below.

email |>
  ggplot() +
  geom_jitter(
    aes(x = exclaim_mess, y = spam, color = spam)
  ) +
  geom_smooth(
    aes(x = exclaim_mess , y = as.numeric(spam)), method = "lm", se = F)
`geom_smooth()` using formula = 'y ~ x'

  • Is the linear model a good fit for the data? Why or why not?

Because our response is categorical

Back to the Slides to Learn About Logistic Regression

Logistic regression Practice in R

Logistic regression takes in a number of explanatory variables and outputs the log-odds of “success” (an outcome of 1) in a binary response variable. The log-odds are then used to predict the probability of “success”.

Note: The model will always assume 0 to be failure and 1 to be a success, and the outputted predicted probabilities will always be the probabilities of success

The logistic regression model using the number of exclamation points as an explanatory variable is

\(\widehat{\ln\Big(\frac{p}{1-p}\Big)} = \hat{\beta_0} + \hat{\beta_1} \times exclaim\_mess\)

The probability an email is spam is

\(\hat{p} = \frac{\exp\{\hat{\beta_0} + \hat{\beta_1} \times exclaim\_mess\}}{1 + \exp\{\hat{\beta_0} + \hat{\beta_1} \times exclaim\_mess\}}\)

Exercise 1

  • Let’s fit the logistic regression model using the number of exclamation points to predict the probability an email is spam.

Things to note: We are no longer doing linear regression (we are doing logistic regression); We are not fitting a linear model (we are fitting a generalized linear model); we need to specify family = binomial in the lm function.

Name this model spam_model

spam_model <- glm(spam ~ exclaim_mess, data = email, family = binomial)

summary(spam_model)

Call:
glm(formula = spam ~ exclaim_mess, family = binomial, data = email)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -1.91139    0.06404 -29.846  < 2e-16 ***
exclaim_mess -0.16836    0.02398  -7.021 2.21e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2417.5  on 3889  degrees of freedom
Residual deviance: 2318.5  on 3888  degrees of freedom
AIC: 2322.5

Number of Fisher Scoring iterations: 7

Note: The solutions are in the next r-code chunk

spam_model <- glm(spam ~ exclaim_mess, data = email, family = binomial)

summary(spam_model)

Call:
glm(formula = spam ~ exclaim_mess, family = binomial, data = email)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -1.91139    0.06404 -29.846  < 2e-16 ***
exclaim_mess -0.16836    0.02398  -7.021 2.21e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2417.5  on 3889  degrees of freedom
Residual deviance: 2318.5  on 3888  degrees of freedom
AIC: 2322.5

Number of Fisher Scoring iterations: 7
  • Now, compare your summary output to the estimated model below.

\(\widehat{\ln\Big(\frac{p}{1-p}\Big)} = -1.9114 - 0.1684 \times exclaim\_mess\)

Exercise 2

What is the probability the email is spam if it contains 10 exclamation points?

Use R as a calculator to calculate the predicted probability (do not use predict)

First, calculate the log odds by plugging in 10.

-1.91 - .168*10
[1] -3.59
exp(-3.59) / (1 + exp(-3.59))
[1] 0.02685712

Now, exponentiate! Hint: You can use exp in R

exp(-3.59) / (1 + exp(-3.59))
[1] 0.02685712

We will pick up here on Wednesday!

We can use the predict function in R to produce the probability as well.

predict(spam_model, data.frame(exclaim_mess = 10), type="response")
         1 
0.02672713 

Note: type = response exponentiates for you, and gives you the actual probability instead of the log odds!

Exercise 3

We have the probability an email is spam, but ultimately we want to use the probability to classify an email as spam or not spam. Therefore, we need to set a decision-making threshold, such that an email is classified as spam if the predicted probability is greater than the threshold and not spam otherwise.

Suppose you are a data scientist working on a spam filter. You must determine how high the predicted probability must be before you think it would be reasonable to call it spam and put it in the junk folder (which the user is unlikely to check).

What are some tradeoffs you would consider as you set the decision-making threshold? Discuss with your neighbor.

Add Response

spam_model_plot <- logistic_reg() |>
  set_engine("glm") |>
  fit(spam ~ exclaim_mess , data = email, family = binomial)

email_pred <- email |>
  mutate(pred_prob = predict(spam_model_plot$fit, type = "response"))

ggplot(data = email_pred) + 
  geom_point(aes(x = exclaim_mess, y = as.numeric(spam) -1, 
                        color = spam)) + 
  geom_line(aes(x = exclaim_mess, y = pred_prob)) + 
  labs(x = "Number of exclamation points", 
       y = "Predicted probability an email is spam", 
       color = "Is email spam?"
       )