Formatting + Summary Statistics

Lecture 3

Dr. Elijah Meyer

NC State University
ST 511 - Fall 2024

2024-08-26

Checklist

– Are on you Slack?

If replying to a comment, please click reply instead of making a               separate post 

– Did you download today’s activities? (there are two)

– Can you make a PDF?

    Run tinytex::install_tinytex() in your Console and restart R

– Are on keeping up with prepare material?

– Quiz Wednesday afternoon (Due Sunday on Moodle)

– HW-1 Thursday (Due following Thursday on Gradescope)

Announcements

Having trouble accessing the WorkBench?

WorkBench can only have 150 users (not active users)

Terry from IT: “They replied yes to a 200 license but I changed my mind and asked for an unlimited so we don’t have monitor the users as much”

Announcements

Having internet issues? Me too…

If you are having trouble with the internet during class, take hand written notes as you follow along with the activity + do the activity outside of class

Questions / Comments / Concerns

Warm up

What is a function?

What is a R package?

Warm up

Function - A function is a set of statements organized together to perform a specific task. It’s an action that we can call to perform.

Example

glimpse

library(tidyverse)

glimpse(mtcars)
Rows: 32
Columns: 11
$ mpg  <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8,…
$ cyl  <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8,…
$ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8, 16…
$ hp   <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 180…
$ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.92,…
$ wt   <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150, 3.…
$ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90, 18…
$ vs   <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0,…
$ am   <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,…
$ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3,…
$ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, 2,…

Warm up

R packages are extensions to the base R statistical programming language

Warm up

What do we call the following? Can we write text in it? If so, how?

Warm up

Shortcut for Windows: ctrl + alt + I

Shortcut for Mac: cmd + option + I

Warm up

What’s wrong?

Warm up

Make sure that your code chunks have the three tick marks left aligned, or your code chunk won’t close.

Formatting in Quarto

Goals

– Section headers

– Bold / Italicize

– Insert tables

– Output control

Summary statistics

Question

How do we summarize quantitative variables? How do we summarize categorical variables?

Summary statistics

\[ mean =\bar{x} = \frac{\sum_{i=1}^{n}{x_i}}{n} \]

Median = The middle number

There is no widely accepted standard notation for the median

\[ proportion = \hat{p} = \frac{success}{total} \]

Other statistics

– Standard deviation

– IQR

Standard deviation

Numerical summary of how spread out your observations are from the center (mean)

\[ sd = \sqrt{\frac{\sum(x_i - \bar{x})^2}{n-1}} \]

IQR

Measures to spread of your data.

Q3 - Q1

75th percentile - 25 percentile

We can calculate all of these (and more) in R

The tidyverse pipe

|> is called a pipe operator

This is used to emphasize a sequence of coding actions

“and then”

The tidyverse pipe

The tidyverse pipe

Compare

library(tidyverse)

glimpse(mtcars)
Rows: 32
Columns: 11
$ mpg  <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8,…
$ cyl  <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8,…
$ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8, 16…
$ hp   <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 180…
$ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.92,…
$ wt   <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150, 3.…
$ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90, 18…
$ vs   <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0,…
$ am   <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,…
$ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3,…
$ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, 2,…

vs

library(tidyverse)

mtcars |>
  glimpse()
Rows: 32
Columns: 11
$ mpg  <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8,…
$ cyl  <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8,…
$ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8, 16…
$ hp   <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 180…
$ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.92,…
$ wt   <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150, 3.…
$ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90, 18…
$ vs   <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0,…
$ am   <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,…
$ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3,…
$ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, 2,…

New functions

group_by()

summarise()

n()

mean(); median(); sd() …etc.

Practice

In summary

– We use the pipe operator when we are writing a sequence of actions

group_by() groups our data and allows us to create summary statistics on the grouped data

summarise() allows us to calculate summary statistics!

Plots

What types of plots can we make?

Golden Rule We let the type of variable(s) dictate the appropriate plot

  • Quantitative

  • Categorical

Pick a plot

What plot is appropriate to graph the following scenarios

– One quantitative variable

– One quantitative variable; one categorical variable

– Two quantitative variables

– One categorical variable

– Two categorical variables

– Scatter plot

– Histogram

– Bar plot

– Segmented bar plot

– Box plot

Scatter plot

Two quantitative variables

data |>
  ggplot() +
  geom_point()

Histogram

One quantitative variable

data |>
  ggplot() +
  geom_histogram()

Bar plot

One categorical variable

data |>
  ggplot() +
  geom_bar() #or geom_col

Segmented bar plot

Two categorical variables

data |>
  ggplot() +
  geom_bar()

Boxplot

One quantitative; One categorical

data |>
  ggplot() +
  geom_boxplot()