HW 4: Recreating R Output + Chi-Square

Homework-Solutions

Packages

library(tidyverse)
library(MASS)

survey <- na.omit(survey)

Data

For this homework, we are going to being using the survey data set from the MASS package. This data frame contains the responses of 237 Statistics I students at the University of Adelaide to a number of questions. You can find the variable names and descriptions by pulling up the help file for survey.

Exercise 1

In this exercise, we are going to investigate the relationship between how students express height, and their age. In this survey, students were ask to express their height, and the researcher recorded if they expressed their height in imperial (feet/inches), or metric (centimeters/meters). The researcher wanted to know if there is a difference between expressed height and a student’s age.

Justify, in as much detail as possible, why it is appropriate to use difference in means methodology to analyze these data.

Answer We are using difference in means because we are working with a categorical explanatory variable with 2 levels (feet/inches vs centimeters/meters), and a quantitative response variable.

Set up the null AND alternative hypothesis, using proper notation, for the difference in means test described above. Note, you only need to write this out in proper notation (not words).

\(Ho: \mu_\text{f/i} = \mu_\text{c/m}\)

\(Ha: \mu_\text{f/i} \neq \mu_\text{c/m}\)

Suppose that the assumptions to conduct a theory based test were met. The researchers used the function t.test to calculate the appropriate p-value. The results can be seen below.


    Welch Two Sample t-test

data:  Age by M.I
t = 1.2452, df = 65.081, p-value = 0.2175
alternative hypothesis: true difference in means between group Imperial and group Metric is not equal to 0
95 percent confidence interval:
 -0.9451082  4.0753596
sample estimates:
mean in group Imperial   mean in group Metric 
              21.45836               19.89324

Question: Using the survey data set, recreate the sample means (it’s okay if they are rounded) for group Imperial and group Metric. Report these means in a 2x2 table by adding to the existing pipeline that starts by filtering out all NA values for the expressed height variable.

survey |>
  filter(!is.na(M.I)) |>
  group_by(M.I) |>
  summarize(means = mean(Age))

# A tibble: 2 × 2
  M.I      means
  <fct>    <dbl>
1 Imperial  21.5
2 Metric    19.9

Now, report the standard deviations for each group by adding the code below.

survey |>
  filter(!is.na(M.I)) |>
  group_by(M.I) |>
  summarize(means = sd(Age), 
            count = n())

# A tibble: 2 × 3
  M.I      means count
  <fct>    <dbl> <int>
1 Imperial  9.25    58
2 Metric    3.37   110

Use the information from parts c and d to recreate the t-static that was calculated using t.test. Show your work.

\(t = \frac{21.5 - 19.9 - 0}{\sqrt{\frac{9.25^2}{58} + \frac{3.37^2}{110}}} = 1.25\)

Using the pt() function, recreate the p-value calculated in the t.test function. Hint: Think about your alternative hypothesis.

pt(1.2452, df = 65.081, lower.tail = F) * 2

[1] 0.2175231

The book used in 511 suggests estimating the degrees of freedmen as \(min(n_1 - 1;n_2-1)\). Is our estimate of the degrees of freedom the same as what t.test calculates? Justify your answer by reporting the degrees of freedom estimated by the book’s formula.

Answer The degrees of freedom the book has us calculate is more conservative, and is the minimum of \(n_1 - 1 ; n_2 - 1\). Based on this, our degrees of freedom would be 57.

Without doing the calculation, justify if the p-value would increase, decrease, or stay the same, if we calculated it using the degrees of freedom from part g.

Answer Our p-value, using a t-distribution with 57 degrees of freedom, would be larger than a p-value calculated on a t-distribution with ~ 65 degrees of freedom. This is because our distribution is wider on the tails (has more area).

Exercise 2

In this exercise, we are again going to use the survey data set. We are going to test if a student’s smoking status in independent from how often they exercise. You can find the variable names and descriptions by pulling up the help file for survey.

Make an observed table (tibble output) that counts the frequency of every exercise by smoking combination. Your answer should be a 11 x 3 tibble. Add to the existing pipeline that filters out missing values for smoking when reporting your tibble.

survey |>
  filter(!is.na(Smoke)) |>
  group_by(Smoke, Exer) |>
  summarise(count = n())

`summarise()` has grouped output by 'Smoke'. You can override using the
`.groups` argument.

# A tibble: 11 × 3
# Groups:   Smoke [4]
   Smoke Exer  count
   <fct> <fct> <int>
 1 Heavy Freq      4
 2 Heavy Some      3
 3 Never Freq     65
 4 Never None     12
 5 Never Some     57
 6 Occas Freq      8
 7 Occas None      1
 8 Occas Some      4
 9 Regul Freq      8
10 Regul None      1
11 Regul Some      5

Below is a table of expected counts. These are counts that would expect to see if exercise really has no impact on smoking status.

           survey$Smoke
survey$Exer     Heavy    Never    Occas    Regul
       Freq 3.5416667 67.79762 6.577381 7.083333
       None 0.5833333 11.16667 1.083333 1.166667
       Some 2.8750000 55.03571 5.339286 5.750000

Question: Using the values from a, recreate the expected count of those who smoke frequently and exercise heavy. Show your work.

\(\text{row total for freq: 4 + 65 + 8 + 8} \\\)

\(\text{col total for heavy: 4 + 3 + 0} \\\)

\(\text{sample size = 168} \\\)

\(\text{expected count for heavy & freq = (85*7)/168} = 3.54\)

Are you justified to use theory based procedures to carry out a chi-square test of independence? Justify your answer

Answer No, we observe expected counts smaller than 5.

Regardless of your answer, the results from a chi-square test of independence can be seen below…


    Pearson's Chi-squared test

data:  survey$Exer and survey$Smoke
X-squared = 1.7861, df = 6, p-value = 0.9383

Question: Please explain how df = 6 was calculated when carrying out this test.

Answer We calculated df = 6 by taking the number of levels of the exercise variable - 1 (3-1), multiplied by the number of levels for the smoking variable - 1 (4-1). 2 * 3 = 6.

Using the results from the output in part d, report the test statistic, p-value, and distribution the p-value was calculated from. Be specific.

Answer Our test statistic is 1.7861 with an associated p-value of 0.9383 from a chi-square distribution with 6 degrees of freedom.