HW 1 - Summary Stats + Data visualization

Homework

Rubric

Important

This homework is due Sunday, Sep 8 at 11:59pm.

Logistics

Packages

Tips

Remember that continuing to develop a sound workflow for reproducible data analysis is important as you complete this homework and other assignments in this course. There will be reminders in this assignment for you to Render your document. The last thing you want to do is work an entire assignment before realizing you have an error somewhere that makes it so you can’t compile your document. Render after each completed question.

Note

Note: Do not let R output answer the question for you unless the question specifically tells you to do so.

Note

Note: You can not earn more than 100% on this assignment

Exercises

Data 1: Duke Forest houses

Note

Use the duke_forest dataset in the openintro package for Exercises 1 and 2.

For the following two exercises you will work with data on houses that were sold in the Duke Forest neighborhood of Durham, NC in November 2020. The duke_forest dataset comes from the openintro package. You can see a list of the variables on the package website or by running ?duke_forest in your console.

Exercise 1

We are going to first explore the duke_forest dataset by calculating a variety of summary statistics. Calculate the summary statistic associated with each scenario below. For this question, your code + code output is enough to earn full credit

  1. Create a data frame that displays the mean house price across all categories of bedroom.

  2. Create a data frame that displays the minimum and maximum lot area, in acres. Name your columns min_lot and max_lot.

  3. Create a data frame that gives the number of homes for each combination of cooling system AND number of bathrooms. You will receive a bonus point if you use the functions arrange() and desc() to put your data frame in descending order of count. Name your count column n_count.

Exercise 2

Usually, we expect that within any market, larger houses will have higher prices. We can also expect that there exists a relation between the age of an house and its price. However, in some markets newer houses might be more expensive, while in other markets antique houses will have ‘more character’ than newer ones and have higher prices. In this question, we will explore the relations among age, size and price of houses.

Your family friend ask: “In Duke Forest, do houses that are bigger and more expensive tend to be newer than smaller and cheaper ones?”.

Once again, data visualization skills to the rescue!

  • Create a scatter plot to exploring the relationship between price and area, also display information about year_built (that is conditioning for year_built, or your z variable).
  • Use size = 3 within the appropriate geom function used to make a scatter plot to make your points bigger.
  • Layer on geom_smooth() with the argument se = FALSE to add a smooth curve fit to the data and color the points by year_built.
  • Include informative title, axis, and legend labels.
  • Discuss each of the following claims (1-2 sentences per claim). Use elements you observe in your plot as evidence for or against each claim.
    • Claim 1: Larger houses are priced higher.
    • Claim 2: Bigger and more expensive houses tend to be newer ones than smaller and cheaper ones.

Data 2: BRFSS

Note

Use this dataset for Exercises 3 through 5.

The Behavioral Risk Factor Surveillance System (BRFSS) is the nation’s premier system of health-related telephone surveys that collect state data about U.S. residents regarding their health-related risk behaviors, chronic health conditions, and use of preventive services. Established in 1984 with 15 states, BRFSS now collects data in all 50 states as well as the District of Columbia and three U.S. territories. BRFSS completes more than 400,000 adult interviews each year, making it the largest continuously conducted health survey system in the world.

Source: cdc.gov/brfss

In the following exercises we will work with data from the 2020 BRFSS survey. The data originally come from here, though we will work with a random sample of responses and a small number of variables from the data provided. These have already been sampled for you and the dataset you’ll use is called brfss.csv.

brfss <- read_csv("https://st511-01.github.io/data/brfss.csv")

Exercise 3

  • How many rows are in the brfss dataset? What does each row represent?
  • How many columns are in the brfss dataset?
  • Include the code and resulting output used to support your answer.

Write a sentence along with your code to answer your question

Exercise 4

Do people who smoke more tend to have worse health conditions?

  • Use a segmented bar chart to visualize the relationship between smoking (smoke_freq) and general health (general_health). Put smoke_freq on the x-axis.
    • Below is sample code for releveling general_health. Here we first convert general_health to a factor (how R stores categorical data) and then order the levels from Excellent to Poor. The same is done to smoke_freq, with the ordering being from Not at all to Every day.
  • You will add to the existing pipeline (code) to make the segmented bar chart.
brfss |>
  mutate(
    general_health = as.factor(general_health),
    general_health = fct_relevel(general_health, "Excellent", "Very good", "Good", "Fair", "Poor"),
    smoke_freq = as.factor(smoke_freq),
    smoke_freq = fct_relevel(smoke_freq, "Not at all", "Some days", "Every day")
    ) # add a pipe here to start creating your bar chart
  • Include informative title, axis, and legend labels.
  • Comment on the motivating question based on evidence from the visualization: Do people who smoke more tend to have worse health conditions?

Exercise 5

How are sleep and general health associated?

  • Create a visualization displaying the relationship between sleep and general_health.
  • Include informative title and axis labels.
  • Modify your plot to use a different theme than the default.
  • Comment on the motivating question based on evidence from the visualization: How are sleep and general health associated?
brfss |>
  mutate(
    general_health = as.factor(general_health),
    general_health = fct_relevel(general_health, "Poor", "Fair", "Good", 
                                 "Very good",  "Excellent")
  ) # add a pipe here to start creating your chart

Submission

  • Go to http://www.gradescope.com and click Log in in the top right corner.
  • Log in with your school credentials.
  • Click on your STA 511 course.
  • Click on the assignment, and you’ll be prompted to submit it.
  • Mark all the pages associated with exercise. All the pages of your homework should be associated with at least one question (i.e., should be “checked”). If you do not do this, you will be subject to lose points on the assignment.
  • Do not select any pages of your PDF submission to be associated with the “Workflow & formatting” question.

Grading

  • Exercise 1: 6 points
  • Exercise 2: 10 points
  • Exercise 3: 4 points
  • Exercise 4: 10 points
  • Exercise 5: 9 points
  • Workflow + formatting: 5 points
  • Total: 44 points
Note

The “Workflow & formatting” grade is to assess the reproducible workflow. This includes:

  • linking all pages appropriately on Gradescope
  • putting your name in the YAML at the top of the document
  • Pipes %>%, |> and ggplot layers + should be followed by a new line
  • You should be consistent with stylistic choices, e.g. %>% vs |>