Summarizing and Tidying Data

Grouping and Summarizing

Tidy Data Concepts

Identifying Tidyr Functions

Using Tidyr Functions

100

What does the summarize() function do?

It reduces many rows down to single summary statistics (e.g., means)

100

Name one of the two reasons we talked about why tidy data is a good thing.

1. It matches the way we think about data, which makes it easier to read.

2. It makes data easer to work with, particularly with functions from the dplyr and ggplot2 packages.

100

Which tidyr function reshapes long data to a wider format?

pivot_wider()

100

Say I had the output below, assigned to an object called my_object. What code could turn the column "attribute_dimension" into to columns: "attribute" and "dimension"?

Species | attribute_dimension | measurement

setosa Sepal.Length 5.1
setosa Sepal.Width 3.5
... ... ...

my_object %>%
separate(
col = attribute_dimension,
into = c("attribute", "dimension")
)

200

What does the group_by() function do?

It tells R that any function that follows should be run separately on each unique value of the group_by variables.

200

What common situation clearly gives away that data are messy?

When the variable of the column name(s) do not match the variable in the column cells.

200

Which tidyr function reshapes wide data to a longer format?

pivot_longer()

200

Say you have my_object (output below), and you want to combine the phone_number_1 and phone_number_2 columns into a single column, with no character separating the cell values. What code could you use?

participant | phone_number_1 | phone_number_2
1 (555) 634- 8465
2 (555) 592- 1985
3 (555) 487- 7896

my_object %>%
unite(
col = "phone_number",
phone_number_1:phone_number_2,
sep = ""
)

300

What makes the summarize function different from mutate()?

summarize() creates a new tibble containing only the columns you create

300

What is the difference between long and wide data formats?

Wide data has more data spread out into columns, whereas long data has fewer columns but more data spread out into rows.

300

Which tidyr function breaks a column into 2 new columns, based on a character value?

separate()

300

Sketch out, on a piece of paper, what the output of the following code would look like:

iris %>%
pivot_longer(
cols = -Species,
names_to = "flower_attribute",
values_to = "dimension"
)

Species | attribute_dimension | measurement

setosa Sepal.Length 5.1
setosa Sepal.Width 3.5
... ... ...

400

What makes the summarize function similar to mutate()?

summarize() creates new columns with the argument column_name = column_content.

400

What are the three principles of tidy data?

* Each variable must have its own column.

* Each observation must have its own row.

* Each value must have its own cell.

400

Which tidyr function takes 2 columns and combines them into a single column, where the values of each column are separated by a character value?

unite()

400

What code would reshape the format of the object my_sportsball (output below) to a longer format with the columns: Player | Variable | Season_Average ?

Player | Points | Assists | Rebounds
A 14.7 3.27 7.9
B 10.1 3.65 12.3
C 20.1 2.98 17.3

my_sportsball %>%
pivot_longer(
cols = -Player,
names_to = "Variable",
values_to = "Season_Average"
)

500

What output should I get if I were to run the following code:

iris %>%

group_by(Species) %>%

summarize(smallest_value = min(Sepal.Width))

A tibble with two columns: one that indicates the flower Species, and another that shows the smallest values for each Species

500

Long-formatted data guarantees tidy data. Why or why not?

Why not: tidy data only needs to be long if there are multiple observations of the same variable per unit, for example the same person measured on their height three different times. But if, say, a person was measure on height, bicep width, and number of visits to the gym this week, these are each different variables. Therefore, they should get their own columns, meaning this dataset should stay in a way format.

500

Which tidyr function breaks the values of a column into multiple rows, based on a character value?

separate_rows()

500

What code would reshape the format of the object my_sportsball (output below) to a wider format with the columns: Player | Points | Assists | Rebounds ?

Player | Variable | Season_Average
A Points 14.7
A Assists 3.27
A Rebounds 7.94

my_sportsball %>%
pivot_wider(
names_from = Variable,
values_from = Season_Average
)