What does the summarize() function do?
It reduces many rows down to single summary statistics (e.g., means)
Name one of the two reasons we talked about why tidy data is a good thing.
1. It matches the way we think about data, which makes it easier to read.
2. It makes data easer to work with, particularly with functions from the dplyr and ggplot2 packages.
Which tidyr function reshapes long data to a wider format?
pivot_wider()
Say I had the output below, assigned to an object called my_object. What code could turn the column "attribute_dimension" into to columns: "attribute" and "dimension"?
Species | attribute_dimension | measurement
setosa Sepal.Length 5.1
setosa Sepal.Width 3.5
... ... ...
my_object %>%
separate(
col = attribute_dimension,
into = c("attribute", "dimension")
)
What does the group_by() function do?
It tells R that any function that follows should be run separately on each unique value of the group_by variables.
What common situation clearly gives away that data are messy?
When the variable of the column name(s) do not match the variable in the column cells.
Which tidyr function reshapes wide data to a longer format?
pivot_longer()
Say you have my_object (output below), and you want to combine the phone_number_1 and phone_number_2 columns into a single column, with no character separating the cell values. What code could you use?
participant | phone_number_1 | phone_number_2
1 (555) 634- 8465
2 (555) 592- 1985
3 (555) 487- 7896
my_object %>%
unite(
col = "phone_number",
phone_number_1:phone_number_2,
sep = ""
)
What makes the summarize function different from mutate()?
summarize() creates a new tibble containing only the columns you create
What is the difference between long and wide data formats?
Wide data has more data spread out into columns, whereas long data has fewer columns but more data spread out into rows.
Which tidyr function breaks a column into 2 new columns, based on a character value?
separate()
Sketch out, on a piece of paper, what the output of the following code would look like:
iris %>%
pivot_longer(
cols = -Species,
names_to = "flower_attribute",
values_to = "dimension"
)
Species | attribute_dimension | measurement
setosa Sepal.Length 5.1
setosa Sepal.Width 3.5
... ... ...
What makes the summarize function similar to mutate()?
summarize() creates new columns with the argument column_name = column_content.
What are the three principles of tidy data?
* Each variable must have its own column.
* Each observation must have its own row.
* Each value must have its own cell.
Which tidyr function takes 2 columns and combines them into a single column, where the values of each column are separated by a character value?
unite()
What code would reshape the format of the object my_sportsball (output below) to a longer format with the columns: Player | Variable | Season_Average ?
Player | Points | Assists | Rebounds
A 14.7 3.27 7.9
B 10.1 3.65 12.3
C 20.1 2.98 17.3
my_sportsball %>%
pivot_longer(
cols = -Player,
names_to = "Variable",
values_to = "Season_Average"
)
What output should I get if I were to run the following code:
iris %>%
group_by(Species) %>%
summarize(smallest_value = min(Sepal.Width))
A tibble with two columns: one that indicates the flower Species, and another that shows the smallest values for each Species
Long-formatted data guarantees tidy data. Why or why not?
Why not: tidy data only needs to be long if there are multiple observations of the same variable per unit, for example the same person measured on their height three different times. But if, say, a person was measure on height, bicep width, and number of visits to the gym this week, these are each different variables. Therefore, they should get their own columns, meaning this dataset should stay in a way format.
Which tidyr function breaks the values of a column into multiple rows, based on a character value?
separate_rows()
What code would reshape the format of the object my_sportsball (output below) to a wider format with the columns: Player | Points | Assists | Rebounds ?
Player | Variable | Season_Average
A Points 14.7
A Assists 3.27
A Rebounds 7.94
my_sportsball %>%
pivot_wider(
names_from = Variable,
values_from = Season_Average
)