class: center, middle, inverse, title-slide .title[ # Wrangling and Pivoting Data ] .subtitle[ ## IBS 519 - Week 4 - TA session ] .author[ ### Ashlyn Johnson ] .date[ ### 9/14/22 ] --- ## Agenda ### First Half of Class - Review of **`dplyr`** package - **Tidy data** and **pivoting** data frames ### Second Half of Class - Homework Questions - Additional practice (if time allows) --- class: inverse, center, middle # `dplyr` ### a grammar of data manipulation providing a consistent set of verbs ### last week, we learned... --- .pull-left[ ### `filter()` ] .pull-right[ subset a data frame for **rows** that satisfy your conditions ] -- .pull-left[ ### `mutate()` ] .pull-right[ adds new variables ] -- .pull-left[ ### `select()` ] .pull-right[ subset a dataframe for desired **columns** ] -- .pull-left[ ### `arrange()` ] .pull-right[ **orders** the rows of a data frame by the values of selected columns; default is ascending order, but can be paired with `desc()` to obtain descending order ] -- .pull-left[ ### `summarise()` ] .pull-right[ creates a new data frame with one column for each grouping variable and one column for each summary statistic ] --- class: center, middle, inverse ### No one programmer can remember everything! it is OK to google questions read help pages `?function()` ask for help R studio even makes [cheatsheets]("https://www.rstudio.com/resources/cheatsheets/") for their packages (use them!! I do!) --- ### Let's practice some data wrangling! To manipulate data, we'll need the tidyverse! Mainly `dplyr`, but I like to load everything in just in case. ```r library(tidyverse) ``` ``` ## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ── ## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4 ## ✔ tibble 3.1.7 ✔ dplyr 1.0.9 ## ✔ tidyr 1.2.0 ✔ stringr 1.4.1 ## ✔ readr 2.1.2 ✔ forcats 0.5.1 ## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── ## ✖ dplyr::filter() masks stats::filter() ## ✖ dplyr::lag() masks stats::lag() ``` --- ### We also need data to wrangle Let's borrow a dataset from the Tidy Tuesday repository. .center[ <img src="data:image/png;base64,#images/tidy_tuesday_screengrab.png" width="60%" /> ] --- We'll use a dataset from the American Kennel Club that includes various information about different dog breeds. ```r breed_traits <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-02-01/breed_traits.csv') ``` -- ```r glimpse(breed_traits) ``` ``` ## Rows: 195 ## Columns: 17 ## $ Breed <chr> "Retrievers (Labrador)", "French Bulldogs… ## $ `Affectionate With Family` <dbl> 5, 5, 5, 5, 4, 5, 3, 5, 5, 5, 5, 3, 5, 4,… ## $ `Good With Young Children` <dbl> 5, 5, 5, 5, 3, 5, 5, 3, 5, 3, 3, 5, 5, 5,… ## $ `Good With Other Dogs` <dbl> 5, 4, 3, 5, 3, 3, 5, 3, 4, 4, 4, 3, 3, 3,… ## $ `Shedding Level` <dbl> 4, 3, 4, 4, 3, 1, 3, 3, 3, 2, 4, 3, 1, 2,… ## $ `Coat Grooming Frequency` <dbl> 2, 1, 2, 2, 3, 4, 2, 1, 2, 2, 2, 2, 5, 2,… ## $ `Drooling Level` <dbl> 2, 3, 2, 2, 3, 1, 1, 3, 2, 2, 1, 1, 1, 3,… ## $ `Coat Type` <chr> "Double", "Smooth", "Double", "Double", "… ## $ `Coat Length` <chr> "Short", "Short", "Medium", "Medium", "Sh… ## $ `Openness To Strangers` <dbl> 5, 5, 3, 5, 4, 5, 3, 3, 4, 4, 4, 3, 5, 4,… ## $ `Playfulness Level` <dbl> 5, 5, 4, 4, 4, 5, 4, 4, 4, 4, 4, 4, 4, 4,… ## $ `Watchdog/Protective Nature` <dbl> 3, 3, 5, 3, 3, 5, 2, 5, 4, 4, 5, 3, 5, 4,… ## $ `Adaptability Level` <dbl> 5, 5, 5, 5, 3, 4, 4, 4, 4, 4, 4, 3, 5, 3,… ## $ `Trainability Level` <dbl> 5, 4, 5, 5, 4, 5, 3, 5, 5, 4, 4, 5, 4, 4,… ## $ `Energy Level` <dbl> 5, 3, 5, 3, 3, 4, 4, 3, 5, 3, 4, 5, 4, 4,… ## $ `Barking Level` <dbl> 3, 1, 3, 1, 2, 4, 4, 1, 3, 5, 4, 3, 4, 3,… ## $ `Mental Stimulation Needs` <dbl> 4, 3, 5, 4, 3, 5, 4, 5, 5, 3, 4, 5, 4, 4,… ``` ??? But what are these traits exactly? ---
--- ###My dog Marvel is some sort of Chihuahua mix. Can I use his traits to narrow down what other breeds may contribute to his genetic make up? .pull-left[ <img src="data:image/png;base64,#images/marvel_at_rest.JPEG" width="70%" /> ] -- .pull-right[ ### Notable Traits: - Coat Length = Short - Trainability >= 4 - Energy >= 3 - Barking = 5 ] --- Which breeds have short hair, a trainability level greater than or equal to 3, an energy level greater than or equal to 3, and a barking level of 5? Let's use `filter()` for this. ```r marvels_traits <- breed_traits %>% filter(`Coat Length` == 'Short') %>% filter(`Trainability Level` >= 3) %>% filter(`Energy Level` >= 3) %>% filter(`Barking Level` == 5) ``` --
--- Since I know that Marvel is good with other dogs and affectionate with family, why don't I sort the dataframe by those qualities to see which dog breeds rise to the top? Let's use `arrange()` and `desc()` for this. ```r marvels_traits_2 <- marvels_traits %>% arrange(desc(`Good With Other Dogs`), desc(`Affectionate With Family`)) %>% select(Breed, `Good With Other Dogs`, `Affectionate With Family`) ``` --
--- In our dataset, there are two variables that deal with different aspects of coat. What if we wanted just one? We can use `mutate()` to create a new column. ```r breed_traits_2 <- breed_traits %>% mutate(Coat = paste(`Coat Type`, `Coat Length`, sep = "_")) %>% select(Breed, Coat) ``` --
--- .pull-left[ Finally, what if we wanted to know the mean trainability level of dogs with different types of coats? For this, we can use `group_by()` and `summarize()`. ```r breed_traits %>% mutate(Coat = paste(`Coat Type`, `Coat Length`, sep = "_")) %>% group_by(Coat) %>% summarize(mean_trainability = mean(`Trainability Level`, na.rm = TRUE)) ``` ] -- .pull-right[ ``` ## # A tibble: 19 × 2 ## Coat mean_trainability ## <chr> <dbl> ## 1 Corded_Long 4 ## 2 Corded_Medium 4 ## 3 Curly_Long 5 ## 4 Curly_Medium 3.8 ## 5 Double_Long 3.88 ## 6 Double_Medium 4.18 ## 7 Double_Short 3.82 ## 8 Hairless_Short 4.33 ## 9 Plott Hounds_Plott Hounds 0 ## 10 Rough_Long 3 ## 11 Rough_Medium 4 ## 12 Silky_Long 3 ## 13 Silky_Medium 4 ## 14 Smooth_Medium 4.4 ## 15 Smooth_Short 3.75 ## 16 Wavy_Long 4 ## 17 Wavy_Medium 3.4 ## 18 Wiry_Medium 3.89 ## 19 Wiry_Short 3.45 ``` ] --- class: inverse, center, middle # Tidy data and pivoting data frames --- # Tidy data *"Tidy datasets are all alike, but every dataset is messy in its own way."* - **Hadley Wickham** Tidy data have 3 rules: 1. Each variable must have its own column. 2. Each observation must have it's own row. 3. Each value must have it's own cell. --- ### What does it mean for data to not be tidy? ```r messy_data <- data.frame(independent_replicate = c(1:3), control_1 = rnorm(3, mean = 5, sd = 1), control_2 = rnorm(3, mean = 5, sd = 1), control_3 = rnorm(3, mean = 5, sd = 1), treatment_1 = rnorm(3, mean = 7, sd = 1), treatment_2 = rnorm(3, mean = 7, sd = 1), treatment_3 = rnorm(3, mean = 7, sd = 1)) messy_data ``` ``` ## independent_replicate control_1 control_2 control_3 treatment_1 treatment_2 ## 1 1 5.807225 6.097076 3.756127 5.733711 6.115535 ## 2 2 3.946806 4.946443 5.456057 7.694489 7.861499 ## 3 3 4.084722 5.516974 4.908542 7.828350 8.072564 ## treatment_3 ## 1 5.969955 ## 2 6.220383 ## 3 7.672997 ``` -- In this format (also known as wide format): - Difficult to compute summary statistics for each treatment group - The columns are not a variable - Multiple observations per row --- ### How do we make this data tidy? ### `pivot_longer()` The `pivot_longer()` function can be used to transform data from wide to long format. .pull-left[ ```r pivot_longer(data = , cols = , names_to = , values_to = ) ``` ```r tidier_data <- messy_data %>% pivot_longer(cols = control_1:treatment_3, names_to = "treatment_group", values_to = "response") head(tidier_data) ``` ] -- .pull-right[ ``` ## # A tibble: 6 × 3 ## independent_replicate treatment_group response ## <int> <chr> <dbl> ## 1 1 control_1 5.81 ## 2 1 control_2 6.10 ## 3 1 control_3 3.76 ## 4 1 treatment_1 5.73 ## 5 1 treatment_2 6.12 ## 6 1 treatment_3 5.97 ``` ] --- Our data frame is certainly longer now. Let's remove the underscores from our treatment_group column and calculate some quick summary statistics using `mutate()` and `summarise()` ```r tidier_data %>% mutate("treatment_group" = str_remove_all(.$treatment_group, "_\\d")) ``` ``` ## # A tibble: 18 × 3 ## independent_replicate treatment_group response ## <int> <chr> <dbl> ## 1 1 control 5.81 ## 2 1 control 6.10 ## 3 1 control 3.76 ## 4 1 treatment 5.73 ## 5 1 treatment 6.12 ## 6 1 treatment 5.97 ## 7 2 control 3.95 ## 8 2 control 4.95 ## 9 2 control 5.46 ## 10 2 treatment 7.69 ## 11 2 treatment 7.86 ## 12 2 treatment 6.22 ## 13 3 control 4.08 ## 14 3 control 5.52 ## 15 3 control 4.91 ## 16 3 treatment 7.83 ## 17 3 treatment 8.07 ## 18 3 treatment 7.67 ``` --- ```r tidy_data_summarized_technicalrep <- tidier_data %>% mutate("treatment_group" = str_remove_all(.$treatment_group, "_\\d"))%>% group_by(independent_replicate, treatment_group) %>% summarise(response = mean(response)) ``` ``` ## `summarise()` has grouped output by 'independent_replicate'. You can override ## using the `.groups` argument. ``` ```r knitr::kable(tidy_data_summarized_technicalrep) ``` | independent_replicate|treatment_group | response| |---------------------:|:---------------|--------:| | 1|control | 5.220143| | 1|treatment | 5.939734| | 2|control | 4.783102| | 2|treatment | 7.258790| | 3|control | 4.836746| | 3|treatment | 7.857970| --- class: inverse, center, middle # Ask questions and practice! --- ### Ask questions about the homework if you have them! ### In R studio cloud, open the project for this session. Your assignment is to pick at least 3 of the traits in the AKC dog breeds dataset that you would use to pick out your ideal dog. Determine what your conditions would be (i.e. `Affectionate With Family` > 3, `Shedding Level` == 1, etc.) and use `filter()` to identify which dog breeds meet those conditions. ### If any of you discover any particularly cute dog breeds, please feel free to post pictures in the slack under the #random channel.