class: center, middle, inverse, title-slide .title[ # Working with strings in R ] .subtitle[ ## IBS 519 - Week 12 - TA session ] .author[ ### Ashlyn Johnson ] .date[ ### 11/9/22 ] --- # Agenda ## First Half of Class - What are strings? - `paste()` and `paste0` - regular expressions and `grep` functions - `stringR` ## Second Half of Class - Homework Questions --- class: inverse, center, middle # What are strings? --- # Strings are ... -- Any values/characters between **single (')** or **double ('')** quotation marks -- ```r my_first_string <- 'Hello world'; my_first_string # semicolon separates the two individual expressions ``` ``` ## [1] "Hello world" ``` -- Double quotes are preferred, according to the [tidyverse style guide](https://style.tidyverse.org/syntax.html?q=quo#character-vectors). Unless your string has a quote within it. ```r my_second_string <- "Goodbye"; my_second_string ``` ``` ## [1] "Goodbye" ``` ```r my_third_string <- 'She said "Howdy"'; my_third_string ``` ``` ## [1] "She said \"Howdy\"" ``` ??? Notice how the strings, when printed out, look the same. However, the tidyverse actually has a style guide which recommends using double quotes unless the string contains multiple quotations. When you include multiple quotations, the console will print out the text with these backwards slashes, which are called escapes. You can save strings to an object like we just did with `my_first_string`, `my_second_string`, and `my_third_string`. --- You can combine multiple strings or string objects into a vector using `c()`. ```r my_string_vector <- c(my_first_string, my_second_string, my_third_string); my_string_vector ``` ``` ## [1] "Hello world" "Goodbye" "She said \"Howdy\"" ``` -- This vector of strings can also be referred to as a **character** vector. ```r class(my_string_vector) ``` ``` ## [1] "character" ``` -- **Character** columns in a dataframe are comprised of **strings**. --- # Why do we need to concern ourselves with strings? - While we often deal with numbers, data can be in the form of strings - Gene/Protein IDs in 'omic studies - Raw files for some of these 'omic studies can come in the FASTA format which is text based - Qualitative analyses - Categorical variables - Cleaning up column and row names ??? Now, we're going to go over some common functions and methods that we can use with strings, some of which you have likely seen before --- # Horror Movies Dataset from the Tidy Tuesday Repository - Information about Horror Movies dating back to the 1950s - Extracted from [The Movie Database](https://www.themoviedb.org/) by [Tanya Shapiro](https://github.com/tashapiro) ```r library(tidytuesdayR) library(tidyverse) tuesdata <- tt_load(2022, week = 44) ``` ``` ## ## Downloading file 1 of 1: `horror_movies.csv` ``` ```r horror_movies <- tuesdata$horror_movies ``` --- ```r glimpse(horror_movies) ``` ``` ## Rows: 32,540 ## Columns: 20 ## $ id <dbl> 760161, 760741, 882598, 756999, 772450, 1014226, 717… ## $ original_title <chr> "Orphan: First Kill", "Beast", "Smile", "The Black P… ## $ title <chr> "Orphan: First Kill", "Beast", "Smile", "The Black P… ## $ original_language <chr> "en", "en", "en", "en", "es", "es", "en", "en", "en"… ## $ overview <chr> "After escaping from an Estonian psychiatric facilit… ## $ tagline <chr> "There's always been something wrong with Esther.", … ## $ release_date <date> 2022-07-27, 2022-08-11, 2022-09-23, 2022-06-22, 202… ## $ poster_path <chr> "/pHkKbIRoCe7zIFvqan9LFSaQAde.jpg", "/xIGr7UHsKf0URW… ## $ popularity <dbl> 5088.584, 2172.338, 1863.628, 1071.398, 1020.995, 93… ## $ vote_count <dbl> 902, 584, 114, 2736, 83, 1, 125, 1684, 73, 1035, 637… ## $ vote_average <dbl> 6.9, 7.1, 6.8, 7.9, 7.0, 1.0, 5.8, 7.0, 6.5, 6.8, 7.… ## $ budget <dbl> 0, 0, 17000000, 18800000, 0, 0, 20000000, 68000000, … ## $ revenue <dbl> 9572765, 56000000, 45000000, 161000000, 0, 0, 289259… ## $ runtime <dbl> 99, 93, 115, 103, 0, 0, 88, 130, 90, 106, 98, 89, 97… ## $ status <chr> "Released", "Released", "Released", "Released", "Rel… ## $ adult <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL… ## $ backdrop_path <chr> "/5GA3vV1aWWHTSDO5eno8V5zDo8r.jpg", "/2k9tBql5GYH328… ## $ genre_names <chr> "Horror, Thriller", "Adventure, Drama, Horror", "Hor… ## $ collection <dbl> 760193, NA, NA, NA, NA, NA, 94899, NA, NA, 950289, N… ## $ collection_name <chr> "Orphan Collection", NA, NA, NA, NA, NA, "Jeepers Cr… ``` --- class: inverse, middle, center # `paste` and `paste0` --- # `paste` and `paste0`, a workhorse for strings - used to concatenate strings - paste automatically inserts a space between each item while paste0 does not - Can change the separator (`sep`) from a space to something else ```r paste(my_first_string, ',', my_second_string, 'world') # adding in 'world' to make the final string more poetic ``` ``` ## [1] "Hello world , Goodbye world" ``` ```r paste(my_first_string, my_second_string, sep = '_') # example with different separator argument ``` ``` ## [1] "Hello world_Goodbye" ``` ```r paste0(my_first_string, ', ', my_second_string, ' world') ``` ``` ## [1] "Hello world, Goodbye world" ``` --- # Creating vectors with paste functions We often need to create a vector of IDs ```r paste('sample', 1:30, sep = '_') ``` ``` ## [1] "sample_1" "sample_2" "sample_3" "sample_4" "sample_5" "sample_6" ## [7] "sample_7" "sample_8" "sample_9" "sample_10" "sample_11" "sample_12" ## [13] "sample_13" "sample_14" "sample_15" "sample_16" "sample_17" "sample_18" ## [19] "sample_19" "sample_20" "sample_21" "sample_22" "sample_23" "sample_24" ## [25] "sample_25" "sample_26" "sample_27" "sample_28" "sample_29" "sample_30" ``` -- ```r paste0('sample', '_', 1:30) ``` ``` ## [1] "sample_1" "sample_2" "sample_3" "sample_4" "sample_5" "sample_6" ## [7] "sample_7" "sample_8" "sample_9" "sample_10" "sample_11" "sample_12" ## [13] "sample_13" "sample_14" "sample_15" "sample_16" "sample_17" "sample_18" ## [19] "sample_19" "sample_20" "sample_21" "sample_22" "sample_23" "sample_24" ## [25] "sample_25" "sample_26" "sample_27" "sample_28" "sample_29" "sample_30" ``` ??? We often need to create a vector of ID. I know that we've had to do that for our class. We can use the paste functions to easily do that. --- # We can combine `mutate` with `paste` to create new columns ```r horror_movies1 <- horror_movies %>% mutate(title_tagline = paste0(horror_movies$title, ' - ', horror_movies$tagline)) horror_movies1$title_tagline %>% head() ``` ``` ## [1] "Orphan: First Kill - There's always been something wrong with Esther." ## [2] "Beast - Fight for family." ## [3] "Smile - Once you see it, it’s too late." ## [4] "The Black Phone - Never talk to strangers." ## [5] "Presences - NA" ## [6] "Sonríe - NA" ``` --- class: center, middle, inverse # Regular expressions and grep --- .pull-left[ # Regular expressions - sequence of characters that specifies a search pattern in text - useful for matching complex/hyper-specific patterns - also referred to as regex or regexp - many 'cheatsheets' online & websites that can help you build a regular expression - RStudio addin, [regexplain](https://github.com/gadenbuie/regexplain), can also help you build regular expressions ] .pull-right[ <iframe src="https://www.petefreitag.com/cheatsheets/regex/" width="100%" height="600px" data-external="1"></iframe> ] --- # regexr.com <iframe src="https://regexr.com/" width="100%" height="500px" data-external="1"></iframe> --- # grep, grepl, gsub - `grep()`: searches for matches of a regular expression and returns the **indices** of matches - `grepl()`: searches for matches of a regular expression and returns a **logical** vector - `gsub()`: searches for a matches of a regular expression and **replaces** with a string ??? global regular expression print --- # grep ```r scream_index <- grep(pattern = 'scream', x = horror_movies$title, ignore.case = TRUE) head(scream_index) # indices of rows that have 'scream' in the title ``` ``` ## [1] 27 190 289 362 437 1728 ``` -- We can use these indices to subset the dataframe for just movies with 'scream' in the title. ```r horror_movies[scream_index, ] %>% head() ``` ``` ## # A tibble: 6 × 20 ## id original_title title origi…¹ overv…² tagline release_…³ poste…⁴ popul…⁵ ## <dbl> <chr> <chr> <chr> <chr> <chr> <date> <chr> <dbl> ## 1 646385 Scream Scre… en "Twent… It's a… 2022-01-12 /1m3W6… 291. ## 2 4232 Scream Scre… en "A kil… Someon… 1996-12-20 /3O3kl… 67.8 ## 3 4233 Scream 2 Scre… en "Two y… Someon… 1997-12-12 /rcI1e… 51.0 ## 4 41446 Scream 4 Scre… en "Sidne… New de… 2011-04-11 /g0YcS… 42.3 ## 5 4234 Scream 3 Scre… en "As bo… Someon… 2000-02-03 /qpH8T… 37.5 ## 6 14853 Screamers: Th… Scre… en "A gro… The Pe… 2009-02-17 /xOOA1… 11.0 ## # … with 11 more variables: vote_count <dbl>, vote_average <dbl>, budget <dbl>, ## # revenue <dbl>, runtime <dbl>, status <chr>, adult <lgl>, ## # backdrop_path <chr>, genre_names <chr>, collection <dbl>, ## # collection_name <chr>, and abbreviated variable names ¹original_language, ## # ²overview, ³release_date, ⁴poster_path, ⁵popularity ``` --- # grepl ```r scream_logical <- grepl(pattern = 'scream', x = horror_movies$title, ignore.case = TRUE) head(scream_logical) ``` ``` ## [1] FALSE FALSE FALSE FALSE FALSE FALSE ``` -- We can combine this code with `mutate` to create a new column called `scream` which we can then use to `filter` the dataframe. ```r horror_movies %>% mutate(scream = grepl(pattern = 'scream', x = horror_movies$title, ignore.case = TRUE)) %>% filter(scream == TRUE) %>% head(n = 3) ``` ``` ## # A tibble: 3 × 21 ## id original_title title origi…¹ overv…² tagline release_…³ poste…⁴ popul…⁵ ## <dbl> <chr> <chr> <chr> <chr> <chr> <date> <chr> <dbl> ## 1 646385 Scream Scre… en Twenty… It's a… 2022-01-12 /1m3W6… 291. ## 2 4232 Scream Scre… en A kill… Someon… 1996-12-20 /3O3kl… 67.8 ## 3 4233 Scream 2 Scre… en Two ye… Someon… 1997-12-12 /rcI1e… 51.0 ## # … with 12 more variables: vote_count <dbl>, vote_average <dbl>, budget <dbl>, ## # revenue <dbl>, runtime <dbl>, status <chr>, adult <lgl>, ## # backdrop_path <chr>, genre_names <chr>, collection <dbl>, ## # collection_name <chr>, scream <lgl>, and abbreviated variable names ## # ¹original_language, ²overview, ³release_date, ⁴poster_path, ⁵popularity ``` --- # gsub Perhaps we don't like scary movies or loud movies and we wanted to make this dataframe a little less frightening. We could replace all of the 'scream' titles with 'whisper'. Let's use `gsub` and `mutate` to make a more `friendly_title` column. -- ```r horror_movies_friendly <- horror_movies %>% mutate(friendly_title = gsub(pattern = 'scream', replacement = 'whisper', horror_movies$title, ignore.case = TRUE)) head(horror_movies_friendly$friendly_title) ``` ``` ## [1] "Orphan: First Kill" "Beast" "Smile" ## [4] "The Black Phone" "Presences" "Sonríe" ``` --- #gsub The contents of the new `friendly_title` column is almost the same as the `title` column, which is why we don't see any new `whisper` titles. We have to search our dataframe to find where the replacements occurred. ```r horror_movies_friendly %>% * filter(grepl(pattern = 'whisper', x = .$friendly_title)) %>% select(title, friendly_title) %>% head(n = 8) ``` ``` ## # A tibble: 8 × 2 ## title friendly_title ## <chr> <chr> ## 1 Scream whisper ## 2 Scream whisper ## 3 Scream 2 whisper 2 ## 4 Scream 4 whisper 4 ## 5 Scream 3 whisper 3 ## 6 Screamers: The Hunting whisperers: The Hunting ## 7 Scream of the Banshee whisper of the Banshee ## 8 Screamers whisperers ``` --- class: center, middle, inverse # stringR --- # stringR: the Tidyverse solution to strings - built off of `stringi` and contains named functions for some of the most common string manipulations - stringR functions make use of regular expressions to match strings and carry out various processes - some stringR functions that I use often: - `str_glue()` - combine strings with objects/variables in your environment - `str_replace_all()` - replaces pattern matches with string of choice - `str_view()` - shows matches to your regular expression - And many more! --- # [stringR cheatsheet](https://github.com/rstudio/cheatsheets/blob/main/strings.pdf) <embed src="strings_cheatsheet.pdf" width="100%" height="500px" type="application/pdf" /> --- # stringR examples: `str_glue` Combining the contents of the `title` column in `horror_movies` with a string, similar to `paste()`. ```r str_glue('My favorite movie is {horror_movies$title}') %>% head() ``` ``` ## My favorite movie is Orphan: First Kill ## My favorite movie is Beast ## My favorite movie is Smile ## My favorite movie is The Black Phone ## My favorite movie is Presences ## My favorite movie is Sonríe ``` --- # stringR examples: `str_replace_all` Replacing the word 'horror' with 'scary', similar to `gsub()` ```r head(horror_movies$genre_names) ``` ``` ## [1] "Horror, Thriller" "Adventure, Drama, Horror" ## [3] "Horror, Mystery, Thriller" "Horror, Thriller" ## [5] "Horror" "Horror, Thriller" ``` ```r str_replace_all(string = horror_movies$genre_names, pattern = 'Horror', replacement = 'Scary') %>% head() ``` ``` ## [1] "Scary, Thriller" "Adventure, Drama, Scary" ## [3] "Scary, Mystery, Thriller" "Scary, Thriller" ## [5] "Scary" "Scary, Thriller" ``` --- # stringR examples: `str_view()` Visualizes pattern matches. For example, I am looking for movies whose titles start (left) and end (right) with a number. .pull-left[ ```r str_view(string = horror_movies$title, pattern = '^\\d', match = TRUE) ```
] .pull-right[ ```r str_view(string = horror_movies$title, pattern = '\\d$', match = TRUE) ```
] --- class: middle, center, inverse # Questions? --- # Resources used for this presentation - https://r4ds.hadley.nz/ - https://bookdown.org/rdpeng/rprogdatascience/ - https://stringr.tidyverse.org/ - https://regexr.com/ - https://www.petefreitag.com/cheatsheets/regex/ - https://github.com/tashapiro/horror-movies