2018/10/05

webscraping / rvest / robotstxt / polite / memoise / ratelimitr

It’s October, time for spooky Twitter names! If you’re on this social media platform, you might have noticed some of your friends switching their names to something spooky and punny. Last year I was “Maelstrom Salmon”, which I find scary but is arguably not that funny. Anyhow, what if you want to switch your name but have no inspiration? In this post, we shall explore R’s abilities to help us with that with the help of webscraping, phonetic spelling and string distance algorithms, and the magic of randomness!

My strategy for spooky name generation will be to replace each part of a name (e.g. first and last names) with the phonetically closest Halloween-related words. Therefore, the technical challenges here were to find a list of Halloween-related words, and to measure phonetical string distance.

Getting spooky words

The otherwise very useful package rcorpora doesn’t have a “spooky” category, so I decided to scrape this webpage of Halloween words for kids. It was one of the first search results, and being for kids it couldn’t be offensive. Like the last two times I webscraped, I used the very cool polite package!

There were two less useful sections on the page, with French and Spanish words, that I very inelegantly removed after looking up their div ID in the page source.

# introduce myself
session <- polite::bow("https://holidappy.com/holidays/halloween-words-and-halloween-vocabulary-lists",
                       user_agent = "Maëlle Salmon https://masalmon.eu/")

# scrape
page <- polite::scrape(session)

## No encoding supplied: defaulting to UTF-8.

# remove the two sections I don't want
French <- xml2::xml_find_all(page,
                            xpath = "//div[@id='mod_37589248']")

xml2::xml_remove(French)

Spanish <- xml2::xml_find_all(page,
                              xpath = "//div[@id='mod_37589143']")

xml2::xml_remove(Spanish)

# extract the words
words <- xml2::xml_find_all(page,
                            xpath = "//td[contains(@class, 'tableCell')]")

words <- xml2::xml_text(words)
head(words)

## [1] "All Hallows' Eve\n\t" "lantern\n\t"          "prank\n\t"           
## [4] "costume\n\t"          "make-believe\n\t"     "sweets\n\t"

I cleaned the words a bit. Some of them were not words so I split them. I hesitated between this solution and keeping only one-word phrases… I’m still not sure it was the best decision!

words <- stringr::str_remove_all(words, "\\\n\\\t")
words <- trimws(words)
words <- tokenizers::tokenize_words(words)
words <- unlist(words[lengths(words) > 0])
words <- words[!stringr::str_detect(words, "[0-9]")]
head(words)

## [1] "all"     "hallows" "eve"     "lantern" "prank"   "costume"

I obtained 258 words.

Matching names and spooky words

Once I had this vector of Halloween-related words, all I needed was a way to compute a phonetical distance between each of them and a name. It turned out more complicated than I thought! The soundex method of the stringdist package returned 1 or 0 only, so I set out to search another package supporting phonetical comparison. I found the phonics package that has a paper in JOSS, that implements more algorithms translating strings to phonetic “codes”. I ended up using the phonics::nysiis() function, corresponding to the New York State Identification and Intelligence System phonetic algorithm. That package also has an algorithm suitable for German, via phonics::cologne(), and one for French, via phonics::statcan(), but I haven’t explored that further.

phonics::nysiis("pain")

## [1] "PAN"

phonics::nysiis("pane")

## [1] "PAN"

phonics::nysiis("train")

## [1] "TRAN"

I was at a loss as to how best compare output from phonics::nysiis() and very hackily used… stringdist::stringdist() with a default method! Below is the function finding the closest match for any name part.

spookify_word <- function(word, seed = 42, words){

  phon <- phonics::nysiis(words)
  dist <- stringdist::stringdist(phonics::nysiis(word),
                                 phon)
  set.seed(seed)
  sample(words[dist == min(dist)], 1)
}
spookify_word("Jane", words = words)

## [1] "bones"

So spookify_word() samples a word among the closest matches… that aren’t necessarily very close to the original name. But that’s fine!

Mixing it all in a chauldron function

I finally created a function taking a whole name, say “Maëlle Salmon” or “Ada Colau Ballano”, tokenizes it into words thanks to the tokenizers package, gets the closest match for each part and then collapses everything together, capitalizing first letters thanks to snakecase… before announcing the result with the cool cowsay package! I made sure to only use animals that are associated with Halloween.

spookify <- function(name, seed,
                     words){
  name_parts <- unlist(tokenizers::tokenize_words(name))
  spook_names <- purrr::map_chr(name_parts,
                                spookify_word,
                                seed, words)
  spook_name <- paste(spook_names, collapse = " ")
  spook_name <- snakecase::to_upper_camel_case(spook_name,
                                               sep_out =  " ")
  
  set.seed(seed)
  animal <- sample(c("owl", "spider",
                     "pumpkin", "ghost",
                     "bat"), 1)

  cowsay::say(what = paste0("Byebye ", name, "\n",
                           "You are now ", spook_name),
              by = animal)

}

And because I didn’t want to make fun of anyone else than me, I created 7 fake names using the charlatan package and ran spookify on them!

set.seed(42)
names <- charlatan::ch_name(n = 7)
purrr::walk2(names, 1:length(names),
             spookify, words = words)

## Colors cannot be applied in this environment :( Try using a terminal or RStudio.

## 
## 
##  ----- 
## Byebye Tyrik Graham PhD
## You are now Trick Grinning Giant 
##  ------ 
##     \   
##      \
##               |
##               |
##               |
##              __
##           | /  \ |
##          \_\\  //_/
##           .'/()\'.
##            \\  //  [nosig]
## 

## Colors cannot be applied in this environment :( Try using a terminal or RStudio.

## 
##  ----- 
## Byebye Jeremy Rippin
## You are now Horrify Rotten 
##  ------ 
##     \   
##      \  
##       \
##        /\___/\
##        {o}{o}|
##        \ v  /|
##        |    \ \
##         \___/_/       [ab] 
##           | |

## Colors cannot be applied in this environment :( Try using a terminal or RStudio.

## 
##  ----- 
## Byebye Margarett Purdy
## You are now Boogers Party 
##  ------ 
##     \   
##      \  
##       \
##        /\___/\
##        {o}{o}|
##        \ v  /|
##        |    \ \
##         \___/_/       [ab] 
##           | |

## Colors cannot be applied in this environment :( Try using a terminal or RStudio.

## 
## 
##  ----- 
## Byebye Miss Tilla Funk DDS
## You are now Ooze Vil Fangs Dead 
##  ------ 
##     \   
##      \
##                   ___
##                ___)__|_
##           .-*'          '*-,
##          /      /|   |\     \
##         ;      /_|   |_\     ;
##         ;   |\           /|  ;
##         ;   | ''--...--'' |  ;
##          \  ''---.....--''  /
##           ''*-.,_______,.-*'  [nosig]
## 

## Colors cannot be applied in this environment :( Try using a terminal or RStudio.

## 
## 
##  ----- 
## Byebye Wilton Carroll IV
## You are now Cartoon Cruella Eve 
##  ------ 
##     \   
##      \
##               |
##               |
##               |
##              __
##           | /  \ |
##          \_\\  //_/
##           .'/()\'.
##            \\  //  [nosig]
## 

## Colors cannot be applied in this environment :( Try using a terminal or RStudio.

## 
## 
##  ----- 
## Byebye Tevin Conn
## You are now Coffin Bones 
##  ------ 
##     \   
##      \
##      .-.
##     (o o)
##     | O \
##      \   \
##       `~~~' [nosig]
## 

## Colors cannot be applied in this environment :( Try using a terminal or RStudio.

## 
## 
##  ------------- 
## Byebye Dr. Ballard Langworth III
## You are now Dish Blood Cemetery Dish 
##  -------------- 
##               \   
##                \  
##                 \
##         __.--'\     \.__./     /'--.__
##     _.-'       '.__.'    '.__.'       '-._
##   .'                                      '.
##  /                                          \
## |                                            |
## |                                            |
##  \         .---.              .---.         /
##   '._    .'     '.''.    .''.'     '.    _.'
##      '-./            \  /           \.-'
##                       ''mrf

Note that I reported the message I got from cowsay. It’s not too bad, cowsay is not designed for use in R Markdown I think.

I found the results not fantastic, but not too bad either! I especially liked “Boogers Party”!

Mucus, anyone?

In conclusion, in this post I showed how to build a corpus of Halloween-related words by responsibly webscraping, and how to more or less hackily compare strings based on their pronunciation. I obviously tried the function on my own name (without the accent, useless in this case) and wasn’t too happy…

spookify("Maelle Salmon", 42, words)

## Colors cannot be applied in this environment :( Try using a terminal or RStudio.

## 
## 
##  ------------- 
## Byebye Maelle Salmon
## You are now Mucus Bulging 
##  -------------- 
##               \   
##                \  
##                 \
##         __.--'\     \.__./     /'--.__
##     _.-'       '.__.'    '.__.'       '-._
##   .'                                      '.
##  /                                          \
## |                                            |
## |                                            |
##  \         .---.              .---.         /
##   '._    .'     '.''.    .''.'     '.    _.'
##      '-./            \  /           \.-'
##                       ''mrf

But then, at least, it is scary.

Spookify: Halloween Name Generation in R

Getting spooky words

Matching names and spooky words

Mixing it all in a chauldron function

Mucus, anyone?