2018/08/25

webscraping / rvest / robotstxt / polite / memoise / ratelimitr / taxize / rredlist

What can a kaka, a kakapo, an European rabbit and a grey heron have in common? Well, they might co-habit in the bookshelf of an R user, since they’re all animals on the covers of popular R books: “R Packages”, “R for Data Science”, “Text mining with R” and “Efficient R programming”, respectively. Their publisher, O’Reilly, has now based its brand on covers featuring beautiful gravures of animals.

Recently, while wondering what the name of R for Data Science bird was again (I thought it was a kea!), I was thrilled to find the whole O’Reilly menagerie, i.e. a list of books and corresponding animals! The website also features a link to “A short history of the O’Reilly animals” that was an amazing read. In it was noted that “The animals are in trouble.”, with a few examples of endangered species. It inspired me to actually try and assess the conservation status of O’Reilly animals using responsible webscraping, taxonomic name resolving and IUCN Redlist API querying…

Scraping the menagerie: an utter delight!

I had a great time webscraping the menagerie, not only thanks to my now reasonable experience doing such things, but also thanks to

my using the wonderful polite package for webscraping, that makes me feel so good about myself. Read more about this package in my previous post.
the webpage having really good structured html with specific classes.

The menagerie is divided into pages of 20 books, so I mapped over all possible offsets up to the number of animals indicated on the website, 1227.

library("magrittr")
home_url <- "https://www.oreilly.com/animals.csp"

session <- polite::bow(home_url,
                       user_agent = "Maëlle Salmon https://masalmon.eu/")

get_twenty <- function(offset, session){
  # offset parameter to get all books 20 by 20
  params <- glue::glue("?x-o={offset}")
  
  # scraping with content parameter
  # cf https://github.com/dmi3kno/polite/issues/6
  # https://www.oreilly.com/animals.csp?x-o=720 was problematic
  # (German characters)
  page <- polite::scrape(session, params = params, 
                         content = "text/html;charset=iso-8859-1")
  # get all animal rows
  rows <- rvest::xml_nodes(page,
                           xpath = "//div[@class='animal-row']")
  # extract book titles
  rows %>%
    rvest::xml_nodes(xpath = "a[@class='book']") %>%
    rvest::xml_nodes(xpath = "h1[@class='book-title']") %>%
    rvest::html_text() -> book_titles
  
  rows %>%
    rvest::xml_nodes(xpath = "h2[@class='animal-name']") %>%
    rvest::html_text() -> animal_names
  
  tibble::tibble(book = book_titles,
                 animal = animal_names)
  
}

no_animals <- 1227 # by hand!

offsets <- (0:floor(no_animals/20))*20

purrr::map_df(offsets, get_twenty, session = session) %>%
  readr::write_csv("oreilly_animals.csv")

I got 1134 rows, each corresponding to a book, with animals potentially repeated.

animals

## # A tibble: 1,134 x 2
##    book                            animal                                 
##    <chr>                           <chr>                                  
##  1 Mobile Design and Development   12-Wired Bird of Paradise              
##  2 Windows PowerShell for Develop~ 3-Banded Armadillo                     
##  3 Jakarta Commons Cookbook        Aardvark                               
##  4 Clojure Cookbook                Aardwolf                               
##  5 Ubuntu: Up and Running          Addax, aka Screwhorn Antelope          
##  6 Social eCommerce                Adjutant (Storks)                      
##  7 BioBuilder                      Aegina Citrea, narcomedusae, jellyfish 
##  8 JRuby Cookbook                  African Civet                          
##  9 C# 5.0 Pocket Reference         African Crowned Crane aka Grey Crowned~
## 10 Programming C# 5.0              African Crowned Crane aka Grey Crowned~
## # ... with 1,124 more rows

In the short history of animals, Edie Freedman mentions having discovered “that there were intriguing correspondences between specific technologies and specific animals”. This made me curious about my last name, Salmon!

animals %>%
  dplyr::filter(stringr::str_detect(animal, "[Ss]almon")) %>%
  knitr::kable()

book	animal
Values, Units, and Colors	Salmon
CSS Text	Salmon
CSS Fonts	Salmon
Selectors, Specificity, and the Cascade	Salmon
Transitions and Animations in CSS	Salmon2

I have no idea what trait of salmons make them good at design, other than my not sharing that trait with them. When my friend Adrien and I wrote a (non O’Reilly) book years ago, we selected a frog for the cover based on its being pretty, which is much less cool than O’Reilly branding!

From animals common names to scientific names?

Now, you’ll have noticed the names of animals are written in English. My ultimate goal being the querying of IUCN Red List API, and this API only accepting scientific names (contrary to the website of the same organization), I needed to resolve the common names to scientific names. This is a hard problem! My strategy here was:

Cleaning the names a bit to remove the parts after “aka” for instance.

clean <- function(animal){
  semiclean <- animal %>%
    stringr::str_remove_all("aka.*") %>%
    stringr::str_remove_all("\\,.*") %>%
    stringr::str_remove_all("\\(.*") 
  
  if(semiclean == "12-Wired Bird of Paradise"){
    semiclean <- "Twelve-Wired Bird of Paradise"
  }
  
  if(semiclean == "3-Banded Armadillo"){
    semiclean <- "Three-Banded Armadillo"
  }
  stringr::str_remove_all(semiclean, "[0-9]")
}


animals <- dplyr::mutate(animals, animal_clean = purrr::map_chr(animal, clean))

Using the rOpenSci taxize package that has a handy comm2sci function. This function works for anyone, but it’s better to request a key for the database used, EOL by default (see e.g. taxize::use_eol() for more info).
Not being too optimistic since the databases taxize queries cannot do wonders, no matter how good they are.

Note that for each species, the first scientific name returned is selected, because there’s no other criterion to go by. That’s how I’ll end up with a Salmon catfish for Salmon, too bad.

animal_names <- unique(animals$animal_clean)

# scientific names
good_comm2sci <- memoise::memoise(taxize::comm2sci)

get_name <- function(common_name){
  sci_names <- good_comm2sci(common_name)
  
  # don't get the name of who defined the species
  sci_name <- stringr::word(sci_names[[1]][1], start=1, end = 2)
  
  tibble::tibble(common_name = common_name,
                 sci_name = sci_name)
}
scientific_names <- purrr::map_df(animal_names, get_name)

animals <- dplyr::left_join(animals,
                            scientific_names,
                            by = c("animal_clean" = "common_name"))

I got names for 694 books, out of 1134, getting 555 animals. It’s not bad, but this number also needs to be treated with caution. See for instance:

animals %>%
  dplyr::filter(stringr::str_detect(animal, "Galapagos")) %>%
  knitr::kable()

book	animal	animal_clean	sci_name
PHP Cookbook	Galapagos Land Iguana	Galapagos Land Iguana	Conolophus marthae
Upgrading to PHP 5	Galapagos Tortoise	Galapagos Tortoise	Chelonoidis nigra

I noticed the iguana while perusing my results, and a quick internet search taught me that there are three species of terrestrial iguanas in the Galapagos, the most common one, and the one probably present on the book cover, being Conolophus subcristatus, not Conolophus marthae! I’ve noticed a few other mistakes, so I’ll need to handle the results with care. I now wish the menagerie had a bit more Latin in it!

Querying the IUCN Red List

Indeed, scientific names of species are the key to a wealth of data! Traits data, taxonomic information… and conservation status thanks to the IUCN Red List, an impressive assessment of species at the global scale. One can programmatically query it using the rOpenSci rredlist package! That’s what I did, adding a waiting time of 2 seconds between API calls, as recommended by the IUCN folks. Note that I have an API key because I asked for it, see more info by typing rredlist::rl_use_iucn() after installing rredlist, and be patient since it can last a few days before one gets one.

slow_rl_search <- ratelimitr::limit_rate(rredlist::rl_search,
                                         rate = ratelimitr::rate(1, 2))

get_status <- function(sci_name){
  message(sci_name)
  results <- slow_rl_search(sci_name)$result
  
  if(!is.null(results)){
    results$sci_name <- sci_name
  }
  
  results
}

animals <- dplyr::filter(animals, !is.na(sci_name))
purrr::map_df(unique(animals$sci_name), get_status) %>%
  readr::write_csv("oreilly_animals_status.csv")

status <- readr::read_csv("oreilly_animals_status.csv")

animals <- readr::read_csv("oreilly_animals_scientific.csv")

status <- dplyr::filter(status, !is.na(category))
animals <- dplyr::left_join(animals, status, by = "sci_name")

str(animals)

## Classes 'tbl_df', 'tbl' and 'data.frame':    1134 obs. of  32 variables:
##  $ book              : chr  "Mobile Design and Development" "Windows PowerShell for Developers" "Jakarta Commons Cookbook" "Clojure Cookbook" ...
##  $ animal            : chr  "12-Wired Bird of Paradise" "3-Banded Armadillo" "Aardvark" "Aardwolf" ...
##  $ animal_clean      : chr  "Twelve-Wired Bird of Paradise" "Three-Banded Armadillo" "Aardvark" "Aardwolf" ...
##  $ sci_name          : chr  "Seleucidis melanoleuca" "Tolypeutes tricinctus" "Cucumis humifructus" "Proteles cristata" ...
##  $ taxonid           : int  NA 21975 NA 18372 NA 22697721 NA 41589 22692046 22692046 ...
##  $ scientific_name   : chr  NA "Tolypeutes tricinctus" NA "Proteles cristata" ...
##  $ kingdom           : chr  NA "ANIMALIA" NA "ANIMALIA" ...
##  $ phylum            : chr  NA "CHORDATA" NA "CHORDATA" ...
##  $ class             : chr  NA "MAMMALIA" NA "MAMMALIA" ...
##  $ order             : chr  NA "CINGULATA" NA "CARNIVORA" ...
##  $ family            : chr  NA "CHLAMYPHORIDAE" NA "HYAENIDAE" ...
##  $ genus             : chr  NA "Tolypeutes" NA "Proteles" ...
##  $ main_common_name  : chr  NA "Brazilian Three-banded Armadillo" NA "Aardwolf" ...
##  $ authority         : chr  NA "(Linnaeus, 1758)" NA "(Sparrman, 1783)" ...
##  $ published_year    : int  NA 2014 NA 2015 NA 2016 NA 2015 2016 2016 ...
##  $ category          : chr  NA "VU" NA "LC" ...
##  $ criteria          : chr  NA "A2cd" NA NA ...
##  $ marine_system     : logi  NA FALSE NA FALSE NA FALSE ...
##  $ freshwater_system : logi  NA FALSE NA FALSE NA TRUE ...
##  $ terrestrial_system: logi  NA TRUE NA TRUE NA TRUE ...
##  $ assessor          : chr  NA "Miranda, F., Moraes-Barros, N., Superina, M. & Abba, A.M." NA "Green, D.S." ...
##  $ reviewer          : chr  NA "Loughry, J." NA "Dloniak, S.M.D. & Holekamp, E." ...
##  $ aoo_km2           : chr  NA NA NA NA ...
##  $ eoo_km2           : chr  NA "937000" NA NA ...
##  $ elevation_upper   : int  NA NA NA 2000 NA 550 NA 2500 NA NA ...
##  $ elevation_lower   : int  NA NA NA 0 NA 0 NA 0 0 0 ...
##  $ depth_upper       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ depth_lower       : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ errata_flag       : logi  NA NA NA NA NA NA ...
##  $ errata_reason     : chr  NA NA NA NA ...
##  $ amended_flag      : logi  NA NA NA NA NA NA ...
##  $ amended_reason    : chr  NA NA NA NA ...

There are 1134 books, 499 with a conservation status from the IUCN Red List, although this includes “DD” meaning “Data Deficient”. I am hesitant to actually show the proportion of species in each category for those for which I got data for, because the resolution of common names to scientific names isn’t certain… Take the following table with a pinch of salt!

dplyr::count(animals, category) %>%
  knitr::kable()

category	n
CR	14
DD	7
EN	42
EW	1
EX	6
LC	348
LR/cd	1
LR/lc	5
LR/nt	4
NT	23
VU	48
NA	635

See the following page for more precise information about categories. LC is least concern. Let’s have a look at the extinct species.

animals %>%
  dplyr::filter(category == "EX") %>%
  dplyr::select(book, animal, sci_name) %>%
  knitr::kable()

book	animal	sci_name
Java Data Objects	Bilby, Rabbit-eared Bandicoot (Macrotis lagotis)	Macrotis leucura
Building and Testing with Gradle	Bush Wren	Xenicus longipes
Designing Mobile Payment Experiences	Crested Pigeon	Microgoura meeki
SSH, The Secure Shell: The Definitive Guide	Land Snail	Amastra crassilabrum
Java NIO	Pigfooted Bandicoot	Chaeropus ecaudatus
Java I/O	White Rabbit	Macrotis leucura

I searched for the covers and names and could assess that in that table, there are 4 false positives due to the ambiguity of common names! Only the Bush wren and the Pigfooted Bandicoot got scientific names corresponding to what they look like, and are extinct, which is quite sad.

Now, to reverse-engineer what Edie Freedman wrote in the short history of O’Reilly animals, “Many of the animals that appear on our covers are critically endangered—the tarsier from Learning the vi & Vim Editors, the lorises from sed & awk, the Hawksbill turtle from Getting Started with CouchDB, the tiger from Running Mac OS X Tiger, and the African elephant on Hadoop: The Definitive Guide, just to name a few.”, let’s look at what we got for them.

animals %>%
  dplyr::filter(book %in%
                  c("Hadoop: The Definitive Guide",
                    "Learning the vi and Vim Editors",
                    "sed & awk",
                    "Getting Started with CouchDB",
                    "Running Mac OS X Tiger")) %>%
  dplyr::select(book, animal, sci_name, category) %>%
  knitr::kable()

book	animal	sci_name	category
Hadoop: The Definitive Guide	African Elephant, young	Elephantulus rozeti	LC
Getting Started with CouchDB	Hawksbill Turtle	Eretmochelys imbricata	CR
sed & awk	Slender Loris “Awk”	NA	NA
Running Mac OS X Tiger	Sumatran Tiger	Parantica tityoides	LR/nt
Learning the vi and Vim Editors	Tarsier, full-body, standing on hind feet, b/w engraving	Tarsius pelengensis	EN

Again, our name resolution wasn’t very good!

The elephant should be Loxondota africana, vulnerable species
The turtle is right.
For the loris we should have gotten this species that’s endangered.
The Sumatran tiger, Panthera tigris ssp. sumatrae , is critically endangered
There are several Tarsier species, I’m not sure which one is the right one.

So all in all, we got some truth but also some wrong names and hence wrong conservation statuses!

Conclusion: hoping for a menagerie of scientific names

In this post, I exemplified responsible webscraping with the use of the polite package to get a table of all animals on O’Reilly book covers from the dedicated menagerie. I tried resolving the common names to scientific names using taxize::comm2sci, which was only partly successful. I got conservation status for the scientific names using the rredlist package, programmatic interface to the IUCN Red List. The results would be better if O’Reilly published scientific names of animals, but nonetheless this workflow helped me identify two extinct species, the Bush wren of Building and Testing with Gradle and the Pigfooted Bandicoot of Java NIO. I can’t but hope the list of such now sad book covers won’t grow any longer…

O'Reilly animals in trouble? Conservation status of book covers

Scraping the menagerie: an utter delight!

From animals common names to scientific names?

Querying the IUCN Red List

Conclusion: hoping for a menagerie of scientific names