2018/07/31

webscraping / rvest / robotstxt / polite / memoise / ratelimitr

Webscraping ALLSTAT

Life on the edge of easy responsible webscraping

As underlined in The Ethical Scraper’s principles by James Densmore (that I found in the slides of Hanjo Odendaal’s useR! 2018 webscraping tutorial), a first step is to research whether you can get the data without webscraping (I found no API). Note that an ideal case would be to have received emails since the list’s creation in 1998, and having stored them, so that one could use this local copy. Or one could email the list maintainers of course!

Then, there is some good practice to keep in mind whilst webscraping, also underlined in the principles list, and most importantly all encompassed in the awesome new package polite by Dmytro Perepolkin! I knew the packages wrapped by Dmytro, but tended to only use robotstxt e.g. here when webscraping, (but using others in API packages). Now with very little effort, with polite you will

Seek permissions via robotstxt,
Introduce yourself with an user agent, by default the package’s name and URL,
Take it slowly thanks to ratelimitr,
Never ask twice thanks to memoise.

The polite package is brand-new, still in development, which in general means you might want to stay away from it for a while, but I was eager to try it and pleased by its working very well! Being an early adopter also means I saw my issues promptly closed with some solution/new code by Dmytro!

I started work by bowing which in polite’s workflow means both creating a session (with user-agent, delay between calls) and checking that the path is allowed.

home_url <- "https://www.jiscmail.ac.uk/cgi-bin/webadmin"

session <- polite::bow(home_url,
                       user_agent = "Maëlle Salmon https://masalmon.eu/")

And then it was time to scrape and parse…

Actual webscraping a.k.a solving XPath puzzles

When confronted with a page I’d like to extract info from, I try to identify how I can write XPath to get the elements I want, which means looking at the page source. I used to transform whole pages to text before using regexp on them, which was clearly suboptimal, thanks to Eric Persson for making me switch to XPath.

My strategy here was to get the subject, date, sender and size of each email from archive pages. Here is such an archive page. Nowadays there is one archive page by month, but there used to be one by year, so I got the list of archive pages from this general page.

library("magrittr")

polite::scrape(session, params = "?A0=ALLSTAT") %>%
  rvest::xml_nodes("li") %>%
  rvest::xml_nodes("a") %>%
  rvest::html_attr("href") %>%
  purrr::keep(function(x) stringr::str_detect(x, "\\/cgi-bin\\/webadmin\\?A1\\=")) %>%
  stringr::str_remove("\\/cgi\\-bin\\/webadmin\\?A1\\=ind") %>%
  stringr::str_remove("\\&L\\=ALLSTAT") -> date_strings

This is not very elegant but this got me only the “1807” and such I needed for the rest of the scraping. polite::scrape is a wrapper to both httr::GET and httr::content and does the rate limiting (by default 1 call every 5 seconds, delay parameter of polite::bow) and memoising. I actually ended up only scraping emails metadata from 2007 because it took ages to parse the 2006 page. It made me, according to a search via the website, miss these data scientists job openings: one in 2000 and one in 2004.

date_strings <- date_strings[stringr::str_length(date_strings) != 2]
# or
date_strings <- purrr::discard(date_strings, function(x) stringr::str_length(x) == 2)

I created a function getting the metadata out of each archive page. The trickiest points here were:

That the rows of the archive table could have two classes, which is the way alternate coloring was obtained. I therefore used | in XPath '//tr[@class="normalgroup"]|//tr[@class="emphasizedgroup"]'.
That there was no different class/formatting for subject, date, sender, so I got all of them at once, and then used the modulo operator, %%, to assign them to the right vector.

get_emails_meta_by_date <- function(date_string, session){
  message(date_string)
  params <- glue::glue("?A1=ind{date_string}&L=ALLSTAT&F=&S=&O=T&H=0&D=0&T=0")
  
  everything <- try(polite::scrape(session, params = params),
                    silent = TRUE)
  
  # at the time of writing one couldn't pass encoding to scrape
  # but now one can https://github.com/dmi3kno/polite/issues/6#issuecomment-409268730
  if(is(everything, "try-error")){
    everything <- httr::GET(paste0(home_url,
                                   params)) %>%
      httr::content(encoding = "latin1")
  }
  
  everything <- everything %>%
    # there are two classes that correspond
    # to the table having two colours of rows!
    rvest::xml_nodes(XPath = '//tr[@class="normalgroup"]|//tr[@class="emphasizedgroup"]') %>%
    rvest::xml_nodes("span")
  
  everything %>%
    rvest::xml_nodes(XPath = "//td") %>%
    rvest::xml_nodes("span") %>%
    rvest::xml_nodes("a") %>%
    rvest::html_text() -> subjects
  
  everything %>%
    rvest::xml_nodes(XPath = "//td[@nowrap]") %>%
    rvest::xml_nodes(XPath = "p[@class='archive']") %>%
    rvest::html_text() -> big_mess
  
  senders <- big_mess[seq_along(big_mess) %% 3 == 1]
  senders <- stringr::str_remove(senders, " \\<\\[log in to unmask\\]\\>")
  
  dates <- big_mess[seq_along(big_mess) %% 3 == 2]
  dates <- lubridate::dmy_hms(dates, tz = "UTC")
  
  sizes <- big_mess[seq_along(big_mess) %% 3 == 0]
  sizes <- stringr::str_remove(sizes, " lines")
  sizes <- as.numeric(sizes)
  
  tibble::tibble(subject = subjects,
                 sender = senders,
                 date = dates,
                 size = sizes) %>%
    readr::write_csv(glue::glue("data/emails_meta{date_string}.csv"))
  
}

I chose to save the metadata of each archive page in its own csv in order to make my workflow less breakable. I could have used purrr::map_df but then it’d be harder to re-start, and it was hard on memory apparently.

fs::dir_create("data")
purrr::walk(date_strings,
              get_emails_meta_by_date,
              session = session)

Analyzing ALLSTAT jobs

Filtering jobs

ALLSTAT encourages you to use keywords in emails’ subjects, so many job openings contain some variant of “job”, and that’s the sample on which I shall work.

library("magrittr")
library("magrittr")

fs::dir_ls("../../static/data/allstat") %>%
  purrr::map_df(readr::read_csv) -> emails

jobs <- dplyr::filter(emails, 
                      stringr::str_detect(subject,
                                          "[Jj][Oo][Bb]"))

Out of 32761 emails I got 9245 job openings.

I created two dummy variables to indicate the presence of data scientist or statistician in the description. With the definition below, the “statistician” category might contain “biostatitisticians” which is fine by me.

jobs <- dplyr::mutate(jobs,
                      data_scientist = stringr::str_detect(subject, 
                                                           "[Dd]ata [Ss]cientist"),
                      statistician = stringr::str_detect(subject, 
                                                          "[Ss]tatistician"))

176 subjects contain the word “data scientist”, 2546 the word “statistician.”, 20 both.

dplyr::filter(jobs, data_scientist, statistician) %>%
  dplyr::select(subject, sender, date) %>%
  knitr::kable()

subject	sender	date
JOB: Statisticians/Data Scientists with Unilever	Murray, Peter	2008-02-12 08:47:56
Jobs: Senior Data Scientists/Statisticians	Cox, Trevor	2008-03-12 16:24:48
20 new job ads for data scientists, statisticians	Vincent Granville	2013-06-16 00:18:27
23 new jobs for statisticians, data scientists	Vincent Granville	2013-07-27 20:05:48
19 new job ads for statisticians and data scientists	Vincent Granville	2013-10-12 23:21:23
Job Data scientist / Statistician	Andrea Schirru	2014-07-22 16:38:59
JOB: Computational Statistician / Data Scientist	David Hastie	2014-07-16 11:36:34
Job Openings: Statisticians and Data Scientists at Open Analytics (Belgium)	Tobias Verbeke	2015-06-30 09:53:21
JOBS x 2: Data Scientist/Medical Statistician at University of Manchester	Matthew Sperrin	2016-07-28 10:42:52
JOB: Data scientist - Statistician (KTP associate) @ University of Essex	Aris Perperoglou	2016-09-12 10:33:32
JOB: Principal Statistician/ Data Scientist- Pharma - Perm - Centralised Monitoring- Data Safety - Global Search-UK	Sabrina Andresen	2016-10-17 16:05:12
JOB: Research Associate - Statistician / Data Scientist, University of Manchester (Centre for Musculoskeletal Research)	Jamie Sergeant	2016-11-01 10:21:07
JOB - Statistician / Data Scientist at Cefas, UK	David Maxwell	2017-08-04 16:15:25
Job: Data Scientist/Statistician at Essex University	Aris Perperoglou	2017-08-21 13:43:43
Job: Statistician/Data Scientist	Roisin McCarthy	2017-11-06 16:49:18
Job Openings: Statisticians and Data Scientists at Open Analytics (Belgium)	Tobias Verbeke	2017-12-05 20:48:39
JOB \| StatsJobs - Data Scientist/Statistician, Zurich Insurance	James Phillips	2018-01-30 08:59:49
JOB \| Junior Medical statistician / Real-world data scientist - Centre of Excellence for Retrospective Studies, IQVIA (London)	Venerus, Alessandra	2018-03-05 10:33:09
JOB \| Senior Medical statistician / Real-world data scientist - Centre of Excellence for Retrospective Studies, IQVIA (London)	Venerus, Alessandra	2018-03-05 10:32:51
Job: Catalyst Project Research Data Scientist / Statistician	Aris Perperoglou	2018-06-25 14:17:18

Are the job titles synonymous for the organizations using slashes? I am especially puzzled by “Senior Medical statistician / Real-world data scientist”! I filtered them out and created a category variable.

jobs <- dplyr::filter(jobs,
                      !(data_scientist&statistician))
jobs <- dplyr::mutate(jobs,
                      category = dplyr::case_when(data_scientist ~ "data scientist",
                                                  statistician ~ "statistician",
                                                  TRUE ~ "other"),
                      category = factor(category,
                                        levels = c("statistician",
                                                   "data scientist",
                                                   "other"),
                                        ordered = TRUE))

jobs <- dplyr::mutate(jobs,
                      year = lubridate::year(date),
                      year = as.factor(year))

Here are some examples of positions for each:

head(jobs$subject[jobs$category=="statistician"])

## [1] "FW: JOB: Senior/Lead Statistician - GlaxoSmithKline"                   
## [2] "JOB - Clinical Trial Statisticians and SAS Programmers"                
## [3] "JOB - Medical statistician in Salvador, Brazil (2-3 years)"            
## [4] "JOB - Medical Statistician, Oxford"                                    
## [5] "JOB OPPORTUNITY - 39408 - Entry-Level Statisticians - GSK"             
## [6] "JOB: Biostatistician - Contract/Consultancy Opportunity - Switzer land"

head(jobs$subject[jobs$category=="data scientist"])

## [1] "JOB: Data Scientist, Unilever"                                 
## [2] "JOB: Data Scientist, Unilever"                                 
## [3] "JOB: Data Scientist / Machine Learning (BELFAST)"              
## [4] "JOB: Data Scientists in Belfast"                               
## [5] "JOBS: Data Scientists"                                         
## [6] "[Job] Hacker or Botnet Developer, Data Scientist (Telecommute)"

head(jobs$subject[jobs$category=="other"])

## [1] "Job - Data manager Dept of Haematology, Imperial College, London"                  
## [2] "Re: JOB - Health Economic Modelling and Value Demonstrations"                      
## [3] "JOB - Lectureship / readership"                                                    
## [4] "JOB - Mathematical Modeller at the Health Protection Agency, Centre for Infections"
## [5] "JOB - PhD position"                                                                
## [6] "JOB - SAS Senior Statistical Programmer, Oxford"

Are data scientists on the rise?

library("ggplot2")

ggplot(jobs) +
  geom_bar(aes(year, fill = category)) +
  viridis::scale_fill_viridis(discrete = TRUE) +
  theme(legend.position = "bottom") +
  hrbrthemes::theme_ipsum(base_size = 14) +
  xlab("Year (2018 not complete yet)") +
  ylab("Number of job openings") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  ggtitle("ALLSTAT mailing list 2007-2018")

I like this bar plot that shows how the total number of job openings fluctuates, but it’s hard to see differences in proportions.

ggplot(jobs) +
  geom_bar(aes(year, fill = category),
           position = "fill") +
  viridis::scale_fill_viridis(discrete = TRUE) +
  theme(legend.position = "bottom") +
  hrbrthemes::theme_ipsum(base_size = 14) +
  xlab("Year (2018 not complete yet)") +
  ylab("Number of job openings")  +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))+
  ggtitle("ALLSTAT mailing list 2007-2018")

According to this plot, although there seems to be more and more data scientists’ jobs advertised on ALLSTAT… Statisticians don’t need to get worried just yet.

Who offers data scientists’ jobs?

dplyr::count(jobs, category,
                sender) %>%
  dplyr::group_by(category) %>%
  dplyr::arrange(category, - n) %>%
  dplyr::filter(sender %in% sender[1:5])

## # A tibble: 15 x 3
## # Groups:   category [3]
##    category       sender                     n
##    <ord>          <chr>                  <int>
##  1 statistician   James Phillips           106
##  2 statistician   Sabrina Andresen          82
##  3 statistician   James Miller              45
##  4 statistician   Angela Smythe             40
##  5 statistician   Helena Newman-Mitchell    37
##  6 data scientist James Phillips            86
##  7 data scientist Sportradar HR              5
##  8 data scientist Jason Howlin               4
##  9 data scientist Deborah Gee                3
## 10 data scientist Christos Mitas             2
## 11 other          James Phillips           671
## 12 other          Angela Smythe            223
## 13 other          Helena Newman-Mitchell   103
## 14 other          Jason Howlin              91
## 15 other          James Miller              85

Seeing James Phillips’ name so often made me have a look at their emails: this person sends emails on the behalf of a website called StatsJobs.com! We can also assume that other super-senders actually work for job aggregators of some sort.

What are the openings about?

To make a more thorough description of the different categories, one would need to get the email bodies, which I decided against for this post. I simply used the subjects, and compared word usage between the “data scientist” and “statistician” categories as in this chapter of the Tidy text mining book by Julia Silge and David Robinson.

library("tidytext")
data("stop_words")
words <- dplyr::filter(jobs, category != "other") %>%
  unnest_tokens(word, subject, token = "words") %>%
  dplyr::filter(!word %in% stop_words$word,
                !word %in% c("job", "statistician", 
                             "jobs", "statisticians",
                             "data", "scientist",
                             "scientists",
                             "datascientistjobs"))

word_ratios <- words %>%
  dplyr::count(word, category) %>%
  dplyr::group_by(word) %>%
  dplyr::filter(sum(n) >= 10) %>%
  dplyr::ungroup() %>%
  tidyr::spread(category, n, fill = 0) %>%
  dplyr::mutate_if(is.numeric, dplyr::funs((. + 1) / sum(. + 1))) %>%
  dplyr::mutate(logratio = log(`data scientist` / statistician)) %>%
  dplyr::arrange(desc(logratio))

word_ratios %>%
  dplyr::group_by(logratio < 0) %>%
  dplyr::top_n(15, abs(logratio)) %>%
  dplyr::ungroup() %>%
  dplyr::mutate(word = reorder(word, logratio)) %>%
  ggplot(aes(word, logratio, fill = logratio < 0)) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  ylab("log odds ratio (data scientist / statistician)") +
  scale_fill_manual(name = "", 
                    values = c("#21908CFF", "#440154FF"),
                    labels = c("data scientist", "statistician")) +
  hrbrthemes::theme_ipsum(base_size = 11)+
  ggtitle("ALLSTAT mailing list 2007-2018")

What does this figure show? On the left are words more frequent in statistician job openings, on the right words more frequent in data scientist job openings. Well, there might be geographical clusters for each of the category, which I don’t believe though: is the data scientist vs. statistician debate a Cambridgeshire vs. Oxford battle? I was surprised that no word related to academia made an appearance because I thought “university” or an equivalent word would be more representative of statisticians. I am less surprised, though, by words such as “trials”, “medical”, “clinical” being more prevalent in the “statistician” category. The word “fixed” as in “fixed-term” contract is more prevalent for data scientist job openings, which doesn’t sound too cool?

Conclusion

In this post, I had the pleasure to use the brand-new polite package that made responsible webscraping very smooth! Armed with that tool, I extracted metadata of job openings posted on the ALLSTAT mailing lists. I made no statistical analyses, but the main take-away is that statisticians aren’t in danger of extinction just yet.

ALLSTATisticians in decline? A polite look at ALLSTAT email Archives