robotstxt · Maëlle Salmon's personal website

Spookify: Halloween Name Generation in R

2018/10/05

webscraping / rvest / robotstxt / polite / memoise / ratelimitr

It’s October, time for spooky Twitter names! If you’re on this social media platform, you might have noticed some of your friends switching their names to something spooky and punny. Last year I was “Maelstrom Salmon”, which I find scary but is arguably not that funny. Anyhow, what if you want to switch your name but have no inspiration? In this post, we shall explore R’s abilities to help us with that with the help of webscraping, phonetic spelling and string distance algorithms, and the magic of randomness!

O'Reilly animals in trouble? Conservation status of book covers

2018/08/25

webscraping / rvest / robotstxt / polite / memoise / ratelimitr / taxize / rredlist

View source

What can a kaka, a kakapo, an European rabbit and a grey heron have in common? Well, they might co-habit in the bookshelf of an R user, since they’re all animals on the covers of popular R books: “R Packages”, “R for Data Science”, “Text mining with R” and “Efficient R programming”, respectively. Their publisher, O’Reilly, has now based its brand on covers featuring beautiful gravures of animals.

Recently, while wondering what the name of R for Data Science bird was again (I thought it was a kea!), I was thrilled to find the whole O’Reilly menagerie, i.e. a list of books and corresponding animals! The website also features a link to “A short history of the O’Reilly animals” that was an amazing read. In it was noted that “The animals are in trouble.”, with a few examples of endangered species. It inspired me to actually try and assess the conservation status of O’Reilly animals using responsible webscraping, taxonomic name resolving and IUCN Redlist API querying…

ALLSTATisticians in decline? A polite look at ALLSTAT email Archives

2018/07/31

webscraping / rvest / robotstxt / polite / memoise / ratelimitr

View source

I was until recently subscribed to an email list, ALLSTAT, “A UK-based worldwide e-mail broadcast system for the statistical community, operated by ICSE for HEA Statistics.” created in 1998. That’s how I saw the ad for my previous job in Barcelona! Now, I dislike emails more and more so I unsubscribed, but I’d still check out the archives any time I need a job, since many messages are related to openings. Nowadays, I probably identify more as a research software engineer or data scientist than a statistician… which made me wonder, when did ALLSTAT start featuring data scientist jobs? How do their frequency compare to those of statisticians?

In this post, I’ll webscrape and analyse meta-data of ALLSTAT emails. It’ll also be the occasion for me to take the wonderful new polite package for a ride, that helps respectful webscraping!

Really not a fish? Scraping my mathematical family tree

2018/06/18

webscraping / rvest / robotstxt / PhD / DiagrammeR / igraph / GraphViz

View source

It’s nearly been two years since I defended my PhD thesis! On top of allowing me to call myself doctor, having a PhD in statistics gives me the honour to feature in the data of the Mathematics Genealogy Project. Today, I decided to webscrape my mathematical ancestors.

Bad Stock Photos of My Job? Data Science on Pexels

2018/05/06

rcorpora / robotstxt / tidytext / NLP / webscraping / rvest / hrbrthemes

View source

I couldn’t miss the fun Twitter hashtag #BadStockPhotosOfMyJob thanks to a tweet by Julia Silge and another one by Colin Fay. The latter inspired me to actually go and look for what makes a data science photo… What characterizes “data science” stock photos?

You beautiful, naïve, sophisticated newborn series

2017/11/05

robotstxt / parks-and-recreation / webscraping / subtools / snowballC / tm / wordcloud / rvest / hrbrthemes

View source

My husband and I recently started watching the wonderful series “Parks and recreation” which was recommended to me by my fellow R-Lady Jennifer Thompson in this very convincing thread. The serie was even endorsed by other R-Ladies. Jennifer told me the first two seasons are not as good as the following ones, but that it was worth it to make it through them. We actually started enjoying the humor and characters right away!

Then, this week while watching the show, one of the characters did a very basic text analysis that made me feel like imitating him for a blog post – my husband told me it was very Leslie of me to plan something while doing something else which made me very proud. I tested my idea on other Leslie fans, and they seemed to think it was a great idea… and that this post should be the beginning of a series of R-Ladies blog posts about Parks and recreation!

In this two-short-part blog post, I’ll therefore inaugurate this series, what an honor!

The Guardian Experience: heavy or light topics?

2017/10/02

rcorpora / robotstxt / tidytext / NLP / rvest / webscraping / The Guardian / hrbrthemes

View source

I’ve recently been binge-reading The Guardian Experience columns. I’m a big fan of The Guardian life and style section regulars: the blind dates to which I dedicated a blog post, Oliver Burkeman’s This column will change your life, etc. Experience is another regular that I enjoy a lot. In each of the column, someone tells something remarkable that happened to them. It can really be anything.

I was thinking of maybe scraping the titles and get a sense of most common topics. The final push was my husband’s telling me about this article of Gabriella Paiella’s about the best Guardian Experience columns. She wrote “the “Experience” column does often touch on heavier topics”. Can one know what is the most prevalent “weight” of Experience columns scraping all their titles?