webscraping · Maëlle Salmon's personal website

Spookify: Halloween Name Generation in R

2018/10/05

webscraping / rvest / robotstxt / polite / memoise / ratelimitr

It’s October, time for spooky Twitter names! If you’re on this social media platform, you might have noticed some of your friends switching their names to something spooky and punny. Last year I was “Maelstrom Salmon”, which I find scary but is arguably not that funny. Anyhow, what if you want to switch your name but have no inspiration? In this post, we shall explore R’s abilities to help us with that with the help of webscraping, phonetic spelling and string distance algorithms, and the magic of randomness!

O'Reilly animals in trouble? Conservation status of book covers

2018/08/25

webscraping / rvest / robotstxt / polite / memoise / ratelimitr / taxize / rredlist

View source

What can a kaka, a kakapo, an European rabbit and a grey heron have in common? Well, they might co-habit in the bookshelf of an R user, since they’re all animals on the covers of popular R books: “R Packages”, “R for Data Science”, “Text mining with R” and “Efficient R programming”, respectively. Their publisher, O’Reilly, has now based its brand on covers featuring beautiful gravures of animals.

Recently, while wondering what the name of R for Data Science bird was again (I thought it was a kea!), I was thrilled to find the whole O’Reilly menagerie, i.e. a list of books and corresponding animals! The website also features a link to “A short history of the O’Reilly animals” that was an amazing read. In it was noted that “The animals are in trouble.”, with a few examples of endangered species. It inspired me to actually try and assess the conservation status of O’Reilly animals using responsible webscraping, taxonomic name resolving and IUCN Redlist API querying…

ALLSTATisticians in decline? A polite look at ALLSTAT email Archives

2018/07/31

webscraping / rvest / robotstxt / polite / memoise / ratelimitr

View source

I was until recently subscribed to an email list, ALLSTAT, “A UK-based worldwide e-mail broadcast system for the statistical community, operated by ICSE for HEA Statistics.” created in 1998. That’s how I saw the ad for my previous job in Barcelona! Now, I dislike emails more and more so I unsubscribed, but I’d still check out the archives any time I need a job, since many messages are related to openings. Nowadays, I probably identify more as a research software engineer or data scientist than a statistician… which made me wonder, when did ALLSTAT start featuring data scientist jobs? How do their frequency compare to those of statisticians?

In this post, I’ll webscrape and analyse meta-data of ALLSTAT emails. It’ll also be the occasion for me to take the wonderful new polite package for a ride, that helps respectful webscraping!

Really not a fish? Scraping my mathematical family tree

2018/06/18

webscraping / rvest / robotstxt / PhD / DiagrammeR / igraph / GraphViz

View source

It’s nearly been two years since I defended my PhD thesis! On top of allowing me to call myself doctor, having a PhD in statistics gives me the honour to feature in the data of the Mathematics Genealogy Project. Today, I decided to webscrape my mathematical ancestors.

Bad Stock Photos of My Job? Data Science on Pexels

2018/05/06

rcorpora / robotstxt / tidytext / NLP / webscraping / rvest / hrbrthemes

View source

I couldn’t miss the fun Twitter hashtag #BadStockPhotosOfMyJob thanks to a tweet by Julia Silge and another one by Colin Fay. The latter inspired me to actually go and look for what makes a data science photo… What characterizes “data science” stock photos?

Rainbowing a set of pictures

2018/01/07

magick / RSelenium / rvest / Selenium / webscraping / xkcd / xkcdcolors / collage

View source

I’ve now done a few collages from R using magick: the faces of #rstats Twitter, We R-Ladies with Lucy D’Agostino McGowan, and a holiday card for R-Ladies. The faces of #rstats Twitter and holiday card collages were arranged at random, while the We R-Ladies one was a mosaic forming the R-Ladies logo. I got the idea to up my collage skills by trying to learn how to arrange pics by their main colour, like a rainbow. The verb rainbow doesn’t exist, and “rainbowing” doesn’t mean ordering by colour, but I didn’t let this stop me.

It was the occasion to grab some useful knowledge about colours, not useless for someone who did not even know about Pantone’s Colors of the Year a few weeks ago…

This post has nothing to do with Kesha’s new album. However, you can listen to it while reading since it’s so good, but maybe switch to something older from her when I use “$”.

Cheer up, Black Metal Cats! Bubblegum Puppies

2018/01/04

magick / rtweet / Twitter / puppies / NLP / tidytext / cleanNLP / praise / rvest / webscraping / Selenium / RSelenium

View source

Do you know the Black Metal Cats Twitter account? As explained in this great introduction, it “combines kitties with heavy metal lyrics”. I know the account because I follow Scott Chamberlain who retweets them a lot, which I enjoy as far as one can enjoy such a dark mood. Speaking of which, I decided to try and transform Black Metal Cat tweets into something more positive… The Bubblegum Puppies were born!

You beautiful, naïve, sophisticated newborn series

2017/11/05

robotstxt / parks-and-recreation / webscraping / subtools / snowballC / tm / wordcloud / rvest / hrbrthemes

View source

My husband and I recently started watching the wonderful series “Parks and recreation” which was recommended to me by my fellow R-Lady Jennifer Thompson in this very convincing thread. The serie was even endorsed by other R-Ladies. Jennifer told me the first two seasons are not as good as the following ones, but that it was worth it to make it through them. We actually started enjoying the humor and characters right away!

Then, this week while watching the show, one of the characters did a very basic text analysis that made me feel like imitating him for a blog post – my husband told me it was very Leslie of me to plan something while doing something else which made me very proud. I tested my idea on other Leslie fans, and they seemed to think it was a great idea… and that this post should be the beginning of a series of R-Ladies blog posts about Parks and recreation!

In this two-short-part blog post, I’ll therefore inaugurate this series, what an honor!

R-Ladies global tour

2017/10/06

magick / gif / rvest / webscraping / traveling salesperson problem / TSP / mixed integer linear programming / great circles / geosphere / kitsch plot

View source

It was recently brought to my attention by Hannah Frick that there are now sooo many R-Ladies chapters around the world! R-Ladies is a world-wide organization to promote gender diversity in the R community, and I’m very grateful to be part of this community through which I met so many awesome ladies! Since we’re all connected, it has now happened quite a few times that R-Ladies gave talks at chapters outside of their hometowns. An R-Lady from Taiwan giving a talk in Madrid while on a trip in Europe and another one doing the same in Lisbon, an R-Lady from San Francisco presenting at the London and Barcelona chapters thanks to a conference on the continent, an R-Lady from Uruguay sharing her experience for the New York City and San Francisco chapters… It’s like rockstars tours!

Therefore we R-Ladies often joke about doing an exhaustive global tour. Hannah made me think about this tour again… If someone were to really visit all of the chapters, what would be the shortest itinerary? And could we do a cool gif with the results? These are the problems we solve here.

The Guardian Experience: heavy or light topics?

2017/10/02

rcorpora / robotstxt / tidytext / NLP / rvest / webscraping / The Guardian / hrbrthemes

View source

I’ve recently been binge-reading The Guardian Experience columns. I’m a big fan of The Guardian life and style section regulars: the blind dates to which I dedicated a blog post, Oliver Burkeman’s This column will change your life, etc. Experience is another regular that I enjoy a lot. In each of the column, someone tells something remarkable that happened to them. It can really be anything.

I was thinking of maybe scraping the titles and get a sense of most common topics. The final push was my husband’s telling me about this article of Gabriella Paiella’s about the best Guardian Experience columns. She wrote “the “Experience” column does often touch on heavier topics”. Can one know what is the most prevalent “weight” of Experience columns scraping all their titles?

The music of Les Mills Body Pump, with Spotify data

2017/10/01

rvest / webscraping / Body Pump / music

View source

I am a runner but also a Body Pump enthusiast. Body Pump is a group fitness class of the Les Mills company, in which you train different muscle groups using a weighted bar – whose total weight you modulate with plates in order to adapt it to your fitness level and to the muscle group. Like R, Body Pump was created in New Zealand, what a wonderful country! Every three months, a new class is released, with new songs and choreographies. What doesn’t change is the muscle group trained in each of the 10 songs of each class.

I’ve thought of analysing Body Pump data for a long time now but could never find what I was looking for, which was a dataset of number of “reps” by song, e.g. how many squats do you do in each squats song. Then I realized I could also play with other data, like a quite comprehensive list of songs used in releases. I decided to cross this information with information about style of the corresponding artist in Spotify. Here is what I came up with!

Radio edit: an improved scraping of and look at Radio Swiss classic program

2017/04/30

rvest / webscraping / quantreg / fuzzyjoin / hrbrthemes / music / Radio Swiss Classic

View source

Last week I published a post about scraping Radio Swiss Classic program. After that, Bob Rudis wrote an extremely useful post improving my code a lot and teaching me cool stuff. I don’t know why I forgot to add pauses between requests… Really bad behaviour! I will use his code today for re-scraping the data.

Why re-scrape the data? I mentioned broken links in my post. In fact, each time I hit a broken page, Radio Swiss Classic webmaster received an email. That person received a lot of emails because of me. They repaired the bug explaining these broken pages and contacted me because someone had turned me in (I feel super famous or spied on now), very kindly mentioning they had fixed all pages, and not holding any grudge against me. So let’s scrape everything again!

A classical analysis (Radio Swiss classic program)

2017/04/23

rvest / webscraping / music / Radio Swiss Classic / hrbrthemes

View source

I am not a classical music expert at all, but I happen to have friends who are, and am even married to someone who plays the cello (and the ukulele!). I appreciate listening to such music from time to time, in particular Baroque music. A friend made me discover Radio Swiss classic, an online radio playing classical music all day and all night long, with a quite nice variety, and very little speaking between pieces, with no ads (thank you, funders of the radio!). Besides, the voices telling me which piece has just been played are really soothing, so Radio Swiss classic is a good one in my opinion.

Today, instead of anxiously waiting for the results of the French presidential elections, I decided to download the program of the radio in the last years and have a quick look at it, since after all, the website says that the radio aims at relaxing people.

Hundreds of Guardian blind dates

2017/03/07

rcorpora / tidytext / NLP / rvest / webscraping / The Guardian / hrbrthemes

View source

One of my more or less guilty pleasures is reading The Guardian blind date each week. I think I started doing this when living in Cambridge, England for five months. I would buy i every weekday and The Guardian week-end every week-end. I wasn’t even dating at the time I discovered The Guardian blind dates but I’ve always liked their format.

I get so much into each date report that seeing both participants say they want to meet again makes me ridiculously happy. I like wondering how matches were made, but today I just want to look into the contents of post-date interviews.

Extracting notable deaths from Wikipedia

2017/02/12

fuzzyjoin / webscraping / Wikipedia / rvest

View source

I like Wikipedia. My husband likes it even more, he included it in his PhD thesis acknowledgements! I appreciate the efforts done for sharing knowledge, and also the apparently random stuff you can find on the website. In particular, I’ve been intrigued by the monthly lists of notable deaths such as this one. Who are people (or dogs, yes, dogs) whose life was deemed notable enough to be listed there? Also, using the numbers of such deaths, can I judge whether 2016 was really worse than previous years? The first step in answering these questions was to scrape the data. I’ll describe the process in this post. In another post I’ll have a look at my study population and in a third post I’ll analyse the time series of death counts.

The Rt of naming your blog

2017/01/29

rvest / webscraping / R-bloggers

View source

In this post, I’m sharing a brand-new analysis! The reason for this is my blog being added to R-bloggers by Tal Galili after I filled this form. R-bloggers is a collection of blogs about R, whose new posts get added to the website via the magic of RSS feeds. R-bloggers even has a Twitter account. As a reader of R-bloggers you get exposed to many different analyses and ideas, as a R-blogger you reach a wider audience, so really it’s an useful website. Tal does a great job maintaining R-bloggers and understandably likes seeing R-bloggers mentioning the website on their blog, which I already do in the About section, and in one article, which I’ve consistently failed to do in the last two posts because I got too caught up about the article at hand to think about anything else. So I’ve figured out the best way not to forget to thank Tal for his work was to do an analysis about R-bloggers! Genius, I know. I’ve scraped the full list of contributing blogs and had a look at their names and addresses.