hrbrthemes · Maëlle Salmon's personal website

Bad Stock Photos of My Job? Data Science on Pexels

2018/05/06

rcorpora / robotstxt / tidytext / NLP / webscraping / rvest / hrbrthemes

I couldn’t miss the fun Twitter hashtag #BadStockPhotosOfMyJob thanks to a tweet by Julia Silge and another one by Colin Fay. The latter inspired me to actually go and look for what makes a data science photo… What characterizes “data science” stock photos?

Lintr Bot, lintr's Hester egg

2018/03/30

package-development / lintr / GitHub-API-V4 / GitHub-API / GitHub / ghrecipes / ghql / jq / jqr / anytime / styler / hrbrthemes

View source

Remember my blog post about automatic tools for improving R packages? One of these tools is Jim Hester’s lintr, a package that performs static code analysis. In my experience it mostly helps identifying too long code lines and missing space, although it’s a bit more involved than that. In any case, lintr helps you maintain good code style, and as mentioned in that now old post of mine, you can add a lintr unit test to your package which will ensure you don’t get lazy over time.

Now say your package has a lintr unit test and lives on GitHub. What happens if someone makes a pull request and writes looong code lines? Continuous integration builds will fail but not only that… The contributor will get to know Lintr Bot, lintr’s Hester (Easter) egg!

Are #python users more likely to get into Slytherin?

2018/01/01

Harry Potter / Hogwarts / rtweet / Twitter / hrbrthemes

View source

This post requires some familiarity with the Harry Potter books but I’m committed to making this blog friendly to everyone, even Muggles/Nomajes.

Have you seen Mark Sellors’ blog post series about writing command line utilities in R? It’s a great one but I was a bit puzzled by his using randomness to assign houses in his sorting hat example (he added a new method based on name digest-ing in the meantime).

This is a really cool #rstats tutorial... but since when does the sorting hat assign a random Hogwarts house?! 🧙 ♀️🎩🎲➡️🏠🤔 https://t.co/Ff8CHR6jb9
— Maëlle Salmon 🐟 (@ma_salmon) 19 de desembre de 2017

This prompted a reply by David Hood who later came up with R code to assign you to a Hogwarts house based on your Twitter activity!

It should be possible to assign House on the basis of Twitter analysis (among R using tweeters). Quatitatively:
Original posts - opinionated - Gryffindor
Replies - social - Slytherin
posts links out of Twitter - homework - Ravenclaw
Retweets- keeping it all working - Hufflepuff
— David Hood (@Thoughtfulnz) 19 de desembre de 2017

I was thrilled to see David Hood’s sorting hat Github repo and thought it’d be the perfect occasion to answer that fascinating question: are #python users more likely to get into Slytherin than #rstats users?

Another note: I do not care about any Python vs. R fights except for Quidditch games, so go away trolls.

Where have you been? Getting my Github activity

2017/12/21

GitHub / Google BigQuery / bigrquery / hrbrthemes / CV

View source

On my pretty and up-to-date CV, one of the first things one sees is my Github username, linking to my Github profile. What does a potential employer look at there? Hopefully not my non informative commit messages… My imitating a red Ampelmann, my being part of several organizations, my pinned repositories described with emojis… But how would they know where&how I’ve mostly been active without too much effort?

A considerable part of my Github work happens in organizations: I’m a co-editor at rOpenSci onboarding of packages, I contribute content to the R Weekly newsletter, etc. Although my profile shows the organizations I belong to, one would need to dig into them for a while before seeing how much or how little I’ve done. Which is fine most of the time but less so when trying to profile myself for jobs, right? Let’s try and fetch some Github data to create a custom profile.

Note: yep I’m looking for a job and ResearchGate’s suggestions are not helpful! Do you need an enthusiastic remote data scientist or research software engineer for your team? I’m available up to 24 hours a week! I care a lot about science, health, open source and community. Ideally I’d like to keep working in something close to public research but we can talk!

You beautiful, naïve, sophisticated newborn series

2017/11/05

robotstxt / parks-and-recreation / webscraping / subtools / snowballC / tm / wordcloud / rvest / hrbrthemes

View source

My husband and I recently started watching the wonderful series “Parks and recreation” which was recommended to me by my fellow R-Lady Jennifer Thompson in this very convincing thread. The serie was even endorsed by other R-Ladies. Jennifer told me the first two seasons are not as good as the following ones, but that it was worth it to make it through them. We actually started enjoying the humor and characters right away!

Then, this week while watching the show, one of the characters did a very basic text analysis that made me feel like imitating him for a blog post – my husband told me it was very Leslie of me to plan something while doing something else which made me very proud. I tested my idea on other Leslie fans, and they seemed to think it was a great idea… and that this post should be the beginning of a series of R-Ladies blog posts about Parks and recreation!

In this two-short-part blog post, I’ll therefore inaugurate this series, what an honor!

The Guardian Experience: heavy or light topics?

2017/10/02

rcorpora / robotstxt / tidytext / NLP / rvest / webscraping / The Guardian / hrbrthemes

View source

I’ve recently been binge-reading The Guardian Experience columns. I’m a big fan of The Guardian life and style section regulars: the blind dates to which I dedicated a blog post, Oliver Burkeman’s This column will change your life, etc. Experience is another regular that I enjoy a lot. In each of the column, someone tells something remarkable that happened to them. It can really be anything.

I was thinking of maybe scraping the titles and get a sense of most common topics. The final push was my husband’s telling me about this article of Gabriella Paiella’s about the best Guardian Experience columns. She wrote “the “Experience” column does often touch on heavier topics”. Can one know what is the most prevalent “weight” of Experience columns scraping all their titles?

Who is talking about the French Open?

2017/06/10

rtweet / Twitter / NLP / cld2 / cld3 / language-detection / hrbrthemes

View source

I don’t think rOpenSci’s Jeroen Ooms can ever top the coolness of his magick package but I have to admit other things he’s developped are not bad at all. He’s recently been working on interfaces to Google compact language detectors 2 and 3 (the latter being more experimental). I saw this cool use case and started thinking about other possible applications of the packages.

I was very sad when I realized it was too late to try and download tweets about the Eurovision song context but then I also remembered there’s this famous tennis tournament going on right now, about which people probably tweet in various languages. I don’t follow the French Open myself, but it seemed interesting to find out which languages were the most prevalent, and whether the results from the cld2 and cld3 packages are similar and whether they’re similar to the language detection results from Twitter itself.

Radio edit: an improved scraping of and look at Radio Swiss classic program

2017/04/30

rvest / webscraping / quantreg / fuzzyjoin / hrbrthemes / music / Radio Swiss Classic

View source

Last week I published a post about scraping Radio Swiss Classic program. After that, Bob Rudis wrote an extremely useful post improving my code a lot and teaching me cool stuff. I don’t know why I forgot to add pauses between requests… Really bad behaviour! I will use his code today for re-scraping the data.

Why re-scrape the data? I mentioned broken links in my post. In fact, each time I hit a broken page, Radio Swiss Classic webmaster received an email. That person received a lot of emails because of me. They repaired the bug explaining these broken pages and contacted me because someone had turned me in (I feel super famous or spied on now), very kindly mentioning they had fixed all pages, and not holding any grudge against me. So let’s scrape everything again!

A classical analysis (Radio Swiss classic program)

2017/04/23

rvest / webscraping / music / Radio Swiss Classic / hrbrthemes

View source

I am not a classical music expert at all, but I happen to have friends who are, and am even married to someone who plays the cello (and the ukulele!). I appreciate listening to such music from time to time, in particular Baroque music. A friend made me discover Radio Swiss classic, an online radio playing classical music all day and all night long, with a quite nice variety, and very little speaking between pieces, with no ads (thank you, funders of the radio!). Besides, the voices telling me which piece has just been played are really soothing, so Radio Swiss classic is a good one in my opinion.

Today, instead of anxiously waiting for the results of the French presidential elections, I decided to download the program of the radio in the last years and have a quick look at it, since after all, the website says that the radio aims at relaxing people.

Mapping waxwings annual migration without Twitter

2017/04/08

gif / emojifont / magick / scrubr / spocc / hrbrthemes

View source

Recently a reader left a comment on this blog mentioning his cool blog post in which he mapped the spread of a migratory bird using Twitter. His data source was the Waxwings UK account which reports sightings of Bohemian waxwings in the UK. I decided to try reproducing and extending his work using the rOpenSci spocc package that interfaces different sources of species occurrence data.

Hundreds of Guardian blind dates

2017/03/07

rcorpora / tidytext / NLP / rvest / webscraping / The Guardian / hrbrthemes

View source

One of my more or less guilty pleasures is reading The Guardian blind date each week. I think I started doing this when living in Cambridge, England for five months. I would buy i every weekday and The Guardian week-end every week-end. I wasn’t even dating at the time I discovered The Guardian blind dates but I’ve always liked their format.

I get so much into each date report that seeing both participants say they want to meet again makes me ridiculously happy. I like wondering how matches were made, but today I just want to look into the contents of post-date interviews.