2017/05/20

rtweet / Twitter

I’ll admit I didn’t really know who Bill Nye was before yesterday. His name sounds a bit like Bill Nighy’s, that’s all I knew. But well science is all around and quite often scientists on Twitter start interesting campaigns. Remember the #actuallylivingscientists whose animals I dedicated a blog post? This time, the Twitter campaign is the #BillMeetScienceTwitter hashtag with which scientists introduce themselves to the famous science TV host Bill Nye. Here is a nice article about the movement.

Since I like surfing on Twitter trends, I decided to download a few of these tweets and to use my own R interface to the Monkeylearn machine learning API, monkeylearn (part of the rOpenSci project!), to classify the tweets in the hope of finding the most represented science fields. So, which science is all around?

Getting the tweets

It might sound a bit like trolling by now, but if you wanna get Twitter data, I recommend using rtweet because it’s a good package and because it’s going to replace twitteR which you might know from other blogs.

I only keep tweets in English, and moreover original ones, i.e. not retweets.

library("rtweet")
billmeet <- search_tweets(q = "#BillMeetScienceTwitter", n = 18000, type = "recent")
billmeet <- unique(billmeet)

billmeet <- dplyr::filter(billmeet, lang == "en")
billmeet <- dplyr::filter(billmeet, is_retweet == FALSE)

I’ve ended up with 2491 tweets.

Classifying the tweets

I’ve chosen to use this taxonomy classifier which classifies text according to generic topics and had quite a few stars on Monkeylearn website. I don’t think it was trained on tweets, and well it wasn’t trained to classify science topics in particular, which is not optimal, but it had the merit of being readily available. I’ve still not started training my own algorithms, and anyway, if I did I’d start by creating a very crucial algorithm for determining animal fluffiness on pictures, not text mining stuff. This was a bit off topic, let’s go back to science Twitter!

When I decided to use my own package I had forgotten it took charge of cutting the request vector into groups of 20 tweets, since the API only accept 20 texts at a time. I thought I’d have to do that splitting myself, but no, since I did it once in the code of the package, I’ll never need to write that code ever again. Great feeling! Look at how easy the code is after cleaning up the tweets a bit! One just needs to wait a bit before getting all results.

output <- monkeylearn::monkeylearn_classify(request = billmeet$text,
                                            classifier_id = "cl_5icAVzKR")
str(output)

## Classes 'tbl_df', 'tbl' and 'data.frame':	4466 obs. of  4 variables:
##  $ category_id: int  64638 64640 64686 64696 64686 64687 64689 64692 64648 64600 ...
##  $ probability: num  0.207 0.739 0.292 0.784 0.521 0.565 0.796 0.453 0.301 0.605 ...
##  $ label      : chr  "Computers & Internet" "Internet" "Humanities" "Religion & Spirituality" ...
##  $ text_md5   : chr  "f7b28f45ea379b4ca6f34284ce0dc4b7" "f7b28f45ea379b4ca6f34284ce0dc4b7" "b95429d83df2cabb9cd701a562444f0b" "b95429d83df2cabb9cd701a562444f0b" ...
##  - attr(*, "headers")=Classes 'tbl_df', 'tbl' and 'data.frame':	0 obs. of  0 variables

In the output, the package creator decided not to put the whole text corresponding to each line but its digested form itself, digested by the MD5 algorithm. So to join the output to the tweets again, I’ll have to first digest the tweet, which I do just copying the code from the package. After all I wrote it. Maybe it was the only time I successfully used vapply in my whole life.

billmeet <- dplyr::mutate(billmeet, text_md5 = vapply(X = text,
                                                    FUN = digest::digest,
                                                    FUN.VALUE = character(1),
                                                    USE.NAMES = FALSE,
                                                    algo = "md5"))
billmeet <- dplyr::select(billmeet, text, text_md5)
output <- dplyr::left_join(output, billmeet, by = "text_md5")

Looking at this small sample, some things make sense, other make less sense, either because the classification isn’t good or because the tweet looks like spam. Since my own field isn’t text analysis, I’ll consider myself happy with these results, but I’d be of course happy to read any better version of it.

As in my #first7jobs, I’ll make a very arbitrary decision and filter the labels to which a probability higher to 0.5 was attributed.

output <- dplyr::filter(output, probability > 0.5)

This covers 0.45 of the original tweets sample. I can only hope it’s a representative sample.

How many labels do I have by tweet?

dplyr::group_by(output) %>%
  dplyr::summarise(nlabels = n()) %>%
  dplyr::group_by(nlabels) %>%
  dplyr::summarise(n_tweets = n()) %>%
  knitr::kable()

nlabels	n_tweets
1415	1

Perfect, only one.

Looking at the results

I know I suck at finding good section titles… At least I like the title of the post, which is a reference to the song Bill Nighy, not Bill Nye, sings in Love Actually. My husband assumed that science Twitter has more biomedical stuff. Now, even if my results were to support this fact, note that this could as well be because it’s easier to classify biomedical tweets.

I’ll first show a few examples of tweets for given labels.

dplyr::filter(output, label == "Chemistry") %>%
  head(n = 5) %>%
  knitr::kable()

category_id	probability	label	text_md5	text
64701	0.530	Chemistry	e82fc920b07ea9d08850928218529ca9	Hi @billnye I started off running BLAST for other ppl but now I have all the money I make them do my DNA extractions #BillMeetScienceTwitter
64701	0.656	Chemistry	d21ce4386512aae5458565fc2e36b686	.@uw’s biochemistry dept - home to Nobel Laureate Eddy Fischer & ZymoGenetics co founder Earl Davie… https://t.co/0nsZW3b3xu
64701	0.552	Chemistry	1d5be9d1e169dfbe2453b6cbe07a4b34	Yo @BillNye - I’m a chemist who plays w lasers & builds <U+009F><U+0094><U+00AC> to study protein interactions w materials #BillMeetScienceTwitter
64701	0.730	Chemistry	1b6a25fcb66deebf35246d7eeea34b1f	Meow @BillNye! I’m Zee and I study quantum physics and working on a Nobel prize. #BillMeetScienceTwitter https://t.co/oxAZO5Y6kI
64701	0.873	Chemistry	701d8c53e3494961ee7f7146b28b9c8c	Hi @BillNye, I’m a organic chemist studying how molecules form materials like the liquid crystal shown below.… https://t.co/QNG2hSG8Fw

dplyr::filter(output, label == "Aquatic Mammals") %>%
  head(n = 5) %>%
  knitr::kable()

category_id	probability	label	text_md5	text
64609	0.515	Aquatic Mammals	f070a05b09d2ccc85b4b1650139b6cd0	Hi Bill, I am Anusuya. I am a palaeo-biologist working at the University of Cape Town. @BillNye #BillMeetScienceTwitter
64609	0.807	Aquatic Mammals	bb06d18a1580c28c255e14e15a176a0f	Hi @BillNye! I worked with people at APL to show that California blue whales are nearly recovered #BillMeetScienceTwitter
64609	0.748	Aquatic Mammals	1ca07aad8bc1abe54836df8dd1ff1a9d	Hi @BillNye! I’m researching marine ecological indicators to improve Arctic marine monitoring and management… https://t.co/pJv8Om4IeI
64609	0.568	Aquatic Mammals	a140320fcf948701cfc9e7b01309ef8b	More like as opposed to vaginitis in dolphins or chimpanzees or sharks #BillMeetScienceTwitter https://t.co/gFCQIASty1
64609	0.520	Aquatic Mammals	06d1e8423a7d928ea31fd6db3c5fee05	Hi @BillNye I study visual function in ppl born w/o largest connection between brain hemispheres #callosalagenesis… https://t.co/WSz8xsP38R

dplyr::filter(output, label == "Internet") %>%
  head(n = 5) %>%
  knitr::kable()

category_id	probability	label	text_md5	text
64640	0.739	Internet	f7b28f45ea379b4ca6f34284ce0dc4b7	@BillNye #AskBillNye @BillNye join me @AllendaleCFD. More details at https://t.co/nJPwWARSsa

Based on that, and on the huge number of internet-labelled tweets, I decided to remove those.

library("ggplot2")
library("viridis")

label_counts <- output %>% 
  dplyr::filter(label != "Internet") %>%
  dplyr::group_by(label) %>% 
  dplyr::summarise(n = n()) %>% 
  dplyr::arrange(desc(n))

label_counts <- label_counts %>%
  dplyr::mutate(label = ifelse(n < 5, "others", label)) %>%
  dplyr::group_by(label) %>%
  dplyr::summarize(n = sum(n)) %>%
  dplyr::arrange(desc(n))

label_counts <- dplyr::mutate(label_counts,
                        label = factor(label,
                                        ordered = TRUE,
                                        levels = unique(label)))

ggplot(label_counts) +
  geom_bar(aes(label, n, fill = label), stat = "identity")+
  scale_fill_viridis(discrete = TRUE, option = "plasma")+
    theme(axis.text.x = element_text(angle = 90,
                            hjust = 1,
                            vjust = 1),
          text = element_text(size=25),
          legend.position = "none")

plot of chunk unnamed-chunk-8

In the end, I’m always skeptical when looking at the results of such classifiers, and well at the quality of my sample to begin with – but then I doubt there ever was a hashtag that was perfectly used to only answer the question and not spam it and comment it (which is what I’m doing). I’d say it seems to support my husband’s hypothesis about biomedical stuff.

I’m pretty sure Bill Nye won’t have had the time to read all the tweets, but I think he should save them, or at least all the ones he can get via the Twitter API thanks to e.g. rtweet, in order to be able to look through them next time he needs an expert. And in the random sample of tweets he’s read, let’s hope he was exposed to a great diversity of science topics (and of scientists), although, hey, the health and life related stuff is the most interesting of course. Just kidding. I liked reading tweets about various scientists, science rocks! And these last words would be labelled with “performing arts”, perfect way to end this post.

Which science is all around? #BillMeetScienceTwitter

Getting the tweets

Classifying the tweets

Looking at the results