I was fascinated by this xkcd comic about where to live based on your temperature preferences. I also thought it’d be fun to try to make a similar one from my R session! Since I’m no meteorologist and was a bit unsure of how to define winter and summer, and of their relevance in countries like, say, India which has monsoon, I decided to focus on a single country located in one hemisphere only and big enough to offer some variety… the USA! So, dear American readers, where should you live based on your temperature preferences?
Defining data sources
Weather data
The data for the original xkcd graph comes from weatherbase. I changed sources because 1) I was not patient enough to wait for weatherbase to send me a custom dataset which I imagine is what xkcd author did and 2) I’m the creator of a cool package accessing airport weather data for the whole word including the US! My package is called “riem” like “R Iowa Environmental Mesonet” (the source of the data, a fantastic website) and “we laugh” in Catalan (at the time I wrote the package I was living in Barcelona and taking 4 hours of Catalan classes a week!). It’s a simple but good package which underwent peer-review at rOpenSci onboarding, thank you Brooke!
I based my graph on data from the last winter and the last summer. I reckon that one should average over more years, but nothing important is at stake here, right?
Cities sample
My package has a function for downloading weather data for a given airport based on its ID. For instance Los Angeles airport is LAX. At that point I just needed a list of cities in the US with their airport code. Indeed with my package you can get all airport weather networks, one per US state, and all the stations within that network… this is a lot of airports with no way to automatically determine how big they are! And to get the city name since it’d be so hard to parse the airport name I’d have resorted to geocoding with e.g. this package. A bit complicated for a simple fun graph!
So I went to the Wikipedia page of the busiest airports in the US and ended up getting this dataset from the US Department of Transportation (such a weird open data format by the way… I basically copy-pasted the first two columns in a spreadsheet!). This is not perfect, getting a list of major cities in every state would be more fair but hey reader I want you to live in a really big city so that I might know where it is.
Ok so let’s get the data
us_airports <- readr::read_csv("data/2017-11-16_wheretoliveus_airports.csv")
knitr::kable(us_airports[1:10,])
Name | Code |
---|---|
Atlanta, GA (Hartsfield-Jackson Atlanta International) | ATL |
Los Angeles, CA (Los Angeles International) | LAX |
Chicago, IL (Chicago O’Hare International) | ORD |
Dallas/Fort Worth, TX (Dallas/Fort Worth International) | DFW |
New York, NY (John F. Kennedy International) | JFK |
Denver, CO (Denver International) | DEN |
San Francisco, CA (San Francisco International) | SFO |
Charlotte, NC (Charlotte Douglas International) | CLT |
Las Vegas, NV (McCarran International) | LAS |
Phoenix, AZ (Phoenix Sky Harbor International) | PHX |
I first opened my table of 50 airports. Then, I made the call to the Iowa Environment Mesonet.
summer_weather <- purrr::map_df(us_airports$Code, riem::riem_measures,
date_start = "2017-06-01",
date_end = "2017-08-31")
winter_weather <- purrr::map_df(us_airports$Code, riem::riem_measures,
date_start = "2016-12-01",
date_end = "2017-02-28")
We then remove the lines that will be useless for further computations (The xkcd graph uses average temperature in the winter and average Humidex in the summer which uses both temperature and dew point).
summer_weather <- dplyr::filter(summer_weather,
!is.na(tmpf), !is.na(dwpf))
winter_weather <- dplyr::filter(winter_weather,
!is.na(tmpf))
I quickly checked that there was “nearly no missing data” so I didn’t remove any station nor day but if I were doing this analysis for something serious I’d do more checks including the time difference between measures for instance. Note that I end up with 48 airports only, there was no measure for Honolulu, HI (Honolulu International), San Juan, PR (Luis Munoz Marin International). Too bad!
Calculating the two weather values
I started by converting all temperatures to Celsius degrees because although I like you, American readers, Fahrenheit degrees do not mean anything to me which is problematic for checking results plausibility for instance. I was lazy and just used Brooke Anderson’s (yep the same Brooke who reviewed riem
for rOpenSci) weathermetrics
package.
summer_weather <- dplyr::mutate(summer_weather,
tmpc = weathermetrics::convert_temperature(tmpf,
old_metric = "f", new_metric = "c"),
dwpc = weathermetrics::convert_temperature(dwpf,
old_metric = "f", new_metric = "c"))
winter_weather <- dplyr::mutate(winter_weather,
tmpc = weathermetrics::convert_temperature(tmpf,
old_metric = "f", new_metric = "c"))
Summer values
Summer values are Humidex values. The author explained Humidex “combines heat and dew point”. Once again I went to Wikipedia and found the formula. I’m not in a mood to fiddle with formula writing on the blog so please go there if you want to see it. There’s also a package with a function returning the Humidex. I was feeling adventurous and therefore wrote my own function and checked the results with the numbers from Wikipedia. It was first wrong because I wrote a “+” instead of a “-”… checking one’s code is crucial.
calculate_humidex <- function(temp, dewpoint){
temp + 0.5555*(6.11 *exp(5417.7530*(1/273.16 - 1/(273.15 + dewpoint))) - 10)
}
calculate_humidex(30, 15)
## [1] 33.969
calculate_humidex(30, 25)
## [1] 42.33841
And then calculating the summer Humidex values by station was quite straightforward…
library("magrittr")
summer_weather <- dplyr::mutate(summer_weather,
humidex = calculate_humidex(tmpc, dwpc))
summer_values <- summer_weather %>%
dplyr::group_by(station) %>%
dplyr::summarise(summer_humidex = mean(humidex, na.rm = TRUE))
… as it was for winter temperatures.
winter_values <- winter_weather %>%
dplyr::group_by(station) %>%
dplyr::summarise(winter_tmpc = mean(tmpc, na.rm = TRUE))
Prepare data for plotting
I first joined winter and summer values
climates <- dplyr::left_join(winter_values, summer_values,
by = "station")
Then I re-added city airport names.
climates <- dplyr::left_join(climates, us_airports,
by = c("station" = "Code"))
I only kept the city name.
head(climates$Name)
## [1] "Atlanta, GA (Hartsfield-Jackson Atlanta International)"
## [2] "Austin, TX (Austin - Bergstrom International)"
## [3] "Nashville, TN (Nashville International)"
## [4] "Boston, MA (Logan International)"
## [5] "Baltimore, MD (Baltimore/Washington International Thurgood Marshall)"
## [6] "Cleveland, OH (Cleveland-Hopkins International)"
climates <- dplyr::mutate(climates,
city = stringr::str_replace(Name, " \\(.*", ""))
head(climates$city)
## [1] "Atlanta, GA" "Austin, TX" "Nashville, TN" "Boston, MA"
## [5] "Baltimore, MD" "Cleveland, OH"
Plotting!
When imitating an xkcd graph, one should use the xkcd
package! I had already done that in this post.
library("xkcd")
library("ggplot2")
library("extrafont")
library("ggrepel")
xrange <- range(climates$summer_humidex)
yrange <- range(climates$winter_tmpc)
set.seed(42)
ggplot(climates,
aes(summer_humidex, winter_tmpc)) +
geom_point() +
geom_text_repel(aes(label = city),
family = "xkcd",
max.iter = 50000)+
ggtitle("Where to live based on your temperature preferences",
subtitle = "Data from airports weather stations, 2016-2017") +
xlab("Summer heat and humidity via Humidex")+
ylab("Winter temperature in Celsius degrees") +
xkcdaxis(xrange = xrange,
yrange = yrange)+
theme_xkcd() +
theme(text = element_text(size = 16, family = "xkcd"))
So, which city seems attractive to you based on this plot? Or would you use different weather measures, e.g. do you care about rain?