I’ll be giving a remote keynote talk at the Rencontres R (French R conference) on July the 12th, all in French. This blog post is a written version of my presentation, but in English. I decided to not talk about package development for once, but rather about workflows and how to structure & run an analysis.1
Many thanks to Christophe Dervieux for useful feedback on this post! Merci beaucoup !
This post was featured on the R Weekly podcast by Eric Nantz.
Why discuss the making of R projects? Because any improvement in your workflow might improve the experience of anyone trying to run or audit an analysis later, including yourself. It’s also nice that there are always elements to refine, although that also means one might be procrastinating instead of making actual progress, so beware!
Now why should I discuss the making of R projects? While I am not often in charge of analyses these days, I follow R news quite well so thought I might be able to deliver some useful tips to the audience.
Some “basics”
To start with I’d like to mention some rules or principles that are quite crucial. My main source of information for them is Jenny Bryan.
Like many I identified with Sharla Gelfand’s quote2 as reported by Kara Woo on Twitter, “Everything I know is from Jenny Bryan”.
There’s even a logo created by Martin Monkman for all of us agreeing with that statement! It features the letters EIKIFJB for “Everything I know is from Jenny Bryan” as well as a laptop on fire.
Why a laptop on fire?!
The only two things that make @JennyBryan 😤😠🤯. Instead use projects + here::here() #rstats pic.twitter.com/GwxnHePL4n
— Hadley Wickham (@hadleywickham) December 11, 2017
That’s a reference to a talk Jenny Bryan gave years ago, to which she said she’d put your computer on fire if you used setwd("C:\Users\jenny\path\that\only\I\have")
or rm(list = ls())
at the beginning of your scripts. This goes against the notion of a “project-oriented workflow”. I do not need to repeat the wisdom of her blog post in mine, so will only use cliff notes & some remarks:
-
It’s handy to create your project with
usethis::create_project()
. RStudio IDE and projects work well together cf IDE support for projects and RStudio projects (although you could use the principles of a project-oriented workflow outside of RStudio!). -
When reading in data etc. use paths relative to the root of your project, possibly using the here package by Kirill Müller.
-
Re-start R often, do not e.g. load packages in your .Rprofile,
usethis::use_blank_slate()
.
Really, go and read her post if you haven’t yet!
In my talk I’ll show pictures of the Kei-tora Gardening Contest where participants create beautiful gardens in little trucks, which I find is a good image of a project that should be independent (you can move your truck, you should be able to move your project too).
The Kei-tora Gardening Contest is an annual event put on by the Japan Federation of Landscape Contractors to see who can build the best looking garden on the back of a Kei truck. Here are some of the winners from the past 12 years. https://t.co/IpkRiALn7a pic.twitter.com/vep4K0VZt6
— Doctor Popular (@DocPop) June 10, 2021
As the first part of my talk really is a collection of awesome content by Jenny Bryan, I will also show her rules for naming files. What Shakespeare said about roses isn’t true when programming. 😅
Then I’ll “introduce” version control. I’ll first underline that you want to have a back-up of your stuff somewhere no matter how you do it (so even if you do not use git, think of backing things up with a good system!). I’ll briefly mention some basic git commands, and most importantly, how to run them! Here are my preferences:
-
Using usethis (
usethis::use_git()
,usethis::use_github()
, etc.) and gert (gert::git_push()
) so that I don’t need to leave R. -
Using RStudio git pane to not be too far from the R console.
-
Using the terminal for commands copy-pasted from somewhere on the web (and for the ones I now know by heart!).
-
Using a git interface like GitKraken (or the editor VSCODE that has a good git integration) for more complicated stuff… which I haven’t really ever fully explored, actually.
I’ll share links to my favorite git resources that I send to anyone who asks (or doesn’t ask):
-
Excuse Me, Do You Have a Moment to Talk About Version Control?, Jenny Bryan.
-
Happy Git and GitHub for the useR, Jenny Bryan, the STAT 545 TAs, Jim Hester.
Great observation: if you can write #rstats, you can use Git.
— Jenny Bryan (@JennyBryan) March 4, 2018
Great advice: when or before things get funky, practice tricky bits and disaster recovery in a bogus repo. https://t.co/XVwHP9K4av
And in French there’s a great blog post by ThinkR: Travailler avec Git via RStudio et versionner son code.
How to protect your project from external changes
Here’s a maximum credible accident: you write some pretty and handy code munging your data using package::my_favorite_function()
. Now you go and update that package and realize my_favorite_function
is gone! It was apparently removed for good reasons but now your script is broken!
To prevent that, you need to encapsulate your project. You can track and restore package dependencies of your package by using the renv package by Kevin Ushey. The renv package is the successor of the packrat package by the same author.
Using renv is actually quite easy:
- In a new project you run
renv::init()
; - After that you install and remove packages as you normally would (renv is smart and will copy files from your local not-project library to be faster). Regularly run
renv::snapshot()
.3 Metadata about packages (where do they come from) are stored in therenv.lock
file that you’d put under version control; - Anyone getting the project runs
renv::restore()
to have the exact same project library as you.
Also worth of a mention is capsule by Miles McBain that is “an inversion of renv for low effort reproducible R package libraries”.
Now if you want to go further and also freeze the operating system used etc. you could check out Docker.
What structure for your project?
What’s in your project? Probably something like:
- Data or the code to get them from a database or a remote resource;
- Some code munging and analysing them;
- Some output that could be a graph, a report etc.
Now how should you structure your project? It’s important to use a structure that’s consistent between your (team’s) project, and that can be created automatically.
While I have never used the ProjectTemplate package by Kenton White, I really like the blog post Love for ProjectTemplate by Hilary Parker as it underlines advantages that should be requirement for any tool that helps create an analysis.
- “Routine is your friend”.
- “It’s easier to start somewhere and then customize, rather than start from the ground up."
- “Reproducibility should be as easy as possible."
- “Finding things should also be as easy as possible."
Now some people find all these advantages by structuring their analyses as R packages. Creating an R package to share code and data you use throughout projects is not subject to debate: it’s great! You can even build Shiny apps as package with golem. Creating your analysis as a package, with dependencies in DESCRIPTION
, functions in R/
, analysis in e.g. a vignette, is subject to debate.
The advantages are that when doing that you can re-use or refresh your package development skills, and foremost that you can re-use tools made for package development (like devtools and usethis). There’s a paper presenting and promoting the approach, where such packages are called research compendia: Packaging Data Analytical Work Reproducibly Using R (and Friends), Ben Marwick, Carl Boettiger & Lincoln Mullen (2018), The American Statistician, 72:1, 80-88, DOI: <10.1080/00031305.2017.1375986>
There are specific tools for building and using research compendia:
- rrtools by Ben Marwick. “The goal of rrtools is to provide instructions, templates, and functions for making a basic compendium suitable for writing a reproducible journal article or report with R."
- holepunch by Karthik Ram. “holepunch will read the contents of your R project on GitHub, create a DESCRIPTION file with all dependencies, write a Dockerfile, add a badge to your README, and build a Docker image. Once these 4 steps are complete, any reader can click the badge and within minutes, be dropped into a free, live, RStudio server. Here they can run your scripts and notebooks and see how everything works." (🤫 holepunch works without the compendium structure as well.)
- The R-universe project at rOpenSci, built by Jeroen Ooms. As it automatically builds vignettes, it could be used to publish your packaged analysis online.
Now it’s good to know not everyone loves the idea of projects as R packages. Miles McBain wrote a blog post “Project as an R package: An okay idea”.
I found this quote quite interesting:
“My response to advocates of project as a package is: ==You’re wasting precious time making the wrong packages.==”
“Instead of shoehorning your work into the package development domain, with all the loss of fidelity that entails, why aren’t you packaging tools that create the smooth {devtools}/{usethis} style experience for your own domain?"
In my talk my own advice is to use whatever structure you, and your team if you have one, prefers, and to choose a structure that can be created automatically. You could be the one creating the package to create projects, as Miles said – although he also mentioned the risk of “bitrot” for tools maintained and used by few people.
How to run your project?
How do you go from resources and scripts to the analysis output? If your project “only” contains one or a few R Markdown document(s), maybe you can simply use the knit button. Now you might be dealing with some challenges warranting the use of dedicated tools. I’ll briefly present two.
Note that not all tools separate ways to structure and run projects i.e. you could be using a workflow package that’s opinionated about both.
Optimize pipeline with targets
The targets package by Will Landau, reviewed at rOpenSci software peer-review, helps optimizing pipelines by recognizing dependencies between steps (e.g. if you change the raw data you need to re-run everything, but if you change only the model fit you only need to re-run the final plot) and only running those that are needed at the moment. To make a project a targets project you need a script called _targets.R
where you load packages, source e.g. functions from R/
and define targets. Now what are targets? Looking at part of a _targets.R
from targets manual,
list(
# Raw data file. Notice the format argument is used.
tar_target(
raw_data_file,
"data/raw_data.csv",
format = "file"
),
tar_target(
raw_data,
read_csv(raw_data_file, col_types = cols())
),
# The dplyr package has been previously loaded
tar_target(
data,
raw_data %>%
filter(!is.na(Ozone))
),
# The create_plot() function comes from a script that has been previously sourced
tar_target(hist, create_plot(data)),
tar_target(fit, biglm(Ozone ~ Wind + Temp, data))
)
the targets are defined in a list. Each of them uses the tar_target()
function, has a name, and code that creates it, or, in the case of the raw data file, the path to the file. To build the project you run targets::tar_make()
(and to destroy everything if you need to, targets::tar_destroy()
). To see the network of dependencies between targets you can run targets::tar_glimpse()
and other functions that inspect the pipeline.
There’s a whole ecosystem of packages around targets, the Targetopia e.g. the tarchetypes package defines targets making functions that lets you define targets that need to be re-run after a certain time.
To get started with targets, I’d recommend:
-
Reading the manual;
-
Watching the talk Reproducible Computation at Scale in R with {targets} (Will Landau at the RUG of Lille).
-
Starting with a small project (my current non-expertise level 😅).
To follow evolutions of targets as it keeps getting better, you can:
-
Release-watch the targets GitHub repo;
-
Follow Will Landau on Twitter;
-
Subscribe to rOpenSci newsletter.
Track versions of an analysis with orderly
Imagine you want to keep track of the different versions of an analysis and everything that went into it, and to run analysis comparing versions. The orderly package, maintained by Rich FitzJohn4, offers an infrastructure for that kind of workflows.
With orderly you have repos and in repos you have reports/tasks, or only one report/task. Here’s an example with an orderly repo with one report. From an RStudio project I ran orderly::orderly_init("blop")
which created the repo in a new folder “blop”, and then orderly::orderly_new("example", "blop")
after which I modified files using the orderly introduction vignette.
In the blop/
folder there’s a general orderly configuration that I haven’t needed to touch, orderly_config.yml
. There’s also a src/
folder corresponding to the source of my example report.
blop
├── orderly_config.yml
└── src
└── example
├── orderly.yml
└── script.R
That example report consists of two files so it’s a very small one. You could put anything in there as orderly is not opinionated about that! It is opinionated about being told about packages, resources, scripts, artefacts in the blop/src/example/orderly.yml
configuration.
In the example the configuration lists one script and two artefacts:
script: script.R
artefacts:
- staticgraph:
description: A graph of things
filenames: mygraph.png
- data:
description: Data that went into the plot
filenames: mydata.csv
The script creates those two artefacts:
dat <- data.frame(x = 1:10, y = runif(10))
write.csv(dat, "mydata.csv", row.names = FALSE)
png("mygraph.png")
plot(dat)
dev.off()
Now how do you run the project?
- You can use the development mode when developing a report, to have things at your current working directory.
- You can build a draft version of the report with
id <- orderly::orderly_run("example", root = "blop")
after which your whole analysis, input and output, appears inblop/draft/example/some-id-that-contains-the-date-and-a-hash
. - If you like it you can commit it with
orderly::orderly_commit(id, root = "blop")
which moves the whole analysis folder fromblop/draft/example/some-id-that-contains-the-date-and-a-hash
toblop/archive/example/some-id-that-contains-the-date-and-a-hash
.
Note that as the draft and archive folders can be gigantic you are expected to back up with some system. You are expected not to use git to track these folders as git does not behave well with very large files. You can still use git to track other files in the orderly repo.
To get started with orderly you can read the orderly website that I found very clear, and you should start with a small project… which again is my current non-expertise level.
To follow orderly news, as it’s actively used and developed, you can:
-
follow Rich FitzJohn on Twitter.
Other tools for building analyses
The targets and orderly packages are not the only ones helping run analyses: workflowr, cabinets, etc.. You might even want to build your own!
Conclusion
In this post/talk I have discussed several aspects of drawing a project
- Some “basics” that are not all easy (all are habits to take, and some are trickier to learn than others);
- Encapsulating your project by tracking its dependencies with e.g. renv;
- Structuring your project in a way that suits your team’s wishes, is consistent over time, and can be automated;
- Using tools for building outputs that answers your needs (optimizing a pipeline? tracking versions of an analysis projects?).
All in all my tips would be to read everything Jenny Bryan writes 😁, and to not be afraid to change tools over time as new cool tools will appear and as your needs and experience will change. I’d be interested to hear any thoughts in the comments below!
Further resources
These items might be relevant for you:
-
Course “Reproducible Research Data and Project Management in R” by Anna Krystalli.
-
Guide “Organisation d’un projet collaboratif de publication PROPRE” de Sébastien Rochette.
-
Good enough practices in scientific computing Wilson G, Bryan J, Cranston K, Kitzes J, Nederbragt L, et al. (2017) Good enough practices in scientific computing. PLOS Computational Biology 13(6): e1005510. https://doi.org/10.1371/journal.pcbi.1005510
-
The Turing Way, an open source community-driven guide to reproducible, ethical, inclusive and collaborative data science.
-
The Little Prince might have asked “Please draw me a sheep”, not a project, but I liked tweaking that quote for a title as one will often end up putting R projects in boxes (folders, maybe packages). ↩︎
-
Note that Sharla Gelfand themselves is a great source of good ideas! See e.g. their talk Don’t repeat yourself, talk to yourself! Repeated reporting in the R universe from the RStudio::conf 2020. ↩︎
-
Unless you go for automatic snapshots. ↩︎
-
Many thanks to Rich for answering some questions of mine! ↩︎