One of my favorite tool is XPath, the query language for exploring XML and HTML trees. In this post, I will highlight a few use cases of this “angle-bracket crunching tool” and hope to convince you that it’s an awesome thing to know about and play with.
Many thanks to Christophe Dervieux for useful feedback on this post! Mille mercis !
Brief intro to XPath in R
Say I have some XML,
my_xml <- xml2::read_xml("<wrapper><thing>blop</thing></wrapper>")
I’m using xml2, by Hadley Wickham, Jim Hester and Jeroen Ooms. This package is recommended over the XML package by e.g. the rOpenSci dev guide.
With XPath I can query the “thing” element:
xml2::xml_find_all(my_xml, ".//thing")
#> {xml_nodeset (1)}
#> [1] <thing>blop</thing>
I can extract its content via xml2::xml_text()
:
xml2::xml_find_all(my_xml, ".//thing") |>
xml2::xml_text()
#> [1] "blop"
I could also replace the element.
Now, that was an especially simple XPath query. XPath’s strength is to allow you to really take advantage of the structure of the XML or HTML tree. You can extract nodes based on their attributes, on their parents, on their siblings, etc.
Where to learn XPath? Two good websites to get started are
A primary skill to learn is the name of elements, e.g. nodes, attributes, which will help write type better keywords into search engines when trying to figure out a query. 😉
Note that if you are handling HTML, you might enjoy selectr by Simon Potter that creates XPath filters based on CSS selectors.
Knowing XPath, or even knowing it exists, is really empowering. In the rest of this post, I’ll highlight cases where this is useful.
When life gives you XML or HTML
Web scraping
At the beginning of this blog I liked extracting data from websites. I did that with regular expressions. Now I know better and would wrangle HTML as HTML. Goodbye, stringr::str_detect()
, hello, xml2::xml_find_all()
.
A package that’s especially useful for web scraping is rvest by Hadley Wickham. rvest builds upon selectr, and will write XPath for you.
pkgdown
If you use pkgdown to produce the documentation website of for your package, please know that part of its magic comes from various “HTML tweaks” that are powered by XPath, see for instance “tweak-reference.R”.
When life gives you something else…
You can still make it XML to handle it as such, with XPath!
Markdown manipulation with commonmark, tinkr
The commonmark package transforms Markdown to XML. This can be extremely handy to get data on R Markdown or Markdown files.
Now, say you want to modify the Markdown file as XML then get a Markdown file back. It is also possible, with the tinkr package, started by yours truly, now maintained by Zhian Kamvar. The conversion back to Markdown relies on xslt by Jeroen Ooms, a package that can use XSL stylesheets.
Code tree manipulation with xmlparsedata
Imagine you’re writing a domain-specific language where you let users write something like
str_detect(str_to_lower(itemTitle), 'wikidata')
that you want to somehow translate to:
REGEX(LCASE(?itemTitle),"wikidata")
Yes, that’s a real use case, from the glitter package (SPARQL DSL) maintained by Lise Vaudor.
The way we translate the code is to transform it to an XML tree via Gábor Csárdi’s xmlparsedata, then we can apply different tweaks based on XPath.
parse(
text = "str_detect(str_to_lower(itemTitle), 'wikidata')",
keep.source = TRUE
) |>
xmlparsedata::xml_parse_data(pretty = TRUE) |>
xml2::read_xml() |>
as.character() |>
cat()
#> <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
#> <exprlist>
#> <expr line1="1" col1="1" line2="1" col2="47" start="49" end="95">
#> <expr line1="1" col1="1" line2="1" col2="10" start="49" end="58">
#> <SYMBOL_FUNCTION_CALL line1="1" col1="1" line2="1" col2="10" start="49" end="58">str_detect</SYMBOL_FUNCTION_CALL>
#> </expr>
#> <OP-LEFT-PAREN line1="1" col1="11" line2="1" col2="11" start="59" end="59">(</OP-LEFT-PAREN>
#> <expr line1="1" col1="12" line2="1" col2="34" start="60" end="82">
#> <expr line1="1" col1="12" line2="1" col2="23" start="60" end="71">
#> <SYMBOL_FUNCTION_CALL line1="1" col1="12" line2="1" col2="23" start="60" end="71">str_to_lower</SYMBOL_FUNCTION_CALL>
#> </expr>
#> <OP-LEFT-PAREN line1="1" col1="24" line2="1" col2="24" start="72" end="72">(</OP-LEFT-PAREN>
#> <expr line1="1" col1="25" line2="1" col2="33" start="73" end="81">
#> <SYMBOL line1="1" col1="25" line2="1" col2="33" start="73" end="81">itemTitle</SYMBOL>
#> </expr>
#> <OP-RIGHT-PAREN line1="1" col1="34" line2="1" col2="34" start="82" end="82">)</OP-RIGHT-PAREN>
#> </expr>
#> <OP-COMMA line1="1" col1="35" line2="1" col2="35" start="83" end="83">,</OP-COMMA>
#> <expr line1="1" col1="37" line2="1" col2="46" start="85" end="94">
#> <STR_CONST line1="1" col1="37" line2="1" col2="46" start="85" end="94">'wikidata'</STR_CONST>
#> </expr>
#> <OP-RIGHT-PAREN line1="1" col1="47" line2="1" col2="47" start="95" end="95">)</OP-RIGHT-PAREN>
#> </expr>
#> </exprlist>
To me, having an XML tree at hand makes it easier to think of, and work with, an “abstract syntax tree”.
XPath for all the things
A tool that I haven’t used, but that sounds intriguing, is rpath by Gabriel Becker, an R package implementing xpath-like functionality for querying R objects.
Data documentation with EML
No matter what format your data is, you can create its metadata using the EML package maintained by Carl Boettiger that creates XML metadata following the Ecological Metadata Language. Sure, you might prefer using dataspice maintained by Bryce Mecum (and get JSON).
When you are creating XML or HTML
If the goal of your code or package is to produce XML or HTML, knowing XPath will help you write unit tests (that you might want to complement with snapshot unit tests).
Conclusion
In this post I’ve explained why I find XPath, XML, HTML so useful. Applications are endless, not limited to the examples from this post: web scraping, HTML tweaks, Markdown manipulation, code tree manipulation…