Clojure: manipulating HTML and XML with zippers

17 Nov 2019

Clojure provides a powerful namespace for manipulating HTML/XML called clojure.zip. It uses the concept of functional zipper (see Functional Pearls The zipper) to make manipulating hierarchical data structures simple and efficient. This article will cover how to use zippers to manipulate HTML/XML in Clojure.

A quick overview of zippers

Lets start by requiring the clojure.xml for parsing XML and clojure.zip for zipper functions.

(ns manipulating-html-and-xml-example.core
  (:require [clojure.xml :as xml]
            [clojure.zip :as zip]))

We have an XML string.

(def xml-string
  "<item>
    <title>Foo</title>
   </item>")

We parse it and get a representation as a nested map.

(defn parse-xml-string [s]
  (xml/parse (java.io.ByteArrayInputStream. (.getBytes s))))

(parse-xml-string xml-string)

=>
{:tag :item,
 :attrs nil,
 :content [{:tag :title, :attrs nil, :content ["Foo"]}]}

If we convert the nested map representation into a zipper we get a two element vector with the first element being our original data and the second element being nil.

(-> (parse-xml-string xml-string)
    (zip/xml-zip))

=>
[{:tag :item,
  :attrs nil,
  :content [{:tag :title, :attrs nil, :content ["Foo"]}]}
 nil]

Calling zip/next takes us to the next location in the zipper. The first element in the vector is the current node. The second element in the vector contains a map that represents a path with the following keys:

:l a list of sibling nodes to the left of the current node.
:pnodes a list of parent nodes.
:ppath path to the parent node.
:r a list of sibling nodes to the right of the current node.

:ppath is nil for now because the parent node is the root of the tree.

(-> (parse-xml-string xml-string)
    zip/xml-zip
    zip/next)

=>
[{:tag :title, :attrs nil, :content ["Foo"]}
 {:l [],
  :pnodes
  [{:tag :item,
    :attrs nil,
    :content [{:tag :title, :attrs nil, :content ["Foo"]}]}],
  :ppath nil,
  :r nil}]

After calling zip/next again :ppath now contains a path.

(-> (parse-xml-string xml-string)
    zip/xml-zip
    zip/next
    zip/next)

=>
["Foo"
 {:l [],
  :pnodes
  [{:tag :item,
    :attrs nil,
    :content [{:tag :title, :attrs nil, :content ["Foo"]}]}
   {:tag :title, :attrs nil, :content ["Foo"]}],
  :ppath
  {:l [],
   :pnodes
   [{:tag :item,
     :attrs nil,
     :content [{:tag :title, :attrs nil, :content ["Foo"]}]}],
   :ppath nil,
   :r nil},
  :r nil}]

In summary, zippers are a location which is a two element vector that consists of a node and a path. What makes zipper so compelling is that clojure.zip comes with a collection of functions for performing common operations on them like navigation and editing (we've already seen zip/xml-zip and zip/next). Zippers also let us iterate rather than recur over a tree which has practical applications (like avoiding stack overflow errors for deeply nested trees).

Putting zippers to work

In this example we will scrape an RSS feed, generate some HTML and then inject it into an existing HTML page replacing part of the original content.

We will use xml/parse to parse the RSS feed of this blog.

(def xml-feed (xml/parse "https://andersmurphy.com/feed.xml"))

We are interested in the item tag but can't quite remember the structure of feed.xml. We could look at the feed.xml file to work out how deep in the hierarchical data the items are but destructuring extremely nested data can be quite cumbersome. Instead we can use a zipper to perform a depth first traversal of the entire document visiting every node and then filter the tags we care about.

(->> (zip/xml-zip xml-feed)
     (iterate zip/next)
     (take-while (complement zip/end?))
     (map zip/node)
     (filter (fn [node] (and (associative? node)
                             (= (:tag node) :item)))))

First the XML is turned into a zipper with zip/xml-zip, we then generate a sequence of all the locations in the zipper with (iterate zip/next) and (take-while (compliment zip/end?)). zip/next goes to the next location from the current location and zip/end? returns true when we are at the end of our depth first walk. We convert that list of locations into nodes with (map zip/node) and then filter all the nodes with the item tag returning a list of items.


({:tag :item,
  :attrs nil,
  :content
  [{:tag :title,
    :attrs nil,
    :content ["Advantages of an Android free zone"]}
   {:tag :pubDate,
    :attrs nil,
    :content ["Thu, 27 Aug 2015 00:00:00 GMT"]}
   {:tag :link,
    :attrs nil,
    :content
    ["https://andersmurphy.com/2015/08/27/advantages-of-an-android-free-zone.html"]}
   {:tag :guid,
    :attrs {:isPermaLink "true"},
    :content
    ["https://andersmurphy.com/2015/08/27/advantages-of-an-android-free-zone.html"]}]}
    ...)

Which we then transform into a hiccup HTML representation.

(->> items
     (map :content)
     (map (fn [[{[title] :content}
                {[date] :content}
                {[link] :content}]]
            [:div
             [:h1 title]
             [:p date]
             [:a {:href link} link]])))

=>
([:div
  [:h1 "Advantages of an Android free zone"]
  [:p "Thu, 27 Aug 2015 00:00:00 GMT"]
  [:a
   {:href
    "https://andersmurphy.com/2015/08/27/advantages-of-an-android-free-zone.html"}
   "https://andersmurphy.com/2015/08/27/advantages-of-an-android-free-zone.html"]]
   ...)

We want to inject this HTML list into an existing HTML page. So we need to get an existing HTML page and then write a function to select the node we want.

(def html-page (slurp "https://andersmurphy.com/"))

(defn zip-select-first [loc tag pred]
  (when-not (zip/end? loc)
    (if (some
         (every-pred associative?
                     #(some-> % tag pred))
         (zip/node loc))
      loc
      (recur (zip/next loc) tag pred))))

zip-select-first does a depth first traversal of a zipper and finds the first node that is associative and has a tag that satisfies a predicate. every-pred is a handy higher order function that returns a function that returns true if a value satisfies all it's predicates. some-> is like -> except that is short circuits if a function returns nil.

For the last part of this pipeline we need to add some more dependencies: hiccup.core for writing html, hickory.core for reading html, and hickory.zip for creating zippers for html.

(ns manipulating-html-and-xml-example.core
  (:require [hiccup.core :as hiccup]
            [hickory.core :as hick]
            [clojure.xml :as xml]
            [hickory.zip :as hick-zip]
            [clojure.zip :as zip]))

Putting it all together. We read the XML feed, filter the items we care about, convert them to hiccup, find the first :div element with it's :class tag equal to "content container" and replace it with our own :div element. Finally we persist our changes with zip/root, convert the hiccup to HTML and write it to a file.

(defn build-page []
  (let [content (xml-feed->hiccup xml-feed)]
    (spit "page.html"
          (-> html-page
              hick/parse
              hick/as-hiccup
              hick-zip/hiccup-zip
              (zip-select-first :class #(= % "content container"))
              (zip/replace [:div {:class "content container"} content])
              zip/root
              hiccup/html))))

This concludes this guide to manipulating HTML/XML in Clojure. The full example project can be found here.