Clojure: crawling Hacker News with re-seq

Clojure has this nifty function called re-seq that returns a lazy sequence of successive matches of a pattern in a string. This is really useful for turning any string into a list of data. Let's use it to crawl Hacker News!

(defn get-first-8-posts-from-HN []
  (->> (slurp "")
       (re-seq #"<td class=\"title\"><a href=\"(.*?)\" class=\"storylink\">(.*?)</a>")
       (map (fn [[_ link title]] title))
       (take 8)))

slurp to gets the first page of Hacker News. re-seq matches on the regex pulling out the title of each post.


("UCSF Launches Translational Psychedelic Research (TrPR) Program"
 "Ethereum London Mainnet Announcement"
 "Show HN: Make spaced-repetition flashcards"
 "Show HN: A low power 1U Raspberry Pi cluster server for inexpensive
 "Last Mile Redis"
 "Windows Print Spooler Elevation of Privilege Vulnerability
 "Full Throttle"
 "Google Drive bans distribution of “misleading content”")

Amazing we get the top 8 posts of Hacker News!

We can easily extend this function to find posts on Rust in the first five pages of Hacker News.

(defn get-posts-about-rust-from-HN []
  (->> (map #(do
               (Thread/sleep 50)
               (slurp (str "" %)))
            (range 1 6))
       (apply str)
       (re-seq #"<td class=\"title\"><a href=\"(.*?)\" class=\"storylink\">(.*?)</a>")
       (map (fn [[_ link title]] {:title title :link link}))
       (filter (fn [{:keys [title]}]
                 (re-find #"Rust" title)))))

Request the first 5 pages and then filter the results. Simple.


({:title "Is Rust Used Safely by Software Developers?",
  :link ""})

We can even open the first post in our default web browser.

(->> (get-posts-about-rust-from-HN)

Opens post in default browser.

How's that for automation!

In this post we've seen how to use slurp and re-seq to write a quick and simple web crawler.