Clojure: crawling Hacker News with re-seq
Clojure has this nifty function called re-seq
that returns a lazy sequence of successive matches of a pattern in a string. This is really useful for turning any string into a list of data. Let's use it to crawl Hacker News!
(defn get-first-8-posts-from-HN []
(->> (slurp "https://news.ycombinator.com/news?p=1")
(re-seq #"<td class=\"title\"><a href=\"(.*?)\" class=\"storylink\">(.*?)</a>")
(map (fn [[_ link title]] title))
(take 8)))
slurp
to gets the first page of Hacker News. re-seq
matches on the regex pulling out the title
of each post.
(get-first-8-posts-from-HN)
=>
("UCSF Launches Translational Psychedelic Research (TrPR) Program"
"Ethereum London Mainnet Announcement"
"Show HN: Make spaced-repetition flashcards"
"Show HN: A low power 1U Raspberry Pi cluster server for inexpensive
colocation"
"Last Mile Redis"
"Windows Print Spooler Elevation of Privilege Vulnerability
(CVE-2021-34481)"
"Full Throttle"
"Google Drive bans distribution of “misleading content”")
Amazing we get the top 8 posts of Hacker News!
We can easily extend this function to find posts on Rust in the first five pages of Hacker News.
(defn get-posts-about-rust-from-HN []
(->> (map #(do
(Thread/sleep 50)
(slurp (str "https://news.ycombinator.com/news?p=" %)))
(range 1 6))
(apply str)
(re-seq #"<td class=\"title\"><a href=\"(.*?)\" class=\"storylink\">(.*?)</a>")
(map (fn [[_ link title]] {:title title :link link}))
(filter (fn [{:keys [title]}]
(re-find #"Rust" title)))))
Request the first 5 pages and then filter the results. Simple.
(get-posts-about-rust-from-HN)
=>
({:title "Is Rust Used Safely by Software Developers?",
:link "https://arxiv.org/abs/2007.00752"})
We can even open the first post in our default web browser.
(->> (get-posts-about-rust-from-HN)
first
:link
clojure.java.browse/browse-url)
=>
Opens post in default browser.
How's that for automation!
In this post we've seen how to use slurp
and re-seq
to write a quick and simple web crawler.