Clojure: sorting a sequence based on another sequence

25 May 2019

Sometimes web responses contain an unordered sequence of items along with a separate manifest that specifies the ordering. This article will cover how you can write a function to sort the list of items by the manifest order as well as using Clojure Spec to generate test data to verify its output.

Sorting by manifest order

Let's start by writing some test data.

(def manifest
  {:item-order ["x234" "d543" "g090" "z001" "a362"]})

(def items
  [{:id "z001" :type "food"}
   {:id "g090" :type "drink"}
   {:id "a362" :type "food"}
   {:id "x234" :type "drink"}
   {:id "d543" :type "food"}])

Now that we have some data we can use the sort-by function and the .indexOf method to make the order of the items match the order in the manifest.

(sort-by (fn [{:keys [id]}]
           (.indexOf (manifest :item-order) id))
         items)

=> ({:id "x234" :type "drink"}
    {:id "d543" :type "food"}
    {:id "g090" :type "drink"}
    {:id "z001" :type "food"}
    {:id "a362" :type "food"})

This seems to work. However, .indexOf gets called for every item in the list of items. Performance might suffer for large amounts of data. Let's write some performance tests and see whether we are right.

Generating data

First we need to write a function that generates a large number of unique string ids. This is so that we can generate the same set of ids for our manifest and items. The shuffle function makes sure every call to this function returns the result in a random order.

(defn large-shuffled-vec-of-ids []
  (->> (range 1000)
       (map (partial str "xid"))
       shuffle
       vec))

We can then define specs with s/def. We provide a custom generator for the item-order and the item ids with s/with-gen. We do this to ensure that both the manifest and the items have the same set of ids. A single item set means the generator will always return the same result.

(require '[clojure.spec.alpha :as s])
(require '[clojure.spec.gen.alpha :as gen])

(s/def ::id string?)
(s/def ::item-order
  (s/with-gen
    (s/coll-of ::id :distinct true)
    #(s/gen #{(large-shuffled-vec-of-ids)})))
(s/def ::manifest (s/keys :req-un [::item-order]))
(s/def ::type #{"food" "drink"})
(s/def ::item (s/keys :req-un [::id ::type]))

s/exercise allows us to generate as many items as we have ids. We concat the result because s/exercise returns pairs. We then map and merge to overwrite the random ids with the ids from the large-shuffled-vec-of-ids function.

(s/def ::items
  (s/with-gen
    (s/coll-of ::item :distinct true)
    #(s/gen #{(let [ids (large-shuffled-vec-of-ids)]
                (map (fn [item id] (merge item {:id id}))
                     (apply concat (s/exercise ::item (count ids)))
                     ids))})))

We add consistent-ids? to the spec with s/and to ensure that the items and manifest share the same set of ids. This isn't strictly necessary as the large-shuffled-vec-of-ids function should guarantee this, but it's always good to capture intent in a spec as they also serve as valuable documentation.

(defn consistent-ids? [{:keys [manifest items]}]
  (= (set (manifest :item-order))
     (set (map :id items))))

(s/def ::items-with-manifest
  (s/and
   (s/keys :req-un [::manifest ::items])
   consistent-ids?))

With the ::items-with-manifest spec finished we can now generate a large amount of test data for our performance test.

(gen/generate (s/gen ::items-with-manifest))

=> {:manifest
    {:item-order
     ["xid313"
      "xid544"
      "xid846"
      "xid351"
      "xid67"
      ...
    }}

Warning! The output is quite large and will flood your repl.

Performance test

Now that we can generate test data, we test the performance of the initial implementation with the time function. It's worth noting the use of do to avoid flooding the repl with output results.

(def large-manifest (gen/generate (s/gen ::items-with-manifest)))

(defn index-of-sort [{:keys [items manifest]}]
  (sort-by (fn [{:keys [id]}]
             (.indexOf (manifest :item-order) id))
           items))

(time (do (index-of-sort large-manifest) nil))

=> "Elapsed time: 561.437815 msecs"

As suspected, the function is very slow for large input. This is because each call of .indexOf has O(n) complexity. We can avoid this cost by building a map of values to index. This can be done with (iterate inc 0) this generates a sequence of numbers starting from 0 which we then zipmap to the id values.

(defn hash-map-index-sort [{:keys [items manifest]}]
  (let [value->index (-> (manifest :item-order)
                         (zipmap (iterate inc 0)))]
    (sort-by (fn [{:keys [id]}]
               (value->index id))
             items)))

(time (do (hash-map-index-sort large-manifest) nil))

=> "Elapsed time: 2.564776 msecs"

The results from time show this implementation is much faster than the initial implementation.

Validation and generative testing

We can also spec the hash-map-index-sort function with s/fdef and use the spec to create a generative test. The :args key defines the input to the function. The :ret key defines the output data. Finally, the :fn key defines the relation between input and output that we want to validate; in this case we want to check that the order of the output items is the same as the order of ids in the input manifest.

(require '[clojure.spec.test.alpha :as stest])

(s/fdef hash-map-index-sort
  :args (s/cat :m ::items-with-manifest)
  :ret  ::items
  :fn #(= (-> % :args :m :manifest :item-order)
          (->> % :ret (map :id))))

(-> (stest/check `hash-map-index-sort)
    first
    :clojure.spec.test.check/ret)

=> {:result true,
    :pass? true,
    :num-tests 1000,
    :time-elapsed-ms 14614,
    :seed 1558893244714}

This concludes this exploration of how to order items by a manifest as well as some use cases for Clojure Spec.