By Renzo Borgatti

In this article we will explore some concrete examples of the many uses and intricacies of the memoize function from the Clojure standard library.

 

 

Save 37% off Clojure Standard Library with code fccborgatti.

memoize is a function in the Clojure standard library that adds caching capabilities to an existent function using the invocation arguments as key. When the wrapped function is invoked with the same list of arguments, the result is returned immediately from the cache without any additional computation. The effects of memoize are readily visible if we print a message from the wrapped function:

 
 
 
(defn- f* [a b]
  (println (format "Cache miss for [%s %s]" a b))
  (+ a b))
 
(def f (memoize f*))
 
(f 1 2)
;; Cache miss for [1 2]
;; 3
 
(f 1 2)
;; 3
 
(f 1 3)
;; Cache miss for [1 3]
;; 4
 
 
 

The first invocation generates the message (the following invocations don’t), confirming that the wrapped function f* isn’t invoked again. Conventionally, if the wrapped function isn’t meant to be used directly, the star * character gets added at the end of the name, and the “memoized” version is named without one.

“Memoize” invocation contract

memoize formal specification of the function signature is as follows:

  • Input: “f” needs to be a function and is mandatory. (fn? f) should return true. You can still provide memoize with a non-invokable object, but the resulting function would be useless and throw an exception.
  • Notable exceptions: ClassCastException if “f” can’t be invoked.
  • Output: a< new function of a variable number of arguments. memoize will eventually delegate the call to the wrapped function, and the number of arguments passed to the generated function needs to be compatible.

Examples

memoize works well for non-trivial computations accepting and returning values with a relatively small memory footprint. The following example illustrates this point. The Levenshtein distance[1] is a simple metric to measure how different two strings are. Levenshtein distance can be used, for example, to suggest corrections for the most common spelling mistakes. The Levenshtein distance is straightforward to implement, but becomes computationally intensive for longer strings (above 10 characters long). We could use memoize to save us from having to compute the distance of the same pair of strings repeatedly. The input (the strings arguments) and the output (a small integer) are relatively small, and we can cache a large amount of them without exhausting memory (assuming the list of words with which the function is invoked is a finite number that we can estimate).

 

To feed our example we’re going to use the dictionary available on Unix systems at “/use/share/dict/words” which is a plain text list of words. If we were asked to implement an auto-correction service, it could work as follows:

  1. The user inputs a misspelled word
  2. The system checks the distance of the word against the words in the dictionary
  3. The results are returned in order of smaller distance

Although in our example we’re implementing the essentials, concentrating mainly the application of memoize, we’re going to pre-compute a dictionary of words starting with the initials of the word – a technique to further speed-up the Levenshtein distance calculation:

 
 
(defn levenshtein* [[c1 & rest1 :as str1]               ; 
       [c2 & rest2 :as str2]]
  (let [len1 (count str1)
         len2 (count str2)]
    (cond (zero? len1) len2
         (zero? len2) len1
           :else
          (min (inc (levenshtein* rest1 str2))
              (inc (levenshtein* str1 rest2))
              (+ (if (= c1 c2) 0 1) (levenshtein* rest1 rest2))))))
 
(def levenshtein (memoize levenshtein*))                ; 
 
(defn to-words [txt init]                               ; 
  (->> txt
slurp
clojure.string/split-lines
       (filter #(.startsWith % init))
       (remove #(> (count %) 8))
        doall))
 
(defn best [misp dict]                                  ; 
  (->> dict
      (map #(-> [% (levenshtein misp %)]))
      (sort-by last)
       (take 3)))
 
(defn dict [init]
  (to-words "/usr/share/dict/words" init))
 
(def dict-ac (dict "ac"))                               ; 
 
(time (best "achive" dict-ac))
;; "Elapsed time: 4671.226198 msecs"                    ; 
;; (["achieve" 1] ["achime" 1] ["active" 1])
 
(time (best "achive" dict-ac))
;; "Elapsed time: 0.854094 msecs"                       ; 
;; (["achieve" 1] ["achime" 1] ["active" 1])
 
 

 

❶  The Levenshtein algorithm presented here is a variation of those available on-line. The important aspect to remember is that it growths roughly as O(n*m), where m and n are the length of the strings, which is O(n^2) in the worst scenario.

❷  This def builds the wrapping function through memoize, conveniently called levenshtein without the final * reserved for the non-memoized version.

❸   to-words is a helper function to prepare the dictionary filtered by the initial string. to-words is part of the “static” or “learning” phase of the algorithm, because we can initially prepare words off-line and store them for later use.

❹   The best function is responsible for the application of the levenshtein memoized function to the words in the dictionary. It sorts the results with sort-by and returns the lowest distances.

❺   The def invocation is defining a filtered dictionary starting by “ac” to avoid being computed multiple times. This also prevents the time function from reporting how long is needed to read and process the file.

❻  The first invocation to search the best matches for the misspelled word returns in almost five seconds.

❼    The second invocation returns much faster.

The memoized version of the Levenshtein distance function is storing each new pair of strings as key and the returned distance as the value of an internal map. Each time the function is invoked with the same arguments, the return value is looked up inside the map. Comparing long strings seems to benefit from caching intermediate results and what has elapsed confirms the theory. The example also shows the way the memoized Levenshtein distance is “trained” before use. The application could pre-compute the set of dictionaries by initials (like

the indexing happening inside a database). This technique contributes to the speed increase seen in our Levenshtein implementation, but let’s also consider other algorithms that out-perform Levenshtein[2].

What’s in a name: memoize?

One reason why storing arguments and return values is called “memoization” instead of “caching” is that memoization is more specific, because it implies two features normally present in functional languages: pure and higher-order functions.

Pure functions

The wrapped function needs to be referentially transparent. If there’s a way for the output to be influenced by factors other than function arguments, then cached results could be different depending on the context. The cache would need to be aware of the context and use it as part of the key. Memoization becomes straightforward in functional languages supporting referential transparency.

Higher-order functions

“Higher order” refers to the property of a function to be treated as a value. As such, the function can be stored, passed to other functions or returned. Not all languages offer higher-order functions, although it’s now more common to offer this feature. By describing this kind of caching as “memoization” it’s implied that a function can

be transparently decorated with caching capabilities. “Transparently”, in this context, means that the original wrapped function remains untouched.

See Also

These are other functions in the Clojure standard library with a similar or related use to memoize.

  • lazy-seq creates a “thunk” (wrapper function around a value) that evaluates its content on the first access and returns a cached version on following calls. When the thunks are joined together in a sequence they form a lazy sequence. Lazy sequences are comparable to a cache where the order and value of the keys are predetermined. An “evaluate once” semantic on collections could be achieved with lazy-seq. Because all Clojure sequences are lazy, you might already be using a “cached data structure” without knowing it.
  • atom creates a Clojure Atom, one of the possible Clojure reference types. Atoms are used by memoize to store intermediate results. Use an atom directly if the memoize implementation is too restrictive for the kind of caching you need to implement. You can, for example, consider something other than a Clojure hash-map to store items in the map, like a mutable Java map with soft-references[3]. Keep in mind that there are already libraries like core.cache (https://github.com/clojure/core.cache) to provide common caching strategies if this is what you’re looking for.

Memoize Performance and Implementation Details

O(1) steps (function generation)

O(n log n) steps (generated function), n number of unique keys

O(n) space (generated function), n number of unique keys

The main aspect to consider about memoize, is that it stores cached items indefinitely. Constant accumulation of new cached values will eventually exhaust memory. memoize users should pay attention to these facts when designing their solution, more specifically around the prospected distribution of keys in the cache. memoize shouldn’t be used in cases of long-running services when the amount of argument permutations is potentially infinite or difficult to predict.

We can gather some statistics about the key distribution with some changes to the original memoize function. The following memoize2 contains additional atoms to collect data cache hits, misses and total number of calls at run-time.

 
 
(defn memoize2 [f]
  (let [mem (atom {})                                         ; 
        hits (atom 0)
        miss (atom 0)
        calls (atom 0)]
    (fn [& args]
     (if (identical? :done (first args))                      ; 
         (let [count-chars (reduce + (map count (flatten (keys @mem))))]
        {:calls @calls
         :hits @hits
         :misses @miss
         :count-chars count-chars
         :bytes (* (int (/ (+ (* count-chars 2) 45) 8)) 8)})  ; 
     (do (swap! calls inc)                                    ; 
          (if-let [e (find @mem args)]
             (do (swap! hits inc) (val e))
              (let [ret (apply f args)
                  _ (swap! miss inc)]
              (swap! mem assoc args ret)
             ret)))))))
 
 

Along with the actual cache, additional counters are added to the initial let block.

:done is a sentinel value that can be used to extract statistics during run-time.

This is an estimate of the amount of memory necessary to store the keys given the number of chars[4].

Additional swap! operations are performed to update counters.

By making access to the additional stats at run-time, we can estimate the key-space size or the memory footprint. For example, if we run the previous Levenshtein memoize example replacing memoize with memoize2, we can extract the following results:

 
 
(def levenshtein (memoize2 levenshtein*))
 
(best "achive" dict-ac)
(["achieve" 1] ["achime" 1] ["active" 1])
 
(levenshtein :done)
{:calls 400, :hits 0, :misses 400 :count-chars 5168 :bytes 10376}
 
(best "achive" dict-ac)
(["achieve" 1] ["achime" 1] ["active" 1])
 
(levenshtein :done)
{:calls 800, :hits 400, :misses 400 :count-chars 5168 :bytes 10376}
 
 

As you can see, the first time the best function is invoked it generates 400 misses, and the second time it results in all hits. We can also come up with an estimate of the memory taken by the strings stored in memory which is around 10Kb.

The second aspect to consider when using memoize is the additional hash-map assoc operation and atom swap!, which is added for each new key combination presented as input. The hash-map adds O(n log n) steps to add a new key, whereas the atom could underperform under heavy thread contention. Depending on the application requirement, memoize could be built on top of a transient data structure to avoid the performance penalty of filling the cache.

Another option to consider, when possible, is “warming the cache”: the application is still not serving live traffic, but the cache could be populated artificially with the most common keys.

Now that you know some of the ins and outs of using memoize, perhaps you’re interested in learning more about the Clojure standard library. If so, download the free first chapter of Clojure Standard Library and see this Slideshare presentation for more details.

 


[1] The Wikipedia article contains a good introduction to the Levenshtein Distance algorithm: https://en.wikipedia.org/wiki/Levenshtein_distance

[2] See the list of metrics available on Wikipedia: https://en.wikipedia.org/wiki/String_metric

[3] There are several examples of use of SoftReference for caching in Java. This is a good starting

point: http://www2.sys-con.com/itsg/virtualcd/java/archives/0507/shields/index.html

[4] A good enough formula to estimate the amount of memory necessary to store strings in Java is: http://www.javamex.com/tutorials/memory/string_memory_usage.shtml