I’m working on an Om Next app with a Pedestal back­end. Figwheel is a great way to do Clojure­Script devel­op­ment, and the Reloaded workflow is really nice for working on Clojure servers. However, reloading Clojure code in the JVM still has corner cases which are diffi­cult to solve. When you run into one, it’s not always obvious what’s causing the prob­lem. Stuff just starts behaving weirdly. And I ran into just such a bug the other day.

Background

Here’s a descrip­tion of what I was working on when I found the issue and how it mani­fested itself. If you’re just inter­ested in the bug and the solu­tion, feel free to skip ahead.

Requests and responses between the Om Next client and Pedestal server are Transit-encoded EDN. When the client sends an update request (called a mutate request in Om parlance) to create a new object, it includes a tempo­rary id for the object. The client also uses this tempo­rary id to refer­ence the item in the app itself until it knows its perma­nent id. The server response includes a map of any tempo­rary ids and the perma­nent ids asso­ci­ated with the objects. For exam­ple, a request to create might a new todo object with :item/text value I’m a new item! might look like this:

[(todo/new-item {:item/id #om/id["2e486bfc-aacb-4736-8aa2-155411274e84"],
                 :item/text "I’m a new item!"})]

Here’s a break­down of what that means:

  • [...] You can send multiple requests at a time to the remote and they’re grouped in a vector. This request has a single mutate request, which is a list (...).
  • todo/new-item is the type of mutation. It tells the server how to interpret the rest of the request.
  • {:item/id #om/id[...], :item/text "..."} is the data associated with the todo/new-item request. This is a map with two keys, :item/id and :item/text.
  • #om/id["6b542daa-6d03-418e-a008-34505dad905a"] is the tempo­rary id that the client asso­ci­ated with the new object. The special #om/id[... ] syntax is called a tagged literal. This is read and inter­preted as the string literal repre­sen­ta­tion of an om.tempid.TempId instance1 with a value of the UUID in the enclosed string. In Clojure,2 the print-method multi­method has been extended for TempId to write the literal with that syntax.

    Simi­larly, this same method is extended for java.util.Date to print a tagged literal like #inst "1985-04-12T23:20:50.52Z".

  • "I’m a new item!" is the text value associated with :item/text.

The server parses this mutate request and perhaps writes it to a Post­greSQL data­base. For exam­ple, we might have a ta­ble like

CREATE TABLE todo.items (
  item_id SERIAL PRIMARY KEY,
  item_text TEXT NOT NULL
);

And in response to the request, the server issues a SQL state­ment like

INSERT INTO todo.items (item_text) VALUES (?) RETURNING item_id

The data­base server returns 852154481843896390 for the item_id, and our Om remote server will return a response to the client:

{todo/new-item
 {:tempids
  {#om/id["2e486bfc-aacb-4736-8aa2-155411274e84"] 852154481843896390}}}

The response is a map ({...}), not a vector ([...]) like the request. This map has a single key, todo/new-item, which corre­sponds to the type of the muta­tion. The value asso­ci­ated with todo/new-item is also a map with a single key, tempids, which is also a map, asso­ci­ating the temp id the client assigned to the request with the value the database.

Both the request and response data that I’ve shown above are EDN. The actual payloads are Transit encoded. Transit plays a similar role to JSON in providing a way to transfer data between appli­ca­tions. Indeed, Transit can be trans­ferred as JSON (and also MessagePack). It includes more types than JSON, such as 64-bit inte­gers, bytes, points in time, URIs, sets, lists, and maps with composite keys. Transit also includes a way to extend the meaning of the encoded data. This is what allows both the front end and the back end to under­stand that #om/id, an Om-spe­cific exten­sion to Tran­sit, is to be read as an om.tempid.TempId value.

When Tran­sit-en­coded in JSON, the above response should be

["^ ","~$todo/new-item",
 ["^ ","~:tempids",
  ["~#cmap",
   [["~#om/id","2e486bfc-aacb-4736-8aa2-155411274e84"],
     "~i852154481843896390"]]]]

It’s just a nested JavaScript array with a bunch of string values. The Transit reader in the client knows how to convert this back into an EDN value.

The Bug

I’ve been using the Untan­gled client library which provides some useful conven­tions for writing Om apps. Most of the remote server exam­ples use Ring. The Untan­gled server uses HTTP Kit, which is largely compat­ible with Ring.

I’m using Pedestal, and there aren’t a lot of exam­ples out there for using Pedestal with Om. The Om library includes an Om Transit writer which knows how to write om.tempid.TempId values. I just needed to figure out how to wire in the Om transit writer into the inter­ceptor chain. I came across a gist by Andre R that provided an exam­ple. I plugged it in to my Pedestal app and it worked—­most of the time.

Oh, my. I initially noticed some­thing was wrong because the client app wasn’t always updating as it should from the server responses. I logged the requests and responses to the JavaScript console and saw that some­times the responses from the server didn’t include the #om/id tag, and that these corre­lated with the instances when the app wasn’t updat­ing. Here’s an example of what that looked like:

["^ ","~$todo/new-item",
 ["^ ","~:tempids",
  ["~#cmap",
   [["^ ","~:id","2e486bfc-aacb-4736-8aa2-155411274e84"],
    "~i852154481843896390"]]]]

The "#~om/id" value is miss­ing. In its place are the two elements "^ " and "~:id".

But I also saw times when the responses did include the #om/id tag. And it seemed like it was breaking when I was updating code that had nothing to do with the API responses. That said, it seemed to only happen after I updated code and ran reset. In my notes I wrote:

This has some­thing to do with the Reloaded work­flow. On (go), works fine. On (reset), it’s messed up, no longer handling #om/ids

At this point I was at a loss. I knew that there were corner cases where code reloading would cause odd bugs, which is one of moti­va­tions for Stuart Sierra to write the Compo­nent library and suggested guide­lines to avoid these situ­a­tions. Where had I run afoul of these guide­lines? The app is too big and I’m still too new with the libraries to know for sure what areas of the code I can discount, espe­cially as the error seemed to pop up regard­less of what sections of the code I was updating.

Building a test case

Time to find a minimal test case. I started with a fresh Pedestal service app. I added a single inter­ceptor to do the Om Transit encod­ing. As there was no indi­ca­tion in the main project that there was any problem with the server inter­preting or processing client requests, I just hard-­coded a response value so I could easily test at the command line with curl. Prior to being encoded, the responses looked just fine in the logs. And why test through the browser if I didn’t have to?

And guess what? No luck. Every­thing worked fine. I couldn’t get the server to fail. So I branched my project app and imple­mented a similar hard-­coded handler, and confirmed I could still see the error. I started ripping out code, trying to work down to the point where the error went away. Rip, (reset), curl, repeat. I ran lein clean, thinking maybe there was some stale code that was causing the issue. And at one point it started working with my command line requests! I couldn’t get it to fail!

I started up Figwheel to confirm it worked from the browser. Yes! No error! I updated the server code to remove some logging I was using for debug­ging. Reload via (reset), test. And now it’s broken? What is going on? What had I done? Certainly removing logging lines shouldn’t break the Transit writer! But I had also reloaded the code. Could that be it?

I thought back to Stuart Halloway’s Debug­ging with the Scien­tific Method talk at Clojure/­conj in 2015 and remem­bered he said some­thing about writing down every­thing you were doing. So at this point I wrote down the steps that repli­cated the bug.

  1. terminal 1 lein clean
  2. emacs cider-restart (restarts repl)
  3. repl (go)
  4. terminal 2 curl localhost:8081/om (still works)
  5. repl (reset)
  6. terminal 1 rlwrap lein run -m clojure.main script/figwheel.clj
  7. repl (reset)
  8. terminal 2 curl localhost:8081/om (still works)
  9. editor whitespace change in server.clj
  10. repl (reset)
  11. terminal 2 curl localhost:8081/om (still works)
  12. editor whitespace change in system.clj
  13. repl (reset)
  14. terminal 2 curl localhost:8081/om (still works)
  15. emacs cider-restart
  16. repl (go)
  17. terminal 2 curl localhost:8081/om (still works)
  18. repl (reset)
  19. terminal 2 curl localhost:8081/om (borked! yay!)

Wow. Nine­teen labo­ri­ous, time-­con­suming steps in three different windows. I did it twice to confirm that these steps repli­cated the bug, and I was relieved that it did. Finally I had a way to at least repro­duce it.

But which steps were neces­sary? I even­tu­ally worked out a set of 8 steps that reli­ably demon­strated the bug.

  1. Make sure Figwheel and server repls aren’t running
  2. terminal 1 lein clean
  3. terminal 1 rlwrap lein run -m clojure.main script/figwheel.clj
  4. emacs cider-jack-in (starts repl)
  5. repl (go)
  6. terminal 2 curl localhost:8081/om
  7. repl (reset)
  8. terminal 2 curl localhost:8081/om

And this explains why I couldn’t get my minimal test case project to fail. I didn’t have a client app, so I wasn’t running Figwheel. When I added a bare-bones Om client and ran Figwheel, the test case project failed just as consis­tently with the same steps.

What to do? I knew that the issue was with the Reloaded work­flow, and I didn’t need to use that. I could use the lein run-dev script included in the Pedestal service template which also picks up changes when I loaded buffers in CIDER to the repl. So this wasn’t a blocker to continued devel­op­ment. It was just blocking devel­op­ment using my preferred work­flow. But I did want to use Reloaded.

I polished the code in the test case project to the clear­est, smallest test case I could think of. I knew at this point I was going to have to ask for help and wanted to make it as easy as possible for someone to examine and under­stand what was going on. I pushed it to GitHub, including in the README as much infor­ma­tion as I could to explain the issue and how to repli­cate the bug.

Now it was time to reach out to the commu­nity, which would be the mailing lists or Clojurians Slack. But which mailing lists? Which chan­nel? Pedestal? Om? Figwheel? Compo­nent? Clojure? Clojure­Script? I certainly didn’t want to spam them all.

I decided to try the #clo­jure Slack chan­nel. I typed up a concise descrip­tion of the issue and pasted it into the message window. Robert Stuttaford (@robert-s­tuttaford) responded within a minute and shared that he had encoun­tered a similar issue, and not surpris­ingly, it entails conflicts between the Figwheel and the Reloaded work­flow compi­la­tion methods.

First a bit of back­ground on Clojure file types. Clojure files which target only the JVM use the .clj exten­sion. Clojure­Script files, target­ting JavaScript, use the .cljs file. Clojure also has a portable Clojure file type with the exten­sion .cljc. Portable Clojure files can target multiple plat­forms. So when targeting the JVM, you can use .clj and .cljc files. When targeting JavaScript, you can use .cljs and .cljc files.

When working on the server, starting the repl compiles the code required by the server. When reloading code using (reset), clojure.tools.namespace.repl/refresh reloads all Clojure files suit­able for the JVM that are on the class­path, not just those required by the server.

When Figwheel compiles the front end code, it copies required .cljs and .cljc files into the resource direc­tory. The resource direc­tory is included on the class­path, so clojure.tools.namespace.repl/refresh, used during the Reloaded work­flow, picks up any .cljc files Figwheel has copied there, and these can conflict with the server code already loaded into the JVM.

I decided to look into ways of updating the class­path used by repl/refresh. Knowing that the code was written by Stuart Sierra, I would be surprised if there wasn’t a way. Looking at the source code for repl/refresh, I saw in the documentation

The direc­to­ries to be scanned are controlled by ‘set-re­fresh-dirs’; defaults to all direc­to­ries on the Java classpath.

Yes! After some messing around at the repl, I came up with the following func­tion which returns all of the direc­to­ries repl/refresh would look at by default, sans the resource directory:

(ns user
  (:require ;; ...
            [com.stuartsierra.component :as component]
            [clojure.tools.namespace.repl :as repl]
            [clojure.java.classpath :as cp]
            [clojure.java.io :refer [resource]])
  (:import [java.io File]))
;; ...
(defn refresh-dirs
  "Remove `resource` path from refresh-dirs"
  ([] (refresh-dirs repl/refresh-dirs))
  ([dirs]
   (let [resource-path (-> "public" resource .getPath File. .getParent)
         exclusions #{resource-path}
         ds (or (seq dirs) (cp/classpath-directories))]
     (remove #(contains? exclusions (.getPath %)) ds))))

Hard-­coding the resource-path like that feels a bit hacky. What if there are multiple resource paths that happen to include public? Should I factor out the exclusions set so that can be defined else­where? However, I can tackle those issues if and when I encounter them. Right now this is a prag­matic solu­tion, and it’s not completely ugly. There’s a single place I need to update if I need to change which paths are included. And the dev/server/user.clj file is gener­ally a per-pro­ject file anyway.

I updated reset to call repl/set-refresh-dirs with the direc­to­ries returned by refresh-dirs.

(defn reset
  "Destroys, initializes, and starts the current development system"
  []
  (stop)
  (apply repl/set-refresh-dirs (refresh-dirs))
  (repl/refresh :after 'user/go))

I don’t know whether refresh-dirs needs to be a func­tion or whether I need to call repl/set-refresh-dirs on every reset call, but it’s not a perfor­mance bottle­neck and it works.

Note that this doesn’t work with clojure.­tool­s.­namespace 0.2.11, the stable release of this writing as repl/refresh-dirs is marked private. I’m using 0.3.0-al­pha3 where it’s now public.

I have a general solu­tion, but I was still curious which namespace was causing the issue. Here are the list of .cljc files Figwheel was copying into resources:

> find ./resources/public/js -name "*.cljc"
./resources/public/js/cljs/stacktrace.cljc
./resources/public/js/om/next/impl/parser.cljc
./resources/public/js/om/next/protocols.cljc
./resources/public/js/om/tempid.cljc
./resources/public/js/om/transit.cljc
./resources/public/js/om/util.cljc

The culprit is om/transit.cljc. I ran Figwheel, deleted om/transit.cljc, and then ran (reset) and confirmed that the code worked as expected. This makes sense, given the affected code uses the om.transit namespace. As expected, deleting a different file, such as om/tempid.cljc, did not fix the bug. However, deleting files from resources isn’t a very conve­nient or robust solution.


  1. In Clojure (on the JVM), it’s a record. In ClojureScript, it’s a type. Records and types are very similar in Clojure the language. I’m not sure why the implementations are different on the JVM and JavaScript.↩︎
  2. In ClojureScript, rather than use a multimethod, the TempId type implments the IPrintWithWriter protocol.↩︎
  3. </li>