Figwheel, Reloaded

I’m working on an Om Next app with a Pedestal backend. Figwheel is a great way to do ClojureScript development, and the Reloaded workflow is really nice for working on Clojure servers. However, reloading Clojure code in the JVM still has corner cases which are difficult to solve. When you run into one, it’s not always obvious what’s causing the problem. Stuff just starts behaving weirdly. And I ran into just such a bug the other day.

Background

Here’s a description of what I was working on when I found the issue and how it manifested itself. If you’re just interested in the bug and the solution, feel free to skip ahead.

Requests and responses between the Om Next client and Pedestal server are Transit-encoded EDN. When the client sends an update request (called a mutate request in Om parlance) to create a new object, it includes a temporary id for the object. The client also uses this temporary id to reference the item in the app itself until it knows its permanent id. The server response includes a map of any temporary ids and the permanent ids associated with the objects. For example, a request to create might a new todo object with :item/text value I’m a new item! might look like this:

[(todo/new-item {:item/id #om/id["2e486bfc-aacb-4736-8aa2-155411274e84"],
                 :item/text "I’m a new item!"})]

Here’s a breakdown of what that means:

[...] You can send multiple requests at a time to the remote and they’re grouped in a vector. This request has a single mutate request, which is a list (...).
todo/new-item is the type of mutation. It tells the server how to interpret the rest of the request.
{:item/id #om/id[...], :item/text "..."} is the data associated with the todo/new-item request. This is a map with two keys, :item/id and :item/text.
#om/id["6b542daa-6d03-418e-a008-34505dad905a"] is the temporary id that the client associated with the new object. The special #om/id[... ] syntax is called a tagged literal. This is read and interpreted as the string literal representation of an om.tempid.TempId instance¹ with a value of the UUID in the enclosed string. In Clojure,² the print-method multimethod has been extended for TempId to write the literal with that syntax.

Similarly, this same method is extended for java.util.Date to print a tagged literal like #inst "1985-04-12T23:20:50.52Z".
"I’m a new item!" is the text value associated with :item/text.

The server parses this mutate request and perhaps writes it to a PostgreSQL database. For example, we might have a table like

CREATE TABLE todo.items (
  item_id SERIAL PRIMARY KEY,
  item_text TEXT NOT NULL
);

And in response to the request, the server issues a SQL statement like

INSERT INTO todo.items (item_text) VALUES (?) RETURNING item_id

The database server returns 852154481843896390 for the item_id, and our Om remote server will return a response to the client:

{todo/new-item
 {:tempids
  {#om/id["2e486bfc-aacb-4736-8aa2-155411274e84"] 852154481843896390}}}

The response is a map ({...}), not a vector ([...]) like the request. This map has a single key, todo/new-item, which corresponds to the type of the mutation. The value associated with todo/new-item is also a map with a single key, tempids, which is also a map, associating the temp id the client assigned to the request with the value the database.

Both the request and response data that I’ve shown above are EDN. The actual payloads are Transit encoded. Transit plays a similar role to JSON in providing a way to transfer data between applications. Indeed, Transit can be transferred as JSON (and also MessagePack). It includes more types than JSON, such as 64-bit integers, bytes, points in time, URIs, sets, lists, and maps with composite keys. Transit also includes a way to extend the meaning of the encoded data. This is what allows both the front end and the back end to understand that #om/id, an Om-specific extension to Transit, is to be read as an om.tempid.TempId value.

When Transit-encoded in JSON, the above response should be

["^ ","~$todo/new-item",
 ["^ ","~:tempids",
  ["~#cmap",
   [["~#om/id","2e486bfc-aacb-4736-8aa2-155411274e84"],
     "~i852154481843896390"]]]]

It’s just a nested JavaScript array with a bunch of string values. The Transit reader in the client knows how to convert this back into an EDN value.

The Bug

I’ve been using the Untangled client library which provides some useful conventions for writing Om apps. Most of the remote server examples use Ring. The Untangled server uses HTTP Kit, which is largely compatible with Ring.

I’m using Pedestal, and there aren’t a lot of examples out there for using Pedestal with Om. The Om library includes an Om Transit writer which knows how to write om.tempid.TempId values. I just needed to figure out how to wire in the Om transit writer into the interceptor chain. I came across a gist by Andre R that provided an example. I plugged it in to my Pedestal app and it worked—most of the time.

Oh, my. I initially noticed something was wrong because the client app wasn’t always updating as it should from the server responses. I logged the requests and responses to the JavaScript console and saw that sometimes the responses from the server didn’t include the #om/id tag, and that these correlated with the instances when the app wasn’t updating. Here’s an example of what that looked like:

["^ ","~$todo/new-item",
 ["^ ","~:tempids",
  ["~#cmap",
   [["^ ","~:id","2e486bfc-aacb-4736-8aa2-155411274e84"],
    "~i852154481843896390"]]]]

The "#~om/id" value is missing. In its place are the two elements "^ " and "~:id".

But I also saw times when the responses did include the #om/id tag. And it seemed like it was breaking when I was updating code that had nothing to do with the API responses. That said, it seemed to only happen after I updated code and ran reset. In my notes I wrote:

This has something to do with the Reloaded workflow. On (go), works fine. On (reset), it’s messed up, no longer handling #om/ids

At this point I was at a loss. I knew that there were corner cases where code reloading would cause odd bugs, which is one of motivations for Stuart Sierra to write the Component library and suggested guidelines to avoid these situations. Where had I run afoul of these guidelines? The app is too big and I’m still too new with the libraries to know for sure what areas of the code I can discount, especially as the error seemed to pop up regardless of what sections of the code I was updating.

Building a test case

Time to find a minimal test case. I started with a fresh Pedestal service app. I added a single interceptor to do the Om Transit encoding. As there was no indication in the main project that there was any problem with the server interpreting or processing client requests, I just hard-coded a response value so I could easily test at the command line with curl. Prior to being encoded, the responses looked just fine in the logs. And why test through the browser if I didn’t have to?

And guess what? No luck. Everything worked fine. I couldn’t get the server to fail. So I branched my project app and implemented a similar hard-coded handler, and confirmed I could still see the error. I started ripping out code, trying to work down to the point where the error went away. Rip, (reset), curl, repeat. I ran lein clean, thinking maybe there was some stale code that was causing the issue. And at one point it started working with my command line requests! I couldn’t get it to fail!

I started up Figwheel to confirm it worked from the browser. Yes! No error! I updated the server code to remove some logging I was using for debugging. Reload via (reset), test. And now it’s broken? What is going on? What had I done? Certainly removing logging lines shouldn’t break the Transit writer! But I had also reloaded the code. Could that be it?

I thought back to Stuart Halloway’s Debugging with the Scientific Method talk at Clojure/conj in 2015 and remembered he said something about writing down everything you were doing. So at this point I wrote down the steps that replicated the bug.

terminal 1 lein clean
emacs cider-restart (restarts repl)
repl (go)
terminal 2 curl localhost:8081/om (still works)
repl (reset)
terminal 1 rlwrap lein run -m clojure.main script/figwheel.clj
repl (reset)
terminal 2 curl localhost:8081/om (still works)
editor whitespace change in server.clj
repl (reset)
terminal 2 curl localhost:8081/om (still works)
editor whitespace change in system.clj
repl (reset)
terminal 2 curl localhost:8081/om (still works)
emacs cider-restart
repl (go)
terminal 2 curl localhost:8081/om (still works)
repl (reset)
terminal 2 curl localhost:8081/om (borked! yay!)

Wow. Nineteen laborious, time-consuming steps in three different windows. I did it twice to confirm that these steps replicated the bug, and I was relieved that it did. Finally I had a way to at least reproduce it.

But which steps were necessary? I eventually worked out a set of 8 steps that reliably demonstrated the bug.

Make sure Figwheel and server repls aren’t running
terminal 1 lein clean
terminal 1 rlwrap lein run -m clojure.main script/figwheel.clj
emacs cider-jack-in (starts repl)
repl (go)
terminal 2 curl localhost:8081/om
repl (reset)
terminal 2 curl localhost:8081/om

And this explains why I couldn’t get my minimal test case project to fail. I didn’t have a client app, so I wasn’t running Figwheel. When I added a bare-bones Om client and ran Figwheel, the test case project failed just as consistently with the same steps.

What to do? I knew that the issue was with the Reloaded workflow, and I didn’t need to use that. I could use the lein run-dev script included in the Pedestal service template which also picks up changes when I loaded buffers in CIDER to the repl. So this wasn’t a blocker to continued development. It was just blocking development using my preferred workflow. But I did want to use Reloaded.

I polished the code in the test case project to the clearest, smallest test case I could think of. I knew at this point I was going to have to ask for help and wanted to make it as easy as possible for someone to examine and understand what was going on. I pushed it to GitHub, including in the README as much information as I could to explain the issue and how to replicate the bug.

Now it was time to reach out to the community, which would be the mailing lists or Clojurians Slack. But which mailing lists? Which channel? Pedestal? Om? Figwheel? Component? Clojure? ClojureScript? I certainly didn’t want to spam them all.

I decided to try the #clojure Slack channel. I typed up a concise description of the issue and pasted it into the message window. Robert Stuttaford (@robert-stuttaford) responded within a minute and shared that he had encountered a similar issue, and not surprisingly, it entails conflicts between the Figwheel and the Reloaded workflow compilation methods.

First a bit of background on Clojure file types. Clojure files which target only the JVM use the .clj extension. ClojureScript files, targetting JavaScript, use the .cljs file. Clojure also has a portable Clojure file type with the extension .cljc. Portable Clojure files can target multiple platforms. So when targeting the JVM, you can use .clj and .cljc files. When targeting JavaScript, you can use .cljs and .cljc files.

When working on the server, starting the repl compiles the code required by the server. When reloading code using (reset), clojure.tools.namespace.repl/refresh reloads all Clojure files suitable for the JVM that are on the classpath, not just those required by the server.

When Figwheel compiles the front end code, it copies required .cljs and .cljc files into the resource directory. The resource directory is included on the classpath, so clojure.tools.namespace.repl/refresh, used during the Reloaded workflow, picks up any .cljc files Figwheel has copied there, and these can conflict with the server code already loaded into the JVM.

I decided to look into ways of updating the classpath used by repl/refresh. Knowing that the code was written by Stuart Sierra, I would be surprised if there wasn’t a way. Looking at the source code for repl/refresh, I saw in the documentation

The directories to be scanned are controlled by ‘set-refresh-dirs’; defaults to all directories on the Java classpath.

Yes! After some messing around at the repl, I came up with the following function which returns all of the directories repl/refresh would look at by default, sans the resource directory:

(ns user
  (:require ;; ...
            [com.stuartsierra.component :as component]
            [clojure.tools.namespace.repl :as repl]
            [clojure.java.classpath :as cp]
            [clojure.java.io :refer [resource]])
  (:import [java.io File]))
;; ...
(defn refresh-dirs
  "Remove `resource` path from refresh-dirs"
  ([] (refresh-dirs repl/refresh-dirs))
  ([dirs]
   (let [resource-path (-> "public" resource .getPath File. .getParent)
         exclusions #{resource-path}
         ds (or (seq dirs) (cp/classpath-directories))]
     (remove #(contains? exclusions (.getPath %)) ds))))

Hard-coding the resource-path like that feels a bit hacky. What if there are multiple resource paths that happen to include public? Should I factor out the exclusions set so that can be defined elsewhere? However, I can tackle those issues if and when I encounter them. Right now this is a pragmatic solution, and it’s not completely ugly. There’s a single place I need to update if I need to change which paths are included. And the dev/server/user.clj file is generally a per-project file anyway.

I updated reset to call repl/set-refresh-dirs with the directories returned by refresh-dirs.

(defn reset
  "Destroys, initializes, and starts the current development system"
  []
  (stop)
  (apply repl/set-refresh-dirs (refresh-dirs))
  (repl/refresh :after 'user/go))

I don’t know whether refresh-dirs needs to be a function or whether I need to call repl/set-refresh-dirs on every reset call, but it’s not a performance bottleneck and it works.

Note that this doesn’t work with clojure.tools.namespace 0.2.11, the stable release of this writing as repl/refresh-dirs is marked private. I’m using 0.3.0-alpha3 where it’s now public.

I have a general solution, but I was still curious which namespace was causing the issue. Here are the list of .cljc files Figwheel was copying into resources:

> find ./resources/public/js -name "*.cljc"
./resources/public/js/cljs/stacktrace.cljc
./resources/public/js/om/next/impl/parser.cljc
./resources/public/js/om/next/protocols.cljc
./resources/public/js/om/tempid.cljc
./resources/public/js/om/transit.cljc
./resources/public/js/om/util.cljc

The culprit is om/transit.cljc. I ran Figwheel, deleted om/transit.cljc, and then ran (reset) and confirmed that the code worked as expected. This makes sense, given the affected code uses the om.transit namespace. As expected, deleting a different file, such as om/tempid.cljc, did not fix the bug. However, deleting files from resources isn’t a very convenient or robust solution.

In Clojure (on the JVM), it’s a record. In ClojureScript, it’s a type. Records and types are very similar in Clojure the language. I’m not sure why the implementations are different on the JVM and JavaScript.↩︎
In ClojureScript, rather than use a multimethod, the TempId type implments the IPrintWithWriter protocol.↩︎