This post also appeared on the Genius Engineering blog.
As part of our recently announced deal with Apple Music, you can now view Genius lyrics for your favorite music within the Apple Music app.
We deliver our lyrics to Apple via a nightly export of newline-delimited JSON objects. With millions of songs in our catalog, these dumps can easily get as big as 5 GB. It’s not quite “big data”, but it’s also not something you can easily open in vim.
Our first iteration of the code that generated these nightly exports was slow and failure-prone. So, we recently did a ground up rewrite focused on speed and reliability, which yielded significant improvements on both axes—stay tuned for a future blog post on that subject. But other than spot-checking with small data sets, how could we make sure that the new export process wasn’t introducing regressions? We decided to run both export processes concurrently and compare the generated exports from each method to make sure the new version was a comprehensive replacement.
What’s the best way to compare these two 5GB files? A good first check is whether the new and old exports have the same number of lines; we can do this on the command line by dividing
wc -l (line count) of the old export by
wc -l of the new export using
bc. If you haven’t seen
bc before, don’t worry: I hadn’t either! It’s a tool to do simple arithmetic in the console.
Ok great! The old export has 99.9% of the line count of the new export, meaning the new version actually has more lines than the old export, so off to a good start.
Next, we can use
diff to get the percentage of lines that are different between the new and old export. We’ll use the
--suppress-common-lines flags so that we can pipe the output from
diff directly to
wc to get a count of total lines that differ between the two exports.
OOPS! Our diff is showing 100% of the lines differing.. either we seriously screwed up with this new export or our diff methodology is flawed.
Let’s take a look at how these objects are structured (payload slightly modified for simplicity):
1 2 3
Fairly standard newline-delimited JSON. Let’s look at the new export:
1 2 3
Yikes, it appears that not only does the new export methodology not order songs in the same way, it doesn’t have the same order of keys within each JSON object. This means that even if the actual JSON content of the files was 100% the same, it would look 100% different with our naive
My first thought was to write a ruby script to parse and compare the two exports, but after spending a little time coding something up I had a program that was starting to get fairly complicated, didn’t work correctly, and was too slow—my first cut took well over an hour. Then I thought: is this one of those situations where a simple series of shell commands can replace a complex purpose-built script?
jq, a powerful command-line tool for processing JSON objects. Note:
jq is not related to jQuery, but its name does make googling for examples a little tricky! Up until this point I had mostly used
jq for pretty-printing JSON, a feature it is quite good at. For example, you can do:
And see a nice pretty-printed version of the CDNJS response for jQuery.
jq also allows you to dig out specific fields from some JSON, e.g. going back to our exports, to get the list of ids from each export:
1 2 3 4 5 6
That’s pretty much all I had used
jq for before looking through these exports. But it turns out that
jq is incredibly powerful as a tool for processing JSON (check out the
jq cookbook to see some of the neat things that are possible). You can run entire programs, or “filters” as
jq calls them (“filter” because it takes an input and produces an output), to iterate over, modify, and transform JSON objects.
How can we use it to solve the problem at hand, diffing these two large JSON files?
Well first we need to sort these files so that tools like
diff can easily compare them. But we can’t just use
sort; we need to sort them by the value of the
genius_ids in their payload.
It turns out this is quite easy with
jq. To sort the exports by
genius_id we can run:
Running through these options:
-c / --compact-outputmakes sure the JSON objects remain compact and not pretty printed
-s / --slurpreads each object into an in-memory array instead of processing one object at a time, which we need in order to sort the file
-M / --monochrome-outputprevents the JSON from being colorized in the terminal
-S / --sort-keysmakes sure that each JSON object’s keys are sorted, ensuring that the order of keys within each object payload is consistent between exports when we compare them
And, of course, the
jq expression to sort the file itself is quite terse! It’s just
sort_by(.genius_id), which sorts the slurped in array by id, and then there’s a little
 on the end which basically splays the sorted array back out into newline-delimited JSON.
This takes a little while, but once it’s done we’ve got two sorted files ready to be compared!
But wait.. not so fast. There are still a few keys in our export, specifically
producers that are arrays of string values, and it’s not guaranteed that each export will generate those in the same order.
Not to worry:
jq has a solution to that problem too! We want to sort each of those keys in the output as well, which we can do by complicating our expression just a little more:
So now the expression is a little more tricky. Let’s break it down.
mapdoes what you expect and maps over each object, much as
sort_byoperates on each object.
- Within that
mapoperation we’re first calling
.featured_artists |= sort, which uses the
|=update operator to do an in-place alphabetic sort on the
featured_artistsarray. This is a bit confusing, but all it’s doing is running the value of
sort“filter”, sorting it, then assigning that sorted value back to the
featured_artistskey of the object, and passing on the the entire object that
featured_artistskey is in. It would be equivalent to
map(.featured_artists = (.featured_artists | sort)). If you don’t know what that
|does, don’t worry.. read on!
- Next up we use
|operator to pipe the previous step to our next step which just sorts the
producersarray exactly as we did the
featured_artistsarray. The pipe operator works exactly like the unix-style pipe on the command line, so we’re essentially sorting the
featured_artistsarray, returning the full object it resides in, and then running that same operation for
producerson the result.
- And then we just feed that object with its sorted arrays into our sort operator from before using another pipe.
And voila! We’ve got two normalized 5 GB JSON blobs, all that’s left is to feed them back into our
diff operation just like before to see how similar they are:
So after all that normalizing we find that only 0.2% of the lines differ between the exports! That’s an incredible start for a complete rewrite of fairly complicated export process. Plus this whole thing takes about 10 minutes to generate each normalized file on my macbook pro and then less than a minute to compare them, already much faster than my naive ruby script.
The final step was looking through specific differing examples to figure out why the logic produced slightly different export outputs, but getting into the details of that is application logic and not what this post is about.
Hopefully now you’ll reach for
jq the next time you want to manipulate JSON files on the command line.. or at least if you want to pretty print an API response.
One thing that bugged me about this solution was the explicit sorting of each key. What if we later added more arrays, or if we had deeply nested objects! Since we were just comparing two specific export results with an unchanging schema over the course of a couple of weeks, this didn’t really matter, but it was bugging me so I poked around looking for a more generic way of normalizing JSON objects.
If you check out the
jq FAQ you’ll find that there was a function called
walk introduced as a built-in after 1.5, which allows you deeply iterate through JSON objects and modify them. It wasn’t in the version I was using but it was simple enough to copy it into my program, which it turns out made the code much simpler:
1 2 3 4 5 6 7 8 9 10 11 12 13
It turned out that this also made it significantly slower to normalize each file, so I ended up just using the more verbose and brittle version, but the
walk version is a lot cleaner!
Also, you might be curious how you can run the above file.. you can also run
jq program files using the
-f option, so: