As part of our recently announced deal with Apple Music, you can now view Genius lyrics for your favorite music within the Apple Music app.
We deliver our lyrics to Apple via a nightly export of newline-delimited JSON objects. With millions of songs in our catalog, these dumps can easily get as big as 5 GB. It’s not quite “big data”, but it’s also not something you can easily open in vim.
Our first iteration of the code that generated these nightly exports was slow and failure-prone. So, we recently did a ground up rewrite focused on speed and reliability, which yielded significant improvements on both axes—stay tuned for a future blog post on that subject. But other than spot-checking with small data sets, how could we make sure that the new export process wasn’t introducing regressions? We decided to run both export processes concurrently and compare the generated exports from each method to make sure the new version was a comprehensive replacement.
What’s the best way to compare these two 5GB files? A good first check is whether the new and old exports have the same number of lines; we can do this on the command line by dividing wc -l
(line count) of the old export by wc -l
of the new export using bc
. If you haven’t seen bc
before, don’t worry: I hadn’t either! It’s a tool to do simple arithmetic in the console.
1 2 |
|
Ok great! The old export has 99.9% of the line count of the new export, meaning the new version actually has more lines than the old export, so off to a good start.
Next, we can use diff
to get the percentage of lines that are different between the new and old export. We’ll use the --side-by-side
and --suppress-common-lines
flags so that we can pipe the output from diff
directly to wc
to get a count of total lines that differ between the two exports.
1 2 |
|
OOPS! Our diff is showing 100% of the lines differing.. either we seriously screwed up with this new export or our diff methodology is flawed.
Let’s take a look at how these objects are structured (payload slightly modified for simplicity):
1 2 3 |
|
Fairly standard newline-delimited JSON. Let’s look at the new export:
1 2 3 |
|
Yikes, it appears that not only does the new export methodology not order songs in the same way, it doesn’t have the same order of keys within each JSON object. This means that even if the actual JSON content of the files was 100% the same, it would look 100% different with our naive diff
strategy.
jq
My first thought was to write a ruby script to parse and compare the two exports, but after spending a little time coding something up I had a program that was starting to get fairly complicated, didn’t work correctly, and was too slow—my first cut took well over an hour. Then I thought: is this one of those situations where a simple series of shell commands can replace a complex purpose-built script?
Enter jq
, a powerful command-line tool for processing JSON objects. Note: jq
is not related to jQuery, but its name does make googling for examples a little tricky! Up until this point I had mostly used jq
for pretty-printing JSON, a feature it is quite good at. For example, you can do:
1
|
|
And see a nice pretty-printed version of the CDNJS response for jQuery.
jq
also allows you to dig out specific fields from some JSON, e.g. going back to our exports, to get the list of ids from each export:
1 2 3 4 5 6 |
|
That’s pretty much all I had used jq
for before looking through these exports. But it turns out that jq
is incredibly powerful as a tool for processing JSON (check out the jq
cookbook to see some of the neat things that are possible). You can run entire programs, or “filters” as jq
calls them (“filter” because it takes an input and produces an output), to iterate over, modify, and transform JSON objects.
How can we use it to solve the problem at hand, diffing these two large JSON files?
Well first we need to sort these files so that tools like diff
can easily compare them. But we can’t just use sort
; we need to sort them by the value of the genius_id
s in their payload.
It turns out this is quite easy with jq
. To sort the exports by genius_id
we can run:
1 2 |
|
Running through these options:
-c / --compact-output
makes sure the JSON objects remain compact and not pretty printed-s / --slurp
reads each object into an in-memory array instead of processing one object at a time, which we need in order to sort the file-M / --monochrome-output
prevents the JSON from being colorized in the terminal-S / --sort-keys
makes sure that each JSON object’s keys are sorted, ensuring that the order of keys within each object payload is consistent between exports when we compare themAnd, of course, the jq
expression to sort the file itself is quite terse! It’s just sort_by(.genius_id)
, which sorts the slurped in array by id, and then there’s a little []
on the end which basically splays the sorted array back out into newline-delimited JSON.
This takes a little while, but once it’s done we’ve got two sorted files ready to be compared!
But wait.. not so fast. There are still a few keys in our export, specifically featured_artists
and producers
that are arrays of string values, and it’s not guaranteed that each export will generate those in the same order.
Not to worry: jq
has a solution to that problem too! We want to sort each of those keys in the output as well, which we can do by complicating our expression just a little more:
1
|
|
So now the expression is a little more tricky. Let’s break it down.
map
does what you expect and maps over each object, much as sort_by
operates on each object.map
operation we’re first calling .featured_artists |= sort
, which uses the |=
update operator to do an in-place alphabetic sort on the featured_artists
array. This is a bit confusing, but all it’s doing is running the value of featured_artists
through a sort
“filter”, sorting it, then assigning that sorted value back to the featured_artists
key of the object, and passing on the the entire object that featured_artists
key is in. It would be equivalent to map(.featured_artists = (.featured_artists | sort))
. If you don’t know what that |
does, don’t worry.. read on!|
operator to pipe the previous step to our next step which just sorts the producers
array exactly as we did the featured_artists
array. The pipe operator works exactly like the unix-style pipe on the command line, so we’re essentially sorting the featured_artists
array, returning the full object it resides in, and then running that same operation for producers
on the result.And voila! We’ve got two normalized 5 GB JSON blobs, all that’s left is to feed them back into our diff
operation just like before to see how similar they are:
1 2 |
|
So after all that normalizing we find that only 0.2% of the lines differ between the exports! That’s an incredible start for a complete rewrite of fairly complicated export process. Plus this whole thing takes about 10 minutes to generate each normalized file on my macbook pro and then less than a minute to compare them, already much faster than my naive ruby script.
The final step was looking through specific differing examples to figure out why the logic produced slightly different export outputs, but getting into the details of that is application logic and not what this post is about.
Hopefully now you’ll reach for jq
the next time you want to manipulate JSON files on the command line.. or at least if you want to pretty print an API response.
One thing that bugged me about this solution was the explicit sorting of each key. What if we later added more arrays, or if we had deeply nested objects! Since we were just comparing two specific export results with an unchanging schema over the course of a couple of weeks, this didn’t really matter, but it was bugging me so I poked around looking for a more generic way of normalizing JSON objects.
If you check out the jq
FAQ you’ll find that there was a function called walk
introduced as a built-in after 1.5, which allows you deeply iterate through JSON objects and modify them. It wasn’t in the version I was using but it was simple enough to copy it into my program, which it turns out made the code much simpler:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
|
It turned out that this also made it significantly slower to normalize each file, so I ended up just using the more verbose and brittle version, but the walk
version is a lot cleaner!
Also, you might be curious how you can run the above file.. you can also run jq
program files using the -f
option, so:
1
|
|
Developers love comments! Everyone who writes a comment thinks that they’re making a pareto improvement to the codebase, when in fact it’s quite the opposite. Comments are especially dangerous because there are many situations where it seems like a comment will help, but beware the siren’s call. I hate reading articles that make abstract arguments, so enough bloviating, let’s check out some examples. Here are some concrete uses of comments that I’ve seen a lot, and how they can be easily avoided.
Read “Beware the Siren Song of Comments” by Andrew Warner on News Genius ]]>The most obvious problem was that I was still using the default Octopress theme. It has a lot of nice qualities: it’s easy to navigate around, easy to read, and it’s responsive! Unfortunately, using the default theme meant that my site also looked exactly like everyone else’s.
Now Octopress is also great because it’s extremely easy for anyone to make a theme. In fact, a bunch of people have already done exactly that. Looking at the list of themes, though, I realized that it was difficult to tell which ones were “the good ones.” Normally when I have a huge list of products that I want to comb through, I’m on a website where I can easily sort by some metadata about the product. (e.g. Amazon) My preferred sort is always by popularity: I basically trust the wisdom of the crowd. On Amazon, for example, I’m much more interested in the product with the most reviews than I am in the product with the best average review. Unfortunately, GitHub tables have no such convenient sorting options!
Luckily, I’m a programmer, and, wanting to procrastinate more, I decided that I wanted to write a quick script to sort projects by number of stars. As it turns out, it’s pretty simple to use Nokogiri and Octokit to get the information I want:
This script simply:
and voila, we have an Amazon-like sort by popular-type situation. (check out the results in the gist comments)
After checking out the popular themes, I decided, contrary to my usual shopping strategy, that I wasn’t in love with any of them. I was looking for something simple, single-column, and easy to read. I ended up settling on whiterspace, which, even though it only had 45 stars, was exactly what I was looking for.
So, while I didn’t end up choosing the most popular theme, it was still useful to be able to look at a mapping of themes to popularity. In the end, whiterspace got one more star, and I got a cleaner, more distinct-looking blog. Oh, and in doing all of this work, I ended up with a somewhat interesting topic to blog about (I hope!), accomplishing my original goal in a somewhat roundabout way. Win win win!
]]>
Ideally git blame
would give you all the context you need to determine why some code was written. But the reality is that no team is perfectly disciplined, and sometimes you’re going to run across commits with cryptic or ambiguous messages (“bugfix,” anyone?).
Post.all
or User.first.posts
, when you
want to know the size, you’ve got 3 choices: size
, length
, and count
. At first glance, it seems like these
might do the same thing, right? Not so! There are some key differences between them.
TL;DR - use size
, it usually “Does the Right Thing”
First off, some background: both Post.all
and User.first.posts
are instances of ActiveRecord::Relation
,
a very sneaky and
powerful class which manages lazy loading of records from the database. (full disclosure, User.first.posts
is
actually an instance of ActiveRecord::Associations::CollectionProxy
, but the difference between the two isn’t
really relevant to this article). It makes a best effort to filter, and
order records until the last possible minute when you actually ask for something concrete. It’s that lazy loading
which allows you to write code like Post.where(featured: true).order(created_at: :desc).paginate(page: 1)
, which
will generate only one query for the first page of posts. If you want to get the size of a Relation, there are
3 different ways to ask for it:
The simplest of the three methods, length
is simply delegated to to_a
on the collection; in other words, calling
length is equivalent to calling Post.all.to_a.length
. It will query for ALL records, initialize ruby objects for
all of them, and then get the size of the array. Probably not what you want if you just want to display the count
of the Posts on your blog!
Does a sql count(*)
query for the count of the records in the database. You probably want to use this method
if you only ever need the count of the records in the association for whatever you’re doing. In the example above,
just displaying a count on the page is a perfect use case for count
.
Size makes a best effort attempt to “Do The Right Thing” based on the current state of the collection. Here is the actual source for size:
From ActiveRecord::Relation:
1 2 3 4 |
|
Great comment by the way, I never would have known what size
did without it.
Basically, size is a heuristic switch between length
and count
. If the collection is loaded, it just
gets the length of the loaded array, otherwise it will hit the database with a query. As pointed out in a
much more informative comment which is for some reason in the CollectionProxy object instead,
you’ll end up with an extra query if you call size
and then actually need the elements of the collection later.
In a lot of cases the differences are completely irrelevant, but, for my money, size
is the best of the 3 options.
It does the best job of not leaking details about what’s going on under the hood in terms of lazy loading in Active
Record.
1 2 3 4 5 6 7 8 9 10 11 |
|
The whole point of the router is to handle stuff about the url! Instead, move whatever the logic inside
something_about_the_url?
does upstream to the router layer. For example, say you want to display a different
home page for www.mysite.com and blog.mysite.com. This can be accomplished very easily using the router:
1 2 3 4 5 |
|
Note that, in this specific case, you must have the subdomain route above the root
route, otherwise the router
will match the route to static#home
before it gets to the subdomain constraint. Remember that the router
checks routes in order. All set!
I have this experience frequently; usually I can figure out what’s going on, but sometimes it can be quite tricky to track down the source of extra queries. Whenever I want to figure out where a method is getting called from, one easy and lazy solution is to add a debugger statement in that code. But where the heck do I add a debugger for sql statements?
It turns out that Active Record has a fairly unified choke-point for query execution on a per-model basis - #find_by_sql
. So, now that we know the method, what’s the best way to add the debugger statement? Well, we could just open up the gem code, but then we have to restart our console, and we run the risk of forgetting to remove the statement or otherwise screwing up the gem code in some way that’s difficult to track down. We could monkey patch the method, but even that sounds onerous, especially if we want the method to be usable again without hitting that debugger statement later in our session.
Enter a relatively short addition to your .irbrc
or .pryrc
! Simply add the following method:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
|
Note that my debugger preference is pry
, if it’s available. You can of course adjust the above code per your preference. Now we can simply run:
1
|
|
Running through our problem code again, you should find yourself in the debugger for any queries on MyModel
. Once in the debugger, simply inspect caller
to figure out what pesky bit of code is generating all of the extra queries.
This is great, but it would incomplete if we had to restart the server in order to remove the debugger statement! The following snippet should do the trick:
1 2 3 4 5 6 7 8 9 |
|
Just run remove_debugger MyModel.singleton_class, :find_by_sql
, and you’re back to regular development.
Now that you’ve got this method, adding debugging statements to your own code or 3rd party code is a breeze!
Check out my .railsrc for more little one-off development helper methods.
Is there an easier way to do this with pry? Is there a gem that just does this and makes my silly code obsolete? Let me know in the comments!
]]>git clone git://github.com/imathis/octopress.git a-warner.github.com
, ran bundle
rake setup_github_pages
rake generate
and rake deploy
!rake new_post["My first blog post"]
The writing process is extremely simple - just run rake preview
until it looks right, and then rake deploy
after committing your changes.
Not that I should be surprised, but using Octopress is really a breeze, and I highly recommend it to anybody looking to crank out a quick blog with minimal setup and maintenance.
]]>