Strong Opinions Weakly Typed

Faster and Simpler With the Command Line: Deep-comparing Two 5GB JSON Files 3X Faster by Ditching the Code

2018-12-07T07:42:27-05:00

This post also appeared on the Genius Engineering blog.

As part of our recently announced deal with Apple Music, you can now view Genius lyrics for your favorite music within the Apple Music app.

We deliver our lyrics to Apple via a nightly export of newline-delimited JSON objects. With millions of songs in our catalog, these dumps can easily get as big as 5 GB. It’s not quite “big data”, but it’s also not something you can easily open in vim.

Our first iteration of the code that generated these nightly exports was slow and failure-prone. So, we recently did a ground up rewrite focused on speed and reliability, which yielded significant improvements on both axes—stay tuned for a future blog post on that subject. But other than spot-checking with small data sets, how could we make sure that the new export process wasn’t introducing regressions? We decided to run both export processes concurrently and compare the generated exports from each method to make sure the new version was a comprehensive replacement.

What’s the best way to compare these two 5GB files? A good first check is whether the new and old exports have the same number of lines; we can do this on the command line by dividing wc -l (line count) of the old export by wc -l of the new export using bc. If you haven’t seen bc before, don’t worry: I hadn’t either! It’s a tool to do simple arithmetic in the console.

$ echo "scale=3; $(wc -l < old_export.json) / $(wc -l < new_export.json)" | bc
.999

Ok great! The old export has 99.9% of the line count of the new export, meaning the new version actually has more lines than the old export, so off to a good start.

Next, we can use diff to get the percentage of lines that are different between the new and old export. We’ll use the --side-by-side and --suppress-common-lines flags so that we can pipe the output from diff directly to wc to get a count of total lines that differ between the two exports.

$ echo "scale=3; $(diff --side-by-side --suppress-common-lines old_export.json new_export.json | wc -l) / $(wc -l < new_export.json)" | bc
1.000

OOPS! Our diff is showing 100% of the lines differing.. either we seriously screwed up with this new export or our diff methodology is flawed.

Let’s take a look at how these objects are structured (payload slightly modified for simplicity):

$ head -n2 old_export.json
{"genius_id":1,"title":"Killa Cam","artist":"Cam’ron","featured_artists":["Opera Steve"],"producers":["The Heatmakerz"],"url":"https://genius.com/Camron-killa-cam-lyrics","lyrics":"..."}
{"genius_id":3,"title":"Can I Live","artist":"JAY-Z","featured_artists":[],"producers":["Irv Gotti"], "url":"https://genius.com/Jay-z-can-i-live-lyrics","lyrics":"..."}

Fairly standard newline-delimited JSON. Let’s look at the new export:

$ head -n2 new_export.json
{"url": "https://genius.com/Lil-wayne-fly-out-lyrics", "title": "Fly Out", "artist": "Lil Wayne", "lyrics": "...", "genius_id": 23, "producers": ["Marques Houston", "T-Mix"], "featured_artists": []}
{"url": "https://genius.com/Wu-tang-clan-cream-lyrics", "title": "C.R.E.A.M.", "artist": "Wu-Tang Clan", "lyrics": "...", "genius_id": 28, "producers": ["RZA"],"featured_artists": []}

Yikes, it appears that not only does the new export methodology not order songs in the same way, it doesn’t have the same order of keys within each JSON object. This means that even if the actual JSON content of the files was 100% the same, it would look 100% different with our naive diff strategy.

Enter `jq`

My first thought was to write a ruby script to parse and compare the two exports, but after spending a little time coding something up I had a program that was starting to get fairly complicated, didn’t work correctly, and was too slow—my first cut took well over an hour. Then I thought: is this one of those situations where a simple series of shell commands can replace a complex purpose-built script?

Enter jq, a powerful command-line tool for processing JSON objects. Note: jq is not related to jQuery, but its name does make googling for examples a little tricky! Up until this point I had mostly used jq for pretty-printing JSON, a feature it is quite good at. For example, you can do:

$ curl -s https://api.cdnjs.com/libraries/jquery | jq

And see a nice pretty-printed version of the CDNJS response for jQuery.

jq also allows you to dig out specific fields from some JSON, e.g. going back to our exports, to get the list of ids from each export:

$ head -n2 old_export.json | jq '.genius_id'
1
3
$ head -n2 new_export.json | jq '.genius_id'
23
28

That’s pretty much all I had used jq for before looking through these exports. But it turns out that jq is incredibly powerful as a tool for processing JSON (check out the jq cookbook to see some of the neat things that are possible). You can run entire programs, or “filters” as jq calls them (“filter” because it takes an input and produces an output), to iterate over, modify, and transform JSON objects.

How can we use it to solve the problem at hand, diffing these two large JSON files?

Well first we need to sort these files so that tools like diff can easily compare them. But we can’t just use sort; we need to sort them by the value of the genius_ids in their payload.

It turns out this is quite easy with jq. To sort the exports by genius_id we can run:

$ cat old_export.json | jq -csMS 'sort_by(.genius_id)[]' > sorted_old_export.json
$ cat new_export.json | jq -csMS 'sort_by(.genius_id)[]' > sorted_new_export.json

Running through these options:

-c / --compact-output makes sure the JSON objects remain compact and not pretty printed
-s / --slurp reads each object into an in-memory array instead of processing one object at a time, which we need in order to sort the file
-M / --monochrome-output prevents the JSON from being colorized in the terminal
-S / --sort-keys makes sure that each JSON object’s keys are sorted, ensuring that the order of keys within each object payload is consistent between exports when we compare them

And, of course, the jq expression to sort the file itself is quite terse! It’s just sort_by(.genius_id), which sorts the slurped in array by id, and then there’s a little [] on the end which basically splays the sorted array back out into newline-delimited JSON.

This takes a little while, but once it’s done we’ve got two sorted files ready to be compared!

But wait.. not so fast. There are still a few keys in our export, specifically featured_artists and producers that are arrays of string values, and it’s not guaranteed that each export will generate those in the same order.

Not to worry: jq has a solution to that problem too! We want to sort each of those keys in the output as well, which we can do by complicating our expression just a little more:

$ cat old_export.json | jq -csMS 'map(.featured_artists |= sort | .producers |= sort) | sort_by(.genius_id)[]' > sorted_old_export.json

So now the expression is a little more tricky. Let’s break it down.

map does what you expect and maps over each object, much as sort_by operates on each object.
Within that map operation we’re first calling .featured_artists |= sort, which uses the |= update operator to do an in-place alphabetic sort on the featured_artists array. This is a bit confusing, but all it’s doing is running the value of featured_artists through a sort “filter”, sorting it, then assigning that sorted value back to the featured_artists key of the object, and passing on the the entire object that featured_artists key is in. It would be equivalent to map(.featured_artists = (.featured_artists | sort)). If you don’t know what that | does, don’t worry.. read on!
Next up we use | operator to pipe the previous step to our next step which just sorts the producers array exactly as we did the featured_artists array. The pipe operator works exactly like the unix-style pipe on the command line, so we’re essentially sorting the featured_artists array, returning the full object it resides in, and then running that same operation for producers on the result.
And then we just feed that object with its sorted arrays into our sort operator from before using another pipe.

And voila! We’ve got two normalized 5 GB JSON blobs, all that’s left is to feed them back into our diff operation just like before to see how similar they are:

$ echo "scale=3; $(diff --side-by-side --suppress-common-lines sorted_old_export.json sorted_new_export.json | wc -l) / $(wc -l < sorted_new_export.json)" | bc
.002

So after all that normalizing we find that only 0.2% of the lines differ between the exports! That’s an incredible start for a complete rewrite of fairly complicated export process. Plus this whole thing takes about 10 minutes to generate each normalized file on my macbook pro and then less than a minute to compare them, already much faster than my naive ruby script.

The final step was looking through specific differing examples to figure out why the logic produced slightly different export outputs, but getting into the details of that is application logic and not what this post is about.

Hopefully now you’ll reach for jq the next time you want to manipulate JSON files on the command line.. or at least if you want to pretty print an API response.

Additional discussion

One thing that bugged me about this solution was the explicit sorting of each key. What if we later added more arrays, or if we had deeply nested objects! Since we were just comparing two specific export results with an unchanging schema over the course of a couple of weeks, this didn’t really matter, but it was bugging me so I poked around looking for a more generic way of normalizing JSON objects.

If you check out the jq FAQ you’ll find that there was a function called walk introduced as a built-in after 1.5, which allows you deeply iterate through JSON objects and modify them. It wasn’t in the version I was using but it was simple enough to copy it into my program, which it turns out made the code much simpler:

# from https://github.com/stedolan/jq/wiki/FAQ#general-questions

def walk(f):
  . as $in
    | if type == "object" then
    reduce keys_unsorted[] as $key
    ( {}; . + { ($key):  ($in[$key] | walk(f)) } ) | f
    elif type == "array" then map( walk(f) ) | f
    else f
    end;

# actual custom jq program
walk( if type == "array" then sort else . end ) | sort_by(.genius_id)[]

It turned out that this also made it significantly slower to normalize each file, so I ended up just using the more verbose and brittle version, but the walk version is a lot cleaner!

Also, you might be curious how you can run the above file.. you can also run jq program files using the -f option, so:

$ cat my_file.json | jq -csMS -f normalize.jq

Beware the Siren Song of Comments

2014-08-27T08:37:08-04:00

Developers love comments! Everyone who writes a comment thinks that they’re making a pareto improvement to the codebase, when in fact it’s quite the opposite. Comments are especially dangerous because there are many situations where it seems like a comment will help, but beware the siren’s call. I hate reading articles that make abstract arguments, so enough bloviating, let’s check out some examples. Here are some concrete uses of comments that I’ve seen a lot, and how they can be easily avoided.

Read “Beware the Siren Song of Comments” by Andrew Warner on News Genius

Choosing a New Theme

2014-08-13T23:25:46-04:00

About a week ago I finally decided that I wanted to start blogging again. I love talking about programming, but I often find it difficult to motivate myself to write a blog post about it. I sat down to write a post, and sure enough, I couldn’t think of anything to blog about. So instead I procrastinated by thinking about all of the things I wanted to do to make my blog better.

The most obvious problem was that I was still using the default Octopress theme. It has a lot of nice qualities: it’s easy to navigate around, easy to read, and it’s responsive! Unfortunately, using the default theme meant that my site also looked exactly like everyone else’s.

Now Octopress is also great because it’s extremely easy for anyone to make a theme. In fact, a bunch of people have already done exactly that. Looking at the list of themes, though, I realized that it was difficult to tell which ones were “the good ones.” Normally when I have a huge list of products that I want to comb through, I’m on a website where I can easily sort by some metadata about the product. (e.g. Amazon) My preferred sort is always by popularity: I basically trust the wisdom of the crowd. On Amazon, for example, I’m much more interested in the product with the most reviews than I am in the product with the best average review. Unfortunately, GitHub tables have no such convenient sorting options!

Luckily, I’m a programmer, and, wanting to procrastinate more, I decided that I wanted to write a quick script to sort projects by number of stars. As it turns out, it’s pretty simple to use Nokogiri and Octokit to get the information I want:

require 'nokogiri' require 'octokit' require 'open-uri' github = Octokit::Client.new(:access_token => ENV['GH_TOKEN']) doc = Nokogiri::HTML.fragment(open('https://github.com/imathis/octopress/wiki/3rd-Party-Octopress-Themes').read) doc.search('table tr td:first a:first').map do |a| a['href'] =~ %r{https?://github.com/([^/]+)/([^?/]+)(\?|$|/)} && [$1, $2] end.compact.map do |owner, repo_name| begin github.repository(:owner => owner, :name => repo_name) rescue Octokit::NotFound end end.compact.sort_by { |repo| -repo.stargazers_count }.each do |repo| puts "#{repo.html_url} - #{repo.stargazers_count}" end

This script simply:

scrapes the themes page
parses it with Nokogiri
finds the table with the themes
selects the link from the first column of each row, which is the link to the theme repository on GitHub
extracts the owner and repo name using a “simple” regular expression
maps owner/repo to number of stars
prints a sorted list of repo links and stars

and voila, we have an Amazon-like sort by popular-type situation. (check out the results in the gist comments)

After checking out the popular themes, I decided, contrary to my usual shopping strategy, that I wasn’t in love with any of them. I was looking for something simple, single-column, and easy to read. I ended up settling on whiterspace, which, even though it only had 45 stars, was exactly what I was looking for.

So, while I didn’t end up choosing the most popular theme, it was still useful to be able to look at a mapping of themes to popularity. In the end, whiterspace got one more star, and I got a cleaner, more distinct-looking blog. Oh, and in doing all of this work, I ended up with a somewhat interesting topic to blog about (I hope!), accomplishing my original goal in a somewhat roundabout way. Win win win!

Git-getpull: Quickly Find the Pull Request That Merged Your Commit to Master

2014-03-06T14:40:00-05:00

Ideally git blame would give you all the context you need to determine why some code was written. But the reality is that no team is perfectly disciplined, and sometimes you’re going to run across commits with cryptic or ambiguous messages (“bugfix,” anyone?).

The 3 Ways to Get the Size of an Active Record Relation

2013-04-21T18:14:00-04:00

If you’re reading this and your first thought is, “there are 3 ways to get the size of a relation?”, then you’ve come to the right place! Basically, given a relation like Post.all or User.first.posts, when you want to know the size, you’ve got 3 choices: size, length, and count. At first glance, it seems like these might do the same thing, right? Not so! There are some key differences between them.

TL;DR - use size, it usually “Does the Right Thing”

First off, some background: both Post.all and User.first.posts are instances of ActiveRecord::Relation, a very sneaky and powerful class which manages lazy loading of records from the database. (full disclosure, User.first.posts is actually an instance of ActiveRecord::Associations::CollectionProxy, but the difference between the two isn’t really relevant to this article). It makes a best effort to filter, and order records until the last possible minute when you actually ask for something concrete. It’s that lazy loading which allows you to write code like Post.where(featured: true).order(created_at: :desc).paginate(page: 1), which will generate only one query for the first page of posts. If you want to get the size of a Relation, there are 3 different ways to ask for it:

length

The simplest of the three methods, length is simply delegated to to_a on the collection; in other words, calling length is equivalent to calling Post.all.to_a.length. It will query for ALL records, initialize ruby objects for all of them, and then get the size of the array. Probably not what you want if you just want to display the count of the Posts on your blog!

count

Does a sql count(*) query for the count of the records in the database. You probably want to use this method if you only ever need the count of the records in the association for whatever you’re doing. In the example above, just displaying a count on the page is a perfect use case for count.

size

Size makes a best effort attempt to “Do The Right Thing” based on the current state of the collection. Here is the actual source for size:

From ActiveRecord::Relation:

lib/active_record/relation.rb

  # Returns size of the records.
  def size
    loaded? ? @records.length : count
  end

Great comment by the way, I never would have known what size did without it.

Basically, size is a heuristic switch between length and count. If the collection is loaded, it just gets the length of the loaded array, otherwise it will hit the database with a query. As pointed out in a much more informative comment which is for some reason in the CollectionProxy object instead, you’ll end up with an extra query if you call size and then actually need the elements of the collection later.

In a lot of cases the differences are completely irrelevant, but, for my money, size is the best of the 3 options. It does the best job of not leaking details about what’s going on under the hood in terms of lazy loading in Active Record.

Use the Rails Router for Routing!

2013-04-21T17:41:00-04:00

This is a quick one, and the title says most of it. Basically, you should never have code like this in your app:

some_controller.rb

class SomeController < ApplicationController
  def some_action
    if something_about_the_url?
      do_something
      render :template => :foo
    else
      do_something_else
      render :template => :baz
    end
  end
end

The whole point of the router is to handle stuff about the url! Instead, move whatever the logic inside something_about_the_url? does upstream to the router layer. For example, say you want to display a different home page for www.mysite.com and blog.mysite.com. This can be accomplished very easily using the router:

routes.rb

constraints subdomain: 'blog' do
  root to: 'blog#home'
end

root to: 'static#home'

Note that, in this specific case, you must have the subdomain route above the root route, otherwise the router will match the route to static#home before it gets to the subdomain constraint. Remember that the router checks routes in order. All set!

Simple Active Record Query Debugging in the Rails Console

2013-03-17T18:22:00-04:00

Stop me if this sounds familiar. You’re tooling around in the Rails console, testing out some new code you’re working on (or debugging some slow/broken code), and you see a ton of repeat queries.

I have this experience frequently; usually I can figure out what’s going on, but sometimes it can be quite tricky to track down the source of extra queries. Whenever I want to figure out where a method is getting called from, one easy and lazy solution is to add a debugger statement in that code. But where the heck do I add a debugger for sql statements?

It turns out that Active Record has a fairly unified choke-point for query execution on a per-model basis - #find_by_sql. So, now that we know the method, what’s the best way to add the debugger statement? Well, we could just open up the gem code, but then we have to restart our console, and we run the risk of forgetting to remove the statement or otherwise screwing up the gem code in some way that’s difficult to track down. We could monkey patch the method, but even that sounds onerous, especially if we want the method to be usable again without hitting that debugger statement later in our session.

Enter a relatively short addition to your .irbrc or .pryrc! Simply add the following method:

def add_debugger(clazz, method)
  debugger_method = binding.respond_to?(:pry) ? 'binding.pry' : 'debugger'

  unless clazz.method_defined? "#{method}_with_debugger"
    clazz.class_eval <<-CODE, __FILE__, __LINE__ + 1
      def #{method}_with_debugger(*args, &block)
        #{debugger_method}
        #{method}_without_debugger(*args, &block)
      end
      alias_method_chain :#{method}, :debugger
    CODE
  end
end

Note that my debugger preference is pry, if it’s available. You can of course adjust the above code per your preference. Now we can simply run:

add_debugger MyModel.singleton_class, :find_by_sql

Running through our problem code again, you should find yourself in the debugger for any queries on MyModel. Once in the debugger, simply inspect caller to figure out what pesky bit of code is generating all of the extra queries.

This is great, but it would incomplete if we had to restart the server in order to remove the debugger statement! The following snippet should do the trick:

def remove_debugger(clazz, method)
  return unless clazz.method_defined? "#{method}_with_debugger"

  clazz.class_eval do
    alias_method method, "#{method}_without_debugger"
    undef_method "#{method}_with_debugger"
    undef_method "#{method}_without_debugger"
  end
end

Just run remove_debugger MyModel.singleton_class, :find_by_sql, and you’re back to regular development.

Now that you’ve got this method, adding debugging statements to your own code or 3rd party code is a breeze!

Check out my .railsrc for more little one-off development helper methods.

Is there an easier way to do this with pry? Is there a gem that just does this and makes my silly code obsolete? Let me know in the comments!

My First Blog Post

2013-03-17T18:07:00-04:00

This is my first blog post! I setup up my blog using Octopress, which was incredibly easy. They’ve got some great guides on their site, but just to give you a sense of exactly how easy it is, I simply:

Created a repository on github named a-warner.github.com - the standard naming conventions that Github Pages expects if I want a-warner.github.com to resolve to this blog
Cloned Octopress via git clone git://github.com/imathis/octopress.git a-warner.github.com, ran bundle
Next step was to run rake setup_github_pages
Then it’s as simple as rake generate and rake deploy!
Creating this blog post just involved running rake new_post["My first blog post"]

The writing process is extremely simple - just run rake preview until it looks right, and then rake deploy after committing your changes.

Not that I should be surprised, but using Octopress is really a breeze, and I highly recommend it to anybody looking to crank out a quick blog with minimal setup and maintenance.