Fast, asynchronous analytics with MongoDB

We needed to do simple analytics on OpenGovernment, but not of the Google Analytics variety. We needed each object in the system to have view count aggregates that we could show in real time on the page, and we needed to be able to pull top ten lists and stuff.

Paul Dix at Web 2.0 Expo last year got me thinking about the idea of using ‘the right data store for the job’, which might mean a key-value store, a relational db, or perhaps a stone tablet. So when we started looking at doing basic view counts for objects in OpenGovernment (bills, people, etc.), my instinct was to avoid using PostgreSQL to store them. It seemed overwrought. And because analytics functionality is largely separate from the main application, it seemed reasonable to be able to split it off and go with some other data store.

So we started using MongoDB and have been very pleased with it. Rather than record every hit as a separate document, we make a maximum of one new document per hour and we store a total for that hour. Here’s the Page model:

class Page
  include MongoMapper::Document
  many :page_views

  key :url, String, :indexed => true
  key :countable_id, Integer, :required => true, :indexed => true
  key :countable_type, String, :required => true, :indexed => true

  ...
end

and PageView:

class PageView
  include MongoMapper::Document

  key :page_id, Object, :required => true
  belongs_to :page

  key :hour, Time, :indexed => true
  key :total, Integer, :required => true, :default => 0 # number of views

  key :countable_type, String, :indexed => true, :required => true
  key :subdomain, String, :indexed => true, :requied => true

  before_validation :set_countable_type

  protected
  def set_countable_type
    self[:countable_type] = Page.find(page_id).countable_type
  end
end

We could do away with the Page model entirely if we didn’t want to store URLs. In PageView, we’re replicating countable_type to make aggregation by object type easier, and the subdomain field makes it easier to show “top ten bills in California” vs. some other state. We had to write one map reduce job to aggregate page views for a given page over an arbitrary time frame. That took 10 lines of code, which was 9 lines more than the equivalent Postgres query, but hey! We can do sharding! OK, we won’t need that for a while…

With the Mongo models in place, we now needed to mix analytics into one of our Postgres models. So we made a lib/trackable.rb that adds object.views and Class.most_viewed. The trickiest bit came when we needed to make the join between, say, Mongo’s most viewed bill_ids and the actual bill rows in Postgres. Given a list of 20 bill_ids from Mongo that are ordered by most viewed, how would we get that exact list out of Postgres without making 20 separate SQL calls?

In MySQL you can get rows back in an explicit order by using fields(). But not in Postgres 8.4. Our best option was, unfortunately, ORDER BY CASE WHEN bill_id = 5 THEN 1 WHEN bill_id = 4 THEN 2 WHEN bill_id = 99 then 3 .... Ugly, but it is plenty fast. So we, ahem, monkey patched ActiveRecord::Base with a find_in_explicit_order method that takes an array and returns a scope. If you can think of a better way to do this in Postgres, I’d love to hear it.

The controller side has two components. A piece of JavaScript that runs on document.ready() and injects an <img> tag in the document, and a Rack app that simply responds to /tracking.gif and marks the hit in MongoDB.

Oh, and we put /tracking.gif into our robots.txt, of course. And that’s it. We mark hits asynchronously, we store them in as little space as we can, and we return aggregates very quickly when needed. I think this will scale up quite nicely. If you’re curious, you can browse the full source on our GitHub.

PS. This is the kind of thing that yearns to be put into its own gem. If you’re a Rails developer and want to volunteer, this would be an awesome project because I think it has a lot of potential uses beyond our app. We’re always hanging out in #opengovernment on FreeNode IRC, so come on by.

This entry was posted in Uncategorized. Bookmark the permalink.
 

One Response to Fast, asynchronous analytics with MongoDB

  1. Hootan Nikbakht says:

    Have you guys did any performance benchmark on MongoDB before adopting it ? In my application I am collecting raw data points on literally multiple (up to 100) per second, and So far I have had to purge mysql to make this happen. I am now exploring MongoDB and I am in the process to do a basic performance comparison between MySql and MongoDB. Any input on performance is a big + and I try to find you on the IRC. Peace n love.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>