We needed to do simple analytics on OpenGovernment, but not of the Google Analytics variety. We needed each object in the system to have view count aggregates that we could show in real time on the page, and we needed to be able to pull top ten lists and stuff.
Paul Dix at Web 2.0 Expo last year got me thinking about the idea of using ‘the right data store for the job’, which might mean a key-value store, a relational db, or perhaps a stone tablet. So when we started looking at doing basic view counts for objects in OpenGovernment (bills, people, etc.), my instinct was to avoid using PostgreSQL to store them. It seemed overwrought. And because analytics functionality is largely separate from the main application, it seemed reasonable to be able to split it off and go with some other data store.
So we started using MongoDB and have been very pleased with it. Rather than record every hit as a separate document, we make a maximum of one new document per hour and we store a total for that hour. Here’s the Page model:
class Page include MongoMapper::Document many :page_views key :url, String, :indexed => true key :countable_id, Integer, :required => true, :indexed => true key :countable_type, String, :required => true, :indexed => true ... end
class PageView include MongoMapper::Document key :page_id, Object, :required => true belongs_to :page key :hour, Time, :indexed => true key :total, Integer, :required => true, :default => 0 # number of views key :countable_type, String, :indexed => true, :required => true key :subdomain, String, :indexed => true, :requied => true before_validation :set_countable_type protected def set_countable_type self[:countable_type] = Page.find(page_id).countable_type end end
We could do away with the Page model entirely if we didn’t want to store URLs. In PageView, we’re replicating
countable_type to make aggregation by object type easier, and the
subdomain field makes it easier to show “top ten bills in California” vs. some other state. We had to write one map reduce job to aggregate page views for a given page over an arbitrary time frame. That took 10 lines of code, which was 9 lines more than the equivalent Postgres query, but hey! We can do sharding! OK, we won’t need that for a while…
With the Mongo models in place, we now needed to mix analytics into one of our Postgres models. So we made a
lib/trackable.rb that adds
Class.most_viewed. The trickiest bit came when we needed to make the join between, say, Mongo’s most viewed
bill_ids and the actual
bill rows in Postgres. Given a list of 20
bill_ids from Mongo that are ordered by most viewed, how would we get that exact list out of Postgres without making 20 separate SQL calls?
In MySQL you can get rows back in an explicit order by using
fields(). But not in Postgres 8.4. Our best option was, unfortunately,
ORDER BY CASE WHEN bill_id = 5 THEN 1 WHEN bill_id = 4 THEN 2 WHEN bill_id = 99 then 3 .... Ugly, but it is plenty fast. So we, ahem, monkey patched
ActiveRecord::Base with a
find_in_explicit_order method that takes an array and returns a scope. If you can think of a better way to do this in Postgres, I’d love to hear it.
document.ready() and injects an
<img> tag in the document, and a Rack app that simply responds to
/tracking.gif and marks the hit in MongoDB.
Oh, and we put
/tracking.gif into our
robots.txt, of course. And that’s it. We mark hits asynchronously, we store them in as little space as we can, and we return aggregates very quickly when needed. I think this will scale up quite nicely. If you’re curious, you can browse the full source on our GitHub.
PS. This is the kind of thing that yearns to be put into its own gem. If you’re a Rails developer and want to volunteer, this would be an awesome project because I think it has a lot of potential uses beyond our app. We’re always hanging out in #opengovernment on FreeNode IRC, so come on by.