I've recently started using MongoDB at work. It's great so far - very fast, good APIs, flexible. However, there's been one big problem: there's no documentation on how to write queries. I mean, there's a manual and it tells you about all the operators you can use, but there's nowhere you can go to read about the correct approach to writing different kinds of queries. So from trial and error, I'm going to contribute a piece which could be part of such a guide.

How to count unique [viewers|actions|users|etc] with MongoDB

First, if you don't care about breaking down the data, there is an extremely simple way to get the number of uniques (aka "distincts"):

db.events.distinct("uid", {type: "pageview"}).length;

This has the downside of loading every single pageview into memory, as far as I can tell. [mstearn from #mongodb on freenode adds that distinct() is limited to returning 4MB, making it unsuitable for large data sets] Since a call to distinct() returns an Array instead of a Cursor, you can't just call count(). You can, however, extend it to get all the pageviews in a set of predefined categories pretty easily:

['search', 'frontpage', 'checkout'].map(function(category){
   return [category, db.events.distinct("uid", {type: "pageview", category: category}).length];
});

OK, but say you have a lot of pageviews and you don't want to load the entire list into memory. The next attempt I came up with used the more exotic mapReduce looked like this:

db.events.mapReduce(
  function(){
   emit([this.category, this.uid].join('-'), true);
  },
  function(key, values){
    //values is irrelevant, just a list of true for each time it's appeared
    var keys = key.split('-');
    return keys.slice(0, keys.length - 1).join('-');
  },
  {query: {type: "pageview"}}
).group(
 ["value"],
 {},
 {uniques: 0},
 function(obj, acc){
   acc.uniques += 1;
 }
)

Downsides: it's very ugly, and I suspect it's doing a lot of useless work. I just want to count the number of uniques for a given set of restrictions...I should not be doing all this random string manipulation. Upside: it gets all categories, not just your predefined set. It will scale up nicely once MongoDB gets sharding working right. [mstearn from #mongodb informs me that group has the same limitations as distinct and thus this doesn't scale up. Sad.]

I thought I could do better, and reading http://www.slideshare.net/gabriele.lana/couchdb-vs-mongodb-2982288 I had the inspiration to do *two* mapReduce() calls:

db.events.mapReduce(
   function(){
        emit(
         [this.category, this.uid],
         {category: this.category, uid: this.uid}
        );
   },
   function(key, values){
      return values[0];
   },
   {query: {type: "pageview"}}
).map_reduce(
  function(){
     emit(this.category, 1);
  },
  function(key, values){
     return Array.sum(values);
  }
)

Upsides: this is the most elegant solution, from one point of view. And it scales up nicely, with no limits. Downsides: It's a two phase map reduce and thus kind of slow. This is the current version of the code I have checked in.

Let me know if you have any insight on the best way to do these queries!