Optimize Prime

To content | To menu | To search

Friday, February 12 2010

Counting Uniques With MongoDB

I've recently started using MongoDB at work. It's great so far - very fast, good APIs, flexible. However, there's been one big problem: there's no documentation on how to write queries. I mean, there's a manual and it tells you about all the operators you can use, but there's nowhere you can go to read about the correct approach to writing different kinds of queries. So from trial and error, I'm going to contribute a piece which could be part of such a guide.

How to count unique [viewers|actions|users|etc] with MongoDB

First, if you don't care about breaking down the data, there is an extremely simple way to get the number of uniques (aka "distincts"):

db.events.distinct("uid", {type: "pageview"}).length;

This has the downside of loading every single pageview into memory, as far as I can tell. [mstearn from #mongodb on freenode adds that distinct() is limited to returning 4MB, making it unsuitable for large data sets] Since a call to distinct() returns an Array instead of a Cursor, you can't just call count(). You can, however, extend it to get all the pageviews in a set of predefined categories pretty easily:

['search', 'frontpage', 'checkout'].map(function(category){
   return [category, db.events.distinct("uid", {type: "pageview", category: category}).length];
});

OK, but say you have a lot of pageviews and you don't want to load the entire list into memory. The next attempt I came up with used the more exotic mapReduce looked like this:

db.events.mapReduce(
  function(){
   emit([this.category, this.uid].join('-'), true);
  },
  function(key, values){
    //values is irrelevant, just a list of true for each time it's appeared
    var keys = key.split('-');
    return keys.slice(0, keys.length - 1).join('-');
  },
  {query: {type: "pageview"}}
).group(
 ["value"],
 {},
 {uniques: 0},
 function(obj, acc){
   acc.uniques += 1;
 }
)

Downsides: it's very ugly, and I suspect it's doing a lot of useless work. I just want to count the number of uniques for a given set of restrictions...I should not be doing all this random string manipulation. Upside: it gets all categories, not just your predefined set. It will scale up nicely once MongoDB gets sharding working right. [mstearn from #mongodb informs me that group has the same limitations as distinct and thus this doesn't scale up. Sad.]

I thought I could do better, and reading http://www.slideshare.net/gabriele.lana/couchdb-vs-mongodb-2982288 I had the inspiration to do *two* mapReduce() calls:

db.events.mapReduce(
   function(){
        emit(
         [this.category, this.uid],
         {category: this.category, uid: this.uid}
        );
   },
   function(key, values){
      return values[0];
   },
   {query: {type: "pageview"}}
).map_reduce(
  function(){
     emit(this.category, 1);
  },
  function(key, values){
     return Array.sum(values);
  }
)

Upsides: this is the most elegant solution, from one point of view. And it scales up nicely, with no limits. Downsides: It's a two phase map reduce and thus kind of slow. This is the current version of the code I have checked in.

Let me know if you have any insight on the best way to do these queries!

Friday, October 10 2008

Don't compile Hadoop more than once: Shutting down. Incompatible buildVersion.

I've been following along setting up my 2 node cluster by the guide at http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Multi-Node_Cluster) and I love it. Great guide.

It even anticipates the first problem I had (with the incompatible namespaceIDs), which was awesome. Now I have another problem:

Starting up DFS works fine now, but when I try to bin/start-mapred.sh the TaskTracker fails to start on the slave (it starts fine on the master). On the slave node task trackerlogs I see the error message "Shutting down. Incompatible buildVersion." The version it claims to start with on the slave tasktracker, the master node's tasktracker, and the master's jobtracker is 0.18.2-dev everywhere.

I see "Report from tracker_micro2:localhost/127.0.0.1:46035: Shutting down. Incompatible buildVersion" in the JobTracker log as well.

What could be causing this? I'm totally stumped.

(10 minutes of feeling stupid follow)

Just found another bit of info:

~/hadoop-0.18.1$ bin/hadoop version Hadoop 0.18.2-dev Subversion -r Compiled by jtv on Thu Oct 9 20:45:34 PDT 2008

~/hadoop-0.18.1$ bin/hadoop version Hadoop 0.18.2-dev Subversion -r Compiled by jtv on Thu Oct 9 20:40:07 PDT 2008

I now have a hypothesis that the different "compiled on" dates are messing it up.

(5 minutes of trying to remember the exact syntax for how to turn a directory into a tarball follow)

That was indeed the issue: having downloaded the same hadoop source onto both boxes and compiled it twice, the tasktracker on one refused to talk to the jobtracker on the other. Taring up the compiled results from one and copying them to the other made everything work.

Saturday, August 23 2008

Shortest parser

We have a programming problem we give out at Justin.tv: write a parser to convert arithmetic expressions from infix to prefix (should deal with parens and arbitrary variable names). I wrote a version when I first came up with the problem, but I've lost it. So I thought I'd try writing it again and see how short I could get it.

# parse simple (no parens) arithmetic expressions => prefix expressions
def parse_simple(expr, opers=%w(+ - * /))
  return expr if opers.empty? 
  return parse_simple(expr, opers[1..-1]) unless expr.include?(opers.first)
  "(#{opers.first} " + 
  expr.split(opers.first).map {|sub_expr| parse_simple(sub_expr, opers[1..-1]).strip}.join(" ") +
   ")"
end

# remove parens, then parse simple
def parse(expr, exprs = [])
  while expr.gsub!(/\(([^\(\)]+)\)/) {"expr:#{(exprs << parse_simple($1)).size - 1}"}; end
  expr = parse_simple(expr)
  while expr.gsub!(/expr:(\d+)/) {exprs[$1.to_i]}; end
  expr
end

I'm fairly satisfied with it; you could remove some lines through stupid ruby tricks, but I don't see any more concise way to do it. I'd love to see your stab at it though - how short can you get it? Language of choice.

Monday, March 3 2008

Don't use Pound for load balancing

We were using Pound for load balancing at Justin.tv until today. It was consistently using about 20% CPU, and during spikes would use up to 80% CPU. Under extremely high load, it would occasionally freak out and break.

We just switched to Ngnix, and load immediately dropped to around 3% CPU. Our pages feel a little snappier, although that might be my imagination. Not only is the config format easier to understand and better documented, but it offers a full webserver's complement of functionality. We haven't hit any spikes yet, but given the current performance I suspect it will cream pound.

In short: Pound is out-dated. Nginx is a good replacement, although there are many, many, many other options I haven't tried.

Nginx vs. Pound

Edit: More data now available!

Week view on Pound vs. Ngnix

Saturday, March 1 2008

The next tinyurl

Tinyurl seems like a ridiculous idea for a website. The service is so simple that any competent hacker I know could write the central feature in under half an hour. It's not even original, in the sense that hashing long keys to short keys is one of the oldest tricks in computer science.

Yet writing it as a web service was original: Tinyurl is an Alexa top-1000 site. That's more traffic and usage than 90% of the startups I've met have (or will probably ever have). Not bad for what couldn't have been more than an afternoon's work. I don't believe it's the only simple utility that should exist, and yet currently doesn't. It opens up the tantalizing possibility that the right little hack could be used by millions of people.

The best idea in this vein I've had so far applies the tinyurl concept to the page itself. Most pages on the internet consist of a tiny patch of content, surrounded by acres of ancillary and mostly unnecessary wrapping. Given a url, the utility returns an embed for the actual content portion; the way this works on YouTube, Alexa, Flickr, or Justin.tv(1) should be obvious. I can imagine using this kind of a utility in several contexts in Justin.tv, and obviously it would quickly become a favorite with linkjackers everywhere (2).

Unfortunately, unlike tinyurl, this project would probably take a week to do right, and I don't really have the time right now. So I doubt I'll get around to writing this, but I hope someone else does. Let me know how it goes, so I can claim credit for all your hard work.

1. Sorry for the plug.

2. As tinyurl is the friend of the shock-site trickster.

Wednesday, August 22 2007

Insecure By Default

I never really paid much attention to the administration of SSH on my boxes. I figured debian probably set it up to be essentially secure by default.

While the algorithm is perfectly secure, there's a big problem. I was poking around my system logs after reading an article about someone else's box being compromised, and discovered multiple ip addresses trying to root passwords. Now, we have root login disabled on our boxes. But the fact that someone could just sit there guessing disturbed me greatly. What if someone actually wanted to break into our boxes? It's not like our usernames are highly obfuscated.

Turns out there's an easy solution:

$ sudo apt-get install denyhosts

Denyhosts blocks (temporarily) anyone who makes a sufficient number of failed login attempts. Why this behavior is not default, I am very unclear. Just whitelist your own ip to make sure you don't lock yourself out:

$ nano WORK_DIR/allowed-hosts

(look up the WORK_DIR in denyhosts.conf)

When I installed denyhosts it instantly locked out 4 people who were *currently* trying to crack our network. If you haven't considered installing it yourself, perhaps you should.

Monday, April 9 2007

Google Ownz Me

Apparently I've been using my gmail account too much:

Sector 6

I'm not using anything like GmailFS...I'm not even close to filling up my gmail account. I wonder what I did to trigger this?

If someone at Google reads this - FIX MY EMAIL.

Tuesday, March 6 2007

Oh the things that you'll see

I've wound up hacking a lot of Flash recently, which is a real PITA if you're used to real development environments. For one thing, there's no way to get debug output from a flash widget once it's been embedded in a real page. Actually it turns out that there is: the Flash debug player. After you install it, all trace output from any embedded flash player is written to a log file. Throw tail -f on the file, and you can watch the trace messages from your flash in real time. The cool thing is, you can also watch the debug messages from everyone else. There are a few interesting ones, but the best I've found so far is the YouTube embedded player. The tastiest bits:
START PLAYING :http://www.youtube.com/get_video?video_id=_-XoafyJ9K4&t=OEgsToPDskIMqdmC3eeaF1meusSwSKjs
playing.. the movie
status code is:NetStream.Play.Start
we got meta fuck yeah!
time is:127.494
and
result xml:0
node is:0 length:1
status code is:NetStream.Buffer.Flush
status code is:NetStream.Buffer.Empty
showing the goddamn play button
I'm always glad to see how much I have in common with the engineers at other companies.

Thursday, January 4 2007

Food Riffs

Justin and I were discussing baking lemon bars the other day, when the same thought hit us both simultaneously: why are we so limited, to consider only the Lemon variety?

Lemon: OLYMPUS DIGITAL CAMERA Lime: OLYMPUS DIGITAL CAMERA Orange: OLYMPUS DIGITAL CAMERA

They're still cooling. I'll update with the success, flavor-wise, tomorrow. Later on the agenda: Grapefruit!

Tuesday, December 19 2006

What's Wealth?

From The Consumerist:

A recent United Nations study on personal wealth found that having just $2,200 per adult puts a household in the top 50% of the world's >richest. However, thanks to large amounts of consumer debt, "many people in high-income countries have negative net worth and - somewhat >paradoxically - are among the poorest people in the world in terms of household wealth."

What does it truly mean to be wealthy? The American child of middle class parents with student loans and credit cards is much more secure in his access to all the accouterments of wealth for the duration of their entire life than a debt-free child of Ugandan refugees. That security is the true measure of wealth. Merely counting up the quantifiable assets (or debts) held is ridiculously simplistic. First, there are personal intangible assets (the American's college degree), familial assets (the American's parents probably have positive networth), and most importantly societal assets (the security that men won't come and raze your house). And I'd be wealthier renting an apartment, in hock to my neck to the credit card companies in America than living completely debt-free in Uganda.

Monday, December 4 2006

Splay Tree

There was no convenient Ruby implementation of Splay Trees. Now there is; this is pretty much a straight port of the Java version available here. I wrote up a quick set of tests as well.

Thursday, November 30 2006

This Blog Is Boring

I think that's objectively true. What I'm really worried about is that this reflects my current personality, working for a startup: boring. More generally, I've been concerned that everyone in startups winds up, to an outsider, fundamentally boring. The startups in Crystal Towers (we need a collective name - suggestions) were discussing this a few days ago. Someone noted that when they went home they had no idea what to talk about with people, if they couldn't discuss work. Luckily I didn't have that problem so much, but only because there were other people pulling the conversation forward. If I had to generate conversation topics on my own, I'm not sure I'd be able to do it for long without work.

Even this post is about startups and their effects. Which is a disease that I've noticed even great writers I know who are involved in startups. And I'm not even a particularly good writer, let alone great, so it's a fairly grim outlook for me. My only hope is to retire to a monastery in the mountains.

Tuesday, November 7 2006

Somehow it seems I always wind up rolling my own...

I spent a few days looking for a customizable real-time chat component to use on our new project. There were plenty of excellent components (mostly built in flash) that you couldn't customize at all. Since we need a bunch of features that those chat clients don't offer, they were only tempting dead ends. In the end, I decided to roll my own. In the search process I ran across Juggernaut, a Rails plugin for persistent connections. Building basic chat off of Juggernaut has required adding a couple feature enhancements (a system of handlers for connect and disconnect events), but overall it's been a solid base. My new chat project is called Zinzani; most of the functionality is now in place, although the default template is still very ugly.

I'll try to make http://zinzani.rubyforge.org/ a functional demo soon.

Thursday, October 26 2006

That's very liberal of you

It used to be that an act of generosity or mercy would be called "christian" by many. I've always thought that was poor word choice, because those actions aren't really in particularly Christian. "Frank led me to Jesus by his great love of Christ. That was very Christian of him": good word choice. "Frank let me borrow his lawnmower. That was very Christian of him": poor word choice. It's an important word choice as well, because it implies that non-Christians lack those traits and belief that outsiders lack generic positive traits is a symptom of unwarranted discrimination and prejudice.

Luckily, nowadays you don't hear that phrasing very often. It sounds slightly archaic. But I recently overheard something here in San Francisco that disturbed me. Someone said "He helped me set up the event; he's a real liberal." Liberal here is being used in the same sense as Christian; a place holder for "good". This is disturbing because it implies that the members of this culture actually believe that conservatives lack personal, positive traits. That's simply not true, and it's foolish to think otherwise. That kind of attitude is exactly what we should be fighting: people are people, good and bad, no matter what their religion or politics are. Neither Christians nor liberals have any monopoly on kindness.

Sunday, October 22 2006

Adventures in Devices

Steps required to get Philips SPC900NC webcam to work under Ubuntu (Dapper-Drake) Linux:

wget http://www.saillard.org/linux/pwc/snapshots/pwc-v4l2-20061020-042701.tar.bz2

  1. A big thank you to Saillard for putting in all the work for these drivers!

tar xjf pwc-v4l2-20061020-042701.tar.bz2 cd pwc-v4l2-20061020-042701

  1. if you try to run make now, it will fail because there's no /build or /source directory in /lib/module/`uname -r`
  2. so you need to get the kernel headers and softlink them in

sudo apt-get install linux-kernel-headers sudo ln -s /usr/src/linux-headers-2.6.15-27-386/ /lib/modules/2.6.15-27-386/build

  1. now make will work

make sudo make install modprobe pwc

  1. plug in webcam and test:

sudo apt-get install camorama camorama

I don't even want to talk about how many dead ends I had to go down to figure that out. At least I sort of understand how linux device support works now...

Wednesday, October 18 2006

Blacklist Script for Reddit

I've seen a lot of people complaining about the prevalence of political articles on reddit. I myself have thought, "I would rather not see any more stories about the LATEST OUTRAGE BUSH DEMOCRATS OMG". So I wrote a simple greasemonkey script that checks for the presence of various words in article titles. If it finds one of the words, it calls removeSiteDom to remove it.

If you like politics but never want to see "10 worthless CSS tricks!" again, feel free to edit the word list.

(download the script at http://blog.emmettshear.com/public/nopolitics.user.js)

Monday, October 16 2006

The World: Very Very Small

We opened our mailbox at our new place, found a magazine sent to the previous resident. Upstairs in our apartment where the Xobnis are crashing with us, and Matt was browsing through it. He finds none other than Reddit mentioned in an article. Tight circle...

Saturday, October 14 2006

Scratchtop

I threw together scratchtop.com in frustration with all the current ways to write and share simple documents on the web. Why are you making me register? Why do I have to click 10 times to start writing my first document? Why do I have to click edit and save? Why do I have to click at all? Feel free to use scratch top, but be warned that it's still very much development software.

As far as I know, scratch top is the simplest useful web application ever written. Anyone know anything simpler?

Arrival in San Francisco

Haven't posted in a few weeks. Was busy driving across country. Trip was fine, although Mormons stole Justin's sunglasses in Salt Lake City. Time to get to work.

Monday, September 18 2006

Lessons from eBaying Kiko

We at Kiko Software Incorporated have recently had an experience which is fairly unique in the crowded software world: we successfully sold a full software product on eBay. This hasn't been done too many times before (just one, as far as I know), but it's a pretty good way to exit a startup. There are a few potential problems areas though, and I'd like to share what we learned from the process.

Most importantly, specify the contract fully up front. Because we failed to do this, we wound up negotiating that contract after the sale. As anyone who's ever paid a lawyer to negotiate anything knows, that's more expensive than you'd like. Luckily Tucows was very reasonable and the negotiations went fine; if we'd been less lucky it could have been a real problem. Different sets of terms are worth very different amounts. In our deal, we very specifically did not offer a long term support contract with the software, even though that could have potentially increased the final price quite a bit.

If you think you have two potential sets of terms that would appeal to different buyers and you're willing to entertain either, consider selling an option bundled with the software. For example, if we'd been willing to offer a long term support contract for a reasonable price, we could have included that.

Pay careful attention to eBay's terms of service though. From what I can tell, certain kinds of options might be against the rules. Kiko's auction was pulled on the 7th day (out of 10) for having 2 links to the kiko.com (one to our main page, another to the API documentation). Apparently that's one over the limit, and an extremely vigilant community member killed our auction for it. We relisted it again as a 3 day auction and it doesn't seem any long term harm was done, but it was very nerve wracking.

Relisting it for 3 days was actually a mistake. We should have taken the removal as a blessing in disguise and relisted it for 10. Another week and a half would have given allowed more companies to bid. Corporations are slow. Ten days isn't enough time for most of them to both find out about the auction and go through their internal procedures. So a ten day auction is probably too short. If you can find another auction site besides eBay that allows longer auctions, you might consider using them.

We did do at least one thing right when we started our auction at $50,000 instead of $1. It's easy to see why it's a good idea if you consider that the real bids are all going to come in the last ten minutes of the auction. Earlier bids are essentially meaningless noise, and a low starting price can only induce more of those. The disadvantage of a low starting bid is you could be forced to sell for less than your minimum. Why take the chance?

Overall, the experience was extremely positive for us. Even with the mistakes we made the process was very fast, the asset sold for more than we expected, and our baby found a good home with Tucows. If you find your business with an large, valuable asset, consider eBaying it. It may sound crazy, but it seems to work.

- page 1 of 2