Working with MongoDB

28 Jun 2012

Last 6 months I spent working in the office on some Facebook app. It is written in Python using Django and uWSGI, but the interesting part is that MongoDB is the primary and only database there. It was not my choice, but anyway it seemed attractive to use some cutting-edge tech like this one so I signed up. This is a collection of small random notes on features, pitfalls and useful techniques of MongoDB.

Optimizing

Since this application was dedicated for Facebook audience the first requirement for my work was to optimize maximally and because database is the root of web-application I've started to look for optimizing techniques. That's what I've found for MongoDB:

  1. Always profile. This should always be the first item in every optimizator's list. You should know what to optimize and thus you have to do profiling and by "profiling" I mean "profile everything": database, server-side code, client-side code and etc.

  2. Create indices and use hints. That was something I've expected to find first. Index can be created by using ensureIndex command. Hints help query optimizer to use right index for particular query. Hints are set via hint command.

  3. Limit results. The less you select the faster it works. Results can be limited with limit command.

  4. Select only relevant fields. Same thing, just push map of fields as a second param to find command.

  5. Rewrite some queries in map/reduce. Map/reduce is easy to parallel and if your server has more than one core or CPU (I believe it has) or you have several DB servers rewriting some "heavy" queries will increase their performance, often impressive.

  6. Use modifier operators. I'm talking about $inc, $set, $push and others. They are always faster than retrieve-update-save circle.

Now let's take a closer look at a couple of things in Mongo: indices and map/reduce.

Indices and keys

Indices in Mongo work like the ones in SQL databases. Unique and primary keys can be achieved in Mongo through indices. If you're a pedantic programmer and you always set PKs and UKs for proper fields in your DB you can simply skip this paragraph. For everybody else: always set unique key for unique fields! There is no other way to avoid duplicates because MongoDB has no transactions and therefore it has no transactional consistency, only atomic one. It means that actuality of your data is guaranteed only for current DB command.

For example, imagine you haven't set unique key and coded some badass test in your app instead. This test selects count of documents that have the same data as yours in unique field and if this count is more than 0 then throw error, otherwise create new entry. As I said before actuality is guaranteed only for current command and if your count was 0 in test time it doesn't mean that you have no similar entries in the time of adding new entry. When your app is under some heavy load you'll get a tons of duplicate entries in database.

How really map/reduce works

I assume that you've already heard about map/reduce model. It took two best practices from functional programming and combined them. MongoDB's modification of M/R includes the third optional function: finalize. I'm going to tell how these things work:

  1. All the data you've selected goes through map function first and map is called on each entry of it. On this step you group data by some key.

  2. Groups of data from previous step is now passed to reduce function which should perform some magic and return reduced value for each group.

    Here goes some important thing: reduce is not called when it's nothing to reduce. I'll better explain it on some example.

    Imagine you have a collection of cars. There are five entries in it:

    {'firm': 'Porsche', 'model': '911 Boxster'},
    {'firm': 'Porsche', 'model': 'Carrera GT'},
    {'firm': 'BMW', 'model': 'M3'},
    {'firm': 'BMW', 'model': 'X6'},
    {'firm': 'Audi', 'model': 'Q5'}
    

    Now you want to select count of models for each firm and add 10 to this number. You expect these results:

    ('Porsche', 12),
    ('BMW', 12),
    ('Audi', 11)
    

    You write map function:

    function map () {
    	emit(this.firm, 1);
    }
    

    It groups your data like this:

    ('Porsche', 1),
    ('Porsche', 1),
    ('BWM', 1),
    ('BWM', 1),
    ('Audi', 1)
    

    And some internal Mongo voodoo groups them again:

    ('Porsche', [1, 1]),
    ('BMW', [1, 1]),
    ('Audi', 1)
    

    And pass to reduce function:

    function reduce(key, vals) {
        var count = 0;
        vals.forEach(function(e) {
            count += e;
        });
        return count + 10;
    }
    

    And you get these:

    ('Porsche', 12),
    ('BMW', 12),
    ('Audi', 1)
    

    Notice the difference?

    Now you must be thinking "what's going on?". But look closer to mapped results: Porsche and BMW have array of 1s and Audi has only "1". Because there's only one result in it Mongo thinks that it's already reduced and not performs reduce function again.

  3. "Well, what should I do if I want correct results?" you ask. You should user finalize function. That's where your data passed after reducing. After rewriting reduce and finalize you'll get expected results:

    function reduce(key, vals) {
        var count = 0;
        vals.forEach(function(e) {
            count += e;
        });
        return count;
    }
    
    function finalize(key, val) {
        return val + 10;
    }
    

Conclusion

MongoDB is a great thing for its own purposes. I would recommend it for fast prototyping of apps, logging and caching. It's easy, fast and quite reliable. Try it sometime!