First, an admission: the comparison I did was neither thorough nor comprehensive, and provided only a slim reason to go one way or the other.
I was looking at a proposal for some work, and it specified that the system would have 300 million records, and some information about the likely structure of those records. Looking at that number and the likely structure, I had two questions:
1. Will we have to worry about how PostgreSQL goes with those sorts of numbers, spending time sharding and distributing and generally making life harder than it is with a single postgres instance?
2. Would MongoDB make our lives better?
The structure of the data is up to 30,000 buckets, with up to 10,000 records in each bucket. The obvious relational thing is to have a buckets table, and a records table, where the records table has a bucket_id and some other data. In this case, I used two decimals for the “some other data”, assuming that using more than just two decimals would change the constant factors but the basic performance characteristics would be the same. From what I know of the actual application, there’ll be another layer of indirection – something will have a bucket_id, and we’ll go from the something to the bucket to the records. Most operations will be either “append to bucket” or “get contents of bucket”, with the latter sometimes involving additional sorting/filtering.
My main interest is in looking ahead to a possible future where we are under deadline, and we have a performance problem, and we need the datastore to just do those append and get-contents operations as quickly as possible. If only one of them can be quick, I want it to be getting the contents of a bucket, because we can potentially queue appends.
So, I knocked up a little database schema for postgres, put appropriate indexes on it, and instructed the computer to invent 300,000,000 records, 10000 for each of 30000 buckets. An hour and a half later, I had them. I ran two tests, one for adding records to a bucket, and one for getting the contents of a bucket. They took 6 and 7 seconds of wall-clock time each (running the former 10000 times and the latter 100 times).
Then I produced a similar-ish thing in mongo. I created a collection, and to that collection I added an entry for each bucket. Each entry contained an index integer and a list of records. Trying approximately the same operations, append-to-bucket and retrieve-bucket, yielded 10 and 24 wall-clock seconds respectively.
Mongo and Postgres both coped fine with the volume of data. I suspect mongo’s relatively-slow performance is a result of the document size, and am curious about if arranging the data into smaller documents would work better. It’s all a bit irrelevant, though, because we have lots of experience of doing rails apps with postgres, and it looks like that’ll work fine at this scale, at least as far as the database is concerned.
Hey, caveats! I compared the mongo driver to ActiveRecord::Base.connection.execute; I didn’t compare mongo mapper to activerecord. The mongo document arrangement I chose may well have been suboptimal.
So, what did I learn from the experiment?
1. We can recommend postgres and expect not to get burned by that choice.
2. While mongo is shiny and interesting, it’s not obviously better for this project than postgres.