MongoDB Aggregation slows down server

Currently we are using MongoDB 2.6.3 at VersionEye as primary database. Almost 3 years ago I picked MongoDB because of this reasons:

  • Easy setup
  • Easy replication
  • Very fast
  • Very good support from the Ruby Community
  • Schemaless

So far it did a good job in the past years. Unfortunately we are facing some issues with it now. Currently it’s running on a 3 node replica set.

At VersionEye we had a very cool reference feature. That showed you on each page how many references a software package has. Mean how many other software packages are using the selected one as a dependency.

Screen Shot 2014-07-16 at 20.57.27

And you could even click on it and see the packages.

Screen Shot 2014-07-16 at 20.58.19

This feature was implemented with MongoDB Aggregation Framework. Which is a simple version of Map & Reduce. In MongoDB we have a collection “dependencies” with more than 8 Million entries. This collection describes the dependency relationship between software packages. To get all references to a software package we run this aggregation code.

deps = Dependency.collection.aggregate(
  { '$match' => { :language => language, :dep_prod_key => prod_key } },
  { '$group' => { :_id => '$prod_key' } },
  { '$skip' => skip },
  { '$limit' => per_page }
)

At first we match all the dependencies which link to the selected software package and then we group them by prod_key, because in the result list we want to have each prod_key only once. In SQL that would be a “distinct prod_key”.

So far so good. At the time we launched that feature we had something like 4 or 5 Million entries in the collection and the aggregation worked fast enough for the web app. But right now with 8 Million entries the aggregation queries take quiet some time. Sometimes several minutes. Far to slow to be part of a HTTP request – response roundtrip. And it slowed down each node in the replica set. The nodes have been running permanently on ~ 60% CPU.

Oh. And yes. There are indexes for the collection πŸ˜‰ These are the indexes for the collection.

index({ language: 1, prod_key: 1, prod_version: 1, name: 1, version: 1, dep_prod_key: 1}, { name: "very_long_index", background: true })
index({ language: 1, prod_key: 1, prod_version: 1, scope: 1 }, { name: "prod_key_lang_ver_scope_index", background: true })
index({ language: 1, prod_key: 1, prod_version: 1 }, { name: "prod_key_lang_ver_index", background: true })
index({ language: 1, prod_key: 1 }, { name: "language_prod_key_index" , background: true })
index({ language: 1, dep_prod_key: 1 }, { name: "language_dep_prod_key_index" , background: true })
index({ dep_prod_key: 1 }, { name: "dep_prod_key_index" , background: true })
index({ prod_key: 1 }, { name: "prod_key_index" , background: true })

Sometimes it was so slow that the whole web app was not reacting and I had to restart the MongoDB nodes.

Finally I turned off the reference feature. And look what happened to the replica set.

Screen Shot 2014-07-16 at 20.06.54

The load went down to ~ 5% CPU. Wow. VersionEye is fast again πŸ™‚

Now we need another solution for the reference feature. Calculating the references for 400K software libraries in the background would be very intensive. I would like to avoid that.

My first thought was to implement that feature with ElasticSearch, with their facet feature. That would make sense because we use ES already for the search. I wrote a mapping for the dependency collection and started to index the dependencies. That was this morning German time, 12 hours ago. The indexing process is still running :-/

Another solution could be Neo4J.

If you have any suggestions, please let me know. Any input is welcome.

12 thoughts on “MongoDB Aggregation slows down server

  1. As I found out on stackoverflow the group function in an Aggregation can not take advantage of the indexes. That means it must read the whole collection. That would explain why it’s so slow now.

  2. The SO question you link isn’t related because aggregation framework doesn’t have anything to do with MapReduce. I’m not quite certain why you need to use agg – wouldn’t you get the same result from db.collection.distinct(“prod_key”, {language:lang, dep_prod_key:key}) ?

    You can see if a query is using an index by running it in the shell with explain () attached to find (easiest to do in the shell)

  3. Here is another suggestion, as with 8 million total records, you should be able to look up all dependencies for all of them and “cache” them in less than a few minutes! Try this aggregation:

    db.collection.aggregation([ { $sort: { language: 1, dep_prod_key: 1 } },
    { $group : { _id: { lang:”$language”, prodkey:”$dep_prod_key”}, dependencies:{$addToSet:”$prod_key”} } }, {$out : “dependencies”});

    Now querying with db.dependencies.find({_id: { lang: , prodkey: }}) will instantaneously return you the dependencies for that package.

    1. Sven: Which limitation are you referring to? None of the things mentioned in that lesson (limitations of aggregation framework) apply to anything that I suggestion – $out will send results to a new collection so you don’t have to worry about how big the result is. If you are replying to top level post, I don’t see anything applicable to them either. Maybe you can expand on which limitation you were thinking of for this use case.

  4. Instead of aggregating the same data over and over, wouldn’t it be better to add it to your documents?
    Every package’s document could have a array field “referencing” with objectids of documents that reference the package. You could $push new references every time a new package referencing the current package is added.

    1. In the mean while it works without the aggregation function πŸ˜‰ I know what you mean but its not only the pure reference number which has to be updated every day. It’s also the list of IDs to the referencing components. If all that would be part of the “product” document, some documents would be huge and slow down the loading of detail pages. That’s the reason why I outsourced it into a separate collection.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s