Another lesson learned with MongoDB

MongoDB is the primary database at VersionEye. Currently VersionEye is crawling more than 600K open source projects on a daily basis. Some of the crawlers are implemented in Java, others in Ruby. You can follow a library at VersionEye and as soon the next version comes out you get a email notification. Today I got this email from VersionEye.

Screen Shot 2015-06-25 at 18.20.59

As you can see the version information is missing for the Java libraries. The email template was not touched in the last couple days. Obviously the crawlers for Maven repositories are implemented in Java 🙂 and they get updated more frequently. The error must be somewhere in the Java crawlers.

The version object is an embedded document in the product object. Every time a crawler finds a new version it adds a new version object to the corresponding product object. The code for that looks like that.

BasicDBObject productMatch = new BasicDBObject();
productMatch.put(Product.LANGUAGE, language);
productMatch.put(Product.PROD_KEY, prodKey);

BasicDBObject versionObj = version.getDBObject();
versionObj.put(Version.VERSION, version.getVersion());

BasicDBObject versionsUpdate = new BasicDBObject();
versionsUpdate.put("$push", new BasicDBObject(Version.VERSIONS, versionObj));

So far so good. In the next lines the product object is updated with the current time.

DBObject verUpdate = getDBObjectByKey(language, prodKey);
verUpdate.put(Product.UPDATED_AT, new Date());
getCollection().update(productMatch, verUpdate);

And of course there is a unit test case for this code and the test case is always green. On production sometimes the new version just disappears. Not always! Just sometimes. At first I thought I found a bug in MongoDB, but this only happened to the Java crawlers, never to the Ruby crawlers. So the root of all eval must be the implementation. I needed a whole day to figure it out!

On production MongoDB is running in a Replica Set on multiple hosts and 2 days ago I changed the read preference of the mongodb driver to “secondary”. That means that the read operations are distributed to ALL nodes in the Replica Set. And this is what happened.

The first code snippet always runs through and adds a new version to the product. But then the 2nd code snippet is reloading the product object from the db and executing an update.

DBObject verUpdate = getDBObjectByKey(language, prodKey);
verUpdate.put(Product.UPDATED_AT, new Date());
getCollection().update(productMatch, verUpdate);

If the changes are not yet distributed in the whole Replica Set and the read operation goes to a node which doesn’t has the new version yet, a product object is loaded without the new version. On this object the “updated_at” field is updated and stored back to the database. But the “update” method on the java driver doesn’t update only the changed field, it updates the whole object. And so it comes that it stores the object without the new version.

There are different solutions to this. First of all I could change the read preference back to “primary” again. But there is a better solution. Actually there is a way to only update single properties in a document in MongoDB. That works like this.

DBObjectnewValues = getDBObjectByKey(language, prodKey);
newValues.put(Product.UPDATED_AT, new Date());
BasicDBObject set = new BasicDBObject("$set", newValues);
getCollection().update(productMatch, set);

The big difference is in line 3. That tells the java driver to only update the changed properties. On day headache for a one liner! I hope I can save somebody else 1 day headache with this blog post.

MongoDB lessons learned

MongoDB is currently the primary database for VersionEye. In the last couple weeks we had some performance and scaling issues. Unfortunately that caused some down times. Here are the learnings from the last 3 weeks.


The Ruby code at VersionEye is using the MongoID driver to access MongoDB. All in one MongoID is a great piece of open source software. There is a very active community which offers a great support.

In our case MongoID somehow didn’t close the opened connections. With each HTTP Request a new connection to MongoDB is created. If the HTTP Response is generated the connection can be closed. Unfortunately this didn’t happened automatically. So the open connections summed up on the MongoDB Replica Set  and the application become slower and slower over time. After a restart of the Replica Set the game started by 0 again the application was fast again. At least for a couple hours until the open connections summed up again into the hundreds.

For right now that’s fixed with this filter in the ApplicationController.

  after_filter :disconnect_from_mongo

  def disconnect_from_mongo
  rescue => e
    p e.message
    Rails.logger.error e.message
    Rails.logger.error e.stacktrace.join "\n"

Still not sure if this is a bug in MongoID or a misconfiguration on our side.

MongoDB Aggregation Framework

We have a cool Feature at VersionEye which shows the references for software packages. This are the references for the Rails framework, for example.

Screen Shot 2014-07-16 at 20.58.19

This feature shows you which other software libraries are using the selected software library as a dependency. Usually many references are a good sign for quality software.

In the beginning this feature was implemented with the Aggregation Framework of MongoDB and it was fast enough. This is the aggregation code snippet we used for this feature.

deps = Dependency.collection.aggregate(
  { '$match' => { :language => language, :dep_prod_key => prod_key } },
  { '$group' => { :_id => '$prod_key' } },
  { '$skip' => skip },
  { '$limit' => per_page }

At the time this was implemented we had less than 4 Million dependency records in the collection. Over time the collection was growing. Right now there are more than 9 Million records in the collection and the aggregation code snippet above is just terrible slow. And it slows down everything else too. If multiple HTTP Requests trigger this code the whole database is getting super slow! I wrote already a blog about that here.

One thing I learned is that the Aggregation Framework doesn’t take advantage of Indexes. Same is true for the Map & Reduce Feature in MongoDB. Originally Map & Reduce was created to crunch data in parallel, super fast. On MongoDB Map & Reduce is running on a single Thread, without indexes :-/

Wrong Indexes

Instead of calculating the references in real time with MongoDBs Aggregation Framework, we wanted to pre calculated the references with a simple query. This one:

prod_keys = Dependency.where(:language => product.language, :dep_prod_key => product.prod_key).distinct(:prod_key)

The advantage of this distinct query over the Aggregation Framework is that it can take advantage of Indexes. And specially for that query there is an index!

index({ language: 1, dep_prod_key: 1 }, { name: "language_dep_prod_key_index" , background: true })

On localhost the query was running quiet fast. Still to slow for real time, but fast enough to pre calculate all values over night. On production it was running super slow! It needed for each query 17 seconds. Calculating the references for all 400K software libraries in our database would take 78 days.

Finally Asya gave the right hint. He recommended to double check the query in the mongo console with “.explain()”, to see which indexes are used. And indeed MongoDB was using the wrong index on production! Only God and the core-committers know why. For me that’s a bug!

This is what happens if you run a couple distinct queries which use the wrong index.

Screen Shot 2014-07-19 at 22.11.34

I deleted 5 indexes on the collection until MongoDB had no other choice than using the dam right index! And now it’s running fast enough. Finally!


Here are the conclusions for working with MongoDB:

  •  Check regularly the logs on the MongoDB Replica Set to recognize odd things.
  • Close open connections.
  • Avoid The Aggregation Framework if you can do the same with a simple query.
  • Ensure that MongoDB is using the right Index for your query.

So far so good.

MongoDB Aggregation slows down server

Currently we are using MongoDB 2.6.3 at VersionEye as primary database. Almost 3 years ago I picked MongoDB because of this reasons:

  • Easy setup
  • Easy replication
  • Very fast
  • Very good support from the Ruby Community
  • Schemaless

So far it did a good job in the past years. Unfortunately we are facing some issues with it now. Currently it’s running on a 3 node replica set.

At VersionEye we had a very cool reference feature. That showed you on each page how many references a software package has. Mean how many other software packages are using the selected one as a dependency.

Screen Shot 2014-07-16 at 20.57.27

And you could even click on it and see the packages.

Screen Shot 2014-07-16 at 20.58.19

This feature was implemented with MongoDB Aggregation Framework. Which is a simple version of Map & Reduce. In MongoDB we have a collection “dependencies” with more than 8 Million entries. This collection describes the dependency relationship between software packages. To get all references to a software package we run this aggregation code.

deps = Dependency.collection.aggregate(
  { '$match' => { :language => language, :dep_prod_key => prod_key } },
  { '$group' => { :_id => '$prod_key' } },
  { '$skip' => skip },
  { '$limit' => per_page }

At first we match all the dependencies which link to the selected software package and then we group them by prod_key, because in the result list we want to have each prod_key only once. In SQL that would be a “distinct prod_key”.

So far so good. At the time we launched that feature we had something like 4 or 5 Million entries in the collection and the aggregation worked fast enough for the web app. But right now with 8 Million entries the aggregation queries take quiet some time. Sometimes several minutes. Far to slow to be part of a HTTP request – response roundtrip. And it slowed down each node in the replica set. The nodes have been running permanently on ~ 60% CPU.

Oh. And yes. There are indexes for the collection 😉 These are the indexes for the collection.

index({ language: 1, prod_key: 1, prod_version: 1, name: 1, version: 1, dep_prod_key: 1}, { name: "very_long_index", background: true })
index({ language: 1, prod_key: 1, prod_version: 1, scope: 1 }, { name: "prod_key_lang_ver_scope_index", background: true })
index({ language: 1, prod_key: 1, prod_version: 1 }, { name: "prod_key_lang_ver_index", background: true })
index({ language: 1, prod_key: 1 }, { name: "language_prod_key_index" , background: true })
index({ language: 1, dep_prod_key: 1 }, { name: "language_dep_prod_key_index" , background: true })
index({ dep_prod_key: 1 }, { name: "dep_prod_key_index" , background: true })
index({ prod_key: 1 }, { name: "prod_key_index" , background: true })

Sometimes it was so slow that the whole web app was not reacting and I had to restart the MongoDB nodes.

Finally I turned off the reference feature. And look what happened to the replica set.

Screen Shot 2014-07-16 at 20.06.54

The load went down to ~ 5% CPU. Wow. VersionEye is fast again 🙂

Now we need another solution for the reference feature. Calculating the references for 400K software libraries in the background would be very intensive. I would like to avoid that.

My first thought was to implement that feature with ElasticSearch, with their facet feature. That would make sense because we use ES already for the search. I wrote a mapping for the dependency collection and started to index the dependencies. That was this morning German time, 12 hours ago. The indexing process is still running :-/

Another solution could be Neo4J.

If you have any suggestions, please let me know. Any input is welcome.

Sorry to MongoDB

Round about 2 weeks ago I updated VersionEye from MongoDB 2.4.9 to 2.6.3. At first I updated everything on localhost and executed a couple tests suites. Then I tested manually. Everything was green. So I did the upgrade on production. After that we faced some down times. The application was running smoothly for a couple hours and then suddenly it become incredible slow until it didn’t react anymore. At that time I had to reboot the Mongo cluster or the app servers. And that happened a couples times.

I was reading logs for hours and searching for answers on StackOverflow.   And of course I was blaming MongoDB on Twitter. Well. I’m sorry for that. It was my fault!

Here is what happened. VersionEye is running on a 3 node replica set. Somehow, late at night, I was doing a copy & paste fuck up. In the mongoid.yml all MongoDB hosts has to be listed. 2 of the 3 entries pointed to the same host. Somehow that worked out for a couples hours. But If 1 of the hosts was not available anymore the MongoID driver tried to find a new master and if it wasn’t available (because 1 host was not listed) everything blocked.

I’m sorry for that!


PHP Master – MongoDB Indexing

PHP Master - MongoDB Indexing

A proper understanding of indexing is important since it can improve the performance and throughput of your application. This series on MongoDB indexing concludes with a look at a few small but important concepts, like indexing on sub-documents and embedded fields, covered queries, and index …

Read more

VersionEye: We’ll keep you up to date!