Currently we are using MongoDB 2.6.3 at VersionEye as primary database. Almost 3 years ago I picked MongoDB because of this reasons:
- Easy setup
- Easy replication
- Very fast
- Very good support from the Ruby Community
- Schemaless
So far it did a good job in the past years. Unfortunately we are facing some issues with it now. Currently it’s running on a 3 node replica set.
At VersionEye we had a very cool reference feature. That showed you on each page how many references a software package has. Mean how many other software packages are using the selected one as a dependency.

And you could even click on it and see the packages.

This feature was implemented with MongoDB Aggregation Framework. Which is a simple version of Map & Reduce. In MongoDB we have a collection “dependencies” with more than 8 Million entries. This collection describes the dependency relationship between software packages. To get all references to a software package we run this aggregation code.
deps = Dependency.collection.aggregate(
{ '$match' => { :language => language, :dep_prod_key => prod_key } },
{ '$group' => { :_id => '$prod_key' } },
{ '$skip' => skip },
{ '$limit' => per_page }
)
At first we match all the dependencies which link to the selected software package and then we group them by prod_key, because in the result list we want to have each prod_key only once. In SQL that would be a “distinct prod_key”.
So far so good. At the time we launched that feature we had something like 4 or 5 Million entries in the collection and the aggregation worked fast enough for the web app. But right now with 8 Million entries the aggregation queries take quiet some time. Sometimes several minutes. Far to slow to be part of a HTTP request – response roundtrip. And it slowed down each node in the replica set. The nodes have been running permanently on ~ 60% CPU.
Oh. And yes. There are indexes for the collection 😉 These are the indexes for the collection.
index({ language: 1, prod_key: 1, prod_version: 1, name: 1, version: 1, dep_prod_key: 1}, { name: "very_long_index", background: true })
index({ language: 1, prod_key: 1, prod_version: 1, scope: 1 }, { name: "prod_key_lang_ver_scope_index", background: true })
index({ language: 1, prod_key: 1, prod_version: 1 }, { name: "prod_key_lang_ver_index", background: true })
index({ language: 1, prod_key: 1 }, { name: "language_prod_key_index" , background: true })
index({ language: 1, dep_prod_key: 1 }, { name: "language_dep_prod_key_index" , background: true })
index({ dep_prod_key: 1 }, { name: "dep_prod_key_index" , background: true })
index({ prod_key: 1 }, { name: "prod_key_index" , background: true })
Sometimes it was so slow that the whole web app was not reacting and I had to restart the MongoDB nodes.
Finally I turned off the reference feature. And look what happened to the replica set.

The load went down to ~ 5% CPU. Wow. VersionEye is fast again 🙂
Now we need another solution for the reference feature. Calculating the references for 400K software libraries in the background would be very intensive. I would like to avoid that.
My first thought was to implement that feature with ElasticSearch, with their facet feature. That would make sense because we use ES already for the search. I wrote a mapping for the dependency collection and started to index the dependencies. That was this morning German time, 12 hours ago. The indexing process is still running 
Another solution could be Neo4J.
If you have any suggestions, please let me know. Any input is welcome.