Currently we are using MongoDB 2.6.3 at VersionEye as primary database. Almost 3 years ago I picked MongoDB because of this reasons:
- Easy setup
- Easy replication
- Very fast
- Very good support from the Ruby Community
- Schemaless
So far it did a good job in the past years. Unfortunately we are facing some issues with it now. Currently it’s running on a 3 node replica set.
At VersionEye we had a very cool reference feature. That showed you on each page how many references a software package has. Mean how many other software packages are using the selected one as a dependency.
And you could even click on it and see the packages.
This feature was implemented with MongoDB Aggregation Framework. Which is a simple version of Map & Reduce. In MongoDB we have a collection “dependencies” with more than 8 Million entries. This collection describes the dependency relationship between software packages. To get all references to a software package we run this aggregation code.
deps = Dependency.collection.aggregate( { '$match' => { :language => language, :dep_prod_key => prod_key } }, { '$group' => { :_id => '$prod_key' } }, { '$skip' => skip }, { '$limit' => per_page } )
At first we match all the dependencies which link to the selected software package and then we group them by prod_key, because in the result list we want to have each prod_key only once. In SQL that would be a “distinct prod_key”.
So far so good. At the time we launched that feature we had something like 4 or 5 Million entries in the collection and the aggregation worked fast enough for the web app. But right now with 8 Million entries the aggregation queries take quiet some time. Sometimes several minutes. Far to slow to be part of a HTTP request – response roundtrip. And it slowed down each node in the replica set. The nodes have been running permanently on ~ 60% CPU.
Oh. And yes. There are indexes for the collection π These are the indexes for the collection.
index({ language: 1, prod_key: 1, prod_version: 1, name: 1, version: 1, dep_prod_key: 1}, { name: "very_long_index", background: true }) index({ language: 1, prod_key: 1, prod_version: 1, scope: 1 }, { name: "prod_key_lang_ver_scope_index", background: true }) index({ language: 1, prod_key: 1, prod_version: 1 }, { name: "prod_key_lang_ver_index", background: true }) index({ language: 1, prod_key: 1 }, { name: "language_prod_key_index" , background: true }) index({ language: 1, dep_prod_key: 1 }, { name: "language_dep_prod_key_index" , background: true }) index({ dep_prod_key: 1 }, { name: "dep_prod_key_index" , background: true }) index({ prod_key: 1 }, { name: "prod_key_index" , background: true })
Sometimes it was so slow that the whole web app was not reacting and I had to restart the MongoDB nodes.
Finally I turned off the reference feature. And look what happened to the replica set.
The load went down to ~ 5% CPU. Wow. VersionEye is fast again π
Now we need another solution for the reference feature. Calculating the references for 400K software libraries in the background would be very intensive. I would like to avoid that.
My first thought was to implement that feature with ElasticSearch, with their facet feature. That would make sense because we use ES already for the search. I wrote a mapping for the dependency collection and started to index the dependencies. That was this morning German time, 12 hours ago. The indexing process is still running
Another solution could be Neo4J.
If you have any suggestions, please let me know. Any input is welcome.
As I found out on stackoverflow the group function in an Aggregation can not take advantage of the indexes. That means it must read the whole collection. That would explain why it’s so slow now.
This link helps to understand: http://stackoverflow.com/questions/3947889/mongodb-terrible-mapreduce-performance/3951871#3951871
Is it possible to get a dummy data set of the same size? I would like to try it with http://www.arangodb.org
Sure. I will prepare some dummy data with the same size. 8 Million records.
The SO question you link isn’t related because aggregation framework doesn’t have anything to do with MapReduce. I’m not quite certain why you need to use agg – wouldn’t you get the same result from db.collection.distinct(“prod_key”, {language:lang, dep_prod_key:key}) ?
You can see if a query is using an index by running it in the shell with explain () attached to find (easiest to do in the shell)
Here is another suggestion, as with 8 million total records, you should be able to look up all dependencies for all of them and “cache” them in less than a few minutes! Try this aggregation:
db.collection.aggregation([ { $sort: { language: 1, dep_prod_key: 1 } },
{ $group : { _id: { lang:”$language”, prodkey:”$dep_prod_key”}, dependencies:{$addToSet:”$prod_key”} } }, {$out : “dependencies”});
Now querying with db.dependencies.find({_id: { lang: , prodkey: }}) will instantaneously return you the dependencies for that package.
Are you using a replica set? You may run into this: https://www.youtube.com/watch?v=U_gRSxEq3c0
Sven: Which limitation are you referring to? None of the things mentioned in that lesson (limitations of aggregation framework) apply to anything that I suggestion – $out will send results to a new collection so you don’t have to worry about how big the result is. If you are replying to top level post, I don’t see anything applicable to them either. Maybe you can expand on which limitation you were thinking of for this use case.
Are you using a shared system? You may run into this limitation of the aggregation framework https://www.youtube.com/watch?v=U_gRSxEq3c0&feature=youtu.be&t=1m20s
Instead of aggregating the same data over and over, wouldn’t it be better to add it to your documents?
Every package’s document could have a array field “referencing” with objectids of documents that reference the package. You could $push new references every time a new package referencing the current package is added.
In the mean while it works without the aggregation function π I know what you mean but its not only the pure reference number which has to be updated every day. It’s also the list of IDs to the referencing components. If all that would be part of the “product” document, some documents would be huge and slow down the loading of detail pages. That’s the reason why I outsourced it into a separate collection.