MongoDB Aggregation slows down server

Currently we are using MongoDB 2.6.3 at VersionEye as primary database. Almost 3 years ago I picked MongoDB because of this reasons:

  • Easy setup
  • Easy replication
  • Very fast
  • Very good support from the Ruby Community
  • Schemaless

So far it did a good job in the past years. Unfortunately we are facing some issues with it now. Currently it’s running on a 3 node replica set.

At VersionEye we had a very cool reference feature. That showed you on each page how many references a software package has. Mean how many other software packages are using the selected one as a dependency.

Screen Shot 2014-07-16 at 20.57.27

And you could even click on it and see the packages.

Screen Shot 2014-07-16 at 20.58.19

This feature was implemented with MongoDB Aggregation Framework. Which is a simple version of Map & Reduce. In MongoDB we have a collection “dependencies” with more than 8 Million entries. This collection describes the dependency relationship between software packages. To get all references to a software package we run this aggregation code.

deps = Dependency.collection.aggregate(
  { '$match' => { :language => language, :dep_prod_key => prod_key } },
  { '$group' => { :_id => '$prod_key' } },
  { '$skip' => skip },
  { '$limit' => per_page }
)

At first we match all the dependencies which link to the selected software package and then we group them by prod_key, because in the result list we want to have each prod_key only once. In SQL that would be a “distinct prod_key”.

So far so good. At the time we launched that feature we had something like 4 or 5 Million entries in the collection and the aggregation worked fast enough for the web app. But right now with 8 Million entries the aggregation queries take quiet some time. Sometimes several minutes. Far to slow to be part of a HTTP request – response roundtrip. And it slowed down each node in the replica set. The nodes have been running permanently on ~ 60% CPU.

Oh. And yes. There are indexes for the collection ๐Ÿ˜‰ These are the indexes for the collection.

index({ language: 1, prod_key: 1, prod_version: 1, name: 1, version: 1, dep_prod_key: 1}, { name: "very_long_index", background: true })
index({ language: 1, prod_key: 1, prod_version: 1, scope: 1 }, { name: "prod_key_lang_ver_scope_index", background: true })
index({ language: 1, prod_key: 1, prod_version: 1 }, { name: "prod_key_lang_ver_index", background: true })
index({ language: 1, prod_key: 1 }, { name: "language_prod_key_index" , background: true })
index({ language: 1, dep_prod_key: 1 }, { name: "language_dep_prod_key_index" , background: true })
index({ dep_prod_key: 1 }, { name: "dep_prod_key_index" , background: true })
index({ prod_key: 1 }, { name: "prod_key_index" , background: true })

Sometimes it was so slow that the whole web app was not reacting and I had to restart the MongoDB nodes.

Finally I turned off the reference feature. And look what happened to the replica set.

Screen Shot 2014-07-16 at 20.06.54

The load went down to ~ 5% CPU. Wow. VersionEye is fast again ๐Ÿ™‚

Now we need another solution for the reference feature. Calculating the references for 400K software libraries in the background would be very intensive. I would like to avoid that.

My first thought was to implement that feature with ElasticSearch, with their facet feature. That would make sense because we use ES already for the search. I wrote a mapping for the dependency collection and started to index the dependencies. That was this morning German time, 12 hours ago. The indexing process is still running :-/

Another solution could be Neo4J.

If you have any suggestions, please let me know. Any input is welcome.

Geek2Geek NoSQL

Last Friday we met again for the Geek2Geek event in Berlin. The theme this time was NoSQL. 40 people RSVPd on MeetUp.com and 42 joint actually the event. Not bad for a Friday night ๐Ÿ™‚

Geek2Geek_01

Jan Lehnardt gave the first talk about CouchDB. A database that uses JSON for documents, JavaScript for Map and Reduce queries, and regular HTTP for an API.

Geek2Geek_04

Jan is one of the core comitters of the CouchDB project and knows the database in and out. Jan gave us a very entertaining overview about the features of CouchDB ๐Ÿ™‚

Geek2Geek_06

Stefan Plantikow made the 2nd talk about Neo4J, a graph database written in Java and Scala. ย Stefan is one of the core comitters of the project.

Geek2Geek_07

He is a brave guy, because he did a live demo. Thanks God the internet was working and Maven could download all dependencies without any errors ๐Ÿ™‚

Geek2Geek_09

Stefan showed us the new features in version 2.0 of Neo4J and he even talked about new and not yet presented features in version 2.1.

The 3rd talk was by Martin Schoener about ArangoDB, a multi-purpose, multi-model, non-relational, open source database. Martin is the chief architect at the ArangoDB project.

Geek2Geek_10

Beside ArangoDB he talked also about the other database systems and he showed us a very interesting landscape of dbs, which caused quiet some active discussions.

Geek2Geek_11

The event was a lot of fun with all this great people. After the talks we moved to a pizza shop around the corner, where we stayed until midnight. We talked for a couple hours about databases and languages.

Geek2Geek_13

Sponsors

Pizza and Drinks have been sponsored by 2 awesome companies. Many Thanks for that.

Locafox

Locafox is a young StartUp from Berlin. They work on a real-time online/offline solution for retail. It’s a young team in the middle of Berlin and currently they are hiring. Check out the positions here:ย http://locafox.de/jobs

locafox

Tempo-DB

TempoDB is a young database StartUp from Chicago.ย TempoDB is purpose-built to store & analyze time series data from sensors, smart meters, servers & more. If you want to join the team send an email toย jobs@tempo-db.com.

Screen Shot 2014-02-18 at 19.20.35

Many Thanks again to our great sponsors. Many Thanks for the Pizza & Beer.