MongoDB Aggregation slows down server

Currently we are using MongoDB 2.6.3 at VersionEye as primary database. Almost 3 years ago I picked MongoDB because of this reasons:

  • Easy setup
  • Easy replication
  • Very fast
  • Very good support from the Ruby Community
  • Schemaless

So far it did a good job in the past years. Unfortunately we are facing some issues with it now. Currently it’s running on a 3 node replica set.

At VersionEye we had a very cool reference feature. That showed you on each page how many references a software package has. Mean how many other software packages are using the selected one as a dependency.

Screen Shot 2014-07-16 at 20.57.27

And you could even click on it and see the packages.

Screen Shot 2014-07-16 at 20.58.19

This feature was implemented with MongoDB Aggregation Framework. Which is a simple version of Map & Reduce. In MongoDB we have a collection “dependencies” with more than 8 Million entries. This collection describes the dependency relationship between software packages. To get all references to a software package we run this aggregation code.

deps = Dependency.collection.aggregate(
  { '$match' => { :language => language, :dep_prod_key => prod_key } },
  { '$group' => { :_id => '$prod_key' } },
  { '$skip' => skip },
  { '$limit' => per_page }
)

At first we match all the dependencies which link to the selected software package and then we group them by prod_key, because in the result list we want to have each prod_key only once. In SQL that would be a “distinct prod_key”.

So far so good. At the time we launched that feature we had something like 4 or 5 Million entries in the collection and the aggregation worked fast enough for the web app. But right now with 8 Million entries the aggregation queries take quiet some time. Sometimes several minutes. Far to slow to be part of a HTTP request – response roundtrip. And it slowed down each node in the replica set. The nodes have been running permanently on ~ 60% CPU.

Oh. And yes. There are indexes for the collection 😉 These are the indexes for the collection.

index({ language: 1, prod_key: 1, prod_version: 1, name: 1, version: 1, dep_prod_key: 1}, { name: "very_long_index", background: true })
index({ language: 1, prod_key: 1, prod_version: 1, scope: 1 }, { name: "prod_key_lang_ver_scope_index", background: true })
index({ language: 1, prod_key: 1, prod_version: 1 }, { name: "prod_key_lang_ver_index", background: true })
index({ language: 1, prod_key: 1 }, { name: "language_prod_key_index" , background: true })
index({ language: 1, dep_prod_key: 1 }, { name: "language_dep_prod_key_index" , background: true })
index({ dep_prod_key: 1 }, { name: "dep_prod_key_index" , background: true })
index({ prod_key: 1 }, { name: "prod_key_index" , background: true })

Sometimes it was so slow that the whole web app was not reacting and I had to restart the MongoDB nodes.

Finally I turned off the reference feature. And look what happened to the replica set.

Screen Shot 2014-07-16 at 20.06.54

The load went down to ~ 5% CPU. Wow. VersionEye is fast again 🙂

Now we need another solution for the reference feature. Calculating the references for 400K software libraries in the background would be very intensive. I would like to avoid that.

My first thought was to implement that feature with ElasticSearch, with their facet feature. That would make sense because we use ES already for the search. I wrote a mapping for the dependency collection and started to index the dependencies. That was this morning German time, 12 hours ago. The indexing process is still running :-/

Another solution could be Neo4J.

If you have any suggestions, please let me know. Any input is welcome.

Sorry to MongoDB

Round about 2 weeks ago I updated VersionEye from MongoDB 2.4.9 to 2.6.3. At first I updated everything on localhost and executed a couple tests suites. Then I tested manually. Everything was green. So I did the upgrade on production. After that we faced some down times. The application was running smoothly for a couple hours and then suddenly it become incredible slow until it didn’t react anymore. At that time I had to reboot the Mongo cluster or the app servers. And that happened a couples times.

I was reading logs for hours and searching for answers on StackOverflow.   And of course I was blaming MongoDB on Twitter. Well. I’m sorry for that. It was my fault!

Here is what happened. VersionEye is running on a 3 node replica set. Somehow, late at night, I was doing a copy & paste fuck up. In the mongoid.yml all MongoDB hosts has to be listed. 2 of the 3 entries pointed to the same host. Somehow that worked out for a couples hours. But If 1 of the hosts was not available anymore the MongoID driver tried to find a new master and if it wasn’t available (because 1 host was not listed) everything blocked.

I’m sorry for that!

mongoDB-logo

Intro to Ansible

Ansible is a tool for managing your IT Infrastructure.

If you have only 1 single server to manage, you probably login via SSH and execute a couple shell commands. If you have 2 servers with the same setup you loose a lot of time if you do everything by hand. 2 Servers are already a reason to think about automation.

How do you handle the setup for 10, 100 or even 1000 servers? Assume you have to install ruby 2.1.1 and Oracle Java 7 on 100 Linux servers. And by the way both packages are not available via “apt-get install”. Good luck by doing it manually 😀

That’s what I asked myself at the time I moved away from Heroku. I took a look to Chef and Puppet. But honestly I couldn’t get warm with any of them. Both are very complex and for my purpose totally over engineered. A friend of my recommended finally Ansible.

AnsibleLogo_transparent_web

I never heard of it and I was skeptical in the beginning. But after I finished watching this videos, I was convinced! It’s simple and it works like expected!

Key Features

Here some facts

  • Ansible doesn’t need a master server!
  • You don’t need to install anything on your servers for Ansible!
  • Ansible works via SSH.
  • Just tell Ansible the IP Addresses of your servers and run the script!
  • With ansible your IT infrastructure is just another code repository.
  • Configuration in Yaml files

Sounds like magic? No it’s not. It’s Python 😉 Ansible is implemented in Python and works via the SSH protocol. If you configured password less login on your servers with public certificates, than Ansible only need the IP Addresses of the servers.

Installation

You don’t need to install Ansible on your servers! Only on your workstation. There are different ways to install it. If you are anyway a Python developer I assume you have installed Pypi, the package manager from Python. In that case you can install it like this:

sudo pip install ansible

On Mac OS X you can install it via the package manager brew.

$ brew update
$ brew install ansible

And there are much more ways to install it. Read here: http://docs.ansible.com/intro_installation.html.

Getting started

With Ansible your IT infrastructure is just another code repository. You can keep everything in one directory and put it under version control with git.

Let’s create a new directory for ansible.

$ mkdir infrastructure
$ cd infrastructure

Ansible has to know where your servers are and how they are grouped together. That information we will keep in the hosts file in the root of the “infrastructure” directory. Here is an example:

[dev_servers]
192.168.0.30
192.168.0.31

[www_servers]
192.168.0.33

As you can see there are 3 servers defined. 2 Of them are in the “dev_servers” group and one of them is in the “www_servers” group. You can define as many groups with as many IP addresses as you want. We will use the group names in Playbook, to assign roles to them.

Playbooks

A Playbook assigns roles (software) to server groups. It defines which role (software) should be installed on which server groups. Playbooks are stored in Yaml files. Let’s create a site.yml file in the root of the “infrastructure” directory, with this content:

---
- hosts: dev_servers
  user: ubuntu
  sudo: true
  roles:
  - java
  - memcached

- hosts: www_servers
  user: ubuntu
  sudo: true
  roles:
  - nginx

In the above example we defined that on all dev_servers the 2 roles “java” and “memcached” should be installed. And on all web servers (www) the role “nginx” should be defined. The “hosts” from the site.yml has to match with the names from the hosts file.

Otherwise we define here to each hosts the user (ubuntu), which should be used to login to the server via SSH. And we defined that “sudo” should be used in front of each command. As you can see there is no password defined. I assume that you can login to your servers without a password, because of cert auth. If you use AWS, that is the default anyway.

Roles

All right. We defined the IP Addresses in the hosts file and we assigned roles to the servers in the site.yml playbook. Now we have to create the roles. A role describes exactly what to install and how to install it. A role can be defined in a single Yaml file. But also in a directory with subdirectories. I prefer a directory per role. Let’s create the first role

$ mkdir java
$ cd java

A role can contain “tasks”, “files”, “vars” and “handlers” as subdirectories. But at least the “tasks” directory.

$ mkdir tasks 
$ cd tasks

Each of this subdirectories have to have a main.yml file. This is the main file for this role. And this is how it looks for the java role:

---
- name: update debian packages
  apt: update_cache=true

- name: install Java JDK
  apt: name=openjdk-7-jdk state=present

Ansible is organized in modules. Currently there are more than 100 modules out there. The “apt” module is for example for “apt-get” on Debian machines. In the above example you can see that the task directives are always 2 lines. The first line is the name of the task. This is what you will see in the command line if you start Ansible. The 2nd line is always a module with attributes. For example:

apt: name=openjdk-7-jdk state=present

This is the “apt” module and we basically tell here that the debian package “openjdk-7-jdk” has to be installed on the server. The full documentation of the apt module you can find here.

Let’s create another role for memcached.

$ mkdir memcached
$ cd memcached
$ mkdir tasks
$ cd tasks

And add a main.yml file with this content:

---
- name: update debian packages
 apt: update_cache=true

- name: install memcached
 apt: name=memcached state=present

Easy right? Now let’s create a more complex role.

$ mkdir nginx
$ cd nginx
$ mkdir tasks
$ mkdir files
$ mkdir handlers

This role has beside the tasks also files and handlers. In the files directory you can put files which should be copied to the server. This is especially useful for configuration files. In this case we put the nginx.conf into the files directory. The main.yml file in the tasks directory looks like this:

---
- name: update debian packages
  apt: update_cache=true

- name: install NGinx
  apt: name=nginx state=present

- name: copy nginx.conf to the server
  copy: src=nginx.conf dest=/etc/nginx/nginx.conf
  notify: restart nginx

The first tasks updates the debian apt cache.  That is similar to “apt-get update”. The 2nd tasks installs nginx. And the 3rd tasks copies nginx.conf from the “files” subdirectory to the server to “/etc/nginx/nginx.con”.

And the last line notifies the handler “restart nginx”. Handlers are defined in the “handlers” subdirectory and are usually used to restart a service. The main.yml in the “handlers” subdirectory looks like this:

---
- name: restart nginx
  service: name=nginx state=restarted

It uses the “service” module to restart the web server nginx. That is mandatory because we installed a new nginx.conf configuration file.

RUN

Allright. We defined 3 roles and 3 servers now. Lets setup our infrastructure. Execute this command in the root of the infrastructure directory:

ansible-playbook -i hosts site.yml

The “ansible-playbook” command takes a playbook file as parameter to execute it. In this case the “site.yml” file. In addition to the playbook file we let the command know where our servers are with “-i hosts”.

Fazit

This was a very simple example as intro. But you can do easily much complex things. For example manipulating values in existing configuration files on servers, checking out private git repositories or executing shell commands with the shell module. Ansible is very powerful and you can do amazing things with it!

I’m using Ansible to manage the whole infrastructure for VersionEye. Currently I have 36 roles and 15 playbooks defined for VersionEye. I can setup the whole infrastructure with 1 single command! Or just parts of it. I even use Ansible for deployments. Deploying the VersionEye crawlers into the Amazon Cloud is 1 single command for me. And I even rebuild the capistrano deployment process for Rails apps with Ansible.

Let me know if you find this tutorial helpful or you have additional questions. Either here in the comments or on Twitter.

The Heartbleed Bug

If you don’t live behind the moon you probably heard already about the Heartbleed bug in openssl. This bug is so critical for the security of the internet that it even gets his own domain, logo and marketing campaign.

heartbleed

Here you can test if your server is affected: http://filippo.io/Heartbleed/

Unfortunately VersionEye was affected as well. We don’t have any reason to believe that we have been compromised! But we exchanged anyway all secrets and revoked all tokens from GitHub and Bitbucket.

What does that mean for you? If you signed up at VersionEye with your GitHub or Bitbucket account you have to grand VersionEye access again to your GitHub/Bitbucket account. Just use one of the social media login buttons on this page:

https://www.versioneye.com/signin

If you are currently signed in at VersionEye you can re-connect your GitHub/Bitbucket account here:

https://www.versioneye.com/settings/connect

If you signed up with your email address please use the “password reset” function, because we reset all passwords in our database to some random values.

https://www.versioneye.com/iforgotmypassword

I’m really sorry for this inconvenience. But safe is safe!

You can believe me that my heart was bleeding than I was clicking the “Revoke all user tokens” button at GitHub :-/

ThoughtWorks Technology Radar – Latest Trends

ThoughtWorks publishes an annual report to the latest technology trends. The ThoughtWorks Technology Radar includes the latest trends to

  • Techniques
  • Tools
  • Platforms
  • Languages & Frameworks

The Radar items appear in this quadrants

  • Adopt
  • Trial
  • Assess
  • Hold

The current Technology Radar has 94 items. It’s important to understand that this is not a complete list of all available technologies. The input for ThoughtWorks Technology Radar comes from the 2500 employees of ThoughtWorks. The final cut is made by the ThoughtWorks Technology Advisory Board, which has only 20 members, including Martin Fowler, Rebecca Parsons, Erik Doernenburg and 17 other very smart ThoughtWorkers. They have to master the difficult process of deciding which technology appears on the radar and in which quadrant. From a couple hundred technologies they have to cut down to ~ 100.

You can download the full report as PDF.

Events

ThoughtWorks is organizing events to the Technology Radar, where they present and discuss their decisions with the audience. You can check out the upcoming events here. I attended to the Technology Radar event in Hamburg. Which was a lot of fun. The following section is a report of my personal experience in Hamburg.

Technology Radar Hamburg

Last week I was invited to the launch Party of Graylog2 version 0.20.1, the market leader for searchable logs. By accident I noticed on Twitter that ThoughtWorks is organizing a Technology Radar event in Hamburg just the day after the Graylog2 launch Party. I decided it’s worth it to stay 1 day longer in Hamburg and join the ThoughtWorks event. It makes specially sense for me because I’m working on VersionEye and I have to deal every day with all kind of technologies.

The event started 7pm at the Lindner Hotel. Erik Doernenburg explained the decision making process and presented the highlights from the different parts of the Radar.

TechRadarHamburg

Here are the highlights from the Technology Radar. At least from my point of view 🙂

Datensparsamkeit

Unbelievable but true. A German buzzword is on the Radar, in the Techniques quadrant as ASSESS 🙂 Basically it means that you should be careful with the amount of data you are collecting / storing. It is privacy against BIG DATA. Many companies are storing for example the full IP address of their users in the logs, so that they can analyze from where they come from. But for that kind of analyzes the first 3 numbers of the IP are enough. And if you don’t store the complete IP address you have much less trouble with the German Datenschützer 😉

JavaScript

JavaScript topics are distributes all over the Technology Radar. Capturing client-side JavaScript errors for example is in TECHNIQUES / ADOPT.  HTML 5 storage instead of cookies is in TECHNIQUES / TRIAL. Dependency management for JavaScript is in TOOLS / ADOPT. Grunt.js & PhantomJS in TOOLS / TRIAL.
JavaScript is a big thing. The tools, libraries and frameworks are getting more sophisticated. Frontend JavaScript is becoming more and more equal to backend development. There are JS MVC frameworks who act completely in the browser, combined with JS templating engines. And there are multiple package managers out there to manage the dependencies for frontside development. Exciting times for JS developers.

HOLD

The HOLD section is maybe the most interesting section on the Radar.
JSF is on HOLD. I worked 3 years with JSF and I’m wondering why it took so long to put in on HOLD. JSF is by default not web friendly. By default you have problems with bookmarking, the browser back button, multiple tabs and it’s eating memory like a black hole because you have to put everything in session scope.  Good luck with that if you have to build something for the Internet.
The lesson is clear. If you want to build a web app you should simply learn HTTP, HTML, CSS and JavaScript. Don’t abstract it away. That doesn’t work!
TFS is on HOLD too.  Team Foundation Server from Microsoft. It’s such a good product that Microsoft gives it away for free if you buy anything else. But it’s still not fun to use it. And it is far away from the open source alternative Git.
Heavyweight test tools. How do you know if you are using a heavyweight test tool? There is a simple test for that. If you can get certified for it, its heavy! If you need a workshop to learn it, its’ heavy! If you need to pay a lot of Money for it, it’s heavy!

There are many very interesting points on the Tech Radar. I hope I made you some appetite for the lecture 🙂

Let me know what you think about the Technology Radar.

Geek2Geek NoSQL

Last Friday we met again for the Geek2Geek event in Berlin. The theme this time was NoSQL. 40 people RSVPd on MeetUp.com and 42 joint actually the event. Not bad for a Friday night 🙂

Geek2Geek_01

Jan Lehnardt gave the first talk about CouchDB. A database that uses JSON for documents, JavaScript for Map and Reduce queries, and regular HTTP for an API.

Geek2Geek_04

Jan is one of the core comitters of the CouchDB project and knows the database in and out. Jan gave us a very entertaining overview about the features of CouchDB 🙂

Geek2Geek_06

Stefan Plantikow made the 2nd talk about Neo4J, a graph database written in Java and Scala.  Stefan is one of the core comitters of the project.

Geek2Geek_07

He is a brave guy, because he did a live demo. Thanks God the internet was working and Maven could download all dependencies without any errors 🙂

Geek2Geek_09

Stefan showed us the new features in version 2.0 of Neo4J and he even talked about new and not yet presented features in version 2.1.

The 3rd talk was by Martin Schoener about ArangoDB, a multi-purpose, multi-model, non-relational, open source database. Martin is the chief architect at the ArangoDB project.

Geek2Geek_10

Beside ArangoDB he talked also about the other database systems and he showed us a very interesting landscape of dbs, which caused quiet some active discussions.

Geek2Geek_11

The event was a lot of fun with all this great people. After the talks we moved to a pizza shop around the corner, where we stayed until midnight. We talked for a couple hours about databases and languages.

Geek2Geek_13

Sponsors

Pizza and Drinks have been sponsored by 2 awesome companies. Many Thanks for that.

Locafox

Locafox is a young StartUp from Berlin. They work on a real-time online/offline solution for retail. It’s a young team in the middle of Berlin and currently they are hiring. Check out the positions here: http://locafox.de/jobs

locafox

Tempo-DB

TempoDB is a young database StartUp from Chicago. TempoDB is purpose-built to store & analyze time series data from sensors, smart meters, servers & more. If you want to join the team send an email to jobs@tempo-db.com.

Screen Shot 2014-02-18 at 19.20.35

Many Thanks again to our great sponsors. Many Thanks for the Pizza & Beer.

Bower Integration

We finished our bower integration! Yeah 🙂 So many of you requested this integration and now it’s done. Now I will get a bit fewer emails per week 🙂

BowerJS Bird

Through the bower integration we have now additional 5K+ JavaScript and 500+ CSS libraries on VersionEye, you can follow. Check out the JavaScript page to see the most followed libraries. Currently jquery-mobile, jquery and backbone are the most followed JavaScript libraries.

Thanks to Bower we can now show dependencies on the JavaScript pages. Just like for Ruby, PHP, NodeJS and other languages. Beside the dependencies we show you some bower code snippets and a download link.

Screen Shot 2014-01-28 at 17.26.01

And since we have the dependencies now, we can display dependency badges for JavaScript libraries too 🙂

dep_up-to-date dep_out-of-date dep_none

I’m looking forward to see them on the README pages on GitHub 🙂

Beside the follow feature VersionEye can actively monitor your bower.json file on GitHub and notify you about out-dated dependencies. In the login area just navigate to the repository and turn on the switch beside the package file.

Screen Shot 2014-01-28 at 17.58.08

Just turn it on and you will receive an email notification as soon their are updates available for your bower.json file. Stay passive & up-to-date 😉

Screen Shot 2014-01-28 at 17.58.39

The bower integration is kind of new. If you see anything that doesn’t seems to be OK, then just contact me on twitter or submit a message via the contact form at VersionEye.

VersionEye + BitBucket

Happy new year to everybody! We start into 2014 with some awesome new features.

We are happy to announce that we finished our BitBucket integration. This enables you to log into VersionEye using your Bitbucket account as well as monitoring the dependencies in your Bitbucket projects.

Connecting your Bitbucket account

If you are new to VersionEye you can now signup with your existing BitBucket Account. Simply click on the “Login with Bitbucket” button on the right side.

SignUp with BitBucket

If you already have an VersionEye user account and want to connect it to your existing BitBucket account, simply go to “Connect” in the “Settings” area.

bitbucket_02_settings_connect

Click on “Connect with Bitbucket” to connect your BitBucket account to VersionEye.

Import and monitor your Bitbucket projects

Now the projects area features the section “Bitbucket Repositories“:

bitbucket_03_repos

It works exactly like our GitHub integration. Simply click “Show project files” on a repository and turn on the switch next to the project file VersionEye should monitor for you.

bitbucket_04_switch

The integration works totally seamless. Just set it up once and you will get email notifications about out-dated dependencies in your projects. It was never easier to stay up-to-date with your open source dependencies.

The Bitbucket integration is pretty new. If you find anything odd, please contact us and give us feedback.

Updated GitHub SinglePage App

Because more then 60% of all users at VersionEye are using the “Login with GitHub” button the GitHub integration is very important to us. That’s why we improved again our GitHub SinglePage App. Here are the most important updates.

No Polling

The last version of the app polled updates from the server every 30 seconds and that caused some strange behaviour. Specially if it updated the view during you clicked around or typed something into the search field. We removed the polling completely.

Async Fetching

The very first time you start the GitHub SinglePage App it will fetch all your repositories from GitHub and cache it. This fetching process happens now asynchronously. It is non blocking and during the fetching process you can see the progress, who many of your repositories are already fetched.

Reimport manually

Because we removed the polling completely we added 2 “Reimport” links. In the right upper corner you can click the “Reimport all repositories” link, which will clear the cache and reimport all your current repositories from GitHub.

Next to each repository you have now a link “Update the repository”, which will only reimport ALL branches from that repository.

Screen Shot 2013-12-13 at 11.48.00

Let us know how you like our updates.

Which packages use my library?

On VersionEye every software library has his own page where we show some important numbers related to that software library.

Screen Shot 2013-12-13 at 11.33.11

For example on the right side you can see the numbers for the current version of the Ruby gem twitter-oauth. As you can see version 0.4.94 of twitter-oauth was released 6 months ago and the average time between releases for that package is 81 days.

The references number shows how many other software libraries are referencing / using the current library. As you can see there are 20 other libraries that use twitter-oauth. Up until now VersionEye didn’t show you more than that, but now this number is a link!

When you click on the references number you’ll see which packages use twitter-oauth.

Screen Shot 2013-12-13 at 11.38.31

This is one of our newest features and it works with every software library on VersionEye we have dependency information for.

Please try it for yourself and let us know what you think!