VersionEye Enterprise Update

Another update for all the Enterprise clients is out.

rails_app 2.1.5

This Docker container ships the new web app for VersionEye Enterprise. This includes many small improvements. Here are the 2 biggest ones:

  • References are turned on again.
  • BugFix for too many open database connections

There was Bug which didn’t closed open database connections. The connections summed up and the web app was slower and slower. This Bug is fixed in this release. If you update this container, please stop and start before the MongoDB Docker Container. That will close all open connections on MongoDB.

Intro to Firegento

Firegento e.V. is an association around the PHP ecommerce framework Magento, dedicated to Open Source, the Community around it and supporting the apprenticeship and education of developers. It started with a group of Developers from different companies to create open source projects under a neutral name. Firegento regularly organizes Magento centred Hackathons in Europe and supports a good set of Magento related projects.
I want to tell you about the two which are now related to VersionEye.

The Magento Composer Installer

Every bigger PHP project seems to have its own installer to integrate packages from composer into the current project, so we thought we should write one for Magento, too. That was nearly 2 years ago at a hackathon. Today it is a standard tool many Magento Developers know and its even a part of the regular Magento trainings (at least people told me this). In other News, it’s currently the 3rd most referenced php project on VersionEye and even the Magento project itself uses it as base for an own composer installer.

With the increased interest into composer there also grow a need to have access to the whole world of magento modules, which created the next project I want to introduce.

The Firegento Repository

It uses the composer satis project as base and got initial created to offer easy access to a few Magento modules hosted on GitHub. On one of the hackathons two members of the community felt the need to expand this and created a converter for all the free Magento modules which are published on the Magento marketplace. Thanks to both projects Firegento plays a major role today in spreading the composer usage under Magento developers.

logo

Useful Links related to this article:

firegento page: http://firegento.com/
packagist page: http://packages.firegento.com/
repository github; https://github.com/magento-hackathon/composer-repository
installer github; https://github.com/magento-hackathon/magento-composer-installer

Daniel Fahlke

This article is written by Daniel Fahlke.

I am Daniel Fahlke, on the Internet also known as Flyingmana and work as a Magento Developer at http://www.melovely.de/.  I am also an active member of the Open Source Community and in my free time currently mostly active in magento and composer related projects. Other things I participate in are PHP-Mentoring, various other PHP projects and some regularly usergroups and conferences.

MongoDB lessons learned

MongoDB is currently the primary database for VersionEye. In the last couple weeks we had some performance and scaling issues. Unfortunately that caused some down times. Here are the learnings from the last 3 weeks.

MongoID

The Ruby code at VersionEye is using the MongoID driver to access MongoDB. All in one MongoID is a great piece of open source software. There is a very active community which offers a great support.

In our case MongoID somehow didn’t close the opened connections. With each HTTP Request a new connection to MongoDB is created. If the HTTP Response is generated the connection can be closed. Unfortunately this didn’t happened automatically. So the open connections summed up on the MongoDB Replica Set  and the application become slower and slower over time. After a restart of the Replica Set the game started by 0 again the application was fast again. At least for a couple hours until the open connections summed up again into the hundreds.

For right now that’s fixed with this filter in the ApplicationController.

  after_filter :disconnect_from_mongo

  def disconnect_from_mongo
    Mongoid.default_session.disconnect
  rescue => e
    p e.message
    Rails.logger.error e.message
    Rails.logger.error e.stacktrace.join "\n"
  end

Still not sure if this is a bug in MongoID or a misconfiguration on our side.

MongoDB Aggregation Framework

We have a cool Feature at VersionEye which shows the references for software packages. This are the references for the Rails framework, for example.

Screen Shot 2014-07-16 at 20.58.19

This feature shows you which other software libraries are using the selected software library as a dependency. Usually many references are a good sign for quality software.

In the beginning this feature was implemented with the Aggregation Framework of MongoDB and it was fast enough. This is the aggregation code snippet we used for this feature.

deps = Dependency.collection.aggregate(
  { '$match' => { :language => language, :dep_prod_key => prod_key } },
  { '$group' => { :_id => '$prod_key' } },
  { '$skip' => skip },
  { '$limit' => per_page }
)

At the time this was implemented we had less than 4 Million dependency records in the collection. Over time the collection was growing. Right now there are more than 9 Million records in the collection and the aggregation code snippet above is just terrible slow. And it slows down everything else too. If multiple HTTP Requests trigger this code the whole database is getting super slow! I wrote already a blog about that here.

One thing I learned is that the Aggregation Framework doesn’t take advantage of Indexes. Same is true for the Map & Reduce Feature in MongoDB. Originally Map & Reduce was created to crunch data in parallel, super fast. On MongoDB Map & Reduce is running on a single Thread, without indexes :-/

Wrong Indexes

Instead of calculating the references in real time with MongoDBs Aggregation Framework, we wanted to pre calculated the references with a simple query. This one:

prod_keys = Dependency.where(:language => product.language, :dep_prod_key => product.prod_key).distinct(:prod_key)

The advantage of this distinct query over the Aggregation Framework is that it can take advantage of Indexes. And specially for that query there is an index!

index({ language: 1, dep_prod_key: 1 }, { name: "language_dep_prod_key_index" , background: true })

On localhost the query was running quiet fast. Still to slow for real time, but fast enough to pre calculate all values over night. On production it was running super slow! It needed for each query 17 seconds. Calculating the references for all 400K software libraries in our database would take 78 days.

Finally Asya gave the right hint. He recommended to double check the query in the mongo console with “.explain()”, to see which indexes are used. And indeed MongoDB was using the wrong index on production! Only God and the core-committers know why. For me that’s a bug!

This is what happens if you run a couple distinct queries which use the wrong index.

Screen Shot 2014-07-19 at 22.11.34

I deleted 5 indexes on the collection until MongoDB had no other choice than using the dam right index! And now it’s running fast enough. Finally!

Conclusion

Here are the conclusions for working with MongoDB:

  •  Check regularly the logs on the MongoDB Replica Set to recognize odd things.
  • Close open connections.
  • Avoid The Aggregation Framework if you can do the same with a simple query.
  • Ensure that MongoDB is using the right Index for your query.

So far so good.

MongoDB Aggregation slows down server

Currently we are using MongoDB 2.6.3 at VersionEye as primary database. Almost 3 years ago I picked MongoDB because of this reasons:

  • Easy setup
  • Easy replication
  • Very fast
  • Very good support from the Ruby Community
  • Schemaless

So far it did a good job in the past years. Unfortunately we are facing some issues with it now. Currently it’s running on a 3 node replica set.

At VersionEye we had a very cool reference feature. That showed you on each page how many references a software package has. Mean how many other software packages are using the selected one as a dependency.

Screen Shot 2014-07-16 at 20.57.27

And you could even click on it and see the packages.

Screen Shot 2014-07-16 at 20.58.19

This feature was implemented with MongoDB Aggregation Framework. Which is a simple version of Map & Reduce. In MongoDB we have a collection “dependencies” with more than 8 Million entries. This collection describes the dependency relationship between software packages. To get all references to a software package we run this aggregation code.

deps = Dependency.collection.aggregate(
  { '$match' => { :language => language, :dep_prod_key => prod_key } },
  { '$group' => { :_id => '$prod_key' } },
  { '$skip' => skip },
  { '$limit' => per_page }
)

At first we match all the dependencies which link to the selected software package and then we group them by prod_key, because in the result list we want to have each prod_key only once. In SQL that would be a “distinct prod_key”.

So far so good. At the time we launched that feature we had something like 4 or 5 Million entries in the collection and the aggregation worked fast enough for the web app. But right now with 8 Million entries the aggregation queries take quiet some time. Sometimes several minutes. Far to slow to be part of a HTTP request – response roundtrip. And it slowed down each node in the replica set. The nodes have been running permanently on ~ 60% CPU.

Oh. And yes. There are indexes for the collection 😉 These are the indexes for the collection.

index({ language: 1, prod_key: 1, prod_version: 1, name: 1, version: 1, dep_prod_key: 1}, { name: "very_long_index", background: true })
index({ language: 1, prod_key: 1, prod_version: 1, scope: 1 }, { name: "prod_key_lang_ver_scope_index", background: true })
index({ language: 1, prod_key: 1, prod_version: 1 }, { name: "prod_key_lang_ver_index", background: true })
index({ language: 1, prod_key: 1 }, { name: "language_prod_key_index" , background: true })
index({ language: 1, dep_prod_key: 1 }, { name: "language_dep_prod_key_index" , background: true })
index({ dep_prod_key: 1 }, { name: "dep_prod_key_index" , background: true })
index({ prod_key: 1 }, { name: "prod_key_index" , background: true })

Sometimes it was so slow that the whole web app was not reacting and I had to restart the MongoDB nodes.

Finally I turned off the reference feature. And look what happened to the replica set.

Screen Shot 2014-07-16 at 20.06.54

The load went down to ~ 5% CPU. Wow. VersionEye is fast again 🙂

Now we need another solution for the reference feature. Calculating the references for 400K software libraries in the background would be very intensive. I would like to avoid that.

My first thought was to implement that feature with ElasticSearch, with their facet feature. That would make sense because we use ES already for the search. I wrote a mapping for the dependency collection and started to index the dependencies. That was this morning German time, 12 hours ago. The indexing process is still running :-/

Another solution could be Neo4J.

If you have any suggestions, please let me know. Any input is welcome.

Sorry to MongoDB

Round about 2 weeks ago I updated VersionEye from MongoDB 2.4.9 to 2.6.3. At first I updated everything on localhost and executed a couple tests suites. Then I tested manually. Everything was green. So I did the upgrade on production. After that we faced some down times. The application was running smoothly for a couple hours and then suddenly it become incredible slow until it didn’t react anymore. At that time I had to reboot the Mongo cluster or the app servers. And that happened a couples times.

I was reading logs for hours and searching for answers on StackOverflow.   And of course I was blaming MongoDB on Twitter. Well. I’m sorry for that. It was my fault!

Here is what happened. VersionEye is running on a 3 node replica set. Somehow, late at night, I was doing a copy & paste fuck up. In the mongoid.yml all MongoDB hosts has to be listed. 2 of the 3 entries pointed to the same host. Somehow that worked out for a couples hours. But If 1 of the hosts was not available anymore the MongoID driver tried to find a new master and if it wasn’t available (because 1 host was not listed) everything blocked.

I’m sorry for that!

mongoDB-logo

Intro to Ansible

Ansible is a tool for managing your IT Infrastructure.

If you have only 1 single server to manage, you probably login via SSH and execute a couple shell commands. If you have 2 servers with the same setup you loose a lot of time if you do everything by hand. 2 Servers are already a reason to think about automation.

How do you handle the setup for 10, 100 or even 1000 servers? Assume you have to install ruby 2.1.1 and Oracle Java 7 on 100 Linux servers. And by the way both packages are not available via “apt-get install”. Good luck by doing it manually 😀

That’s what I asked myself at the time I moved away from Heroku. I took a look to Chef and Puppet. But honestly I couldn’t get warm with any of them. Both are very complex and for my purpose totally over engineered. A friend of my recommended finally Ansible.

AnsibleLogo_transparent_web

I never heard of it and I was skeptical in the beginning. But after I finished watching this videos, I was convinced! It’s simple and it works like expected!

Key Features

Here some facts

  • Ansible doesn’t need a master server!
  • You don’t need to install anything on your servers for Ansible!
  • Ansible works via SSH.
  • Just tell Ansible the IP Addresses of your servers and run the script!
  • With ansible your IT infrastructure is just another code repository.
  • Configuration in Yaml files

Sounds like magic? No it’s not. It’s Python 😉 Ansible is implemented in Python and works via the SSH protocol. If you configured password less login on your servers with public certificates, than Ansible only need the IP Addresses of the servers.

Installation

You don’t need to install Ansible on your servers! Only on your workstation. There are different ways to install it. If you are anyway a Python developer I assume you have installed Pypi, the package manager from Python. In that case you can install it like this:

sudo pip install ansible

On Mac OS X you can install it via the package manager brew.

$ brew update
$ brew install ansible

And there are much more ways to install it. Read here: http://docs.ansible.com/intro_installation.html.

Getting started

With Ansible your IT infrastructure is just another code repository. You can keep everything in one directory and put it under version control with git.

Let’s create a new directory for ansible.

$ mkdir infrastructure
$ cd infrastructure

Ansible has to know where your servers are and how they are grouped together. That information we will keep in the hosts file in the root of the “infrastructure” directory. Here is an example:

[dev_servers]
192.168.0.30
192.168.0.31

[www_servers]
192.168.0.33

As you can see there are 3 servers defined. 2 Of them are in the “dev_servers” group and one of them is in the “www_servers” group. You can define as many groups with as many IP addresses as you want. We will use the group names in Playbook, to assign roles to them.

Playbooks

A Playbook assigns roles (software) to server groups. It defines which role (software) should be installed on which server groups. Playbooks are stored in Yaml files. Let’s create a site.yml file in the root of the “infrastructure” directory, with this content:

---
- hosts: dev_servers
  user: ubuntu
  sudo: true
  roles:
  - java
  - memcached

- hosts: www_servers
  user: ubuntu
  sudo: true
  roles:
  - nginx

In the above example we defined that on all dev_servers the 2 roles “java” and “memcached” should be installed. And on all web servers (www) the role “nginx” should be defined. The “hosts” from the site.yml has to match with the names from the hosts file.

Otherwise we define here to each hosts the user (ubuntu), which should be used to login to the server via SSH. And we defined that “sudo” should be used in front of each command. As you can see there is no password defined. I assume that you can login to your servers without a password, because of cert auth. If you use AWS, that is the default anyway.

Roles

All right. We defined the IP Addresses in the hosts file and we assigned roles to the servers in the site.yml playbook. Now we have to create the roles. A role describes exactly what to install and how to install it. A role can be defined in a single Yaml file. But also in a directory with subdirectories. I prefer a directory per role. Let’s create the first role

$ mkdir java
$ cd java

A role can contain “tasks”, “files”, “vars” and “handlers” as subdirectories. But at least the “tasks” directory.

$ mkdir tasks 
$ cd tasks

Each of this subdirectories have to have a main.yml file. This is the main file for this role. And this is how it looks for the java role:

---
- name: update debian packages
  apt: update_cache=true

- name: install Java JDK
  apt: name=openjdk-7-jdk state=present

Ansible is organized in modules. Currently there are more than 100 modules out there. The “apt” module is for example for “apt-get” on Debian machines. In the above example you can see that the task directives are always 2 lines. The first line is the name of the task. This is what you will see in the command line if you start Ansible. The 2nd line is always a module with attributes. For example:

apt: name=openjdk-7-jdk state=present

This is the “apt” module and we basically tell here that the debian package “openjdk-7-jdk” has to be installed on the server. The full documentation of the apt module you can find here.

Let’s create another role for memcached.

$ mkdir memcached
$ cd memcached
$ mkdir tasks
$ cd tasks

And add a main.yml file with this content:

---
- name: update debian packages
 apt: update_cache=true

- name: install memcached
 apt: name=memcached state=present

Easy right? Now let’s create a more complex role.

$ mkdir nginx
$ cd nginx
$ mkdir tasks
$ mkdir files
$ mkdir handlers

This role has beside the tasks also files and handlers. In the files directory you can put files which should be copied to the server. This is especially useful for configuration files. In this case we put the nginx.conf into the files directory. The main.yml file in the tasks directory looks like this:

---
- name: update debian packages
  apt: update_cache=true

- name: install NGinx
  apt: name=nginx state=present

- name: copy nginx.conf to the server
  copy: src=nginx.conf dest=/etc/nginx/nginx.conf
  notify: restart nginx

The first tasks updates the debian apt cache.  That is similar to “apt-get update”. The 2nd tasks installs nginx. And the 3rd tasks copies nginx.conf from the “files” subdirectory to the server to “/etc/nginx/nginx.con”.

And the last line notifies the handler “restart nginx”. Handlers are defined in the “handlers” subdirectory and are usually used to restart a service. The main.yml in the “handlers” subdirectory looks like this:

---
- name: restart nginx
  service: name=nginx state=restarted

It uses the “service” module to restart the web server nginx. That is mandatory because we installed a new nginx.conf configuration file.

RUN

Allright. We defined 3 roles and 3 servers now. Lets setup our infrastructure. Execute this command in the root of the infrastructure directory:

ansible-playbook -i hosts site.yml

The “ansible-playbook” command takes a playbook file as parameter to execute it. In this case the “site.yml” file. In addition to the playbook file we let the command know where our servers are with “-i hosts”.

Fazit

This was a very simple example as intro. But you can do easily much complex things. For example manipulating values in existing configuration files on servers, checking out private git repositories or executing shell commands with the shell module. Ansible is very powerful and you can do amazing things with it!

I’m using Ansible to manage the whole infrastructure for VersionEye. Currently I have 36 roles and 15 playbooks defined for VersionEye. I can setup the whole infrastructure with 1 single command! Or just parts of it. I even use Ansible for deployments. Deploying the VersionEye crawlers into the Amazon Cloud is 1 single command for me. And I even rebuild the capistrano deployment process for Rails apps with Ansible.

Let me know if you find this tutorial helpful or you have additional questions. Either here in the comments or on Twitter.

The Heartbleed Bug

If you don’t live behind the moon you probably heard already about the Heartbleed bug in openssl. This bug is so critical for the security of the internet that it even gets his own domain, logo and marketing campaign.

heartbleed

Here you can test if your server is affected: http://filippo.io/Heartbleed/

Unfortunately VersionEye was affected as well. We don’t have any reason to believe that we have been compromised! But we exchanged anyway all secrets and revoked all tokens from GitHub and Bitbucket.

What does that mean for you? If you signed up at VersionEye with your GitHub or Bitbucket account you have to grand VersionEye access again to your GitHub/Bitbucket account. Just use one of the social media login buttons on this page:

https://www.versioneye.com/signin

If you are currently signed in at VersionEye you can re-connect your GitHub/Bitbucket account here:

https://www.versioneye.com/settings/connect

If you signed up with your email address please use the “password reset” function, because we reset all passwords in our database to some random values.

https://www.versioneye.com/iforgotmypassword

I’m really sorry for this inconvenience. But safe is safe!

You can believe me that my heart was bleeding than I was clicking the “Revoke all user tokens” button at GitHub :-/