Round about 2 weeks ago I updated VersionEye from MongoDB 2.4.9 to 2.6.3. At first I updated everything on localhost and executed a couple tests suites. Then I tested manually. Everything was green. So I did the upgrade on production. After that we faced some down times. The application was running smoothly for a couple hours and then suddenly it become incredible slow until it didn’t react anymore. At that time I had to reboot the Mongo cluster or the app servers. And that happened a couples times.
I was reading logs for hours and searching for answers on StackOverflow. And of course I was blaming MongoDB on Twitter. Well. I’m sorry for that. It was my fault!
Here is what happened. VersionEye is running on a 3 node replica set. Somehow, late at night, I was doing a copy & paste fuck up. In the mongoid.yml all MongoDB hosts has to be listed. 2 of the 3 entries pointed to the same host. Somehow that worked out for a couples hours. But If 1 of the hosts was not available anymore the MongoID driver tried to find a new master and if it wasn’t available (because 1 host was not listed) everything blocked.
I’m sorry for that!