As those of you that used PullReview yesterday know, we did experience severe load problems. The situation is going back to normal, albeit slowly, as the review jobs have a lot of work to do.

First of all, we apologize for the inconvenience caused.

We hope you did get our message announcing this inside PullReview itself, but we wanted to give you a bit more details about what happened, what we did to resolve the situation, and what we plan to avoid it in the future.

The context

PullReview is made of two main parts: the Rails app that you use to configure your account and look at your reviews, and the jobs that are reviewing your code asynchronously.

What happened

The influx of new users yesterday induced a larger amount of jobs, making those more and more resources heavy in both CPU and memory as the number of users & repositories was increasing.

Until last night, those two components were sitting on the same machine, which means than even if the analyse processes are separated Resque jobs, they did impact the web performances. We were aware of this risk, but the impact on the user experience was until then reasonable.

This was obviously no longer the case yesterday, as the application responded more and more slowly.

What we did

Around 14:30 UTC, is was becoming clear that small improvements and tweaks would not restore the situation. We hence started to port the jobs on a separate machine to restore the availability of the web interface. In order to speed up the port, around 21:00 UTC we stopped all jobs from running (accumulating them in the queue), and the web application was impacted during short periods. Job migration was completed around 23:30 UTC, and we restored most of the services at that time.

Both the web interface and the jobs have been running since then correctly, but the queue is still long, so your reviews may arrive slowly. We'll tell you when the queue is back at its normal size.

Next steps

This is not the end of the work on our side, but with the service restored we are now able to take some time to analyse the situation better, and start improving on it (we've several actions we can take to reduce the job's load).

For those of you that want to give PullReview a try, you'll find the service running, although you may have to wait a bit longer to get your reviews.

Again, our apologies for the impact, and be assured that we are working hard to restore the situation.

Christophe & Martin

Enhanced by Zemanta