I thought that some people might be interested to know a bit more about our recent move and a bit of the background events that led us to the point where a move was required. This two hour outage was probably a year in the making, and by the end, it was long overdue. That said, the actual event went quite smoothly, with only a few hiccups – I will explain in detail later.
Early in 2009, the site started to suffer stability issues and had a few outages. Our hosting provider suffered several power failures during the year, each of which brought the site down for hours. Attempts were made to remedy the stability problems by acquiring new servers and giving the system a bit of a tune up. This worked to an degree, but did not solve the root of the problem: the site was getting too big for the model it had been built in, and the hardware was aging. A new home had to be found.
The worst thing about aging hardware is the unpredictable nature of its failures. For long periods of time, the servers would be flawless, diligently beavering away and we’d start working on some new feature – and then all of a sudden we would get reboots, poor performance or lockups (sometimes at midnight!). It would then be all hands on deck trying to right the ship, diagnosing and preparing for the next time. Frustratingly it often was not clear what the root cause of the problem was so we would be left with a lingering feeling that the gremlins could return at any time.
We took the opportunity to do some re-architecting of the system and I’ll elaborate on that in a subsequent post, but the core of the site was what we moved yesterday. There was about one month of preparation in the lead up to the big day, most of which was handled by the Engine Yard team. This was one of the benefits of the new arrangement: we went from a team of developers dabbling in system administration to having a team of system administrators alongside us.
So, how did the move actually go? Well, pretty good but not without a few hiccups, the most noticeable being that some images didn’t show up immediately, which stumped us initially. It turned out that in the past a request had been made to our Content Delivery Network (CDN) provider to find us by direct IP address. So of course, when we turned on the new site and switched the domain over, the CDN servers were still knocking on the door of our old home. And having turned the old site off, they received no answer and hence the broken images. Once we got the CDN change in place we had to grin and bear it while it slowly propagated through their network of global servers over the next hour – you may have seen some funny behaviour with images for a time. Nonetheless, I classify it as a minor issue and was happy with the way the servers held the traffic (which hit the site with full impact almost immediately we came back online).
One side effect of the move is that there is a different distance to the servers for all users. So users in Australia will notice a bit more lag while those in the US and Europe will find the site snappier. We cannot make everyone’s experience the same unfortunately.