Harmony in the Cloud
Table of Contents
This post is to highlight how awesome Cloud computing really is, and the impact it can have on your business if you get it right or wrong. I’ll cover both with real examples from my CPU graphs.
Getting it wrong, or not quite perfect
I’m sure we all know this can have an impact on your business in a negative light. If you get it not-quite-perfect your users may not even notice, until you exceed your expectations and the cracks start to appear.
As you can see by this graph, the CPU usage wasn’t ideal, but the site was running perfectly fine for a long time .. until around the start of this month when it started to not.
So what happened? Well, a lot of things. First, we started to implement some aggressive marketing strategies that were paying off more than we expected, so I added more servers to dull the load which is where you see the usage drop initially. Then, we came up with a really good idea for some content to add to the site, so I set about making that and a few days later I uploaded it and we started to promote the crap out of it, and that’s when the cracks started to appear.
Now, it must be said that I run my site lean … very lean. It costs me $538.43 a month to do over 20 million dynamic page views per month, and an ungodly amount of ajax and asset requests. So narrowing these problems down is not only necessary to keep my users happy and my response times low, but also to keep my costs low.
First, I didn’t do the math properly and we were taking on more requests than we could handle so I threw some more servers at it. For 0.03c an hour, who cares, right? Wrong. That wasn’t the issue. After realising this, I sat down to figure out what the real issues were. As you can see, the servers were experiencing iowait times and softirqs so that’s what I needed to solve.
This is usually due to network traffic or disk issues. My servers run SSD so there was no issue waiting for disk times, which I confirmed with iostat. The SQL network was taking a beating, so I narrowed down the causes for that and plugged those holes with some front-end caching. This changed my network graphs from mimicking an oscilloscope measuring heavy metal, to measuring the purr of a kitten or a well oiled tractor. But, the problem was still there ..
The problem with the iowait times ended up being a random error with the database. It was doing something odd, which still to this day I’m not entirely sure what happened because I only discovered it as a fluke and it hasn’t come back since. But, the SQL queries were taking ages to execute, despite the disk io, CPU and network being fine. A quick restart of the server made all the iowait times disappear.
So, the problem turned out to be random, but it did highlight some flaws which I was able to patch.
Getting it right
After this happened I decided to get a little more serious about the server scaling. Originally I had the day broken up in to 3 parts which I considered to be offpeak, heavy load and peak load. During these times of the day I would scale the server amounts up or down depending on that load period. I changed this dramatically.
I busted out python and broke the day up in to 24 hour sections, then I opened analytics and looked at the traffic graph for a few days. I defined a minimum and a maximum amount of servers that I want to run for off peak and peak times based on my average usage + some extra allowance for burst loads from social media linking and media campaigns. Then I went through hour by hour and assigned a number to each hour which represented the number of web processors that would be online to handle the traffic.
The idea is; rather than abruptly going from X to Y then back to X at a period of time, I gently add more servers in increments of 2 which should better distribute the stress and give a smoother ride.
As you can see from the graph above, this is a typical day for me now. No iowait, very litle softirq and a fairly flat, consistent CPU range. If you didn’t know better you’d have no idea the cluster this server is part of experiences 18-30x the traffic during peak load than it does in off peak time. And, every 3-4 hours receives surges of traffic from promoted links on our Facebook page.
This, is harmony in the cloud!