Diagnosing a sick website getting 500,000 to 1 million views a day

So today I had a customer that had some woes. I don’t even think they were aware they were getting 504’s but I had to come up with some novel ways to

A) show them where teh failure happened
B) Show them the failed pages that failed to load (i.e. get a 504 gateway timeout)
C) show them the number of requests and how they changed based on the day of the outage, and a ‘regular normal’ day.
D) show them specific type of pages which are failing to give better idea of where the failure was

In this case a lot of the failures were .html pages, so it could be a cache was being triggered too much, it could be that their application was really inefficient, or in many cases, were catalog search requests which no doubt would scratch the db pretty nastily if the database or the query wasn’t refactored or designed with scalability in mind.

With all that in mind I explained to the customer, even the most worrysome (or woesome) of applications and frameworks, and even the most grizzly of expensive MySQL queries can be combatted, simply by a more adaptable or advanced cache mechanism. Anyway, all of that out of the way, I said to them it’s important to understand the nature of the problem with the application, since in their case were getting a load average of over 600.

I don’t have their solution,. I have the solution to showing them the problem. Enter the sysad, blazing armour, etc etc. Well, thats the way it’s _supposed_ to happen !

cat /var/log/httpd/access_log | grep '26/Mar' | grep 'HTTP/1.1" 50' | wc -l

cat /var/log/httpd/access_log | grep '27/Mar' | grep 'HTTP/1.1" 50' | wc -l

So we can see 504’s the day before wasn’t an issue, but how many requests did the site get for each day comparatively?

[root@anon-WEB1 httpd]# cat access_log | grep '26/Mar' | wc -l
[root@anon-WEB1 httpd]# cat access_log | grep '25/Mar' | wc -l

The box received 25% more traffic, but even based from the figures in the SAR, cpuload had gone up 1500% beyond what the 32 cores on their server could do. Crazy. It must be because requests are getting queued or rather ‘building up’, and there are so many requests reaching apache, hitting the request for mysql, that either mysql formed a bottleneck and might need more memory, or, at this scale, a larger or smaller (probably larger) sized packet for the request, this can speed up significantly how fast the memory bucket fills and empties, and request queue gets killed. Allowing it to build up is going to be a disaster, because it will mean not just slow queries get a 504 gateway timeout, but also normal requests to regular html pages too (or even cached pages), since at that stage the cpu is completely overwhelmed.

I wrote a script,

to find a majority of the 504’s for the 26 Mar you can use this piece:

cat access_log | grep '26/Mar' | grep 'HTTP/1.1" 50' | awk {'print $7'}

to generate a unique list for your developer/team of pages which failed you can run:

cat access_log | grep '26/Mar' | grep 'HTTP/1.1" 50' | awk {'print $7'} | sort | uniq

To be honest. In the simplicity of this post somewhere, is a stroke of inspiration (if not ingenuity). Also it’s kind of hacky and crap, but, it does work and it is effective for doing the job.

AND that is What counts.