Diagnosing a sick website getting 500,000 to 1 million views a day

So today I had a customer that had some woes. I don’t even think they were aware they were getting 504’s but I had to come up with some novel ways to

A) show them where teh failure happened
B) Show them the failed pages that failed to load (i.e. get a 504 gateway timeout)
C) show them the number of requests and how they changed based on the day of the outage, and a ‘regular normal’ day.
D) show them specific type of pages which are failing to give better idea of where the failure was

In this case a lot of the failures were .html pages, so it could be a cache was being triggered too much, it could be that their application was really inefficient, or in many cases, were catalog search requests which no doubt would scratch the db pretty nastily if the database or the query wasn’t refactored or designed with scalability in mind.

With all that in mind I explained to the customer, even the most worrysome (or woesome) of applications and frameworks, and even the most grizzly of expensive MySQL queries can be combatted, simply by a more adaptable or advanced cache mechanism. Anyway, all of that out of the way, I said to them it’s important to understand the nature of the problem with the application, since in their case were getting a load average of over 600.

I don’t have their solution,. I have the solution to showing them the problem. Enter the sysad, blazing armour, etc etc. Well, thats the way it’s _supposed_ to happen !

cat /var/log/httpd/access_log | grep '26/Mar' | grep 'HTTP/1.1" 50' | wc -l
26081

cat /var/log/httpd/access_log | grep '27/Mar' | grep 'HTTP/1.1" 50' | wc -l
2

So we can see 504’s the day before wasn’t an issue, but how many requests did the site get for each day comparatively?

[root@anon-WEB1 httpd]# cat access_log | grep '26/Mar' | wc -l
437598
[root@anon-WEB1 httpd]# cat access_log | grep '25/Mar' | wc -l
339445

The box received 25% more traffic, but even based from the figures in the SAR, cpuload had gone up 1500% beyond what the 32 cores on their server could do. Crazy. It must be because requests are getting queued or rather ‘building up’, and there are so many requests reaching apache, hitting the request for mysql, that either mysql formed a bottleneck and might need more memory, or, at this scale, a larger or smaller (probably larger) sized packet for the request, this can speed up significantly how fast the memory bucket fills and empties, and request queue gets killed. Allowing it to build up is going to be a disaster, because it will mean not just slow queries get a 504 gateway timeout, but also normal requests to regular html pages too (or even cached pages), since at that stage the cpu is completely overwhelmed.

I wrote a script,

to find a majority of the 504’s for the 26 Mar you can use this piece:

cat access_log | grep '26/Mar' | grep 'HTTP/1.1" 50' | awk {'print $7'}

to generate a unique list for your developer/team of pages which failed you can run:

cat access_log | grep '26/Mar' | grep 'HTTP/1.1" 50' | awk {'print $7'} | sort | uniq

To be honest. In the simplicity of this post somewhere, is a stroke of inspiration (if not ingenuity). Also it’s kind of hacky and crap, but, it does work and it is effective for doing the job.

AND that is What counts.

Rapid Troubleshooting mail not arriving at its destination

Today I had a customer whose mail was not arriving at it’s destination. IN my customers case, they knew that it was arriving at the destination but going into the spam folder. This is likely due to blacklisting, however some ISP or clients dont know it’s reaching the spam folder at the other end, and since the most common cause of this happening is due to a missing SPF (Sending policy framework record), MX record or DKIM record it is possibly to rapidly check the DNS of each using dig, if the sender domain is known.

To check for the IP Blacklistings on a mailserver use it’s ip in one of the many spam a checker;

https://mxtoolbox.com/SuperTool.aspx?action=blacklist

In other cases it might be being caused by email bounces, due to the PTR, MX or DKIM records, and not even getting into the inbox, you can see that on the sending mail server using a simple grep command;

cat /var/log/maillog | grep -i status=bounced

You probably want to save the file as well

cat /var/log/maillog | grep -i status=bounced > bouncedemail.txt 

If you wanted to know which domains bounced email, if you’ve ensured all sending domains are correctly configured via DNS..

[root@api ~]# cat /var/log/maillog | grep -i status=bounced | awk '{print $7}'
to=<[email protected]>,
to=<[email protected]>,

You could use sed to extract which domains failed..

Then you could use whois against the domains to reach out to the email contact with some automation that explains that ‘weve checked DKIM, MX and SPF and all are configured correctly and believe this is an error on your behalf, etc, blah blah’..

Such a thing would be a good idea to implement for some large email providers, and I’m sure you could automate the DNS checking as well. You could likely automate the whole thing, just by watching the logs and DNS records, and some intelligent grep and awking.

Not bad.

Enabling MySQL Slow Query Logs

In the case your seeing very long pageload times, and have checked your application. It always pays to check the way in which the database performs when interacting with the application, especially if they are on either same or seperate server, as these significantly affect the way that your application will run.


mysql> SET GLOBAL slow_query_log = 'ON' ;
Query OK, 0 rows affected (0.01 sec)

mysql> SET GLOBAL slow_query_log_file = '/slow_query_logs/slow_query_logs.txt';
Query OK, 0 rows affected (0.00 sec)

mysql> SET GLOBAL long_query_time = 5;
Query OK, 0 rows affected (0.00 sec)

Less Ghetto Log Parser for Website Hitcount/Downtime Analysis

Yesterday I created a proof of concept script, which basically goes off and identifies the hitcounts of a website, and can give a technician within a short duration of time (minutes instead of hours) exactly where hitcounts are coming from and where.

This is kind of a tradeoff, between a script that is automated, and one that is flexible.

The end goal is to provide a hitcount vs memory commit metric value. A NEW TYPE OF METRIC! HURRAH! (This is needed by the industry IMO).

And also would be nice to generate graphing and mean, average, and ranges, etc. So can provide output like ‘stat’ tool. Here is how I have progress

#!/bin/bash
#
# Author: 	Adam Bull, Cirrus Infrastructure, Rackspace LTD
# Date: 	March 20 2017
# Use:		This script automates the analysis of webserver logs hitcounts and
# 		provides a breakdown to indicate whether outages are caused by website visits
#		In correlation to memory and load avg figures


# Settings

# What logfile to get stats for
logfile="/var/log/httpd/google.com-access.log"

# What year month and day are we scanning for minute/hour hits
year=2017
month=Mar
day=9

echo "Total HITS: MARCH"
grep "/$month/$year" "$logfile" | wc -l;

# Hours
for i in 0{1..9} {10..24};

do echo "      > 9th March 2017, hits this $i hour";
grep "$day/$month/$year:$i" "$logfile" | wc -l;

        # break down the minutes in a nested visual way thats AWsome

# Minutes
for j in 0{1..9} {10..60};
do echo "                  >>hits at $i:$j";
grep "$day/$month/$year:$i:$j" "$logfile" | wc -l;
done

done

Thing is, after I wrote this, I wasn’t really happy, so I refactored it a bit more;

#!/bin/bash
#
# Author: 	Adam Bull, Cirrus Infrastructure, Rackspace LTD
# Date: 	March 20 2017
# Use:		This script automates the analysis of webserver logs hitcounts and
# 		provides a breakdown to indicate whether outages are caused by website visits
#		In correlation to memory and load avg figures


# Settings

# What logfile to get stats for
logfile="/var/log/httpd/someweb.biz-access.log"

# What year month and day are we scanning for minute/hour hits
year=2017
month=Mar
day=9

echo "Total HITS: $month"
grep "/$month/$year" "$logfile" | wc -l;

# Hours
for i in 0{1..9} {10..24};

do
hitsperhour=$(grep "$day/$month/$year:$i" "$logfile" | wc -l;);
echo "    > $day $month $year, hits this $ith hour: $hitsperhour"

        # break down the minutes in a nested visual way thats AWsome

# Minutes
for j in 0{1..9} {10..59};
do
hitsperminute=$(grep "$day/$month/$year:$i:$j" "$logfile" | wc -l);
echo "                  >>hits at $i:$j  $hitsperminute";
done

done

Now it’s pretty leet.. well, simple. but functional. Here is what the output of the more nicely refined script; I’m really satisfied with the tabulation.

[root@822616-db1 automation]# ./list-visits.sh
Total HITS: Mar
6019301
    > 9 Mar 2017, hits this  hour: 28793
                  >>hits at 01:01  416
                  >>hits at 01:02  380
                  >>hits at 01:03  417
                  >>hits at 01:04  408
                  >>hits at 01:05  385
^C

Ghetto but simple Log Parser for testing website performance

So… I got fedup with constantly writing my own stuff for basic things. I’m going to turn this into something more spectacular that accepts commandline input, and also, allows you to define which days, and months, ranges, and stuff like that.

It’s a no-frills-ghetto log parser.

#!/bin/bash

echo "Total HITS: MARCH"
grep "/Mar/2017" /var/log/httpd/somewebsite.com-access_log | wc -l;

for i in 0{1..9} {10..24};

do echo "      > 9th March 2017, hits this $i hour";
grep "09/Mar/2017:$i" /var/log/httpd/somesite.com-access_log | wc -l;

        # break down the minutes in a nested visual way thats AWsome
for j in 0{1..9} {10..60};
do echo "                  >>hits at $i:$j";
grep "09/Mar/2017:$i:$j" /var/log/httpd/somesite.com-access_log | wc -l;
done

done

It’s not perfect, it’s just a proof of concept, really.

PHP5 newrelic agent not collecting data

So today I had a newrelic customer who was having issues after installing the newrelic php plugin. He couldn’t understand why it wasn’t collecting data. For it to collecting data you need to make sure newrelic-daemon process is running by using ps auxfwww | grep newrelic-daemon.

We check the process of the daemon is running

[root@rtd-production-1 ~]# ps -ef | grep newrelic-daemon
root     26007 18914  0 09:59 pts/0    00:00:00 grep newrelic-daemon

We check the status of the daemon process

[root@rtd-production-1 ~]# service newrelic-daemon status
newrelic-daemon is stopped...

Copy basic NewRelic configuration template to correct location

[root@rtd-production-1 ~]# cp /etc/newrelic/newrelic.cfg.template /etc/newrelic/newrelic.cfg

Start the daemon

[root@rtd-production-1 ~]# service newrelic-daemon start
Starting newrelic-daemon:                                  [  OK  ]

Site keeps on going down because of spiders

So a Rackspace customer was consistently having an issue with their site going down, even after the number of workers were increased. It looked like in this customers case they were being hit really hard by yahoo slurp, google bot, a href bot, and many many others.

So I checked the hour the customer was affected, and found that over that hour just yahoo slurp and google bot accounted for 415 of the requests. This made up like 25% of all the requests to the site so it was certainly a possibility the max workers were being reached due to spikes in traffic from bots, in parallel with potential spikes in usual visitors.

[root@www logs]#  grep '01/Mar/2017:10:' access_log | egrep -i 'www.google.com/bot.html|http://help.yahoo.com/help/us/ysearch/slurp' |  wc -l
415

It wasn’t a complete theory, but was the best with all the available information I had, since everything else had been checked. The only thing that remains is the number of retransmits for that machine. All in all it was a victory, and this was so awesome, I’m now thinking of making a tool that will do this in more automated way.

I don’t know if this is the best way to find google bot and yahoo bot spiders, but it seems like a good method to start.

Count number of IP’s over a given time period in Apache Log

So, a customer had an outage, and wasn’t sure what caused it. It looked like some IP’s were hammering the site, so I wrote this quite one liner just to sort the IP’s numerically, so that uniq -c can count the duplicate requests, this way we can count exactly how many times a given IP makes a request in any given minute or hour:

Any given minute

# grep '24/Feb/2017:10:03' /var/www/html/website.com/access.log | awk '{print $1}' | sort -k2nr | uniq -c

Any given hour

# grep '24/Feb/2017:10:' /var/www/html/website.com/access.log | awk '{print $1}' | sort -k2nr | uniq -c

Any Given day

# grep '24/Feb/2017:' /var/www/html/website.com/access.log | awk '{print $1}' | sort -k2nr | uniq -c

Any Given Month

# grep '/Feb/2017:' /var/www/html/website.com/access.log | awk '{print $1}' | sort -k2nr | uniq -c

Any Given Year

# grep '/2017:' /var/www/html/website.com/access.log | awk '{print $1}' | sort -k2nr | uniq -c

Any given year might cause dupes though, and I’m sure there is a better way of doing that which is more specific

Checking Webserver Logs and generating hit counts

So, I’ve been meaning to do this for a while. I’m sure your all familiar with this crappy oneliner where you’ll check the hits using wc -l against a webserver log for a given hour and use for i in seq or similar to get the last 10 minutes data, or the last 59 minutes. But getting hours and days is a bit harder. You need some additional nesting, and, it isn’t difficult for you to do at all!

for i in 0{1..9}; do echo "16/Jan/2017:06:$i"; grep "16/Jan/2017:06:$i" /var/log/httpd/soemsite.com-access_log | wc -l ; done

I improved this one, quite significantly by adding an additional for j, for the hour, and adding an additional 0 in front of {1..9}, this properly is matching the Apache2 log format and allows me to increment through all the hours of the day. to present ;-D All that is missing is some error checking when the last date in the file is, im thinking a tail and grep for the timecode from the log should be sufficient.

Here is the proud oneliner I made for this!

for j in 0{1..9}; do for i in 0{1..9} {10..59}; do echo "16/Jan/2017:$j:$i"; grep "16/Jan/2017:06:$i" /var/log/httpd/website.com-access_log | wc -l ; done; done

A Unique Situation for grep (finding the files with content matching a specific pattern Linux)

This article explains how to find all the files that have a specific text or pattern within them, this is the article you’ve been looking for!

So today, I was dealing with a customers server where he had tried to configure BASIC AUTH. I’d found the httpd.conf file for the specific site, but I couldn’t see which file had basic auth setup as wrong. To save me looking through hundreds of configurations (and also to save YOU from looking through hundreds of configuration files) for this specific pattern. Why not use grep to recursively search files for the pattern, and why not use -n to give the filename and line number of files which have text in that match this pattern.

I really enjoyed this oneliner, and been meaning to work to put something like this together, because this kind of issue comes up a lot, and this can save a lot of time!

 grep -rnw '/' -e "PermitRootLogin"

# OUTPUT looks like

/usr/share/vim/vim74/syntax/sshdconfig.vim:157:syn keyword sshdconfigKeyword PermitRootLogin
/usr/share/doc/openssh-5.3p1/README.platform:37:instead the PermitRootLogin setting in sshd_config is used.

The above searches recursively all files in the root filesystem ‘/’ looking for PermitRootLogin.

I wanted to find which .htaccess file was responsible so I ran;

# grep -rnw '/' -e "/path/to/.htpasswd'

# OUTPUT looks like
/var/www/vhosts/somesite.com/.htaccess:14:AuthUserFile /path/to/.htpasswd