Testing Rackspace Cloud-server Service-net Connectivity and creating an alarm

So, the last few weeks my colleagues and myself have been noticing that there has been a couple of issues with the cloud-servers servicenet interface. Unfortunately for some customers utilizing dbaas instances this means that their cloud-server will be unable to communicate, often, with their database backend.

The solution is a custom monitoring script that my colleague Marcin has kindly put together for another customer of his own.

The python script that goes on the server:

Create file:

vi /usr/lib/rackspace-monitoring-agent/plugins/servicenet.sh

Paste into file:

#!/bin/bash
#
ping="/usr/bin/ping -W 1 -c 1 -I eth1 -q"

if [ -z $1 ];then
   echo -e "status CRITICAL\nmetric ping_check uint32 1"
   exit 1
else
   $ping $1 &>/dev/null
   if [ "$?" -eq 0 ]; then
      echo -e "status OK\nmetric ping_check uint32 0"
      exit 0
   else
      echo -e "status CRITICAL\nmetric ping_check uint32 1"
      exit 1
   fi
fi

Create an alarm that utilizes the below metric

if (metric["ping_check"] == 1) {
    return new AlarmStatus(CRITICAL, 'what?');
}
if (metric["ping_check"] == 0) {
    return new AlarmStatus(OK, 'eee?');
}

Of course for this to work the primary requirement is a Rackspace Cloud-server and an installation of Rackspace Cloud Monitoring installed on the server already.

Thanks again Marcin, for this golden nugget.

Grabbing network activity from server without network utility

So, is it possible to look at a network interfaces activity without bwm-ng, iptraf, or other tools? Yes.

while true do
RX1=`cat /sys/class/net/${INTERFACE}/statistics/rx_bytes`
TX1=`cat /sys/class/net/${INTERFACE}/statistics/tx_bytes`
DOWN=$(($RX1-$RX2))
UP=$(($TX1-$TX2))
DOWN_Bits=$(($DOWN * 8 ))
UP_Bits=$(($UP * 8 ))
DOWNmbps=$(( $DOWN_Bits >> 20 ))
UPmbps=$(($UP_Bits >> 20 ))
echo -e "RX:${DOWN}\tTX:${UP} B/s | RX:${DOWNmbps}\tTX:${UPmbps} Mb/s"
RX2=$RX1; TX2=$TX1
sleep 1; done

I found this little gem yesterday, but couldn’t understand why they had not used clear. I guess they wanted to log activity or something… still this was a really nice find. I can’t remember where I found it yesterday but googling part of it should lead you to the original source 😀

Using TCP to ping test your Cloud server connectivity

So, you have probably heard that there are a variety of reasons why you shouldn’t use ICMP to test your service is operating normally. Mainly because of the way that ICMP is handled by routers. If you really want a representative view of the way that TCP packets, such as HTTP and HTTPS are performing in terms of packet loss (that is to say packets which do not arrive at their destination) , then hping is your friend.

You might be pinging a cloud-server that is not replying. You might think it’s down. But what if the firewall is simply dropping ICMP echo requests coming in on that port? Indeed.

Enter hping.

# hping -S -p 80 google.com
HPING google.com (eth0 74.125.136.102): S set, 40 headers + 0 data bytes
len=46 ip=74.125.136.102 ttl=46 id=23970 sport=80 flags=SA seq=0 win=42780 rtt=13.8 ms
len=46 ip=74.125.136.102 ttl=47 id=37443 sport=80 flags=SA seq=1 win=42780 rtt=12.6 ms
len=46 ip=74.125.136.102 ttl=47 id=43654 sport=80 flags=SA seq=2 win=42780 rtt=12.0 ms
len=46 ip=74.125.136.102 ttl=47 id=37877 sport=80 flags=SA seq=3 win=42780 rtt=11.4 ms
len=46 ip=74.125.136.102 ttl=47 id=62433 sport=80 flags=SA seq=4 win=42780 rtt=13.3 ms
^C
--- google.com hping statistic ---
5 packets transmitted, 5 packets received, 0% packet loss
round-trip min/avg/max = 11.4/12.6/13.8 ms

In this case I tested with google.com. I’m actually surprised that more people don’t use hping, because, hping is awesome. It also makes quite a decent port scanner, were it not for the fact that the machine I tried to test that feature with buffer overflowed 😉 It’s a nice way to test a firewalled box, but more than that, it’s a more reliable test in my opinion.

Using Nova and Supernova to manage Firewall IP access lists, automation & more

So, a customer today reached out to us asking if Rackspace provided the entire infrastructure IP address ranges in use on cloud. The answer is, no. However, that doesn’t mean that making your firewall rules, or autoscale automation need to be painful.

In fact, Rackspace Cloud utilizes Openstack which fully supports API calls which will easily be able to provide this detail in just a few simple short steps. To do this you require nova to be installed, this is really relatively easy to install, and instructions for installing it can be found here;

https://support.rackspace.com/how-to/installing-python-novaclient-on-linux-and-mac-os/

Once you have installed nova, it’s simply a case of making sure you set these 4 lines correctly in your .bash_profile

OS_USERNAME=mycloudusernamegoeshere
OS_TENANT_NAME=yourrackspaceaccountnumbergoeshereusuallysomethinglike1010101010
OS_AUTH_SYSTEM=rackspace
OS_PASSWORD=apikeygoeshere
OS_AUTH_URL=https://identity.api.rackspacecloud.com/v2.0/
OS_REGION_NAME=LON
OS_NO_CACHE=1
export OS_USERNAME OS_TENANT_NAME OS_AUTH_SYSTEM OS_PASSWORD OS_AUTH_URL OS_REGION_NAME OS_NO_CACHE

OS_USERNAME is your mycloud login username (normally the primary user).
OS_TENANT_NAME is your Customer ID, it’s the number that appears in the URL of your control panel link, see below picture for illustration

Screen Shot 2016-08-10 at 2.45.05 PM

OS_PASSWORD is a bit misleading, this is actually where your apikey goes , but I think it’s possible to authenticate using your control panel password too, don’t do that for security reasons.

OS_REGION_NAME is pretty self explanatory, this is simply the region that you would like to list cloud-server IP’s in or rather, the region that you wish to perform NOVA API calls.

Making the API call using the nova API wrapper

# supernova lon list --tenant 100010101 --fields accessIPv4,name
[SUPERNOVA] Running nova against lon...
+--------------------------------------+-----------------+-----------+
| ID                                   | accessIPv4      | Name      |
+--------------------------------------+-----------------+-----------+
| 7e5a7f99-60ae-4c28-b2b8              | 1.1.1.1  |  xapp      |
| 94747603-812d-4594-850b              | 1.1.1.1  |   rabbit2   |
| d5b318aa-0fa2-4269-ae00              | 1.1.1.1  |   elastic5  |
| 6c1d8d33-ae5e-44be-b9f0              | 1.1.1.1  | | elastic6  |
| 9f79a7dc-fd19-4f8f-9c26              |1.1.1.1   | | elastic3  |
| 05b1c52b-6ced-4db0-8af2              | 11.1.1.1 | | elastic1  |
| c8302366-f2f9-4c36-8f7a              | 1.1.1.1  | | app5      |
| b159cd07-8e68-49bc-83ee              | 1.1.1.1  | | app6      |
| f1f31eef-97c6-4c68-b01a              | 1.1.1.1  | | ruby1     |
| 64b7f0fd-8f2f-4d5f-8f89              | 1.1.1.1  | | build3    |
| e320c051-b5cf-473a-9f96              | 1.1.1.1  |   mysql2    |
| 4fddd022-59a8-4502-bf6e              | 1.1.1.1  | | mysql1    |
| c9ad6951-f5f9-4351-b31d              | 1.1.1.1  | | worker2   |
+--------------------------------------+-----------------+-----------+

This is pretty useful for managing autoscale permissions if you need to make sure your corporate network can be connected to from your cloud-servers when new cloud-servers with new IP are built out. considerations like this are really important when putting together a solution. The nice thing is the tools are really quite simple and flexible. If I wanted I could have pulled out detail for servicenet instead. I hope this helps make some folks lives a bit easier and works to demystify API to others that haven’t had the opportunity to use it.

You are probably wondering though, what field names can I use? a nova show will reveal this against one of your server UUID’s

# supernova lon show someuuidgoeshere
+-------------------------------------+------------------------------------------------------------------+
| Property                            | Value                                                            |
+-------------------------------------+------------------------------------------------------------------+
| OS-DCF:diskConfig                   | MANUAL                                                           |
| OS-EXT-SRV-ATTR:host                | censored                                                   |
| OS-EXT-SRV-ATTR:hypervisor_hostname | censored                                                 |
| OS-EXT-SRV-ATTR:instance_name       | instance-734834278-sdfdsfds-                   |
| OS-EXT-STS:power_state              | 1                                                                |
| OS-EXT-STS:task_state               | -                                                                |
| OS-EXT-STS:vm_state                 | active                                                           |
| censorednet network                 | censored                                                     |
| accessIPv4                          | censored                                                 |
| accessIPv6                          | censored                      |
| created                             | 2015-12-11T14:12:08Z                                             |
| flavor                              | 15 GB I/O v1 (io1-15)                                            |
| hostId                              | 860...         |
| id                                  | 9f79a7dc-fd19-4f8f-9c26-72a335ed2be8                             |
| image                               | Debian 8 (Jessie) (PVHVM) (cf16c435-7bed-4dc3-b76e-57b09987866d) |
| metadata                            | {"build_config": "", "rax_service_level_automation": "Complete"} |
| name                                | elastic3                                                         |
| private network                     |                                                 |
| progress                            | 100                                                              |
| public network                      |          |
| status                              | ACTIVE                                                           |
| tenant_id                           |                                                    |
| updated                             | 2016-02-27T09:30:20Z                                             |
| user_id                             |                             |
+-------------------------------------+------------------------------------------------------------------+

I censored some of the fields.. but you can see all of the column names, so if you wanted to see metadata and progress only, with the server uuid and server name.



nova list --fields name, metadata, progress

This could be pretty handy for detecting when a process has finished building, or detecting once automation has completed. The possibilities with API are quite endless. API is certainly the future, and, there is no reason why, in the future, people won't be building and deploying websites thru API only, and some sophisticated UI wrapper like NOVA.

Admittedly, this is very far away, but that should be what the future technology will be made of, stuff like LAMBDA, serverless architecture, will be the future.

Resizing a SD or USB card partition for Retropie Raspberry Pi Arcade on Linux

So, I had a friend who had recently bought his Raspberry Pi 3 and wanted to run retropie on it like I have been with my arcade cabinet.

13682010_1735545403352472_192137128_o

The problem was the sandisk 64GB disk he had bought had some few less sectors on the disk, which meant my image was just a few bytes too big. What a bummer!

13599828_10153775400798481_3462834024739541253_n

So I used this great tool by sirlagz to fix that.

#!/bin/bash
# Automatic Image file resizer
# Written by SirLagz
strImgFile=$1


if [[ ! $(whoami) =~ "root" ]]; then
echo ""
echo "**********************************"
echo "*** This should be run as root ***"
echo "**********************************"
echo ""
exit
fi

if [[ -z $1 ]]; then
echo "Usage: ./autosizer.sh "
exit
fi

if [[ ! -e $1 || ! $(file $1) =~ "x86" ]]; then
echo "Error : Not an image file, or file doesn't exist"
exit
fi

partinfo=`parted -m $1 unit B print`
partnumber=`echo "$partinfo" | grep ext4 | awk -F: ' { print $1 } '`
partstart=`echo "$partinfo" | grep ext4 | awk -F: ' { print substr($2,0,length($2)-1) } '`
loopback=`losetup -f --show -o $partstart $1`
e2fsck -f $loopback
minsize=`resize2fs -P $loopback | awk -F': ' ' { print $2 } '`
minsize=`echo $minsize+1000 | bc`
resize2fs -p $loopback $minsize
sleep 1
losetup -d $loopback
partnewsize=`echo "$minsize * 4096" | bc`
newpartend=`echo "$partstart + $partnewsize" | bc`
part1=`parted $1 rm 2`
part2=`parted $1 unit B mkpart primary $partstart $newpartend`
endresult=`parted -m $1 unit B print free | tail -1 | awk -F: ' { print substr($2,0,length($2)-1) } '`
truncate -s $endresult $1

It was a nice solution to my friends problem… the only problem now is the working image for my pi is not working with audio for him, and for some reason when he comes out of hte game and goes back to emulation station he loses the joystick input controller. That is kind of bizarre.

Does anyone know what could cause those secondary issues? I’m a bit stumped on this one.

Preventing /etc/resolv.conf reset on startup/boot

Today, a customer approached us after a Host Server Down complaining that, although the server is up again their website and application were down & not working. Even though the server was online and functioning correctly.

The customer discovered that the source of the issue was that there /etc/resolv.conf is blank, this means that they will not be able to resolve DNS A/PTR/CNAME record hostnames into a resolved IP. This is called hostname to IP resolution. Its means that if /etc/resolv.conf is blank and the customer uses hostnames in their calls, such a failure will break the connectivity due to failure to resolve to IP to communicate on the TCP stack.

There is actually a very simple way to prevent the /etc/resolv.conf file from being changed. But first, it’s important to understand why /etc/resolv.conf is being reset.

On All Rackspace cloud-servers there is a process called nova-agent, and when the server starts up, the /etc/resolv.conf file will be reset along with the networking configuration. This happens each time your server is restarted and is used to set new networking details, specifically if you take an image and build server on a new ip address or if your server is live-migrated to a new host, it makes sure on the next reboot it comes up with correct networking detail transparently. However this can cause some issues, such as in this case with the /etc/resolv.conf file. Fortunately there are some novel ways of preventing your /etc/resolv.conf being modified after you have added the correct nameservers you desire to it.

You can use the chattr immutable file setting to stop processes from modifying it after you have made the changes to your resolv.conf that are desired;

Set Immutable File

chattr +i /etc/resolv.conf 

Un-set Immutable File

chattr -i /etc/resolv.conf 

This /etc/resolv.conf issue is a common problem, however using immutable file flag and chattr should prevent it from being changed ever again.

Simple way to perform a body check on a website

So, I was testing with curl today and I know that it’s possible to direct to /dev/null to suppress the page. But that’s not very handy if you are checking whether html page loads, so I came up with some better body checks to use.

A Basic body check using wc -l to count the lines of the site

 time curl https://www.google.com/ > 1; echo "non zero indicates server up and served content of n lines"; cat 1 | wc -l
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  167k    0  167k    0     0  79771      0 --:--:--  0:00:02 --:--:-- 79756

real	0m2.162s
user	0m0.042s
sys	0m0.126s
non zero indicates server up and served content of n lines
2134

A body check for Google analytics

$ time curl https://www.groundworkjobs.com/ > 1; echo "Checking for google analytics html elements string"; cat 1 | grep "www.google-analytics.com/analytics.js"
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  167k    0  167k    0     0  76143      0 --:--:--  0:00:02 --:--:-- 76152

real	0m2.265s
user	0m0.042s
sys	0m0.133s
Checking for google analytics html elements string
				})(window,document,'script','//www.google-analytics.com/analytics.js','ga');

Such commands might be useful when troubleshooting a cluster for instance, where one server shows more up to date versions, (different number of lines). There’s probably better way to do this with ls and awk and use the html filesize, since number of lines wouldn’t be so accurate.

Check Filesize from request

$ time curl https://www.groundworkjobs.com/ > 1; var=$(ls -al 1 | awk '{print $5}') ; echo "Page size is: $var kB"
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  167k    0  167k    0     0  79467      0 --:--:--  0:00:02 --:--:-- 79461

real	0m2.170s
user	0m0.048s
sys	0m0.111s
Page size is: 171876 kB

Pretty simple.. but you could take the oneliner even further… populate a variable called $var with the filesize using ls and awk , and then use an if statement to check that var is not 0, indicating the page is answering positively, or alternatively not answering at all.

Check Filesize and populate a variable with the filesize, then validate variable

$ time curl https://www.groundworkjobs.com/ > 1; var=$(ls -al 1 | awk '{print $5}') ; echo "Page size is: $var kB"; if [ "$var" -gt 0 ] ; then echo "The filesize was greater than 0, which indicates box is up but may be giving an error page"; fi
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  167k    0  167k    0     0  78915      0 --:--:--  0:00:02 --:--:-- 78950

real	0m2.185s
user	0m0.041s
sys	0m0.132s
Page size is: 171876 kB
The filesize was greater than 0, which indicates box is up but may be giving an error page

The second exercise is not particularly useful or practical as a means of testing, since if the site was timing out the script would take ages to reply and make the whole test pointless, but as a learning exercise being able to assemble one liners on the fly like this is an enjoyable, rewarding and useful investment of time and effort. Understanding such things are the fundamentals of automating tasks. In this case with output filtering, variable creation, and subsequent validation logic. It’s a simple test, but the concept is exactly the same for any advanced automation procedure too.

Discovering siteurl variable for WordPress

So I recently read a little piece by one of my colleagues about this. It comes up fairly frequently so it’s worth mentioning. It’s possible to determine the address that wordpress is using as the siteurl by directly querying the database or looking for the value in the sql dump.

Database changed
mysql> SELECT option_name,option_value FROM wp_options WHERE option_name='siteurl';
+-------------+---------------------------------------+
| option_name | option_value                          |
+-------------+---------------------------------------+
| siteurl     | http://mywordpresssite.com/ |
+-------------+---------------------------------------+
1 row in set (0.00 sec)

It turns out there is a second ‘home’ page variable in the database:

mysql> SELECT option_name,option_value FROM wp_options WHERE option_name='home';
+-------------+---------------------------------------+
| option_name | option_value                          |
+-------------+---------------------------------------+
| home        | http://mywordpresssite.com/test |
+-------------+---------------------------------------+
1 row in set (0.00 sec)

I’m not 100% on the difference between ‘siteurl’ and ‘home’, but guessing the siteurl is the tld definition of the domain, and home is the default landing page for requests to that TLD. As I understand it anyway, I am sure someone will correct me if this isn’t completely correct.

Analyse the process ID’s and check their legitimacy

 for pid in $(ps aux | grep -v '\[' | grep -v grep | awk '{print $2}' | grep -v PID); do SERVICE=$(ps aux | grep -v grep | grep " $pid " | awk '{print $11}' | egrep -v 'nimbus|delloma' | tr -d '-' | tr -d ':'); [ "X$SERVICE" != "X" ] && ls -lh /proc/$pid | grep ' exe ' | tr -d '-' | grep -v $SERVICE >/dev/null 2>&1 && echo "$pid should be $SERVICE but it is actually $(ls -lh /proc/$pid | grep ' exe ' | awk '{print $11}')"; done

Check the netstat detail

netstat -np | awk '{print $7}' | awk -F/ '{count[$2]++}END{for(j in count) print count[j],j}' | sort -nr

Full process list and commands

ps auxfwww