Checking if your Cloud Server VM run out of memory after it has happened

Posted on 23rd August 2016 by azio

Thanks to my colleague Jan, for this oneliner, it is possible to check if your cloud-server runs out of memory historically;

find /var/log -maxdepth 1 -type f -mtime -3 -exec zgrep -i -E "oom|killed" {} \;

Uploading and Mounting ISO for Windows Cloud-server

Posted on 22nd August 2016 by azio

So a customer came with the question today on the process of uploading an ISO to a cloud-server. It’s a relatively easy process. RDP allows you to use resource sharing, that effectively mounts on the server you RDP to, your remote filesystem of your choice. This is the manner in which you can upload the file easily.

Once you have the ISO on the server you simply need to mount the disk with a software to present it as an ordinary CD/DVD disk. I personally use Elaborate Bytes’ Virtual Clone Drive.

Copying Resources between client and server using RDP resource sharing:
https://support.microsoft.com/en-us/kb/313292

Mounting CD/DVD ISO/CUE/BIN files in Windows
https://www.elby.ch/en/products/vcd.html

Yeah, I know it’s simple to us, but other people might find this useful. Exact same question came up today 😀

Using Sar to tell a story

Posted on 12th August 2016 by azio

So, a customer is experiencing slowness/sluggishness in their app. You know there is not issue with the hypervisor from instinct, but instinct isn’t enough. Using tools like xentop, sar, bwm-ng are critical parts of live and historical troubleshooting.

Sar can tell you a story, if you can ask the storyteller the write questions, or even better, pick up the book and read it properly. You’ll understand what the plot, scenario, situation and exactly how to proceed with troubleshooting by paying attention to these data and knowing which things to check under certain circumstances.

This article doesn’t go in depth to that, but it gives you a good reference of a variety of tests, the most important being, cpu usage, io usage, network usage, and load averages.

CPU Usage of all processors

# Grab details live
sar -u 1 3

# Use historical binary sar file
# sa10 means '10th day' of current month.
sar -u -f /var/log/sa/sa10

CPU Usage of a particular Processor

sar -P ALL 1 1

‘-P 1’ means check only the 2nd Core. (Core numbers start from 0).

sar -P 1 1 5

The above command displays real time CPU usage for core number 1, every 1 second for 5 times.

Observing Changes in Memory over time

sar -r 1 3

The above command provides memory stats every 1 second for a total of 3 times.

Observing Swap usage over time

sar -S 1 5

The above command reports swap statistics every 1 seconds, a total 3 times.

Overall I/O activity

sar -b 1 3

The above command checks every 1 seconds, 3 times.

Individual Block Device I/O Activities

This is a useful check for LUN , block devices and other specific mounts

sar -d 1 1 
sar -p d

DEV – indicates block device, i.e. sda, sda1, sdb1 etc.

Total Number processors created a second / Context switches

sar -w 1 3

Run Queue and Load Average

sar -q 1 3

This reports the run queue size and load average of last 1 minute, 5 minutes, and 15 minutes. “1 3” reports for every 1 seconds a total of 3 times.

Report Network Statistics

sar -n KEYWORD

KEYWORDS Available;

DEV – Displays network devices vital statistics for eth0, eth1, etc.,
EDEV – Display network device failure statistics
NFS – Displays NFS client activities
NFSD – Displays NFS server activities
SOCK – Displays sockets in use for IPv4
IP – Displays IPv4 network traffic
EIP – Displays IPv4 network errors
ICMP – Displays ICMPv4 network traffic
EICMP – Displays ICMPv4 network errors
TCP – Displays TCPv4 network traffic
ETCP – Displays TCPv4 network errors
UDP – Displays UDPv4 network traffic
SOCK6, IP6, EIP6, ICMP6, UDP6 are for IPv6
ALL – This displays all of the above information. The output will be very long.

sar -n DEV 1 1

Specify Start Time

sar -q -f /var/log/sa/sa11 -s 11:00:00

sar -q -f /var/log/sa/sa11 -s 11:00:00 | head -n 10

Grabbing network activity from server without network utility

Posted on 11th August 2016 by azio

So, is it possible to look at a network interfaces activity without bwm-ng, iptraf, or other tools? Yes.

while true do
RX1=`cat /sys/class/net/${INTERFACE}/statistics/rx_bytes`
TX1=`cat /sys/class/net/${INTERFACE}/statistics/tx_bytes`
DOWN=$(($RX1-$RX2))
UP=$(($TX1-$TX2))
DOWN_Bits=$(($DOWN * 8 ))
UP_Bits=$(($UP * 8 ))
DOWNmbps=$(( $DOWN_Bits >> 20 ))
UPmbps=$(($UP_Bits >> 20 ))
echo -e "RX:${DOWN}\tTX:${UP} B/s | RX:${DOWNmbps}\tTX:${UPmbps} Mb/s"
RX2=$RX1; TX2=$TX1
sleep 1; done

I found this little gem yesterday, but couldn’t understand why they had not used clear. I guess they wanted to log activity or something… still this was a really nice find. I can’t remember where I found it yesterday but googling part of it should lead you to the original source 😀

Using TCP to ping test your Cloud server connectivity

Posted on 10th August 2016 by azio

So, you have probably heard that there are a variety of reasons why you shouldn’t use ICMP to test your service is operating normally. Mainly because of the way that ICMP is handled by routers. If you really want a representative view of the way that TCP packets, such as HTTP and HTTPS are performing in terms of packet loss (that is to say packets which do not arrive at their destination) , then hping is your friend.

You might be pinging a cloud-server that is not replying. You might think it’s down. But what if the firewall is simply dropping ICMP echo requests coming in on that port? Indeed.

Enter hping.

# hping -S -p 80 google.com
HPING google.com (eth0 74.125.136.102): S set, 40 headers + 0 data bytes
len=46 ip=74.125.136.102 ttl=46 id=23970 sport=80 flags=SA seq=0 win=42780 rtt=13.8 ms
len=46 ip=74.125.136.102 ttl=47 id=37443 sport=80 flags=SA seq=1 win=42780 rtt=12.6 ms
len=46 ip=74.125.136.102 ttl=47 id=43654 sport=80 flags=SA seq=2 win=42780 rtt=12.0 ms
len=46 ip=74.125.136.102 ttl=47 id=37877 sport=80 flags=SA seq=3 win=42780 rtt=11.4 ms
len=46 ip=74.125.136.102 ttl=47 id=62433 sport=80 flags=SA seq=4 win=42780 rtt=13.3 ms
^C
--- google.com hping statistic ---
5 packets transmitted, 5 packets received, 0% packet loss
round-trip min/avg/max = 11.4/12.6/13.8 ms

In this case I tested with google.com. I’m actually surprised that more people don’t use hping, because, hping is awesome. It also makes quite a decent port scanner, were it not for the fact that the machine I tried to test that feature with buffer overflowed 😉 It’s a nice way to test a firewalled box, but more than that, it’s a more reliable test in my opinion.

Using Nova and Supernova to manage Firewall IP access lists, automation & more

Posted on 10th August 2016 by azio

So, a customer today reached out to us asking if Rackspace provided the entire infrastructure IP address ranges in use on cloud. The answer is, no. However, that doesn’t mean that making your firewall rules, or autoscale automation need to be painful.

In fact, Rackspace Cloud utilizes Openstack which fully supports API calls which will easily be able to provide this detail in just a few simple short steps. To do this you require nova to be installed, this is really relatively easy to install, and instructions for installing it can be found here;

https://support.rackspace.com/how-to/installing-python-novaclient-on-linux-and-mac-os/

Once you have installed nova, it’s simply a case of making sure you set these 4 lines correctly in your .bash_profile

OS_USERNAME=mycloudusernamegoeshere
OS_TENANT_NAME=yourrackspaceaccountnumbergoeshereusuallysomethinglike1010101010
OS_AUTH_SYSTEM=rackspace
OS_PASSWORD=apikeygoeshere
OS_AUTH_URL=https://identity.api.rackspacecloud.com/v2.0/
OS_REGION_NAME=LON
OS_NO_CACHE=1
export OS_USERNAME OS_TENANT_NAME OS_AUTH_SYSTEM OS_PASSWORD OS_AUTH_URL OS_REGION_NAME OS_NO_CACHE

OS_USERNAME is your mycloud login username (normally the primary user).
OS_TENANT_NAME is your Customer ID, it’s the number that appears in the URL of your control panel link, see below picture for illustration

OS_PASSWORD is a bit misleading, this is actually where your apikey goes , but I think it’s possible to authenticate using your control panel password too, don’t do that for security reasons.

OS_REGION_NAME is pretty self explanatory, this is simply the region that you would like to list cloud-server IP’s in or rather, the region that you wish to perform NOVA API calls.

Making the API call using the nova API wrapper

# supernova lon list --tenant 100010101 --fields accessIPv4,name
[SUPERNOVA] Running nova against lon...
+--------------------------------------+-----------------+-----------+
| ID                                   | accessIPv4      | Name      |
+--------------------------------------+-----------------+-----------+
| 7e5a7f99-60ae-4c28-b2b8              | 1.1.1.1  |  xapp      |
| 94747603-812d-4594-850b              | 1.1.1.1  |   rabbit2   |
| d5b318aa-0fa2-4269-ae00              | 1.1.1.1  |   elastic5  |
| 6c1d8d33-ae5e-44be-b9f0              | 1.1.1.1  | | elastic6  |
| 9f79a7dc-fd19-4f8f-9c26              |1.1.1.1   | | elastic3  |
| 05b1c52b-6ced-4db0-8af2              | 11.1.1.1 | | elastic1  |
| c8302366-f2f9-4c36-8f7a              | 1.1.1.1  | | app5      |
| b159cd07-8e68-49bc-83ee              | 1.1.1.1  | | app6      |
| f1f31eef-97c6-4c68-b01a              | 1.1.1.1  | | ruby1     |
| 64b7f0fd-8f2f-4d5f-8f89              | 1.1.1.1  | | build3    |
| e320c051-b5cf-473a-9f96              | 1.1.1.1  |   mysql2    |
| 4fddd022-59a8-4502-bf6e              | 1.1.1.1  | | mysql1    |
| c9ad6951-f5f9-4351-b31d              | 1.1.1.1  | | worker2   |
+--------------------------------------+-----------------+-----------+

This is pretty useful for managing autoscale permissions if you need to make sure your corporate network can be connected to from your cloud-servers when new cloud-servers with new IP are built out. considerations like this are really important when putting together a solution. The nice thing is the tools are really quite simple and flexible. If I wanted I could have pulled out detail for servicenet instead. I hope this helps make some folks lives a bit easier and works to demystify API to others that haven’t had the opportunity to use it.

You are probably wondering though, what field names can I use? a nova show will reveal this against one of your server UUID’s

# supernova lon show someuuidgoeshere +-------------------------------------+------------------------------------------------------------------+ | Property | Value | +-------------------------------------+------------------------------------------------------------------+ | OS-DCF:diskConfig | MANUAL | | OS-EXT-SRV-ATTR:host | censored | | OS-EXT-SRV-ATTR:hypervisor_hostname | censored | | OS-EXT-SRV-ATTR:instance_name | instance-734834278-sdfdsfds- | | OS-EXT-STS:power_state | 1 | | OS-EXT-STS:task_state | - | | OS-EXT-STS:vm_state | active | | censorednet network | censored | | accessIPv4 | censored | | accessIPv6 | censored | | created | 2015-12-11T14:12:08Z | | flavor | 15 GB I/O v1 (io1-15) | | hostId | 860... | | id | 9f79a7dc-fd19-4f8f-9c26-72a335ed2be8 | | image | Debian 8 (Jessie) (PVHVM) (cf16c435-7bed-4dc3-b76e-57b09987866d) | | metadata | {"build_config": "", "rax_service_level_automation": "Complete"} | | name | elastic3 | | private network | | | progress | 100 | | public network | | | status | ACTIVE | | tenant_id | | | updated | 2016-02-27T09:30:20Z | | user_id | | +-------------------------------------+------------------------------------------------------------------+ I censored some of the fields.. but you can see all of the column names, so if you wanted to see metadata and progress only, with the server uuid and server name.

nova list --fields name, metadata, progress

This could be pretty handy for detecting when a process has finished building, or detecting once automation has completed. The possibilities with API are quite endless. API is certainly the future, and, there is no reason why, in the future, people won't be building and deploying websites thru API only, and some sophisticated UI wrapper like NOVA.

Admittedly, this is very far away, but that should be what the future technology will be made of, stuff like LAMBDA, serverless architecture, will be the future.

Adding some excludes for Lsyncd

Posted on 8th August 2016 by azio

A customer was having some issues with their syncing, as was shown by their inotify

Error: Terminating since out of inotify watches.
Consider increasing /proc/sys/fs/inotify/max_user_watches

Fix was quite simple, to remove other folders from sync that aren’t necessary.

Adding this line to the /etc/lsyncd.conf

excludeFrom="/etc/lsyncd-excludes.txt",

And creating the ‘excludes’ file for LsyncD, i.e. what folders you want to ignore, in this case we wanted to ignore old httpdocs.OLD backup.

# cat /etc/lsyncd-excludes.txt
somewebsite.com/httpdocs.OLD/

A shockingly simple fix.

Please note that the path in lsyncd-excludes.txt is determined by the path in lsyncd. (do not give full path, give relative path inside the parent). It was a simple fix.

Resizing a SD or USB card partition for Retropie Raspberry Pi Arcade on Linux

Posted on 5th August 2016 by azio

So, I had a friend who had recently bought his Raspberry Pi 3 and wanted to run retropie on it like I have been with my arcade cabinet.

The problem was the sandisk 64GB disk he had bought had some few less sectors on the disk, which meant my image was just a few bytes too big. What a bummer!

So I used this great tool by sirlagz to fix that.

#!/bin/bash
# Automatic Image file resizer
# Written by SirLagz
strImgFile=$1


if [[ ! $(whoami) =~ "root" ]]; then
echo ""
echo "**********************************"
echo "*** This should be run as root ***"
echo "**********************************"
echo ""
exit
fi

if [[ -z $1 ]]; then
echo "Usage: ./autosizer.sh "
exit
fi

if [[ ! -e $1 || ! $(file $1) =~ "x86" ]]; then
echo "Error : Not an image file, or file doesn't exist"
exit
fi

partinfo=`parted -m $1 unit B print`
partnumber=`echo "$partinfo" | grep ext4 | awk -F: ' { print $1 } '`
partstart=`echo "$partinfo" | grep ext4 | awk -F: ' { print substr($2,0,length($2)-1) } '`
loopback=`losetup -f --show -o $partstart $1`
e2fsck -f $loopback
minsize=`resize2fs -P $loopback | awk -F': ' ' { print $2 } '`
minsize=`echo $minsize+1000 | bc`
resize2fs -p $loopback $minsize
sleep 1
losetup -d $loopback
partnewsize=`echo "$minsize * 4096" | bc`
newpartend=`echo "$partstart + $partnewsize" | bc`
part1=`parted $1 rm 2`
part2=`parted $1 unit B mkpart primary $partstart $newpartend`
endresult=`parted -m $1 unit B print free | tail -1 | awk -F: ' { print substr($2,0,length($2)-1) } '`
truncate -s $endresult $1

It was a nice solution to my friends problem… the only problem now is the working image for my pi is not working with audio for him, and for some reason when he comes out of hte game and goes back to emulation station he loses the joystick input controller. That is kind of bizarre.

Does anyone know what could cause those secondary issues? I’m a bit stumped on this one.

Preventing /etc/resolv.conf reset on startup/boot

Posted on 1st August 2016 by azio

Today, a customer approached us after a Host Server Down complaining that, although the server is up again their website and application were down & not working. Even though the server was online and functioning correctly.

The customer discovered that the source of the issue was that there /etc/resolv.conf is blank, this means that they will not be able to resolve DNS A/PTR/CNAME record hostnames into a resolved IP. This is called hostname to IP resolution. Its means that if /etc/resolv.conf is blank and the customer uses hostnames in their calls, such a failure will break the connectivity due to failure to resolve to IP to communicate on the TCP stack.

There is actually a very simple way to prevent the /etc/resolv.conf file from being changed. But first, it’s important to understand why /etc/resolv.conf is being reset.

On All Rackspace cloud-servers there is a process called nova-agent, and when the server starts up, the /etc/resolv.conf file will be reset along with the networking configuration. This happens each time your server is restarted and is used to set new networking details, specifically if you take an image and build server on a new ip address or if your server is live-migrated to a new host, it makes sure on the next reboot it comes up with correct networking detail transparently. However this can cause some issues, such as in this case with the /etc/resolv.conf file. Fortunately there are some novel ways of preventing your /etc/resolv.conf being modified after you have added the correct nameservers you desire to it.

You can use the chattr immutable file setting to stop processes from modifying it after you have made the changes to your resolv.conf that are desired;

Set Immutable File

chattr +i /etc/resolv.conf

Un-set Immutable File

chattr -i /etc/resolv.conf

This /etc/resolv.conf issue is a common problem, however using immutable file flag and chattr should prevent it from being changed ever again.

Simple way to perform a body check on a website

Posted on 20th July 2016 by azio

So, I was testing with curl today and I know that it’s possible to direct to /dev/null to suppress the page. But that’s not very handy if you are checking whether html page loads, so I came up with some better body checks to use.

A Basic body check using wc -l to count the lines of the site

 time curl https://www.google.com/ > 1; echo "non zero indicates server up and served content of n lines"; cat 1 | wc -l
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  167k    0  167k    0     0  79771      0 --:--:--  0:00:02 --:--:-- 79756

real	0m2.162s
user	0m0.042s
sys	0m0.126s
non zero indicates server up and served content of n lines
2134

A body check for Google analytics

$ time curl https://www.groundworkjobs.com/ > 1; echo "Checking for google analytics html elements string"; cat 1 | grep "www.google-analytics.com/analytics.js"
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  167k    0  167k    0     0  76143      0 --:--:--  0:00:02 --:--:-- 76152

real	0m2.265s
user	0m0.042s
sys	0m0.133s
Checking for google analytics html elements string
				})(window,document,'script','//www.google-analytics.com/analytics.js','ga');

Such commands might be useful when troubleshooting a cluster for instance, where one server shows more up to date versions, (different number of lines). There’s probably better way to do this with ls and awk and use the html filesize, since number of lines wouldn’t be so accurate.

Check Filesize from request

$ time curl https://www.groundworkjobs.com/ > 1; var=$(ls -al 1 | awk '{print $5}') ; echo "Page size is: $var kB"
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  167k    0  167k    0     0  79467      0 --:--:--  0:00:02 --:--:-- 79461

real	0m2.170s
user	0m0.048s
sys	0m0.111s
Page size is: 171876 kB

Pretty simple.. but you could take the oneliner even further… populate a variable called $var with the filesize using ls and awk , and then use an if statement to check that var is not 0, indicating the page is answering positively, or alternatively not answering at all.

Check Filesize and populate a variable with the filesize, then validate variable

$ time curl https://www.groundworkjobs.com/ > 1; var=$(ls -al 1 | awk '{print $5}') ; echo "Page size is: $var kB"; if [ "$var" -gt 0 ] ; then echo "The filesize was greater than 0, which indicates box is up but may be giving an error page"; fi
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  167k    0  167k    0     0  78915      0 --:--:--  0:00:02 --:--:-- 78950

real	0m2.185s
user	0m0.041s
sys	0m0.132s
Page size is: 171876 kB
The filesize was greater than 0, which indicates box is up but may be giving an error page

The second exercise is not particularly useful or practical as a means of testing, since if the site was timing out the script would take ages to reply and make the whole test pointless, but as a learning exercise being able to assemble one liners on the fly like this is an enjoyable, rewarding and useful investment of time and effort. Understanding such things are the fundamentals of automating tasks. In this case with output filtering, variable creation, and subsequent validation logic. It’s a simple test, but the concept is exactly the same for any advanced automation procedure too.

haxed.me.uk

a place of infrastructure delights