Troubleshooting Networking, initial steps

At work we see a lot of stuff come up from day to day, and one of the issues which we see every now and then is networking issues. Specifically Rackspace Cloud Networks (project name is neutron). This is the ‘Rackspace’ implementation of isolated network entities. Amazon use VPC, virtual private cloud, but the concepts are quite similar.

In this case one of our customers web machines wasn’t able to ping other machines. The first thing I did was ask the customer to ping the other machine from their web machine, and ping the web machine from the other machine.

In this case the customer was reporting problems with isolated network, (i.e. not public or private interfaces), so not eth0, or eth1, but the eth2 interface in this case. Here is what my tcpdump on the hypervisor looked like.

$ tcpdump vif{domainid}{network}

10:28:30.542146 bc:76:4e:09:2a:69 > bc:76:4e:08:43:86, ethertype ARP (0x0806), length 42: Request who-has 192.168.66.19 tell 192.168.66.3, length 28
10:28:30.542486 bc:76:4e:08:43:86 > bc:76:4e:09:2a:69, ethertype ARP (0x0806), length 42: Reply 192.168.66.19 is-at bc:76:4e:08:43:86, length 28
10:28:30.571805 bc:76:4e:09:2a:69 > bc:76:4e:08:43:86, ethertype IPv4 (0x0800), length 98: 192.168.66.3 > 192.168.66.19: ICMP echo request, id 29516, seq 6, length 64
10:28:31.579785 bc:76:4e:09:2a:69 > bc:76:4e:08:43:86, ethertype IPv4 (0x0800), length 98: 192.168.66.3 > 192.168.66.19: ICMP echo request, id 29516, seq 7, length 64
10:28:32.587837 bc:76:4e:09:2a:69 > bc:76:4e:08:43:86, ethertype IPv4 (0x0800), length 98: 192.168.66.3 > 192.168.66.19: ICMP echo request, id 29516, seq 8, length 64

As we can see as 192.168.66.19 is being pinged by 192.168.66.3 but there is no ping reply. If it had a reply it would look different, something like:

192.168.66.3 < 192.168.66.19: ICMP echo request, id 29516, seq 7, length 64

192.168.66.3 is broadcasting the ARP request. Asking the local router to tell it what macid 192.168.66.19 has. This is answered and the physical hardware mac address is given '192.168.66.19 is-at bc:76:4e:08:43:86', but still 192.168.66.3 isn't sending an ping echo reply.

From the ARP request we can see 192.168.66.3 knows where to physically send the packet reply to .19 and this goes thru the local switch to reach the router. On the router there is a routing table that manages which macid is destined for which ip.

In this case something wrong was happening. For some reason 192.168.66.3 wasn't able to reply to the pings from 192.168.66.19, even with the physical hardware mac address.

However the weird thing is, the problem suddenly went away again!

11:09:26.735818 bc:76:4e:09:2a:69 > bc:76:4e:08:43:86, ethertype IPv4 (0x0800), length 98: 192.168.66.3 > 192.168.66.19: ICMP echo request, id 29516, seq 2453, length 64
11:09:27.197715 bc:76:4e:08:43:86 > bc:76:4e:09:2a:69, ethertype IPv4 (0x0800), length 98: 192.168.66.19 > 192.168.66.3: ICMP echo request, id 53772, seq 232, length 64
11:09:27.198315 bc:76:4e:09:2a:69 > bc:76:4e:08:43:86, ethertype IPv4 (0x0800), length 98: 192.168.66.3 > 192.168.66.19: ICMP echo reply, id 53772, seq 232, length 64
11:09:27.743907 bc:76:4e:09:2a:69 > bc:76:4e:08:43:86, ethertype IPv4 (0x0800), length 98: 192.168.66.3 > 192.168.66.19: ICMP echo request, id 29516, seq 2454, length 64
11:09:28.198486 bc:76:4e:08:43:86 > bc:76:4e:09:2a:69, ethertype IPv4 (0x0800), length 98: 192.168.66.19 > 192.168.66.3: ICMP echo request, id 53772, seq 233, length 64
11:09:28.201819 bc:76:4e:09:2a:69 > bc:76:4e:08:43:86, ethertype IPv4 (0x0800), length 98: 192.168.66.3 > 192.168.66.19: ICMP echo reply, id 53772, seq 233, length 64
11:09:28.751893 bc:76:4e:09:2a:69 > bc:76:4e:08:43:86, ethertype IPv4 (0x0800), length 98: 192.168.66.3 > 192.168.66.19: ICMP echo request, id 29516, seq 2455, length 64
11:09:29.203245 bc:76:4e:08:43:86 > bc:76:4e:09:2a:69, ethertype IPv4 (0x0800), length 98: 192.168.66.19 > 192.168.66.3: ICMP echo request, id 53772, seq 234, length 64
11:09:29.203737 bc:76:4e:09:2a:69 > bc:76:4e:08:43:86, ethertype IPv4 (0x0800), length 98: 192.168.66.3 > 192.168.66.19: ICMP echo reply, id 53772, seq 234, length 64
11:09:29.759691 bc:76:4e:09:2a:69 > bc:76:4e:08:43:86, ethertype IPv4 (0x0800), length 98: 192.168.66.3 > 192.168.66.19: ICMP echo request, id 29516, seq 2456, length 64
11:09:30.203991 bc:76:4e:08:43:86 > bc:76:4e:09:2a:69, ethertype IPv4 (0x0800), length 98: 192.168.66.19 > 192.168.66.3: ICMP echo request, id 53772, seq 235, length 64
11:09:30.204516 bc:76:4e:09:2a:69 > bc:76:4e:08:43:86, ethertype IPv4 (0x0800), length 98: 192.168.66.3 > 192.168.66.19: ICMP echo reply, id 53772, seq 235, length 64

All of a sudden the echo reply were coming back from 192.168.66.3 and it was finding 192.168.66.19.

192.168.66.3 pings 192.168.66.19

11:09:27.197715 bc:76:4e:08:43:86 > bc:76:4e:09:2a:69, ethertype IPv4 (0x0800), length 98: 192.168.66.19 > 192.168.66.3: ICMP echo request, id 53772, seq 232, length 64

192.168.66.19 responds back to 192.168.66.3

11:09:27.198315 bc:76:4e:09:2a:69 > bc:76:4e:08:43:86, ethertype IPv4 (0x0800), length 98: 192.168.66.3 > 192.168.66.19: ICMP echo reply, id 53772, seq 232, length 64

The question, ultimate question is WHY. I don't know why, but I shown you how to see WHAT and WHERE. Which is the most pertinent way to begin reaching a why ;D