maandag 20 april 2009

EtherChannel : Issue with backup virtual adapter

Introduction:
Currently, there is something very odd going on when using EtherChannel (Network Interface Backup mode) if the backup adapter is a virtual adapter. PMR 68839.300.624 clarified that it is currently designed that the backup virtual adapter is receiving traffic, even though it is in backup mode. However, this introduces an additional problem: even though the backup virtual adapter is receiving traffic, it is not replying to it. It is the primary channel that responds, which creates an unbalanced situation on the physical network, resulting in flooding.

Impacted:
- All AIX versions up till now
http://www-933.ibm.com/eserver/support/fixes/fixcentral/pfixpacks/53
http://www-933.ibm.com/eserver/support/fixes/fixcentral/pfixpacks/61
- POWER5(+) firmware <= SF240_320

Details:


Consider that the ARP tables on client LPAR A and server LPAR B are empty aswell as the MAC table on the Ethernet Switch. Client LPAR A wishes to send data to LPAR B.

LPAR A:
ent1 (Virtual Ethernet) - MAC address 22:f1:30:00:70:06
en1 - IP address 10.226.32.145

LPAR B:
ent3 (EtherChannel in NIB mode (Active/Passive)) - MAC address 00:14:5e:c6:46:80
en3 - IP address 10.226.32.139
Primary Channel: ent2 (Physical Ethernet)
Backup Channel: ent1 (Virtual Ethernet)

VIO:
ent3 (Shared Ethernet Adapter) - MAC address 00:14:5e:48:2c:7a
Physical Ethernet: ent2 - MAC address 00:14:5e:48:2c:7a
Virtual Ethernet: ent1 - MAC address 22:f1:30:00:30:06

Source IP address: 10.226.32.145
Destination IP address: 10.226.32.139

Source MAC address: 22:f1:30:00:70:06
Destination MAC address: unknown

Since client LPAR A does not know the destination MAC address of server LPAR B, client LPAR A is broadcasting an ARP request (Who has 10.226.32.139, tell 10.226.32.145) on the internal Layer 2 PHYP switch. Even though the EtherChannel on server LPAR B is in Primary Channel Mode, the PHYP delivers this packet to the backup Virtual Ethernet adapter of the EtherChannel and also delivers the broadcast to the SEA for bridging. As a result, the MAC table on the physical switch is updated with MAC address of client LPAR A, located on physical port X. Server LPAR B will form a unicast reply but sends this unicast reply via the Primary Channel to the Ethernet Switch. The Ethernet Switch receives the unicast reply on port Y, links the source MAC address of server LPAR B to port Y in the MAC table. Since the frame contains a destination MAC address which has a valid MAC table entry on the physical switch, it is delivered to port X and it ultimately received by client LPAR A through the SEA. Client LPAR A updates the ARP table with the MAC address of server LPAR B.

Now client LPAR A can start communicating with server LPAR B since it now knows the destination MAC address. The PHYP is delivering the packets via the backup Virtual Ethernet adapter of the EtherChannel. After the TTL of the MAC table entry for client LPAR A expires, flooding is observed on the physical switch, meaning that the switch will act as a simple repeater for all communication from server LPAR B to client LPAR A and hereby sending it to all trunk ports and access ports defined in the same VLAN. Ofcourse, the frames are also forwarded to port X (it's in the same VLAN) and are ultimately received by client LPAR A though the SEA.

When client LPAR A is sending jumbo frames (data) to server LPAR B, approximately 2 Mbit/s of TCP ACK flooding was observed. It gets really bad when the process is reversed, in which server LPAR B is sending data to client LPAR A. As a result, all data will be flooded on the switch and only the TCP acks are delivered via the backup Virtual Ethernet Adapter.

According to IBM, this is working as designed and a DCR was created to address this issue.

Resolution:
- Reduce ARP table TTL on the LPARs (arpt_killc network tunable) OR
- Increase MAC table TTL on the physical switch OR
- Replace Virtual Ethernet adapter by a Physical Ethernet adapter for the EtherChannel backup channel.

woensdag 1 april 2009

Quorum active or not?

AIX 5.3 TL7 introduces concurrent quorum changes on a volume group. Prior to that version, the quorum change only becomes active after a varyoff/varyon operation on that specific volume group. This also means that, whever the ODM value is changed, there is no easy way to know whether quorum is currently active or not since lsvg displays the values of ODM attributes, not real-time values.
Fortunately, there is way to figure out whether quorum is active or not. This involves debugging the running kernel using kdb. The procedure to do this is as follows:

- Determine the major number of the volume group in /dev, and convert to the hexadecimal value. F.e. rootvg will always have a major number of 10 (hexadecimal A) and all logical volumes will have a sequential minor number starting at 1.

# ls -al /dev/rootvg
crw-rw---- 1 root system 10, 0 Apr 24 2008 /dev/rootvg


- List the device switch table entry for the volume group, based on the hexadecimal major number, and track the effective address of the volgrp structure in memory (dsdptr)

# echo 'devsw 0xA' | kdb
The specified kernel file is a 64-bit kernel
Preserving 1402949 bytes of symbol table
First symbol __mulh
START END
0000000000001000 0000000003DDF050 start+000FD8
F00000002FF47600 F00000002FFDC920 __ublock+000000
000000002FF22FF4 000000002FF22FF8 environ+000000
000000002FF22FF8 000000002FF22FFC errno+000000
F100070F00000000 F100070F10000000 pvproc+000000
F100070F10000000 F100070F18000000 pvthread+000000
PFT:
PVT:
id....................0002
raddr.....000000000A000000 eaddr.....F200800040000000
size..............00080000 align.............00001000
valid..1 ros....0 fixlmb.1 seg....0 wimg...2
(0)> devsw 0xA
Slot address F1000100101AA500
MAJOR: 00A
   open: 04165624
   close: 04164EC8
   read: 04164738
   write: 04164638
   ioctl: 04162960
   strategy: 04180E9C
   ttys: 00000000
   select: .nodev (00196AE4)
   config: 041588F8
   print: .nodev (00196AE4)
   dump: 04181E68
   mpx: .nodev (00196AE4)
   revoke: .nodev (00196AE4)
   dsdptr: F100010032BA2000
   selptr: 00000000
   opts: 0000012A DEV_DEFINED DEV_MPSAFE DEV_EXTBUF


- Determine the flags attribute of the volgrp structure. The last bit is about quorum (1 -> quorum disabled)

# echo 'volgrp F100010032BA2000' | kdb | grep flags | awk '{print $4}'
00000001