As I indicated in the "all clear" email I sent out to everyone earlier, I'm forwarding on all my notes (below) to you from this morning just so anybody who wants to read through the whole saga can get on the same page, ask the right questions, etc.
Can you please provide NETS with a current picture (diagram, documentation, whatever) of roy and its processing nodes? Having better documentation on this would have helped everyone work through this. I think I get it now (at least conceptually), but not having to re-learn it each time would be good. We all spent a lot of time just trying to come up to speed on what we were trying to solve this morning, I think.
Long answer: I'll include my summary of what I believe the sequence of events was here:
At ~07:50am this morning NETS staff bumped several fiber pairs feeding hosts on module 8 of ml-mr-c1-gs which includes these devices:
--- Port Name Status Vlan Duplex Speed Type ----- -------------------- ---------- ---------- ------ ----- ------------ 8/1 ML-29/DSG-R6:SALO connected 12 full 1000 1000BaseSX 8/2 ML-29/DSG-R6:ABBY connected 12 full 1000 1000BaseSX 8/3 ML-29/DSG-R6:SUAVE connected 12 full 1000 1000BaseSX 8/4 ML-29/DSG-R6:BILLER connected 12 full 1000 1000BaseSX 8/5 DENISE connected 12 full 1000 1000BaseSX 8/6 FISET connected 12 full 1000 1000BaseSX ---
Here is the email of what I knew as of ~10am this morning when Greg stopped by my office:
Greg: After you stopped by this morning, I did some more checking into this. The only thing I found was this on netman/HPOV:Fri Oct 15 07:50:15 roy.ucar.edu mlra.ucar.edu reports a new physical address for roy.ucar.edu, changing from 0x00005A9C8C58 to 0x00005A9C9BB8 Fri Oct 15 07:55:21 roy.ucar.edu mlra.ucar.edu reports a new physical address for roy.ucar.edu, changing from 0x00005A9C9BB8 to 0x00005A9CEB38It seems those MAC addresses correspond to these devices:0x00005A9C8C58 ml-mr-c1-gs 8/2 12 1000 full ML-29/DSG-R6:ABBY 00005a9c8c58 184.108.40.206 roy 0x00005A9C9BB8 ml-mr-c1-gs 8/6 12 1000 full FISET 00005a9c9bb8 220.127.116.11 fiset 0x00005A9CEB38 ml-mr-c1-gs 8/5 12 1000 full DENISE 00005a9ceb38 18.104.22.168 denisThose devices are all on the same blade (module 8 on ml-mr-c1-gs) so it is possible that there was some issue with that card this morning, though I see no syslog entries (or anything else from the switch) about it. Despite the lack of hard evidence, that scenario seems likely as there were also "IF lan0 down" entries in netman/HPOV about other devices at the same time on that module 8, including salo and abby.
Apparently, we also have a work request to split those hosts out which might have helped in this scenario. In any case, we will obviously keep a closer eye on that module8 now and follow up on the time table for getting those hosts moved so they do not all share ports on that (now suspect) module. I suspect this will be addressed at the latest when NETS installs our new ml-mr-c4-gs switch in our new ML-29 rack location.
That network bump (in item 1 above) set off a chain of events that caused the wrong physical (MAC) address for roy to be associated with its "processing" IP address. Greg can explain this more clearly, but I believe basically the roy processing nodes have a flag set so they do not advertise their MAC address when they receive ARP requests (that's how they coordinate getting roy's traffic). That's what Greg is referring to in his email below.
Here's why (from an email from Greg):
Sure enough, the "hidden" flag got unset on all the nodes. This flag is set at boot time and is normally never touched after that. This is the same method that has been used on roy since the beginning, and on the mdir/mscan cluster for over two years now, and we've never seen this happen before. Presumably this is a result of reloading the driver this morning when the elbow bump occurred, but the only evidence I have for that is the time coincidence. Everything was working up to then, and afterwards it was not. But there are no log entries indicating this happened other than the ones indicating that the connection went down and up within 5 seconds, twice, a minute apart, just after 0745 this morning. I am assuming this is when the hidden flag got unset.
The hidden flag not being set would result in the four processing nodes (denis, fiset, biller, sauve) responding to ARP requests for the address of roy. This would explain how the bogus ARP entries got into the router's ARP cache, and why the packets were being routed to a place other than where it looked like they should be.
As of now, the hidden flag has been reset on all the nodes, so clearing = the bogus ARP entries out of the router should fix things. I will install a cron job that simply resets this flag every minute, so that if something like this happens again, the flag will be reset almost immediately. We obviously cannot prevent = an elbow bump from dropping connections, but we can prevent the fallout from it that happened today.
Here is what was wrong (and what I was seeing on mlra):
mlra#show ip arp | inc 12.34 Internet 22.214.171.124 0 0000.5a9c.8c58 ARPA Vlan12 mlra#show ip arp | inc 12.39 Internet 126.96.36.199 2 0000.5a9c.9bb8 ARPA Vlan12 mlra#show ip arp | inc 12.33 Internet 188.8.131.52 2 0000.5a9c.9bb8 ARPA Vlan12 mlra#show ip arp | inc 12.38 Internet 184.108.40.206 0 0000.5a9c.eb38 ARPA Vlan1233 (roy) and 39 (fiset) matched: wrong. Then 33 (roy) and 36 (sauve) matched--also wrong. What Greg was looking for was for 33 (roy) and 34 (abby) to match (with roy's MAC)...but instead roy had abby's.
mlra#show arp | inc 8c88 Internet 220.127.116.11 7 0000.5a9c.8c88 ARPA Vlan12 Internet 18.104.22.168 3 0000.5a9c.8c88 ARPA Vlan1212.33 and 12.34 should have had matching MAC addresses according to Greg's config...THIS IS WHAT WE WANTED (and finally got):
mlra#show arp | inc Vlan12 [only relevant output included]: ... Internet 22.214.171.124 0 0000.5a9c.8c58 ARPA Vlan12 Internet 126.96.36.199 0 0000.5a9c.8c58 ARPA Vlan1212.33 and 12.34 match mac addrs (using roy's)...now it works (for mlra; need to do a similar fix on mlrb):
While Greg was setting the hidden flag on the host in question, then bumping the interface, I was clearing the ARP entries, and then re-pinging the devices. Eventually, once we got our timing right, the ARP cache got populated with the configuration Greg expected (i.e., the matches noted above).
roy=3D188.8.131.52Note: since we noticed the static route entries on flra/b and mlra/b were wrong, I corrected those as well; that should be:
Lightning=3D215.151 and 215.152
private address space=3D192.168.150.0
ip route 192.168.150.0 255.255.255.0 184.108.40.206Greg's original email on this:
From: Greg Woods
Sent: Friday, October 15, 2004 9:45 AM
To: email@example.com, firstname.lastname@example.org, email@example.com, firstname.lastname@example.org
Subject: GigE outage this morning
Notice the subject doesn't say "roy outage", because there was never a problem with roy. However, a number of users had their connections through roy to the supers dropped this morning just after 0745. The cause of this appears to be a brief outage in the GigE connections. Both GigE interfaces on all six nodes went down, then back up about 5 seconds later. The same thing happened again about a minute later. The chance of all six nodes developing the same problem at exactly the same time, and on both interfaces to boot, is almost nil, so it is virtually certain that this was a network issue. A work request has been filed with NETS to investigate.
There is no indication that this problem has occurred before or since these two brief incidents this morning.
On Fri, 2004-10-15 at 14:24, Jeff Custard wrote:
Can you please provide NETS with a current picture (diagram, documentation, whatever) of roy and its processing nodes?
The best we have is at http://www.scd.ucar.edu/dsg/roy.html
That took three full working days to produce and I don't claim it's complete.
One can also look at the Linux Virtual Server web site, where the method used to farm connections out is documented. But I can guarantee you that trying to figure out what was wrong from documentation such as that would have taken days. I already know how it works and I was involved in the diagnostic process today. There is no substitute for the knowledge that is in our heads, for certain classes of problems and this in my opinion is one of those.
This is a brief description of how LVS works. LVS is used on roy, and has also been used for more than two years on the mail cluster, without ever causing this kind of problem, which only lasted for about 4 hours. Please keep that in mind when proposing remedies or the development of documentation.
When a packet comes in for a service (SSH on roy, SMTP on mdir) that is under the control of LVS, the kernel will first determine if this is part of an established connection. If it isn't, a new connection will be created. New connections are farmed out to the work nodes in a way that attempts to equalize the number of active connections to each work node.
Once it is determined which connection a packet is part of, it is then sent back out on the LAN, with the same destination IP address it came in with, but with the MAC address of the receiving work node. In order for this to work, the receiving node has to recognize that IP address as one of its own. This is accomplished by creating a loopback alias device (lo:0) with that IP address. The problem with doing this is that if a system recognizes one of its own IP addresses in an ARP request, it will normally respond to that ARP. But for stateful farmed out connections via the control node to work, only the control node should answer ARPs for the service address. This is known in the LVS documentation as "the ARP problem", and it is solved by using a trick known as "the hidden patch". What this does is create some kernel flags in the /proc filesystem that, when set, will cause the kernel NOT to answer any ARP requests for IP addresses on that device.
At this point, I will bet that even some of the network engineers don't quite grok how this all works even if they understood everything I said above. It's not immediately obvious how this all fits together. Maybe a whiteboard meeting at some point would be helpful.
So with the background established, here's what's supposed to happen. A packet comes in for the service address (roy.ucar.edu, 220.127.116.11). The router will ARP for this address, and get a response only from the main control node (because all the work nodes are programmed not to respond to ARPs for this address), and will thus forward the packet to the control node. The control node will determine which work node should get this packet, and send it back out on the LAN, but with the MAC address of the receiving work node. The work node will accept the packet (because it does have an interface corresponding to that address), and everything is hunky dory. Return packets are sent directly from the work node to the client, and do not have to go through the control node.
What went wrong this morning was that, somehow, as a result of the momentary outage of the GigE devices, the hidden flag for the loopback devices got unset. I do not understand the mechanism of how that happened. I presume that it has something to do with how the GigE device driver works. There are messages in the logs indicating that the hardware was reinitialized when the link came back online, which is probably when the hidden flag got unset.
Once the hidden flag is unset, things break seriously, because the next ARP sent by the router could be (and apparently was) answered by one of the work nodes instead of the control node, and that resulted in all inbound packets to the control node being mis-routed. That effectively breaks all the inbound connections, and also breaks anything else that is attempting to use IP routing via the control node, which includes sparky and lightning. This also explains why no problems were seen on the IBM systems, because they use a different node than the control node for packet forwarding.
To help prevent a similar situation from recurring, I have entered a cron job on all four work nodes that resets the hidden flag every minute. We could still have this problem if the router happened to make an ARP request in the brief interval of time between a fiber connection outage and the next run of the every-minute cron job, but there is no way to do this with any finer granularity than once a minute.
Please read the above-referenced URL before asking further questions.
Address comments or questions about this Web page to the
The National Center for Atmospheric Research is sponsored by the