TAGS: | |

When Good NICs Do Bad Things: A Blast of IPv6 Multicast Listener Discovery Queries

Andrew Gallo

This is the write-up of a recent event we experienced on our network.  This will be combination of a  journal of symptoms, troubleshooting steps taken, and a brief overview of the environment and platforms involved. This isn’t a forensic analysis of the cause or of different behaviors in various environments.  Rather, it’s meant to be a heads up if you see something similar in your environment.

Because this is an outage report, I’ll start with the good stuff (an explanation of details).  Some steps during the investigation and details of our environment have been left out.  My purpose is to share our experiences, symptoms, and commands used to identify the problem so that others faced with this situation can more quickly identify and isolate the problem.

Certain Intel NICs, when the host machine goes to sleep, will send out excessive amounts of IPv6 multicast listener report traffic.

In our case, these were Dell 9020s with Intel I217-LM NICs.

 Through a combinations of details about our environment, this caused a large-scale outage.  Initial troubleshooting did not lead to the cause, as will be detailed below.

Initial Observations

Our monitoring systems were reporting major outages throughout our network.  We saw lots of messages indicating port down/ups.  Just about all the ports on the Cisco VSS (2×6509),over 250 ports, were going up and down.  There was no indication of why, just link up/down:

%LINEPROTO-SW1_SP-5-UPDOWN: Line protocol on Interface GigabitEthernet2/3/11, changed state to down
%LINEPROTO-SW1_SP-5-UPDOWN: Line protocol on Interface Port-channel253, changed state to down
%LINK-3-UPDOWN: Interface Port-channel12, changed state to up
%LINK-SW1_SP-3-UPDOWN: Interface Port-channel253, changed state to down
%LINEPROTO-SW1_SP-5-UPDOWN: Line protocol on Interface GigabitEthernet1/1/12, changed state to up

 Accessing switches on the other side of these links did not provide any information as to why this was happening.

Going through the basic troubleshooting steps didn’t show any useful information.  The usual suspects of spanning tree, broadcast storms, and high processor utilization were all missing.  The lack of high processor utilization turns out to be a platform dependent and may be different on your network.

We thought we had stabilized the network, but were proven wrong.  I’ll leave out the details of what we did to get to that point.  We opened a TAC case and dove deeper into the box.  This is when we found the cause.

First off, my usual steps of checking switch health via show proc cpu were misleading.  Our VSS is built on a Supervisor 720 which has separate route processor and switch processor components.  Our route processor was fine, but the switch processor was pegged at 100%.  This was determined by running a remote command on the switch processor:

remote command switch show proc cpu sort

CPU utilization for five seconds: 99%/81%; one minute: 99%; five minutes: 99%
 PID Runtime(ms)   Invoked      uSecs   5Sec   1Min   5Min TTY Process
 103       15684    121518        129 100.00% 67.37% 66.97%   0 Heartbeat Proces
 578     4819716    247249      19493  11.43%  8.85%  8.88%   0 LTL MGR

What this tells us:

  • Interrupt usage (81%) is very high – this is bad
  • Heartbeat process is 100% – this is bad

The switch was so busy that is was unable to manage its line cards which explains why the ports were going up and down. Analysis of the syslog after the fact revealed a couple of interesting messages:

%PFREDUN-SW1_SP-7-KPA_WARN: RF KPA messages have not been heard for 27 seconds
%MLSM-6-LC_SCP_FAILURE: NMP encountered internal communication failure for
%ICC-SW1_SP-5-WATERMARK: 1055 pkts for class EARL_L2-DRV are waiting to be processed

What this tells us:

  • Keepalive messages haven’t been received. Best I can tell, this indicates that the standby processor hasn’t responded to keepalives (or it had responded, but the active SP couldn’t process it).
  • The SP was unable to communicate with/update the CEF tables on the line cards; this caused traffic to be software switched, pouring gasoline on the fire
  • Inter-card communication: There are heartbeat packets between the SP and line cards that are queued and waiting to be processed.

To determine what was causing the high SP CPU, we ran debug netdr capture rx.  This captures packets destined for the CPU.  In our case this was run on the SP because was the subsystem having a problem.   The results can be viewed with show netdr captured-packets.  A partial output:

A total of 4096 packets have been captured
The capture buffer wrapped 0 times
Total capture capacity: 4096 packets

------- dump of incoming inband packet -------
interface NULL, routine mistral_process_rx_packet_inlin, timestamp 10:29:55.297
dbus info: src_vlan 0x373(883), src_indx 0x1070(4208), len 0x5A(90)
  bpdu 0, index_dir 0, flood 0, dont_lrn 0, dest_indx 0x5802(22530)
  2E820400 03730000 10700000 5A080000 0C000060 07000004 00000000 5802E3D8
mistral hdr: req_token 0x0(0), src_index 0x1070(4208), rx_offset 0x76(118)
  requeue 0, obl_pkt 0, vlan 0x373(883)
destmac 33.33.00.00.00.01, srcmac C8.1F.66.A8.EA.87, protocol 86DD
protocol ipv6: version 6, flow 1610612736, payload 32, nexthdr 0, hoplt 1
class 0, src FE80::CA1F:66FF:FEA8:EA87, dst FF02::1

------- dump of incoming inband packet -------
interface NULL, routine mistral_process_rx_packet_inlin, timestamp 10:29:55.297
dbus info: src_vlan 0x373(883), src_indx 0x1070(4208), len 0x5A(90)
  bpdu 0, index_dir 0, flood 0, dont_lrn 0, dest_indx 0x5802(22530)
  36820400 03730000 10700000 5A080000 0C000020 07000004 00000000 58027BC5
mistral hdr: req_token 0x0(0), src_index 0x1070(4208), rx_offset 0x76(118)
  requeue 0, obl_pkt 0, vlan 0x373(883)
destmac 33.33.00.00.00.01, srcmac C8.1F.66.A8.73.29, protocol 86DD
protocol ipv6: version 6, flow 1610612736, payload 32, nexthdr 0, hoplt 1
class 0, src FE80::CA1F:66FF:FEA8:7329, dst FF02::1

After parsing through the file, we determined a handful of machines were generating an inordinate amount of IPv6 multicast listener report traffic.  The key things from this output:

  • protocol 86DD – IPv6

  • destination IPv6 – FF02::1 (all nodes multicast)

  • srcmac C8.1F.66.A8.73.29 – offending machine

  • next-header – 0 (hop-by-hop option)

  • hoplt 1 – Hop Limit of 1

Furthermore, it took just .092 seconds to collect the 4096 packets.  There were 8 MAC addresses that stood out as clearly generating all this traffic.  We estimate that the group of machines was generating about 40,000 packets per second which must be handled in software.  Simply too much.  The SP couldn’t handle the load and was unable to manage its own line cards, causing several hundred ports to flap.

To quickly stabilize things, we deleted the VLAN hosting this machines.  Processor utilization dropped to normal levels.

To summarize

  • Direct Cause Analysis:

    • SP CPU so high that the switch was unable to maintain internal communications between itself and the line cards, causing all ports to flap.

  • Contributing Causes:

    • Large, flat layer-2 domain
    • Platform architecture with separate route and switch processors misleading  initial troubleshooting
  • Root cause analysis:

    • Bad NIC driver from Intel causing machines in certain sleep states to generate inordinate amounts of IPv6 Multicast Listener Report traffic.

  • Remediation:

    • Deleted the VLAN hosting these machines, thereby preventing the traffic reaching the SP.  This is neither a scalable nor permanent solution.

There are some indications that not having an SVI (VLAN interface) for this network (which is routed by a firewall) contributed either to the problem or its isolation.  This remains unclear.  Would the router processor have taken the brunt of the high processor load if there had been an SVI?  Might have this been easier to troubleshoot in this topology?

It didn’t matter that the VSS didn’t have IPv6 enabled, which is of great concern.

Due to resource constraints, we are unable to do an in-depth analysis of how different platforms, topologies, etc would have acted in a similar situation.  Please feel free to comment and share any experiences.

Other organizations have seen this problem and, to varying degrees, had network disruptions because of it.  This article has some additional details about Dell machines seeing this issue, along with a pcap file.  Intel is aware of the issue and there are fixes (at the driver level, BIOS level, and by disabling IPv6).  Make sure your drivers are up to date, and for as much as I encourage the adoption of IPv6, if you aren’t using it, disabled it on your end stations.