igb becomes sender only on Intel 82576 [kawela] (X8DTH-6F motherboard)

Discussion:

k***@public.gmane.org

2011-09-14 04:35:11 UTC

We have a system with a SuperMicro X8DTH-6F dual socket motherboard with dual Xeon X5680 processors. The X8DTH-6F motherboard has dual GigE NIC powered by the Intel 82576 (kawela) chipset [attached by igb driver]. We've got S10u9 installed:

$ uname -a
SunOS r10.atl.efoldering.com 5.10 Generic_144489-06 i86pc i386 i86pc

The symptoms are very similar to what was reported here:
http://opensolaris.org/jive/thread.jspa?threadID=124411

Since the chipset is completely different in this case (igb instead of e1000g) I'm posting to a new thread.

The system is under a constant incoming load between 40 Mbit/sec to 240 Mbit/sec. At random times (usually about once per day, sometimes twice per day, sometimes not at all), the system disappears off the network (it can't ping other systems, and other systems can't ping it) for between 60 seconds to up to 30 minutes, after which it will mysteriously appear on the network again all without intervention.

We actually have both igb0 and igb1 interfaces wired with separate IPs, and both IPs are not pingable when the system is not visible on the network. We collected some packet count stats with kstat and we do see ipackets64 and opackets64 increasing (5 minutes between measurements), but IP receive counters are not increasing much (518 packets received, but only an increase of 1 in the IP inDelivers). Also when the system isn't visible on the network, running 'arp -an' freezes -- using dtrace on running arp shows that it freezes during the syscall to open for the file /dev/arp -- when the system decides to appear on the network again all of the pending 'arp -an' processes suddenly continue and print their output like normal. This makes me think there is somehow a deadlock in the ARP layer tha
t is eventually timing out.

Any ideas on what might be going on, or what dtrace commands we might try running to diagnose this further?

To help diagnose the situation, since it occurs at random times, we wrote a cron job script do-ping-network.sh as follows:

##########################################################################
#!/bin/bash

LOG="/tmp/do-ping-network.log.txt"

echo "========================================================================= " >> "$LOG"
date >> "$LOG"
ping -s 192.168.3.1 1024 4 >> "$LOG" 2>&1
echo "------------------------------------------------------------------------------------------------------------" >> "$LOG"
date >> "$LOG"
kstat -m igb 1 2 >> "$LOG"
echo "------------------------------------------------------------------------------------------------------------" >> "$LOG"
date >> "$LOG"
kstat -m ip -n ip 1 2 >> "$LOG"
echo "------------------------------------------------------------------------------------------------------------" >> "$LOG"
date >> "$LOG"
#arp -an >> "$LOG" 2>&1
/root/do-trace-arp.sh >> "$LOG" 2>&1
##########################################################################

This script is run every 5 minutes from cron, so we have a record of when the network stops working. When the network stops working ping reports 100% packet loss. Note that we run kstat to collect some stats after ping. Today we had an outage for 10 minutes. Here is the relevant kstat output from the first time after ping reported 100% packet loss:

=================================================================
Tue Sep 13 19:10:13 EDT 2011

module: igb instance: 0
name: igb0 class: net
ierrors 2
ipackets64 13442703584
norcvbuf 285
obytes64 5522625154592
opackets64 13191458263
rbytes64 81689304694198

module: igb instance: 0
name: mac class: net
ipackets64 13442703584
obytes64 5522625154592
opackets64 13191458263
rbytes64 81689304694198

module: igb instance: 0
name: statistics class: net
XOFFs_recvd 34915
XOFFs_xmitd 97389
XONs_recvd 34915
XONs_xmitd 177
recv_missed_packets 224854

module: ip instance: 0
name: ip class: mib2
inDelivers 61165291672
inDiscards 721
inReceives 61163929129
outRequests 32516565932
=================================================================

And here's the kstat output 5 minutes later (the system is still not visible on the network):

=================================================================
module: igb instance: 0
name: igb0 class: net
ierrors 2
ipackets64 13442704102
norcvbuf 285
obytes64 5522625158872
opackets64 13191458267
rbytes64 81689304728036

module: igb instance: 0
name: mac class: net
ipackets64 13442704102
obytes64 5522625158872
opackets64 13191458267
rbytes64 81689304728036

module: igb instance: 0
name: statistics class: net
XOFFs_recvd 34915
XOFFs_xmitd 97389
XONs_recvd 34915
XONs_xmitd 177
recv_missed_packets 225200

module: ip instance: 0
name: ip class: mib2
inDelivers 61165291763
inDiscards 721
inReceives 61163929129
outRequests 32516565936
=================================================================

I elided uninteresting counter values -- all error counters are 0 or very close to 0, I can provide full output if there is interest. Also note that I'm only including igb0 stats -- igb1 link is active but there is no traffic flowing to it. igb0 has a fair amount of TCP traffic being received all the time.

Some observations:
* ipackets64 increased by 518 packets (presumably from existing TCP connections that continued to try to send data).
* opackets64 increased by 4 packets (the 4 pings that were sent out by the cron job -- note that the ping packet size is set to 1024, so that explains why the average packet size sent was 1070 [IP+ICMP headers + ping payload])
* recv_missed_packets increased by 346 packets (not sure what recv_missed_packets means for igb driver?)
* ip stats inDelivers increased by 91.
* ip stats inReceives did not change.
* ip stats outRequests increased by 4 (the 4 pings sent by the cron job)

Using dtrace we can see that when arp runs while the network is not responsive, the arp process runs fine until is tries the open syscall to open /dev/arp and that's where it gets stuck until the network finally responds again.

Also, this may or may not be relevant, but when the network is down we can login on the console and ifconfig -a is responsive and does not block. Also, logging in on the console sometimes seems to almost immediately clear the problem. I wouldn't believe it if I didn't see it with my own eyes, since I can't see how these two events could possibly be connected. However, we have seen cases where we logged in on the console and the network connectivity problems persisted for several more minutes. We have also tried forcing the interface to go down and come back up using ifconfig and that seems to clear the problem immediately as well.

Any ideas or suggestions will be appreciated.

Thanks,
Kevin

--
This message posted from opensolaris.org

John Cui

2011-09-14 07:54:43 UTC

Permalink

Hi Kevin,
This symptom looks similar with a known Intel CPU issue.
Erratum AAK76
EOI Transaction May Not be Sent if Software Enters Core C6 During An Interrupt Service Routine
==================================================================
Problem: If core C6 is entered after the start of an interrupt service routine but before a write to the APIC EOI(End of Interrupt) register, and the core is woken up by an event other than a fixed interrupt source the core may drop the EOI transaction the next time APIC EOI register is written and further interrupts from the same or lower priority level will be blocked.

Please try the workaround:
Add the line

set idle_cpu_no_deep_c = 1

to /etc/system and reboot.

Cheers,

--
This message posted from opensolaris.org

k***@public.gmane.org

2011-09-15 09:13:34 UTC

Permalink

John,

Many thanks for your reply!

I've added the line to /etc/system but I haven't rebooted yet. According to http://wesunsolve.net/bugid/id/6958068 another workaround is to add

cpu_deep_idle disable

to /etc/power.conf and to then run pmconfig. The workaround doesn't mention having to reboot after running pmconfig, and the man page for pmconfig indicates that changes are made effective immediately. Is this sufficient to change the cpu_deep_idle setting or do I actually need to reboot? (This system is sensitive to downtime from rebooting, so I'm hoping to avoid the reboot if it's not necessary.)

Also, before I made the above change, I created a new dtrace script that trace the processing of the open(/dev/arp) system call in the arp process and had this run from cron every 5 minutes. The issue occurred again after I made this change (but before I changed the above power setting) where it was offline for a couple of minutes. I've attached the resulting dtrace output, as well as the dtrace scripts. The most relevant piece of the output is:

0 -> fop_open 10723 0 ms 1 us 19014758217431532
0 -> spec_open 10723 0 ms 1 us 19014758217432883
0 -> secpolicy_spec_open 10723 0 ms 1 us 19014758217434464
0 ...
0 <- secpolicy_spec_open 10723 0 ms 1 us 19014758217461622
0 -> spec_lockcsp 10723 0 ms 1 us 19014758217463617
0 <- spec_lockcsp 10723 0 ms 1 us 19014758217464847
0 -> stropen 10723 0 ms 1 us 19014758217466486
0 -> allocq 10723 0 ms 1 us 19014758217468424
0 -> kmem_cache_alloc 10723 0 ms 0 us 19014758217469378
0 <- kmem_cache_alloc 10723 0 ms 1 us 19014758217470382
0 <- allocq 10723 0 ms 1 us 19014758217471735
0 -> shalloc 10723 0 ms 1 us 19014758217473117
0 -> kmem_cache_alloc 10723 0 ms 1 us 19014758217474227
0 <- kmem_cache_alloc 10723 0 ms 1 us 19014758217475314
0 <- shalloc 10723 0 ms 1 us 19014758217477226
0 -> setq 10723 0 ms 1 us 19014758217478676
0 <- setq 10723 0 ms 1 us 19014758217480417
0 -> set_qend 10723 0 ms 1 us 19014758217482116
0 <- set_qend 10723 0 ms 1 us 19014758217483702
0 -> qattach 10723 0 ms 1 us 19014758217485231
0 -> allocq 10723 0 ms 0 us 19014758217486207
0 -> kmem_cache_alloc 10723 0 ms 0 us 19014758217486886
0 <- kmem_cache_alloc 10723 0 ms 0 us 19014758217487662
0 <- allocq 10723 0 ms 0 us 19014758217488579
0 -> setq 10723 0 ms 1 us 19014758217489594
0 <- setq 10723 0 ms 0 us 19014758217490432
0 -> entersq 10723 0 ms 1 us 19014758217492028
0 <- entersq 10723 0 ms 1 us 19014758217493900
0 -> ip_open 10723 0 ms 1 us 19014758217495654
0 ...
0 <- ip_open 10723 0 ms 1 us 19014758217563000
0 -> leavesq 10723 0 ms 1 us 19014758217564404
0 <- leavesq 10723 0 ms 1 us 19014758217565782
0 <- qattach 10723 0 ms 1 us 19014758217567311
0 -> zoneid_to_netstackid 10723 0 ms 1 us 19014758217568959
0 -> netstack_find_shared_zoneid 10723 0 ms 1 us 19014758217570399
0 <- netstack_find_shared_zoneid 10723 0 ms 1 us 19014758217571794
0 <- zoneid_to_netstackid 10723 0 ms 1 us 19014758217573281
0 -> netstack_find_by_stackid 10723 0 ms 1 us 19014758217574665
0 -> netstack_hold 10723 0 ms 0 us 19014758217575610
0 <- netstack_hold 10723 0 ms 0 us 19014758217576309
0 <- netstack_find_by_stackid 10723 0 ms 1 us 19014758217577415
0 -> push_mod 10723 0 ms 1 us 19014758217579097
0 -> fmodsw_find 10723 0 ms 1 us 19014758217580829
0 <- fmodsw_find 10723 0 ms 1 us 19014758217582630
0 -> qattach 10723 0 ms 0 us 19014758217583590
0 -> allocq 10723 0 ms 0 us 19014758217584330
0 ...
0 <- allocq 10723 0 ms 1 us 19014758217586850
0 -> setq 10723 0 ms 0 us 19014758217587650
0 <- setq 10723 0 ms 1 us 19014758217588671
0 -> entersq 10723 0 ms 0 us 19014758217589425
0 -> cv_wait 10723 0 ms 1 us 19014758217590660
0 -> thread_lock 10723 0 ms 1 us 19014758217591880
0 <- thread_lock 10723 0 ms 1 us 19014758217593066
0 -> cv_block 10723 0 ms 1 us 19014758217594205
(this is where it effectively waits for the CV to become available, which only becomes available several minutes later when the network starts working again)

As far as I can tell, openstr is used to construct the ARP stream device, and this eventually calls push_mod, which calls qattach. qattach then calls entersq, which then calls cv_wait, which is where we block until the network connectivity resumes (it actually context switches several times while it's waiting, but effectively it's blocked on the cv). I'm going to modify my dtrace script to attempt to get more information about this particular cv in case it happens again.

I'm crossing my fingers that the power setting change will have fixed it for good! But if not, I hope to have some more detailed data to report back.

Since the issue usually occurs once every 24 hours I'll report back whether or not the power settings fix this.

Thanks,
Kevin

--
This message posted from opensolaris.org