k***@public.gmane.org
2011-09-14 04:35:11 UTC
We have a system with a SuperMicro X8DTH-6F dual socket motherboard with dual Xeon X5680 processors. The X8DTH-6F motherboard has dual GigE NIC powered by the Intel 82576 (kawela) chipset [attached by igb driver]. We've got S10u9 installed:
$ uname -a
SunOS r10.atl.efoldering.com 5.10 Generic_144489-06 i86pc i386 i86pc
The symptoms are very similar to what was reported here:
http://opensolaris.org/jive/thread.jspa?threadID=124411
Since the chipset is completely different in this case (igb instead of e1000g) I'm posting to a new thread.
The system is under a constant incoming load between 40 Mbit/sec to 240 Mbit/sec. At random times (usually about once per day, sometimes twice per day, sometimes not at all), the system disappears off the network (it can't ping other systems, and other systems can't ping it) for between 60 seconds to up to 30 minutes, after which it will mysteriously appear on the network again all without intervention.
We actually have both igb0 and igb1 interfaces wired with separate IPs, and both IPs are not pingable when the system is not visible on the network. We collected some packet count stats with kstat and we do see ipackets64 and opackets64 increasing (5 minutes between measurements), but IP receive counters are not increasing much (518 packets received, but only an increase of 1 in the IP inDelivers). Also when the system isn't visible on the network, running 'arp -an' freezes -- using dtrace on running arp shows that it freezes during the syscall to open for the file /dev/arp -- when the system decides to appear on the network again all of the pending 'arp -an' processes suddenly continue and print their output like normal. This makes me think there is somehow a deadlock in the ARP layer tha
t is eventually timing out.
Any ideas on what might be going on, or what dtrace commands we might try running to diagnose this further?
To help diagnose the situation, since it occurs at random times, we wrote a cron job script do-ping-network.sh as follows:
##########################################################################
#!/bin/bash
LOG="/tmp/do-ping-network.log.txt"
echo "========================================================================= " >> "$LOG"
date >> "$LOG"
ping -s 192.168.3.1 1024 4 >> "$LOG" 2>&1
echo "------------------------------------------------------------------------------------------------------------" >> "$LOG"
date >> "$LOG"
kstat -m igb 1 2 >> "$LOG"
echo "------------------------------------------------------------------------------------------------------------" >> "$LOG"
date >> "$LOG"
kstat -m ip -n ip 1 2 >> "$LOG"
echo "------------------------------------------------------------------------------------------------------------" >> "$LOG"
date >> "$LOG"
#arp -an >> "$LOG" 2>&1
/root/do-trace-arp.sh >> "$LOG" 2>&1
##########################################################################
This script is run every 5 minutes from cron, so we have a record of when the network stops working. When the network stops working ping reports 100% packet loss. Note that we run kstat to collect some stats after ping. Today we had an outage for 10 minutes. Here is the relevant kstat output from the first time after ping reported 100% packet loss:
=================================================================
Tue Sep 13 19:10:13 EDT 2011
module: igb instance: 0
name: igb0 class: net
ierrors 2
ipackets64 13442703584
norcvbuf 285
obytes64 5522625154592
opackets64 13191458263
rbytes64 81689304694198
module: igb instance: 0
name: mac class: net
ipackets64 13442703584
obytes64 5522625154592
opackets64 13191458263
rbytes64 81689304694198
module: igb instance: 0
name: statistics class: net
XOFFs_recvd 34915
XOFFs_xmitd 97389
XONs_recvd 34915
XONs_xmitd 177
recv_missed_packets 224854
module: ip instance: 0
name: ip class: mib2
inDelivers 61165291672
inDiscards 721
inReceives 61163929129
outRequests 32516565932
=================================================================
And here's the kstat output 5 minutes later (the system is still not visible on the network):
=================================================================
module: igb instance: 0
name: igb0 class: net
ierrors 2
ipackets64 13442704102
norcvbuf 285
obytes64 5522625158872
opackets64 13191458267
rbytes64 81689304728036
module: igb instance: 0
name: mac class: net
ipackets64 13442704102
obytes64 5522625158872
opackets64 13191458267
rbytes64 81689304728036
module: igb instance: 0
name: statistics class: net
XOFFs_recvd 34915
XOFFs_xmitd 97389
XONs_recvd 34915
XONs_xmitd 177
recv_missed_packets 225200
module: ip instance: 0
name: ip class: mib2
inDelivers 61165291763
inDiscards 721
inReceives 61163929129
outRequests 32516565936
=================================================================
I elided uninteresting counter values -- all error counters are 0 or very close to 0, I can provide full output if there is interest. Also note that I'm only including igb0 stats -- igb1 link is active but there is no traffic flowing to it. igb0 has a fair amount of TCP traffic being received all the time.
Some observations:
* ipackets64 increased by 518 packets (presumably from existing TCP connections that continued to try to send data).
* opackets64 increased by 4 packets (the 4 pings that were sent out by the cron job -- note that the ping packet size is set to 1024, so that explains why the average packet size sent was 1070 [IP+ICMP headers + ping payload])
* recv_missed_packets increased by 346 packets (not sure what recv_missed_packets means for igb driver?)
* ip stats inDelivers increased by 91.
* ip stats inReceives did not change.
* ip stats outRequests increased by 4 (the 4 pings sent by the cron job)
Using dtrace we can see that when arp runs while the network is not responsive, the arp process runs fine until is tries the open syscall to open /dev/arp and that's where it gets stuck until the network finally responds again.
Also, this may or may not be relevant, but when the network is down we can login on the console and ifconfig -a is responsive and does not block. Also, logging in on the console sometimes seems to almost immediately clear the problem. I wouldn't believe it if I didn't see it with my own eyes, since I can't see how these two events could possibly be connected. However, we have seen cases where we logged in on the console and the network connectivity problems persisted for several more minutes. We have also tried forcing the interface to go down and come back up using ifconfig and that seems to clear the problem immediately as well.
Any ideas or suggestions will be appreciated.
Thanks,
Kevin
$ uname -a
SunOS r10.atl.efoldering.com 5.10 Generic_144489-06 i86pc i386 i86pc
The symptoms are very similar to what was reported here:
http://opensolaris.org/jive/thread.jspa?threadID=124411
Since the chipset is completely different in this case (igb instead of e1000g) I'm posting to a new thread.
The system is under a constant incoming load between 40 Mbit/sec to 240 Mbit/sec. At random times (usually about once per day, sometimes twice per day, sometimes not at all), the system disappears off the network (it can't ping other systems, and other systems can't ping it) for between 60 seconds to up to 30 minutes, after which it will mysteriously appear on the network again all without intervention.
We actually have both igb0 and igb1 interfaces wired with separate IPs, and both IPs are not pingable when the system is not visible on the network. We collected some packet count stats with kstat and we do see ipackets64 and opackets64 increasing (5 minutes between measurements), but IP receive counters are not increasing much (518 packets received, but only an increase of 1 in the IP inDelivers). Also when the system isn't visible on the network, running 'arp -an' freezes -- using dtrace on running arp shows that it freezes during the syscall to open for the file /dev/arp -- when the system decides to appear on the network again all of the pending 'arp -an' processes suddenly continue and print their output like normal. This makes me think there is somehow a deadlock in the ARP layer tha
t is eventually timing out.
Any ideas on what might be going on, or what dtrace commands we might try running to diagnose this further?
To help diagnose the situation, since it occurs at random times, we wrote a cron job script do-ping-network.sh as follows:
##########################################################################
#!/bin/bash
LOG="/tmp/do-ping-network.log.txt"
echo "========================================================================= " >> "$LOG"
date >> "$LOG"
ping -s 192.168.3.1 1024 4 >> "$LOG" 2>&1
echo "------------------------------------------------------------------------------------------------------------" >> "$LOG"
date >> "$LOG"
kstat -m igb 1 2 >> "$LOG"
echo "------------------------------------------------------------------------------------------------------------" >> "$LOG"
date >> "$LOG"
kstat -m ip -n ip 1 2 >> "$LOG"
echo "------------------------------------------------------------------------------------------------------------" >> "$LOG"
date >> "$LOG"
#arp -an >> "$LOG" 2>&1
/root/do-trace-arp.sh >> "$LOG" 2>&1
##########################################################################
This script is run every 5 minutes from cron, so we have a record of when the network stops working. When the network stops working ping reports 100% packet loss. Note that we run kstat to collect some stats after ping. Today we had an outage for 10 minutes. Here is the relevant kstat output from the first time after ping reported 100% packet loss:
=================================================================
Tue Sep 13 19:10:13 EDT 2011
module: igb instance: 0
name: igb0 class: net
ierrors 2
ipackets64 13442703584
norcvbuf 285
obytes64 5522625154592
opackets64 13191458263
rbytes64 81689304694198
module: igb instance: 0
name: mac class: net
ipackets64 13442703584
obytes64 5522625154592
opackets64 13191458263
rbytes64 81689304694198
module: igb instance: 0
name: statistics class: net
XOFFs_recvd 34915
XOFFs_xmitd 97389
XONs_recvd 34915
XONs_xmitd 177
recv_missed_packets 224854
module: ip instance: 0
name: ip class: mib2
inDelivers 61165291672
inDiscards 721
inReceives 61163929129
outRequests 32516565932
=================================================================
And here's the kstat output 5 minutes later (the system is still not visible on the network):
=================================================================
module: igb instance: 0
name: igb0 class: net
ierrors 2
ipackets64 13442704102
norcvbuf 285
obytes64 5522625158872
opackets64 13191458267
rbytes64 81689304728036
module: igb instance: 0
name: mac class: net
ipackets64 13442704102
obytes64 5522625158872
opackets64 13191458267
rbytes64 81689304728036
module: igb instance: 0
name: statistics class: net
XOFFs_recvd 34915
XOFFs_xmitd 97389
XONs_recvd 34915
XONs_xmitd 177
recv_missed_packets 225200
module: ip instance: 0
name: ip class: mib2
inDelivers 61165291763
inDiscards 721
inReceives 61163929129
outRequests 32516565936
=================================================================
I elided uninteresting counter values -- all error counters are 0 or very close to 0, I can provide full output if there is interest. Also note that I'm only including igb0 stats -- igb1 link is active but there is no traffic flowing to it. igb0 has a fair amount of TCP traffic being received all the time.
Some observations:
* ipackets64 increased by 518 packets (presumably from existing TCP connections that continued to try to send data).
* opackets64 increased by 4 packets (the 4 pings that were sent out by the cron job -- note that the ping packet size is set to 1024, so that explains why the average packet size sent was 1070 [IP+ICMP headers + ping payload])
* recv_missed_packets increased by 346 packets (not sure what recv_missed_packets means for igb driver?)
* ip stats inDelivers increased by 91.
* ip stats inReceives did not change.
* ip stats outRequests increased by 4 (the 4 pings sent by the cron job)
Using dtrace we can see that when arp runs while the network is not responsive, the arp process runs fine until is tries the open syscall to open /dev/arp and that's where it gets stuck until the network finally responds again.
Also, this may or may not be relevant, but when the network is down we can login on the console and ifconfig -a is responsive and does not block. Also, logging in on the console sometimes seems to almost immediately clear the problem. I wouldn't believe it if I didn't see it with my own eyes, since I can't see how these two events could possibly be connected. However, we have seen cases where we logged in on the console and the network connectivity problems persisted for several more minutes. We have also tried forcing the interface to go down and come back up using ifconfig and that seems to clear the problem immediately as well.
Any ideas or suggestions will be appreciated.
Thanks,
Kevin
--
This message posted from opensolaris.org
This message posted from opensolaris.org