scalability and memory consumption of the DCE cache

Nils Goroll

2012-02-06 17:36:47 UTC

Hi,

I am analyzing issues on some s11 (151.0.1.12) production systems used as
Varnish web caches with a high number of concurrent connections from a high
number of tcp clients where client-side statistics/benchmarks would show long
connect times at times, e.g. 2.7 seconds connect maximum reported in ApacheBench
compared to normal .6 seconds (which already was far too bad, see below)

I would very much appreciate if I could discuss my current understanding with
someone.

* The issue is related to the DCE cache

Using

dtrace -n 'profile-1001 { @s[stack(),ustack()] = count();} END {trunc(@s,5)}'

i noticed that the top stacks were

CPU ID FUNCTION:NAME
2 2 :END

ip`ip_snmp_get_mib2_ip_dce+0xbe
ip`ip_snmp_get+0x267
ip`snmpcom_req+0x293
ip`ip_wput_nondata+0xd5
unix`putnext+0x21e
unix`putnext+0x21e
unix`putnext+0x21e
unix`putnext+0x21e
genunix`strput+0x19d
genunix`strputmsg+0x2a0
genunix`msgio32+0x202
genunix`putmsg32+0x78
unix`_sys_sysenter_post_swapgs+0x149

0xfeef25d5
0x8053f39
0x80537d2
0x805300d
1067

ip`ip_snmp_get_mib2_ip_dce+0xbe
ip`ip_snmp_get+0x267
ip`snmpcom_req+0x293
ip`ip_wput_nondata+0xd5
unix`putnext+0x21e
unix`putnext+0x21e
unix`putnext+0x21e
unix`putnext+0x21e
genunix`strput+0x19d
genunix`strputmsg+0x2a0
genunix`msgio32+0x202
genunix`putmsg32+0x78
unix`_sys_sysenter_post_swapgs+0x149

0xfeef25d5
0x8053f39
0x80537d2
0x805300d
1072

ip`ip_snmp_get_mib2_ip_dce+0xbe
ip`ip_snmp_get+0x267
ip`snmpcom_req+0x293
ip`ip_wput_nondata+0xd5
unix`putnext+0x21e
unix`putnext+0x21e
unix`putnext+0x21e
unix`putnext+0x21e
genunix`strput+0x19d
genunix`strputmsg+0x2a0
genunix`msgio32+0x202
genunix`putmsg32+0x78
unix`_sys_sysenter_post_swapgs+0x149

0xfeef25d5
0x8053f39
0x80537d2
0x805300d
1073

ip`ip_snmp_get_mib2_ip_dce+0xbe
ip`ip_snmp_get+0x267
ip`snmpcom_req+0x293
ip`ip_wput_nondata+0xd5
unix`putnext+0x21e
unix`putnext+0x21e
unix`putnext+0x21e
unix`putnext+0x21e
genunix`strput+0x19d
genunix`strputmsg+0x2a0
genunix`msgio32+0x202
genunix`putmsg32+0x78
unix`_sys_sysenter_post_swapgs+0x149

libc.so.1`__putmsg+0x15
netstat`mibget+0x79
netstat`main+0x736
netstat`_start+0x7d
1089

unix`i86_mwait+0xd
unix`cpu_idle_mwait+0xf1
unix`cpu_acpi_idle+0x8d
unix`cpu_idle_adaptive+0x19
unix`idle+0x112
unix`thread_start+0x8

84497

Trying to understand the source of the ip_snmp_get_mib2_ip_dce calls, I could
identify a statistics script which called "netstat -an" every 10 seconds.

The situation improved when I stopped this script, but dtrace profiling still
showed dce lookups as four of the top five stacks:

CPU ID FUNCTION:NAME
3 2 :END

ip`dce_lookup_and_add_v4+0x85
ip`ip_set_destination_v4+0x1cf
ip`ip_attr_connect+0xc6
ip`conn_connect+0xeb
ip`tcp_set_destination+0x66
ip`tcp_input_listener+0x7be
ip`squeue_enter+0x440
ip`ip_fanout_v4+0x48d
ip`ire_recv_local_v4+0x1ce
ip`ill_input_short_v4+0x6aa
ip`ip_input+0x23b
mac`mac_rx_soft_ring_drain+0xdf
mac`mac_soft_ring_worker+0x111
unix`thread_start+0x8

272

ip`dce_lookup_and_add_v4+0x85
ip`ip_set_destination_v4+0x1cf
ip`ip_attr_connect+0xc6
ip`conn_connect+0xeb
ip`tcp_set_destination+0x66
ip`tcp_connect_ipv4+0x135
ip`tcp_do_connect+0x20f
ip`tcp_connect+0x9f
sockfs`so_connect+0xf8
sockfs`socket_connect+0x47
sockfs`connect+0xec
unix`sys_syscall+0x17a

libc.so.1`__so_connect+0xa
libsocket.so.1`connect+0x10
libvarnish.so.1.0.0`TCP_connect+0x6c
varnishd`VBE_TryConnect+0x15a
varnishd`bes_conn_try+0x11f
varnishd`VBE_GetVbe+0x3bc
varnishd`vdi_vslp_try+0x53
varnishd`vdi_vslp_getfd+0x464
varnishd`VBE_GetFd+0x161
varnishd`Fetch+0x346
varnishd`cnt_fetch+0x1da
varnishd`CNT_Session+0x5fe
varnishd`wrk_do_cnt_sess+0x143
varnishd`wrk_thread+0x49f
libc.so.1`_thrp_setup+0xbc
libc.so.1`_lwp_start
704

ip`dce_lookup_and_add_v4+0x85
ip`ip_set_destination_v4+0x1cf
ip`ip_attr_connect+0xc6
ip`conn_connect+0xeb
ip`tcp_set_destination+0x66
ip`tcp_connect_ipv4+0x135
ip`tcp_do_connect+0x20f
ip`tcp_connect+0x9f
sockfs`so_connect+0xf8
sockfs`socket_connect+0x47
sockfs`connect+0xec
unix`sys_syscall+0x17a

libc.so.1`__so_connect+0xa
libsocket.so.1`connect+0x10
libvarnish.so.1.0.0`TCP_connect+0x6c
varnishd`vbp_connect+0x4a
varnishd`vbp_poke+0x21a
varnishd`vbp_wrk_poll_backend+0x27f
libc.so.1`_thrp_setup+0xbc
libc.so.1`_lwp_start
881

ip`dce_lookup_and_add_v4+0x85
ip`ip_set_destination_v4+0x1cf
ip`ip_attr_connect+0xc6
ip`conn_connect+0xeb
ip`tcp_set_destination+0x66
ip`tcp_input_listener+0x7be
ip`squeue_enter+0x440
ip`ip_fanout_v4+0x48d
ip`ire_recv_local_v4+0x1ce
ip`ill_input_short_v4+0x6aa
ip`ip_input+0x23b
mac`mac_rx_soft_ring_process+0x17a
mac`mac_rx_srs_proto_fanout+0x46f
mac`mac_rx_srs_drain+0x26e
mac`mac_rx_srs_process+0x180
mac`mac_rx_common+0x9b
mac`mac_rx+0xac
mac`mac_rx_ring+0x4c
igb`igb_intr_rx+0x73
unix`av_dispatch_autovect+0x7c

907

unix`i86_mwait+0xd
unix`cpu_idle_mwait+0xf1
unix`cpu_acpi_idle+0x8d
unix`cpu_idle_adaptive+0x19
unix`idle+0x112
unix`thread_start+0x8

104598

* IMHO, the above suggests that the issue most likely is related to the
reader/writer lock on each dce_cache hash bucket.

* The DCE cache is getting really large

output of netstat -dn | wc -l :

server1 8,083,266
server2 12,875,356
server3 12,881,157
server4 8,615,560
server5 12,832,229
server6 12,975,798
server7 9,179,529

::kmem_cache!grep dce

ffffff0572525448 dce_cache 0000 000000 152 12864956

If I got the numbers right, this means that for the 12.8m entries above,
we spend 1.3gigs of RAM

* Entries get really old, e.g.:

Destination Cache Entries: IPv4
Address PMTU Age Flags
-------------------- ------ ----- -----
...
92.122.206.23 0 5713503 U
110.164.2.55 0 5713521 U
46.115.34.128 0 5713543 U
80.239.230.166 0 5713564 U

So the last entry in this case was 66 days old, which is the uptime of the
machine.

Unfortunately, I don't have good historical data on performance metrics at the
moment, but manually inspecting vmstat, I noticed a drop from about 9-10% sys
time to about 1-2% sys time after

using ndd -set /dev/ip ip_dce_reclaim_fraction 1

and putting memory pressure on the machine (see below for details). This
momentarily reduced the dce size to less than 1000 entries.

Also I have compared connection times by running a simple apache bench test

# before

server6# netstat -nd | wc -l
12995387

otherserver> ab -c 100 -n 10000 http://server6/keepalive.html

...

Concurrency Level: 100
Time taken for tests: 64.812 seconds
Complete requests: 10000
Failed requests: 0
Write errors: 0
Total transferred: 2400240 bytes
HTML transferred: 30003 bytes
Requests per second: 154.29 [#/sec] (mean)
Time per request: 648.124 [ms] (mean)
Time per request: 6.481 [ms] (mean, across all concurrent requests)
Transfer rate: 36.17 [Kbytes/sec] received

Connection Times (ms)
min mean[+/-sd] median max
Connect: 8 271 400.3 222 5771
Processing: 89 374 581.6 291 16682
Waiting: 89 373 581.6 291 16682
Total: 269 644 705.1 523 16957

Percentage of the requests served within a certain time (ms)
50% 523
66% 581
75% 614
80% 630
90% 738
95% 1218
98% 1600
99% 5902
100% 16957 (longest request)

# purging the cache

server6# netstat -nd | wc -l
12995387
server6# ndd -set /dev/ip ip_dce_reclaim_fraction 1
server6# nice -19 dd if=/dev/zero of=/dev/null bs=$((1024 * 1024 * 1024)) & nice
-19 dd if=/dev/zero of=/dev/null bs=$((1024 * 1024 * 1024)) & nice -19 dd
if=/dev/zero of=/dev/null bs=$((1024 * 1024 * 1024)) & nice -19 dd if=/dev/zero
of=/dev/null bs=$((1024 * 1024 * 1024)) & sleep 30 ; pkill dd
[1] 24991
[2] 24992
[3] 24993
[4] 24994
# ...
server6# netstat -nd | wc -l
107

otherserver> ab -c 100 -n 10000 http://cobildcache106/keepalive.html

Concurrency Level: 100
Time taken for tests: 0.451 seconds
Complete requests: 10000
Failed requests: 0
Write errors: 0
Total transferred: 2406960 bytes
HTML transferred: 30087 bytes
Requests per second: 22182.34 [#/sec] (mean)
Time per request: 4.508 [ms] (mean)
Time per request: 0.045 [ms] (mean, across all concurrent requests)
Transfer rate: 5214.06 [Kbytes/sec] received

Connection Times (ms)
min mean[+/-sd] median max
Connect: 1 2 1.3 1 12
Processing: 1 3 2.0 2 20
Waiting: 0 1 1.5 1 12
Total: 2 4 2.8 4 24

Percentage of the requests served within a certain time (ms)
50% 4
66% 4
75% 4
80% 4
90% 4
95% 12
98% 14
99% 17
100% 24 (longest request)

My questions:

If the current code is still similar to what it was like when the gates were
closed (and it looks like it is), my understanding is that

- the dce_cache is accessed via a hash with 256 buckets only.

in dce_stack_init, ipst->ips_dce_hashsize is assigned 256 and I don't see a
way to tune this size

- so for a large number of dce cache entries, any access is inefficient (the
list behind every hash bucket has to be walked, in my case approx 25K entries
on average)

- the dce_cache is never shrunk unless there is memory pressure

If my understanding is correct, then I believe this should be optimized:

- I don't see why dces should be kept around forever (until there is enough
memory pressure for dcb_reclaim to be called)

Wouldn't an LRU (U for updated) cache make more sense here?

- Isn't PMTU the main purpose for the DCE? If yes, then wouldn't it make sense
to only keep entries for ip_pathmtu_interval in the first place?

PSARC/2009/331 refers to RFC4861, which states

Any Least Recently Used (LRU)-based policy that only reclaims entries
that have not been used in some time (e.g., ten minutes or more) should
be adequate for garbage-collecting unused entries.

So the default for ip_pathmtu_interval of 20 Minutes would seem acceptable
as an upper limit

- If for some reason a larger cache would be required / intended, then the
number of hash buckets should be tunable to allow for efficient access,
in particular if the data structure mandates a reader/writer lock per bucket

- In general, a lockless data structure would look appealing (see the netstat
example)

Thanks, Nils