Welcome! Log In Create A New Profile

Advanced

Connections stuck in CLOSE_WAIT state with h2

Posted by Janusz Dziemidowicz 
Janusz Dziemidowicz
Connections stuck in CLOSE_WAIT state with h2
May 24, 2018 02:00PM
Recently I've moved several servers from haproxy 1.7.x to 1.8.x I have
a setup with nghttpx handling h2 (haproxy connects to nghttpx via unix
socket which handles h2 and connects back to haproxy with plain
http/1.1 also through unix socket).

After the upgrade I wanted to switch to native h2 supported by
haproxy. Unfortunately, it seems that over time haproxy is
accumulating sockets in CLOSE_WAIT state. Currently, after 12h I have
5k connections in this state. All of them have non-zero Recv-Q and
zero Send-Q. netstat -ntpa shows something like this:

tcp 1 0 IP:443 IP:28032 CLOSE_WAIT 115495/haproxy
tcp 35 0 IP:443 IP:49531 CLOSE_WAIT 115495/haproxy
tcp 507 0 IP:443 IP:31938 CLOSE_WAIT 115495/haproxy
tcp 134 0 IP:443 IP:49672 CLOSE_WAIT 115495/haproxy
tcp 732 0 IP:443 IP:3180 CLOSE_WAIT 115494/haproxy
tcp 746 0 IP:443 IP:39731 CLOSE_WAIT 115494/haproxy
tcp 35 0 IP:443 IP:62986 CLOSE_WAIT 115495/haproxy
tcp 585 0 IP:443 IP:51318 CLOSE_WAIT 115493/haproxy
tcp 100 0 IP:443 IP:60449 CLOSE_WAIT 115493/haproxy
tcp 35 0 IP:443 IP:1274 CLOSE_WAIT 115494/haproxy
...

Those are all frontend connections. Reloading haproxy removes those
connections, but only after hard-stop-after kicks in and old processes
are killed. Disabling native h2 support and switching back to nghttpx
makes the problem disappear.

This kinda seems like the socket was closed on the writing side, but
the client has already sent something and everything is stuck. I was
not able to reproduce the problem by myself. Any ideas how to debug
this further?

haproxy -vv (Debian package rebuilt on stretch with USE_TFO):
HA-Proxy version 1.8.9-1~tsg9+1 2018/05/21
Copyright 2000-2018 Willy Tarreau <[email protected]>

Build options :
TARGET = linux2628
CPU = generic
CC = gcc
CFLAGS = -g -O2 -fdebug-prefix-map=/root/haproxy-1.8.9=.
-fstack-protector-strong -Wformat -Werror=format-security -Wdate-time
-D_FORTIFY_SOURCE=2
OPTIONS = USE_GETADDRINFO=1 USE_ZLIB=1 USE_REGPARM=1 USE_OPENSSL=1
USE_LUA=1 USE_SYSTEMD=1 USE_PCRE=1 USE_PCRE_JIT=1 USE_TFO=1 USE_NS=1

Default settings :
maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents = 200

Built with OpenSSL version : OpenSSL 1.1.0f 25 May 2017
Running on OpenSSL version : OpenSSL 1.1.0f 25 May 2017
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports : TLSv1.0 TLSv1.1 TLSv1.2
Built with Lua version : Lua 5.3.3
Built with transparent proxy support using: IP_TRANSPARENT
IPV6_TRANSPARENT IP_FREEBIND
Encrypted password support via crypt(3): yes
Built with multi-threading support.
Built with PCRE version : 8.39 2016-06-14
Running on PCRE version : 8.39 2016-06-14
PCRE library supports JIT : yes
Built with zlib version : 1.2.8
Running on zlib version : 1.2.8
Compression algorithms supported : identity("identity"),
deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with network namespace support.

Available polling systems :
epoll : pref=300, test result OK
poll : pref=200, test result OK
select : pref=150, test result OK
Total: 3 (3 usable), will use epoll.

Available filters :
[SPOE] spoe
[COMP] compression
[TRACE] trace

config

global
log /dev/log daemon warning
log-send-hostname
chroot /var/lib/haproxy
maxconn 65536
user haproxy
group haproxy
daemon
nbproc 4
stats socket /var/run/haproxy/stats.socket user haproxy mode 0640
level user process 1
stats socket /var/run/haproxy/stats-1.socket user haproxy mode 0640
level admin process 1
stats socket /var/run/haproxy/stats-2.socket user haproxy mode 0640
level admin process 2
stats socket /var/run/haproxy/stats-3.socket user haproxy mode 0640
level admin process 3
stats socket /var/run/haproxy/stats-4.socket user haproxy mode 0640
level admin process 4
ssl-default-bind-ciphers
ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES128-SHA256:ECDHE-ECDSA-AES256-SHA384:ECDHE-RSA-AES128-SHA256:ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA-AES128-SHA:ECDHE-ECDSA-AES256-SHA:ECDHE-RSA-AES128-SHA:ECDHE-RSA-AES256-SHA:AES128-GCM-SHA256:AES256-GCM-SHA384:AES128-SHA256:AES256-SHA256:AES128-SHA:AES256-SHA
ssl-default-bind-options ssl-min-ver TLSv1.0
tune.ssl.cachesize 200000
tune.ssl.lifetime 24h
hard-stop-after 2h

unix-bind prefix /var/lib/haproxy/ mode 600 user haproxy group haproxy


defaults http
option dontlognull
option dontlog-normal
option redispatch
option tcp-smart-connect
option httplog

timeout client 60s
timeout connect 10s
timeout server 60s
timeout tunnel 10m
timeout client-fin 30s
timeout http-keep-alive 30s
timeout http-request 30s

log global
retries 3
backlog 16384
maxconn 65536

mode http
errorfile 403 /etc/haproxy/403.html

backend php
option httpchk GET /ping.php
balance roundrobin
cookie SOME_COOKIE insert indirect httponly
server hostname-1 IP:81 check inter 5000 rise 2 fall 5 weight 100 cookie 1
# more servers

frontend http
bind IP:80 transparent
bind IP:443 tfo transparent ssl alpn h2,http/1.1 curves X25519:P-256
tls-ticket-keys FILE crt FILE

http-request set-header X-Forwarded-For %ci unless LOCALHOST
http-request set-header X-Forwarded-Proto https unless { dst_port 80 }
http-request set-header X-Forwarded-Proto http if { dst_port 80 }

default_backend php

--
Janusz Dziemidowicz
Willy Tarreau
Re: Connections stuck in CLOSE_WAIT state with h2
May 24, 2018 10:30PM
Hi Janusz,

On Thu, May 24, 2018 at 01:49:52PM +0200, Janusz Dziemidowicz wrote:
> Recently I've moved several servers from haproxy 1.7.x to 1.8.x I have
> a setup with nghttpx handling h2 (haproxy connects to nghttpx via unix
> socket which handles h2 and connects back to haproxy with plain
> http/1.1 also through unix socket).
>
> After the upgrade I wanted to switch to native h2 supported by
> haproxy. Unfortunately, it seems that over time haproxy is
> accumulating sockets in CLOSE_WAIT state. Currently, after 12h I have
> 5k connections in this state. All of them have non-zero Recv-Q and
> zero Send-Q. netstat -ntpa shows something like this:
>
> tcp 1 0 IP:443 IP:28032 CLOSE_WAIT 115495/haproxy
> tcp 35 0 IP:443 IP:49531 CLOSE_WAIT 115495/haproxy
> tcp 507 0 IP:443 IP:31938 CLOSE_WAIT 115495/haproxy
> tcp 134 0 IP:443 IP:49672 CLOSE_WAIT 115495/haproxy
> tcp 732 0 IP:443 IP:3180 CLOSE_WAIT 115494/haproxy
> tcp 746 0 IP:443 IP:39731 CLOSE_WAIT 115494/haproxy
> tcp 35 0 IP:443 IP:62986 CLOSE_WAIT 115495/haproxy
> tcp 585 0 IP:443 IP:51318 CLOSE_WAIT 115493/haproxy
> tcp 100 0 IP:443 IP:60449 CLOSE_WAIT 115493/haproxy
> tcp 35 0 IP:443 IP:1274 CLOSE_WAIT 115494/haproxy
> ..

I never managed to see this happen yet. Even haproxy.org uses H2 and I've
just checked on the server, zero CLOSE_WAIT. What is strange is that they
all have pending data, it means they sent some data and closed. It could
correspond to a timeout where the client finally closed not receiving a
response.

> Those are all frontend connections. Reloading haproxy removes those
> connections, but only after hard-stop-after kicks in and old processes
> are killed. Disabling native h2 support and switching back to nghttpx
> makes the problem disappear.

OK.

> This kinda seems like the socket was closed on the writing side, but
> the client has already sent something and everything is stuck. I was
> not able to reproduce the problem by myself. Any ideas how to debug
> this further?

For now not much comes to my mind. I'd be interested in seeing the
output of "show fd" issued on the stats socket of such a process (it
can be large, be careful).

> haproxy -vv (Debian package rebuilt on stretch with USE_TFO):

Interesting, and I'm seeing "tfo" on your bind line. We don't have it
on haproxy.org. Could you please re-test without it, just in case ?
Maybe you're receiving SYN+data+FIN that are not properly handled.

> HA-Proxy version 1.8.9-1~tsg9+1 2018/05/21

Is 1.8.9 the first version you tested or is it the first one you saw
the issue on, or did you notice the issue on another 1.8 version ? If
it turned out to be a regression it could be easier to spot in fact.

Your config is very clean and shows nothing suspicious at all. Thus at
first knowing if tfo changes anything would be a good start.

Thanks!
Willy
Janusz Dziemidowicz
Re: Connections stuck in CLOSE_WAIT state with h2
May 24, 2018 11:30PM
2018-05-24 22:26 GMT+02:00 Willy Tarreau <[email protected]>:
>> This kinda seems like the socket was closed on the writing side, but
>> the client has already sent something and everything is stuck. I was
>> not able to reproduce the problem by myself. Any ideas how to debug
>> this further?
>
> For now not much comes to my mind. I'd be interested in seeing the
> output of "show fd" issued on the stats socket of such a process (it
> can be large, be careful).

Will do tomorrow. Forgot to mention, apart from this issue everything
seems to work fine. No user reports any problem. Obviously it consumes
more and more memory. So I can enable h2 for an hour or two to avoid
problems.

>> haproxy -vv (Debian package rebuilt on stretch with USE_TFO):
>
> Interesting, and I'm seeing "tfo" on your bind line. We don't have it
> on haproxy.org. Could you please re-test without it, just in case ?
> Maybe you're receiving SYN+data+FIN that are not properly handled.

I've spend some time tweaking several settings already. I believe I've
checked without tfo and there was no difference. Will repeat that
tomorrow to be sure.

>> HA-Proxy version 1.8.9-1~tsg9+1 2018/05/21
>
> Is 1.8.9 the first version you tested or is it the first one you saw
> the issue on, or did you notice the issue on another 1.8 version ? If
> it turned out to be a regression it could be easier to spot in fact.
>
> Your config is very clean and shows nothing suspicious at all. Thus at
> first knowing if tfo changes anything would be a good start.

I've seen this issue also in 1.8.8, which was the first version I've
used after 1.7.x. My actual config is a bit more complicated (multiple
processes per socket, some stats, etc.), but I've been stripping it
down and down and what I've attached is still producing this issue for
me.

Anyway, I'll do another round of experiments (without tfo) tomorrow.

--
Janusz Dziemidowicz
Willy Tarreau
Re: Connections stuck in CLOSE_WAIT state with h2
May 24, 2018 11:30PM
On Thu, May 24, 2018 at 11:20:13PM +0200, Janusz Dziemidowicz wrote:
> 2018-05-24 22:26 GMT+02:00 Willy Tarreau <[email protected]>:
> >> This kinda seems like the socket was closed on the writing side, but
> >> the client has already sent something and everything is stuck. I was
> >> not able to reproduce the problem by myself. Any ideas how to debug
> >> this further?
> >
> > For now not much comes to my mind. I'd be interested in seeing the
> > output of "show fd" issued on the stats socket of such a process (it
> > can be large, be careful).
>
> Will do tomorrow. Forgot to mention, apart from this issue everything
> seems to work fine. No user reports any problem. Obviously it consumes
> more and more memory. So I can enable h2 for an hour or two to avoid
> problems.

Great, thank you.

> I've seen this issue also in 1.8.8, which was the first version I've
> used after 1.7.x.

That's pretty useful. I'm realizing that haproxy.org isn't totally
up to date (1.8.5+) so I'll upgrade it to see. If it starts to fail
it will indicate a regression between 1.8.5 and 1.8.8 (quite some
h2 fixes).

> My actual config is a bit more complicated (multiple
> processes per socket, some stats, etc.), but I've been stripping it
> down and down and what I've attached is still producing this issue for
> me.

OK that's a good job, thanks for doing this.

> Anyway, I'll do another round of experiments (without tfo) tomorrow.

Much appreciated, thank you.

Willy
Janusz Dziemidowicz
Re: Connections stuck in CLOSE_WAIT state with h2
May 25, 2018 03:50PM
2018-05-24 23:26 GMT+02:00 Willy Tarreau <[email protected]>:
>> Anyway, I'll do another round of experiments (without tfo) tomorrow.
>
> Much appreciated, thank you.

I've removed tfo, curves and tls-ticket-keys from bind line, no
change, CLOSE_WAIT is still leaking.
Attached is 'show fd' output from one process. It's been running for 2
hours and about 200 connections are stuck in CLOSE_WAIT.
This is running on my "full" config, with couple of backends (simple
selection based on Host header). Reducing this server to the minimal
config from above takes a lot more time. But the problem seems to be
always the same.

--
Janusz Dziemidowicz
Attachments:
open | download - fd (249.5 KB)
Willy Tarreau
Re: Connections stuck in CLOSE_WAIT state with h2
May 26, 2018 09:00AM
On Fri, May 25, 2018 at 03:43:16PM +0200, Janusz Dziemidowicz wrote:
> 2018-05-24 23:26 GMT+02:00 Willy Tarreau <[email protected]>:
> >> Anyway, I'll do another round of experiments (without tfo) tomorrow.
> >
> > Much appreciated, thank you.
>
> I've removed tfo, curves and tls-ticket-keys from bind line, no
> change, CLOSE_WAIT is still leaking.
> Attached is 'show fd' output from one process. It's been running for 2
> hours and about 200 connections are stuck in CLOSE_WAIT.
> This is running on my "full" config, with couple of backends (simple
> selection based on Host header). Reducing this server to the minimal
> config from above takes a lot more time. But the problem seems to be
> always the same.

Thank you. I'll study your capture, it will take some time. I may get
back to you for more info once I have a better idea of it.

Willy
Milan Petruželka
Re: Connections stuck in CLOSE_WAIT state with h2
May 31, 2018 11:00AM
I've got the same problem described by Janusz Dziemidowicz. After enabling
h2 on Haproxy 1.8.8 it slowly accumulates frontend connections. Reload or
restart cuts them down, but they start to grow again. Upgrade to Haproxy
1.8.9 did not help. See a 40-day graph here -
https://pasteboard.co/HnHlKdg.png

Milan Petruzelka

################################################################################################################
../bin/haproxy -vv

HA-Proxy version 1.8.9-83616ec 2018/05/18
Copyright 2000-2018 Willy Tarreau <[email protected]>

Build options :
TARGET = linux2628
CPU = native
CC = gcc
CFLAGS = -O2 -march=native -g -fno-strict-aliasing
-Wdeclaration-after-statement -fwrapv -fno-strict-overflow -Wno-unused-label
OPTIONS = USE_LIBCRYPT=1 USE_ZLIB=1 USE_VSYSCALL=1 USE_OPENSSL=1
USE_SYSTEMD=1 USE_PCRE=1

Default settings :
maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents = 200

Built with OpenSSL version : OpenSSL 1.0.2g 1 Mar 2016
Running on OpenSSL version : OpenSSL 1.0.2g 1 Mar 2016
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports : TLSv1.0 TLSv1.1 TLSv1.2
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT
IP_FREEBIND
Encrypted password support via crypt(3): yes
Built with multi-threading support.
Built with PCRE version : 8.38 2015-11-23
Running on PCRE version : 8.38 2015-11-23
PCRE library supports JIT : no (USE_PCRE_JIT not set)
Built with zlib version : 1.2.8
Running on zlib version : 1.2.8
Compression algorithms supported : identity("identity"),
deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with network namespace support.

Available polling systems :
epoll : pref=300, test result OK
poll : pref=200, test result OK
select : pref=150, test result OK
Total: 3 (3 usable), will use epoll.

Available filters :
[SPOE] spoe
[COMP] compression
[TRACE] trace


################################################################################################################
netstat -aptn|grep haproxy|cut -c 69-80|sort|uniq -c|sort -n

4 LISTEN
20 ESTABLISHED
102 CLOSE_WAIT


################################################################################################################
netstat -apton|grep CLOS|grep hapro|head

tcp 7292 0 MY_IP:443 CLIENT_IP:53575 CLOSE_WAIT
16694/haproxy off (0.00/0/0)
tcp 211 0 MY_IP:443 CLIENT_IP:50930 CLOSE_WAIT
16694/haproxy off (0.00/0/0)
tcp 723 0 MY_IP:443 CLIENT_IP:4873 CLOSE_WAIT
16694/haproxy off (0.00/0/0)
tcp 242 0 MY_IP:443 CLIENT_IP:50831 CLOSE_WAIT
5205/haproxy off (0.00/0/0)
tcp 1 0 MY_IP:443 CLIENT_IP:18309 CLOSE_WAIT
5205/haproxy off (0.00/0/0)
tcp 200 0 MY_IP:443 CLIENT_IP:64678 CLOSE_WAIT
16694/haproxy off (0.00/0/0)
tcp 242 0 MY_IP:443 CLIENT_IP:50085 CLOSE_WAIT
16694/haproxy off (0.00/0/0)
tcp 242 0 MY_IP:443 CLIENT_IP:64613 CLOSE_WAIT
16694/haproxy off (0.00/0/0)
tcp 347 0 MY_IP:443 CLIENT_IP:60401 CLOSE_WAIT
5205/haproxy off (0.00/0/0)
tcp 753 0 MY_IP:443 CLIENT_IP:64146 CLOSE_WAIT
16694/haproxy off (0.00/0/0)

################################################################################################################
> show fd
4 : st=0x05(R:PrA W:pra) ev=0x01(heopI) [nlC] cache=0 owner=0x1eb4bf0
iocb=0x4be950(listener_accept) tmask=0xffffffffffffffff
umask=0xfffffffffffffffe l.st=RDY fe=GLOBAL
5 : st=0x05(R:PrA W:pra) ev=0x01(heopI) [nlC] cache=0 owner=0x1eb6df0
iocb=0x4be950(listener_accept) tmask=0xffffffffffffffff
umask=0xfffffffffffffffe l.st=RDY fe=fe-http
6 : st=0x05(R:PrA W:pra) ev=0x01(heopI) [nlC] cache=0 owner=0x1eb7300
iocb=0x4be950(listener_accept) tmask=0xffffffffffffffff
umask=0xfffffffffffffffe l.st=RDY fe=fe-http
8 : st=0x05(R:PrA W:pra) ev=0x01(heopI) [nlC] cache=0 owner=0x1ed1b10
iocb=0x4be950(listener_accept) tmask=0xffffffffffffffff
umask=0xfffffffffffffffe l.st=RDY fe=fe-service
9 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [nlC] cache=0 owner=0x1ed2020
iocb=0x4be950(listener_accept) tmask=0xffffffffffffffff
umask=0xfffffffffffffffe l.st=RDY fe=fe-service
10 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [nlc] cache=0 owner=0x754758
iocb=0x48cdf0(unknown) tmask=0xffffffffffffffff umask=0xfffffffffffffffe
11 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [nlc] cache=0 owner=0x4e7200
iocb=0x4e7200(thread_sync_io_handler) tmask=0xffffffffffffffff
umask=0xfffffffffffffffe
13 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0 owner=0x1f05c30
iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x80203300
fe=fe-http mux=H2 mux_ctx=0x1f055a0
14 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0 owner=0x20ea060
iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x80203300
fe=fe-http mux=H2 mux_ctx=0x22a1580
15 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0 owner=0x1ff94d0
iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x80203300
fe=fe-http mux=H2 mux_ctx=0x20fa360
16 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0 owner=0x21b7a90
iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x80203300
fe=fe-http mux=H2 mux_ctx=0x1fc58f0
17 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0 owner=0x2012970
iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x80203300
fe=fe-http mux=H2 mux_ctx=0x22a7480
18 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0 owner=0x21b7e30
iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x80203300
fe=fe-http mux=H2 mux_ctx=0x210b610
19 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0 owner=0x1efa590
iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x80203300
fe=fe-http mux=H2 mux_ctx=0x1f02610
20 : st=0x25(R:PrA W:pRa) ev=0x00(heopi) [nlc] cache=0 owner=0x21b83a0
iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x80203306
fe=fe-http mux=H2 mux_ctx=0x22a2800
21 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0 owner=0x20a0f10
iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x80203300
fe=fe-http mux=H2 mux_ctx=0x216ecf0
22 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0 owner=0x1f02a40
iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x80203300
fe=fe-http mux=H2 mux_ctx=0x20d03b0
23 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0 owner=0x21b8e80
iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x80203300
fe=fe-http mux=H2 mux_ctx=0x22d2ff0
24 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0 owner=0x1f8f2c0
iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x80203300
fe=fe-http mux=H2 mux_ctx=0x1ee8030
25 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0 owner=0x1f88c90
iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x80203300
fe=fe-http mux=H2 mux_ctx=0x1fa1e20
26 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0 owner=0x21b8cb0
iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x80203300
fe=fe-http mux=H2 mux_ctx=0x1f2d260
27 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0 owner=0x20d0b40
iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x80203300
fe=fe-http mux=H2 mux_ctx=0x1f04870
28 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0 owner=0x20a3670
iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x80203300
fe=fe-http mux=H2 mux_ctx=0x1f02790
29 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0 owner=0x21639c0
iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x80203300
fe=fe-http mux=H2 mux_ctx=0x1f21dd0
30 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0 owner=0x217cac0
iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x80203300
fe=fe-http mux=H2 mux_ctx=0x22cd210
31 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0 owner=0x2163540
iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x80203300
fe=fe-http mux=H2 mux_ctx=0x209b2e0
32 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0 owner=0x209b750
iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x80203300
fe=fe-http mux=H2 mux_ctx=0x20a02b0
33 : st=0x25(R:PrA W:pRa) ev=0x00(heopi) [nlc] cache=0 owner=0x20ce280
iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x80203306
fe=fe-http mux=H2 mux_ctx=0x2321d60
34 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0 owner=0x2162d40
iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x80203300
fe=fe-http mux=H2 mux_ctx=0x1fbe580
36 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0 owner=0x1f05a70
iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x80203300
fe=fe-http mux=H2 mux_ctx=0x20c0d80
37 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0 owner=0x20e43a0
iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x80203300
fe=fe-http mux=H2 mux_ctx=0x1f18200
38 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0 owner=0x20a0120
iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x80203300
fe=fe-http mux=H2 mux_ctx=0x20ea4e0
39 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0 owner=0x2082a90
iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x80203300
fe=fe-http mux=H2 mux_ctx=0x22a1e00
40 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0 owner=0x21b8740
iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x80203300
fe=fe-http mux=H2 mux_ctx=0x22a5ca0
41 : st=0x25(R:PrA W:pRa) ev=0x00(heopi) [nlc] cache=0 owner=0x215a730
iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x80203306
fe=fe-http mux=H2 mux_ctx=0x236f300
42 : st=0x25(R:PrA W:pRa) ev=0x00(heopi) [nlc] cache=0 owner=0x210b440
iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x80203306
fe=fe-http mux=H2 mux_ctx=0x2326220
43 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0 owner=0x20d8110
iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x80203300
fe=fe-http mux=H2 mux_ctx=0x1faa770
44 : st=0x25(R:PrA W:pRa) ev=0x00(heopi) [nlc] cache=0 owner=0x216f020
iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x80203306
fe=fe-http mux=H2 mux_ctx=0x20e7850
45 : st=0x24(R:Pra W:pRa) ev=0x00(heopi) [nlc] cache=0 owner=0x20816b0
iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x1 cflg=0x00201300
fe=GLOBAL mux=PASS mux_ctx=0x1fa6600
46 : st=0x25(R:PrA W:pRa) ev=0x00(heopi) [nlc] cache=0 owner=0x2011390
iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x80203306
fe=fe-http mux=H2 mux_ctx=0x1f2bf70
49 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0 owner=0x1f8de50
iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x80203300
fe=fe-http mux=H2 mux_ctx=0x2327030
50 : st=0x25(R:PrA W:pRa) ev=0x00(heopi) [nlc] cache=0 owner=0x1f8aea0
iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x80203306
fe=fe-http mux=H2 mux_ctx=0x230db80

################################################################################################################
haproxy.cfg

global
daemon
user haproxy
group haproxy
stats socket /some_path/cmd.sock mode 600 level admin
stats timeout 1m
maxconn 100000
log 127.0.0.1:1234 local0

defaults
log global
option httplog
mode http
option http-keep-alive
option forwardfor
timeout connect 5s
retries 3
option redispatch
timeout queue 30s ## how long to wait in backend / server
queue
timeout client 31s ## inactivity timeout during request
reading
timeout server 120s ## inactivity timeout during response
reading
timeout tunnel 1h ## long lived connections (WebSocket /
TCP)
timeout http-request 5s ## max. request duration - headers
without post data
timeout http-keep-alive 500ms ## timeout when waiting for next
request in keepalive mode

frontend fe-http
maxconn 10000
bind *:80
bind *:443 ssl no-sslv3 alpn h2,http/1.1 crt /some_path/ai1.pem crt
/some_path/default.pem
capture cookie X-Sid len 45
capture request header Host len 30
capture request header X-Forwarded-For len 50
capture response header X-Rid len 30
capture response header X-Cache len 20
default_backend be-nginx01

backend be-nginx01
default-server inter 5s rise 2 fall 3
option httpchk HEAD /
server nginx01 [email protected]/some_path/listen-auto.sock check maxconn 30
Willy Tarreau
Re: Connections stuck in CLOSE_WAIT state with h2
May 31, 2018 09:20PM
Hi Milan,

On Thu, May 31, 2018 at 10:54:11AM +0200, Milan Petruzelka wrote:
> I've got the same problem described by Janusz Dziemidowicz. After enabling
> h2 on Haproxy 1.8.8 it slowly accumulates frontend connections. Reload or
> restart cuts them down, but they start to grow again. Upgrade to Haproxy
> 1.8.9 did not help. See a 40-day graph here -
> https://pasteboard.co/HnHlKdg.png

Thank you. It's pretty constant, so it doesn't really seem load-dependant
(unless your load is reasonably stable of course). I'm seeing that the
number of entries in "show fd" is not that much, did you truncate it or
is it all you had ? I'm asking because suddenly that makes me think that
it could be possible that we're leaking sockets that are not correctly
closed (for whatever reason). That would explain why they're not visible
in "show fd" but remain in CLOSE_WAIT. It's absolutely not what I was
expecting last time and I could have overlooked this possibility.

Thanks,
Willy
Milan Petruželka
Re: Connections stuck in CLOSE_WAIT state with h2
June 01, 2018 12:00PM
Hi Willy

On Thu, 31 May 2018 at 21:07, Willy Tarreau <[email protected]> wrote:

> Thank you. It's pretty constant, so it doesn't really seem load-dependant
> (unless your load is reasonably stable of course).


Here is a 3-day sample from the "grow" period -
https://pasteboard.co/HnQu44t.png. Overall trend is steady growth with
somewhat varying speed. There are also short periods when the number of
frontend connections decreases. These periods seem to match times when
request rate drops below 5000/hour.


> I'm seeing that the number of entries in "show fd" is not that much, did
> you truncate it or
> is it all you had ?I'm asking because suddenly that makes me think that
> it could be possible that we're leaking sockets that are not correctly
> closed (for whatever reason). That would explain why they're not visible
> in "show fd" but remain in CLOSE_WAIT. It's absolutely not what I was
> expecting last time and I could have overlooked this possibility.
>

"show fd" list was not truncated. It's really shorter than the number of
CLOSE_WAIT netstat entries. Extra netstat "CLOSE_WAIT" connections
invisible in "show fg" may come from socket leak. See below for current
sample - 111 CLOSE_WAIT vs. 64 "show fd".

Why are there so many "show fd" sockets with cflg=0x80203300? They are the
one which keep growing. Diff between my first "show fd" from yesterday and
new from today clearly shows that a lot of them did not change and new
appeared.

cat show-fd-2018-06-01.txt | cut -c128-142 | grep cflg | sort | uniq -c |
sort -n
1 cflg=0x00202306
1 cflg=0x00241300
10 cflg=0x80203306
45 cflg=0x80203300 !!!!!

wc -l *
42 show-fd-2018-05-31.txt
64 show-fd-2018-06-01.txt

diff -u *
--- show-fd-2018-05-31.txt 2018-06-01 10:58:21.354207477 +0200
+++ show-fd-2018-06-01.txt 2018-06-01 10:58:50.886461366 +0200
@@ -12,7 +12,7 @@
17 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0
owner=0x2012970 iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0
cflg=0x80203300 fe=fe-http mux=H2 mux_ctx=0x22a7480
18 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0
owner=0x21b7e30 iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0
cflg=0x80203300 fe=fe-http mux=H2 mux_ctx=0x210b610
19 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0
owner=0x1efa590 iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0
cflg=0x80203300 fe=fe-http mux=H2 mux_ctx=0x1f02610
- 20 : st=0x25(R:PrA W:pRa) ev=0x00(heopi) [nlc] cache=0
owner=0x21b83a0 iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0
cflg=0x80203306 fe=fe-http mux=H2 mux_ctx=0x22a2800
+ 20 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0
owner=0x1fe8a90 iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0
cflg=0x80203300 fe=fe-http mux=H2 mux_ctx=0x2311bd0
21 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0
owner=0x20a0f10 iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0
cflg=0x80203300 fe=fe-http mux=H2 mux_ctx=0x216ecf0
22 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0
owner=0x1f02a40 iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0
cflg=0x80203300 fe=fe-http mux=H2 mux_ctx=0x20d03b0
23 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0
owner=0x21b8e80 iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0
cflg=0x80203300 fe=fe-http mux=H2 mux_ctx=0x22d2ff0
@@ -25,18 +25,40 @@
30 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0
owner=0x217cac0 iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0
cflg=0x80203300 fe=fe-http mux=H2 mux_ctx=0x22cd210
31 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0
owner=0x2163540 iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0
cflg=0x80203300 fe=fe-http mux=H2 mux_ctx=0x209b2e0
32 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0
owner=0x209b750 iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0
cflg=0x80203300 fe=fe-http mux=H2 mux_ctx=0x20a02b0
- 33 : st=0x25(R:PrA W:pRa) ev=0x00(heopi) [nlc] cache=0
owner=0x20ce280 iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0
cflg=0x80203306 fe=fe-http mux=H2 mux_ctx=0x2321d60
+ 33 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0
owner=0x20c07c0 iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0
cflg=0x80203300 fe=fe-http mux=H2 mux_ctx=0x2328430
34 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0
owner=0x2162d40 iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0
cflg=0x80203300 fe=fe-http mux=H2 mux_ctx=0x1fbe580
+ 35 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0
owner=0x1f914d0 iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0
cflg=0x80203300 fe=fe-http mux=H2 mux_ctx=0x1ed2da0
36 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0
owner=0x1f05a70 iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0
cflg=0x80203300 fe=fe-http mux=H2 mux_ctx=0x20c0d80
37 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0
owner=0x20e43a0 iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0
cflg=0x80203300 fe=fe-http mux=H2 mux_ctx=0x1f18200
38 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0
owner=0x20a0120 iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0
cflg=0x80203300 fe=fe-http mux=H2 mux_ctx=0x20ea4e0
39 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0
owner=0x2082a90 iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0
cflg=0x80203300 fe=fe-http mux=H2 mux_ctx=0x22a1e00
40 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0
owner=0x21b8740 iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0
cflg=0x80203300 fe=fe-http mux=H2 mux_ctx=0x22a5ca0
- 41 : st=0x25(R:PrA W:pRa) ev=0x00(heopi) [nlc] cache=0
owner=0x215a730 iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0
cflg=0x80203306 fe=fe-http mux=H2 mux_ctx=0x236f300
- 42 : st=0x25(R:PrA W:pRa) ev=0x00(heopi) [nlc] cache=0
owner=0x210b440 iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0
cflg=0x80203306 fe=fe-http mux=H2 mux_ctx=0x2326220
+ 41 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0
owner=0x1fa7b10 iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0
cflg=0x80203300 fe=fe-http mux=H2 mux_ctx=0x230db80
+ 42 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0
owner=0x2152460 iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0
cflg=0x80203300 fe=fe-http mux=H2 mux_ctx=0x1f2bf70
43 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0
owner=0x20d8110 iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0
cflg=0x80203300 fe=fe-http mux=H2 mux_ctx=0x1faa770
- 44 : st=0x25(R:PrA W:pRa) ev=0x00(heopi) [nlc] cache=0
owner=0x216f020 iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0
cflg=0x80203306 fe=fe-http mux=H2 mux_ctx=0x20e7850
- 45 : st=0x24(R:Pra W:pRa) ev=0x00(heopi) [nlc] cache=0
owner=0x20816b0 iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x1
cflg=0x00201300 fe=GLOBAL mux=PASS mux_ctx=0x1fa6600
- 46 : st=0x25(R:PrA W:pRa) ev=0x00(heopi) [nlc] cache=0
owner=0x2011390 iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0
cflg=0x80203306 fe=fe-http mux=H2 mux_ctx=0x1f2bf70
+ 44 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0
owner=0x1fbe0a0 iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0
cflg=0x80203300 fe=fe-http mux=H2 mux_ctx=0x22e52a0
+ 45 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0
owner=0x1f23ad0 iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0
cflg=0x80203300 fe=fe-http mux=H2 mux_ctx=0x22a2800
+ 46 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0
owner=0x20ccdd0 iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0
cflg=0x80203300 fe=fe-http mux=H2 mux_ctx=0x236d720
+ 47 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0
owner=0x2083130 iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0
cflg=0x80203300 fe=fe-http mux=H2 mux_ctx=0x236eb00
+ 48 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0
owner=0x1f90e00 iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0
cflg=0x80203300 fe=fe-http mux=H2 mux_ctx=0x2397bd0
49 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0
owner=0x1f8de50 iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0
cflg=0x80203300 fe=fe-http mux=H2 mux_ctx=0x2327030
- 50 : st=0x25(R:PrA W:pRa) ev=0x00(heopi) [nlc] cache=0
owner=0x1f8aea0 iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0
cflg=0x80203306 fe=fe-http mux=H2 mux_ctx=0x230db80
+ 50 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0
owner=0x217a360 iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0
cflg=0x80203300 fe=fe-http mux=H2 mux_ctx=0x236d8b0
+ 51 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0
owner=0x215a730 iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0
cflg=0x80203300 fe=fe-http mux=H2 mux_ctx=0x238bb70
+ 52 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0
owner=0x209fb60 iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0
cflg=0x80203300 fe=fe-http mux=H2 mux_ctx=0x20e7850
+ 53 : st=0x25(R:PrA W:pRa) ev=0x00(heopi) [nlc] cache=0
owner=0x20a1230 iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0
cflg=0x80203306 fe=fe-http mux=H2 mux_ctx=0x25a9470
+ 54 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0
owner=0x1f90730 iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0
cflg=0x80203300 fe=fe-http mux=H2 mux_ctx=0x1ff82d0
+ 55 : st=0x25(R:PrA W:pRa) ev=0x00(heopi) [nlc] cache=0
owner=0x24d1f10 iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0
cflg=0x80203306 fe=fe-http mux=H2 mux_ctx=0x230e930
+ 56 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0
owner=0x21647d0 iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0
cflg=0x80203300 fe=fe-http mux=H2 mux_ctx=0x22e8b60
+ 57 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0
owner=0x1f8c310 iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0
cflg=0x80203300 fe=fe-http mux=H2 mux_ctx=0x1ff1b00
+ 58 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0
owner=0x1f90060 iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0
cflg=0x80203300 fe=fe-http mux=H2 mux_ctx=0x2326220
+ 59 : st=0x25(R:PrA W:pRa) ev=0x00(heopi) [nlc] cache=0
owner=0x1ff80d0 iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0
cflg=0x80203306 fe=fe-http mux=H2 mux_ctx=0x24df3e0
+ 60 : st=0x25(R:PrA W:pRa) ev=0x00(heopi) [nlc] cache=0
owner=0x21b81d0 iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0
cflg=0x80203306 fe=fe-http mux=H2 mux_ctx=0x236f300
+ 61 : st=0x25(R:PrA W:pRa) ev=0x00(heopi) [nlc] cache=0
owner=0x20f3230 iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0
cflg=0x80203306 fe=fe-http mux=PASS mux_ctx=0x2163e90
+ 62 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0
owner=0x2158150 iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0
cflg=0x80203300 fe=fe-http mux=H2 mux_ctx=0x23004c0
+ 63 : st=0x25(R:PrA W:pRa) ev=0x00(heopi) [nlc] cache=0
owner=0x21122f0 iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0
cflg=0x80203306 fe=fe-http mux=PASS mux_ctx=0x1f88ba0
+ 64 : st=0x25(R:PrA W:pRa) ev=0x00(heopi) [nlc] cache=0
owner=0x21b8ae0 iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0
cflg=0x80203306 fe=fe-http mux=H2 mux_ctx=0x254fd10
+ 65 : st=0x25(R:PrA W:pRa) ev=0x00(heopi) [nlc] cache=0
owner=0x210b440 iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0
cflg=0x00202306 sv=nginx01/be-nginx01 mux=PASS mux_ctx=0x1fddaf0
+ 66 : st=0x22(R:pRa W:pRa) ev=0x00(heopi) [Nlc] cache=0
owner=0x21b9050 iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0
cflg=0x00241300 fe=GLOBAL mux=PASS mux_ctx=0x1f92230
+ 68 : st=0x25(R:PrA W:pRa) ev=0x00(heopi) [nlc] cache=0
owner=0x21b8910 iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0
cflg=0x80203306 fe=fe-http mux=H2 mux_ctx=0x2321d60
+ 69 : st=0x25(R:PrA W:pRa) ev=0x00(heopi) [nlc] cache=0
owner=0x2109130 iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0
cflg=0x80203306 fe=fe-http mux=H2 mux_ctx=0x2543fc0
+ 71 : st=0x25(R:PrA W:pRa) ev=0x00(heopi) [nlc] cache=0
owner=0x1f8d780 iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0
cflg=0x80203306 fe=fe-http mux=H2 mux_ctx=0x236fdd0

Thanks, Milan

########################################################################
date ; netstat -aptn|grep haproxy|cut -c 69-80|sort|uniq -c|sort -n
Fri Jun 1 10:02:07 CEST 2018
4 LISTEN
22 ESTABLISHED
111 CLOSE_WAIT

########################################################################
date ; echo "show fd" | socat - $HASOCK | wc -l
Fri Jun 1 10:02:07 CEST 2018
64

########################################################################
date ; echo "show fd" | socat - $HASOCK

Fri Jun 1 10:02:07 CEST 2018
4 : st=0x05(R:PrA W:pra) ev=0x01(heopI) [nlC] cache=0 owner=0x1eb4bf0
iocb=0x4be950(listener_accept) tmask=0xffffffffffffffff
umask=0xfffffffffffffffe l.st=RDY fe=GLOBAL
5 : st=0x05(R:PrA W:pra) ev=0x01(heopI) [nlC] cache=0 owner=0x1eb6df0
iocb=0x4be950(listener_accept) tmask=0xffffffffffffffff
umask=0xfffffffffffffffe l.st=RDY fe=fe-http
6 : st=0x05(R:PrA W:pra) ev=0x01(heopI) [nlC] cache=0 owner=0x1eb7300
iocb=0x4be950(listener_accept) tmask=0xffffffffffffffff
umask=0xfffffffffffffffe l.st=RDY fe=fe-http
8 : st=0x05(R:PrA W:pra) ev=0x01(heopI) [nlC] cache=0 owner=0x1ed1b10
iocb=0x4be950(listener_accept) tmask=0xffffffffffffffff
umask=0xfffffffffffffffe l.st=RDY fe=fe-service
9 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [nlC] cache=0 owner=0x1ed2020
iocb=0x4be950(listener_accept) tmask=0xffffffffffffffff
umask=0xfffffffffffffffe l.st=RDY fe=fe-service
10 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [nlc] cache=0 owner=0x754758
iocb=0x48cdf0(unknown) tmask=0xffffffffffffffff umask=0xfffffffffffffffe
11 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [nlc] cache=0 owner=0x4e7200
iocb=0x4e7200(thread_sync_io_handler) tmask=0xffffffffffffffff
umask=0xfffffffffffffffe
13 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0 owner=0x1f05c30
iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x80203300
fe=fe-http mux=H2 mux_ctx=0x1f055a0
14 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0 owner=0x20ea060
iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x80203300
fe=fe-http mux=H2 mux_ctx=0x22a1580
15 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0 owner=0x1ff94d0
iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x80203300
fe=fe-http mux=H2 mux_ctx=0x20fa360
16 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0 owner=0x21b7a90
iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x80203300
fe=fe-http mux=H2 mux_ctx=0x1fc58f0
17 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0 owner=0x2012970
iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x80203300
fe=fe-http mux=H2 mux_ctx=0x22a7480
18 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0 owner=0x21b7e30
iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x80203300
fe=fe-http mux=H2 mux_ctx=0x210b610
19 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0 owner=0x1efa590
iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x80203300
fe=fe-http mux=H2 mux_ctx=0x1f02610
20 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0 owner=0x1fe8a90
iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x80203300
fe=fe-http mux=H2 mux_ctx=0x2311bd0
21 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0 owner=0x20a0f10
iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x80203300
fe=fe-http mux=H2 mux_ctx=0x216ecf0
22 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0 owner=0x1f02a40
iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x80203300
fe=fe-http mux=H2 mux_ctx=0x20d03b0
23 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0 owner=0x21b8e80
iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x80203300
fe=fe-http mux=H2 mux_ctx=0x22d2ff0
24 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0 owner=0x1f8f2c0
iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x80203300
fe=fe-http mux=H2 mux_ctx=0x1ee8030
25 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0 owner=0x1f88c90
iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x80203300
fe=fe-http mux=H2 mux_ctx=0x1fa1e20
26 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0 owner=0x21b8cb0
iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x80203300
fe=fe-http mux=H2 mux_ctx=0x1f2d260
27 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0 owner=0x20d0b40
iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x80203300
fe=fe-http mux=H2 mux_ctx=0x1f04870
28 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0 owner=0x20a3670
iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x80203300
fe=fe-http mux=H2 mux_ctx=0x1f02790
29 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0 owner=0x21639c0
iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x80203300
fe=fe-http mux=H2 mux_ctx=0x1f21dd0
30 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0 owner=0x217cac0
iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x80203300
fe=fe-http mux=H2 mux_ctx=0x22cd210
31 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0 owner=0x2163540
iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x80203300
fe=fe-http mux=H2 mux_ctx=0x209b2e0
32 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0 owner=0x209b750
iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x80203300
fe=fe-http mux=H2 mux_ctx=0x20a02b0
33 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0 owner=0x20c07c0
iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x80203300
fe=fe-http mux=H2 mux_ctx=0x2328430
34 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0 owner=0x2162d40
iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x80203300
fe=fe-http mux=H2 mux_ctx=0x1fbe580
35 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0 owner=0x1f914d0
iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x80203300
fe=fe-http mux=H2 mux_ctx=0x1ed2da0
36 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0 owner=0x1f05a70
iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x80203300
fe=fe-http mux=H2 mux_ctx=0x20c0d80
37 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0 owner=0x20e43a0
iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x80203300
fe=fe-http mux=H2 mux_ctx=0x1f18200
38 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0 owner=0x20a0120
iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x80203300
fe=fe-http mux=H2 mux_ctx=0x20ea4e0
39 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0 owner=0x2082a90
iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x80203300
fe=fe-http mux=H2 mux_ctx=0x22a1e00
40 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0 owner=0x21b8740
iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x80203300
fe=fe-http mux=H2 mux_ctx=0x22a5ca0
41 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0 owner=0x1fa7b10
iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x80203300
fe=fe-http mux=H2 mux_ctx=0x230db80
42 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0 owner=0x2152460
iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x80203300
fe=fe-http mux=H2 mux_ctx=0x1f2bf70
43 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0 owner=0x20d8110
iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x80203300
fe=fe-http mux=H2 mux_ctx=0x1faa770
44 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0 owner=0x1fbe0a0
iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x80203300
fe=fe-http mux=H2 mux_ctx=0x22e52a0
45 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0 owner=0x1f23ad0
iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x80203300
fe=fe-http mux=H2 mux_ctx=0x22a2800
46 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0 owner=0x20ccdd0
iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x80203300
fe=fe-http mux=H2 mux_ctx=0x236d720
47 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0 owner=0x2083130
iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x80203300
fe=fe-http mux=H2 mux_ctx=0x236eb00
48 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0 owner=0x1f90e00
iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x80203300
fe=fe-http mux=H2 mux_ctx=0x2397bd0
49 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0 owner=0x1f8de50
iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x80203300
fe=fe-http mux=H2 mux_ctx=0x2327030
50 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0 owner=0x217a360
iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x80203300
fe=fe-http mux=H2 mux_ctx=0x236d8b0
51 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0 owner=0x215a730
iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x80203300
fe=fe-http mux=H2 mux_ctx=0x238bb70
52 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0 owner=0x209fb60
iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x80203300
fe=fe-http mux=H2 mux_ctx=0x20e7850
53 : st=0x25(R:PrA W:pRa) ev=0x00(heopi) [nlc] cache=0 owner=0x20a1230
iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x80203306
fe=fe-http mux=H2 mux_ctx=0x25a9470
54 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0 owner=0x1f90730
iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x80203300
fe=fe-http mux=H2 mux_ctx=0x1ff82d0
55 : st=0x25(R:PrA W:pRa) ev=0x00(heopi) [nlc] cache=0 owner=0x24d1f10
iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x80203306
fe=fe-http mux=H2 mux_ctx=0x230e930
56 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0 owner=0x21647d0
iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x80203300
fe=fe-http mux=H2 mux_ctx=0x22e8b60
57 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0 owner=0x1f8c310
iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x80203300
fe=fe-http mux=H2 mux_ctx=0x1ff1b00
58 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0 owner=0x1f90060
iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x80203300
fe=fe-http mux=H2 mux_ctx=0x2326220
59 : st=0x25(R:PrA W:pRa) ev=0x00(heopi) [nlc] cache=0 owner=0x1ff80d0
iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x80203306
fe=fe-http mux=H2 mux_ctx=0x24df3e0
60 : st=0x25(R:PrA W:pRa) ev=0x00(heopi) [nlc] cache=0 owner=0x21b81d0
iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x80203306
fe=fe-http mux=H2 mux_ctx=0x236f300
61 : st=0x25(R:PrA W:pRa) ev=0x00(heopi) [nlc] cache=0 owner=0x20f3230
iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x80203306
fe=fe-http mux=PASS mux_ctx=0x2163e90
62 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0 owner=0x2158150
iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x80203300
fe=fe-http mux=H2 mux_ctx=0x23004c0
63 : st=0x25(R:PrA W:pRa) ev=0x00(heopi) [nlc] cache=0 owner=0x21122f0
iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x80203306
fe=fe-http mux=PASS mux_ctx=0x1f88ba0
64 : st=0x25(R:PrA W:pRa) ev=0x00(heopi) [nlc] cache=0 owner=0x21b8ae0
iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x80203306
fe=fe-http mux=H2 mux_ctx=0x254fd10
65 : st=0x25(R:PrA W:pRa) ev=0x00(heopi) [nlc] cache=0 owner=0x210b440
iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x00202306
sv=nginx01/be-nginx01 mux=PASS mux_ctx=0x1fddaf0
66 : st=0x22(R:pRa W:pRa) ev=0x00(heopi) [Nlc] cache=0 owner=0x21b9050
iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x00241300
fe=GLOBAL mux=PASS mux_ctx=0x1f92230
68 : st=0x25(R:PrA W:pRa) ev=0x00(heopi) [nlc] cache=0 owner=0x21b8910
iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x80203306
fe=fe-http mux=H2 mux_ctx=0x2321d60
69 : st=0x25(R:PrA W:pRa) ev=0x00(heopi) [nlc] cache=0 owner=0x2109130
iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x80203306
fe=fe-http mux=H2 mux_ctx=0x2543fc0
71 : st=0x25(R:PrA W:pRa) ev=0x00(heopi) [nlc] cache=0 owner=0x1f8d780
iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x80203306
fe=fe-http mux=H2 mux_ctx=0x236fdd0
Milan Petruželka
Re: Connections stuck in CLOSE_WAIT state with h2
June 06, 2018 11:20AM
Hi Willy,

I've tracked one of connections hanging in CLOSE_WAIT state with tcpdump
over last night. It started at 17:19 like this:

"Packet No.","Time in
seconds","Source","Destination","Protocol","Length","Info"
"1","0.000000","ip_client","ip_haproxy_server","TCP","62","64311 >
443 [SYN] Seq=0 Win=8192 Len=0 MSS=1460 SACK_PERM=1"
"2","0.001049","ip_haproxy_server","ip_client","TCP","62","443 >
64311 [SYN, ACK] Seq=0 Ack=1 Win=29200 Len=0 MSS=1460 SACK_PERM=1"
"3","0.127239","ip_client","ip_haproxy_server","TCP","54","64311 >
443 [ACK] Seq=1 Ack=1 Win=64240 Len=0"
"4","0.127344","ip_client","ip_haproxy_server","TLSv1.2","571","Client
Hello"
"5","0.130304","ip_haproxy_server","ip_client","TLSv1.2","2974","Server
Hello, Certificate"
"6","0.130336","ip_haproxy_server","ip_client","TLSv1.2","310","Server
Key Exchange, Server Hello Done"

After some 13 seconds client sent it's last data, which haproxy server
acknowledged.


"319","13.781347","ip_client","ip_haproxy_server","TLSv1.2","96","Application
Data"
"320","13.781365","ip_haproxy_server","ip_client","TCP","54","443 >
64311 [ACK] Seq=240156 Ack=3689 Win=36448 Len=0"

Then client sent FIN packet, server acknowledged it again

"321","16.292016","ip_client","ip_haproxy_server","TCP","54","64311 >
443 [FIN, ACK] Seq=3689 Ack=240156 Win=64240 Len=0"
"322","16.329574","ip_haproxy_server","ip_client","TCP","54","443 >
64311 [ACK] Seq=240156 Ack=3690 Win=36448 Len=0"

From then client sent only TCP keepalive every 45s, which server always
ackonwledged.

"323","61.443121","ip_client","ip_haproxy_server","TCP","55","[TCP
Keep-Alive] 64311 > 443 [ACK] Seq=3689 Ack=240156 Win=64240 Len=1"
"324","61.443216","ip_haproxy_server","ip_client","TCP","66","[TCP
Keep-Alive ACK] 443 > 64311 [ACK] Seq=240156 Ack=3690 Win=36448 Len=0
SLE=3689 SRE=3690"
"325","106.528926","ip_client","ip_haproxy_server","TCP","55","[TCP
Keep-Alive] 64311 > 443 [ACK] Seq=3689 Ack=240156 Win=64240 Len=1"
"326","106.529117","ip_haproxy_server","ip_client","TCP","66","[TCP
Keep-Alive ACK] 443 > 64311 [ACK] Seq=240156 Ack=3690 Win=36448 Len=0
SLE=3689 SRE=3690"
...

After some 4.5 hours (at 21:51) client sent last keepalive which server
acknowledged. There were no more packets after that.

"1043","16284.644240","ip_client","ip_haproxy_server","TCP","55","[TCP
Keep-Alive] 64311 > 443 [ACK] Seq=3689 Ack=240156 Win=64240 Len=1"
"1044","16284.644354","ip_haproxy_server","ip_client","TCP","66","[TCP
Keep-Alive ACK] 443 > 64311 [ACK] Seq=240156 Ack=3690 Win=36448 Len=0
SLE=3689 SRE=3690"
"1045","16329.797223","ip_client","ip_haproxy_server","TCP","55","[TCP
Keep-Alive] 64311 > 443 [ACK] Seq=3689 Ack=240156 Win=64240 Len=1"
"1046","16329.797274","ip_haproxy_server","ip_client","TCP","66","[TCP
Keep-Alive ACK] 443 > 64311 [ACK] Seq=240156 Ack=3690 Win=36448 Len=0
SLE=3689 SRE=3690"

Next day in the morning at 10:40 I can still see the hanging connection on
the server:

netstat -aptn|grep 64311
tcp 430 0 ip_haproxy_server:443 ip_client:64311
CLOSE_WAIT 916/haproxy

lsof|grep 64311
haproxy 916 haproxy 40u IPv4 106204553
0t0 TCP ip_haproxy_server:https->ip_client:64311 (CLOSE_WAIT)

echo "show fd" | socat - $HASOCK | grep "40 :"
40 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0 owner=0x1648d80
iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x80203300
fe=fe-http mux=H2 mux_ctx=0x15e9460

I hope this can help in tracking the problem down.

Best regards,
Milan
Willy Tarreau
Re: Connections stuck in CLOSE_WAIT state with h2
June 06, 2018 11:30AM
Hi Milan,

On Wed, Jun 06, 2018 at 11:09:19AM +0200, Milan Petruzelka wrote:
> Hi Willy,
>
> I've tracked one of connections hanging in CLOSE_WAIT state with tcpdump
> over last night. It started at 17:19 like this:
>
> "Packet No.","Time in
> seconds","Source","Destination","Protocol","Length","Info"
> "1","0.000000","ip_client","ip_haproxy_server","TCP","62","64311 >
> 443 [SYN] Seq=0 Win=8192 Len=0 MSS=1460 SACK_PERM=1"
> "2","0.001049","ip_haproxy_server","ip_client","TCP","62","443 >
> 64311 [SYN, ACK] Seq=0 Ack=1 Win=29200 Len=0 MSS=1460 SACK_PERM=1"
> "3","0.127239","ip_client","ip_haproxy_server","TCP","54","64311 >
> 443 [ACK] Seq=1 Ack=1 Win=64240 Len=0"
> "4","0.127344","ip_client","ip_haproxy_server","TLSv1.2","571","Client
> Hello"
> "5","0.130304","ip_haproxy_server","ip_client","TLSv1.2","2974","Server
> Hello, Certificate"
> "6","0.130336","ip_haproxy_server","ip_client","TLSv1.2","310","Server
> Key Exchange, Server Hello Done"
>
> After some 13 seconds client sent it's last data, which haproxy server
> acknowledged.
>
>
> "319","13.781347","ip_client","ip_haproxy_server","TLSv1.2","96","Application
> Data"
> "320","13.781365","ip_haproxy_server","ip_client","TCP","54","443 >
> 64311 [ACK] Seq=240156 Ack=3689 Win=36448 Len=0"
>
> Then client sent FIN packet, server acknowledged it again
>
> "321","16.292016","ip_client","ip_haproxy_server","TCP","54","64311 >
> 443 [FIN, ACK] Seq=3689 Ack=240156 Win=64240 Len=0"
> "322","16.329574","ip_haproxy_server","ip_client","TCP","54","443 >
> 64311 [ACK] Seq=240156 Ack=3690 Win=36448 Len=0"
>
> >From then client sent only TCP keepalive every 45s, which server always
> ackonwledged.
>
> "323","61.443121","ip_client","ip_haproxy_server","TCP","55","[TCP
> Keep-Alive] 64311 > 443 [ACK] Seq=3689 Ack=240156 Win=64240 Len=1"
> "324","61.443216","ip_haproxy_server","ip_client","TCP","66","[TCP
> Keep-Alive ACK] 443 > 64311 [ACK] Seq=240156 Ack=3690 Win=36448 Len=0
> SLE=3689 SRE=3690"
> "325","106.528926","ip_client","ip_haproxy_server","TCP","55","[TCP
> Keep-Alive] 64311 > 443 [ACK] Seq=3689 Ack=240156 Win=64240 Len=1"
> "326","106.529117","ip_haproxy_server","ip_client","TCP","66","[TCP
> Keep-Alive ACK] 443 > 64311 [ACK] Seq=240156 Ack=3690 Win=36448 Len=0
> SLE=3689 SRE=3690"
> ...
>
> After some 4.5 hours (at 21:51) client sent last keepalive which server
> acknowledged. There were no more packets after that.
>
> "1043","16284.644240","ip_client","ip_haproxy_server","TCP","55","[TCP
> Keep-Alive] 64311 > 443 [ACK] Seq=3689 Ack=240156 Win=64240 Len=1"
> "1044","16284.644354","ip_haproxy_server","ip_client","TCP","66","[TCP
> Keep-Alive ACK] 443 > 64311 [ACK] Seq=240156 Ack=3690 Win=36448 Len=0
> SLE=3689 SRE=3690"
> "1045","16329.797223","ip_client","ip_haproxy_server","TCP","55","[TCP
> Keep-Alive] 64311 > 443 [ACK] Seq=3689 Ack=240156 Win=64240 Len=1"
> "1046","16329.797274","ip_haproxy_server","ip_client","TCP","66","[TCP
> Keep-Alive ACK] 443 > 64311 [ACK] Seq=240156 Ack=3690 Win=36448 Len=0
> SLE=3689 SRE=3690"
>
> Next day in the morning at 10:40 I can still see the hanging connection on
> the server:
>
> netstat -aptn|grep 64311
> tcp 430 0 ip_haproxy_server:443 ip_client:64311
> CLOSE_WAIT 916/haproxy
>
> lsof|grep 64311
> haproxy 916 haproxy 40u IPv4 106204553
> 0t0 TCP ip_haproxy_server:https->ip_client:64311 (CLOSE_WAIT)
>
> echo "show fd" | socat - $HASOCK | grep "40 :"
> 40 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0 owner=0x1648d80
> iocb=0x4d2c80(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x80203300
> fe=fe-http mux=H2 mux_ctx=0x15e9460
>
> I hope this can help in tracking the problem down.

Sure, it's extremely useful. It looks like a shutdown is ignored when
waiting for the H2 preface. The FD is not being polled for reading, so
I tend to conclude that either it was disabled for a reason I still do
not know, or it used to be enabled, the shutdown was received then the
polling was disabled. But that doesn't appear in the connection flags.
So it seems that the transition between the end of handshake and the
start of parsing could be at fault. Maybe we refrain from entering the
H2 state machine because of the early shutdown. I'm going to have a look
in that direction.

Thank you!
Willy
Milan Petruželka
Re: Connections stuck in CLOSE_WAIT state with h2
June 08, 2018 01:30PM
On Wed, 6 Jun 2018 at 11:20, Willy Tarreau <[email protected]> wrote:

> Hi Milan,
>
> On Wed, Jun 06, 2018 at 11:09:19AM +0200, Milan Petruzelka wrote:
> > Hi Willy,
> >
> > I've tracked one of connections hanging in CLOSE_WAIT state with tcpdump
> > over last night. It started at 17:19 like this:
> >
> > "Packet No.","Time in
> seconds","Source","Destination","Protocol","Length","Info"
> > "1","0.000000","ip_client","ip_haproxy_server","TCP","62","64311
> > 443 [SYN] Seq=0 Win=8192 Len=0 MSS=1460 SACK_PERM=1"
> > "2","0.001049","ip_haproxy_server","ip_client","TCP","62","443
> > 64311 [SYN, ACK] Seq=0 Ack=1 Win=29200 Len=0 MSS=1460 SACK_PERM=1"
>
> Sure, it's extremely useful. It looks like a shutdown is ignored when
> waiting for the H2 preface. The FD is not being polled for reading, so
> I tend to conclude that either it was disabled for a reason I still do
> not know, or it used to be enabled, the shutdown was received then the
> polling was disabled. But that doesn't appear in the connection flags.
> So it seems that the transition between the end of handshake and the
> start of parsing could be at fault. Maybe we refrain from entering the
> H2 state machine because of the early shutdown. I'm going to have a look
> in that direction.


Hi Willy,

Here is some additional info about the hanging CLOSE_WAIT connection
described in my last message - screenshot from our log management -
https://pasteboard.co/HoVCR0b.jpg and text variant:

No. Time hap.ts.accept hap.client.port hap.term.state
hap.rq.method
hap.rsp.status hap.rsp.size
1 June 5th 2018, 17:19:22.913 June 5th 2018, 17:19:22.435 64311 ---- GET 200
17,733
2 June 5th 2018, 17:19:23.814 June 5th 2018, 17:19:23.806 64311 ---- GET 200
4,671
3 June 5th 2018, 17:19:23.814 June 5th 2018, 17:19:23.806 64311 ---- GET 200
4,481
4 June 5th 2018, 17:19:23.820 June 5th 2018, 17:19:23.808 64311 ---- GET 200
4,613
5 June 5th 2018, 17:19:23.821 June 5th 2018, 17:19:23.808 64311 ---- GET 200
4,584
6 June 5th 2018, 17:19:24.967 June 5th 2018, 17:19:24.961 64311 ---- GET 200
2,786
7 June 5th 2018, 17:19:28.945 June 5th 2018, 17:19:25.049 64311 ---- GET 200
78,325
8 June 5th 2018, 17:19:28.945 June 5th 2018, 17:19:25.049 64311 ---- GET 200
46,830
9 June 5th 2018, 17:19:31.617 June 5th 2018, 17:19:24.707 64311 CL-- GET 200
13,086
10 June 5th 2018, 17:19:31.618 June 5th 2018, 17:19:25.049 64311 CD-- GET
200 49,162
11 June 5th 2018, 17:19:31.618 June 5th 2018, 17:19:25.049 64311 CD-- GET
200 40,328
12 June 5th 2018, 17:19:31.618 June 5th 2018, 17:19:25.066 64311 CL-- POST
200 3,198

As you can see, client first made 8 successful requests using http/2
connection through Haproxy to our backend server. Remaining 4 requests
finished with termination states CL-- or CD--, probably because of FIN
packet received form client.

C : the TCP session was unexpectedly aborted by the client
D : the session was in the DATA phase.
L : the proxy was still transmitting LAST data to the client while the
server had already finished. This one is very rare as it can only happen
when the client dies while receiving the last packets.

It looks like Haproxy knows about client disconnection in muxed requests
inside frontend http/2 connection. It just seems not to propagate this
knowledge to the frontend connection, leaving it hanging in CLOSE_WAIT
state. Sorry for not sending this piece of log last time, i didn't expect
to see any of those CL-- / CD-- requests.

I'll switch http/2 off on the site because I'll be AFK for 2 weeks. After
that I'll be ready to provide any additional info to hunt this one down. Or
to test the patched version if available.

Thank you, best regards
Milan
Willy Tarreau
Re: Connections stuck in CLOSE_WAIT state with h2
June 12, 2018 10:40AM
Hi Milan,

On Fri, Jun 08, 2018 at 01:26:41PM +0200, Milan Petruzelka wrote:
> On Wed, 6 Jun 2018 at 11:20, Willy Tarreau <[email protected]> wrote:
>
> > Hi Milan,
> >
> > On Wed, Jun 06, 2018 at 11:09:19AM +0200, Milan Petruzelka wrote:
> > > Hi Willy,
> > >
> > > I've tracked one of connections hanging in CLOSE_WAIT state with tcpdump
> > > over last night. It started at 17:19 like this:
> > >
> > > "Packet No.","Time in
> > seconds","Source","Destination","Protocol","Length","Info"
> > > "1","0.000000","ip_client","ip_haproxy_server","TCP","62","64311
> > > 443 [SYN] Seq=0 Win=8192 Len=0 MSS=1460 SACK_PERM=1"
> > > "2","0.001049","ip_haproxy_server","ip_client","TCP","62","443
> > > 64311 [SYN, ACK] Seq=0 Ack=1 Win=29200 Len=0 MSS=1460 SACK_PERM=1"
> >
> > Sure, it's extremely useful. It looks like a shutdown is ignored when
> > waiting for the H2 preface. The FD is not being polled for reading, so
> > I tend to conclude that either it was disabled for a reason I still do
> > not know, or it used to be enabled, the shutdown was received then the
> > polling was disabled. But that doesn't appear in the connection flags.
> > So it seems that the transition between the end of handshake and the
> > start of parsing could be at fault. Maybe we refrain from entering the
> > H2 state machine because of the early shutdown. I'm going to have a look
> > in that direction.
>
>
> Hi Willy,
>
> Here is some additional info about the hanging CLOSE_WAIT connection
> described in my last message - screenshot from our log management -
> https://pasteboard.co/HoVCR0b.jpg and text variant:
>
> No. Time hap.ts.accept hap.client.port hap.term.state
> hap.rq.method
> hap.rsp.status hap.rsp.size
> 1 June 5th 2018, 17:19:22.913 June 5th 2018, 17:19:22.435 64311 ---- GET 200
> 17,733
> 2 June 5th 2018, 17:19:23.814 June 5th 2018, 17:19:23.806 64311 ---- GET 200
> 4,671
> 3 June 5th 2018, 17:19:23.814 June 5th 2018, 17:19:23.806 64311 ---- GET 200
> 4,481
> 4 June 5th 2018, 17:19:23.820 June 5th 2018, 17:19:23.808 64311 ---- GET 200
> 4,613
> 5 June 5th 2018, 17:19:23.821 June 5th 2018, 17:19:23.808 64311 ---- GET 200
> 4,584
> 6 June 5th 2018, 17:19:24.967 June 5th 2018, 17:19:24.961 64311 ---- GET 200
> 2,786
> 7 June 5th 2018, 17:19:28.945 June 5th 2018, 17:19:25.049 64311 ---- GET 200
> 78,325
> 8 June 5th 2018, 17:19:28.945 June 5th 2018, 17:19:25.049 64311 ---- GET 200
> 46,830
> 9 June 5th 2018, 17:19:31.617 June 5th 2018, 17:19:24.707 64311 CL-- GET 200
> 13,086
> 10 June 5th 2018, 17:19:31.618 June 5th 2018, 17:19:25.049 64311 CD-- GET
> 200 49,162
> 11 June 5th 2018, 17:19:31.618 June 5th 2018, 17:19:25.049 64311 CD-- GET
> 200 40,328
> 12 June 5th 2018, 17:19:31.618 June 5th 2018, 17:19:25.066 64311 CL-- POST
> 200 3,198
>
> As you can see, client first made 8 successful requests using http/2
> connection through Haproxy to our backend server. Remaining 4 requests
> finished with termination states CL-- or CD--, probably because of FIN
> packet received form client.

Or possibly because the client aborted with an RST_STREAM, which is
equally possible (eg: the user pressed STOP).

> C : the TCP session was unexpectedly aborted by the client
> D : the session was in the DATA phase.
> L : the proxy was still transmitting LAST data to the client while the
> server had already finished. This one is very rare as it can only happen
> when the client dies while receiving the last packets.
>
> It looks like Haproxy knows about client disconnection in muxed requests
> inside frontend http/2 connection. It just seems not to propagate this
> knowledge to the frontend connection, leaving it hanging in CLOSE_WAIT
> state.

It is indeed a possibility. I'm currently studying the code in hope to
find a clue there. I've been unsuccessful at trying to provoke the issue
for now.

> Sorry for not sending this piece of log last time, i didn't expect
> to see any of those CL-- / CD-- requests.

Hey are you kidding ? You don't have to be sorry, you've sent one of
the most detailed bug reports we've had for a while, and I really
appreciate the amount of work it requires on your side to dig through
this, blur all these fields to be able to still provide us with as much
information as possible! No, thanks *a lot* for your traces, really!

> I'll switch http/2 off on the site because I'll be AFK for 2 weeks. After
> that I'll be ready to provide any additional info to hunt this one down. Or
> to test the patched version if available.

I'll continue to work on this. I really don't like this strange behaviour.

Thanks!
Willy
Willy Tarreau
Re: Connections stuck in CLOSE_WAIT state with h2
June 13, 2018 02:50PM
Hi Milan, hi Janusz,

thanks to your respective traces, I may have come up with a possible
scenario explaining the CLOSE_WAIT you're facing. Could you please
try the attached patch ?

Thanks,
Willy

From fc527303f1391849ae5880b304486e328010b5ff Mon Sep 17 00:00:00 2001
From: Willy Tarreau <[email protected]>
Date: Wed, 13 Jun 2018 14:24:56 +0200
Subject: BUG/MEDIUM: h2: make sure the last stream closes the connection
after a timeout

If a timeout strikes on the connection side with some active streams,
there is a corner case which can sometimes cause the following sequence
to happen :

- There are active streams but there are data in the mux buffer
(eg: a client suddenly disconnected during a download with pending
requests). The timeout is active.

- The timeout strikes, h2_timeout_task() is called, kills the task and
doesn't close the connection since there are streams left ; The
connection is marked in H2_CS_ERROR ;

- the streams are woken up and closed ;

- when the last stream closes, calling h2_detach(), it sees the
tree list is empty, but there is no condition allowing the
connection to be closed (mbuf->o > 0), thus it does nothing ;

- since the task is dead, there's no more hope to clear this
situation later

For now we can take care of this by adding a test for the presence of
H2_CS_ERROR and !task, implying the timeout task triggered already
and will not be able to handle this again.

Over the long term it seems like a more reliable test on should be
made, so that it is possible to know whether or not someone is still
able to close this connection.

A big thanks to Janusz Dziemidowicz and Milan Petruzelka for providing
many details helping in figuring this bug.
---
src/mux_h2.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/src/mux_h2.c b/src/mux_h2.c
index 57b1722..bc33ce2 100644
--- a/src/mux_h2.c
+++ b/src/mux_h2.c
@@ -2469,6 +2469,7 @@ static void h2_detach(struct conn_stream *cs)
*/
if (eb_is_empty(&h2c->streams_by_id) && /* don't close if streams exist */
((h2c->conn->flags & CO_FL_ERROR) || /* errors close immediately */
+ (h2c->st0 >= H2_CS_ERROR && !h2c->task) || /* a timeout stroke earlier */
(h2c->flags & H2_CF_GOAWAY_FAILED) ||
(!h2c->mbuf->o && /* mux buffer empty, also process clean events below */
(conn_xprt_read0_pending(h2c->conn) ||
--
1.7.12.1
Janusz Dziemidowicz
Re: Connections stuck in CLOSE_WAIT state with h2
June 13, 2018 07:20PM
2018-06-13 14:42 GMT+02:00 Willy Tarreau <[email protected]>:
> Hi Milan, hi Janusz,
>
> thanks to your respective traces, I may have come up with a possible
> scenario explaining the CLOSE_WAIT you're facing. Could you please
> try the attached patch ?

Unfortunately there is no change for me. CLOSE_WAIT sockets still
accumulate if I switch native h2 on. Milan should probably double
check this though.
https://pasteboard.co/HpJj72H.png

I'll try move some low traffic site to a separate instance tomorrow,
maybe I'll be able to capture some traffic too.

--
Janusz Dziemidowicz
Willy Tarreau
Re: Connections stuck in CLOSE_WAIT state with h2
June 13, 2018 07:20PM
On Wed, Jun 13, 2018 at 07:06:58PM +0200, Janusz Dziemidowicz wrote:
> 2018-06-13 14:42 GMT+02:00 Willy Tarreau <[email protected]>:
> > Hi Milan, hi Janusz,
> >
> > thanks to your respective traces, I may have come up with a possible
> > scenario explaining the CLOSE_WAIT you're facing. Could you please
> > try the attached patch ?
>
> Unfortunately there is no change for me. CLOSE_WAIT sockets still
> accumulate if I switch native h2 on. Milan should probably double
> check this though.
> https://pasteboard.co/HpJj72H.png

:-(

With still the same perfectly straight line really making me think of either
a periodic activity which I'm unable to guess nor model, or something related
to our timeouts.

> I'll try move some low traffic site to a separate instance tomorrow,
> maybe I'll be able to capture some traffic too.

Unfortunately with H2 that will not help much, there's the TLS layer
under it that makes it a real pain. TLS is designed to avoid observability
and it does it well :-/

I've suspected a received shutdown at the TLS layer, which I was not
able to model at all. Tools are missing at this point. I even tried
to pass the traffic through haproxy in TCP mode to help but I couldn't
reproduce the problem.

It could possibly help if you can look for the affected client's IP:port
in your logs to see if they are perfectly normal or if you notice they
have something in common (eg: always the exact same requests, or they
never made a request from the affected connections, etc).

I won't merge the current patch for now. At minima it's incomplete,
and there is always a risk that it breaks something else in such a
difficult to detect way.

Thanks,
Willy
Janusz Dziemidowicz
Re: Connections stuck in CLOSE_WAIT state with h2
June 13, 2018 08:10PM
2018-06-13 19:14 GMT+02:00 Willy Tarreau <[email protected]>:
> On Wed, Jun 13, 2018 at 07:06:58PM +0200, Janusz Dziemidowicz wrote:
>> 2018-06-13 14:42 GMT+02:00 Willy Tarreau <[email protected]>:
>> > Hi Milan, hi Janusz,
>> >
>> > thanks to your respective traces, I may have come up with a possible
>> > scenario explaining the CLOSE_WAIT you're facing. Could you please
>> > try the attached patch ?
>>
>> Unfortunately there is no change for me. CLOSE_WAIT sockets still
>> accumulate if I switch native h2 on. Milan should probably double
>> check this though.
>> https://pasteboard.co/HpJj72H.png
>
> :-(
>
> With still the same perfectly straight line really making me think of either
> a periodic activity which I'm unable to guess nor model, or something related
> to our timeouts.

It is not exactly straight. While it looks like this for short test,
when I did this earlier, for much longer period of time, it was
slowing down during night, when I have less traffic.

>> I'll try move some low traffic site to a separate instance tomorrow,
>> maybe I'll be able to capture some traffic too.
>
> Unfortunately with H2 that will not help much, there's the TLS layer
> under it that makes it a real pain. TLS is designed to avoid observability
> and it does it well :-/
>
> I've suspected a received shutdown at the TLS layer, which I was not
> able to model at all. Tools are missing at this point. I even tried
> to pass the traffic through haproxy in TCP mode to help but I couldn't
> reproduce the problem.

When I disable native h2 in haproxy I switch back to tcp mode going
though nghttpx. The traffic is obviously the same, yet there is no
problem.

> It could possibly help if you can look for the affected client's IP:port
> in your logs to see if they are perfectly normal or if you notice they
> have something in common (eg: always the exact same requests, or they
> never made a request from the affected connections, etc).

I'm aware of the problems :) However, if I can get some traffic dumps,
knowing my application I might be able to reproduce this, which would
be a huge win. I've already tried some experiments with various tools
with no luck unfortunately.

> I won't merge the current patch for now. At minima it's incomplete,
> and there is always a risk that it breaks something else in such a
> difficult to detect way.

Sure, no problem :)

--
Janusz Dziemidowicz
Willy Tarreau
Re: Connections stuck in CLOSE_WAIT state with h2
June 14, 2018 11:20AM
On Wed, Jun 13, 2018 at 08:01:30PM +0200, Janusz Dziemidowicz wrote:
> 2018-06-13 19:14 GMT+02:00 Willy Tarreau <[email protected]>:
> > On Wed, Jun 13, 2018 at 07:06:58PM +0200, Janusz Dziemidowicz wrote:
> >> 2018-06-13 14:42 GMT+02:00 Willy Tarreau <[email protected]>:
> >> > Hi Milan, hi Janusz,
> >> >
> >> > thanks to your respective traces, I may have come up with a possible
> >> > scenario explaining the CLOSE_WAIT you're facing. Could you please
> >> > try the attached patch ?
> >>
> >> Unfortunately there is no change for me. CLOSE_WAIT sockets still
> >> accumulate if I switch native h2 on. Milan should probably double
> >> check this though.
> >> https://pasteboard.co/HpJj72H.png
> >
> > :-(
> >
> > With still the same perfectly straight line really making me think of either
> > a periodic activity which I'm unable to guess nor model, or something related
> > to our timeouts.
>
> It is not exactly straight. While it looks like this for short test,
> when I did this earlier, for much longer period of time, it was
> slowing down during night, when I have less traffic.

OK but there's definitely something very stable in this.

> >> I'll try move some low traffic site to a separate instance tomorrow,
> >> maybe I'll be able to capture some traffic too.
> >
> > Unfortunately with H2 that will not help much, there's the TLS layer
> > under it that makes it a real pain. TLS is designed to avoid observability
> > and it does it well :-/
> >
> > I've suspected a received shutdown at the TLS layer, which I was not
> > able to model at all. Tools are missing at this point. I even tried
> > to pass the traffic through haproxy in TCP mode to help but I couldn't
> > reproduce the problem.
>
> When I disable native h2 in haproxy I switch back to tcp mode going
> though nghttpx. The traffic is obviously the same, yet there is no
> problem.

I'm not surprized since it doens't use our H2 engine anymore :-) In my
case the purpose was to try to abuse haproxy's TCP proxy to modify the
behaviour befor reaching its H2 engine.

> > It could possibly help if you can look for the affected client's IP:port
> > in your logs to see if they are perfectly normal or if you notice they
> > have something in common (eg: always the exact same requests, or they
> > never made a request from the affected connections, etc).
>
> I'm aware of the problems :) However, if I can get some traffic dumps,
> knowing my application I might be able to reproduce this, which would
> be a huge win. I've already tried some experiments with various tools
> with no luck unfortunately.

Yep it's really not easy and probably once we find it I'll be ashamed
saying "I thought this code was not merged"... By the way yesterday I
found another suspect but I'm undecided on it ; the current architecture
of the H2 mux complicates the code analysis. If you want to give it a
try on top of previous one, I'd appreciate it, even if it doesn't change
anything. Please find it attached.

Thanks,
willy
From f055296f7598ba84b08e78f8309ffd7fa0c9522b Mon Sep 17 00:00:00 2001
From: Willy Tarreau <[email protected]>
Date: Wed, 13 Jun 2018 09:19:29 +0200
Subject: WIP: h2: try to address possible causes for the close_wait issues

It is uncertain whether certain errors could prevent pending outgoing
data from being emitted, and from releasing attached streams. Indeed,
for h2_release() to be called, the mux buf must be empty or an error
must have been met. A clean shutdown will not constitute an error and
it's likely that refraining from sending may prevent the buffer from
flushing. Thus maybe we can end up with data forever in the buffer.

The timeout task should normally take care of this though. It's worth
noting that if there's no more stream and the output buffer is empty
on wake(), the task's timeout is eternity.

This fix should be backported to 1.8.
---
src/mux_h2.c | 9 +--------
1 file changed, 1 insertion(+), 8 deletions(-)

diff --git a/src/mux_h2.c b/src/mux_h2.c
index bc33ce2..2f2384d 100644
--- a/src/mux_h2.c
+++ b/src/mux_h2.c
@@ -2271,14 +2271,6 @@ static int h2_wake(struct connection *conn)
h2_release(conn);
return -1;
}
- else {
- /* some streams still there, we need to signal them all and
- * wait for their departure.
- */
- __conn_xprt_stop_recv(conn);
- __conn_xprt_stop_send(conn);
- return 0;
- }
}

if (!h2c->dbuf->i)
@@ -2294,6 +2286,7 @@ static int h2_wake(struct connection *conn)

/* adjust output polling */
if (!(conn->flags & CO_FL_SOCK_WR_SH) &&
+ h2c->st0 != H2_CS_ERROR2 && !(h2c->flags & H2_CF_GOAWAY_FAILED) &&
(h2c->st0 == H2_CS_ERROR ||
h2c->mbuf->o ||
(h2c->mws > 0 && !LIST_ISEMPTY(&h2c->fctl_list)) ||
--
1.7.12.1
Janusz Dziemidowicz
Re: Connections stuck in CLOSE_WAIT state with h2
June 14, 2018 11:40AM
2018-06-14 11:14 GMT+02:00 Willy Tarreau <[email protected]>:
> Yep it's really not easy and probably once we find it I'll be ashamed
> saying "I thought this code was not merged"... By the way yesterday I
> found another suspect but I'm undecided on it ; the current architecture
> of the H2 mux complicates the code analysis. If you want to give it a
> try on top of previous one, I'd appreciate it, even if it doesn't change
> anything. Please find it attached.

Will try.

I've found one more clue. I've added various graphs to my monitoring.
Also I've been segregating various traffic kinds into different
haproxy backends. Yesterday test shows this:
https://pasteboard.co/HpPK2Ml6.png

This backend (sns) is used exclusively for static files that are
"large" (from 10KB up to over a megabyte) compared to my usual traffic
(various API calls mostly). Those 5xx errors are not from the backend
servers, "show stat":
sns,kr-8,0,0,5,108,,186655,191829829,19744924356,,0,,0,0,0,0,UP,100,1,0,0,0,15377,0,,1,4,1,,186655,,2,7,,147,L7OK,200,1,0,184295,550,0,0,0,,,,,8895,0,,,,,0,OK,,0,12,4,1474826,,,,Layer7
check passed,,2,5,6,,,,10.7.1.8:81,,http,,,,,,,,
sns,kr-10,0,0,2,105,,186654,191649821,19977086644,,0,,0,0,0,0,UP,100,1,0,0,0,15377,0,,1,4,2,,186654,,2,7,,148,L7OK,200,0,0,184275,551,0,0,0,,,,,8823,0,,,,,0,OK,,0,21,4,1473385,,,,Layer7
check passed,,2,5,6,,,,10.7.1.10:81,,http,,,,,,,,
sns,BACKEND,0,0,8,213,6554,383553,391967657,39722011000,0,0,,0,0,0,0,UP,200,2,0,,0,15377,0,,1,4,0,,373309,,1,14,,332,,,,0,368563,1101,0,1873,12008,,,,383545,27962,0,0,0,0,0,0,,,0,18,5,1763433,,,,,,,,,,,,,,http,roundrobin,,,,,,,

--
Janusz Dziemidowicz
Willy Tarreau
Re: Connections stuck in CLOSE_WAIT state with h2
June 14, 2018 11:50AM
On Thu, Jun 14, 2018 at 11:29:39AM +0200, Janusz Dziemidowicz wrote:
> 2018-06-14 11:14 GMT+02:00 Willy Tarreau <[email protected]>:
> > Yep it's really not easy and probably once we find it I'll be ashamed
> > saying "I thought this code was not merged"... By the way yesterday I
> > found another suspect but I'm undecided on it ; the current architecture
> > of the H2 mux complicates the code analysis. If you want to give it a
> > try on top of previous one, I'd appreciate it, even if it doesn't change
> > anything. Please find it attached.
>
> Will try.
>
> I've found one more clue. I've added various graphs to my monitoring.
> Also I've been segregating various traffic kinds into different
> haproxy backends. Yesterday test shows this:
> https://pasteboard.co/HpPK2Ml6.png
>
> This backend (sns) is used exclusively for static files that are
> "large" (from 10KB up to over a megabyte) compared to my usual traffic
> (various API calls mostly). Those 5xx errors are not from the backend
> servers, "show stat":
> sns,kr-8,0,0,5,108,,186655,191829829,19744924356,,0,,0,0,0,0,UP,100,1,0,0,0,15377,0,,1,4,1,,186655,,2,7,,147,L7OK,200,1,0,184295,550,0,0,0,,,,,8895,0,,,,,0,OK,,0,12,4,1474826,,,,Layer7
> check passed,,2,5,6,,,,10.7.1.8:81,,http,,,,,,,,
> sns,kr-10,0,0,2,105,,186654,191649821,19977086644,,0,,0,0,0,0,UP,100,1,0,0,0,15377,0,,1,4,2,,186654,,2,7,,148,L7OK,200,0,0,184275,551,0,0,0,,,,,8823,0,,,,,0,OK,,0,21,4,1473385,,,,Layer7
> check passed,,2,5,6,,,,10.7.1.10:81,,http,,,,,,,,
> sns,BACKEND,0,0,8,213,6554,383553,391967657,39722011000,0,0,,0,0,0,0,UP,200,2,0,,0,15377,0,,1,4,0,,373309,,1,14,,332,,,,0,368563,1101,0,1873,12008,,,,383545,27962,0,0,0,0,0,0,,,0,18,5,1763433,,,,,,,,,,,,,,http,roundrobin,,,,,,,

Oh this is very interesting indeed! So haproxy detected 6500 5xx errors
on this backend that were not attributed to any of these servers. I'm
really not seeing many situations where this can happen, I'll have a
look at this in the code. A common case could be 503s emitted when the
requests die in the queue but you have no maxconn thus no queue. Maybe
we return some 500 from time to time, though I'll have to figure out
why!

Thank you!
Willy
Janusz Dziemidowicz
Re: Connections stuck in CLOSE_WAIT state with h2
June 14, 2018 02:00PM
2018-06-14 11:46 GMT+02:00 Willy Tarreau <[email protected]>:
>> Will try.

I've tried the seconds path, together with the first one, no change at all.

However, I was able to catch it on my laptop finally. I still can't
easily reproduce this, but at least that's something. Little
background, my company makes online games, the one I am testing with
is a web browser flash game. As it starts, it makes various API calls
and loads game resources, graphics/music, etc. So I've disabled
browser cache and tried closing browser tab with the game as it was
loading. After a couple of tries I've achieved following state:
tcp6 1190 0 SERVER_IP:443 MY_IP:54514 ESTABLISHED 538049/haproxy

This is with browser tab already closed. Browser (latest Chrome)
probably keeps the connection alive, but haproxy should close it after
a while. Well, that didn't happen, after good 30 minutes the
connection is still ESTABLISHED. My timeouts are at the beginning of
this thread, my understanding is that this connection should be killed
after "timeout client" which is 60s.
After that I've closed the browser completely. Connection moved to the
CLOSE_WAIT state in question:
tcp6 1191 0 SERVER_IP:443 MY_IP:54514 CLOSE_WAIT 538049/haproxy

haproxy logs (I have dontlognormal enabled): https://pastebin.com/sUsa6jNQ

--
Janusz Dziemidowicz
Willy Tarreau
Re: Connections stuck in CLOSE_WAIT state with h2
June 14, 2018 07:00PM
On Thu, Jun 14, 2018 at 01:51:20PM +0200, Janusz Dziemidowicz wrote:
> 2018-06-14 11:46 GMT+02:00 Willy Tarreau <[email protected]>:
> >> Will try.
>
> I've tried the seconds path, together with the first one, no change at all.
>
> However, I was able to catch it on my laptop finally. I still can't
> easily reproduce this, but at least that's something. Little
> background, my company makes online games, the one I am testing with
> is a web browser flash game. As it starts, it makes various API calls
> and loads game resources, graphics/music, etc. So I've disabled
> browser cache and tried closing browser tab with the game as it was
> loading. After a couple of tries I've achieved following state:
> tcp6 1190 0 SERVER_IP:443 MY_IP:54514 ESTABLISHED 538049/haproxy
>
> This is with browser tab already closed. Browser (latest Chrome)
> probably keeps the connection alive, but haproxy should close it after
> a while. Well, that didn't happen, after good 30 minutes the
> connection is still ESTABLISHED. My timeouts are at the beginning of
> this thread, my understanding is that this connection should be killed
> after "timeout client" which is 60s.
> After that I've closed the browser completely. Connection moved to the
> CLOSE_WAIT state in question:
> tcp6 1191 0 SERVER_IP:443 MY_IP:54514 CLOSE_WAIT 538049/haproxy
>
> haproxy logs (I have dontlognormal enabled): https://pastebin.com/sUsa6jNQ

Thank you! I've just found a bug which I suspect could be related. By trying
to exploit it I managed to reproduce the problem once, and after the fix I
couldn't anymore. That's not enough to draw a conclusion but I suspect I'm
on a track.

I found that the case where some extra data are pending after a
chunked-encoded response is not properly responded to by the H2 encoder,
the data are properly deleted but are not reported as being part of what
was sent. This can cause the upper layer to believe that nothing was sent
and to continue to wait. When trying to send responses containing garbage
after the final chunk, I ended in the same situation you saw, with an H2
connection still present in "show fd" and the timeout not getting rid of
it, apparently because there's permanently a stream attached to it. I also
remember we faced a similar situation in the early 1.8 with extra data
after content-length not being properly trimmed. It could very well be
similar here. It's unclear to me why the stream timeout doesn't trigger
(probably that the stream is considered completed, which would be the root
cause of the problem), but these data definitely need to be reported as
deleted.

If you'd like to run a test, I'm attaching the patch.

Cheers,
Willy
Janusz Dziemidowicz
Re: Connections stuck in CLOSE_WAIT state with h2
June 14, 2018 07:30PM
2018-06-14 18:56 GMT+02:00 Willy Tarreau <[email protected]>:

> If you'd like to run a test, I'm attaching the patch.

Sure, but you forgot to attach it :)

--
Janusz Dziemidowicz
Willy Tarreau
Re: Connections stuck in CLOSE_WAIT state with h2
June 14, 2018 08:00PM
On Thu, Jun 14, 2018 at 07:22:34PM +0200, Janusz Dziemidowicz wrote:
> 2018-06-14 18:56 GMT+02:00 Willy Tarreau <[email protected]>:
>
> > If you'd like to run a test, I'm attaching the patch.
>
> Sure, but you forgot to attach it :)

Ah, that's because I'm stupid :-)

Here it comes this time.

Willy
diff --git a/src/mux_h2.c b/src/mux_h2.c
index 5f1da0d..4c3e2dd 100644
--- a/src/mux_h2.c
+++ b/src/mux_h2.c
@@ -3106,6 +3109,7 @@ static int h2s_frt_make_resp_headers(struct h2s *h2s, struct buffer *buf)
*/
if (es_now) {
// trim any possibly pending data (eg: inconsistent content-length)
+ ret += buf->o;
bo_del(buf, buf->o);

h1m->state = HTTP_MSG_DONE;
@@ -3359,6 +3363,7 @@ static int h2s_frt_make_resp_data(struct h2s *h2s, struct buffer *buf)

if (!(h1m->flags & H1_MF_CHNK)) {
// trim any possibly pending data (eg: inconsistent content-length)
+ total += buf->o;
bo_del(buf, buf->o);

h1m->state = HTTP_MSG_DONE;
@@ -3413,6 +3418,7 @@ static int h2_snd_buf(struct conn_stream *cs, struct buffer *buf, int flags)
bo_del(buf, count);

// trim any possibly pending data (eg: extra CR-LF, ...)
+ total += buf->o;
bo_del(buf, buf->o);

h2s->res.state = HTTP_MSG_DONE;
Janusz Dziemidowicz
Re: Connections stuck in CLOSE_WAIT state with h2
June 15, 2018 11:30AM
2018-06-14 19:49 GMT+02:00 Willy Tarreau <[email protected]>:
> On Thu, Jun 14, 2018 at 07:22:34PM +0200, Janusz Dziemidowicz wrote:
>> 2018-06-14 18:56 GMT+02:00 Willy Tarreau <[email protected]>:
>>
>> > If you'd like to run a test, I'm attaching the patch.
>>
>> Sure, but you forgot to attach it :)
>
> Ah, that's because I'm stupid :-)
>
> Here it comes this time.

I've tried with all three patches, still no luck. I had to revert
native h2 shortly because I've started getting ERR_SPDY_PROTOCOL_ERROR
in Chrome. The error was always on POST request.

--
Janusz Dziemidowicz
Willy Tarreau
Re: Connections stuck in CLOSE_WAIT state with h2
June 15, 2018 11:30AM
On Fri, Jun 15, 2018 at 11:19:04AM +0200, Janusz Dziemidowicz wrote:
> 2018-06-14 19:49 GMT+02:00 Willy Tarreau <[email protected]>:
> > On Thu, Jun 14, 2018 at 07:22:34PM +0200, Janusz Dziemidowicz wrote:
> >> 2018-06-14 18:56 GMT+02:00 Willy Tarreau <[email protected]>:
> >>
> >> > If you'd like to run a test, I'm attaching the patch.
> >>
> >> Sure, but you forgot to attach it :)
> >
> > Ah, that's because I'm stupid :-)
> >
> > Here it comes this time.
>
> I've tried with all three patches, still no luck. I had to revert
> native h2 shortly because I've started getting ERR_SPDY_PROTOCOL_ERROR
> in Chrome. The error was always on POST request.

Too bad, have to dig again then :-/.
Thank you Janusz!

Willy
Janusz Dziemidowicz
Re: Connections stuck in CLOSE_WAIT state with h2
June 15, 2018 11:50AM
2018-06-15 11:21 GMT+02:00 Willy Tarreau <[email protected]>:
>> I've tried with all three patches, still no luck. I had to revert
>> native h2 shortly because I've started getting ERR_SPDY_PROTOCOL_ERROR
>> in Chrome. The error was always on POST request.
>
> Too bad, have to dig again then :-/.
> Thank you Janusz!

That's a bit weird, but I've reverted back to clean 1.8.9 and I still
get ERR_SPDY_PROTOCOL_ERROR, so this seems like a unrelated problem.
Chrome net-internals shows only:

t= 10312 [st= 4042] HTTP2_SESSION_RECV_RST_STREAM
--> error_code = "5 (STREAM_CLOSED)"
--> stream_id = 129

However I'm pretty sure I was doing exactly the same yesterday and had
no such problem.

Anyway, I'm reverting back to clean 1.8.9 and h2 handled by nghttpx.
I'd prefer not to do any more tests before Monday ;)

--
Janusz Dziemidowicz
Sorry, only registered users may post in this forum.

Click here to login