Welcome! Log In Create A New Profile

Advanced

Problems with haproxy 1.7.3 on FreeBSD 11.0-p8

Posted by Matthias Fechner 
Matthias Fechner
Problems with haproxy 1.7.3 on FreeBSD 11.0-p8
March 06, 2017 09:40AM
Dear all,


are problem with haproxy 1.7.3 on FreeBSD 11.0-p8 known?

I have the problem that I got a lot of timeout for all websites that are
behind haproxy.

Haproxy does terminate the SSL connection and forwards to nginx. Before
haproxy I have a sslh running.


Downgrading to version 1.7.2 fixed the problem.


Here my config (I removed some fqdns and username and password):
global
maxconn 2048
user haproxy
group haproxy
daemon
tune.ssl.default-dh-param 2048

# logging
ulimit-n 65536
#log /var/run/log local0 info
log /var/run/log local0 err

# Configure chipers to not use, see
https://mozilla.github.io/server-side-tls/ssl-config-generator/
ssl-default-bind-ciphers
ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:DHE-RSA-AES128-GCM-SHA256:DHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA256:ECDHE-ECDSA-AES128-SHA:ECDHE-RSA-AES256-SHA384:ECDHE-RSA-AES128-SHA:ECDHE-ECDSA-AES256-SHA384:ECDHE-ECDSA-AES256-SHA:ECDHE-RSA-AES256-SHA:DHE-RSA-AES128-SHA256:DHE-RSA-AES128-SHA:DHE-RSA-AES256-SHA256:DHE-RSA-AES256-SHA:ECDHE-ECDSA-DES-CBC3-SHA:ECDHE-RSA-DES-CBC3-SHA:EDH-RSA-DES-CBC3-SHA:AES128-GCM-SHA256:AES256-GCM-SHA384:AES128-SHA256:AES256-SHA256:AES128-SHA:AES256-SHA:DES-CBC3-SHA:!DSS
ssl-default-bind-options no-sslv3 no-tls-tickets
ssl-default-server-ciphers
ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:DHE-RSA-AES128-GCM-SHA256:DHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA256:ECDHE-ECDSA-AES128-SHA:ECDHE-RSA-AES256-SHA384:ECDHE-RSA-AES128-SHA:ECDHE-ECDSA-AES256-SHA384:ECDHE-ECDSA-AES256-SHA:ECDHE-RSA-AES256-SHA:DHE-RSA-AES128-SHA256:DHE-RSA-AES128-SHA:DHE-RSA-AES256-SHA256:DHE-RSA-AES256-SHA:ECDHE-ECDSA-DES-CBC3-SHA:ECDHE-RSA-DES-CBC3-SHA:EDH-RSA-DES-CBC3-SHA:AES128-GCM-SHA256:AES256-GCM-SHA384:AES128-SHA256:AES256-SHA256:AES128-SHA:AES256-SHA:DES-CBC3-SHA:!DSS
ssl-default-server-options no-sslv3 no-tls-tickets

defaults
mode http
log global
option httplog
#option dontlog-normal
timeout connect 80s
timeout client 80s
timeout server 80s
#timeout check 1s
#timeout http-keep-alive 1s
#timeout http-request 400s # slowloris protection
option forwardfor
option http-server-close
default-server inter 3s fall 2 rise 2 slowstart 60s
compression algo gzip
compression type text/html text/plain text/css

frontend www-http
bind *:80
redirect scheme https code 301 if !{ ssl_fc }
reqadd X-Forwarded-Proto:\ http
default_backend nginx-backend

frontend www-https
mode tcp
bind 192.168.0.251:8443 ssl crt
/usr/local/etc/haproxy/certs/ alpn h2,http/1.1
bind 192.168.200.6:8443 ssl crt
/usr/local/etc/haproxy/certs/ alpn h2,http/1.1
bind localhost:443 ssl crt /usr/local/etc/haproxy/certs/
alpn h2,http/1.1
bind 127.0.0.1:443 ssl crt /usr/local/etc/haproxy/certs/
alpn h2,http/1.1

acl use_nginx ssl_fc_sni -i fqdn1 fqdn2
acl http2 ssl_fc_alpn -i h2
use_backend nginx-http2-backend if http2
use_backend nginx-http-backend if use_nginx

default_backend nginx-http-backend

backend nginx-backend
server www-1 127.0.0.1:8082 check send-proxy

backend nginx-http2-backend
mode tcp
server www-1 127.0.0.1:8083 check send-proxy

backend nginx-http-backend
mode tcp
server www-1 127.0.0.1:8082 check send-proxy

frontend haproxy-stats
bind 192.168.0.251:9001
mode http
stats enable
stats hide-version
stats realm Haproxy\ Statistics
stats uri /haproxy_stats
stats auth <user>:<passwordhash>

Gruß
Matthias

--

"Programming today is a race between software engineers striving to
build bigger and better idiot-proof programs, and the universe trying to
produce bigger and better idiots. So far, the universe is winning." --
Rich Cook
Georg Faerber
Re: Problems with haproxy 1.7.3 on FreeBSD 11.0-p8
March 06, 2017 09:50AM
Anonymous User
Re: Problems with haproxy 1.7.3 on FreeBSD 11.0-p8
March 06, 2017 10:00AM
Hi,


it would be cool if somebody could open a PR at

https://bugs.freebsd.org/

I personally don't use FreeBSD 11 for any of my HAProxy-installations
(yet), so I'm not really affected (yet) - but thanks for the heads-up.




Regards,
Rainer
Matthias Fechner
Re: Problems with haproxy 1.7.3 on FreeBSD 11.0-p8
March 06, 2017 10:10AM
Hi Georg,

Am 06.03.2017 um 09:43 schrieb Georg Faerber:
> I'm not running FreeBSD myself, but have a look at [1]: In the
> follow-ups to this thread there are two more people reporting problems.
>
> [1] https:[email protected]/msg25093.html

no, this cannot be the problem, because this error reported in [1] is
related to haproxy version 1.7.2.

My problem is related to 1.7.3. The problem was introduced by a change
for 1.7.3. as 1.7.2 is running fine.


Gruß
Matthias

--

"Programming today is a race between software engineers striving to
build bigger and better idiot-proof programs, and the universe trying to
produce bigger and better idiots. So far, the universe is winning." --
Rich Cook
Matthias Fechner
Re: Problems with haproxy 1.7.3 on FreeBSD 11.0-p8
March 06, 2017 10:10AM
Dear Rainer,

Am 06.03.2017 um 09:52 schrieb rainer@ultra-secure.de:
> it would be cool if somebody could open a PR at
>
> https://bugs.freebsd.org/
>
> I personally don't use FreeBSD 11 for any of my HAProxy-installations
> (yet), so I'm not really affected (yet) - but thanks for the heads-up.

I opened a bug report here:

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=217576


I have only one server already upgraded to FreeBSD 11. The 10.3
installation are running fine with haproxy 1.7.3.

Gruß
Matthias

--

"Programming today is a race between software engineers striving to
build bigger and better idiot-proof programs, and the universe trying to
produce bigger and better idiots. So far, the universe is winning." --
Rich Cook
Anonymous User
Re: Problems with haproxy 1.7.3 on FreeBSD 11.0-p8
March 06, 2017 10:20AM
Am 2017-03-06 10:05, schrieb Matthias Fechner:
> Dear Rainer,
>
> I opened a bug report here:
>
> https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=217576
>
>
> I have only one server already upgraded to FreeBSD 11. The 10.3
> installation are running fine with haproxy 1.7.3.



Thanks!
Willy Tarreau
Re: Problems with haproxy 1.7.3 on FreeBSD 11.0-p8
March 06, 2017 10:50AM
On Mon, Mar 06, 2017 at 09:59:21AM +0100, Matthias Fechner wrote:
> Hi Georg,
>
> Am 06.03.2017 um 09:43 schrieb Georg Faerber:
> > I'm not running FreeBSD myself, but have a look at [1]: In the
> > follow-ups to this thread there are two more people reporting problems.
> >
> > [1] https:[email protected]/msg25093.html
>
> no, this cannot be the problem, because this error reported in [1] is
> related to haproxy version 1.7.2.

Since we don't know what causes the issue above, we could end up discovering
that there's a behaviour change that depends on the workload.

> My problem is related to 1.7.3. The problem was introduced by a change
> for 1.7.3. as 1.7.2 is running fine.

Could you retry 1.7.3 by reverting the attached patch ? I don't see
why it would cause any trouble but that's the only likely candidate
I'm seeing between 1.7.2 and 1.7.3. If it fixes it, it may indicate
an issue with our implementation of the kqueue poller, possibly
explaining the issues other people have reported with fbsd 11 vs 10.

Also any hint you can provide to help people reproduce it would be welcome!

Thanks,
Willy
Willy Tarreau
Re: Problems with haproxy 1.7.3 on FreeBSD 11.0-p8
March 06, 2017 11:20AM
with the attachment now (thanks Dmitry)

On Mon, Mar 06, 2017 at 10:44:56AM +0100, Willy Tarreau wrote:
> On Mon, Mar 06, 2017 at 09:59:21AM +0100, Matthias Fechner wrote:
> > Hi Georg,
> >
> > Am 06.03.2017 um 09:43 schrieb Georg Faerber:
> > > I'm not running FreeBSD myself, but have a look at [1]: In the
> > > follow-ups to this thread there are two more people reporting problems.
> > >
> > > [1] https:[email protected]/msg25093.html
> >
> > no, this cannot be the problem, because this error reported in [1] is
> > related to haproxy version 1.7.2.
>
> Since we don't know what causes the issue above, we could end up discovering
> that there's a behaviour change that depends on the workload.
>
> > My problem is related to 1.7.3. The problem was introduced by a change
> > for 1.7.3. as 1.7.2 is running fine.
>
> Could you retry 1.7.3 by reverting the attached patch ? I don't see
> why it would cause any trouble but that's the only likely candidate
> I'm seeing between 1.7.2 and 1.7.3. If it fixes it, it may indicate
> an issue with our implementation of the kqueue poller, possibly
> explaining the issues other people have reported with fbsd 11 vs 10.
>
> Also any hint you can provide to help people reproduce it would be welcome!
>
> Thanks,
> Willy
From cd4c5a3ecf5e77fb4734c423c914f7280199c763 Mon Sep 17 00:00:00 2001
From: Willy Tarreau <[email protected]>
Date: Wed, 25 Jan 2017 14:12:22 +0100
Subject: BUG/MEDIUM: tcp: don't poll for write when connect() succeeds
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4

While testing a tcp_fastopen related change, it appeared that in the rare
case where connect() can immediately succeed, we still subscribe to write
notifications on the socket, causing the conn_fd_handler() to immediately
be called and a second call to connect() to be attempted to double-check
the connection.

In fact this issue had already been met with unix sockets (which often
respond immediately) and partially addressed but incorrect so another
patch will follow. But for TCP nothing was done.

The fix consists in removing the WAIT_L4_CONN flag if connect() succeeds
and to subscribe for writes only if some handshakes or L4_CONN are still
needed. In addition in order not to fail raw TCP health checks, we have
to continue to enable polling for data when nothing is scheduled for
leaving and the connection is already established, otherwise the caller
will never be notified.

This fix should be backported to 1.7 and 1.6.
(cherry picked from commit 819efbf4b532d718abeb5e5aa6b2521ed725fe17)
---
src/proto_tcp.c | 30 +++++++++++++++++++++++++-----
1 file changed, 25 insertions(+), 5 deletions(-)

diff --git a/src/proto_tcp.c b/src/proto_tcp.c
index f6d8ca1..c04f276 100644
--- a/src/proto_tcp.c
+++ b/src/proto_tcp.c
@@ -474,10 +474,16 @@ int tcp_connect_server(struct connection *conn, int data, int delack)
if (global.tune.server_rcvbuf)
setsockopt(fd, SOL_SOCKET, SO_RCVBUF, &global.tune.server_rcvbuf, sizeof(global.tune.server_rcvbuf));

- if ((connect(fd, (struct sockaddr *)&conn->addr.to, get_addr_len(&conn->addr.to)) == -1) &&
- (errno != EINPROGRESS) && (errno != EALREADY) && (errno != EISCONN)) {
-
- if (errno == EAGAIN || errno == EADDRINUSE || errno == EADDRNOTAVAIL) {
+ if (connect(fd, (struct sockaddr *)&conn->addr.to, get_addr_len(&conn->addr.to)) == -1) {
+ if (errno == EINPROGRESS || errno == EALREADY) {
+ /* common case, let's wait for connect status */
+ conn->flags |= CO_FL_WAIT_L4_CONN;
+ }
+ else if (errno == EISCONN) {
+ /* should normally not happen but if so, indicates that it's OK */
+ conn->flags &= ~CO_FL_WAIT_L4_CONN;
+ }
+ else if (errno == EAGAIN || errno == EADDRINUSE || errno == EADDRNOTAVAIL) {
char *msg;
if (errno == EAGAIN || errno == EADDRNOTAVAIL) {
msg = "no free ports";
@@ -514,6 +520,10 @@ int tcp_connect_server(struct connection *conn, int data, int delack)
return SF_ERR_SRVCL;
}
}
+ else {
+ /* connect() == 0, this is great! */
+ conn->flags &= ~CO_FL_WAIT_L4_CONN;
+ }

conn->flags |= CO_FL_ADDR_TO_SET;

@@ -523,7 +533,6 @@ int tcp_connect_server(struct connection *conn, int data, int delack)

conn_ctrl_init(conn); /* registers the FD */
fdtab[fd].linger_risk = 1; /* close hard if needed */
- conn_sock_want_send(conn); /* for connect status */

if (conn_xprt_init(conn) < 0) {
conn_force_close(conn);
@@ -531,6 +540,17 @@ int tcp_connect_server(struct connection *conn, int data, int delack)
return SF_ERR_RESOURCE;
}

+ if (conn->flags & (CO_FL_HANDSHAKE | CO_FL_WAIT_L4_CONN)) {
+ conn_sock_want_send(conn); /* for connect status, proxy protocol or SSL */
+ }
+ else {
+ /* If there's no more handshake, we need to notify the data
+ * layer when the connection is already OK otherwise we'll have
+ * no other opportunity to do it later (eg: health checks).
+ */
+ data = 1;
+ }
+
if (data)
conn_data_want_send(conn); /* prepare to send data if any */

--
1.7.12.1
Matthias Fechner
Re: Problems with haproxy 1.7.3 on FreeBSD 11.0-p8
March 06, 2017 11:30PM
Dear Willy and Dmitry,

Am 06.03.2017 um 11:16 schrieb Willy Tarreau:
> with the attachment now (thanks Dmitry)

hm, I'm not able to apply the patch:
git apply --ignore-space-change --ignore-whitespace
0001-BUG-MEDIUM-tcp-don-t-poll-for-write-when-connect-suc.patch

But I get:
0001-BUG-MEDIUM-tcp-don-t-poll-for-write-when-connect-suc.patch:41:
trailing whitespace.
if (connect(fd, (struct sockaddr *)&conn->addr.to,
get_addr_len(&conn->addr.to)) == -1) {
0001-BUG-MEDIUM-tcp-don-t-poll-for-write-when-connect-suc.patch:42:
trailing whitespace.
if (errno == EINPROGRESS || errno == EALREADY) {
0001-BUG-MEDIUM-tcp-don-t-poll-for-write-when-connect-suc.patch:43:
trailing whitespace.
/* common case, let's wait for connect status */
0001-BUG-MEDIUM-tcp-don-t-poll-for-write-when-connect-suc.patch:44:
trailing whitespace.
conn->flags |= CO_FL_WAIT_L4_CONN;
0001-BUG-MEDIUM-tcp-don-t-poll-for-write-when-connect-suc.patch:45:
trailing whitespace.
}
error: patch failed: src/proto_tcp.c:474
error: src/proto_tcp.c: patch does not apply

It is a 1.7.3 version (I just did a make extract in the freebsd port and
tried to apply the patch).

Gruß
Matthias

--

"Programming today is a race between software engineers striving to
build bigger and better idiot-proof programs, and the universe trying to
produce bigger and better idiots. So far, the universe is winning." --
Rich Cook
Willy Tarreau
Re: Problems with haproxy 1.7.3 on FreeBSD 11.0-p8
March 07, 2017 12:40AM
On Mon, Mar 06, 2017 at 11:19:18PM +0100, Matthias Fechner wrote:
> Dear Willy and Dmitry,
>
> Am 06.03.2017 um 11:16 schrieb Willy Tarreau:
> > with the attachment now (thanks Dmitry)
>
> hm, I'm not able to apply the patch:
> git apply --ignore-space-change --ignore-whitespace
> 0001-BUG-MEDIUM-tcp-don-t-poll-for-write-when-connect-suc.patch

Sorry, when I said "revert" I meant typically like this :

patch -Rp1 < 0001-BUG-MEDIUM-tcp-don-t-poll-for-write-when-connect-suc.patch

I've just tested here on 1.7.3 and it does apply correctly.

With git apply you'll have to pass -R as well.
Sorry for not being clear the first time.

Willy
Matthias Fechner
Re: Problems with haproxy 1.7.3 on FreeBSD 11.0-p8
March 07, 2017 08:40AM
Thanks Willy,

Am 07.03.2017 um 00:32 schrieb Willy Tarreau:
> Sorry, when I said "revert" I meant typically like this :
>
> patch -Rp1 < 0001-BUG-MEDIUM-tcp-don-t-poll-for-write-when-connect-suc.patch
>
> I've just tested here on 1.7.3 and it does apply correctly.
>
> With git apply you'll have to pass -R as well.
> Sorry for not being clear the first time.

I recompiled haproxy 1.7.3 with the patch reverted and will test it now.

Please give me 9 hours, to give you a real feedback, but so far the
monitoring system is quiet.

Gruß
Matthias

--

"Programming today is a race between software engineers striving to
build bigger and better idiot-proof programs, and the universe trying to
produce bigger and better idiots. So far, the universe is winning." --
Rich Cook
Matthias Fechner
Re: Problems with haproxy 1.7.3 on FreeBSD 11.0-p8
March 07, 2017 10:20AM
Am 2017-03-07 00:32, schrieb Willy Tarreau:
> patch -Rp1 <
> 0001-BUG-MEDIUM-tcp-don-t-poll-for-write-when-connect-suc.patch
>
> I've just tested here on 1.7.3 and it does apply correctly.
>
> With git apply you'll have to pass -R as well.
> Sorry for not being clear the first time.

so, shortly after I started to 1.7.3 with the patch reverted I got
timeouts again.
If I do a netstat -an I see a lot of:
tcp4 0 0 127.0.0.1.443 127.0.0.1.47010
TIME_WAIT
tcp4 0 0 127.0.0.1.443 127.0.0.1.46961
CLOSE_WAIT
tcp4 0 0 127.0.0.1.46961 127.0.0.1.443
FIN_WAIT_2

Can I do anything to help you to track the problem to the source?

--
Gruß
Matthias
Willy Tarreau
Re: Problems with haproxy 1.7.3 on FreeBSD 11.0-p8
March 07, 2017 10:20AM
On Tue, Mar 07, 2017 at 08:30:43AM +0100, Matthias Fechner wrote:
> Thanks Willy,
>
> Am 07.03.2017 um 00:32 schrieb Willy Tarreau:
> > Sorry, when I said "revert" I meant typically like this :
> >
> > patch -Rp1 < 0001-BUG-MEDIUM-tcp-don-t-poll-for-write-when-connect-suc.patch
> >
> > I've just tested here on 1.7.3 and it does apply correctly.
> >
> > With git apply you'll have to pass -R as well.
> > Sorry for not being clear the first time.
>
> I recompiled haproxy 1.7.3 with the patch reverted and will test it now.
>
> Please give me 9 hours, to give you a real feedback, but so far the
> monitoring system is quiet.

Thanks. In the mean time, I'm interested in trying to figure the code
paths we follow. Could please tell me :
- if you're using send-proxy on your server lines ?
- if you're using ssl on your server lines ?
- if you're using SSL health checks ?
- if you're only using HTTP or only using another TCP protocol, or
a combination of both ?

It is possible that your workload triggers a bug that was hidden by the
one fixed above. This bug may even affect only a specific poller.

Thanks,
Willy
Willy Tarreau
Re: Problems with haproxy 1.7.3 on FreeBSD 11.0-p8
March 07, 2017 10:30AM
On Tue, Mar 07, 2017 at 10:14:42AM +0100, Matthias Fechner wrote:
> Am 2017-03-07 00:32, schrieb Willy Tarreau:
> > patch -Rp1 <
> > 0001-BUG-MEDIUM-tcp-don-t-poll-for-write-when-connect-suc.patch
> >
> > I've just tested here on 1.7.3 and it does apply correctly.
> >
> > With git apply you'll have to pass -R as well.
> > Sorry for not being clear the first time.
>
> so, shortly after I started to 1.7.3 with the patch reverted I got timeouts
> again.

Wow we cross-posted.

> If I do a netstat -an I see a lot of:
> tcp4 0 0 127.0.0.1.443 127.0.0.1.47010 TIME_WAIT
> tcp4 0 0 127.0.0.1.443 127.0.0.1.46961 CLOSE_WAIT
> tcp4 0 0 127.0.0.1.46961 127.0.0.1.443 FIN_WAIT_2

Is :443 your haproxy or the server after haproxy ? If it's the server behind
haproxy it means this one didn't respond to a close. If it's haproxy it means
it's stuck doing something else (very bad).

> Can I do anything to help you to track the problem to the source?

Yes, please test with "nokqueue" in the global section, or start haproxy with
"-dk". It will switch to poll() and will tell us if there's a bug in the kqueue
poller. Please be aware that your CPU usage will increase a bit.

Thanks,
Willy
Matthias Fechner
Re: Problems with haproxy 1.7.3 on FreeBSD 11.0-p8
March 07, 2017 01:30PM
Am 2017-03-07 10:19, schrieb Willy Tarreau:
>> If I do a netstat -an I see a lot of:
>> tcp4 0 0 127.0.0.1.443 127.0.0.1.47010
>> TIME_WAIT
>> tcp4 0 0 127.0.0.1.443 127.0.0.1.46961
>> CLOSE_WAIT
>> tcp4 0 0 127.0.0.1.46961 127.0.0.1.443
>> FIN_WAIT_2
>
> Is :443 your haproxy or the server after haproxy ? If it's the server
> behind
> haproxy it means this one didn't respond to a close. If it's haproxy it
> means
> it's stuck doing something else (very bad).

could you please check my first post, there is the complete haproxy
config included, here a short summary:
sslh is listening on 192.168.0.251:443 and 192.168.1.2:443 and
192.168.200.6:443 is acting as transparent proxy forwarding all requests
to 192.168.0.251:8443 or 192.168.200.6:8443
External connections reaching the server on
192.168.0.251/192.168.1.2/192.168.200.6 depending on which network
interface they came in.
haproxy (frontend) is listening on localhost:443 and 192.168.0.251:8443,
192.168.200.6:8443 (in tcp mode, as I use http2 to connect to backend
nginx).

So in this case I would say that haproxy is hanging.
As this is a test system, I can do more tests if required.

> Yes, please test with "nokqueue" in the global section, or start
> haproxy with
> "-dk". It will switch to poll() and will tell us if there's a bug in
> the kqueue
> poller. Please be aware that your CPU usage will increase a bit.

I tested it again.
haproxy 1.7.3 with patch reverted -> timeout
haproxy 1.7.3 with patch reverted and kqueue disabled -> timeout
ps shows: /usr/local/sbin/haproxy -dk -q -f /usr/local/etc/haproxy.conf
-p /var/run/haproxy.pid
haproxy 1.7.2 -> no timeouts

I cannot says that in sslh is maybe a bug that is now triggered by a
change inside haproxy.
The homepage of sslh is here: http://www.rutschle.net/tech/sslh.shtml

I use sslh to get ssh/https/openvpn to listen the same port (443). sslh
just looks at the first bytes of the connection and acts as a
transparent proxy to forward the connection to haproxy (https), ssh or
openvpn.

Version 1.7.2 shows:
HA-Proxy version 1.7.2 2017/01/13
Copyright 2000-2017 Willy Tarreau <[email protected]>

Build options :
TARGET = freebsd
CPU = generic
CC = cc
CFLAGS = -O2 -pipe -fstack-protector -fno-strict-aliasing
-DFREEBSD_PORTS
OPTIONS = USE_GETADDRINFO=1 USE_ZLIB=1 USE_CPU_AFFINITY=1
USE_OPENSSL=1 USE_STATIC_PCRE=1 USE_PCRE_JIT=1

Default settings :
maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents =
200

Encrypted password support via crypt(3): yes
Built with zlib version : 1.2.8
Running on zlib version : 1.2.8
Compression algorithms supported : identity("identity"),
deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with OpenSSL version : OpenSSL 1.0.2j 26 Sep 2016
Running on OpenSSL version : OpenSSL 1.0.2k 26 Jan 2017
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports prefer-server-ciphers : yes
Built with PCRE version : 8.39 2016-06-14
Running on PCRE version : 8.39 2016-06-14
PCRE library supports JIT : yes
Built without Lua support
Built with transparent proxy support using: IP_BINDANY IPV6_BINDANY

Available polling systems :
kqueue : pref=300, test result OK
poll : pref=200, test result OK
select : pref=150, test result OK
Total: 3 (3 usable), will use kqueue.

Available filters :
[SPOE] spoe
[TRACE] trace
[COMP] compression

Version 1.7.3 shows:
HA-Proxy version 1.7.3 2017/02/28
Copyright 2000-2017 Willy Tarreau <[email protected]>

Build options :
TARGET = freebsd
CPU = generic
CC = cc
CFLAGS = -O2 -pipe -fstack-protector -fno-strict-aliasing
-DFREEBSD_PORTS
OPTIONS = USE_GETADDRINFO=1 USE_ZLIB=1 USE_CPU_AFFINITY=1
USE_OPENSSL=1 USE_STATIC_PCRE=1 USE_PCRE_JIT=1

Default settings :
maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents =
200

Encrypted password support via crypt(3): yes
Built with zlib version : 1.2.8
Running on zlib version : 1.2.8
Compression algorithms supported : identity("identity"),
deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with OpenSSL version : OpenSSL 1.0.2k 26 Jan 2017
Running on OpenSSL version : OpenSSL 1.0.2k 26 Jan 2017
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports prefer-server-ciphers : yes
Built with PCRE version : 8.40 2017-01-11
Running on PCRE version : 8.40 2017-01-11
PCRE library supports JIT : yes
Built without Lua support
Built with transparent proxy support using: IP_BINDANY IPV6_BINDANY

Available polling systems :
kqueue : pref=300, test result OK
poll : pref=200, test result OK
select : pref=150, test result OK
Total: 3 (3 usable), will use kqueue.

Available filters :
[SPOE] spoe
[TRACE] trace
[COMP] compression


--
Gruß
Matthias
Matthias Fechner
Re: Problems with haproxy 1.7.3 on FreeBSD 11.0-p8
March 07, 2017 01:40PM
Am 2017-03-07 10:15, schrieb Willy Tarreau:
> Thanks. In the mean time, I'm interested in trying to figure the code
> paths we follow. Could please tell me :
> - if you're using send-proxy on your server lines ?

yes, all 2 backends have (but see first post, for full config:
backend nginx-http2-backend
mode tcp
server www-1 127.0.0.1:8083 check send-proxy

backend nginx-http-backend
mode tcp
server www-1 127.0.0.1:8082 check send-proxy

> - if you're using ssl on your server lines ?

ssl is terminated in the frontend, no ssl connection to the backend

> - if you're using SSL health checks ?

could you please see my first post, I'm not sure what you mean by this
question, but I would says no, as the backend is not using ssl.

> - if you're only using HTTP or only using another TCP protocol, or
> a combination of both ?

please see here also my first post for more details.
haproxy is working completely in tcp mode.
nginx is listening on port 8082 with http1 mode
nginx is listening on port 8083 with http2 mode
haproxy forwards the connection with acl:
acl http2 ssl_fc_alpn -i h2
use_backend nginx-http2-backend if http2
default_backend nginx-http-backend


--
Gruß
Matthias
Willy Tarreau
Re: Problems with haproxy 1.7.3 on FreeBSD 11.0-p8
March 07, 2017 05:30PM
Hi Matthias,

On Tue, Mar 07, 2017 at 01:21:39PM +0100, Matthias Fechner wrote:
> could you please check my first post, there is the complete haproxy config
> included,

Oh I'm sorry, I simply cannot catch up with so many e-mails, at ~1500/day
that's about 10k/wk and I can't follow any single thread anymore :-(

> here a short summary:
> sslh is listening on 192.168.0.251:443 and 192.168.1.2:443 and
> 192.168.200.6:443 is acting as transparent proxy forwarding all requests to
> 192.168.0.251:8443 or 192.168.200.6:8443
> External connections reaching the server on
> 192.168.0.251/192.168.1.2/192.168.200.6 depending on which network interface
> they came in.
> haproxy (frontend) is listening on localhost:443 and 192.168.0.251:8443,
> 192.168.200.6:8443 (in tcp mode, as I use http2 to connect to backend
> nginx).
>
> So in this case I would say that haproxy is hanging.

Yes and then it's irrelevant to the connect() thing that was recently changed,
we have to restart the thinking from scratch.

> > Yes, please test with "nokqueue" in the global section, or start haproxy
> > with
> > "-dk". It will switch to poll() and will tell us if there's a bug in the
> > kqueue
> > poller. Please be aware that your CPU usage will increase a bit.
>
> I tested it again.
> haproxy 1.7.3 with patch reverted -> timeout
> haproxy 1.7.3 with patch reverted and kqueue disabled -> timeout
> ps shows: /usr/local/sbin/haproxy -dk -q -f /usr/local/etc/haproxy.conf -p
> /var/run/haproxy.pid
> haproxy 1.7.2 -> no timeouts
>
> I cannot says that in sslh is maybe a bug that is now triggered by a change
> inside haproxy.
> The homepage of sslh is here: http://www.rutschle.net/tech/sslh.shtml
>
> I use sslh to get ssh/https/openvpn to listen the same port (443). sslh just
> looks at the first bytes of the connection and acts as a transparent proxy
> to forward the connection to haproxy (https), ssh or openvpn.

OK, I didn't initially understand there was this in front, and was not even
aware of this component since most of us do that directly from within haproxy
(and have the proxy protocol added for free to forwarded connections).
There's very little chance it has an effect causing haproxy to enter a
CLOSE_WAIT, but we never know obviously.

> Version 1.7.2 shows:
> HA-Proxy version 1.7.2 2017/01/13
(...)

So they're pretty much identical except the version. Are you interested in
trying to do a bisection between 1.7.2 and 1.7.3 to find the culprit commit ?
There are only 20 patches so it should take about 5 attempts so depending
on the time it takes for the problem to appear it may be faster than
speculating on each individual patch. If you're interested, the procedure
is the following :

- from a git tree containing haroxy 1.7, you start bisection between 1.7.2
and 1.7.3 this way :

$ git bisect start v1.7.3 v1.7.2

- it will cut the history in the middle and will checkout this state.

- then you build and run the resulting executable

- if it fails, you type :

$ git bisect bad

and if it works, you type :

$ git bisect good

- this will result in git only focusing on half of the remaining history
either before or after the test point and to checkout the next commit so
that you can build and test again. If you end up on a patch that
definitely cannot be the culprit (eg: doc) you can even skip it :

$ git bisect skip

- once you're done or if you failed, simply type "git bisect reset".

- it's often nice to note the last known good and last known bad commits
especially if it takes a bit of time because if you happen to do
something else with your sources, you don't want to screw up your
bisection point.

At the end git will tell you "this is the first bad commit". This
way we'll be sure what to look at. There are a few candidates :

- disable of close on redirects by default (though you're in pure TCP
most of the time)
- failure to properly handle polling on connect()
- side effect of the fix on the analysers (possibly uncovering another
bug that used to stay hidden)
- SSL build fixes.

I'm still wondering why you're the only one facing this for now and I
suspect it's unrelated to the fact that you're on FreeBSD 11.

Thanks,
Willy
Matthias Fechner
Re: Problems with haproxy 1.7.3 on FreeBSD 11.0-p8
March 11, 2017 01:20PM
Hi Willy,

Am 07.03.2017 um 17:26 schrieb Willy Tarreau:
>
> So they're pretty much identical except the version. Are you interested in
> trying to do a bisection between 1.7.2 and 1.7.3 to find the culprit commit ?
> There are only 20 patches so it should take about 5 attempts so depending
> on the time it takes for the problem to appear it may be faster than
> speculating on each individual patch. If you're interested, the procedure
> is the following :
>
> - from a git tree containing haroxy 1.7, you start bisection between 1.7.2
> and 1.7.3 this way :
>
> $ git bisect start v1.7.3 v1.7.2

here is the result:
[email protected] /u/p/n/h/w/haproxy-1.7.3> git bisect log
# bad: [9cb532a34ae190b350cdeb8bbbae25d524b10949] [RELEASE] Released
version 1.7.3
# good: [ddb646ee9182df570017ddf280873a1360a28898] [RELEASE] Released
version 1.7.2
git bisect start 'v1.7.3' 'v1.7.2'
# bad: [eaf96d7a0849b2883e98459f52489d555b6b013c] BUG/MAJOR: dns:
restart sockets after fork()
git bisect bad eaf96d7a0849b2883e98459f52489d555b6b013c
# bad: [cd4c5a3ecf5e77fb4734c423c914f7280199c763] BUG/MEDIUM: tcp: don't
poll for write when connect() succeeds
git bisect bad cd4c5a3ecf5e77fb4734c423c914f7280199c763
# good: [9a84d5dd772f3198da499534ac2f8d7694be2b23] BUILD: ssl: fix build
on OpenSSL 1.0.0
git bisect good 9a84d5dd772f3198da499534ac2f8d7694be2b23

Hope that helps you.

> I'm still wondering why you're the only one facing this for now and I
> suspect it's unrelated to the fact that you're on FreeBSD 11.
>

I face similar problems on FreeBSD 10.3-p11 but it is much harder to
find them.
On FreeBSD 10.3 if I try to access the main page:
https://domain/
it works without problem, but if I try to access:
https://domain/node/130/edit?destination=admin/content
It takes a very long time to get a response (but I get one).
I rolled back to haproxy 1.7.2 here too, and it solves the problem.

Gruß
Matthias

--

"Programming today is a race between software engineers striving to
build bigger and better idiot-proof programs, and the universe trying to
produce bigger and better idiots. So far, the universe is winning." --
Rich Cook
Willy Tarreau
Re: Problems with haproxy 1.7.3 on FreeBSD 11.0-p8
March 11, 2017 01:50PM
Hi Matthias,

On Sat, Mar 11, 2017 at 01:16:19PM +0100, Matthias Fechner wrote:
> Hi Willy,
>
> Am 07.03.2017 um 17:26 schrieb Willy Tarreau:
> >
> > So they're pretty much identical except the version. Are you interested in
> > trying to do a bisection between 1.7.2 and 1.7.3 to find the culprit commit ?
> > There are only 20 patches so it should take about 5 attempts so depending
> > on the time it takes for the problem to appear it may be faster than
> > speculating on each individual patch. If you're interested, the procedure
> > is the following :
> >
> > - from a git tree containing haroxy 1.7, you start bisection between 1.7.2
> > and 1.7.3 this way :
> >
> > $ git bisect start v1.7.3 v1.7.2
>
> here is the result:
> [email protected] /u/p/n/h/w/haproxy-1.7.3> git bisect log
> # bad: [9cb532a34ae190b350cdeb8bbbae25d524b10949] [RELEASE] Released
> version 1.7.3
> # good: [ddb646ee9182df570017ddf280873a1360a28898] [RELEASE] Released
> version 1.7.2
> git bisect start 'v1.7.3' 'v1.7.2'
> # bad: [eaf96d7a0849b2883e98459f52489d555b6b013c] BUG/MAJOR: dns:
> restart sockets after fork()
> git bisect bad eaf96d7a0849b2883e98459f52489d555b6b013c
> # bad: [cd4c5a3ecf5e77fb4734c423c914f7280199c763] BUG/MEDIUM: tcp: don't
> poll for write when connect() succeeds
> git bisect bad cd4c5a3ecf5e77fb4734c423c914f7280199c763
> # good: [9a84d5dd772f3198da499534ac2f8d7694be2b23] BUILD: ssl: fix build
> on OpenSSL 1.0.0
> git bisect good 9a84d5dd772f3198da499534ac2f8d7694be2b23

I don't understand, the bisection didn't end ?

Otherwise I'm inclined to think that the regression comes from
"BUG/MEDIUM: tcp: don't poll for write when connect() succeeds", which
it the one I proposed you to revert and which didn't change anything.

Did "git bisect" end up telling you "This is the first bad commit" ?

> > I'm still wondering why you're the only one facing this for now and I
> > suspect it's unrelated to the fact that you're on FreeBSD 11.
> >
>
> I face similar problems on FreeBSD 10.3-p11 but it is much harder to
> find them.
> On FreeBSD 10.3 if I try to access the main page:
> https://domain/
> it works without problem, but if I try to access:
> https://domain/node/130/edit?destination=admin/content
> It takes a very long time to get a response (but I get one).
> I rolled back to haproxy 1.7.2 here too, and it solves the problem.

OK thank you.

Willy
Matthias Fechner
Re: Problems with haproxy 1.7.3 on FreeBSD 11.0-p8
March 11, 2017 02:00PM
Am 11.03.2017 um 13:45 schrieb Willy Tarreau:
> I don't understand, the bisection didn't end ?
>
> Otherwise I'm inclined to think that the regression comes from
> "BUG/MEDIUM: tcp: don't poll for write when connect() succeeds", which
> it the one I proposed you to revert and which didn't change anything.
>
> Did "git bisect" end up telling you "This is the first bad commit" ?

I had a typo, here the result (typed git bisect goog instead of good):

cd4c5a3ecf5e77fb4734c423c914f7280199c763 is the first bad commit
commit cd4c5a3ecf5e77fb4734c423c914f7280199c763
Author: Willy Tarreau <[email protected]>
Date: Wed Jan 25 14:12:22 2017 +0100

BUG/MEDIUM: tcp: don't poll for write when connect() succeeds

While testing a tcp_fastopen related change, it appeared that in the
rare
case where connect() can immediately succeed, we still subscribe to
write
notifications on the socket, causing the conn_fd_handler() to
immediately
be called and a second call to connect() to be attempted to double-check
the connection.

In fact this issue had already been met with unix sockets (which often
respond immediately) and partially addressed but incorrect so another
patch will follow. But for TCP nothing was done.

The fix consists in removing the WAIT_L4_CONN flag if connect() succeeds
and to subscribe for writes only if some handshakes or L4_CONN are still
needed. In addition in order not to fail raw TCP health checks, we have
to continue to enable polling for data when nothing is scheduled for
leaving and the connection is already established, otherwise the caller
will never be notified.

This fix should be backported to 1.7 and 1.6.
(cherry picked from commit 819efbf4b532d718abeb5e5aa6b2521ed725fe17)

:040000 040000 438a30304ac086b5cf0b0670253ff12efefeac82
6f2d45215d45de5669d3789eac247df856117e69 M src

Gruß
Matthias

--

"Programming today is a race between software engineers striving to
build bigger and better idiot-proof programs, and the universe trying to
produce bigger and better idiots. So far, the universe is winning." --
Rich Cook
Willy Tarreau
Re: Problems with haproxy 1.7.3 on FreeBSD 11.0-p8
March 11, 2017 02:20PM
On Sat, Mar 11, 2017 at 01:54:57PM +0100, Matthias Fechner wrote:
> Am 11.03.2017 um 13:45 schrieb Willy Tarreau:
> > I don't understand, the bisection didn't end ?
> >
> > Otherwise I'm inclined to think that the regression comes from
> > "BUG/MEDIUM: tcp: don't poll for write when connect() succeeds", which
> > it the one I proposed you to revert and which didn't change anything.
> >
> > Did "git bisect" end up telling you "This is the first bad commit" ?
>
> I had a typo, here the result (typed git bisect goog instead of good):
>
> cd4c5a3ecf5e77fb4734c423c914f7280199c763 is the first bad commit
> commit cd4c5a3ecf5e77fb4734c423c914f7280199c763
> Author: Willy Tarreau <[email protected]>
> Date: Wed Jan 25 14:12:22 2017 +0100
>
> BUG/MEDIUM: tcp: don't poll for write when connect() succeeds

(...)

OK so this is the one that I initially suspected and that after you
reverted, didn't fix the issue for you.

Are you sure you didn't have a problem when you reverted it ? (eg:
failed to restart the process or something like this).

At least now we're certain it's this one so we can try to imagine
any possible corner case or any remaining bug this one could have
uncovered.

Thanks!
Willy
Matthias Fechner
Re: Problems with haproxy 1.7.3 on FreeBSD 11.0-p8
March 12, 2017 09:40AM
Hi Willy,

Am 11.03.2017 um 14:13 schrieb Willy Tarreau:
>
> OK so this is the one that I initially suspected and that after you
> reverted, didn't fix the issue for you.
>
> Are you sure you didn't have a problem when you reverted it ? (eg:
> failed to restart the process or something like this).
>
> At least now we're certain it's this one so we can try to imagine
> any possible corner case or any remaining bug this one could have
> uncovered.

I tested it again with the patch and I get the result:
patch -Rp1
</home/idefix/0001-BUG-MEDIUM-tcp-don-t-poll-for-write-when-connect-suc.patch
...
|---
| src/proto_tcp.c | 30 +++++++++++++++++++++++++-----
| 1 file changed, 25 insertions(+), 5 deletions(-)
|
|diff --git a/src/proto_tcp.c b/src/proto_tcp.c
|index f6d8ca1..c04f276 100644
|--- a/src/proto_tcp.c
|+++ b/src/proto_tcp.c
--------------------------
Patching file src/proto_tcp.c using Plan A...
Hunk #1 failed at 474.
Hunk #2 failed at 514.
Hunk #3 failed at 523.
Hunk #4 failed at 531.
4 out of 4 hunks failed--saving rejects to src/proto_tcp.c.rej
Hmm... Ignoring the trailing garbage.
done


What I understand from the output is that the revert got rejected in
src/proto_tcp.c.
So that explains why the test failed.

I checked the port again and there is one patch applied to haproxy, but
it is a different file, so it should not cause the patch to fail, but
maybe can cause other problems.
--- src/hlua_fcn.c.orig 2016-12-17 13:58:44.786067000 +0300
+++ src/hlua_fcn.c 2016-12-17 13:59:17.551256000 +0300
@@ -39,6 +39,12 @@ static int class_listener_ref;

#define STATS_LEN (MAX((int)ST_F_TOTAL_FIELDS, (int)INF_TOTAL_FIELDS))

+#if defined(__FreeBSD__) || defined(__NetBSD__) || defined(__OpenBSD__)
+#define s6_addr8 __u6_addr.__u6_addr8
+#define s6_addr16 __u6_addr.__u6_addr16
+#define s6_addr32 __u6_addr.__u6_addr32
+#endif
+
static struct field stats[STATS_LEN];

int hlua_checkboolean(lua_State *L, int index)



Gruß
Matthias

--

"Programming today is a race between software engineers striving to
build bigger and better idiot-proof programs, and the universe trying to
produce bigger and better idiots. So far, the universe is winning." --
Rich Cook
Willy Tarreau
Re: Problems with haproxy 1.7.3 on FreeBSD 11.0-p8
March 12, 2017 10:00AM
Hi Matthias,

On Sun, Mar 12, 2017 at 09:34:31AM +0100, Matthias Fechner wrote:
> Hi Willy,
>
> Am 11.03.2017 um 14:13 schrieb Willy Tarreau:
> >
> > OK so this is the one that I initially suspected and that after you
> > reverted, didn't fix the issue for you.
> >
> > Are you sure you didn't have a problem when you reverted it ? (eg:
> > failed to restart the process or something like this).
> >
> > At least now we're certain it's this one so we can try to imagine
> > any possible corner case or any remaining bug this one could have
> > uncovered.
>
> I tested it again with the patch and I get the result:
> patch -Rp1
> </home/idefix/0001-BUG-MEDIUM-tcp-don-t-poll-for-write-when-connect-suc.patch
> ...
> |---
> | src/proto_tcp.c | 30 +++++++++++++++++++++++++-----
> | 1 file changed, 25 insertions(+), 5 deletions(-)
> |
> |diff --git a/src/proto_tcp.c b/src/proto_tcp.c
> |index f6d8ca1..c04f276 100644
> |--- a/src/proto_tcp.c
> |+++ b/src/proto_tcp.c
> --------------------------
> Patching file src/proto_tcp.c using Plan A...
> Hunk #1 failed at 474.
> Hunk #2 failed at 514.
> Hunk #3 failed at 523.
> Hunk #4 failed at 531.
> 4 out of 4 hunks failed--saving rejects to src/proto_tcp.c.rej
> Hmm... Ignoring the trailing garbage.
> done
>
>
> What I understand from the output is that the revert got rejected in
> src/proto_tcp.c.
> So that explains why the test failed.

Yep, that totally makes sense. Thanks for checking. Please find in
attachment one which does properly apply here with -Rp1 (at least it
will allow you to fix your production for the time it takes to find
the root cause of this issue). If it doesn't apply it may mean that
you have some leftovers of a previous patch or that your tree has
other local patches, which could also explain the problem.

> I checked the port again and there is one patch applied to haproxy, but
> it is a different file, so it should not cause the patch to fail, but
> maybe can cause other problems.
> --- src/hlua_fcn.c.orig 2016-12-17 13:58:44.786067000 +0300
> +++ src/hlua_fcn.c 2016-12-17 13:59:17.551256000 +0300
> @@ -39,6 +39,12 @@ static int class_listener_ref;
>
> #define STATS_LEN (MAX((int)ST_F_TOTAL_FIELDS, (int)INF_TOTAL_FIELDS))
>
> +#if defined(__FreeBSD__) || defined(__NetBSD__) || defined(__OpenBSD__)
> +#define s6_addr8 __u6_addr.__u6_addr8
> +#define s6_addr16 __u6_addr.__u6_addr16
> +#define s6_addr32 __u6_addr.__u6_addr32
> +#endif
> +

No, it's harmless, it's just a build fix.

Best regards,
Willy
From cd4c5a3ecf5e77fb4734c423c914f7280199c763 Mon Sep 17 00:00:00 2001
From: Willy Tarreau <[email protected]>
Date: Wed, 25 Jan 2017 14:12:22 +0100
Subject: BUG/MEDIUM: tcp: don't poll for write when connect() succeeds
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4

While testing a tcp_fastopen related change, it appeared that in the rare
case where connect() can immediately succeed, we still subscribe to write
notifications on the socket, causing the conn_fd_handler() to immediately
be called and a second call to connect() to be attempted to double-check
the connection.

In fact this issue had already been met with unix sockets (which often
respond immediately) and partially addressed but incorrect so another
patch will follow. But for TCP nothing was done.

The fix consists in removing the WAIT_L4_CONN flag if connect() succeeds
and to subscribe for writes only if some handshakes or L4_CONN are still
needed. In addition in order not to fail raw TCP health checks, we have
to continue to enable polling for data when nothing is scheduled for
leaving and the connection is already established, otherwise the caller
will never be notified.

This fix should be backported to 1.7 and 1.6.
(cherry picked from commit 819efbf4b532d718abeb5e5aa6b2521ed725fe17)
---
src/proto_tcp.c | 30 +++++++++++++++++++++++++-----
1 file changed, 25 insertions(+), 5 deletions(-)

diff --git a/src/proto_tcp.c b/src/proto_tcp.c
index f6d8ca1..c04f276 100644
--- a/src/proto_tcp.c
+++ b/src/proto_tcp.c
@@ -474,10 +474,16 @@ int tcp_connect_server(struct connection *conn, int data, int delack)
if (global.tune.server_rcvbuf)
setsockopt(fd, SOL_SOCKET, SO_RCVBUF, &global.tune.server_rcvbuf, sizeof(global.tune.server_rcvbuf));

- if ((connect(fd, (struct sockaddr *)&conn->addr.to, get_addr_len(&conn->addr.to)) == -1) &&
- (errno != EINPROGRESS) && (errno != EALREADY) && (errno != EISCONN)) {
-
- if (errno == EAGAIN || errno == EADDRINUSE || errno == EADDRNOTAVAIL) {
+ if (connect(fd, (struct sockaddr *)&conn->addr.to, get_addr_len(&conn->addr.to)) == -1) {
+ if (errno == EINPROGRESS || errno == EALREADY) {
+ /* common case, let's wait for connect status */
+ conn->flags |= CO_FL_WAIT_L4_CONN;
+ }
+ else if (errno == EISCONN) {
+ /* should normally not happen but if so, indicates that it's OK */
+ conn->flags &= ~CO_FL_WAIT_L4_CONN;
+ }
+ else if (errno == EAGAIN || errno == EADDRINUSE || errno == EADDRNOTAVAIL) {
char *msg;
if (errno == EAGAIN || errno == EADDRNOTAVAIL) {
msg = "no free ports";
@@ -514,6 +520,10 @@ int tcp_connect_server(struct connection *conn, int data, int delack)
return SF_ERR_SRVCL;
}
}
+ else {
+ /* connect() == 0, this is great! */
+ conn->flags &= ~CO_FL_WAIT_L4_CONN;
+ }

conn->flags |= CO_FL_ADDR_TO_SET;

@@ -523,7 +533,6 @@ int tcp_connect_server(struct connection *conn, int data, int delack)

conn_ctrl_init(conn); /* registers the FD */
fdtab[fd].linger_risk = 1; /* close hard if needed */
- conn_sock_want_send(conn); /* for connect status */

if (conn_xprt_init(conn) < 0) {
conn_force_close(conn);
@@ -531,6 +540,17 @@ int tcp_connect_server(struct connection *conn, int data, int delack)
return SF_ERR_RESOURCE;
}

+ if (conn->flags & (CO_FL_HANDSHAKE | CO_FL_WAIT_L4_CONN)) {
+ conn_sock_want_send(conn); /* for connect status, proxy protocol or SSL */
+ }
+ else {
+ /* If there's no more handshake, we need to notify the data
+ * layer when the connection is already OK otherwise we'll have
+ * no other opportunity to do it later (eg: health checks).
+ */
+ data = 1;
+ }
+
if (data)
conn_data_want_send(conn); /* prepare to send data if any */

--
2.8.0.rc2.1.gbe9624a
Dmitry Sivachenko
Re: Problems with haproxy 1.7.3 on FreeBSD 11.0-p8
March 12, 2017 10:00AM
> On 12 Mar 2017, at 11:34, Matthias Fechner <[email protected]> wrote:
>
>
> I checked the port again and there is one patch applied to haproxy, but
> it is a different file, so it should not cause the patch to fail, but
> maybe can cause other problems.
> --- src/hlua_fcn.c.orig 2016-12-17 13:58:44.786067000 +0300
> +++ src/hlua_fcn.c 2016-12-17 13:59:17.551256000 +0300
> @@ -39,6 +39,12 @@ static int class_listener_ref;
>
> #define STATS_LEN (MAX((int)ST_F_TOTAL_FIELDS, (int)INF_TOTAL_FIELDS))
>
> +#if defined(__FreeBSD__) || defined(__NetBSD__) || defined(__OpenBSD__)
> +#define s6_addr8 __u6_addr.__u6_addr8
> +#define s6_addr16 __u6_addr.__u6_addr16
> +#define s6_addr32 __u6_addr.__u6_addr32
> +#endif
> +
> static struct field stats[STATS_LEN];
>
> int hlua_checkboolean(lua_State *L, int index)
>
>


I removed this patch from ports.
It was needed for previous version to compile and should not cause any problems.
Now it became obsoleted.
Matthias Fechner
Re: Problems with haproxy 1.7.3 on FreeBSD 11.0-p8
March 12, 2017 10:50AM
Hi Willy, Hi Dmitry,

Am 12.03.2017 um 09:48 schrieb Willy Tarreau:
> Yep, that totally makes sense. Thanks for checking. Please find in
> attachment one which does properly apply here with -Rp1 (at least it
> will allow you to fix your production for the time it takes to find
> the root cause of this issue). If it doesn't apply it may mean that
> you have some leftovers of a previous patch or that your tree has
> other local patches, which could also explain the problem.

maybe the patch command from FreeBSD is different, I do not get your
patch applying correctly.
I always start from a clean build dir version 1.7.3.

But Dmitry already created an updated version of the port with the patch
included:
https://svnweb.freebsd.org/ports/head/net/haproxy/files/patch-src-proto_tcp.c?view=markup&pathrev=435979

This patch applies correctly and I have now a test env running with the
patched version.
My monitoring system is already testing it and it looks promising, no
timeouts so far.
I also did some other tests that caused timeout before and everything
looks fine.

I will monitor it closely the next day and will report immediately if I
get a timeout, but I think it is fixed now.

Thanks you all for your support!


Gruß
Matthias

--

"Programming today is a race between software engineers striving to
build bigger and better idiot-proof programs, and the universe trying to
produce bigger and better idiots. So far, the universe is winning." --
Rich Cook
Willy Tarreau
Re: Problems with haproxy 1.7.3 on FreeBSD 11.0-p8
March 12, 2017 03:00PM
On Sun, Mar 12, 2017 at 10:44:09AM +0100, Matthias Fechner wrote:
> maybe the patch command from FreeBSD is different, I do not get your
> patch applying correctly.
> I always start from a clean build dir version 1.7.3.

Very strange. That's good to know at least.

> But Dmitry already created an updated version of the port with the patch
> included:
> https://svnweb.freebsd.org/ports/head/net/haproxy/files/patch-src-proto_tcp.c?view=markup&pathrev=435979
>
> This patch applies correctly and I have now a test env running with the
> patched version.
> My monitoring system is already testing it and it looks promising, no
> timeouts so far.
>
> I also did some other tests that caused timeout before and everything
> looks fine.

OK great.

> I will monitor it closely the next day and will report immediately if I
> get a timeout, but I think it is fixed now.
>
> Thanks you all for your support!

You're welcome. It's possible that I'll ask you to test a patch or two
if I find anything suspicious, given that for now you're the first one
to observe this issue.

Thanks!
Willy
Matthias Fechner
Re: Problems with haproxy 1.7.3 on FreeBSD 11.0-p8
March 12, 2017 05:50PM
Hi Willy,

Am 12.03.2017 um 14:51 schrieb Willy Tarreau:
> You're welcome. It's possible that I'll ask you to test a patch or two
> if I find anything suspicious, given that for now you're the first one
> to observe this issue.

sure, but next week I will not be able to test anything, so take your time.

I do not really understand why no one else has faced the problem.
Maybe no one has update yet (the new version went one week ago into
FreeBSD ports and here only to HEAD and not the quarterly branch).



Gruß
Matthias

--

"Programming today is a race between software engineers striving to
build bigger and better idiot-proof programs, and the universe trying to
produce bigger and better idiots. So far, the universe is winning." --
Rich Cook
Willy Tarreau
Re: Problems with haproxy 1.7.3 on FreeBSD 11.0-p8
March 14, 2017 05:40PM
Hi Matthias,

I do have some good news. I could reproduce your issue on a FreeBSD
machine (thanks Olivier!). This issue isn't FreeBSD specific, it's a
bug in haproxy. It just happens that FreeBSD *sometimes* manages to
get connect() to immediately succeed over the loopback, that you're
indeed connecting to a backend server on the loopack, that you're
using send-proxy and that your front sslh sometimes doesn't deliver
a request immediately (probably due to client's pre-connect feature).

In my tests I saw about 1 connection out of 19000 being immediately
accepted but didn't see any problem. So I cheated by disabling non-
blocking during the connect(), resulting in 100% of the connections
being non-blocking :-)

There's a race between the removal of the CO_FL_WAIT_L4_CONN flag on
the connection, the sending of the proxy protocol line and the enabling
of polling for sending pending data.

If I send some data early or force "data=1" in tcp_connect_server(), it
works, just like if I don't remove CO_FL_WAIT_L4_CONN or if I don't use
send-proxy.

All this to say that the issue is now under investigation even if not
yet fully understood. I already suspect I'll discover something scary
down the pipe, I hope it will be reasonably fixable :-)

Thanks for all the elements you provided, that was a key in being able
to go that close to the bug. I'll keep you updated.

Cheers,
Willy
Willy Tarreau
Re: Problems with haproxy 1.7.3 on FreeBSD 11.0-p8
March 14, 2017 10:20PM
Matthias,

I could finally track the problem down to a 5-year old bug in the
connection handler. It already used to affect Unix sockets but it
requires so rare a set of options and even then its occurrence rate
is so low that probably nobody noticed it yet.

I'm attaching the patch to be applied on top of 1.7.3 which fixes it,
it will be merged into next version.

Dmitry, you may prefer to take this one than to revert the previous
one from your ports, especially considering that a few connect()
immediately succeed over the loopback on FreeBSD and that it was
absolutely needed to trigger the bug (as well as the previously fixed
one, which had less impact).

Or you may prefer to wait for 1.7.4. It's not planned yet given that
there are other fixes in the wild waiting for some feedback though.

Thanks guys for the detailed feedback, it's now time to turn the page
and switch to less difficult ones!

Willy
From afbf56b951967e8fa4d509e423fdcb11c27d40e2 Mon Sep 17 00:00:00 2001
From: Willy Tarreau <[email protected]>
Date: Tue, 14 Mar 2017 20:19:29 +0100
Subject: BUG/MAJOR: connection: update CO_FL_CONNECTED before calling the
data layer
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4

Matthias Fechner reported a regression in 1.7.3 brought by the backport
of commit 819efbf ("BUG/MEDIUM: tcp: don't poll for write when connect()
succeeds"), causing some connections to fail to establish once in a while.
While this commit itself was a fix for a bad sequencing of connection
events, it in fact unveiled a much deeper bug going back to the connection
rework era in v1.5-dev12 : 8f8c92f ("MAJOR: connection: add a new
CO_FL_CONNECTED flag").

It's worth noting that in a lab reproducing a similar environment as
Matthias' about only 1 every 19000 connections exhibit this behaviour,
making the issue not so easy to observe. A trick to make the problem
more observable consists in disabling non-blocking mode on the socket
before calling connect() and re-enabling it later, so that connect()
always succeeds. Then it becomes 100% reproducible.

The problem is that this CO_FL_CONNECTED flag is tested after deciding to
call the data layer (typically the stream interface but might be a health
check as well), and that the decision to call the data layer relies on a
change of one of the flags covered by the CO_FL_CONN_STATE set, which is
made of CO_FL_CONNECTED among others.

Before the fix above, this bug couldn't appear with TCP but it could
appear with Unix sockets. Indeed, connect() was always considered
blocking so the CO_FL_WAIT_L4_CONN connection flag was always set, and
polling for write events was always enabled. This used to guarantee that
the conn_fd_handler() could detect a change among the CO_FL_CONN_STATE
flags.

Now with the fix above, if a connect() immediately succeeds for non-ssl
connection with send-proxy enabled, and no data in the buffer (thus TCP
mode only), the CO_FL_WAIT_L4_CONN flag is not set, the lack of data in
the buffer doesn't enable polling flags for the data layer, the
CO_FL_CONNECTED flag is not set due to send-proxy still being pending,
and once send-proxy is done, its completion doesn't cause the data layer
to be woken up due to the fact that CO_FL_CONNECT is still not present
and that the CO_FL_SEND_PROXY flag is not watched in CO_FL_CONN_STATE.

Then no progress is made when data are received from the client (and
attempted to be forwarded), because a CF_WRITE_NULL (or CF_WRITE_PARTIAL)
flag is needed for the stream-interface state to turn from SI_ST_CON to
SI_ST_EST, allowing ->chk_snd() to be called when new data arrive. And
the only way to set this flag is to call the data layer of course.

After the connect timeout, the connection gets killed and if in the mean
time some data have accumulated in the buffer, the retry will succeed.

This patch fixes this situation by simply placing the update of
CO_FL_CONNECTED where it should have been, before the check for a flag
change needed to wake up the data layer and not after.

This fix must be backported to 1.7, 1.6 and 1.5. Versions not having
the patch above are still affected for unix sockets.

Special thanks to Matthias Fechner who provided a very detailed bug
report with a bisection designating the faulty patch, and to Olivier
Houchard for providing full access to a pretty similar environment where
the issue could first be reproduced.
(cherry picked from commit 7bf3fa3c23f6a1b7ed1212783507ac50f7e27544)
---
src/connection.c | 11 +++++++----
1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/src/connection.c b/src/connection.c
index 26fc5f6..1e4c9aa 100644
--- a/src/connection.c
+++ b/src/connection.c
@@ -131,6 +131,13 @@ void conn_fd_handler(int fd)
}

leave:
+ /* Verify if the connection just established. The CO_FL_CONNECTED flag
+ * being included in CO_FL_CONN_STATE, its change will be noticed by
+ * the next block and be used to wake up the data layer.
+ */
+ if (unlikely(!(conn->flags & (CO_FL_WAIT_L4_CONN | CO_FL_WAIT_L6_CONN | CO_FL_CONNECTED))))
+ conn->flags |= CO_FL_CONNECTED;
+
/* The wake callback may be used to process a critical error and abort the
* connection. If so, we don't want to go further as the connection will
* have been released and the FD destroyed.
@@ -140,10 +147,6 @@ void conn_fd_handler(int fd)
conn->data->wake(conn) < 0)
return;

- /* Last check, verify if the connection just established */
- if (unlikely(!(conn->flags & (CO_FL_WAIT_L4_CONN | CO_FL_WAIT_L6_CONN | CO_FL_CONNECTED))))
- conn->flags |= CO_FL_CONNECTED;
-
/* remove the events before leaving */
fdtab[fd].ev &= FD_POLL_STICKY;

--
1.7.12.1
Dmitry Sivachenko
Re: Problems with haproxy 1.7.3 on FreeBSD 11.0-p8
March 14, 2017 10:50PM
> On 15 Mar 2017, at 00:17, Willy Tarreau <[email protected]> wrote:
>
> Matthias,
>
> I could finally track the problem down to a 5-year old bug in the
> connection handler. It already used to affect Unix sockets but it
> requires so rare a set of options and even then its occurrence rate
> is so low that probably nobody noticed it yet.
>
> I'm attaching the patch to be applied on top of 1.7.3 which fixes it,
> it will be merged into next version.
>
> Dmitry, you may prefer to take this one than to revert the previous
> one from your ports, especially considering that a few connect()
> immediately succeed over the loopback on FreeBSD and that it was
> absolutely needed to trigger the bug (as well as the previously fixed
> one, which had less impact).
>

Thanks!

I committed your patch to FreeBSD ports.
Sorry, only registered users may post in this forum.

Click here to login