Welcome! Log In Create A New Profile

Advanced

HaProxy Hang

Posted by David King 
David King
HaProxy Hang
March 03, 2017 03:10PM
Hi All

Hoping someone will be able to help, we're running a bit of an interesting
setup

we have 3 HAProxy nodes running freebsd 11.0 , each host runs 4 jails, each
running haproxy, but only one of the jails is under any real load

we use CARP to balance between the hosts and jails which seems to be
working fine

about 1 once every 2/3 months, all the haproxy instances hang, the process
keeps running, but doesn't access any more connections, the monitoring
socket is unresponsive. it doesn't produce any errors in logs.

these hangs all happen within a couple of seconds, over all jails on all
hosts taking down our frontend network, a restart of the haproxy service
fixes it.

we use chef for config management, and all the run times are splayed, all
the haproxy instances will have different up times

Any one who has an good idea of what could cause this?

Thanks!!



haproxy -vv

HA-Proxy version 1.7.2 2017/01/13

Copyright 2000-2017 Willy Tarreau <[email protected]>


Build options :

TARGET = freebsd

CPU = generic

CC = cc

CFLAGS = -O2 -pipe -fstack-protector -fno-strict-aliasing -DFREEBSD_PORTS

OPTIONS = USE_GETADDRINFO=1 USE_ZLIB=1 USE_CPU_AFFINITY=1 USE_OPENSSL=1
USE_LUA=1 USE_STATIC_PCRE=1 USE_PCRE_JIT=1


Default settings :

maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents = 200


Encrypted password support via crypt(3): yes

Built with zlib version : 1.2.8

Running on zlib version : 1.2.8

Compression algorithms supported : identity("identity"),
deflate("deflate"), raw-deflate("deflate"), gzip("gzip")

Built with OpenSSL version : OpenSSL 1.0.2k-freebsd 26 Jan 2017

Running on OpenSSL version : OpenSSL 1.0.2j-freebsd 26 Sep 2016

OpenSSL library supports TLS extensions : yes

OpenSSL library supports SNI : yes

OpenSSL library supports prefer-server-ciphers : yes

Built with PCRE version : 8.40 2017-01-11

Running on PCRE version : 8.40 2017-01-11

PCRE library supports JIT : yes

Built with Lua version : Lua 5.3.3

Built with transparent proxy support using: IP_BINDANY IPV6_BINDANY


Available polling systems :

kqueue : pref=300, test result OK

poll : pref=200, test result OK

select : pref=150, test result OK

Total: 3 (3 usable), will use kqueue.


Available filters :

[SPOE] spoe

[TRACE] trace

[COMP] compression
Dmitry Sivachenko
Re: HaProxy Hang
March 03, 2017 03:20PM
> On 03 Mar 2017, at 17:07, David King <[email protected]> wrote:
>
> Hi All
>
> Hoping someone will be able to help, we're running a bit of an interesting setup
>
> we have 3 HAProxy nodes running freebsd 11.0 , each host runs 4 jails, each running haproxy, but only one of the jails is under any real load
>


If my memory does not fail me this is third report on haproxy hang on FreeBSD and all these reports are about FreeBSD-11.

I wonder if any one experiences this issue with FreeBSD-10?

I am running rather heavy loaded haproxy cluster on FreeBSD-10 (version 1.6.9 to be specific) and never experienced any hungs (knock the wood).
David King
Re: HaProxy Hang
March 03, 2017 05:40PM
Thanks for the response!
Thats interesting, i don't suppose you have the details of the other issues?

Thanks
Dave

On 3 March 2017 at 14:15, Dmitry Sivachenko <[email protected]> wrote:

>
> > On 03 Mar 2017, at 17:07, David King <[email protected]>
> wrote:
> >
> > Hi All
> >
> > Hoping someone will be able to help, we're running a bit of an
> interesting setup
> >
> > we have 3 HAProxy nodes running freebsd 11.0 , each host runs 4 jails,
> each running haproxy, but only one of the jails is under any real load
> >
>
>
> If my memory does not fail me this is third report on haproxy hang on
> FreeBSD and all these reports are about FreeBSD-11.
>
> I wonder if any one experiences this issue with FreeBSD-10?
>
> I am running rather heavy loaded haproxy cluster on FreeBSD-10 (version
> 1.6.9 to be specific) and never experienced any hungs (knock the wood).
>
Dmitry Sivachenko
Re: HaProxy Hang
March 03, 2017 06:00PM
> On 03 Mar 2017, at 19:36, David King <[email protected]> wrote:
>
> Thanks for the response!
> Thats interesting, i don't suppose you have the details of the other issues?


First report is
https://www.mail-archive.com/[email protected]/msg25060.html
Second one
https://www.mail-archive.com/[email protected]/msg25067.html

(in the same thread)



>
> Thanks
> Dave
>
> On 3 March 2017 at 14:15, Dmitry Sivachenko <[email protected]> wrote:
>
> > On 03 Mar 2017, at 17:07, David King <[email protected]> wrote:
> >
> > Hi All
> >
> > Hoping someone will be able to help, we're running a bit of an interesting setup
> >
> > we have 3 HAProxy nodes running freebsd 11.0 , each host runs 4 jails, each running haproxy, but only one of the jails is under any real load
> >
>
>
> If my memory does not fail me this is third report on haproxy hang on FreeBSD and all these reports are about FreeBSD-11.
>
> I wonder if any one experiences this issue with FreeBSD-10?
>
> I am running rather heavy loaded haproxy cluster on FreeBSD-10 (version 1.6.9 to be specific) and never experienced any hungs (knock the wood).
>
>
Rainer Duffner
Re: HaProxy Hang
March 03, 2017 11:40PM
> Am 03.03.2017 um 15:07 schrieb David King <[email protected]>:
>
> Hi All
>
> Hoping someone will be able to help, we're running a bit of an interesting setup
>
> we have 3 HAProxy nodes running freebsd 11.0 , each host runs 4 jails, each running haproxy, but only one of the jails is under any real load
>
>


Do you use ZFS?


We have an internal software (some sort of monitoring agent) that also hangs in jails, from time to time.

The guy who wrote it found out it’s because of mmap (I don’t know the specifics).

The processes end up as unkillable in „D“ state and we need to reboot the hosts to fix it.

As the purpose of the hosts is not to run the agent, we usually let it hang and restart when it’s convenient.


The systems are FreeBSD 10.3, though (running nginx and varnish in different jails).
Willy Tarreau
Re: HaProxy Hang
March 06, 2017 07:40AM
On Fri, Mar 03, 2017 at 07:54:46PM +0300, Dmitry Sivachenko wrote:
>
> > On 03 Mar 2017, at 19:36, David King <[email protected]> wrote:
> >
> > Thanks for the response!
> > Thats interesting, i don't suppose you have the details of the other issues?
>
>
> First report is
> https://www.mail-archive.com/[email protected]/msg25060.html
> Second one
> https://www.mail-archive.com/[email protected]/msg25067.html

Thanks for the links Dmitry.

That's indeed really odd. If all hang at the same time, timing or uptime
looks like a good candidate. There's not much which is really specific
to FreeBSD in haproxy. However, the kqueue poller is only used there
(and on OpenBSD), and uses timing for the timeout. Thus it sounds likely
that there could be an issue there, either in haproxy or FreeBSD.

A hang every 2-3 months makes me think about the 49.7 days it takes for
a millisecond counter to wrap. These bugs are hard to troubleshoot. We
used to have such an issue a long time ago in linux 2.4 when the timer
was set to 100 Hz, it required 497 days to know whether the bug was
solved or not (obviously it now is).

I've just compared ev_epoll.c and ev_kqueue.c in case I could spot
anything obvious but from what I'm seeing they're pretty much similar
so I don't see what there could cause this bug. And since it apparently
works fine on FreeBSD 10, at best one of our bugs could only trigger a
system bug if it exists.

David, if your workload permits it, you can disable kqueue and haproxy
will automatically fall back to poll. For this you can simply put
"nokqueue" in the global section. poll() doesn't scale as well as
kqueue(), it's cheaper on low connection counts but it will use more
CPU above ~1000 concurrent connections.

Regards,
Willy
Mark S
Re: HaProxy Hang
March 06, 2017 09:00PM
On Mon, 06 Mar 2017 01:35:19 -0500, Willy Tarreau <[email protected]> wrote:

> On Fri, Mar 03, 2017 at 07:54:46PM +0300, Dmitry Sivachenko wrote:
>>
>> > On 03 Mar 2017, at 19:36, David King <[email protected]>
>> wrote:
>> >
>> > Thanks for the response!
>> > Thats interesting, i don't suppose you have the details of the other
>> issues?
>>
>>
>> First report is
>> https://www.mail-archive.com/[email protected]/msg25060.html
>> Second one
>> https://www.mail-archive.com/[email protected]/msg25067.html
>
> Thanks for the links Dmitry.
>
> That's indeed really odd. If all hang at the same time, timing or uptime
> looks like a good candidate. There's not much which is really specific
> to FreeBSD in haproxy. However, the kqueue poller is only used there
> (and on OpenBSD), and uses timing for the timeout. Thus it sounds likely
> that there could be an issue there, either in haproxy or FreeBSD.
>
> A hang every 2-3 months makes me think about the 49.7 days it takes for
> a millisecond counter to wrap. These bugs are hard to troubleshoot. We
> used to have such an issue a long time ago in linux 2.4 when the timer
> was set to 100 Hz, it required 497 days to know whether the bug was
> solved or not (obviously it now is).
>
> I've just compared ev_epoll.c and ev_kqueue.c in case I could spot
> anything obvious but from what I'm seeing they're pretty much similar
> so I don't see what there could cause this bug. And since it apparently
> works fine on FreeBSD 10, at best one of our bugs could only trigger a
> system bug if it exists.
>
> David, if your workload permits it, you can disable kqueue and haproxy
> will automatically fall back to poll. For this you can simply put
> "nokqueue" in the global section. poll() doesn't scale as well as
> kqueue(), it's cheaper on low connection counts but it will use more
> CPU above ~1000 concurrent connections.
>
> Regards,
> Willy
>

Hi Willy,

As for the timing issue, I can add to the discussion with a few related
data points. In short, system uptime does not seem to be a commonality to
my situation.

1) I had this issue affect 6 servers, spread across 5 data centers (only 2
servers are in the same facility.) All servers stopped processing
requests at roughly the same moment, certainly within the same minute.
All servers running FreeBSD 11.0-RELEASE-p2 with HAProxy compiled locally
against OpenSSL-1.0.2k

2) System uptime was not at all similar across these servers, although
chances are most servers HAProxy process start time would be similar. The
servers with the highest system uptime were at about 27 days at the time
of the incident, while the shortest were under a day or two.

3) HAProxy configurations are similar, but not exactly consistent between
servers - different IPs on the frontend, different ACLs and backends.

4) The only synchronized application common to all of these servers is
OpenNTPd.

5) I have since upgraded to HAProxy-1.7.3, same build process: the full
version output is below - and will of course report any observed issues.

haproxy -vv
HA-Proxy version 1.7.3 2017/02/28
Copyright 2000-2017 Willy Tarreau <[email protected]>

Build options :
TARGET = freebsd
CPU = generic
CC = clang
CFLAGS = -O2 -g -fno-strict-aliasing -Wdeclaration-after-statement
OPTIONS = USE_OPENSSL=1 USE_PCRE=1

Default settings :
maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents = 200

Encrypted password support via crypt(3): yes
Built without compression support (neither USE_ZLIB nor USE_SLZ are set)
Compression algorithms supported : identity("identity")
Built with OpenSSL version : OpenSSL 1.0.2k 26 Jan 2017
Running on OpenSSL version : OpenSSL 1.0.2k 26 Jan 2017
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports prefer-server-ciphers : yes
Built with PCRE version : 8.39 2016-06-14
Running on PCRE version : 8.39 2016-06-14
PCRE library supports JIT : no (USE_PCRE_JIT not set)
Built without Lua support
Built with transparent proxy support using: IP_BINDANY IPV6_BINDANY

Available polling systems :
kqueue : pref=300, test result OK
poll : pref=200, test result OK
select : pref=150, test result OK
Total: 3 (3 usable), will use kqueue.

Available filters :
[SPOE] spoe
[TRACE] trace
[COMP] compression

Cheers,
-=Mark
Willy Tarreau
Re: HaProxy Hang
March 06, 2017 09:10PM
Hi Mark,

On Mon, Mar 06, 2017 at 02:49:28PM -0500, Mark S wrote:
> As for the timing issue, I can add to the discussion with a few related data
> points. In short, system uptime does not seem to be a commonality to my
> situation.

thanks!

> 1) I had this issue affect 6 servers, spread across 5 data centers (only 2
> servers are in the same facility.) All servers stopped processing requests
> at roughly the same moment, certainly within the same minute. All servers
> running FreeBSD 11.0-RELEASE-p2 with HAProxy compiled locally against
> OpenSSL-1.0.2k

OK.

> 2) System uptime was not at all similar across these servers, although
> chances are most servers HAProxy process start time would be similar. The
> servers with the highest system uptime were at about 27 days at the time of
> the incident, while the shortest were under a day or two.

OK so that means that haproxy could have hung in a day or two, then your
case is much more common than one of the other reports. If your fdront LB
is fair between the 6 servers, that could be related to a total number of
requests or connections or something like this.

> 3) HAProxy configurations are similar, but not exactly consistent between
> servers - different IPs on the frontend, different ACLs and backends.

OK.

> 4) The only synchronized application common to all of these servers is
> OpenNTPd.

Is there any risk that the ntpd causes time jumps in the future or in
the past for whatever reasons ? Maybe there's something with kqueue and
time jumps in recent versions ?

> 5) I have since upgraded to HAProxy-1.7.3, same build process: the full
> version output is below - and will of course report any observed issues.
>
> haproxy -vv
> HA-Proxy version 1.7.3 2017/02/28
(...)

Everything there looks pretty standard. If it dies again it could be good
to try with "nokqueue" in the global section (or start haproxy with -dk)
to disable kqueue and switch to poll. It will eat a bit more CPU, so don't
do this on all nodes at once.

I'm thinking about other things :
- if you're doing a lot of SSL we could imagine an issue with random
generation using /dev/random instead of /dev/urandom. I've met this
issue a long time ago on some apache servers where all the entropy
was progressively consumed until it was not possible anymore to get
a connection.

- it could be useful to run "netstat -an" on a dead node before killing
haproxy and archive this for later analysis. It may reveal that all
file descriptors were used by close_wait connections (indicating a
close bug in haproxy) or something like this. If instead you see a
lot of FIN_WAIT1 or FIN_WAIT2 it may indicate an issue with some
external firewall or pf blocking some final traffic and leading to
socket space exhaustion.

If you have the same issue that was reported with kevent() being called
in loops and returning an error, you may definitely see tons of close_wait
and it will indicate an issue with this poller, though I have no idea
which one, especially since it doesn't change often and *seems* to work
with previous versions.

Best regards,
Willy
Mark S
Re: HaProxy Hang
March 06, 2017 09:40PM
On Mon, 06 Mar 2017 15:02:43 -0500, Willy Tarreau <[email protected]> wrote:

> OK so that means that haproxy could have hung in a day or two, then your
> case is much more common than one of the other reports. If your fdront LB
> is fair between the 6 servers, that could be related to a total number of
> requests or connections or something like this.

Another relevant point is that these servers are tied together using
upstream, GeoIP-based DNS load balancing. So the request rate across
servers varies quite a bit depending on the location. This would make a
synchronized failure based on total requests less likely.

> I'm thinking about other things :
> - if you're doing a lot of SSL we could imagine an issue with random
> generation using /dev/random instead of /dev/urandom. I've met this
> issue a long time ago on some apache servers where all the entropy
> was progressively consumed until it was not possible anymore to get
> a connection.

I'll set up a script to capture the netstat and other info prior to
reloading should this issue re-occur.

As for SSL, yes, we do a fair bit of SSL ( about 30% of total request
count ) and HAProxy does the TLS termination and then hands off via TCP
proxy.

Best,
-=Mark S.
Jerry Scharf
Re: HaProxy Hang
March 06, 2017 09:50PM
Willy,

per your comment on /dev/random exhaustion. I think running haveged on
servers doing crypto work is/should be best practice.

jerry
On 3/6/17 12:02 PM, Willy Tarreau wrote:
> Hi Mark,
>
> On Mon, Mar 06, 2017 at 02:49:28PM -0500, Mark S wrote:
>> As for the timing issue, I can add to the discussion with a few related data
>> points. In short, system uptime does not seem to be a commonality to my
>> situation.
> thanks!
>
>> 1) I had this issue affect 6 servers, spread across 5 data centers (only 2
>> servers are in the same facility.) All servers stopped processing requests
>> at roughly the same moment, certainly within the same minute. All servers
>> running FreeBSD 11.0-RELEASE-p2 with HAProxy compiled locally against
>> OpenSSL-1.0.2k
> OK.
>
>> 2) System uptime was not at all similar across these servers, although
>> chances are most servers HAProxy process start time would be similar. The
>> servers with the highest system uptime were at about 27 days at the time of
>> the incident, while the shortest were under a day or two.
> OK so that means that haproxy could have hung in a day or two, then your
> case is much more common than one of the other reports. If your fdront LB
> is fair between the 6 servers, that could be related to a total number of
> requests or connections or something like this.
>
>> 3) HAProxy configurations are similar, but not exactly consistent between
>> servers - different IPs on the frontend, different ACLs and backends.
> OK.
>
>> 4) The only synchronized application common to all of these servers is
>> OpenNTPd.
> Is there any risk that the ntpd causes time jumps in the future or in
> the past for whatever reasons ? Maybe there's something with kqueue and
> time jumps in recent versions ?
>
>> 5) I have since upgraded to HAProxy-1.7.3, same build process: the full
>> version output is below - and will of course report any observed issues.
>>
>> haproxy -vv
>> HA-Proxy version 1.7.3 2017/02/28
> (...)
>
> Everything there looks pretty standard. If it dies again it could be good
> to try with "nokqueue" in the global section (or start haproxy with -dk)
> to disable kqueue and switch to poll. It will eat a bit more CPU, so don't
> do this on all nodes at once.
>
> I'm thinking about other things :
> - if you're doing a lot of SSL we could imagine an issue with random
> generation using /dev/random instead of /dev/urandom. I've met this
> issue a long time ago on some apache servers where all the entropy
> was progressively consumed until it was not possible anymore to get
> a connection.
>
> - it could be useful to run "netstat -an" on a dead node before killing
> haproxy and archive this for later analysis. It may reveal that all
> file descriptors were used by close_wait connections (indicating a
> close bug in haproxy) or something like this. If instead you see a
> lot of FIN_WAIT1 or FIN_WAIT2 it may indicate an issue with some
> external firewall or pf blocking some final traffic and leading to
> socket space exhaustion.
>
> If you have the same issue that was reported with kevent() being called
> in loops and returning an error, you may definitely see tons of close_wait
> and it will indicate an issue with this poller, though I have no idea
> which one, especially since it doesn't change often and *seems* to work
> with previous versions.
>
> Best regards,
> Willy
>

--
Soundhound Devops
"What could possibly go wrong?"
David King
Re: HaProxy Hang
March 13, 2017 12:40PM
Hi All

Apologies for the delay in response, i've been out of the country for the
last week

Mark, my gut feeling is that is network related in someway, so thought we
could compare the networking setup of our systems

You mentioned you see the hang across geo locations, so i assume there
isn't layer 2 connectivity between all of the hosts? is there any back end
connectivity between the haproxy hosts?

Ours are all layer 2 but are fairly complex. We have 6 connected NIC's
which are bonded into 3 LACP groups. over the top of the LACP we have a
number of VLAN interfaces. we also have a couple of normal IP aliases and a
number of CARP IP's on top of that

One commonality is NTP as they all sync from our own upstream NTP services,
but having looked through the logs, there isn't a recent NTP update when
the hang occurs and i can't see any time jump

other things which are set up on the host:
local rsyslog which sends logs to centralised host
we have crons every minute for each jail (4 jails) to monitor the health of
the haproxy service
we have crons every minute for each jail (4 jails) to gather stats from
haproxy using haproxy stats frontend
we run pf on the host
Chef runs every 30 mins, and these times are splayed

does anything match up on these which could cause these issues?

Thanks

Dave



On 6 March 2017 at 20:28, Mark S <[email protected]> wrote:

> On Mon, 06 Mar 2017 15:02:43 -0500, Willy Tarreau <[email protected]> wrote:
>
> OK so that means that haproxy could have hung in a day or two, then your
>> case is much more common than one of the other reports. If your fdront LB
>> is fair between the 6 servers, that could be related to a total number of
>> requests or connections or something like this.
>>
>
> Another relevant point is that these servers are tied together using
> upstream, GeoIP-based DNS load balancing. So the request rate across
> servers varies quite a bit depending on the location. This would make a
> synchronized failure based on total requests less likely.
>
> I'm thinking about other things :
>> - if you're doing a lot of SSL we could imagine an issue with random
>> generation using /dev/random instead of /dev/urandom. I've met this
>> issue a long time ago on some apache servers where all the entropy
>> was progressively consumed until it was not possible anymore to get
>> a connection.
>>
>
> I'll set up a script to capture the netstat and other info prior to
> reloading should this issue re-occur.
>
> As for SSL, yes, we do a fair bit of SSL ( about 30% of total request
> count ) and HAProxy does the TLS termination and then hands off via TCP
> proxy.
>
> Best,
> -=Mark S.
>
Dave Cottlehuber
Re: HaProxy Hang
April 03, 2017 06:50PM
On Mon, 13 Mar 2017, at 13:31, David King wrote:
> Hi All
>
> Apologies for the delay in response, i've been out of the country for the
> last week
>
> Mark, my gut feeling is that is network related in someway, so thought we
> could compare the networking setup of our systems
>
> You mentioned you see the hang across geo locations, so i assume there
> isn't layer 2 connectivity between all of the hosts? is there any back
> end
> connectivity between the haproxy hosts?

Following up on this, some interesting points but nothing useful.

- Mark & I see the hang at almost exactly the same time on the same day:
2017-02-27T14:36Z give or take a minute either way

- I see the hang in 3 different regions using 2 different hosting
providers on both clustered and non-clustered services, but all on
FreeBSD 11.0R amd64. There is some dependency between these systems but
nothing unusual (logging backends, reverse proxied services etc).

- our servers don't have a specific workload that would allow them all
to run out of some internal resource at the same time, as their reboot
and patch cycles are reasonably different - typically a few days elapse
between first patches and last reboots unless its deemed high risk

- our networking setup is not complex but typical FreeBSD:
- LACP bonded Gbit igb(4) NICs
- CARP failover for both ipv4 & ipv6 addresses
- either direct to haproxy for http & TLS traffic, or via spiped to
decrypt intra-server traffic
- haproxy directs traffic into jailed services
- our overall load and throughput is low but consistent
- pf firewall
- rsyslog for logging, along with riemann and graphite for metrics
- all our db traffic (couchdb, kyoto tycoon) and rabbitmq go via haproxy
- haproxy 1.6.10 + libressl at the time

As I'm not one for conspiracy theories or weird coincidences, somebody
port scanning the internet with an Unexpectedly Evil Packet Combo seems
the most plausible explanation. I cannot find an alternative that would
fit the scenario of 3 different organisations with geographically
distributed equipment and unconnected services reporting an unusual
interruption on the same day and almost the same time.

Since then, I've moved to FreeBSD 11.0p8, haproxy 1.7.3 and latest
libressl and seen no recurrence, just like the last 8+ months or so
since first deploying haproxy on FreeBSD instead of debian & nginx.

If the issue recurs I plan to run a small cyclic traffic capture with
tcpdump and wait for a re-repeat, see
https://superuser.com/questions/286062/practical-tcpdump-examples

Let me know if I can help or clarify further.

A+
Dave
David King
Re: HaProxy Hang
April 05, 2017 12:40AM
Hi Dave

Thanks for the info, So interestingly we had the crash at exactly the same
time, so we are 3 for 3 on that

The setups sounds very similar, but given we all saw issue at the same
time, it really points to something more global.

We are using NTP from our firewalls, which in turn get it from our ISP, so
i doubt that is the cause, so it could be external port scanning which is
the cause as you suggest. or maybe a leap second of some sort?

Willy any thoughts on the time co-incidence?

Thanks

Dave





On 3 April 2017 at 17:45, Dave Cottlehuber <[email protected]> wrote:

> On Mon, 13 Mar 2017, at 13:31, David King wrote:
> > Hi All
> >
> > Apologies for the delay in response, i've been out of the country for the
> > last week
> >
> > Mark, my gut feeling is that is network related in someway, so thought we
> > could compare the networking setup of our systems
> >
> > You mentioned you see the hang across geo locations, so i assume there
> > isn't layer 2 connectivity between all of the hosts? is there any back
> > end
> > connectivity between the haproxy hosts?
>
> Following up on this, some interesting points but nothing useful.
>
> - Mark & I see the hang at almost exactly the same time on the same day:
> 2017-02-27T14:36Z give or take a minute either way
>
> - I see the hang in 3 different regions using 2 different hosting
> providers on both clustered and non-clustered services, but all on
> FreeBSD 11.0R amd64. There is some dependency between these systems but
> nothing unusual (logging backends, reverse proxied services etc).
>
> - our servers don't have a specific workload that would allow them all
> to run out of some internal resource at the same time, as their reboot
> and patch cycles are reasonably different - typically a few days elapse
> between first patches and last reboots unless its deemed high risk
>
> - our networking setup is not complex but typical FreeBSD:
> - LACP bonded Gbit igb(4) NICs
> - CARP failover for both ipv4 & ipv6 addresses
> - either direct to haproxy for http & TLS traffic, or via spiped to
> decrypt intra-server traffic
> - haproxy directs traffic into jailed services
> - our overall load and throughput is low but consistent
> - pf firewall
> - rsyslog for logging, along with riemann and graphite for metrics
> - all our db traffic (couchdb, kyoto tycoon) and rabbitmq go via haproxy
> - haproxy 1.6.10 + libressl at the time
>
> As I'm not one for conspiracy theories or weird coincidences, somebody
> port scanning the internet with an Unexpectedly Evil Packet Combo seems
> the most plausible explanation. I cannot find an alternative that would
> fit the scenario of 3 different organisations with geographically
> distributed equipment and unconnected services reporting an unusual
> interruption on the same day and almost the same time.
>
> Since then, I've moved to FreeBSD 11.0p8, haproxy 1.7.3 and latest
> libressl and seen no recurrence, just like the last 8+ months or so
> since first deploying haproxy on FreeBSD instead of debian & nginx.
>
> If the issue recurs I plan to run a small cyclic traffic capture with
> tcpdump and wait for a re-repeat, see
> https://superuser.com/questions/286062/practical-tcpdump-examples
>
> Let me know if I can help or clarify further.
>
> A+
> Dave
>
Lukas Tribus
Re: HaProxy Hang
April 05, 2017 01:40AM
Hello,


Am 05.04.2017 um 00:27 schrieb David King:
> Hi Dave
>
> Thanks for the info, So interestingly we had the crash at exactly the
> same time, so we are 3 for 3 on that
>
> The setups sounds very similar, but given we all saw issue at the same
> time, it really points to something more global.
>
> We are using NTP from our firewalls, which in turn get it from our
> ISP, so i doubt that is the cause, so it could be external port
> scanning which is the cause as you suggest. or maybe a leap second of
> some sort?
>
> Willy any thoughts on the time co-incidence?

Can we be absolutely positive that those hangs are not directly or
indirectly caused by the bugs Willy already fixed in 1.7.4 and 1.7.5,
for example from the ML thread "Problems with haproxy 1.7.3 on FreeBSD
11.0-p8"?
There maybe multiple and different symptoms of those bugs, so even if
the descriptions in those threads don't match your case 100%, it may
still caused by the same underlying bug.

A confirmation that hose hangs are still happening in v1.7.5 would be
crucial.

The time co-incidence is intriguing, but I would not spend too much time
with that. Collecting actual traces (like strace or its freebsd
equivalent) and capture dumps is more likely to achieve progress, imo.


Hoping that this is not AI/IoT/Skynet trying to erase mankind, I wish
y'all a good night,
lukas
Dave Cottlehuber
Re: HaProxy Hang
April 05, 2017 02:10AM
On Wed, 5 Apr 2017, at 01:34, Lukas Tribus wrote:
> Hello,
>
>
> Am 05.04.2017 um 00:27 schrieb David King:
> > Hi Dave
> >
> > Thanks for the info, So interestingly we had the crash at exactly the
> > same time, so we are 3 for 3 on that
> >
> > The setups sounds very similar, but given we all saw issue at the same
> > time, it really points to something more global.
> >
> > We are using NTP from our firewalls, which in turn get it from our
> > ISP, so i doubt that is the cause, so it could be external port
> > scanning which is the cause as you suggest. or maybe a leap second of
> > some sort?
> >
> > Willy any thoughts on the time co-incidence?
>
> Can we be absolutely positive that those hangs are not directly or
> indirectly caused by the bugs Willy already fixed in 1.7.4 and 1.7.5,
> for example from the ML thread "Problems with haproxy 1.7.3 on FreeBSD
> 11.0-p8"?
>
> There maybe multiple and different symptoms of those bugs, so even if
> the descriptions in those threads don't match your case 100%, it may
> still caused by the same underlying bug.

I'll update from 1.7.3 to 1.7.5 with those goodies tomorrow and see how
that goes.

A+
Dave
Willy Tarreau
Re: HaProxy Hang
April 05, 2017 08:10AM
Hi all,

On Wed, Apr 05, 2017 at 01:34:20AM +0200, Lukas Tribus wrote:
> Can we be absolutely positive that those hangs are not directly or
> indirectly caused by the bugs Willy already fixed in 1.7.4 and 1.7.5, for
> example from the ML thread "Problems with haproxy 1.7.3 on FreeBSD 11.0-p8"?

I don't believe in this at all unfortunately. The issues that were faced
on FreeBSD in earlier versions were related to connect() occasionally
succeeding synchronously and haproxy did not handle this case cleanly
(it initially used to poll then validate the connect() a second time,
and fixing this broke the rest).

> There maybe multiple and different symptoms of those bugs, so even if the
> descriptions in those threads don't match your case 100%, it may still
> caused by the same underlying bug.
>
> A confirmation that hose hangs are still happening in v1.7.5 would be
> crucial.

I'm pretty sure they will still happen.

> The time co-incidence is intriguing, but I would not spend too much time
> with that. Collecting actual traces (like strace or its freebsd equivalent)
> and capture dumps is more likely to achieve progress, imo.

In fact I do think there's an operating system issue here (and those who
know me also know that I'm not one who tries to hide haproxy bugs). What
I suspect is that there's a problem when time wraps. A 1 kHz scheduler
wraps every 49.7 days. With clocks synchronized over NTP, all of them
wrap exactly at the same time. If the issue is there, it may happen
again on Tue Apr 18, 9:38 (13 days from now).

It could have been haproxy's time wrapping and causing the issue, so I
modified it to add an offset and make the time wrap 5s after startup,
and couldn't trigger the problem on a FreeBSD system, even after
multiple attempts. And the time of first crash reported above doesn't
match any wrapping pattern (0x58b43950). Also, reporters indicated
that the issue appeared after migrating to FreeBSD 11 and no such
issue was ever reported on earlier versions.

Also Dave reported this, which is totally abnormal :

kqueue(0,0,0....) = 22 (EINVAL)

and the fact that the system panicked, which cannot be an haproxy issue.

Another point, Dave reported a loss of network connectivity at the
same moment when it last happened. Dave, could this be related to
other FreeBSD nodes running FreeBSD as well and rebooting or any
such thing ?

I think that at this point we should discuss with some FreeBSD
maintainers and see what can be done to track this problem down, even
if it means adding some debugging code in the kqueue loop to help
troubleshoot this, or using it differently if we're doing something
wrong.

Given that Mark indicated that reloading the process fixed the problem
(except he had to manually kill the previous one), one possible workaround
might be to detect the EINVAL, and try to reinitialize kqueue or switch
to poll() if this happens (and emit loud warnings in the logs).

> Hoping that this is not AI/IoT/Skynet trying to erase mankind, I wish y'all
> a good night,

There's still a faint possibility of a widespread attack but while I
can easily imagine some such devices sending a "packet of death"
exploiting a bug in an OS, I don't believe it would make kqueue()
return EINVAL in haproxy.

Cheers,
Willy
David King
Re: HaProxy Hang
April 05, 2017 11:20AM
I'm going to keep with version 1.7.2 till then, so we should have a
comparison

If we think we may have a hang at Tue Apr 18, 9:38, is there any specific
logging we should set up on a server at that time? is it worth setting at
least one server to have nokqueue set at that time?

Thanks

David

On 5 April 2017 at 07:00, Willy Tarreau <[email protected]> wrote:

> Hi all,
>
> On Wed, Apr 05, 2017 at 01:34:20AM +0200, Lukas Tribus wrote:
> > Can we be absolutely positive that those hangs are not directly or
> > indirectly caused by the bugs Willy already fixed in 1.7.4 and 1.7.5, for
> > example from the ML thread "Problems with haproxy 1.7.3 on FreeBSD
> 11.0-p8"?
>
> I don't believe in this at all unfortunately. The issues that were faced
> on FreeBSD in earlier versions were related to connect() occasionally
> succeeding synchronously and haproxy did not handle this case cleanly
> (it initially used to poll then validate the connect() a second time,
> and fixing this broke the rest).
>
> > There maybe multiple and different symptoms of those bugs, so even if the
> > descriptions in those threads don't match your case 100%, it may still
> > caused by the same underlying bug.
> >
> > A confirmation that hose hangs are still happening in v1.7.5 would be
> > crucial.
>
> I'm pretty sure they will still happen.
>
> > The time co-incidence is intriguing, but I would not spend too much time
> > with that. Collecting actual traces (like strace or its freebsd
> equivalent)
> > and capture dumps is more likely to achieve progress, imo.
>
> In fact I do think there's an operating system issue here (and those who
> know me also know that I'm not one who tries to hide haproxy bugs). What
> I suspect is that there's a problem when time wraps. A 1 kHz scheduler
> wraps every 49.7 days. With clocks synchronized over NTP, all of them
> wrap exactly at the same time. If the issue is there, it may happen
> again on Tue Apr 18, 9:38 (13 days from now).
>
> It could have been haproxy's time wrapping and causing the issue, so I
> modified it to add an offset and make the time wrap 5s after startup,
> and couldn't trigger the problem on a FreeBSD system, even after
> multiple attempts. And the time of first crash reported above doesn't
> match any wrapping pattern (0x58b43950). Also, reporters indicated
> that the issue appeared after migrating to FreeBSD 11 and no such
> issue was ever reported on earlier versions.
>
> Also Dave reported this, which is totally abnormal :
>
> kqueue(0,0,0....) = 22 (EINVAL)
>
> and the fact that the system panicked, which cannot be an haproxy issue.
>
> Another point, Dave reported a loss of network connectivity at the
> same moment when it last happened. Dave, could this be related to
> other FreeBSD nodes running FreeBSD as well and rebooting or any
> such thing ?
>
> I think that at this point we should discuss with some FreeBSD
> maintainers and see what can be done to track this problem down, even
> if it means adding some debugging code in the kqueue loop to help
> troubleshoot this, or using it differently if we're doing something
> wrong.
>
> Given that Mark indicated that reloading the process fixed the problem
> (except he had to manually kill the previous one), one possible workaround
> might be to detect the EINVAL, and try to reinitialize kqueue or switch
> to poll() if this happens (and emit loud warnings in the logs).
>
> > Hoping that this is not AI/IoT/Skynet trying to erase mankind, I wish
> y'all
> > a good night,
>
> There's still a faint possibility of a widespread attack but while I
> can easily imagine some such devices sending a "packet of death"
> exploiting a bug in an OS, I don't believe it would make kqueue()
> return EINVAL in haproxy.
>
> Cheers,
> Willy
>
Willy Tarreau
Re: HaProxy Hang
April 05, 2017 11:30AM
On Wed, Apr 05, 2017 at 10:10:49AM +0100, David King wrote:
> I'm going to keep with version 1.7.2 till then, so we should have a
> comparison

OK as you like :-)

> If we think we may have a hang at Tue Apr 18, 9:38, is there any specific
> logging we should set up on a server at that time?

Maybe detailed truss output if it happens, to get all arguments and a few
things like this. Unfortunately for now I don't see an easy way to reset
the kqueue fd and reinitialize all events from scratch (though it's possible,
just requires quite some code and will come with some bugs).

> is it worth setting at
> least one server to have nokqueue set at that time?

Well, possibly if you have multiple servers and all of them die at the same
time, that could avoid a complete outage. And maybe nothing will happen, it
was a pure guess from me but given that these ones more or less match issues
we've had a long time ago with looping timers, I would not be surprized if
it happens this way.

Willy
Mark S
Re: HaProxy Hang
April 06, 2017 04:00PM
On Mon, 03 Apr 2017 12:45:57 -0400, Dave Cottlehuber <[email protected]>
wrote:

> On Mon, 13 Mar 2017, at 13:31, David King wrote:
>> Hi All
>>
>> Apologies for the delay in response, i've been out of the country for
>> the
>> last week
>>
>> Mark, my gut feeling is that is network related in someway, so thought
>> we
>> could compare the networking setup of our systems
>>
>> You mentioned you see the hang across geo locations, so i assume there
>> isn't layer 2 connectivity between all of the hosts? is there any back
>> end
>> connectivity between the haproxy hosts?
>
> Following up on this, some interesting points but nothing useful.
>
> - Mark & I see the hang at almost exactly the same time on the same day:
> 2017-02-27T14:36Z give or take a minute either way
>
> - I see the hang in 3 different regions using 2 different hosting
> providers on both clustered and non-clustered services, but all on
> FreeBSD 11.0R amd64. There is some dependency between these systems but
> nothing unusual (logging backends, reverse proxied services etc).
>
> - our servers don't have a specific workload that would allow them all
> to run out of some internal resource at the same time, as their reboot
> and patch cycles are reasonably different - typically a few days elapse
> between first patches and last reboots unless its deemed high risk
>
> - our networking setup is not complex but typical FreeBSD:
> - LACP bonded Gbit igb(4) NICs
> - CARP failover for both ipv4 & ipv6 addresses
> - either direct to haproxy for http & TLS traffic, or via spiped to
> decrypt intra-server traffic
> - haproxy directs traffic into jailed services
> - our overall load and throughput is low but consistent
> - pf firewall
> - rsyslog for logging, along with riemann and graphite for metrics
> - all our db traffic (couchdb, kyoto tycoon) and rabbitmq go via haproxy
> - haproxy 1.6.10 + libressl at the time
>
> As I'm not one for conspiracy theories or weird coincidences, somebody
> port scanning the internet with an Unexpectedly Evil Packet Combo seems
> the most plausible explanation. I cannot find an alternative that would
> fit the scenario of 3 different organisations with geographically
> distributed equipment and unconnected services reporting an unusual
> interruption on the same day and almost the same time.
>
> Since then, I've moved to FreeBSD 11.0p8, haproxy 1.7.3 and latest
> libressl and seen no recurrence, just like the last 8+ months or so
> since first deploying haproxy on FreeBSD instead of debian & nginx.
>
> If the issue recurs I plan to run a small cyclic traffic capture with
> tcpdump and wait for a re-repeat, see
> https://superuser.com/questions/286062/practical-tcpdump-examples
>
> Let me know if I can help or clarify further.
>
> A+
> Dave

Hi Dave,

Thanks for keeping this thread going. As for the initial report with all
servers hanging, I too run NTP (actually OpenNTPd), and these only speak
to in-house stratum-2 servers.

As a follow-up to my initial report, I upgraded to 1.7.3 shortly
thereafter.

I've had one re-occurrence of this "hang" but this time, it did not affect
all of my servers, instead, it affected only 2 (the busier ones). If the
theory about some timing event ( leap second, counter wrapping, etc.) is
correct, perhaps it only affects processes actually accepting or handling
a connection in a particular state at the time.

I have not yet upgraded beyond 1.7.3.

Best,
-=Mark
David King
Re: HaProxy Hang
April 18, 2017 11:40AM
Hi All

Just like to confirm Willy's theory, we had the hang at exactly the time
specified this morning.

Sadly due to a bank holiday yesterday in the UK, we didn't set up the truss
and monitoring before the hang occurred.

Was the hang seen by everyone?

Thanks

Dave

On 6 April 2017 at 14:56, Mark S <[email protected]> wrote:

> On Mon, 03 Apr 2017 12:45:57 -0400, Dave Cottlehuber <[email protected]>
> wrote:
>
> On Mon, 13 Mar 2017, at 13:31, David King wrote:
>>
>>> Hi All
>>>
>>> Apologies for the delay in response, i've been out of the country for the
>>> last week
>>>
>>> Mark, my gut feeling is that is network related in someway, so thought we
>>> could compare the networking setup of our systems
>>>
>>> You mentioned you see the hang across geo locations, so i assume there
>>> isn't layer 2 connectivity between all of the hosts? is there any back
>>> end
>>> connectivity between the haproxy hosts?
>>>
>>
>> Following up on this, some interesting points but nothing useful.
>>
>> - Mark & I see the hang at almost exactly the same time on the same day:
>> 2017-02-27T14:36Z give or take a minute either way
>>
>> - I see the hang in 3 different regions using 2 different hosting
>> providers on both clustered and non-clustered services, but all on
>> FreeBSD 11.0R amd64. There is some dependency between these systems but
>> nothing unusual (logging backends, reverse proxied services etc).
>>
>> - our servers don't have a specific workload that would allow them all
>> to run out of some internal resource at the same time, as their reboot
>> and patch cycles are reasonably different - typically a few days elapse
>> between first patches and last reboots unless its deemed high risk
>>
>> - our networking setup is not complex but typical FreeBSD:
>> - LACP bonded Gbit igb(4) NICs
>> - CARP failover for both ipv4 & ipv6 addresses
>> - either direct to haproxy for http & TLS traffic, or via spiped to
>> decrypt intra-server traffic
>> - haproxy directs traffic into jailed services
>> - our overall load and throughput is low but consistent
>> - pf firewall
>> - rsyslog for logging, along with riemann and graphite for metrics
>> - all our db traffic (couchdb, kyoto tycoon) and rabbitmq go via haproxy
>> - haproxy 1.6.10 + libressl at the time
>>
>> As I'm not one for conspiracy theories or weird coincidences, somebody
>> port scanning the internet with an Unexpectedly Evil Packet Combo seems
>> the most plausible explanation. I cannot find an alternative that would
>> fit the scenario of 3 different organisations with geographically
>> distributed equipment and unconnected services reporting an unusual
>> interruption on the same day and almost the same time.
>>
>> Since then, I've moved to FreeBSD 11.0p8, haproxy 1.7.3 and latest
>> libressl and seen no recurrence, just like the last 8+ months or so
>> since first deploying haproxy on FreeBSD instead of debian & nginx.
>>
>> If the issue recurs I plan to run a small cyclic traffic capture with
>> tcpdump and wait for a re-repeat, see
>> https://superuser.com/questions/286062/practical-tcpdump-examples
>>
>> Let me know if I can help or clarify further.
>>
>> A+
>> Dave
>>
>
> Hi Dave,
>
> Thanks for keeping this thread going. As for the initial report with all
> servers hanging, I too run NTP (actually OpenNTPd), and these only speak to
> in-house stratum-2 servers.
>
> As a follow-up to my initial report, I upgraded to 1.7.3 shortly
> thereafter.
>
> I've had one re-occurrence of this "hang" but this time, it did not affect
> all of my servers, instead, it affected only 2 (the busier ones). If the
> theory about some timing event ( leap second, counter wrapping, etc.) is
> correct, perhaps it only affects processes actually accepting or handling a
> connection in a particular state at the time.
>
> I have not yet upgraded beyond 1.7.3.
>
> Best,
> -=Mark
>
Willy Tarreau
Re: HaProxy Hang
April 18, 2017 11:50AM
Hi David,

On Tue, Apr 18, 2017 at 10:33:40AM +0100, David King wrote:
> Hi All
>
> Just like to confirm Willy's theory, we had the hang at exactly the time
> specified this morning.

I could recycle myself in a new church of which I would be the prophet...
well maybe it already exists, we have thousands of adepts after all :-)

More seriously, I think it will be useful to report a bug to the FreeBSD
project, there are quite a number of elements, possibly nothing that can
make it obvious where the problem could be, but a number of hypothesis
can be ruled out already I think. It's possible that some FreeBSD devs
ask us to monitor a few things or capture some syscall returns, or try
some workarounds and this might require some dev. So in short, the earlier
the better if we want to be ready for the next occurrence.

Cheers,
Willy
David King
Re: HaProxy Hang
June 07, 2017 10:50AM
Just to close the loop on this, last night was the time at which we were
expecting the next hang. All of the servers we updated haproxy to the
patched versions did not hang. The test servers which were running the
older version hung as expected

Thanks so much to everyone who fixed the issue!

On 18 April 2017 at 10:45, Willy Tarreau <[email protected]> wrote:

> Hi David,
>
> On Tue, Apr 18, 2017 at 10:33:40AM +0100, David King wrote:
> > Hi All
> >
> > Just like to confirm Willy's theory, we had the hang at exactly the time
> > specified this morning.
>
> I could recycle myself in a new church of which I would be the prophet...
> well maybe it already exists, we have thousands of adepts after all :-)
>
> More seriously, I think it will be useful to report a bug to the FreeBSD
> project, there are quite a number of elements, possibly nothing that can
> make it obvious where the problem could be, but a number of hypothesis
> can be ruled out already I think. It's possible that some FreeBSD devs
> ask us to monitor a few things or capture some syscall returns, or try
> some workarounds and this might require some dev. So in short, the earlier
> the better if we want to be ready for the next occurrence.
>
> Cheers,
> Willy
>
Willy Tarreau
Re: HaProxy Hang
June 07, 2017 11:30AM
Hi David,

On Wed, Jun 07, 2017 at 09:42:58AM +0100, David King wrote:
> Just to close the loop on this, last night was the time at which we were
> expecting the next hang. All of the servers we updated haproxy to the
> patched versions did not hang. The test servers which were running the
> older version hung as expected
>
> Thanks so much to everyone who fixed the issue!

Feedback much appreciated, thank you! We need to issue 1.7.6 soon with
this fix but other troubling ones being under investigation have delayed
this a bit.

Cheers,
Willy
Dave Cottlehuber
Re: HaProxy Hang
June 07, 2017 01:30PM
On Wed, 7 Jun 2017, at 10:42, David King wrote:
> Just to close the loop on this, last night was the time at which we were
> expecting the next hang. All of the servers we updated haproxy to the
> patched versions did not hang. The test servers which were running the
> older version hung as expected
>
> Thanks so much to everyone who fixed the issue!

Same here, although as we patched everything we had no issues at all :D
Merci beaucoup!

A+
Dave
Sorry, only registered users may post in this forum.

Click here to login