Welcome! Log In Create A New Profile

Advanced

[BUG] 100% cpu on each threads

Posted by Emmanuel Hocdet 
Emmanuel Hocdet
[BUG] 100% cpu on each threads
January 12, 2018 11:20AM
Hi,

with 1.8.3 + threads (with mworker)
I notice a 100% cpu per thread ( epool_wait + gettimeofday in loop)
Syndrome appears regularly on start/reload.

My configuration include one bind line with ssl on tcp mode.

It's a know issue?

++
Manu
Willy Tarreau
Re: [BUG] 100% cpu on each threads
January 12, 2018 11:50AM
Hi Manu,

On Fri, Jan 12, 2018 at 11:14:57AM +0100, Emmanuel Hocdet wrote:
>
> Hi,
>
> with 1.8.3 + threads (with mworker)
> I notice a 100% cpu per thread ( epool_wait + gettimeofday in loop)
> Syndrome appears regularly on start/reload.

We got a similar report yesterday affecting 1.5 to 1.8 caused by
a client aborting during a server redispatch. I don't know if it
could be related at all to what you're seeing, but would you care
to try the latest 1.8 git to see if it fixes it, since it contains
the fix ?

> My configuration include one bind line with ssl on tcp mode.

There was TCP mode there as well :-/

Willy
Emmanuel Hocdet
Re: [BUG] 100% cpu on each threads
January 12, 2018 12:10PM
Hi Willy

> Le 12 janv. 2018 à 11:38, Willy Tarreau <[email protected]> a écrit :
>
> Hi Manu,
>
> On Fri, Jan 12, 2018 at 11:14:57AM +0100, Emmanuel Hocdet wrote:
>>
>> Hi,
>>
>> with 1.8.3 + threads (with mworker)
>> I notice a 100% cpu per thread ( epool_wait + gettimeofday in loop)
>> Syndrome appears regularly on start/reload.
>
> We got a similar report yesterday affecting 1.5 to 1.8 caused by
> a client aborting during a server redispatch. I don't know if it
> could be related at all to what you're seeing, but would you care
> to try the latest 1.8 git to see if it fixes it, since it contains
> the fix ?
>

same syndrome with latest 1.8.

When syndrome appear, i see such line on syslog:
(for one or all servers)

Server tls/L7_1 is DOWN, reason: Layer4 connection problem, info: "Bad file descriptor", check duration: 2018ms. 0 active and 1 backup servers left. Running on backup. 0 sessions active, 0 requeued, 0 remaining in queue.

Manu
Willy Tarreau
Re: [BUG] 100% cpu on each threads
January 12, 2018 01:10PM
On Fri, Jan 12, 2018 at 12:01:15PM +0100, Emmanuel Hocdet wrote:
> When syndrome appear, i see such line on syslog:
> (for one or all servers)
>
> Server tls/L7_1 is DOWN, reason: Layer4 connection problem, info: "Bad file
> descriptor", check duration: 2018ms. 0 active and 1 backup servers left.
> Running on backup. 0 sessions active, 0 requeued, 0 remaining in queue.

Wow, that's scary! This means we have a problem with server-side connections
and I really have no idea what it's about at the moment :-(

Willy
Aleksandar Lazic
Re[2]: [BUG] 100% cpu on each threads
January 12, 2018 03:30PM
------ Originalnachricht ------
Von: "Willy Tarreau" <[email protected]>
An: "Emmanuel Hocdet" <[email protected]>
Cc: "haproxy" <[email protected]>
Gesendet: 12.01.2018 13:04:02
Betreff: Re: [BUG] 100% cpu on each threads

>On Fri, Jan 12, 2018 at 12:01:15PM +0100, Emmanuel Hocdet wrote:
>>When syndrome appear, i see such line on syslog:
>>(for one or all servers)
>>
>>Server tls/L7_1 is DOWN, reason: Layer4 connection problem, info: "Bad
>>file
>>descriptor", check duration: 2018ms. 0 active and 1 backup servers
>>left.
>>Running on backup. 0 sessions active, 0 requeued, 0 remaining in
>>queue.
>
>Wow, that's scary! This means we have a problem with server-side
>connections
>and I really have no idea what it's about at the moment :-(
@Emmanuel: Wild guess, is this a meltdown/spectre patched server and
since the patch you have seen this errors?

>Willy
>
Aleks
Willy Tarreau
Re: [BUG] 100% cpu on each threads
January 12, 2018 03:30PM
On Fri, Jan 12, 2018 at 12:01:15PM +0100, Emmanuel Hocdet wrote:
> When syndrome appear, i see such line on syslog:
> (for one or all servers)
>
> Server tls/L7_1 is DOWN, reason: Layer4 connection problem, info: "Bad file descriptor", check duration: 2018ms. 0 active and 1 backup servers left. Running on backup. 0 sessions active, 0 requeued, 0 remaining in queue.

So I tried a bit but found no way to reproduce this. I'll need some
more info like the type of health-checks, probably the "server" line
settings, stuff like this. Does it appear quickly or does it take a
long time ? Also, does it recover from this on subsequent checks or
does it stay stuck in this state ?

Willy
Emmanuel Hocdet
Re: [BUG] 100% cpu on each threads
January 12, 2018 04:00PM
> Le 12 janv. 2018 à 15:24, Willy Tarreau <[email protected]> a écrit :
>
> On Fri, Jan 12, 2018 at 12:01:15PM +0100, Emmanuel Hocdet wrote:
>> When syndrome appear, i see such line on syslog:
>> (for one or all servers)
>>
>> Server tls/L7_1 is DOWN, reason: Layer4 connection problem, info: "Bad file descriptor", check duration: 2018ms. 0 active and 1 backup servers left. Running on backup. 0 sessions active, 0 requeued, 0 remaining in queue.
>

or new one:
Jan 12 13:25:13 webacc1 haproxy_ssl[31002]: Server tls/L7_1 is DOWN, reason: Layer4 connection problem, info: "General socket error (Bad file descriptor)", check duration: 0ms. 0 active and 1 backup servers left. Running on backup. 0 sessions active, 0 requeued, 0 remaining in queue.


> So I tried a bit but found no way to reproduce this. I'll need some
> more info like the type of health-checks, probably the "server" line
> settings, stuff like this. Does it appear quickly or does it take a
> long time ? Also, does it recover from this on subsequent checks or
> does it stay stuck in this state ?

yep, conf include.
issue no seen without check (but without traffic)

Manu

global
user haproxy
group haproxy
daemon

# for master-worker (-W)
stats socket /var/run/haproxy_ssl.sock expose-fd listeners
nbthread 8

log /dev/log daemon warning
log /dev/log local0

tune.ssl.cachesize 200000
tune.ssl.lifetime 5m

ssl-default-bind-options no-sslv3
ssl-default-bind-ciphers ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES128-SHA:ECDHE-ECDSA-AES256-SHA:ECDHE-RSA-AES128-SHA:ECDHE-RSA-AES256-SHA:AES128-GCM-SHA256:AES256-GCM-SHA384:AES128-SHA:AES256-SHA


defaults
log global
log-tag "haproxy_ssl"
option dontlognull
maxconn 40000
timeout connect 500ms
source 0.0.0.0

timeout client 207s
retries 3
timeout server 207s

listen tls

mode tcp
bind 127.0.0.1:463,X.Y.Z.B:463 accept-proxy ssl tls-ticket-keys /var/lib/haproxy/ssl/tls_keys.cfg strict-sni crt-list /var/lib/haproxy/ssl/crtlist.cfg

log-format 'resumed:%[ssl_fc_is_resumed] cipher:%sslc tlsv:%sslv'

balance roundrobin
option allbackups
fullconn 30000

server L7_1 127.0.0.1:483 check send-proxy

server L7_2 X.Y.Z.C:483 check send-proxy backup
Cyril Bonté
Re: [BUG] 100% cpu on each threads
January 12, 2018 04:00PM
Hi all,

----- Mail original -----
> De: "Willy Tarreau" <[email protected]>
> À: "Emmanuel Hocdet" <[email protected]>
> Cc: "haproxy" <[email protected]>
> Envoyé: Vendredi 12 Janvier 2018 15:24:54
> Objet: Re: [BUG] 100% cpu on each threads
>
> On Fri, Jan 12, 2018 at 12:01:15PM +0100, Emmanuel Hocdet wrote:
> > When syndrome appear, i see such line on syslog:
> > (for one or all servers)
> >
> > Server tls/L7_1 is DOWN, reason: Layer4 connection problem, info:
> > "Bad file descriptor", check duration: 2018ms. 0 active and 1
> > backup servers left. Running on backup. 0 sessions active, 0
> > requeued, 0 remaining in queue.
>
> So I tried a bit but found no way to reproduce this. I'll need some
> more info like the type of health-checks, probably the "server" line
> settings, stuff like this. Does it appear quickly or does it take a
> long time ? Also, does it recover from this on subsequent checks or
> does it stay stuck in this state ?

Im' not sure you saw Samuel Reed's mail.
He reported a similar issue some hours ago (High load average under 1.8 with multiple draining processes). It would be interesting to find a common configuration to reproduce the issue, so I add him to the thread.

Cyril
Emmanuel Hocdet
Re: [BUG] 100% cpu on each threads
January 12, 2018 04:00PM
> Le 12 janv. 2018 à 15:23, Aleksandar Lazic <[email protected]> a écrit :
>
>
> ------ Originalnachricht ------
> Von: "Willy Tarreau" <[email protected]>
> An: "Emmanuel Hocdet" <[email protected]>
> Cc: "haproxy" <[email protected]>
> Gesendet: 12.01.2018 13:04:02
> Betreff: Re: [BUG] 100% cpu on each threads
>
>> On Fri, Jan 12, 2018 at 12:01:15PM +0100, Emmanuel Hocdet wrote:
>>> When syndrome appear, i see such line on syslog:
>>> (for one or all servers)
>>>
>>> Server tls/L7_1 is DOWN, reason: Layer4 connection problem, info: "Bad file
>>> descriptor", check duration: 2018ms. 0 active and 1 backup servers left.
>>> Running on backup. 0 sessions active, 0 requeued, 0 remaining in queue.
>>
>> Wow, that's scary! This means we have a problem with server-side connections
>> and I really have no idea what it's about at the moment :-(
> @Emmanuel: Wild guess, is this a meltdown/spectre patched server and since the patch you have seen this errors?
>

no,
nothing has changed on serveurs (and are with different linux kernel), only haproxy versions from 1.8-dev … and see
the issue with 1.8.3 (and i don’t know if issue is on 1.8.2)
Sorry, only registered users may post in this forum.

Click here to login