Welcome! Log In Create A New Profile

Advanced

[BUG] 100% cpu on each threads

Posted by Emmanuel Hocdet 
Emmanuel Hocdet
[BUG] 100% cpu on each threads
January 12, 2018 11:20AM
Hi,

with 1.8.3 + threads (with mworker)
I notice a 100% cpu per thread ( epool_wait + gettimeofday in loop)
Syndrome appears regularly on start/reload.

My configuration include one bind line with ssl on tcp mode.

It's a know issue?

++
Manu
Willy Tarreau
Re: [BUG] 100% cpu on each threads
January 12, 2018 11:50AM
Hi Manu,

On Fri, Jan 12, 2018 at 11:14:57AM +0100, Emmanuel Hocdet wrote:
>
> Hi,
>
> with 1.8.3 + threads (with mworker)
> I notice a 100% cpu per thread ( epool_wait + gettimeofday in loop)
> Syndrome appears regularly on start/reload.

We got a similar report yesterday affecting 1.5 to 1.8 caused by
a client aborting during a server redispatch. I don't know if it
could be related at all to what you're seeing, but would you care
to try the latest 1.8 git to see if it fixes it, since it contains
the fix ?

> My configuration include one bind line with ssl on tcp mode.

There was TCP mode there as well :-/

Willy
Emmanuel Hocdet
Re: [BUG] 100% cpu on each threads
January 12, 2018 12:10PM
Hi Willy

> Le 12 janv. 2018 à 11:38, Willy Tarreau <[email protected]> a écrit :
>
> Hi Manu,
>
> On Fri, Jan 12, 2018 at 11:14:57AM +0100, Emmanuel Hocdet wrote:
>>
>> Hi,
>>
>> with 1.8.3 + threads (with mworker)
>> I notice a 100% cpu per thread ( epool_wait + gettimeofday in loop)
>> Syndrome appears regularly on start/reload.
>
> We got a similar report yesterday affecting 1.5 to 1.8 caused by
> a client aborting during a server redispatch. I don't know if it
> could be related at all to what you're seeing, but would you care
> to try the latest 1.8 git to see if it fixes it, since it contains
> the fix ?
>

same syndrome with latest 1.8.

When syndrome appear, i see such line on syslog:
(for one or all servers)

Server tls/L7_1 is DOWN, reason: Layer4 connection problem, info: "Bad file descriptor", check duration: 2018ms. 0 active and 1 backup servers left. Running on backup. 0 sessions active, 0 requeued, 0 remaining in queue.

Manu
Willy Tarreau
Re: [BUG] 100% cpu on each threads
January 12, 2018 01:10PM
On Fri, Jan 12, 2018 at 12:01:15PM +0100, Emmanuel Hocdet wrote:
> When syndrome appear, i see such line on syslog:
> (for one or all servers)
>
> Server tls/L7_1 is DOWN, reason: Layer4 connection problem, info: "Bad file
> descriptor", check duration: 2018ms. 0 active and 1 backup servers left.
> Running on backup. 0 sessions active, 0 requeued, 0 remaining in queue.

Wow, that's scary! This means we have a problem with server-side connections
and I really have no idea what it's about at the moment :-(

Willy
Aleksandar Lazic
Re[2]: [BUG] 100% cpu on each threads
January 12, 2018 03:30PM
------ Originalnachricht ------
Von: "Willy Tarreau" <[email protected]>
An: "Emmanuel Hocdet" <[email protected]>
Cc: "haproxy" <[email protected]>
Gesendet: 12.01.2018 13:04:02
Betreff: Re: [BUG] 100% cpu on each threads

>On Fri, Jan 12, 2018 at 12:01:15PM +0100, Emmanuel Hocdet wrote:
>>When syndrome appear, i see such line on syslog:
>>(for one or all servers)
>>
>>Server tls/L7_1 is DOWN, reason: Layer4 connection problem, info: "Bad
>>file
>>descriptor", check duration: 2018ms. 0 active and 1 backup servers
>>left.
>>Running on backup. 0 sessions active, 0 requeued, 0 remaining in
>>queue.
>
>Wow, that's scary! This means we have a problem with server-side
>connections
>and I really have no idea what it's about at the moment :-(
@Emmanuel: Wild guess, is this a meltdown/spectre patched server and
since the patch you have seen this errors?

>Willy
>
Aleks
Willy Tarreau
Re: [BUG] 100% cpu on each threads
January 12, 2018 03:30PM
On Fri, Jan 12, 2018 at 12:01:15PM +0100, Emmanuel Hocdet wrote:
> When syndrome appear, i see such line on syslog:
> (for one or all servers)
>
> Server tls/L7_1 is DOWN, reason: Layer4 connection problem, info: "Bad file descriptor", check duration: 2018ms. 0 active and 1 backup servers left. Running on backup. 0 sessions active, 0 requeued, 0 remaining in queue.

So I tried a bit but found no way to reproduce this. I'll need some
more info like the type of health-checks, probably the "server" line
settings, stuff like this. Does it appear quickly or does it take a
long time ? Also, does it recover from this on subsequent checks or
does it stay stuck in this state ?

Willy
Emmanuel Hocdet
Re: [BUG] 100% cpu on each threads
January 12, 2018 04:00PM
> Le 12 janv. 2018 à 15:24, Willy Tarreau <[email protected]> a écrit :
>
> On Fri, Jan 12, 2018 at 12:01:15PM +0100, Emmanuel Hocdet wrote:
>> When syndrome appear, i see such line on syslog:
>> (for one or all servers)
>>
>> Server tls/L7_1 is DOWN, reason: Layer4 connection problem, info: "Bad file descriptor", check duration: 2018ms. 0 active and 1 backup servers left. Running on backup. 0 sessions active, 0 requeued, 0 remaining in queue.
>

or new one:
Jan 12 13:25:13 webacc1 haproxy_ssl[31002]: Server tls/L7_1 is DOWN, reason: Layer4 connection problem, info: "General socket error (Bad file descriptor)", check duration: 0ms. 0 active and 1 backup servers left. Running on backup. 0 sessions active, 0 requeued, 0 remaining in queue.


> So I tried a bit but found no way to reproduce this. I'll need some
> more info like the type of health-checks, probably the "server" line
> settings, stuff like this. Does it appear quickly or does it take a
> long time ? Also, does it recover from this on subsequent checks or
> does it stay stuck in this state ?

yep, conf include.
issue no seen without check (but without traffic)

Manu

global
user haproxy
group haproxy
daemon

# for master-worker (-W)
stats socket /var/run/haproxy_ssl.sock expose-fd listeners
nbthread 8

log /dev/log daemon warning
log /dev/log local0

tune.ssl.cachesize 200000
tune.ssl.lifetime 5m

ssl-default-bind-options no-sslv3
ssl-default-bind-ciphers ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES128-SHA:ECDHE-ECDSA-AES256-SHA:ECDHE-RSA-AES128-SHA:ECDHE-RSA-AES256-SHA:AES128-GCM-SHA256:AES256-GCM-SHA384:AES128-SHA:AES256-SHA


defaults
log global
log-tag "haproxy_ssl"
option dontlognull
maxconn 40000
timeout connect 500ms
source 0.0.0.0

timeout client 207s
retries 3
timeout server 207s

listen tls

mode tcp
bind 127.0.0.1:463,X.Y.Z.B:463 accept-proxy ssl tls-ticket-keys /var/lib/haproxy/ssl/tls_keys.cfg strict-sni crt-list /var/lib/haproxy/ssl/crtlist.cfg

log-format 'resumed:%[ssl_fc_is_resumed] cipher:%sslc tlsv:%sslv'

balance roundrobin
option allbackups
fullconn 30000

server L7_1 127.0.0.1:483 check send-proxy

server L7_2 X.Y.Z.C:483 check send-proxy backup
Cyril Bonté
Re: [BUG] 100% cpu on each threads
January 12, 2018 04:00PM
Hi all,

----- Mail original -----
> De: "Willy Tarreau" <[email protected]>
> À: "Emmanuel Hocdet" <[email protected]>
> Cc: "haproxy" <[email protected]>
> Envoyé: Vendredi 12 Janvier 2018 15:24:54
> Objet: Re: [BUG] 100% cpu on each threads
>
> On Fri, Jan 12, 2018 at 12:01:15PM +0100, Emmanuel Hocdet wrote:
> > When syndrome appear, i see such line on syslog:
> > (for one or all servers)
> >
> > Server tls/L7_1 is DOWN, reason: Layer4 connection problem, info:
> > "Bad file descriptor", check duration: 2018ms. 0 active and 1
> > backup servers left. Running on backup. 0 sessions active, 0
> > requeued, 0 remaining in queue.
>
> So I tried a bit but found no way to reproduce this. I'll need some
> more info like the type of health-checks, probably the "server" line
> settings, stuff like this. Does it appear quickly or does it take a
> long time ? Also, does it recover from this on subsequent checks or
> does it stay stuck in this state ?

Im' not sure you saw Samuel Reed's mail.
He reported a similar issue some hours ago (High load average under 1.8 with multiple draining processes). It would be interesting to find a common configuration to reproduce the issue, so I add him to the thread.

Cyril
Emmanuel Hocdet
Re: [BUG] 100% cpu on each threads
January 12, 2018 04:00PM
> Le 12 janv. 2018 à 15:23, Aleksandar Lazic <[email protected]> a écrit :
>
>
> ------ Originalnachricht ------
> Von: "Willy Tarreau" <[email protected]>
> An: "Emmanuel Hocdet" <[email protected]>
> Cc: "haproxy" <[email protected]>
> Gesendet: 12.01.2018 13:04:02
> Betreff: Re: [BUG] 100% cpu on each threads
>
>> On Fri, Jan 12, 2018 at 12:01:15PM +0100, Emmanuel Hocdet wrote:
>>> When syndrome appear, i see such line on syslog:
>>> (for one or all servers)
>>>
>>> Server tls/L7_1 is DOWN, reason: Layer4 connection problem, info: "Bad file
>>> descriptor", check duration: 2018ms. 0 active and 1 backup servers left.
>>> Running on backup. 0 sessions active, 0 requeued, 0 remaining in queue.
>>
>> Wow, that's scary! This means we have a problem with server-side connections
>> and I really have no idea what it's about at the moment :-(
> @Emmanuel: Wild guess, is this a meltdown/spectre patched server and since the patch you have seen this errors?
>

no,
nothing has changed on serveurs (and are with different linux kernel), only haproxy versions from 1.8-dev … and see
the issue with 1.8.3 (and i don’t know if issue is on 1.8.2)
Marc Fournier
Re: [BUG] 100% cpu on each threads
January 22, 2018 03:30PM
Cyril Bonté <[email protected]> writes:

Hello,

> Im' not sure you saw Samuel Reed's mail.
> He reported a similar issue some hours ago (High load average under
> 1.8 with multiple draining processes). It would be interesting to find
> a common configuration to reproduce the issue, so I add him to the thread.

I've been observing the same error messages Emmanuel reports, using
1.8.3 and nbthread. I tried to simplify & anonymize my configuration so
I could share a version which reproduces the problem, but didn't
succeed: the problem disappears at some point in the process, and I'm
unable figure out exactly which change makes the difference :-(

So here are all the observations I gathered, hoping this will help move
a step further:

- disabling "nbthread", as well as setting "nbthread 1", makes the
problem go away.

- when exposed to client requests, I only observed high CPU load on one
instance out of the three I have, which receeded after a restart of
haproxy. When working in isolation (no client requests), I never
noticed high CPU load.

- the more complex the config gets, the easiest it is to reproduce the
issue. By "complex" I mean: more frontends, backend and servers
defined, conditionnally routing traffic to each other based on ACLs,
SSL enabled, dns resolver enabled and used in server statements,
various healthchecks on servers.

- at some point when simplyfing the config, the problem becomes
transient, then eventually stops happening. But there doesn't seem to
be exacly one configuration keyword which triggers the issue.

- I also noticed a few log messages go missing from time to time. Not
sure about this though, it could also be a problem further downstream
in my logging pipeline.

- I've seen the problem happen on systems both with and without the
spectre/meltdown kernel patches.

Last but not least, by continuously reloading haproxy (SIGUSR2) and
running strace against it until the problem occured, I was able to get
this sequence of events (slightly redacted, with a couple of comments
in-line), which seem to show some incorrect actions on file descriptiors
between concurrent threads:

# thread 12167 opens an UDP socket to the DNS server defined in the resolvers
# section of my config, and starts sending queries:

[pid 12167] socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP) = 16
[pid 12167] connect(16, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.10.0.2")}, 16) = 0
[pid 12167] fcntl(16, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
[...]
[pid 12167] sendto(16, "\37\350\1\0\0\1\0\0\0\0\0\1\tprivate-0\10backends\4"..., 78, 0, NULL, 0) = 78
[pid 12167] sendto(16, "\265\327\1\0\0\1\0\0\0\0\0\1\tprivate-1\10backends\4"..., 78, 0, NULL, 0) = 78
[...]
[pid 12167] sendto(16, "\341\21\1\0\0\1\0\0\0\0\0\1\nprivate-23\10backends"...., 79, 0, NULL, 0) = 79
[pid 12167] sendto(16, "\\\223\1\0\0\1\0\0\0\0\0\1\nprivate-24\10backends"...., 79, 0, NULL, 0) = 79

# thread 12166 gets created, and closes an fd it didn't create, which
# happens to be the socket opened to the DNS server:

strace: Process 12166 attached
[pid 12167] sendto(16, "\316\352\1\0\0\1\0\0\0\0\0\1\nprivate-25\10backends"..., 79, 0, NULL, 0 <unfinished ...>
[pid 12166] set_robust_list(0x7ff9bc9aa9e0, 24 <unfinished ...>
[pid 12167] <... sendto resumed> ) = 79
[pid 12166] <... set_robust_list resumed> ) = 0
[pid 12166] gettimeofday({1516289044, 684014}, NULL) = 0
[pid 12167] sendto(16, "\37\367\1\0\0\1\0\0\0\0\0\1\nprivate-26\10backends"...., 79, 0, NULL, 0 <unfinished ...>
[pid 12166] mmap(NULL, 134217728, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0 <unfinished ...>
[pid 12167] <... sendto resumed> ) = 79
[pid 12167] sendto(16, "\224\10\1\0\0\1\0\0\0\0\0\1\nprivate-27\10backends"...., 79, 0, NULL, 0) = 79
[pid 12166] <... mmap resumed> ) = 0x7ff9ac000000
[pid 12167] sendto(16, "\25 \1\0\0\1\0\0\0\0\0\1\nprivate-28\10backends"..., 79, 0, NULL, 0) = 79
[pid 12166] munmap(0x7ff9b0000000, 67108864) = 0
[pid 12167] sendto(16, "\275\n\1\0\0\1\0\0\0\0\0\1\nprivate-29\10backends"...., 79, 0, NULL, 0 <unfinished ...>
[pid 12166] mprotect(0x7ff9ac000000, 135168, PROT_READ|PROT_WRITE <unfinished ...>
[pid 12167] <... sendto resumed> ) = 79
[pid 12166] <... mprotect resumed> ) = 0
[pid 12167] sendto(16, "\312\360\1\0\0\1\0\0\0\0\0\1\nprivate-30\10backends"..., 79, 0, NULL, 0 <unfinished ...>
[pid 12166] mmap(NULL, 8003584, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0 <unfinished ...>
[pid 12167] <... sendto resumed> ) = 79
[pid 12167] sendto(16, "\247e\1\0\0\1\0\0\0\0\0\1\nprivate-31\10backends"...., 79, 0, NULL, 0) = 79
[pid 12166] <... mmap resumed> ) = 0x7ff9baa65000
[pid 12167] sendto(16, "_k\1\0\0\1\0\0\0\0\0\1\tprivate-0\nbackoffic"..., 80, 0, NULL, 0 <unfinished ...>
[pid 12166] close(16 <unfinished ...>

# from now on, thread 12167 gets "Bad file descriptor" back when sending
# DNS queries:

[pid 12167] <... sendto resumed> ) = 80
[pid 12167] sendto(16, "\355\25\1\0\0\1\0\0\0\0\0\1\tprivate-1\nbackoffic"...., 80, 0, NULL, 0 <unfinished ...>
[pid 12166] <... close resumed> ) = 0
[pid 12167] <... sendto resumed> ) = -1 EBADF (Bad file descriptor)
[pid 12167] sendto(16, "?\217\1\0\0\1\0\0\0\0\0\1\tprivate-2\nbackoffic"..., 80, 0, NULL, 0 <unfinished ...>
[pid 12166] fcntl(15, F_SETFL, O_RDONLY|O_NONBLOCK <unfinished ...>
[pid 12167] <... sendto resumed> ) = -1 EBADF (Bad file descriptor)
[pid 12166] <... fcntl resumed> ) = 0

# meanwhile thread 12166 opens a TCP socket to a server declared in my config
# using it's IP address, I assume to send a healthcheck:

[pid 12164] <... sched_setaffinity resumed> ) = 0
[pid 12164] sched_setaffinity(58, 128, [2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 558d19fa65d0, 558d19eacee0, 7ffdfc793f70, 0, 0, 7ff9c82dc2b1, 7ffdfc793fb8, 7ffdfc793f78, 700000000, 558d19eaab30, 0, dd0973282a659c8a, 558d19eacee0, 7ffdfc793f70, ...] <unfinished ...>
[pid 12166] <... socket resumed> ) = 16
[pid 12166] fcntl(16, F_SETFL, O_RDONLY|O_NONBLOCK <unfinished ...>
[pid 12164] <... sched_setaffinity resumed> ) = 0
[pid 12166] <... fcntl resumed> ) = 0
[pid 12166] setsockopt(16, SOL_TCP, TCP_NODELAY, [1], 4 <unfinished ...>
[pid 12164] sched_setaffinity(59, 128, [4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 558d19fa65d0, 558d19eacee0, 7ffdfc793f70, 0, 0, 7ff9c82dc2b1, 7ffdfc793fb8, 7ffdfc793f78, 700000000, 558d19eaab30, 0, dd0973282a659c8a, 558d19eacee0, 7ffdfc793f70, ...] <unfinished ...>
[pid 12166] <... setsockopt resumed> ) = 0
[pid 12164] <... sched_setaffinity resumed> ) = -1 EINVAL (Invalid argument)
[pid 12166] setsockopt(16, SOL_TCP, TCP_QUICKACK, [0], 4 <unfinished ...>
[pid 12164] sched_setaffinity(60, 128, [8, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 558d19fa65d0, 558d19eacee0, 7ffdfc793f70, 0, 0, 7ff9c82dc2b1, 7ffdfc793fb8, 7ffdfc793f78, 700000000, 558d19eaab30, 0, dd0973282a659c8a, 558d19eacee0, 7ffdfc793f70, ...]) = -1 EINVAL (Invalid argument)
[pid 12166] <... setsockopt resumed> ) = 0
[pid 12164] gettimeofday( <unfinished ...>
[pid 12166] connect(16, {sa_family=AF_INET, sin_port=htons(8080), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRESS (Operation now in progress)

# ... then thread 12165 gets created and immediatly closes 12166's socket:

strace: Process 12165 attached
[pid 12165] set_robust_list(0x7ff9bd1ab9e0, 24 <unfinished ...>
[pid 12164] <... gettimeofday resumed> {1516289044, 692533}, NULL) = 0
[pid 12165] <... set_robust_list resumed> ) = 0
[pid 12165] gettimeofday({1516289044, 692598}, NULL) = 0
[pid 12164] mmap(NULL, 8003584, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0 <unfinished ...>
[pid 12165] mmap(0x7ff9b0000000, 67108864, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x7ff9b0000000
[pid 12165] mprotect(0x7ff9b0000000, 135168, PROT_READ|PROT_WRITE) = 0
[pid 12165] mmap(NULL, 8003584, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7ff9ba2c3000
[pid 12165] close(16) = 0

[pid 12165] fcntl(15, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
[pid 12165] gettimeofday({1516289044, 693546}, NULL) = 0
[pid 12165] gettimeofday({1516289044, 693652}, NULL) = 0
[pid 12165] epoll_wait(3, [], 200, 0) = 0
[pid 12165] gettimeofday({1516289044, 693836}, NULL) = 0
[pid 12166] gettimeofday( <unfinished ...>

# ... 12164 (which seems to be the father or manager thread) tries to do
# the same thing, and fails:

[pid 12164] <... mmap resumed> ) = 0x7ff9b9b21000
[pid 12164] close(16) = -1 EBADF (Bad file descriptor)
[pid 12164] fcntl(15, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
[pid 12164] gettimeofday({1516289044, 696511}, NULL) = 0
[pid 12164] gettimeofday({1516289044, 696557}, NULL) = 0
[pid 12164] epoll_wait(3, [], 200, 0) = 0
[pid 12164] gettimeofday({1516289044, 696623}, NULL) = 0
[pid 12166] <... gettimeofday resumed> {1516289044, 700259}, NULL) = 0
[pid 12166] epoll_wait(3, [], 200, 0) = 0
[pid 12166] gettimeofday({1516289044, 700345}, NULL) = 0

# and when 12166 wants to use this socket, it obviously gets an error back too:

[pid 12166] getsockname(16, 0x7ff9ac012368, 0x7ff9bc9a7c0c) = -1 EBADF (Bad file descriptor)
[pid 12164] getpid() = 57
[pid 12164] getpid() = 57

# shortly after, log messages get sent out, indicating servers are considered
# out of order:

[pid 12164] sendmsg(6, {msg_name(110)={sa_family=AF_LOCAL, sun_path="/dev/log"}, msg_iov(8)=[{"<145>Jan 18 15:24:04 ", 21}, {"haproxy", 10}, {"[", 1}, {"57", 2}, {"]: ", 3}, {"", 0}, {"Server blue/blue0 is DOWN, reason: Socket error, check duration: 0ms. 31 active and 1 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.", 159}, {"\n", 1}], msg_controllen=0, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 214
[pid 12164] getpid() = 57
[pid 12164] sendmsg(6, {msg_name(110)={sa_family=AF_LOCAL, sun_path="/dev/log"}, msg_iov(8)=[{"<149>Jan 18 15:24:04 ", 21}, {"haproxy", 10}, {"[", 1}, {"57", 2}, {"]: ", 3}, {"", 0}, {"Server blue/blue0 was DOWN and now enters maintenance.", 54}, {"\n", 1}], msg_controllen=0, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 109
[pid 12164] getpid() = 57
[...]

In this specific situation, HAProxy is not able to recover by itself. I
assume due to how maintenance mode with DNS resolution works. But I've
also seen a case (when no DNS was involved) where the backend server
was seen offline, then online again, despite the actual service behind
HAProxy was up and running the whole time.

Thanks !
Marc
Willy Tarreau
Re: [BUG] 100% cpu on each threads
January 22, 2018 06:00PM
Hi Marc,

On Mon, Jan 22, 2018 at 03:18:20PM +0100, Marc Fournier wrote:
> Cyril Bonté <[email protected]> writes:
>
> Hello,
>
> > Im' not sure you saw Samuel Reed's mail.
> > He reported a similar issue some hours ago (High load average under
> > 1.8 with multiple draining processes). It would be interesting to find
> > a common configuration to reproduce the issue, so I add him to the thread.
>
> I've been observing the same error messages Emmanuel reports, using
> 1.8.3 and nbthread. I tried to simplify & anonymize my configuration so
> I could share a version which reproduces the problem, but didn't
> succeed: the problem disappears at some point in the process, and I'm
> unable figure out exactly which change makes the difference :-(

We've done some work over the week-end to address an issue related to
FDs and threads : in short, it's impossible to let a thread sleep when
there's some activity on another one because they share the same epoll_fd.

We've sent Samuel a copy of patches to test (I'm attaching them here in
case you're interested to try as well, to add on top of latest 1.8, though
1.8.3 will be OK). Since you seem to be able to reproduce the bug on a
full config, you may be tempted to try htem.

> - when exposed to client requests, I only observed high CPU load on one
> instance out of the three I have, which receeded after a restart of
> haproxy. When working in isolation (no client requests), I never
> noticed high CPU load.

So this could indicate an uncaught error on a specific fd. A "show fd"
on the CLI may give some useful information about this. And the patches
above also add "show activity", to run twice one second apart, and which
will indicate the various causes of wakeups.

> - the more complex the config gets, the easiest it is to reproduce the
> issue. By "complex" I mean: more frontends, backend and servers
> defined, conditionnally routing traffic to each other based on ACLs,
> SSL enabled, dns resolver enabled and used in server statements,
> various healthchecks on servers.

This could match the same root cause we've been working on, but it may
also more easily trigger a bug in one such area and cause the problem
to reveal itself.

> - at some point when simplyfing the config, the problem becomes
> transient, then eventually stops happening. But there doesn't seem to
> be exacly one configuration keyword which triggers the issue.

Which may possibly rule out the single isolated bug theory and fuel the
FD one a bit more.

> - I also noticed a few log messages go missing from time to time. Not
> sure about this though, it could also be a problem further downstream
> in my logging pipeline.

OK.

> - I've seen the problem happen on systems both with and without the
> spectre/meltdown kernel patches.

Good point, you're right to indicate it since we've had a doubt at some
point about the coincidence of the update (though it can come from various
causes).

> Last but not least, by continuously reloading haproxy (SIGUSR2) and
> running strace against it until the problem occured, I was able to get
> this sequence of events (slightly redacted, with a couple of comments
> in-line), which seem to show some incorrect actions on file descriptiors
> between concurrent threads:
>
> # thread 12167 opens an UDP socket to the DNS server defined in the resolvers
> # section of my config, and starts sending queries:
>
> [pid 12167] socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP) = 16
> [pid 12167] connect(16, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.10.0.2")}, 16) = 0
> [pid 12167] fcntl(16, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
> [...]
> [pid 12167] sendto(16, "\37\350\1\0\0\1\0\0\0\0\0\1\tprivate-0\10backends\4"..., 78, 0, NULL, 0) = 78
> [pid 12167] sendto(16, "\265\327\1\0\0\1\0\0\0\0\0\1\tprivate-1\10backends\4"..., 78, 0, NULL, 0) = 78
> [...]
> [pid 12167] sendto(16, "\341\21\1\0\0\1\0\0\0\0\0\1\nprivate-23\10backends"..., 79, 0, NULL, 0) = 79
> [pid 12167] sendto(16, "\\\223\1\0\0\1\0\0\0\0\0\1\nprivate-24\10backends"..., 79, 0, NULL, 0) = 79
>
> # thread 12166 gets created, and closes an fd it didn't create, which
> # happens to be the socket opened to the DNS server:
>
> strace: Process 12166 attached
> [pid 12167] sendto(16, "\316\352\1\0\0\1\0\0\0\0\0\1\nprivate-25\10backends"..., 79, 0, NULL, 0 <unfinished ...>
> [pid 12166] set_robust_list(0x7ff9bc9aa9e0, 24 <unfinished ...>
> [pid 12167] <... sendto resumed> ) = 79
> [pid 12166] <... set_robust_list resumed> ) = 0
> [pid 12166] gettimeofday({1516289044, 684014}, NULL) = 0
> [pid 12167] sendto(16, "\37\367\1\0\0\1\0\0\0\0\0\1\nprivate-26\10backends"..., 79, 0, NULL, 0 <unfinished ...>
> [pid 12166] mmap(NULL, 134217728, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0 <unfinished ...>
> [pid 12167] <... sendto resumed> ) = 79
> [pid 12167] sendto(16, "\224\10\1\0\0\1\0\0\0\0\0\1\nprivate-27\10backends"..., 79, 0, NULL, 0) = 79
> [pid 12166] <... mmap resumed> ) = 0x7ff9ac000000
> [pid 12167] sendto(16, "\25 \1\0\0\1\0\0\0\0\0\1\nprivate-28\10backends"..., 79, 0, NULL, 0) = 79
> [pid 12166] munmap(0x7ff9b0000000, 67108864) = 0
> [pid 12167] sendto(16, "\275\n\1\0\0\1\0\0\0\0\0\1\nprivate-29\10backends"..., 79, 0, NULL, 0 <unfinished ...>
> [pid 12166] mprotect(0x7ff9ac000000, 135168, PROT_READ|PROT_WRITE <unfinished ...>
> [pid 12167] <... sendto resumed> ) = 79
> [pid 12166] <... mprotect resumed> ) = 0
> [pid 12167] sendto(16, "\312\360\1\0\0\1\0\0\0\0\0\1\nprivate-30\10backends"..., 79, 0, NULL, 0 <unfinished ...>
> [pid 12166] mmap(NULL, 8003584, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0 <unfinished ...>
> [pid 12167] <... sendto resumed> ) = 79
> [pid 12167] sendto(16, "\247e\1\0\0\1\0\0\0\0\0\1\nprivate-31\10backends"..., 79, 0, NULL, 0) = 79
> [pid 12166] <... mmap resumed> ) = 0x7ff9baa65000
> [pid 12167] sendto(16, "_k\1\0\0\1\0\0\0\0\0\1\tprivate-0\nbackoffic"..., 80, 0, NULL, 0 <unfinished ...>
> [pid 12166] close(16 <unfinished ...>

Wow, that's scary! Are you certain it's not the parent process and that
it's really a thread ? you may see a clone() syscall in 12166 giving birth
to 12167. If they are different processes, that can be fine, but ...

> # from now on, thread 12167 gets "Bad file descriptor" back when sending
> # DNS queries:
>
> [pid 12167] <... sendto resumed> ) = 80
> [pid 12167] sendto(16, "\355\25\1\0\0\1\0\0\0\0\0\1\tprivate-1\nbackoffic"..., 80, 0, NULL, 0 <unfinished ...>
> [pid 12166] <... close resumed> ) = 0
> [pid 12167] <... sendto resumed> ) = -1 EBADF (Bad file descriptor)
> [pid 12167] sendto(16, "?\217\1\0\0\1\0\0\0\0\0\1\tprivate-2\nbackoffic"..., 80, 0, NULL, 0 <unfinished ...>
> [pid 12166] fcntl(15, F_SETFL, O_RDONLY|O_NONBLOCK <unfinished ...>
> [pid 12167] <... sendto resumed> ) = -1 EBADF (Bad file descriptor)
> [pid 12166] <... fcntl resumed> ) = 0

Here it's getting obvious that it was a shared file descriptor :-(

> # meanwhile thread 12166 opens a TCP socket to a server declared in my config
> # using it's IP address, I assume to send a healthcheck:
>
> [pid 12164] <... sched_setaffinity resumed> ) = 0
> [pid 12164] sched_setaffinity(58, 128, [2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 558d19fa65d0, 558d19eacee0, 7ffdfc793f70, 0, 0, 7ff9c82dc2b1, 7ffdfc793fb8, 7ffdfc793f78, 700000000, 558d19eaab30, 0, dd0973282a659c8a, 558d19eacee0, 7ffdfc793f70, ...] <unfinished ...>

BTW, it's very interesting to see that 12164 has still not finished
starting up. Maybe we're racing somewhere during the startup.

> [pid 12166] <... socket resumed> ) = 16
> [pid 12166] fcntl(16, F_SETFL, O_RDONLY|O_NONBLOCK <unfinished ...>
> [pid 12164] <... sched_setaffinity resumed> ) = 0
> [pid 12166] <... fcntl resumed> ) = 0
> [pid 12166] setsockopt(16, SOL_TCP, TCP_NODELAY, [1], 4 <unfinished ...>
> [pid 12164] sched_setaffinity(59, 128, [4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 558d19fa65d0, 558d19eacee0, 7ffdfc793f70, 0, 0, 7ff9c82dc2b1, 7ffdfc793fb8, 7ffdfc793f78, 700000000, 558d19eaab30, 0, dd0973282a659c8a, 558d19eacee0, 7ffdfc793f70, ...] <unfinished ...>
> [pid 12166] <... setsockopt resumed> ) = 0
> [pid 12164] <... sched_setaffinity resumed> ) = -1 EINVAL (Invalid argument)
> [pid 12166] setsockopt(16, SOL_TCP, TCP_QUICKACK, [0], 4 <unfinished ...>
> [pid 12164] sched_setaffinity(60, 128, [8, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 558d19fa65d0, 558d19eacee0, 7ffdfc793f70, 0, 0, 7ff9c82dc2b1, 7ffdfc793fb8, 7ffdfc793f78, 700000000, 558d19eaab30, 0, dd0973282a659c8a, 558d19eacee0, 7ffdfc793f70, ...]) = -1 EINVAL (Invalid argument)
> [pid 12166] <... setsockopt resumed> ) = 0
> [pid 12164] gettimeofday( <unfinished ...>
> [pid 12166] connect(16, {sa_family=AF_INET, sin_port=htons(8080), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRESS (Operation now in progress)
>
> # ... then thread 12165 gets created and immediatly closes 12166's socket:
>
> strace: Process 12165 attached
> [pid 12165] set_robust_list(0x7ff9bd1ab9e0, 24 <unfinished ...>
> [pid 12164] <... gettimeofday resumed> {1516289044, 692533}, NULL) = 0
> [pid 12165] <... set_robust_list resumed> ) = 0
> [pid 12165] gettimeofday({1516289044, 692598}, NULL) = 0
> [pid 12164] mmap(NULL, 8003584, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0 <unfinished ...>
> [pid 12165] mmap(0x7ff9b0000000, 67108864, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x7ff9b0000000
> [pid 12165] mprotect(0x7ff9b0000000, 135168, PROT_READ|PROT_WRITE) = 0
> [pid 12165] mmap(NULL, 8003584, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7ff9ba2c3000
> [pid 12165] close(16) = 0

So that's basically the same problem as the other one above. We need to
figure what close() are present after threads get created.

> [pid 12165] fcntl(15, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
> [pid 12165] gettimeofday({1516289044, 693546}, NULL) = 0
> [pid 12165] gettimeofday({1516289044, 693652}, NULL) = 0
> [pid 12165] epoll_wait(3, [], 200, 0) = 0
> [pid 12165] gettimeofday({1516289044, 693836}, NULL) = 0
> [pid 12166] gettimeofday( <unfinished ...>
>
> # ... 12164 (which seems to be the father or manager thread) tries to do
> # the same thing, and fails:
>
> [pid 12164] <... mmap resumed> ) = 0x7ff9b9b21000
> [pid 12164] close(16) = -1 EBADF (Bad file descriptor)
> [pid 12164] fcntl(15, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
> [pid 12164] gettimeofday({1516289044, 696511}, NULL) = 0
> [pid 12164] gettimeofday({1516289044, 696557}, NULL) = 0
> [pid 12164] epoll_wait(3, [], 200, 0) = 0
> [pid 12164] gettimeofday({1516289044, 696623}, NULL) = 0
> [pid 12166] <... gettimeofday resumed> {1516289044, 700259}, NULL) = 0
> [pid 12166] epoll_wait(3, [], 200, 0) = 0
> [pid 12166] gettimeofday({1516289044, 700345}, NULL) = 0
>
> # and when 12166 wants to use this socket, it obviously gets an error back too:
>
> [pid 12166] getsockname(16, 0x7ff9ac012368, 0x7ff9bc9a7c0c) = -1 EBADF (Bad file descriptor)

getsockname() tends to indicate that fd #16 was the result of an accept().

> [pid 12164] getpid() = 57
> [pid 12164] getpid() = 57

Pids are collected for logging. Strange that the libc didn't cache
the getpid()'s result by the way, usually it does. Thinking about it,
getpid()==57 is a bit surprising, maybe it's not really a getpid in
fact.

> # shortly after, log messages get sent out, indicating servers are considered
> # out of order:
>
> [pid 12164] sendmsg(6, {msg_name(110)={sa_family=AF_LOCAL, sun_path="/dev/log"}, msg_iov(8)=[{"<145>Jan 18 15:24:04 ", 21}, {"haproxy", 10}, {"[", 1}, {"57", 2}, {"]: ", 3}, {"", 0}, {"Server blue/blue0 is DOWN, reason: Socket error, check duration: 0ms. 31 active and 1 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.", 159}, {"\n", 1}], msg_controllen=0, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 214
> [pid 12164] getpid() = 57
> [pid 12164] sendmsg(6, {msg_name(110)={sa_family=AF_LOCAL, sun_path="/dev/log"}, msg_iov(8)=[{"<149>Jan 18 15:24:04 ", 21}, {"haproxy", 10}, {"[", 1}, {"57", 2}, {"]: ", 3}, {"", 0}, {"Server blue/blue0 was DOWN and now enters maintenance.", 54}, {"\n", 1}], msg_controllen=0, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 109
> [pid 12164] getpid() = 57
> [...]
>
> In this specific situation, HAProxy is not able to recover by itself. I
> assume due to how maintenance mode with DNS resolution works. But I've
> also seen a case (when no DNS was involved) where the backend server
> was seen offline, then online again, despite the actual service behind
> HAProxy was up and running the whole time.

So if one thread decides to close another one's fd, anything can happen.
Some improperly caught events causing the CPU to spin, stuff like this.
I'm going to look where we could have a close() call during the startup
before creating threads. The only I'm aware of are in the parent process
in case of master worker.

At least the "show fd" and "show activity" (coming with the attached
patches) will significantly help.

Thanks!
Willy
From ac490a109e8132e25ca02362f0230d05c3d9160a Mon Sep 17 00:00:00 2001
From: Willy Tarreau <[email protected]>
Date: Sat, 20 Jan 2018 18:12:15 +0100
Subject: MINOR: global/threads: move cpu_map at the end of the global struct

The "thread" part is 32kB long, better move it at the end of the
structure since it's only used during initialization, to keep the
rest grouped together.

(cherry picked from commit e862cb70dffb96387e05e2908c3cd6a642d2e36f)
Signed-off-by: Willy Tarreau <[email protected]>
---
include/types/global.h | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/types/global.h b/include/types/global.h
index 1f33207..6f3fedc 100644
--- a/include/types/global.h
+++ b/include/types/global.h
@@ -163,14 +163,14 @@ struct global {
mode_t mode; /* 0 to leave unchanged */
} ux;
} unix_bind;
+ struct proxy *stats_fe; /* the frontend holding the stats settings */
+ struct vars vars; /* list of variables for the process scope. */
#ifdef USE_CPU_AFFINITY
struct {
unsigned long proc[LONGBITS]; /* list of CPU masks for the 32/64 first processes */
unsigned long thread[LONGBITS][LONGBITS]; /* list of CPU masks for the 32/64 first threads per process */
} cpu_map;
#endif
- struct proxy *stats_fe; /* the frontend holding the stats settings */
- struct vars vars; /* list of variables for the process scope. */
};

extern struct global global;
--
2.8.0.rc2.1.gbe9624a

From c0ff5283c4837cef08f699280c706cf4b5a2e684 Mon Sep 17 00:00:00 2001
From: Willy Tarreau <[email protected]>
Date: Sat, 20 Jan 2018 18:19:22 +0100
Subject: MINOR: threads: add a MAX_THREADS define instead of LONGBITS

This one allows not to inflate some structures when threads are
disabled. Now struct global is 1.4 kB instead of 33 kB.

(cherry picked from commit 614683ee346cdc4589806863f01777f1fee33d18)
Signed-off-by: Willy Tarreau <[email protected]>
---
include/common/hathreads.h | 4 ++++
include/types/global.h | 2 +-
src/cfgparse.c | 14 +++++++-------
src/haproxy.c | 2 +-
src/listener.c | 2 +-
5 files changed, 14 insertions(+), 10 deletions(-)

diff --git a/include/common/hathreads.h b/include/common/hathreads.h
index 5f0b969..f819536 100644
--- a/include/common/hathreads.h
+++ b/include/common/hathreads.h
@@ -30,6 +30,8 @@ extern THREAD_LOCAL unsigned long tid_bit; /* The bit corresponding to the threa

#ifndef USE_THREAD

+#define MAX_THREADS 1
+
#define __decl_hathreads(decl)

#define HA_ATOMIC_CAS(val, old, new) ({((*val) == (*old)) ? (*(val) = (new) , 1) : (*(old) = *(val), 0);})
@@ -95,6 +97,8 @@ extern THREAD_LOCAL unsigned long tid_bit; /* The bit corresponding to the threa
#include <pthread.h>
#include <import/plock.h>

+#define MAX_THREADS LONGBITS
+
#define __decl_hathreads(decl) decl

/* TODO: thread: For now, we rely on GCC builtins but it could be a good idea to
diff --git a/include/types/global.h b/include/types/global.h
index 6f3fedc..5c5cf73 100644
--- a/include/types/global.h
+++ b/include/types/global.h
@@ -168,7 +168,7 @@ struct global {
#ifdef USE_CPU_AFFINITY
struct {
unsigned long proc[LONGBITS]; /* list of CPU masks for the 32/64 first processes */
- unsigned long thread[LONGBITS][LONGBITS]; /* list of CPU masks for the 32/64 first threads per process */
+ unsigned long thread[LONGBITS][MAX_THREADS]; /* list of CPU masks for the 32/64 first threads per process */
} cpu_map;
#endif
};
diff --git a/src/cfgparse.c b/src/cfgparse.c
index efd56c4..567f3ac 100644
--- a/src/cfgparse.c
+++ b/src/cfgparse.c
@@ -1155,12 +1155,6 @@ int cfg_parse_global(const char *file, int linenum, char **args, int kwm)
goto out;
}
global.nbthread = atol(args[1]);
- if (global.nbthread < 1 || global.nbthread > LONGBITS) {
- ha_alert("parsing [%s:%d] : '%s' must be between 1 and %d (was %d).\n",
- file, linenum, args[0], LONGBITS, global.nbthread);
- err_code |= ERR_ALERT | ERR_FATAL;
- goto out;
- }
#ifndef USE_THREAD
if (global.nbthread > 1) {
ha_alert("HAProxy is not compiled with threads support, please check build options for USE_THREAD.\n");
@@ -1169,6 +1163,12 @@ int cfg_parse_global(const char *file, int linenum, char **args, int kwm)
goto out;
}
#endif
+ if (global.nbthread < 1 || global.nbthread > MAX_THREADS) {
+ ha_alert("parsing [%s:%d] : '%s' must be between 1 and %d (was %d).\n",
+ file, linenum, args[0], MAX_THREADS, global.nbthread);
+ err_code |= ERR_ALERT | ERR_FATAL;
+ goto out;
+ }
}
else if (!strcmp(args[0], "maxconn")) {
if (alertif_too_many_args(1, file, linenum, args, &err_code))
@@ -1781,7 +1781,7 @@ int cfg_parse_global(const char *file, int linenum, char **args, int kwm)
}

/* Mapping at the thread level */
- for (j = 0; j < LONGBITS; j++) {
+ for (j = 0; j < MAX_THREADS; j++) {
/* Np mapping for this thread */
if (!(thread & (1UL << j)))
continue;
diff --git a/src/haproxy.c b/src/haproxy.c
index 20b18f8..a8d0fad 100644
--- a/src/haproxy.c
+++ b/src/haproxy.c
@@ -2986,7 +2986,7 @@ int main(int argc, char **argv)
if (global.cpu_map.proc[relative_pid-1])
global.cpu_map.thread[relative_pid-1] &= global.cpu_map.proc[relative_pid-1];

- if (i < LONGBITS && /* only the first 32/64 threads may be pinned */
+ if (i < MAX_THREADS && /* only the first 32/64 threads may be pinned */
global.cpu_map.thread[relative_pid-1]) {/* only do this if the thread has a THREAD map */
#if defined(__FreeBSD__) || defined(__NetBSD__)
cpuset_t cpuset;
diff --git a/src/listener.c b/src/listener.c
index dfd36cc..fd70726 100644
--- a/src/listener.c
+++ b/src/listener.c
@@ -963,7 +963,7 @@ static int bind_parse_process(char **args, int cur_arg, struct proxy *px, struct

conf->bind_proc |= proc;
if (thread) {
- for (i = 0; i < LONGBITS; i++)
+ for (i = 0; i < MAX_THREADS; i++)
if (!proc || (proc & (1UL << i)))
conf->bind_thread |= thread;
}
--
2.8.0.rc2.1.gbe9624a

From 03db02f11fa03fc8bc705ec6e4931f84378e4de8 Mon Sep 17 00:00:00 2001
From: Willy Tarreau <[email protected]>
Date: Sat, 20 Jan 2018 19:30:13 +0100
Subject: MINOR: global: add some global activity counters to help debugging

A number of counters have been added at special places helping better
understanding certain bug reports. These counters are maintained per
thread and are shown using "show activity" on the CLI. The output is
sent as a single write(), which currently produces up to about 7 kB
of data for 64 threads. If more counters are added, it may be
necessary to write into multiple buffers, or to reset the counters.

(cherry picked from commit 5f19d4a01d8432bc930c04795df98725b87755cc)
Signed-off-by: Willy Tarreau <[email protected]>
---
doc/management.txt | 11 +++++++++++
include/types/global.h | 27 +++++++++++++++++++++++++++
src/cli.c | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
src/connection.c | 4 +++-
src/ev_epoll.c | 17 ++++++++++++++---
src/ev_kqueue.c | 15 +++++++++++++--
src/ev_poll.c | 28 +++++++++++++++++++++-------
src/ev_select.c | 18 +++++++++++++-----
src/fd.c | 12 +++++++++---
src/haproxy.c | 21 ++++++++++++++++-----
src/stream.c | 2 ++
src/task.c | 7 ++++++-
12 files changed, 183 insertions(+), 27 deletions(-)

diff --git a/doc/management.txt b/doc/management.txt
index 280ab17..6e5ebd3 100644
--- a/doc/management.txt
+++ b/doc/management.txt
@@ -1861,6 +1861,17 @@ show fd [<fd>]
that the output format may evolve over time so this output must not be parsed
by tools designed to be durable.

+show activity
+ Reports some counters about special events that will help developers and more
+ generally people who know haproxy well enough to narrow down the causes of
+ reports of abnormal behaviours. A typical example would be a properly running
+ process never sleeping and eating 100% of the CPU. The output fields will be
+ made of one line per metric, and per-thread counters on the same line. These
+ counters are 32-bit and will wrap during the process' life, which is not a
+ problem since calls to this command will typically be performed twice. The
+ fields are purposely not documented so that their exact meaning is verified
+ in the code where the counters are fed.
+
show info [typed|json]
Dump info about haproxy status on current process. If "typed" is passed as an
optional argument, field numbers, names and types are emitted as well so that
diff --git a/include/types/global.h b/include/types/global.h
index 5c5cf73..bd7761c 100644
--- a/include/types/global.h
+++ b/include/types/global.h
@@ -173,7 +173,34 @@ struct global {
#endif
};

+/* per-thread activity reports. It's important that it's aligned on cache lines
+ * because some elements will be updated very often. Most counters are OK on
+ * 32-bit since this will be used during debugging sessions for troubleshooting
+ * in iterative mode.
+ */
+struct activity {
+ unsigned int loops; // complete loops in run_poll_loop()
+ unsigned int wake_cache; // active fd_cache prevented poll() from sleeping
+ unsigned int wake_tasks; // active tasks prevented poll() from sleeping
+ unsigned int wake_applets; // active applets prevented poll() from sleeping
+ unsigned int wake_signal; // pending signal prevented poll() from sleeping
+ unsigned int poll_exp; // number of times poll() sees an expired timeout (includes wake_*)
+ unsigned int poll_drop; // poller dropped a dead FD from the update list
+ unsigned int poll_dead; // poller woke up with a dead FD
+ unsigned int poll_skip; // poller skipped another thread's FD
+ unsigned int fd_skip; // fd cache skipped another thread's FD
+ unsigned int fd_lock; // fd cache skipped a locked FD
+ unsigned int fd_del; // fd cache detected a deleted FD
+ unsigned int conn_dead; // conn_fd_handler woke up on an FD indicating a dead connection
+ unsigned int stream; // calls to process_stream()
+ unsigned int empty_rq; // calls to process_runnable_tasks() with nothing for the thread
+ unsigned int long_rq; // process_runnable_tasks() left with tasks in the run queue
+ char __pad[0]; // unused except to check remaining room
+ char __end[0] __attribute__((aligned(64))); // align size to 64.
+};
+
extern struct global global;
+extern struct activity activity[MAX_THREADS];
extern int pid; /* current process id */
extern int relative_pid; /* process id starting at 1 */
extern unsigned long pid_bit; /* bit corresponding to the process id */
diff --git a/src/cli.c b/src/cli.c
index 149ecd7..e2e7c34 100644
--- a/src/cli.c
+++ b/src/cli.c
@@ -855,6 +855,53 @@ static int cli_io_handler_show_fd(struct appctx *appctx)
return 1;
}

+/* This function dumps some activity counters used by developers and support to
+ * rule out some hypothesis during bug reports. It returns 0 if the output
+ * buffer is full and it needs to be called again, otherwise non-zero. It dumps
+ * everything at once in the buffer and is not designed to do it in multiple
+ * passes.
+ */
+static int cli_io_handler_show_activity(struct appctx *appctx)
+{
+ struct stream_interface *si = appctx->owner;
+ int thr;
+
+ if (unlikely(si_ic(si)->flags & (CF_WRITE_ERROR|CF_SHUTW)))
+ return 1;
+
+ chunk_reset(&trash);
+
+ chunk_appendf(&trash, "thread_id: %u", tid);
+ chunk_appendf(&trash, "\ndate_now: %lu.%06lu", (long)now.tv_sec, (long)now.tv_usec);
+ chunk_appendf(&trash, "\nloops:"); for (thr = 0; thr < global.nbthread; thr++) chunk_appendf(&trash, " %u", activity[thr].loops);
+ chunk_appendf(&trash, "\nwake_cache:"); for (thr = 0; thr < global.nbthread; thr++) chunk_appendf(&trash, " %u", activity[thr].wake_cache);
+ chunk_appendf(&trash, "\nwake_tasks:"); for (thr = 0; thr < global.nbthread; thr++) chunk_appendf(&trash, " %u", activity[thr].wake_tasks);
+ chunk_appendf(&trash, "\nwake_applets:"); for (thr = 0; thr < global.nbthread; thr++) chunk_appendf(&trash, " %u", activity[thr].wake_applets);
+ chunk_appendf(&trash, "\nwake_signal:"); for (thr = 0; thr < global.nbthread; thr++) chunk_appendf(&trash, " %u", activity[thr].wake_signal);
+ chunk_appendf(&trash, "\npoll_exp:"); for (thr = 0; thr < global.nbthread; thr++) chunk_appendf(&trash, " %u", activity[thr].poll_exp);
+ chunk_appendf(&trash, "\npoll_drop:"); for (thr = 0; thr < global.nbthread; thr++) chunk_appendf(&trash, " %u", activity[thr].poll_drop);
+ chunk_appendf(&trash, "\npoll_dead:"); for (thr = 0; thr < global.nbthread; thr++) chunk_appendf(&trash, " %u", activity[thr].poll_dead);
+ chunk_appendf(&trash, "\npoll_skip:"); for (thr = 0; thr < global.nbthread; thr++) chunk_appendf(&trash, " %u", activity[thr].poll_skip);
+ chunk_appendf(&trash, "\nfd_skip:"); for (thr = 0; thr < global.nbthread; thr++) chunk_appendf(&trash, " %u", activity[thr].fd_skip);
+ chunk_appendf(&trash, "\nfd_lock:"); for (thr = 0; thr < global.nbthread; thr++) chunk_appendf(&trash, " %u", activity[thr].fd_lock);
+ chunk_appendf(&trash, "\nfd_del:"); for (thr = 0; thr < global.nbthread; thr++) chunk_appendf(&trash, " %u", activity[thr].fd_del);
+ chunk_appendf(&trash, "\nconn_dead:"); for (thr = 0; thr < global.nbthread; thr++) chunk_appendf(&trash, " %u", activity[thr].conn_dead);
+ chunk_appendf(&trash, "\nstream:"); for (thr = 0; thr < global.nbthread; thr++) chunk_appendf(&trash, " %u", activity[thr].stream);
+ chunk_appendf(&trash, "\nempty_rq:"); for (thr = 0; thr < global.nbthread; thr++) chunk_appendf(&trash, " %u", activity[thr].empty_rq);
+ chunk_appendf(&trash, "\nlong_rq:"); for (thr = 0; thr < global.nbthread; thr++) chunk_appendf(&trash, " %u", activity[thr].long_rq);
+
+ chunk_appendf(&trash, "\n");
+
+ if (ci_putchk(si_ic(si), &trash) == -1) {
+ chunk_reset(&trash);
+ chunk_printf(&trash, "[output too large, cannot dump]\n");
+ si_applet_cant_put(si);
+ }
+
+ /* dump complete */
+ return 1;
+}
+
/*
* CLI IO handler for `show cli sockets`.
* Uses ctx.cli.p0 to store the restart pointer.
@@ -1428,6 +1475,7 @@ static struct cli_kw_list cli_kws = {{ },{
{ { "show", "env", NULL }, "show env [var] : dump environment variables known to the process", cli_parse_show_env, cli_io_handler_show_env, NULL },
{ { "show", "cli", "sockets", NULL }, "show cli sockets : dump list of cli sockets", cli_parse_default, cli_io_handler_show_cli_sock, NULL },
{ { "show", "fd", NULL }, "show fd [num] : dump list of file descriptors in use", cli_parse_show_fd, cli_io_handler_show_fd, NULL },
+ { { "show", "activity", NULL }, "show activity : show per-thread activity stats (for support/developers)", cli_parse_default, cli_io_handler_show_activity, NULL },
{ { "_getsocks", NULL }, NULL, _getsocks, NULL },
{{},}
}};
diff --git a/src/connection.c b/src/connection.c
index 0f8acb0..48d7a64 100644
--- a/src/connection.c
+++ b/src/connection.c
@@ -63,8 +63,10 @@ void conn_fd_handler(int fd)
struct connection *conn = fdtab[fd].owner;
unsigned int flags;

- if (unlikely(!conn))
+ if (unlikely(!conn)) {
+ activity[tid].conn_dead++;
return;
+ }

conn_refresh_polling_flags(conn);
conn->flags |= CO_FL_WILL_UPDATE;
diff --git a/src/ev_epoll.c b/src/ev_epoll.c
index 602a243..679dfee 100644
--- a/src/ev_epoll.c
+++ b/src/ev_epoll.c
@@ -68,8 +68,10 @@ REGPRM2 static void _do_poll(struct poller *p, int exp)
for (updt_idx = 0; updt_idx < fd_nbupdt; updt_idx++) {
fd = fd_updt[updt_idx];

- if (!fdtab[fd].owner)
+ if (!fdtab[fd].owner) {
+ activity[tid].poll_drop++;
continue;
+ }

HA_SPIN_LOCK(FD_LOCK, &fdtab[fd].lock);
fdtab[fd].updated = 0;
@@ -114,8 +116,10 @@ REGPRM2 static void _do_poll(struct poller *p, int exp)
/* compute the epoll_wait() timeout */
if (!exp)
wait_time = MAX_DELAY_MS;
- else if (tick_is_expired(exp, now_ms))
+ else if (tick_is_expired(exp, now_ms)) {
+ activity[tid].poll_exp++;
wait_time = 0;
+ }
else {
wait_time = TICKS_TO_MS(tick_remain(now_ms, exp)) + 1;
if (wait_time > MAX_DELAY_MS)
@@ -136,8 +140,15 @@ REGPRM2 static void _do_poll(struct poller *p, int exp)
unsigned int e = epoll_events[count].events;
fd = epoll_events[count].data.fd;

- if (!fdtab[fd].owner || !(fdtab[fd].thread_mask & tid_bit))
+ if (!fdtab[fd].owner) {
+ activity[tid].poll_dead++;
+ continue;
+ }
+
+ if (!(fdtab[fd].thread_mask & tid_bit)) {
+ activity[tid].poll_skip++;
continue;
+ }

/* it looks complicated but gcc can optimize it away when constants
* have same values... In fact it depends on gcc :-(
diff --git a/src/ev_kqueue.c b/src/ev_kqueue.c
index b42ee3d..69d51b6 100644
--- a/src/ev_kqueue.c
+++ b/src/ev_kqueue.c
@@ -47,8 +47,10 @@ REGPRM2 static void _do_poll(struct poller *p, int exp)
for (updt_idx = 0; updt_idx < fd_nbupdt; updt_idx++) {
fd = fd_updt[updt_idx];

- if (!fdtab[fd].owner)
+ if (!fdtab[fd].owner) {
+ activity[tid].poll_drop++;
continue;
+ }

HA_SPIN_LOCK(FD_LOCK, &fdtab[fd].lock);
fdtab[fd].updated = 0;
@@ -106,6 +108,8 @@ REGPRM2 static void _do_poll(struct poller *p, int exp)
timeout.tv_sec = (delta_ms / 1000);
timeout.tv_nsec = (delta_ms % 1000) * 1000000;
}
+ else
+ activity[tid].poll_exp++;

fd = MIN(maxfd, global.tune.maxpollevents);
gettimeofday(&before_poll, NULL);
@@ -122,8 +126,15 @@ REGPRM2 static void _do_poll(struct poller *p, int exp)
unsigned int n = 0;
fd = kev[count].ident;

- if (!fdtab[fd].owner || !(fdtab[fd].thread_mask & tid_bit))
+ if (!fdtab[fd].owner) {
+ activity[tid].poll_dead++;
continue;
+ }
+
+ if (!(fdtab[fd].thread_mask & tid_bit)) {
+ activity[tid].poll_skip++;
+ continue;
+ }

if (kev[count].filter == EVFILT_READ) {
if (kev[count].data)
diff --git a/src/ev_poll.c b/src/ev_poll.c
index 610509b..efd56ee 100644
--- a/src/ev_poll.c
+++ b/src/ev_poll.c
@@ -73,8 +73,10 @@ REGPRM2 static void _do_poll(struct poller *p, int exp)
for (updt_idx = 0; updt_idx < fd_nbupdt; updt_idx++) {
fd = fd_updt[updt_idx];

- if (!fdtab[fd].owner)
+ if (!fdtab[fd].owner) {
+ activity[tid].poll_drop++;
continue;
+ }

HA_SPIN_LOCK(FD_LOCK, &fdtab[fd].lock);
fdtab[fd].updated = 0;
@@ -111,13 +113,21 @@ REGPRM2 static void _do_poll(struct poller *p, int exp)
continue;

for (count = 0, fd = fds * 8*sizeof(**fd_evts); count < 8*sizeof(**fd_evts) && fd < maxfd; count++, fd++) {
-
- if (!fdtab[fd].owner || !(fdtab[fd].thread_mask & tid_bit))
- continue;
-
sr = (rn >> count) & 1;
sw = (wn >> count) & 1;
if ((sr|sw)) {
+ if (!fdtab[fd].owner) {
+ /* should normally not happen here except
+ * due to rare thread concurrency
+ */
+ continue;
+ }
+
+ if (!(fdtab[fd].thread_mask & tid_bit)) {
+ activity[tid].poll_skip++;
+ continue;
+ }
+
poll_events[nbfd].fd = fd;
poll_events[nbfd].events = (sr ? (POLLIN | POLLRDHUP) : 0) | (sw ? POLLOUT : 0);
nbfd++;
@@ -128,8 +138,10 @@ REGPRM2 static void _do_poll(struct poller *p, int exp)
/* now let's wait for events */
if (!exp)
wait_time = MAX_DELAY_MS;
- else if (tick_is_expired(exp, now_ms))
+ else if (tick_is_expired(exp, now_ms)) {
+ activity[tid].poll_exp++;
wait_time = 0;
+ }
else {
wait_time = TICKS_TO_MS(tick_remain(now_ms, exp)) + 1;
if (wait_time > MAX_DELAY_MS)
@@ -152,8 +164,10 @@ REGPRM2 static void _do_poll(struct poller *p, int exp)
/* ok, we found one active fd */
status--;

- if (!fdtab[fd].owner)
+ if (!fdtab[fd].owner) {
+ activity[tid].poll_dead++;
continue;
+ }

/* it looks complicated but gcc can optimize it away when constants
* have same values... In fact it depends on gcc :-(
diff --git a/src/ev_select.c b/src/ev_select.c
index b2b4e50..52c4454 100644
--- a/src/ev_select.c
+++ b/src/ev_select.c
@@ -55,8 +55,10 @@ REGPRM2 static void _do_poll(struct poller *p, int exp)
for (updt_idx = 0; updt_idx < fd_nbupdt; updt_idx++) {
fd = fd_updt[updt_idx];

- if (!fdtab[fd].owner)
+ if (!fdtab[fd].owner) {
+ activity[tid].poll_drop++;
continue;
+ }

HA_SPIN_LOCK(FD_LOCK, &fdtab[fd].lock);
fdtab[fd].updated = 0;
@@ -117,6 +119,8 @@ REGPRM2 static void _do_poll(struct poller *p, int exp)
delta.tv_sec = (delta_ms / 1000);
delta.tv_usec = (delta_ms % 1000) * 1000;
}
+ else
+ activity[tid].poll_exp++;

gettimeofday(&before_poll, NULL);
status = select(maxfd,
@@ -138,11 +142,15 @@ REGPRM2 static void _do_poll(struct poller *p, int exp)
for (count = BITS_PER_INT, fd = fds * BITS_PER_INT; count && fd < maxfd; count--, fd++) {
unsigned int n = 0;

- /* if we specify read first, the accepts and zero reads will be
- * seen first. Moreover, system buffers will be flushed faster.
- */
- if (!fdtab[fd].owner || !(fdtab[fd].thread_mask & tid_bit))
+ if (!fdtab[fd].owner) {
+ activity[tid].poll_dead++;
+ continue;
+ }
+
+ if (!(fdtab[fd].thread_mask & tid_bit)) {
+ activity[tid].poll_skip++;
continue;
+ }

if (FD_ISSET(fd, tmp_evts[DIR_RD]))
n |= FD_POLL_IN;
diff --git a/src/fd.c b/src/fd.c
index 9fb09ab..148b4d2 100644
--- a/src/fd.c
+++ b/src/fd.c
@@ -243,10 +243,14 @@ void fd_process_cached_events()
for (entry = 0; entry < fd_cache_num; ) {
fd = fd_cache[entry];

- if (!(fdtab[fd].thread_mask & tid_bit))
+ if (!(fdtab[fd].thread_mask & tid_bit)) {
+ activity[tid].fd_skip++;
goto next;
- if (HA_SPIN_TRYLOCK(FD_LOCK, &fdtab[fd].lock))
+ }
+ if (HA_SPIN_TRYLOCK(FD_LOCK, &fdtab[fd].lock)) {
+ activity[tid].fd_lock++;
goto next;
+ }

HA_RWLOCK_RDUNLOCK(FDCACHE_LOCK, &fdcache_lock);

@@ -272,8 +276,10 @@ void fd_process_cached_events()
/* If the fd was removed from the cache, it has been
* replaced by the next one that we don't want to skip !
*/
- if (entry < fd_cache_num && fd_cache[entry] != fd)
+ if (entry < fd_cache_num && fd_cache[entry] != fd) {
+ activity[tid].fd_del++;
continue;
+ }
next:
entry++;
}
diff --git a/src/haproxy.c b/src/haproxy.c
index a8d0fad..952733e 100644
--- a/src/haproxy.c
+++ b/src/haproxy.c
@@ -165,6 +165,8 @@ struct global global = {
/* others NULL OK */
};

+struct activity activity[MAX_THREADS] __attribute__((aligned(64))) = { };
+
/*********************************************************************/

int stopping; /* non zero means stopping in progress */
@@ -2371,7 +2373,7 @@ static void sync_poll_loop()
/* Runs the polling loop */
static void run_poll_loop()
{
- int next;
+ int next, exp;

tv_update_date(0,1);
while (1) {
@@ -2389,18 +2391,27 @@ static void run_poll_loop()
break;

/* expire immediately if events are pending */
- if (fd_cache_num || (active_tasks_mask & tid_bit) || signal_queue_len || (active_applets_mask & tid_bit))
- next = now_ms;
+ exp = now_ms;
+ if (fd_cache_num)
+ activity[tid].wake_cache++;
+ else if (active_tasks_mask & tid_bit)
+ activity[tid].wake_tasks++;
+ else if (active_applets_mask & tid_bit)
+ activity[tid].wake_applets++;
+ else if (signal_queue_len)
+ activity[tid].wake_signal++;
+ else
+ exp = next;

/* The poller will ensure it returns around <next> */
- cur_poller.poll(&cur_poller, next);
+ cur_poller.poll(&cur_poller, exp);
fd_process_cached_events();
applet_run_active();


/* Synchronize all polling loops */
sync_poll_loop();
-
+ activity[tid].loops++;
}
}

diff --git a/src/stream.c b/src/stream.c
index 60d3eff..ebe41be 100644
--- a/src/stream.c
+++ b/src/stream.c
@@ -1627,6 +1627,8 @@ struct task *process_stream(struct task *t)
struct channel *req, *res;
struct stream_interface *si_f, *si_b;

+ activity[tid].stream++;
+
req = &s->req;
res = &s->res;

diff --git a/src/task.c b/src/task.c
index 053376c..fd9acf6 100644
--- a/src/task.c
+++ b/src/task.c
@@ -196,8 +196,10 @@ void process_runnable_tasks()
max_processed = 200;
if (unlikely(global.nbthread <= 1)) {
/* when no lock is needed, this loop is much faster */
- if (!(active_tasks_mask & tid_bit))
+ if (!(active_tasks_mask & tid_bit)) {
+ activity[tid].empty_rq++;
return;
+ }

active_tasks_mask &= ~tid_bit;
rq_next = eb32sc_lookup_ge(&rqueue, rqueue_ticks - TIMER_LOOK_BACK, tid_bit);
@@ -245,6 +247,7 @@ void process_runnable_tasks()
max_processed--;
if (max_processed <= 0) {
active_tasks_mask |= tid_bit;
+ activity[tid].long_rq++;
break;
}
}
@@ -254,6 +257,7 @@ void process_runnable_tasks()
HA_SPIN_LOCK(TASK_RQ_LOCK, &rq_lock);
if (!(active_tasks_mask & tid_bit)) {
HA_SPIN_UNLOCK(TASK_RQ_LOCK, &rq_lock);
+ activity[tid].empty_rq++;
return;
}

@@ -335,6 +339,7 @@ void process_runnable_tasks()
HA_SPIN_LOCK(TASK_RQ_LOCK, &rq_lock);
if (max_processed <= 0) {
active_tasks_mask |= tid_bit;
+ activity[tid].long_rq++;
break;
}
}
--
2.8.0.rc2.1.gbe9624a

From 2c9bdf3c08dc94cce969fe65fd987798c83def03 Mon Sep 17 00:00:00 2001
From: Christopher Faulet <[email protected]>
Date: Mon, 15 Jan 2018 11:57:03 +0100
Subject: MINOR: threads/fd: Use a bitfield to know if there are FDs for a
thread in the FD cache

A bitfield has been added to know if there are some FDs processable by a
specific thread in the FD cache. When a FD is inserted in the FD cache, the bits
corresponding to its thread_mask are set. On each thread, the bitfield is
updated when the FD cache is processed. If there is no FD processed, the thread
is removed from the bitfield by unsetting its tid_bit.

Note that this bitfield is updated but not checked in
fd_process_cached_events. So, when this function is called, the FDs cache is
always processed.

[wt: should be backported to 1.8 as it will help fix a design limitation]

(cherry picked from commit 699edb43c50bf7e3590f2994a67a3d70ddcb8f79)
Signed-off-by: Willy Tarreau <[email protected]>
---
include/proto/fd.h | 2 ++
src/fd.c | 7 ++++---
2 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/include/proto/fd.h b/include/proto/fd.h
index 8cc191f..44370e7 100644
--- a/include/proto/fd.h
+++ b/include/proto/fd.h
@@ -35,6 +35,7 @@

extern unsigned int *fd_cache; // FD events cache
extern int fd_cache_num; // number of events in the cache
+extern unsigned long fd_cache_mask; // Mask of threads with events in the cache

extern THREAD_LOCAL int *fd_updt; // FD updates list
extern THREAD_LOCAL int fd_nbupdt; // number of updates in the list
@@ -115,6 +116,7 @@ static inline void fd_alloc_cache_entry(const int fd)
if (fdtab[fd].cache)
goto end;
fd_cache_num++;
+ fd_cache_mask |= fdtab[fd].thread_mask;
fdtab[fd].cache = fd_cache_num;
fd_cache[fd_cache_num-1] = fd;
end:
diff --git a/src/fd.c b/src/fd.c
index 148b4d2..8411bcf 100644
--- a/src/fd.c
+++ b/src/fd.c
@@ -170,6 +170,7 @@ int nbpollers = 0;

unsigned int *fd_cache = NULL; // FD events cache
int fd_cache_num = 0; // number of events in the cache
+unsigned long fd_cache_mask = 0; // Mask of threads with events in the cache

THREAD_LOCAL int *fd_updt = NULL; // FD updates list
THREAD_LOCAL int fd_nbupdt = 0; // number of updates in the list
@@ -236,10 +237,8 @@ void fd_process_cached_events()
{
int fd, entry, e;

- if (!fd_cache_num)
- return;
-
HA_RWLOCK_RDLOCK(FDCACHE_LOCK, &fdcache_lock);
+ fd_cache_mask &= ~tid_bit;
for (entry = 0; entry < fd_cache_num; ) {
fd = fd_cache[entry];

@@ -247,6 +246,8 @@ void fd_process_cached_events()
activity[tid].fd_skip++;
goto next;
}
+
+ fd_cache_mask |= tid_bit;
if (HA_SPIN_TRYLOCK(FD_LOCK, &fdtab[fd].lock)) {
activity[tid].fd_lock++;
goto next;
--
2.8.0.rc2.1.gbe9624a

From f31f53d2d840e3842b98b9a2dbc9ecec5ddffaa1 Mon Sep 17 00:00:00 2001
From: Christopher Faulet <[email protected]>
Date: Mon, 15 Jan 2018 12:16:34 +0100
Subject: BUG/MEDIUM: threads/polling: Use fd_cache_mask instead of
fd_cache_num

fd_cache_num is the number of FDs in the FD cache. It is a global variable. So
it is underoptimized because we may be lead to consider there are waiting FDs
for the current thread in the FD cache while in fact all FDs are assigned to the
other threads. So, in such cases, the polling loop will be evaluated many more
times than necessary.

Instead, we now check if the thread id is set in the bitfield fd_cache_mask.

[wt: it's not exactly a bug, rather a design limitation of the thread
which was not addressed in time for the 1.8 release. It can appear more
often than we initially predicted, when more threads are running than
the number of assigned CPU cores, or when certain threads spend
milliseconds computing crypto keys while other threads spin on
epoll_wait(0)=0]

This patch should be backported to 1.8.

(cherry picked from commit 207049e8db9cab66d00376e924500b5689637e74)
Signed-off-by: Willy Tarreau <[email protected]>
---
src/haproxy.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/haproxy.c b/src/haproxy.c
index 952733e..45ff0fc 100644
--- a/src/haproxy.c
+++ b/src/haproxy.c
@@ -2392,7 +2392,7 @@ static void run_poll_loop()

/* expire immediately if events are pending */
exp = now_ms;
- if (fd_cache_num)
+ if (fd_cache_mask & tid_bit)
activity[tid].wake_cache++;
else if (active_tasks_mask & tid_bit)
activity[tid].wake_tasks++;
--
2.8.0.rc2.1.gbe9624a

From 412dec96e5f0c664d17ac49bb806af5e24a6766e Mon Sep 17 00:00:00 2001
From: Willy Tarreau <[email protected]>
Date: Sat, 20 Jan 2018 23:53:50 +0100
Subject: BUG/MEDIUM: fd: maintain a per-thread update mask

Since the fd update tables are per-thread, we need to have a bit per
thread to indicate whether an update exists, otherwise this can lead
to lost update events every time multiple threads want to update the
same FD. In practice *for now*, it only happens at start time when
listeners are enabled and ask for polling after facing their first
EAGAIN. But since the pollers are still shared, a lost event is still
recovered by a neighbor thread. This will not reliably work anymore
with per-thread pollers, where it has been observed a few times on
startup that a single-threaded listener would not always accept
incoming connections upon startup.

It's worth noting that during this code review it appeared that the
"new" flag in the fdtab isn't used anymore.

This fix should be backported to 1.8.

(cherry picked from commit ba7dbcb134c3864d2bf212e6381867f09d3dd0a6)
Signed-off-by: Willy Tarreau <[email protected]>
---
include/proto/fd.h | 6 +++---
include/types/fd.h | 2 +-
src/cli.c | 5 ++---
src/ev_epoll.c | 2 +-
src/ev_kqueue.c | 2 +-
src/ev_poll.c | 2 +-
src/ev_select.c | 2 +-
src/fd.c | 2 +-
src/stream.c | 4 ++--
9 files changed, 13 insertions(+), 14 deletions(-)

diff --git a/include/proto/fd.h b/include/proto/fd.h
index 44370e7..d6b591d 100644
--- a/include/proto/fd.h
+++ b/include/proto/fd.h
@@ -99,10 +99,10 @@ void fd_process_cached_events();
*/
static inline void updt_fd_polling(const int fd)
{
- if (fdtab[fd].updated)
+ if (fdtab[fd].update_mask & tid_bit)
/* already scheduled for update */
return;
- fdtab[fd].updated = 1;
+ fdtab[fd].update_mask |= tid_bit;
fd_updt[fd_nbupdt++] = fd;
}

@@ -400,7 +400,7 @@ static inline void fd_insert(int fd, unsigned long thread_mask)
HA_SPIN_LOCK(FD_LOCK, &fdtab[fd].lock);
fdtab[fd].ev = 0;
fdtab[fd].new = 1;
- fdtab[fd].updated = 0;
+ fdtab[fd].update_mask &= ~tid_bit;
fdtab[fd].linger_risk = 0;
fdtab[fd].cloned = 0;
fdtab[fd].cache = 0;
diff --git a/include/types/fd.h b/include/types/fd.h
index 032bab9..54192e4 100644
--- a/include/types/fd.h
+++ b/include/types/fd.h
@@ -94,13 +94,13 @@ enum fd_states {
struct fdtab {
__decl_hathreads(HA_SPINLOCK_T lock);
unsigned long thread_mask; /* mask of thread IDs authorized to process the task */
+ unsigned long update_mask; /* mask of thread IDs having an update for fd */
void (*iocb)(int fd); /* I/O handler */
void *owner; /* the connection or listener associated with this fd, NULL if closed */
unsigned int cache; /* position+1 in the FD cache. 0=not in cache. */
unsigned char state; /* FD state for read and write directions (2*3 bits) */
unsigned char ev; /* event seen in return of poll() : FD_POLL_* */
unsigned char new:1; /* 1 if this fd has just been created */
- unsigned char updated:1; /* 1 if this fd is already in the update list */
unsigned char linger_risk:1; /* 1 if we must kill lingering before closing */
unsigned char cloned:1; /* 1 if a cloned socket, requires EPOLL_CTL_DEL on close */
};
diff --git a/src/cli.c b/src/cli.c
index e2e7c34..a1a8323 100644
--- a/src/cli.c
+++ b/src/cli.c
@@ -794,7 +794,7 @@ static int cli_io_handler_show_fd(struct appctx *appctx)
li = fdt.owner;

chunk_printf(&trash,
- " %5d : st=0x%02x(R:%c%c%c W:%c%c%c) ev=0x%02x(%c%c%c%c%c) [%c%c%c%c] cache=%u owner=%p iocb=%p(%s) tmask=0x%lx",
+ " %5d : st=0x%02x(R:%c%c%c W:%c%c%c) ev=0x%02x(%c%c%c%c%c) [%c%c%c] cache=%u owner=%p iocb=%p(%s) tmask=0x%lx umask=0x%lx",
fd,
fdt.state,
(fdt.state & FD_EV_POLLED_R) ? 'P' : 'p',
@@ -810,7 +810,6 @@ static int cli_io_handler_show_fd(struct appctx *appctx)
(fdt.ev & FD_POLL_PRI) ? 'P' : 'p',
(fdt.ev & FD_POLL_IN) ? 'I' : 'i',
fdt.new ? 'N' : 'n',
- fdt.updated ? 'U' : 'u',
fdt.linger_risk ? 'L' : 'l',
fdt.cloned ? 'C' : 'c',
fdt.cache,
@@ -820,7 +819,7 @@ static int cli_io_handler_show_fd(struct appctx *appctx)
(fdt.iocb == dgram_fd_handler) ? "dgram_fd_handler" :
(fdt.iocb == listener_accept) ? "listener_accept" :
"unknown",
- fdt.thread_mask);
+ fdt.thread_mask, fdt.update_mask);

if (fdt.iocb == conn_fd_handler) {
chunk_appendf(&trash, " cflg=0x%08x", conn_flags);
diff --git a/src/ev_epoll.c b/src/ev_epoll.c
index 679dfee..f37455f 100644
--- a/src/ev_epoll.c
+++ b/src/ev_epoll.c
@@ -74,7 +74,7 @@ REGPRM2 static void _do_poll(struct poller *p, int exp)
}

HA_SPIN_LOCK(FD_LOCK, &fdtab[fd].lock);
- fdtab[fd].updated = 0;
+ fdtab[fd].update_mask &= ~tid_bit;
fdtab[fd].new = 0;

eo = fdtab[fd].state;
diff --git a/src/ev_kqueue.c b/src/ev_kqueue.c
index 69d51b6..20fa290 100644
--- a/src/ev_kqueue.c
+++ b/src/ev_kqueue.c
@@ -53,7 +53,7 @@ REGPRM2 static void _do_poll(struct poller *p, int exp)
}

HA_SPIN_LOCK(FD_LOCK, &fdtab[fd].lock);
- fdtab[fd].updated = 0;
+ fdtab[fd].update_mask &= ~tid_bit;
fdtab[fd].new = 0;

eo = fdtab[fd].state;
diff --git a/src/ev_poll.c b/src/ev_poll.c
index efd56ee..f24bf69 100644
--- a/src/ev_poll.c
+++ b/src/ev_poll.c
@@ -79,7 +79,7 @@ REGPRM2 static void _do_poll(struct poller *p, int exp)
}

HA_SPIN_LOCK(FD_LOCK, &fdtab[fd].lock);
- fdtab[fd].updated = 0;
+ fdtab[fd].update_mask &= ~tid_bit;
fdtab[fd].new = 0;

eo = fdtab[fd].state;
diff --git a/src/ev_select.c b/src/ev_select.c
index 52c4454..19b1380 100644
--- a/src/ev_select.c
+++ b/src/ev_select.c
@@ -61,7 +61,7 @@ REGPRM2 static void _do_poll(struct poller *p, int exp)
}

HA_SPIN_LOCK(FD_LOCK, &fdtab[fd].lock);
- fdtab[fd].updated = 0;
+ fdtab[fd].update_mask &= ~tid_bit;
fdtab[fd].new = 0;

eo = fdtab[fd].state;
diff --git a/src/fd.c b/src/fd.c
index 8411bcf..112806b 100644
--- a/src/fd.c
+++ b/src/fd.c
@@ -199,7 +199,7 @@ static void fd_dodelete(int fd, int do_close)
port_range_release_port(fdinfo[fd].port_range, fdinfo[fd].local_port);
fdinfo[fd].port_range = NULL;
fdtab[fd].owner = NULL;
- fdtab[fd].updated = 0;
+ fdtab[fd].update_mask &= ~tid_bit;
fdtab[fd].new = 0;
fdtab[fd].thread_mask = 0;
if (do_close)
diff --git a/src/stream.c b/src/stream.c
index ebe41be..92f9c0a 100644
--- a/src/stream.c
+++ b/src/stream.c
@@ -2906,7 +2906,7 @@ static int stats_dump_full_strm_to_buffer(struct stream_interface *si, struct st
conn->handle.fd,
conn->handle.fd >= 0 ? fdtab[conn->handle.fd].state : 0,
conn->handle.fd >= 0 ? fdtab[conn->handle.fd].cache : 0,
- conn->handle.fd >= 0 ? fdtab[conn->handle.fd].updated : 0,
+ conn->handle.fd >= 0 ? !!(fdtab[conn->handle.fd].update_mask & tid_bit) : 0,
conn->handle.fd >= 0 ? fdtab[conn->handle.fd].thread_mask: 0);
}
else if ((tmpctx = objt_appctx(strm->si[0].end)) != NULL) {
@@ -2939,7 +2939,7 @@ static int stats_dump_full_strm_to_buffer(struct stream_interface *si, struct st
conn->handle.fd,
conn->handle.fd >= 0 ? fdtab[conn->handle.fd].state : 0,
conn->handle.fd >= 0 ? fdtab[conn->handle.fd].cache : 0,
- conn->handle.fd >= 0 ? fdtab[conn->handle.fd].updated : 0,
+ conn->handle.fd >= 0 ? !!(fdtab[conn->handle.fd].update_mask & tid_bit) : 0,
conn->handle.fd >= 0 ? fdtab[conn->handle.fd].thread_mask: 0);
}
else if ((tmpctx = objt_appctx(strm->si[1].end)) != NULL) {
--
2.8.0.rc2.1.gbe9624a

From 7caae107b4416ce74102a1398bf68bf93bddf65b Mon Sep 17 00:00:00 2001
From: Willy Tarreau <[email protected]>
Date: Wed, 17 Jan 2018 18:44:46 +0100
Subject: MINOR: fd: add a bitmask to indicate that an FD is known by the
poller

Some pollers like epoll() need to know if the fd is already known or
not in order to compute the operation to perform (add, mod, del). For
now this is performed based on the difference between the previous FD
state and the new state but this will not be usable anymore once threads
become responsible for their own polling.

Here we come with a different approach : a bitmask is stored with the
fd to indicate which pollers already know it, and the pollers will be
able to simply perform the add/mod/del operations based on this bit
combined with the new state.

This patch only adds the bitmask declaration and initialization, it
is it not used.

(cherry picked from commit 2582acf2b0793240afb2464f1839876432cf66ce)
Signed-off-by: Willy Tarreau <[email protected]>
---
include/proto/fd.h | 3 +++
include/types/fd.h | 1 +
src/fd.c | 4 +++-
3 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/include/proto/fd.h b/include/proto/fd.h
index d6b591d..cc559ac 100644
--- a/include/proto/fd.h
+++ b/include/proto/fd.h
@@ -405,6 +405,9 @@ static inline void fd_insert(int fd, unsigned long thread_mask)
fdtab[fd].cloned = 0;
fdtab[fd].cache = 0;
fdtab[fd].thread_mask = thread_mask;
+ /* note: do not reset polled_mask here as it indicates which poller
+ * still knows this FD from a possible previous round.
+ */
HA_SPIN_UNLOCK(FD_LOCK, &fdtab[fd].lock);

HA_SPIN_LOCK(FDTAB_LOCK, &fdtab_lock);
diff --git a/include/types/fd.h b/include/types/fd.h
index 54192e4..9f2c5fe 100644
--- a/include/types/fd.h
+++ b/include/types/fd.h
@@ -94,6 +94,7 @@ enum fd_states {
struct fdtab {
__decl_hathreads(HA_SPINLOCK_T lock);
unsigned long thread_mask; /* mask of thread IDs authorized to process the task */
+ unsigned long polled_mask; /* mask of thread IDs currently polling this fd */
unsigned long update_mask; /* mask of thread IDs having an update for fd */
void (*iocb)(int fd); /* I/O handler */
void *owner; /* the connection or listener associated with this fd, NULL if closed */
diff --git a/src/fd.c b/src/fd.c
index 112806b..b64130e 100644
--- a/src/fd.c
+++ b/src/fd.c
@@ -202,8 +202,10 @@ static void fd_dodelete(int fd, int do_close)
fdtab[fd].update_mask &= ~tid_bit;
fdtab[fd].new = 0;
fdtab[fd].thread_mask = 0;
- if (do_close)
+ if (do_close) {
+ fdtab[fd].polled_mask = 0;
close(fd);
+ }
HA_SPIN_UNLOCK(FD_LOCK, &fdtab[fd].lock);

HA_SPIN_LOCK(FDTAB_LOCK, &fdtab_lock);
--
2.8.0.rc2.1.gbe9624a

From a9d6f8c266f47507bd43904c7eb52b8265fda6ea Mon Sep 17 00:00:00 2001
From: Willy Tarreau <[email protected]>
Date: Thu, 18 Jan 2018 19:16:02 +0100
Subject: BUG/MEDIUM: epoll/threads: use one epoll_fd per thread

There currently is a problem regarding epoll(). While select() and poll()
compute their polling state on the fly upon each call, epoll() keeps a
shared state between all threads via the epoll_fd. The problem is that
once an fd is registered on *any* thread, all other threads receive
events for that FD as well. It is clearly visible when binding a listener
to a single thread like in the configuration below where all 4 threads
will work, 3 of them simply spinning to skip the event :

global
nbthread 4

frontend foo
bind :1234 process 1/1

The worst case happens when some slow operations are in progress on a
busy thread, preventing it from processing its task and causing the
other ones to wake up not being able to do anything with this event.
Typically computing a large TLS key will delay processing of next
events on the same thread while others will still wake up.

All this simply shows that the poller must remain thread-specific, with
its own events and its own ability to sleep when it doesn't have anyhing
to do.

This patch does exactly this. For this, it proceeds like this :

- have one epoll_fd per thread instead of one per process
- initialize these epoll_fd when threads are created.
- mark all known FDs as updated so that the next invocation of
_do_poll() recomputes their polling status (including a possible
removal of undesired polling from the original FD) ;
- use each fd's polled_mask to maintain an accurate status of
the current polling activity for this FD.
- when scanning updates, only focus on events whose new polling
status differs from the existing one
- during updates, always verify the thread_mask to resist migration
- on __fd_clo(), for cloned FDs (typically listeners inherited
from the parent during a graceful shutdown), run epoll_ctl(DEL)
on all epoll_fd. This is the reason why epoll_fd is stored in a
shared array and not in a thread_local storage. Note: maybe this
can be moved to an update instead.

Interestingly, this shows that we don't need the FD's old state anymore
and that we only use it to convert it to the new state based on stable
information. It appears clearly that the FD code can be further improved
by computing the final state directly when manipulating it.

With this change, the config above goes from 22000 cps at 380% CPU to
43000 cps at 100% CPU : not only the 3 unused threads are not activated,
but they do not disturb the activity anymore.

The output of "show activity" before and after the patch on a 4-thread
config where a first listener on thread 2 forwards over SSL to threads
3 & 4 shows this a much smaller amount of undesired events (thread 1
doesn't wake up anymore, poll_skip remains zero, fd_skip stays low) :

// before: 400% CPU, 7700 cps, 13 seconds
loops: 11380717 65879 5733468 5728129
wake_cache: 0 63986 317547 314174
wake_tasks: 0 0 0 0
wake_applets: 0 0 0 0
wake_signal: 0 0 0 0
poll_exp: 0 63986 317547 314174
poll_drop: 1 0 49981 48893
poll_dead: 65514 0 31334 31934
poll_skip: 46293690 34071 22867786 22858208
fd_skip: 66068135 174157 33732685 33825727
fd_lock: 0 2 2809 2905
fd_del: 0 494361 80890 79464
conn_dead: 0 0 0 0
stream: 0 407747 50526 49474
empty_rq: 11380718 1914 5683023 5678715
long_rq: 0 0 0 0

// after: 200% cpu, 9450 cps, 11 seconds
loops: 17 66147 1001631 450968
wake_cache: 0 66119 865139 321227
wake_tasks: 0 0 0 0
wake_applets: 0 0 0 0
wake_signal: 0 0 0 0
poll_exp: 0 66119 865139 321227
poll_drop: 6 5 38279 60768
poll_dead: 0 0 0 0
poll_skip: 0 0 0 0
fd_skip: 54 172661 4411407 2008198
fd_lock: 0 0 10890 5394
fd_del: 0 492829 58965 105091
conn_dead: 0 0 0 0
stream: 0 406223 38663 61338
empty_rq: 18 40 962999 390549
long_rq: 0 0 0 0

This patch presents a few risks but fixes a real problem with threads,
and as such it must experience more tests and should be backported to
1.8. It depends on previous patch ("MINOR: fd: add a bitmask to indicate
that an FD is known by the poller").

(cherry picked from commit a206ce8e337fb7dda6559337c1f525ba9cff6482)
Signed-off-by: Willy Tarreau <[email protected]>
---
src/ev_epoll.c | 104 +++++++++++++++++++++++++++++++++++++++------------------
1 file changed, 71 insertions(+), 33 deletions(-)

diff --git a/src/ev_epoll.c b/src/ev_epoll.c
index f37455f..635b8a5 100644
--- a/src/ev_epoll.c
+++ b/src/ev_epoll.c
@@ -29,7 +29,7 @@

/* private data */
static THREAD_LOCAL struct epoll_event *epoll_events = NULL;
-static int epoll_fd;
+static int epoll_fd[MAX_THREADS]; // per-thread epoll_fd

/* This structure may be used for any purpose. Warning! do not use it in
* recursive functions !
@@ -49,8 +49,14 @@ static THREAD_LOCAL struct epoll_event ev;
*/
REGPRM1 static void __fd_clo(int fd)
{
- if (unlikely(fdtab[fd].cloned))
- epoll_ctl(epoll_fd, EPOLL_CTL_DEL, fd, &ev);
+ if (unlikely(fdtab[fd].cloned)) {
+ unsigned long m = fdtab[fd].thread_mask;
+ int i;
+
+ for (i = global.nbthread - 1; i >= 0; i--)
+ if (m & (1UL << i))
+ epoll_ctl(epoll_fd, EPOLL_CTL_DEL, fd, &ev);
+ }
}

/*
@@ -82,34 +88,36 @@ REGPRM2 static void _do_poll(struct poller *p, int exp)
fdtab[fd].state = en;
HA_SPIN_UNLOCK(FD_LOCK, &fdtab[fd].lock);

- if ((eo ^ en) & FD_EV_POLLED_RW) {
- /* poll status changed */
-
- if ((en & FD_EV_POLLED_RW) == 0) {
+ if (fdtab[fd].polled_mask & tid_bit) {
+ if (!(fdtab[fd].thread_mask & tid_bit) || !(en & FD_EV_POLLED_RW)) {
/* fd removed from poll list */
opcode = EPOLL_CTL_DEL;
- }
- else if ((eo & FD_EV_POLLED_RW) == 0) {
- /* new fd in the poll list */
- opcode = EPOLL_CTL_ADD;
+ HA_ATOMIC_AND(&fdtab[fd].polled_mask, ~tid_bit);
}
else {
/* fd status changed */
opcode = EPOLL_CTL_MOD;
}
+ }
+ else if ((fdtab[fd].thread_mask & tid_bit) && (en & FD_EV_POLLED_RW)) {
+ /* new fd in the poll list */
+ opcode = EPOLL_CTL_ADD;
+ HA_ATOMIC_OR(&fdtab[fd].polled_mask, tid_bit);
+ }
+ else {
+ continue;
+ }

- /* construct the epoll events based on new state */
- ev.events = 0;
- if (en & FD_EV_POLLED_R)
- ev.events |= EPOLLIN | EPOLLRDHUP;
-
- if (en & FD_EV_POLLED_W)
- ev.events |= EPOLLOUT;
+ /* construct the epoll events based on new state */
+ ev.events = 0;
+ if (en & FD_EV_POLLED_R)
+ ev.events |= EPOLLIN | EPOLLRDHUP;

- ev.data.fd = fd;
+ if (en & FD_EV_POLLED_W)
+ ev.events |= EPOLLOUT;

- epoll_ctl(epoll_fd, opcode, fd, &ev);
- }
+ ev.data.fd = fd;
+ epoll_ctl(epoll_fd[tid], opcode, fd, &ev);
}
fd_nbupdt = 0;

@@ -129,7 +137,7 @@ REGPRM2 static void _do_poll(struct poller *p, int exp)
/* now let's wait for polled events */

gettimeofday(&before_poll, NULL);
- status = epoll_wait(epoll_fd, epoll_events, global.tune.maxpollevents, wait_time);
+ status = epoll_wait(epoll_fd[tid], epoll_events, global.tune.maxpollevents, wait_time);
tv_update_date(wait_time, status);
measure_idle();

@@ -146,7 +154,10 @@ REGPRM2 static void _do_poll(struct poller *p, int exp)
}

if (!(fdtab[fd].thread_mask & tid_bit)) {
+ /* FD has been migrated */
activity[tid].poll_skip++;
+ epoll_ctl(epoll_fd[tid], EPOLL_CTL_DEL, fd, &ev);
+ HA_ATOMIC_AND(&fdtab[fd].polled_mask, ~tid_bit);
continue;
}

@@ -178,14 +189,38 @@ REGPRM2 static void _do_poll(struct poller *p, int exp)

static int init_epoll_per_thread()
{
+ int fd;
+
epoll_events = calloc(1, sizeof(struct epoll_event) * global.tune.maxpollevents);
if (epoll_events == NULL)
- return 0;
+ goto fail_alloc;
+
+ if (tid) {
+ epoll_fd[tid] = epoll_create(global.maxsock + 1);
+ if (epoll_fd[tid] < 0)
+ goto fail_fd;
+ }
+
+ /* we may have to unregister some events initially registered on the
+ * original fd when it was alone, and/or to register events on the new
+ * fd for this thread. Let's just mark them as updated, the poller will
+ * do the rest.
+ */
+ for (fd = 0; fd < maxfd; fd++)
+ updt_fd_polling(fd);
+
return 1;
+ fail_fd:
+ free(epoll_events);
+ fail_alloc:
+ return 0;
}

static void deinit_epoll_per_thread()
{
+ if (tid)
+ close(epoll_fd[tid]);
+
free(epoll_events);
epoll_events = NULL;
}
@@ -199,8 +234,8 @@ REGPRM1 static int _do_init(struct poller *p)
{
p->private = NULL;

- epoll_fd = epoll_create(global.maxsock + 1);
- if (epoll_fd < 0)
+ epoll_fd[tid] = epoll_create(global.maxsock + 1);
+ if (epoll_fd[tid] < 0)
goto fail_fd;

hap_register_per_thread_init(init_epoll_per_thread);
@@ -219,9 +254,9 @@ REGPRM1 static int _do_init(struct poller *p)
*/
REGPRM1 static void _do_term(struct poller *p)
{
- if (epoll_fd >= 0) {
- close(epoll_fd);
- epoll_fd = -1;
+ if (epoll_fd[tid] >= 0) {
+ close(epoll_fd[tid]);
+ epoll_fd[tid] = -1;
}

p->private = NULL;
@@ -251,10 +286,10 @@ REGPRM1 static int _do_test(struct poller *p)
*/
REGPRM1 static int _do_fork(struct poller *p)
{
- if (epoll_fd >= 0)
- close(epoll_fd);
- epoll_fd = epoll_create(global.maxsock + 1);
- if (epoll_fd < 0)
+ if (epoll_fd[tid] >= 0)
+ close(epoll_fd[tid]);
+ epoll_fd[tid] = epoll_create(global.maxsock + 1);
+ if (epoll_fd[tid] < 0)
return 0;
return 1;
}
@@ -268,11 +303,14 @@ __attribute__((constructor))
static void _do_register(void)
{
struct poller *p;
+ int i;

if (nbpollers >= MAX_POLLERS)
return;

- epoll_fd = -1;
+ for (i = 0; i < MAX_THREADS; i++)
+ epoll_fd = -1;
+
p = &pollers[nbpollers++];

p->name = "epoll";
--
2.8.0.rc2.1.gbe9624a

From e432f5d0cd8dd46b31f169c22c8cb3e433f0c30c Mon Sep 17 00:00:00 2001
From: Willy Tarreau <[email protected]>
Date: Fri, 19 Jan 2018 08:56:14 +0100
Subject: BUG/MEDIUM: kqueue/threads: use one kqueue_fd per thread

This is the same principle as the previous patch (BUG/MEDIUM:
epoll/threads: use one epoll_fd per thread) except that this time it's
for kqueue. We don't want one thread to wake up because of activity on
another thread that the first one is not interested in.

After some testing period, this should be backported to 1.8.

(cherry picked from commit f98fa477e159864d55f7ab863649523e46ee9461)
Signed-off-by: Willy Tarreau <[email protected]>
---
src/ev_kqueue.c | 93 +++++++++++++++++++++++++++++++++++----------------------
1 file changed, 58 insertions(+), 35 deletions(-)

diff --git a/src/ev_kqueue.c b/src/ev_kqueue.c
index 20fa290..532d49d 100644
--- a/src/ev_kqueue.c
+++ b/src/ev_kqueue.c
@@ -29,7 +29,7 @@


/* private data */
-static int kqueue_fd;
+static int kqueue_fd[MAX_THREADS]; // per-thread kqueue_fd
static THREAD_LOCAL struct kevent *kev = NULL;

/*
@@ -61,35 +61,34 @@ REGPRM2 static void _do_poll(struct poller *p, int exp)
fdtab[fd].state = en;
HA_SPIN_UNLOCK(FD_LOCK, &fdtab[fd].lock);

- if ((eo ^ en) & FD_EV_POLLED_RW) {
- /* poll status changed */
- if ((eo ^ en) & FD_EV_POLLED_R) {
- /* read poll status changed */
- if (en & FD_EV_POLLED_R) {
- EV_SET(&kev[changes], fd, EVFILT_READ, EV_ADD, 0, 0, NULL);
- changes++;
- }
- else {
- EV_SET(&kev[changes], fd, EVFILT_READ, EV_DELETE, 0, 0, NULL);
- changes++;
- }
+ if (!(fdtab[fd].thread_mask & tid_bit) || !(en & FD_EV_POLLED_RW)) {
+ if (!(fdtab[fd].polled_mask & tid_bit)) {
+ /* fd was not watched, it's still not */
+ continue;
}
+ /* fd totally removed from poll list */
+ EV_SET(&kev[changes++], fd, EVFILT_READ, EV_DELETE, 0, 0, NULL);
+ EV_SET(&kev[changes++], fd, EVFILT_WRITE, EV_DELETE, 0, 0, NULL);
+ HA_ATOMIC_AND(&fdtab[fd].polled_mask, ~tid_bit);
+ }
+ else {
+ /* OK fd has to be monitored, it was either added or changed */

- if ((eo ^ en) & FD_EV_POLLED_W) {
- /* write poll status changed */
- if (en & FD_EV_POLLED_W) {
- EV_SET(&kev[changes], fd, EVFILT_WRITE, EV_ADD, 0, 0, NULL);
- changes++;
- }
- else {
- EV_SET(&kev[changes], fd, EVFILT_WRITE, EV_DELETE, 0, 0, NULL);
- changes++;
- }
- }
+ if (en & FD_EV_POLLED_R)
+ EV_SET(&kev[changes++], fd, EVFILT_READ, EV_ADD, 0, 0, NULL);
+ else if (fdtab[fd].polled_mask & tid_bit)
+ EV_SET(&kev[changes++], fd, EVFILT_READ, EV_DELETE, 0, 0, NULL);
+
+ if (en & FD_EV_POLLED_W)
+ EV_SET(&kev[changes++], fd, EVFILT_WRITE, EV_ADD, 0, 0, NULL);
+ else if (fdtab[fd].polled_mask & tid_bit)
+ EV_SET(&kev[changes++], fd, EVFILT_WRITE, EV_DELETE, 0, 0, NULL);
+
+ HA_ATOMIC_OR(&fdtab[fd].polled_mask, tid_bit);
}
}
if (changes)
- kevent(kqueue_fd, kev, changes, NULL, 0, NULL);
+ kevent(kqueue_fd[tid], kev, changes, NULL, 0, NULL);
fd_nbupdt = 0;

delta_ms = 0;
@@ -113,7 +112,7 @@ REGPRM2 static void _do_poll(struct poller *p, int exp)

fd = MIN(maxfd, global.tune.maxpollevents);
gettimeofday(&before_poll, NULL);
- status = kevent(kqueue_fd, // int kq
+ status = kevent(kqueue_fd[tid], // int kq
NULL, // const struct kevent *changelist
0, // int nchanges
kev, // struct kevent *eventlist
@@ -155,11 +154,32 @@ REGPRM2 static void _do_poll(struct poller *p, int exp)

static int init_kqueue_per_thread()
{
+ int fd;
+
/* we can have up to two events per fd (*/
kev = calloc(1, sizeof(struct kevent) * 2 * global.maxsock);
if (kev == NULL)
- return 0;
+ goto fail_alloc;
+
+ if (tid) {
+ kqueue_fd[tid] = kqueue();
+ if (kqueue_fd[tid] < 0)
+ goto fail_fd;
+ }
+
+ /* we may have to unregister some events initially registered on the
+ * original fd when it was alone, and/or to register events on the new
+ * fd for this thread. Let's just mark them as updated, the poller will
+ * do the rest.
+ */
+ for (fd = 0; fd < maxfd; fd++)
+ updt_fd_polling(fd);
+
return 1;
+ fail_fd:
+ free(kev);
+ fail_alloc:
+ return 0;
}

static void deinit_kqueue_per_thread()
@@ -177,8 +197,8 @@ REGPRM1 static int _do_init(struct poller *p)
{
p->private = NULL;

- kqueue_fd = kqueue();
- if (kqueue_fd < 0)
+ kqueue_fd[tid] = kqueue();
+ if (kqueue_fd[tid] < 0)
goto fail_fd;

hap_register_per_thread_init(init_kqueue_per_thread);
@@ -196,9 +216,9 @@ REGPRM1 static int _do_init(struct poller *p)
*/
REGPRM1 static void _do_term(struct poller *p)
{
- if (kqueue_fd >= 0) {
- close(kqueue_fd);
- kqueue_fd = -1;
+ if (kqueue_fd[tid] >= 0) {
+ close(kqueue_fd[tid]);
+ kqueue_fd[tid] = -1;
}

p->private = NULL;
@@ -227,8 +247,8 @@ REGPRM1 static int _do_test(struct poller *p)
*/
REGPRM1 static int _do_fork(struct poller *p)
{
- kqueue_fd = kqueue();
- if (kqueue_fd < 0)
+ kqueue_fd[tid] = kqueue();
+ if (kqueue_fd[tid] < 0)
return 0;
return 1;
}
@@ -242,11 +262,14 @@ __attribute__((constructor))
static void _do_register(void)
{
struct poller *p;
+ int i;

if (nbpollers >= MAX_POLLERS)
return;

- kqueue_fd = -1;
+ for (i = 0; i < MAX_THREADS; i++)
+ kqueue_fd = -1;
+
p = &pollers[nbpollers++];

p->name = "kqueue";
--
2.8.0.rc2.1.gbe9624a
Willy Tarreau
Re: [BUG] 100% cpu on each threads
January 22, 2018 06:20PM
On Mon, Jan 22, 2018 at 05:47:55PM +0100, Willy Tarreau wrote:
> > strace: Process 12166 attached
> > [pid 12166] set_robust_list(0x7ff9bc9aa9e0, 24 <unfinished ...>
> > [pid 12166] <... set_robust_list resumed> ) = 0
> > [pid 12166] gettimeofday({1516289044, 684014}, NULL) = 0
> > [pid 12166] mmap(NULL, 134217728, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0 <unfinished ...>
> > [pid 12166] <... mmap resumed> ) = 0x7ff9ac000000
> > [pid 12166] munmap(0x7ff9b0000000, 67108864) = 0
> > [pid 12166] mprotect(0x7ff9ac000000, 135168, PROT_READ|PROT_WRITE <unfinished ...>
> > [pid 12166] <... mprotect resumed> ) = 0
> > [pid 12166] mmap(NULL, 8003584, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0 <unfinished ...>
> > [pid 12166] <... mmap resumed> ) = 0x7ff9baa65000
> > [pid 12166] close(16 <unfinished ...>
> > [pid 12166] <... close resumed> ) = 0
> > [pid 12166] fcntl(15, F_SETFL, O_RDONLY|O_NONBLOCK <unfinished ...>
> > [pid 12166] <... fcntl resumed> ) = 0
>
> Here it's getting obvious that it was a shared file descriptor :-(

So I have a suspect here :

- run_thread_poll_loop() runs after the threads are created
- first thing it does is to close the master-worker pipe FD :

(...)
if (global.mode & MODE_MWORKER)
mworker_pipe_register(mworker_pipe);
(...)

void mworker_pipe_register(int pipefd[2])
{
close(mworker_pipe[1]); /* close the write end of the master pipe in the children */
fcntl(mworker_pipe[0], F_SETFL, O_NONBLOCK);
(...)
}

Looks familiar with the trace above ?

So I guess your config works in master-worker mode, am I right ?

Note that I'm bothered with the call to protocol_enable_all() as
well in this function since it will start the proxies multiple times
in a possibly unsafe mode. That may explain a lot of things suddenly!

I think the attached patch works around it, but I'd like your
confirmation before cleaning it up.

Thanks,
Willy

diff --git a/src/haproxy.c b/src/haproxy.c
index 20b18f8..66639fc 100644
--- a/src/haproxy.c
+++ b/src/haproxy.c
@@ -2339,7 +2339,11 @@ void mworker_pipe_handler(int fd)

void mworker_pipe_register(int pipefd[2])
{
+ if (mworker_pipe[1] < 0)
+ return;
+
close(mworker_pipe[1]); /* close the write end of the master pipe in the children */
+ mworker_pipe[1] = -1;

fcntl(mworker_pipe[0], F_SETFL, O_NONBLOCK);
fdtab[mworker_pipe[0]].owner = mworker_pipe;
@@ -2408,6 +2412,7 @@ static void *run_thread_poll_loop(void *data)
{
struct per_thread_init_fct *ptif;
struct per_thread_deinit_fct *ptdf;
+ static __maybe_unused HA_SPINLOCK_T start_lock;

tid = *((unsigned int *)data);
tid_bit = (1UL << tid);
@@ -2420,10 +2425,12 @@ static void *run_thread_poll_loop(void *data)
}
}

+ HA_SPIN_LOCK(LISTENER_LOCK, &start_lock);
if (global.mode & MODE_MWORKER)
mworker_pipe_register(mworker_pipe);

protocol_enable_all();
+ HA_SPIN_UNLOCK(LISTENER_LOCK, &start_lock);
THREAD_SYNC_ENABLE();
run_poll_loop();
Marc Fournier
Re: [BUG] 100% cpu on each threads
January 23, 2018 05:10PM
Hello,

Thanks for your answer Willy !

On Mon, Jan 22, 2018 at 05:47:55PM +0100, Willy Tarreau wrote:
> Hi Marc,
>
> On Mon, Jan 22, 2018 at 03:18:20PM +0100, Marc Fournier wrote:
> > Cyril Bonté <[email protected]> writes:
> >
> > Hello,
> >
> > > Im' not sure you saw Samuel Reed's mail.
> > > He reported a similar issue some hours ago (High load average under
> > > 1.8 with multiple draining processes). It would be interesting to find
> > > a common configuration to reproduce the issue, so I add him to the thread.
> >
> > I've been observing the same error messages Emmanuel reports, using
> > 1.8.3 and nbthread. I tried to simplify & anonymize my configuration so
> > I could share a version which reproduces the problem, but didn't
> > succeed: the problem disappears at some point in the process, and I'm
> > unable figure out exactly which change makes the difference :-(
>
> We've done some work over the week-end to address an issue related to
> FDs and threads : in short, it's impossible to let a thread sleep when
> there's some activity on another one because they share the same epoll_fd.
>
> We've sent Samuel a copy of patches to test (I'm attaching them here in
> case you're interested to try as well, to add on top of latest 1.8, though
> 1.8.3 will be OK). Since you seem to be able to reproduce the bug on a
> full config, you may be tempted to try htem.

Ok, thanks ! I built 1.8.3 with only this patchset added to it, but the
issue still occurred. strace shows similar sequence of actions, slightly
differently. Example:

6629 12:34:38.312873 set_robust_list(0x7f4a21c449e0, 24 <unfinished ...>
6434 12:34:38.312900 munmap(0x7f4a22446000, 8003584 <unfinished ...>
6629 12:34:38.312920 <... set_robust_list resumed> ) = 0
6629 12:34:38.312941 gettimeofday( <unfinished ...>
6630 12:34:38.312965 <... epoll_create resumed> ) = 19
6434 12:34:38.312999 <... munmap resumed> ) = 0
6629 12:34:38.313020 <... gettimeofday resumed> {1516707278, 313013}, NULL) = 0
6629 12:34:38.313055 mmap(NULL, 134217728, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0 <unfinished ...>
6630 12:34:38.313095 close(16 <unfinished ...>
6629 12:34:38.313118 <... mmap resumed> ) = 0x7f4a1385e000
6630 12:34:38.313172 <... close resumed> ) = 0
6434 12:34:38.313196 munmap(0x7f4a22be8000, 32006144 <unfinished ...>
6629 12:34:38.313220 munmap(0x7f4a1385e000, 8003584) = 0
6434 12:34:38.313265 <... munmap resumed> ) = 0
6629 12:34:38.313287 munmap(0x7f4a18000000, 59105280 <unfinished ...>
6434 12:34:38.313308 munmap(0x7f4a24a6e000, 112009216 <unfinished ...>
6629 12:34:38.313330 <... munmap resumed> ) = 0
6434 12:34:38.313839 <... munmap resumed> ) = 0
6629 12:34:38.313870 mprotect(0x7f4a14000000, 135168, PROT_READ|PROT_WRITE) = 0
6630 12:34:38.313911 fcntl(15, F_SETFL, O_RDONLY|O_NONBLOCK <unfinished ...>
6629 12:34:38.313959 mmap(NULL, 8003584, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0 <unfinished ...>
6630 12:34:38.313985 <... fcntl resumed> ) = 0
6630 12:34:38.314010 gettimeofday({1516707278, 314022}, NULL) = 0
6630 12:34:38.314070 socket(PF_INET, SOCK_STREAM, IPPROTO_TCP <unfinished ...>
6629 12:34:38.314096 <... mmap resumed> ) = 0x7f4a1b0bc000
6629 12:34:38.314126 epoll_create(2000128 <unfinished ...>
6630 12:34:38.314158 <... socket resumed> ) = 16
6629 12:34:38.314182 <... epoll_create resumed> ) = 20
6629 12:34:38.314203 close(16 <unfinished ...>
6630 12:34:38.314226 fcntl(16, F_SETFL, O_RDONLY|O_NONBLOCK <unfinished ...>
6629 12:34:38.314256 <... close resumed> ) = 0
6630 12:34:38.314278 <... fcntl resumed> ) = -1 EBADF (Bad file descriptor)
6630 12:34:38.314303 close(16 <unfinished ...>
6629 12:34:38.314324 fcntl(15, F_SETFL, O_RDONLY|O_NONBLOCK <unfinished ...>
6630 12:34:38.314344 <... close resumed> ) = -1 EBADF (Bad file descriptor)
6629 12:34:38.314365 <... fcntl resumed> ) = 0


But HAProxy still isn't able to recover from the mayhem it created, in this
case.

> > - when exposed to client requests, I only observed high CPU load on one
> > instance out of the three I have, which receeded after a restart of
> > haproxy. When working in isolation (no client requests), I never
> > noticed high CPU load.
>
> So this could indicate an uncaught error on a specific fd. A "show fd"
> on the CLI may give some useful information about this. And the patches
> above also add "show activity", to run twice one second apart, and which
> will indicate the various causes of wakeups.

Attached to this email, you have the output of five sequential
"show fd;show activity" with a one second sleep between each. The first one
was taken just before sending SIGUSR2 to HAProxy.

> > # thread 12166 gets created, and closes an fd it didn't create, which
> > # happens to be the socket opened to the DNS server:
> >
> > strace: Process 12166 attached
> > [pid 12167] sendto(16, "\316\352\1\0\0\1\0\0\0\0\0\1\nprivate-25\10backends"..., 79, 0, NULL, 0 <unfinished ...>
> > [pid 12166] set_robust_list(0x7ff9bc9aa9e0, 24 <unfinished ...>
> > [pid 12167] <... sendto resumed> ) = 79
> > [pid 12166] <... set_robust_list resumed> ) = 0
> > [pid 12166] gettimeofday({1516289044, 684014}, NULL) = 0
> > [pid 12167] sendto(16, "\37\367\1\0\0\1\0\0\0\0\0\1\nprivate-26\10backends"..., 79, 0, NULL, 0 <unfinished ...>
> > [pid 12166] mmap(NULL, 134217728, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0 <unfinished ...>
> > [pid 12167] <... sendto resumed> ) = 79
> > [pid 12167] sendto(16, "\224\10\1\0\0\1\0\0\0\0\0\1\nprivate-27\10backends"..., 79, 0, NULL, 0) = 79
> > [pid 12166] <... mmap resumed> ) = 0x7ff9ac000000
> > [pid 12167] sendto(16, "\25 \1\0\0\1\0\0\0\0\0\1\nprivate-28\10backends"..., 79, 0, NULL, 0) = 79
> > [pid 12166] munmap(0x7ff9b0000000, 67108864) = 0
> > [pid 12167] sendto(16, "\275\n\1\0\0\1\0\0\0\0\0\1\nprivate-29\10backends"..., 79, 0, NULL, 0 <unfinished ...>
> > [pid 12166] mprotect(0x7ff9ac000000, 135168, PROT_READ|PROT_WRITE <unfinished ...>
> > [pid 12167] <... sendto resumed> ) = 79
> > [pid 12166] <... mprotect resumed> ) = 0
> > [pid 12167] sendto(16, "\312\360\1\0\0\1\0\0\0\0\0\1\nprivate-30\10backends"..., 79, 0, NULL, 0 <unfinished ...>
> > [pid 12166] mmap(NULL, 8003584, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0 <unfinished ...>
> > [pid 12167] <... sendto resumed> ) = 79
> > [pid 12167] sendto(16, "\247e\1\0\0\1\0\0\0\0\0\1\nprivate-31\10backends"..., 79, 0, NULL, 0) = 79
> > [pid 12166] <... mmap resumed> ) = 0x7ff9baa65000
> > [pid 12167] sendto(16, "_k\1\0\0\1\0\0\0\0\0\1\tprivate-0\nbackoffic"..., 80, 0, NULL, 0 <unfinished ...>
> > [pid 12166] close(16 <unfinished ...>
>
> Wow, that's scary! Are you certain it's not the parent process and that
> it's really a thread ? you may see a clone() syscall in 12166 giving birth
> to 12167. If they are different processes, that can be fine, but ...

Yes, the options to strace I used there doesn't print out the pid of the
master process, only those of sub-processes (and threads). This is how
things look like when reload occurs after receiving SIGUSR2:

kill(48, SIGUSR1) = 0
getrlimit(RLIMIT_NOFILE, {rlim_cur=1024*1024, rlim_max=1024*1024}) = 0
getpid() = 1
write(2, "[WARNING] 017/152404 (1) : [hapr"..., 172) = 172
getpid() = 1
write(-1, "1\n", 2) = -1 EBADF (Bad file descriptor)
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7ff9ca196c50) = 57
strace: Process 12164 attached
[pid 12164] set_robust_list(0x7ff9ca196c60, 24 <unfinished ...>
[pid 11084] close(4) = 0
[pid 12164] <... set_robust_list resumed> ) = 0
[pid 11084] close(5) = 0
[pid 11084] close(7) = 0
[pid 12164] open("/dev/null", O_RDWR <unfinished ...>
[pid 11084] close(8) = 0
[pid 12164] <... open resumed> ) = 17
[pid 11084] close(9) = 0
[pid 12164] getgroups(0, <unfinished ...>


> > [pid 12164] getpid() = 57
> > [pid 12164] getpid() = 57
>
> Pids are collected for logging. Strange that the libc didn't cache
> the getpid()'s result by the way, usually it does. Thinking about it,
> getpid()==57 is a bit surprising, maybe it's not really a getpid in
> fact.

Sorry I didn't mention this in my last email: HAProxy is running in a
docker container, but I'm running strace from the host. The pid namespace
isolation would be causing this pid number mismatch.

I don't have any explanation for the repeated calls to getpid() though.

I'll reply to your other email in a moment.

Thanks !
Marc

4 : st=0x05(R:PrA W:pra) ev=0x01(heopI) [nlC] cache=0 owner=0x55e0a118fa80 iocb=0x55e0a0910890(listener_accept) tmask=0xffffffffffffffff umask=0x0 l.st=RDY fe=GLOBAL
5 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [nlC] cache=0 owner=0x55e0a1192780 iocb=0x55e0a0910890(listener_accept) tmask=0xffffffffffffffff umask=0x0 l.st=RDY fe=https_redirection
7 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [nlC] cache=0 owner=0x55e0a1194840 iocb=0x55e0a0910890(listener_accept) tmask=0xffffffffffffffff umask=0x0 l.st=RDY fe=entrypoint
8 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [nlC] cache=0 owner=0x55e0a1198d80 iocb=0x55e0a0910890(listener_accept) tmask=0xffffffffffffffff umask=0x0 l.st=RDY fe=app_internal
9 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [nlC] cache=0 owner=0x55e0a119bb50 iocb=0x55e0a0910890(listener_accept) tmask=0xffffffffffffffff umask=0x0 l.st=RDY fe=sorry_internal
10 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [nlC] cache=0 owner=0x55e0a119c340 iocb=0x55e0a0910890(listener_accept) tmask=0xffffffffffffffff umask=0x0 l.st=RDY fe=backdoor_bypass_varnish
11 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [nlC] cache=0 owner=0x55e0a11f1db0 iocb=0x55e0a0910890(listener_accept) tmask=0xffffffffffffffff umask=0x0 l.st=RDY fe=backdoor_blue
12 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [nlC] cache=0 owner=0x55e0a121ab50 iocb=0x55e0a0910890(listener_accept) tmask=0xffffffffffffffff umask=0x0 l.st=RDY fe=backdoor_green
13 : st=0x05(R:PrA W:pra) ev=0x01(heopI) [nlC] cache=0 owner=0x55e0a1565e80 iocb=0x55e0a0910890(listener_accept) tmask=0xffffffffffffffff umask=0x0 l.st=RDY fe=stats
14 : st=0x05(R:PrA W:pra) ev=0x01(heopI) [nlC] cache=0 owner=0x55e0a1567f50 iocb=0x55e0a0910890(listener_accept) tmask=0xffffffffffffffff umask=0x0 l.st=RDY fe=healthcheck
15 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [nlc] cache=0 owner=0x55e0a0baca38 iocb=0x55e0a08dfed0(unknown) tmask=0xffffffffffffffff umask=0x0
16 : st=0x22(R:pRa W:pRa) ev=0x00(heopi) [Nlc] cache=0 owner=0x7f0310013130 iocb=0x55e0a0923d40(conn_fd_handler) tmask=0x4 umask=0x8 cflg=0x00241300 fe=GLOBAL
21 : st=0x25(R:PrA W:pRa) ev=0x00(heopi) [nlc] cache=0 owner=0x7f03180124c0 iocb=0x55e0a0923d40(conn_fd_handler) tmask=0x8 umask=0x0 cflg=0x80201306 fe=healthcheck

thread_id: 2
date_now: 1516707278.042397
loops: 1879 39538 40174 24107064
wake_cache: 84 164 203 996
wake_tasks: 9 9 10 1767
wake_applets: 2 37569 38160 6
wake_signal: 0 0 0 0
poll_exp: 95 37742 38373 2769
poll_drop: 7 8 8 20
poll_dead: 0 0 0 0
poll_skip: 0 0 0 0
fd_skip: 28 59 85 1705
fd_lock: 0 2 2 13
fd_del: 14 24 23 11
conn_dead: 0 0 0 0
stream: 35 37654 38244 24
empty_rq: 80 130 177 24105239
long_rq: 0 0 0 0

4 : st=0x05(R:PrA W:pra) ev=0x01(heopI) [nlC] cache=0 owner=0x5587cd0a7ad0 iocb=0x5587cb59e890(listener_accept) tmask=0xffffffffffffffff umask=0x0 l.st=RDY fe=GLOBAL
5 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [nlC] cache=0 owner=0x5587cd0aa7d0 iocb=0x5587cb59e890(listener_accept) tmask=0xffffffffffffffff umask=0x0 l.st=RDY fe=https_redirection
7 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [nlC] cache=0 owner=0x5587cd0ac890 iocb=0x5587cb59e890(listener_accept) tmask=0xffffffffffffffff umask=0x0 l.st=RDY fe=entrypoint
8 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [nlC] cache=0 owner=0x5587cd0b0dd0 iocb=0x5587cb59e890(listener_accept) tmask=0xffffffffffffffff umask=0x0 l.st=RDY fe=app_internal
9 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [nlC] cache=0 owner=0x5587cd0b3ba0 iocb=0x5587cb59e890(listener_accept) tmask=0xffffffffffffffff umask=0x0 l.st=RDY fe=sorry_internal
10 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [nlC] cache=0 owner=0x5587cd0b4390 iocb=0x5587cb59e890(listener_accept) tmask=0xffffffffffffffff umask=0x0 l.st=RDY fe=backdoor_bypass_varnish
11 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [nlC] cache=0 owner=0x5587cd109e30 iocb=0x5587cb59e890(listener_accept) tmask=0xffffffffffffffff umask=0x0 l.st=RDY fe=backdoor_blue
12 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [nlC] cache=0 owner=0x5587cd132ba0 iocb=0x5587cb59e890(listener_accept) tmask=0xffffffffffffffff umask=0x0 l.st=RDY fe=backdoor_green
13 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [nlC] cache=0 owner=0x5587cd47ded0 iocb=0x5587cb59e890(listener_accept) tmask=0xffffffffffffffff umask=0x0 l.st=RDY fe=stats
14 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [nlC] cache=0 owner=0x5587cd47ffa0 iocb=0x5587cb59e890(listener_accept) tmask=0xffffffffffffffff umask=0x0 l.st=RDY fe=healthcheck
15 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [nlc] cache=0 owner=0x5587cb83aa38 iocb=0x5587cb56ded0(unknown) tmask=0xffffffffffffffff umask=0x0
16 : st=0x25(R:PrA W:pRa) ev=0x04(heOpi) [nlc] cache=0 owner=0x5587cd4977d0 iocb=0x5587cb5ce2b0(dgram_fd_handler) tmask=0xffffffffffffffff umask=0x0
21 : st=0x22(R:pRa W:pRa) ev=0x00(heopi) [Nlc] cache=0 owner=0x7f4a1c0122f0 iocb=0x5587cb5b1d40(conn_fd_handler) tmask=0x4 umask=0x0 cflg=0x00241300 fe=GLOBAL

thread_id: 2
date_now: 1516707279.023984
loops: 33 5526 28 29
wake_cache: 2 1 2 2
wake_tasks: 2 28 0 0
wake_applets: 0 0 0 0
wake_signal: 0 0 0 0
poll_exp: 4 29 2 2
poll_drop: 6 6 6 6
poll_dead: 0 0 0 0
poll_skip: 0 0 0 0
fd_skip: 1 1 0 1
fd_lock: 0 0 0 0
fd_del: 0 0 1 0
conn_dead: 0 0 0 0
stream: 0 0 1 0
empty_rq: 3 5497 0 2
long_rq: 0 0 0 0

4 : st=0x05(R:PrA W:pra) ev=0x01(heopI) [nlC] cache=0 owner=0x5587cd0a7ad0 iocb=0x5587cb59e890(listener_accept) tmask=0xffffffffffffffff umask=0x0 l.st=RDY fe=GLOBAL
5 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [nlC] cache=0 owner=0x5587cd0aa7d0 iocb=0x5587cb59e890(listener_accept) tmask=0xffffffffffffffff umask=0x0 l.st=RDY fe=https_redirection
7 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [nlC] cache=0 owner=0x5587cd0ac890 iocb=0x5587cb59e890(listener_accept) tmask=0xffffffffffffffff umask=0x0 l.st=RDY fe=entrypoint
8 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [nlC] cache=0 owner=0x5587cd0b0dd0 iocb=0x5587cb59e890(listener_accept) tmask=0xffffffffffffffff umask=0x0 l.st=RDY fe=app_internal
9 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [nlC] cache=0 owner=0x5587cd0b3ba0 iocb=0x5587cb59e890(listener_accept) tmask=0xffffffffffffffff umask=0x0 l.st=RDY fe=sorry_internal
10 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [nlC] cache=0 owner=0x5587cd0b4390 iocb=0x5587cb59e890(listener_accept) tmask=0xffffffffffffffff umask=0x0 l.st=RDY fe=backdoor_bypass_varnish
11 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [nlC] cache=0 owner=0x5587cd109e30 iocb=0x5587cb59e890(listener_accept) tmask=0xffffffffffffffff umask=0x0 l.st=RDY fe=backdoor_blue
12 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [nlC] cache=0 owner=0x5587cd132ba0 iocb=0x5587cb59e890(listener_accept) tmask=0xffffffffffffffff umask=0x0 l.st=RDY fe=backdoor_green
13 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [nlC] cache=0 owner=0x5587cd47ded0 iocb=0x5587cb59e890(listener_accept) tmask=0xffffffffffffffff umask=0x0 l.st=RDY fe=stats
14 : st=0x05(R:PrA W:pra) ev=0x01(heopI) [nlC] cache=0 owner=0x5587cd47ffa0 iocb=0x5587cb59e890(listener_accept) tmask=0xffffffffffffffff umask=0x0 l.st=RDY fe=healthcheck
15 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [nlc] cache=0 owner=0x5587cb83aa38 iocb=0x5587cb56ded0(unknown) tmask=0xffffffffffffffff umask=0x0
16 : st=0x25(R:PrA W:pRa) ev=0x04(heOpi) [nlc] cache=0 owner=0x5587cd4977d0 iocb=0x5587cb5ce2b0(dgram_fd_handler) tmask=0xffffffffffffffff umask=0x0
21 : st=0x25(R:PrA W:pRa) ev=0x00(heopi) [nlc] cache=0 owner=0x7f4a1c0122f0 iocb=0x5587cb5b1d40(conn_fd_handler) tmask=0x8 umask=0x0 cflg=0x00201306 fe=GLOBAL

thread_id: 3
date_now: 1516707280.047117
loops: 83 13949 83 83
wake_cache: 7 4 10 9
wake_tasks: 2 69 0 1
wake_applets: 0 0 1 1
wake_signal: 0 0 0 0
poll_exp: 9 73 11 11
poll_drop: 6 6 6 6
poll_dead: 0 0 0 0
poll_skip: 0 0 0 0
fd_skip: 1 3 6 7
fd_lock: 0 0 0 0
fd_del: 2 0 1 1
conn_dead: 0 0 0 0
stream: 4 0 2 1
empty_rq: 7 13878 12 13
long_rq: 0 0 0 0

4 : st=0x05(R:PrA W:pra) ev=0x01(heopI) [nlC] cache=0 owner=0x5587cd0a7ad0 iocb=0x5587cb59e890(listener_accept) tmask=0xffffffffffffffff umask=0x0 l.st=RDY fe=GLOBAL
5 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [nlC] cache=0 owner=0x5587cd0aa7d0 iocb=0x5587cb59e890(listener_accept) tmask=0xffffffffffffffff umask=0x0 l.st=RDY fe=https_redirection
7 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [nlC] cache=0 owner=0x5587cd0ac890 iocb=0x5587cb59e890(listener_accept) tmask=0xffffffffffffffff umask=0x0 l.st=RDY fe=entrypoint
8 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [nlC] cache=0 owner=0x5587cd0b0dd0 iocb=0x5587cb59e890(listener_accept) tmask=0xffffffffffffffff umask=0x0 l.st=RDY fe=app_internal
9 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [nlC] cache=0 owner=0x5587cd0b3ba0 iocb=0x5587cb59e890(listener_accept) tmask=0xffffffffffffffff umask=0x0 l.st=RDY fe=sorry_internal
10 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [nlC] cache=0 owner=0x5587cd0b4390 iocb=0x5587cb59e890(listener_accept) tmask=0xffffffffffffffff umask=0x0 l.st=RDY fe=backdoor_bypass_varnish
11 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [nlC] cache=0 owner=0x5587cd109e30 iocb=0x5587cb59e890(listener_accept) tmask=0xffffffffffffffff umask=0x0 l.st=RDY fe=backdoor_blue
12 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [nlC] cache=0 owner=0x5587cd132ba0 iocb=0x5587cb59e890(listener_accept) tmask=0xffffffffffffffff umask=0x0 l.st=RDY fe=backdoor_green
13 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [nlC] cache=0 owner=0x5587cd47ded0 iocb=0x5587cb59e890(listener_accept) tmask=0xffffffffffffffff umask=0x0 l.st=RDY fe=stats
14 : st=0x05(R:PrA W:pra) ev=0x01(heopI) [nlC] cache=0 owner=0x5587cd47ffa0 iocb=0x5587cb59e890(listener_accept) tmask=0xffffffffffffffff umask=0x0 l.st=RDY fe=healthcheck
15 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [nlc] cache=0 owner=0x5587cb83aa38 iocb=0x5587cb56ded0(unknown) tmask=0xffffffffffffffff umask=0x0
16 : st=0x25(R:PrA W:pRa) ev=0x04(heOpi) [nlc] cache=0 owner=0x5587cd4977d0 iocb=0x5587cb5ce2b0(dgram_fd_handler) tmask=0xffffffffffffffff umask=0x0
21 : st=0x26(R:PRa W:pRa) ev=0x10(Heopi) [nlc] cache=0 owner=0x7f4a1c0122f0 iocb=0x5587cb5b1d40(conn_fd_handler) tmask=0x8 umask=0x8 cflg=0x00241300 fe=GLOBAL

thread_id: 3
date_now: 1516707281.058964
loops: 103 24168 133 137
wake_cache: 11 7 13 14
wake_tasks: 3 113 0 1
wake_applets: 0 0 1 2
wake_signal: 0 0 0 0
poll_exp: 14 120 14 17
poll_drop: 6 6 6 7
poll_dead: 0 0 0 0
poll_skip: 0 0 0 0
fd_skip: 4 3 6 10
fd_lock: 0 0 0 0
fd_del: 2 0 2 2
conn_dead: 0 0 0 0
stream: 4 0 4 3
empty_rq: 12 24051 15 20
long_rq: 0 0 0 0

4 : st=0x05(R:PrA W:pra) ev=0x01(heopI) [nlC] cache=0 owner=0x5587cd0a7ad0 iocb=0x5587cb59e890(listener_accept) tmask=0xffffffffffffffff umask=0x0 l.st=RDY fe=GLOBAL
5 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [nlC] cache=0 owner=0x5587cd0aa7d0 iocb=0x5587cb59e890(listener_accept) tmask=0xffffffffffffffff umask=0x0 l.st=RDY fe=https_redirection
7 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [nlC] cache=0 owner=0x5587cd0ac890 iocb=0x5587cb59e890(listener_accept) tmask=0xffffffffffffffff umask=0x0 l.st=RDY fe=entrypoint
8 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [nlC] cache=0 owner=0x5587cd0b0dd0 iocb=0x5587cb59e890(listener_accept) tmask=0xffffffffffffffff umask=0x0 l.st=RDY fe=app_internal
9 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [nlC] cache=0 owner=0x5587cd0b3ba0 iocb=0x5587cb59e890(listener_accept) tmask=0xffffffffffffffff umask=0x0 l.st=RDY fe=sorry_internal
10 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [nlC] cache=0 owner=0x5587cd0b4390 iocb=0x5587cb59e890(listener_accept) tmask=0xffffffffffffffff umask=0x0 l.st=RDY fe=backdoor_bypass_varnish
11 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [nlC] cache=0 owner=0x5587cd109e30 iocb=0x5587cb59e890(listener_accept) tmask=0xffffffffffffffff umask=0x0 l.st=RDY fe=backdoor_blue
12 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [nlC] cache=0 owner=0x5587cd132ba0 iocb=0x5587cb59e890(listener_accept) tmask=0xffffffffffffffff umask=0x0 l.st=RDY fe=backdoor_green
13 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [nlC] cache=0 owner=0x5587cd47ded0 iocb=0x5587cb59e890(listener_accept) tmask=0xffffffffffffffff umask=0x0 l.st=RDY fe=stats
14 : st=0x05(R:PrA W:pra) ev=0x01(heopI) [nlC] cache=0 owner=0x5587cd47ffa0 iocb=0x5587cb59e890(listener_accept) tmask=0xffffffffffffffff umask=0x0 l.st=RDY fe=healthcheck
15 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [nlc] cache=0 owner=0x5587cb83aa38 iocb=0x5587cb56ded0(unknown) tmask=0xffffffffffffffff umask=0x0
16 : st=0x25(R:PrA W:pRa) ev=0x04(heOpi) [nlc] cache=0 owner=0x5587cd4977d0 iocb=0x5587cb5ce2b0(dgram_fd_handler) tmask=0xffffffffffffffff umask=0x0
21 : st=0x26(R:PRa W:pRa) ev=0x10(Heopi) [nlc] cache=0 owner=0x7f4a1c0122f0 iocb=0x5587cb5b1d40(conn_fd_handler) tmask=0x8 umask=0x8 cflg=0x00241300 fe=GLOBAL

thread_id: 3
date_now: 1516707282.078082
loops: 117 34766 177 186
wake_cache: 12 7 16 20
wake_tasks: 3 153 0 1
wake_applets: 0 0 1 3
wake_signal: 0 0 0 0
poll_exp: 15 160 17 24
poll_drop: 6 6 6 8
poll_dead: 0 0 0 0
poll_skip: 0 0 0 0
fd_skip: 5 6 8 10
fd_lock: 0 0 0 0
fd_del: 2 0 2 5
conn_dead: 0 0 0 0
stream: 4 0 4 9
empty_rq: 14 34609 20 25
long_rq: 0 0 0 0
Marc Fournier
Re: [BUG] 100% cpu on each threads
January 23, 2018 05:20PM
On Mon, Jan 22, 2018 at 06:16:12PM +0100, Willy Tarreau wrote:
> On Mon, Jan 22, 2018 at 05:47:55PM +0100, Willy Tarreau wrote:
> > > strace: Process 12166 attached
> > > [pid 12166] set_robust_list(0x7ff9bc9aa9e0, 24 <unfinished ...>
> > > [pid 12166] <... set_robust_list resumed> ) = 0
> > > [pid 12166] gettimeofday({1516289044, 684014}, NULL) = 0
> > > [pid 12166] mmap(NULL, 134217728, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0 <unfinished ...>
> > > [pid 12166] <... mmap resumed> ) = 0x7ff9ac000000
> > > [pid 12166] munmap(0x7ff9b0000000, 67108864) = 0
> > > [pid 12166] mprotect(0x7ff9ac000000, 135168, PROT_READ|PROT_WRITE <unfinished ...>
> > > [pid 12166] <... mprotect resumed> ) = 0
> > > [pid 12166] mmap(NULL, 8003584, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0 <unfinished ...>
> > > [pid 12166] <... mmap resumed> ) = 0x7ff9baa65000
> > > [pid 12166] close(16 <unfinished ...>
> > > [pid 12166] <... close resumed> ) = 0
> > > [pid 12166] fcntl(15, F_SETFL, O_RDONLY|O_NONBLOCK <unfinished ...>
> > > [pid 12166] <... fcntl resumed> ) = 0
> >
> > Here it's getting obvious that it was a shared file descriptor :-(
>
> So I have a suspect here :
>
> - run_thread_poll_loop() runs after the threads are created
> - first thing it does is to close the master-worker pipe FD :
>
> (...)
> if (global.mode & MODE_MWORKER)
> mworker_pipe_register(mworker_pipe);
> (...)
>
> void mworker_pipe_register(int pipefd[2])
> {
> close(mworker_pipe[1]); /* close the write end of the master pipe in the children */
> fcntl(mworker_pipe[0], F_SETFL, O_NONBLOCK);
> (...)
> }
>
> Looks familiar with the trace above ?
>
> So I guess your config works in master-worker mode, am I right ?

Exactly, I'm running HAProxy with the "-W -db" flags.

> Note that I'm bothered with the call to protocol_enable_all() as
> well in this function since it will start the proxies multiple times
> in a possibly unsafe mode. That may explain a lot of things suddenly!
>
> I think the attached patch works around it, but I'd like your
> confirmation before cleaning it up.

I applied this single patch on top of 1.8.3, and indeed this seems much
better ! The servers are eventually in an UP state after reloading.

"grep -c EBADF" on yesterday's strace logfile returned around 900 matches.
18 matches with the patchset from your other email, and only 1 with this
patch:

write(-1, "1\n", 2) = -1 EBADF (Bad file descriptor)


There still are some "Socket error" in the logs though, and I noticed some
servers (not all) go DOWN for a couple of seconds just after reload because
of this, before coming up again 4 seconds later (I use "inter 2000 rise 2").
But at least the system recovers properly from this situation and seems to
stay stable afterwards.

I can't see any obvious close() in the strace log which would be causing
trouble (un)fortunately. I'll send the whole log to you privately, so you
can have a look.


26356 11:50:48.780107 sendmsg(6, {msg_name(110)={sa_family=AF_LOCAL, sun_path="/dev/log"}, msg_iov(8)=[{"<149>Jan 23 10:50:48 ", 21}, {"haproxy", 17}, {"[", 1}, {"10", 2}, {"]: ", 3}, {"", 0}, {"toto_blue/toto_blue0 changed its IP from to 10.10.4.45 by haproxy-dns/dns-server.", 82}, {"\n", 1}], msg_controllen=0, msg_
flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 127
26356 11:50:48.780244 getpid() = 10
26356 11:50:48.780307 sendmsg(6, {msg_name(110)={sa_family=AF_LOCAL, sun_path="/dev/log"}, msg_iov(8)=[{"<149>Jan 23 10:50:48 ", 21}, {"haproxy", 17}, {"[", 1}, {"10", 2}, {"]: ", 3}, {"", 0}, {"Server toto_blue/toto_blue0 administratively READY thanks to valid DNS answer.", 78}, {"\n", 1}], msg_controllen=0, msg_flag
s=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 123
26356 11:50:48.780626 getpid() = 10

[...]

26358 11:50:48.818887 recvfrom(16, "L\324\201\203\0\1\0\0\0\1\0\1\tprivate-1\nbackoffice\4stag\16infrastructure\4toto\5tld\0\0\34\0\1\300&\0\6\0\1\0\0\0\27\0I\7ns-1155\tawsdns-16\3org\0\21awsdns-hostmaster\6amazon\3com\0\0\0\0\1\0\0\34 \0\0\3\204\0\22u\0\0\1Q\200\0\0)\20\0\0\0\0\0\0\0", 513, 0, NULL, NULL) = 165
26358 11:50:48.818981 recvfrom(16, "?M\201\203\0\1\0\0\0\1\0\1\tprivate-2\nbackoffice\4stag\16infrastructure\4toto\5tld\0\0\34\0\1\300&\0\6\0\1\0\0\0008\0I\7ns-1155\tawsdns-16\3org\0\21awsdns-hostmaster\6amazon\3com\0\0\0\0\1\0\0\34 \0\0\3\204\0\22u\0\0\1Q\200\0\0)\20\0\0\0\0\0\0\0", 513, 0, NULL, NULL) = 165
26358 11:50:48.819064 recvfrom(16, "Ls\201\203\0\1\0\0\0\1\0\1\tprivate-3\nbackoffice\4stag\16infrastructure\4toto\5tld\0\0\34\0\1\300&\0\6\0\1\0\0\0\27\0I\7ns-1155\tawsdns-16\3org\0\21awsdns-hostmaster\6amazon\3com\0\0\0\0\1\0\0\34 \0\0\3\204\0\22u\0\0\1Q\200\0\0)\20\0\0\0\0\0\0\0", 513, 0, NULL, NULL) = 165
26358 11:50:48.819146 recvfrom(16, 0x7f05f1433c10, 513, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
26356 11:50:48.820330 getpid() = 10
26356 11:50:48.820410 sendmsg(6, {msg_name(110)={sa_family=AF_LOCAL, sun_path="/dev/log"}, msg_iov(8)=[{"<145>Jan 23 10:50:48 ", 21}, {"haproxy", 17}, {"[", 1}, {"10", 2}, {"]: ", 3}, {"", 0}, {"Server toto_blue/toto_blue0 is DOWN, reason: Socket error, check duration: 0ms. 31 active and 1 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.", 169}, {"\n", 1}], msg_controllen=0, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 214
26356 11:50:48.820591 getpid() = 10
26356 11:50:48.820662 sendmsg(6, {msg_name(110)={sa_family=AF_LOCAL, sun_path="/dev/log"}, msg_iov(8)=[{"<145>Jan 23 10:50:48 ", 21}, {"haproxy", 17}, {"[", 1}, {"10", 2}, {"]: ", 3}, {"", 0}, {"Server toto_blue/toto_blue1 is DOWN, reason: Socket error, check duration: 0ms. 30 active and 1 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.", 169}, {"\n", 1}], msg_controllen=0, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 214
26356 11:50:48.820802 getpid() = 10

[...]

26357 11:50:52.704807 sendmsg(19, {msg_name(110)={sa_family=AF_LOCAL, sun_path="/dev/log"}, msg_iov(8)=[{"<149>Jan 23 10:50:52 ", 21}, {"haproxy", 17}, {"[", 1}, {"10", 2}, {"]: ", 3}, {"", 0}, {"Server toto_blue/toto_blue0 is UP, reason: Layer7 check passed, code: 200, info: \"OK\", check duration: 16ms. 1 active and 1 backup servers online. 0 sessions requeued, 0 total in queue.", 185}, {"\n", 1}], msg_controllen=0, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 230


Thanks again for your help !

Marc
Willy Tarreau
Re: [BUG] 100% cpu on each threads
January 23, 2018 07:00PM
Hi Marc,

On Tue, Jan 23, 2018 at 05:07:26PM +0100, Marc Fournier wrote:
> > Note that I'm bothered with the call to protocol_enable_all() as
> > well in this function since it will start the proxies multiple times
> > in a possibly unsafe mode. That may explain a lot of things suddenly!
> >
> > I think the attached patch works around it, but I'd like your
> > confirmation before cleaning it up.
>
> I applied this single patch on top of 1.8.3, and indeed this seems much
> better ! The servers are eventually in an UP state after reloading.

Thanks very much for your useful feedback.

> "grep -c EBADF" on yesterday's strace logfile returned around 900 matches.

Definitely not good :-/

> 18 matches with the patchset from your other email, and only 1 with this
> patch:
>
> write(-1, "1\n", 2) = -1 EBADF (Bad file descriptor)

Seems to me we've left an "fddebug()" call somewhere, though I can't
find it.

> There still are some "Socket error" in the logs though, and I noticed some
> servers (not all) go DOWN for a couple of seconds just after reload because
> of this, before coming up again 4 seconds later (I use "inter 2000 rise 2").
> But at least the system recovers properly from this situation and seems to
> stay stable afterwards.

Then it could be related to something else, I prefer that we analyse
issues one at a time.

> I can't see any obvious close() in the strace log which would be causing
> trouble (un)fortunately. I'll send the whole log to you privately, so you
> can have a look.

Thank you, I've received it, I'll check.

I'm going to polish up the patch you tested. Now I see how to reproduce
the random behaviour (it's always easier to try to break a program when
you have the patch to fix it).

Thanks!
Willy
Willy Tarreau
Re: [BUG] 100% cpu on each threads
January 23, 2018 07:40PM
On Tue, Jan 23, 2018 at 06:49:03PM +0100, Willy Tarreau wrote:
> > write(-1, "1\n", 2) = -1 EBADF (Bad file descriptor)
>
> Seems to me we've left an "fddebug()" call somewhere, though I can't
> find it.

Found it, it's the pidfile. In your docker, haproxy is pid 1 :-)

> I'm going to polish up the patch you tested. Now I see how to reproduce
> the random behaviour (it's always easier to try to break a program when
> you have the patch to fix it).

OK, I've pushed all pending fixes into the 1.8 branch, so you can use
the nightly snapshot tomorrow morning if you want, or close from the
1.8 git now as you prefer.

Thanks,
Willy
Sorry, only registered users may post in this forum.

Click here to login