Welcome! Log In Create A New Profile

Advanced

High load average under 1.8 with multiple draining processes

Posted by Samuel Reed 
We've recently upgraded to HAProxy 1.8.3, which we run with `nbthread 4`
(we used to run nbproc 4 with older releases). This has generally been
good, especially for stick tables & stats.

We terminate SSL and proxy a large number of long-running TCP
connections (websockets). When configuration changes (usually a server
going up or down), a reload occurs. For many hours, there may be 2-5
active HAProxy instances as the old ones drain. We use hard-stop-after
to keep it reasonable.

I have noticed that load average gets *very* high while these processes
are still present. Our web tier is usually at a load average of 5 with
16 cores. Across the board, load averages go up if stale HAProxy
instances are active. I saw as high as 34 with one instance that had 5
instances, and 100% CPU, most of it sys. Even with just 2, the loadavg
is double what it is with 1. Terminating the old process immediately
brings the load down.

Is there a regression in the 1.8 series with SO_REUSEPORT and nbthread
(we didn't see this before with nbproc) or somewhere we should start
looking? We make (relatively) heavy use of stick tables for DDoS
protection purposes and terminate SSL, but aside from that our
configuration is pretty vanilla. Nothing changed from 1.7 to 1.8 except
changing nbproc to nbthread.

Thanks!
Hi Samuel,

On Thu, Jan 11, 2018 at 08:29:15PM -0600, Samuel Reed wrote:
> Is there a regression in the 1.8 series with SO_REUSEPORT and nbthread
> (we didn't see this before with nbproc) or somewhere we should start
> looking?

In fact no, nbthread is simply new so it's not a regression but we're
starting to see some side effects. One possibility I can easily imagine
is that at most places we're using spinlocks because the locks are very
short-lived and very small, so they have to be cheap. One limit of
spinlocks is that it's mandatory that you don't have more threads than
cores so that your threads are never scheduled out with a lock held, to
let another one spin for nothing for a timeslice.

The reload makes an interesting case because if you use cpumap to bind
your threads to CPU cores, during the soft-stop period, they do have to
coexist on the same cores and a thread of one process disturbs the thread
of another process by stealing its CPU regularly.

I can't say I'm seeing any easy solution to this in the short term, that's
something we have to add to the list of things to improve in the future.
Maybe something as simple as starting with SCHED_FIFO to prevent threads
from being preempted outside of the poll loop, and dropping it upon reload
could help a lot, but that's just speculation.

We'll have to continue to think about this I guess. It may be possible
that if your old processes last very long you'd continue to get a better
experience using nbproc than nbthread :-/

Willy
Thanks for your quick answer, Willy.

That's a shame to hear but makes sense. We'll try out some ideas for
reducing contention. We don't use cpu-map with nbthread; I considered it
best to let the kernel take care of this, especially since there are
some other processes on that box. I don't really want to fall back to
nbproc but we may have to, at least until we get the number of reloads down.


On 1/12/18 8:55 AM, Willy Tarreau wrote:
> Hi Samuel,
>
> On Thu, Jan 11, 2018 at 08:29:15PM -0600, Samuel Reed wrote:
>> Is there a regression in the 1.8 series with SO_REUSEPORT and nbthread
>> (we didn't see this before with nbproc) or somewhere we should start
>> looking?
> In fact no, nbthread is simply new so it's not a regression but we're
> starting to see some side effects. One possibility I can easily imagine
> is that at most places we're using spinlocks because the locks are very
> short-lived and very small, so they have to be cheap. One limit of
> spinlocks is that it's mandatory that you don't have more threads than
> cores so that your threads are never scheduled out with a lock held, to
> let another one spin for nothing for a timeslice.
>
> The reload makes an interesting case because if you use cpumap to bind
> your threads to CPU cores, during the soft-stop period, they do have to
> coexist on the same cores and a thread of one process disturbs the thread
> of another process by stealing its CPU regularly.
>
> I can't say I'm seeing any easy solution to this in the short term, that's
> something we have to add to the list of things to improve in the future.
> Maybe something as simple as starting with SCHED_FIFO to prevent threads
> from being preempted outside of the poll loop, and dropping it upon reload
> could help a lot, but that's just speculation.
>
> We'll have to continue to think about this I guess. It may be possible
> that if your old processes last very long you'd continue to get a better
> experience using nbproc than nbthread :-/
>
> Willy
On Fri, Jan 12, 2018 at 09:28:54AM -0600, Samuel Reed wrote:
> Thanks for your quick answer, Willy.
>
> That's a shame to hear but makes sense. We'll try out some ideas for
> reducing contention. We don't use cpu-map with nbthread; I considered it
> best to let the kernel take care of this, especially since there are
> some other processes on that box.

So that definitely explains why 5 instances start to give you a high load
with 4 threads on 16 cores. Note, do you happen to see some processes
running at 100% CPU (or in fact 400% since you have 4 threads) ? It would
be possible that some remaining bugs would cause older processes and their
threads to spin a bit too much.

If you're interested, when this happens you could run "strace -cp $pid"
for a few seconds, it will report the syscall count over that period. A
typical rule of thumb is that if you see more epoll_wait() than recvfrom()
or read(), there's an issue somewhere in the code.

> I don't really want to fall back to
> nbproc but we may have to, at least until we get the number of reloads down.

It's possible, but let's see if there's a way to improve the situation a
bit by gathering some elements first ;-)

Willy
To accelerate the process, I've increased the number of threads from 4
to 8 on a 16-core machine. Ran strace for about 5s on each.

Single process (8 threads):

$ strace -cp 16807
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 71.36    0.330172          21     15479           epoll_wait
 13.59    0.062861           4     14477         1 write
 10.58    0.048955           4     11518        10 recvfrom
  4.44    0.020544          38       537       244 read
  0.02    0.000094           1       135           sendto
  0.01    0.000051           3        16           epoll_ctl
  0.00    0.000000           0        24           close
  0.00    0.000000           0         9           socket
  0.00    0.000000           0         9         9 connect
  0.00    0.000000           0        25           sendmsg
  0.00    0.000000           0         6           shutdown
  0.00    0.000000           0        24           setsockopt
  0.00    0.000000           0         3           getsockopt
  0.00    0.000000           0         9           fcntl
  0.00    0.000000           0         8         7 accept4
------ ----------- ----------- --------- --------- ----------------
100.00    0.462677                 42279       271 total


Two processes (2x8 threads):

## Draining process

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 48.65    0.544758          30     18359           epoll_wait
 28.69    0.321283          14     23540           write
 22.60    0.253049          19     13338           recvfrom
  0.04    0.000474           1       786       374 read
  0.03    0.000287           2       187           sendto
  0.00    0.000000           0         2           close
  0.00    0.000000           0         1           sendmsg
  0.00    0.000000           0         1           shutdown
  0.00    0.000000           0         1           setsockopt
------ ----------- ----------- --------- --------- ----------------
100.00    1.119851                 56215       374 total

## "New" process

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 93.87    1.588239          11    149253           epoll_wait
  3.84    0.064985          30      2140        31 recvfrom
  1.77    0.029905          13      2388           write
  0.34    0.005737          10       589       130 read
  0.12    0.002018          38        53           close
  0.06    0.000960           8       114         2 sendto
  0.00    0.000031           1        25           shutdown
  0.00    0.000019           0       102           sendmsg
  0.00    0.000019           0        58           epoll_ctl
  0.00    0.000015           0        31           fcntl
  0.00    0.000000           0        31           socket
  0.00    0.000000           0        31        31 connect
  0.00    0.000000           0        94           setsockopt
  0.00    0.000000           0         8           getsockopt
  0.00    0.000000           0        47        29 accept4
------ ----------- ----------- --------- --------- ----------------
100.00    1.691928                154964       223 total


It does indeed appear the new process is contending with the old, even
with just 16 threads on a 16-core box. A third process (oversubscribed
to 24 threads on 16 cores):

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 97.21    0.950863          14     69926           epoll_wait
  2.32    0.022677         208       109           write
  0.42    0.004106          48        85        14 recvfrom
  0.04    0.000439          17        26           close
  0.00    0.000022           1        34           epoll_ctl
  0.00    0.000000           0       136        33 read
  0.00    0.000000           0         1           brk
  0.00    0.000000           0        14           socket
  0.00    0.000000           0        14        14 connect
  0.00    0.000000           0        15         1 sendto
  0.00    0.000000           0        11           sendmsg
  0.00    0.000000           0        13         1 shutdown
  0.00    0.000000           0        50           setsockopt
  0.00    0.000000           0         3           getsockopt
  0.00    0.000000           0        14           fcntl
  0.00    0.000000           0        34        22 accept4
------ ----------- ----------- --------- --------- ----------------
100.00    0.978107                 70485        85 total


During this time, each of the three processes was running at roughly
250-350% CPU.


On 1/12/18 9:34 AM, Willy Tarreau wrote:
> On Fri, Jan 12, 2018 at 09:28:54AM -0600, Samuel Reed wrote:
>> Thanks for your quick answer, Willy.
>>
>> That's a shame to hear but makes sense. We'll try out some ideas for
>> reducing contention. We don't use cpu-map with nbthread; I considered it
>> best to let the kernel take care of this, especially since there are
>> some other processes on that box.
> So that definitely explains why 5 instances start to give you a high load
> with 4 threads on 16 cores. Note, do you happen to see some processes
> running at 100% CPU (or in fact 400% since you have 4 threads) ? It would
> be possible that some remaining bugs would cause older processes and their
> threads to spin a bit too much.
>
> If you're interested, when this happens you could run "strace -cp $pid"
> for a few seconds, it will report the syscall count over that period. A
> typical rule of thumb is that if you see more epoll_wait() than recvfrom()
> or read(), there's an issue somewhere in the code.
>
>> I don't really want to fall back to
>> nbproc but we may have to, at least until we get the number of reloads down.
> It's possible, but let's see if there's a way to improve the situation a
> bit by gathering some elements first ;-)
>
> Willy
On Fri, Jan 12, 2018 at 09:50:58AM -0600, Samuel Reed wrote:
> To accelerate the process, I've increased the number of threads from 4
> to 8 on a 16-core machine. Ran strace for about 5s on each.
>
> Single process (8 threads):
>
> $ strace -cp 16807
> % time     seconds  usecs/call     calls    errors syscall
> ------ ----------- ----------- --------- --------- ----------------
>  71.36    0.330172          21     15479           epoll_wait
>  13.59    0.062861           4     14477         1 write
>  10.58    0.048955           4     11518        10 recvfrom
>   4.44    0.020544          38       537       244 read

This one is OK and shows that quite some time is in fact spent waiting
for I/O events.

> Two processes (2x8 threads):
>
> ## Draining process
>
> % time     seconds  usecs/call     calls    errors syscall
> ------ ----------- ----------- --------- --------- ----------------
>  48.65    0.544758          30     18359           epoll_wait
>  28.69    0.321283          14     23540           write
>  22.60    0.253049          19     13338           recvfrom
>   0.04    0.000474           1       786       374 read
>   0.03    0.000287           2       187           sendto

This one as well.

> ## "New" process
>
> % time     seconds  usecs/call     calls    errors syscall
> ------ ----------- ----------- --------- --------- ----------------
>  93.87    1.588239          11    149253           epoll_wait
>   3.84    0.064985          30      2140        31 recvfrom
>   1.77    0.029905          13      2388           write
>   0.34    0.005737          10       589       130 read
>   0.12    0.002018          38        53           close
>   0.06    0.000960           8       114         2 sendto

This one is very interesting! So the epoll_wait to other syscalls ratio
went from roughly 1/2 to 30/1. I'm pretty sure that a regular strace would
show you a large number of epoll_wait(0)=0 indicating we're missing some
events. I seem to remember that sometimes there are situations where a
thread may be notified by epoll() about an fd it cannot take care of but
I don't remember in which case, I'll have to discuss with Christopher.

But at least now we have an explanation and it's not directly related to
thread contention but more likely with the mapping of FDs to threads, so
we may have opportunities to improve the situation here.

Thanks!
Willy
Excellent! Please let me know if there's any other output you'd like
from this machine.

Strace on that new process shows thousands of these types of syscalls,
which vary slightly,

epoll_wait(3, {{EPOLLIN, {u32=206, u64=206}}}, 200, 239) = 1

and these:

epoll_wait(3, {}, 200, 0)               = 0

There is also something of a cascade (each repeats about 10-20x before
the next):

epoll_wait(3, {{EPOLLIN, {u32=47, u64=47}}}, 200, 71) = 1
epoll_wait(3, {{EPOLLIN, {u32=93, u64=93}}, {EPOLLIN, {u32=656,
u64=656}}}, 200, 65) = 2
epoll_wait(3, {{EPOLLIN, {u32=93, u64=93}}, {EPOLLIN, {u32=656,
u64=656}}, {EPOLLIN, {u32=227, u64=227}}}, 200, 0) = 3
epoll_wait(3, {{EPOLLIN, {u32=93, u64=93}}, {EPOLLIN, {u32=656,
u64=656}}, {EPOLLIN, {u32=227, u64=227}}, {EPOLLIN, {u32=785,
u64=785}}}, 200, 65) = 4
epoll_wait(3, {{EPOLLIN, {u32=93, u64=93}}, {EPOLLIN, {u32=656,
u64=656}}, {EPOLLIN, {u32=227, u64=227}}, {EPOLLIN, {u32=785, u64=785}},
{EPOLLIN, {u32=639, u64=639}}}, 200, 64) = 5

I've seen it go as deep as 15. The trace is absolutely dominated by these.


On 1/12/18 10:01 AM, Willy Tarreau wrote:
> On Fri, Jan 12, 2018 at 09:50:58AM -0600, Samuel Reed wrote:
>> To accelerate the process, I've increased the number of threads from 4
>> to 8 on a 16-core machine. Ran strace for about 5s on each.
>>
>> Single process (8 threads):
>>
>> $ strace -cp 16807
>> % time     seconds  usecs/call     calls    errors syscall
>> ------ ----------- ----------- --------- --------- ----------------
>>  71.36    0.330172          21     15479           epoll_wait
>>  13.59    0.062861           4     14477         1 write
>>  10.58    0.048955           4     11518        10 recvfrom
>>   4.44    0.020544          38       537       244 read
> This one is OK and shows that quite some time is in fact spent waiting
> for I/O events.
>
>> Two processes (2x8 threads):
>>
>> ## Draining process
>>
>> % time     seconds  usecs/call     calls    errors syscall
>> ------ ----------- ----------- --------- --------- ----------------
>>  48.65    0.544758          30     18359           epoll_wait
>>  28.69    0.321283          14     23540           write
>>  22.60    0.253049          19     13338           recvfrom
>>   0.04    0.000474           1       786       374 read
>>   0.03    0.000287           2       187           sendto
> This one as well.
>
>> ## "New" process
>>
>> % time     seconds  usecs/call     calls    errors syscall
>> ------ ----------- ----------- --------- --------- ----------------
>>  93.87    1.588239          11    149253           epoll_wait
>>   3.84    0.064985          30      2140        31 recvfrom
>>   1.77    0.029905          13      2388           write
>>   0.34    0.005737          10       589       130 read
>>   0.12    0.002018          38        53           close
>>   0.06    0.000960           8       114         2 sendto
> This one is very interesting! So the epoll_wait to other syscalls ratio
> went from roughly 1/2 to 30/1. I'm pretty sure that a regular strace would
> show you a large number of epoll_wait(0)=0 indicating we're missing some
> events. I seem to remember that sometimes there are situations where a
> thread may be notified by epoll() about an fd it cannot take care of but
> I don't remember in which case, I'll have to discuss with Christopher.
>
> But at least now we have an explanation and it's not directly related to
> thread contention but more likely with the mapping of FDs to threads, so
> we may have opportunities to improve the situation here.
>
> Thanks!
> Willy
On Fri, Jan 12, 2018 at 10:13:55AM -0600, Samuel Reed wrote:
> Excellent! Please let me know if there's any other output you'd like
> from this machine.
>
> Strace on that new process shows thousands of these types of syscalls,
> which vary slightly,
>
> epoll_wait(3, {{EPOLLIN, {u32=206, u64=206}}}, 200, 239) = 1

If the u32 value almost doesn't vary, that's an uncaught event. We've
got a report for this that we've just fixed yesterday which started to
appear after the system was upgraded with Meltdown fixes. That seems
unrelated but reverting made the problem disappear.

> and these:
>
> epoll_wait(3, {}, 200, 0)               = 0

This one used to appear in yesterday's report though it could be caused
by other bugs as well. That's the one I predicted.

> There is also something of a cascade (each repeats about 10-20x before
> the next):
>
> epoll_wait(3, {{EPOLLIN, {u32=47, u64=47}}}, 200, 71) = 1
> epoll_wait(3, {{EPOLLIN, {u32=93, u64=93}}, {EPOLLIN, {u32=656,
> u64=656}}}, 200, 65) = 2
> epoll_wait(3, {{EPOLLIN, {u32=93, u64=93}}, {EPOLLIN, {u32=656,
> u64=656}}, {EPOLLIN, {u32=227, u64=227}}}, 200, 0) = 3
> epoll_wait(3, {{EPOLLIN, {u32=93, u64=93}}, {EPOLLIN, {u32=656,
> u64=656}}, {EPOLLIN, {u32=227, u64=227}}, {EPOLLIN, {u32=785,
> u64=785}}}, 200, 65) = 4
> epoll_wait(3, {{EPOLLIN, {u32=93, u64=93}}, {EPOLLIN, {u32=656,
> u64=656}}, {EPOLLIN, {u32=227, u64=227}}, {EPOLLIN, {u32=785, u64=785}},
> {EPOLLIN, {u32=639, u64=639}}}, 200, 64) = 5
>
> I've seen it go as deep as 15. The trace is absolutely dominated by these.

OK that's very interesting. Just in doubt, please update to latest
1.8-git to see if it makes this issue disappear.

Thanks,
Willy
On 1.8-git, similar results on the new process:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 93.75    0.265450          15     17805           epoll_wait
  4.85    0.013730          49       283           write
  1.40    0.003960          15       266        12 recvfrom
  0.01    0.000018           0        42        12 read
  0.00    0.000000           0        28           close
  0.00    0.000000           0        12           socket
  0.00    0.000000           0        12        12 connect
  0.00    0.000000           0        19         1 sendto
  0.00    0.000000           0        12           sendmsg
  0.00    0.000000           0         6           shutdown
  0.00    0.000000           0        35           setsockopt
  0.00    0.000000           0         7           getsockopt
  0.00    0.000000           0        12           fcntl
  0.00    0.000000           0        13           epoll_ctl
  0.00    0.000000           0         2         2 accept4
------ ----------- ----------- --------- --------- ----------------
100.00    0.283158                 18554        39 total

Cursory look through the strace output looks the same, with the same
three types as in the last email, including the cascade.


On 1/12/18 10:23 AM, Willy Tarreau wrote:
> On Fri, Jan 12, 2018 at 10:13:55AM -0600, Samuel Reed wrote:
>> Excellent! Please let me know if there's any other output you'd like
>> from this machine.
>>
>> Strace on that new process shows thousands of these types of syscalls,
>> which vary slightly,
>>
>> epoll_wait(3, {{EPOLLIN, {u32=206, u64=206}}}, 200, 239) = 1
> If the u32 value almost doesn't vary, that's an uncaught event. We've
> got a report for this that we've just fixed yesterday which started to
> appear after the system was upgraded with Meltdown fixes. That seems
> unrelated but reverting made the problem disappear.
>
>> and these:
>>
>> epoll_wait(3, {}, 200, 0)               = 0
> This one used to appear in yesterday's report though it could be caused
> by other bugs as well. That's the one I predicted.
>
>> There is also something of a cascade (each repeats about 10-20x before
>> the next):
>>
>> epoll_wait(3, {{EPOLLIN, {u32=47, u64=47}}}, 200, 71) = 1
>> epoll_wait(3, {{EPOLLIN, {u32=93, u64=93}}, {EPOLLIN, {u32=656,
>> u64=656}}}, 200, 65) = 2
>> epoll_wait(3, {{EPOLLIN, {u32=93, u64=93}}, {EPOLLIN, {u32=656,
>> u64=656}}, {EPOLLIN, {u32=227, u64=227}}}, 200, 0) = 3
>> epoll_wait(3, {{EPOLLIN, {u32=93, u64=93}}, {EPOLLIN, {u32=656,
>> u64=656}}, {EPOLLIN, {u32=227, u64=227}}, {EPOLLIN, {u32=785,
>> u64=785}}}, 200, 65) = 4
>> epoll_wait(3, {{EPOLLIN, {u32=93, u64=93}}, {EPOLLIN, {u32=656,
>> u64=656}}, {EPOLLIN, {u32=227, u64=227}}, {EPOLLIN, {u32=785, u64=785}},
>> {EPOLLIN, {u32=639, u64=639}}}, 200, 64) = 5
>>
>> I've seen it go as deep as 15. The trace is absolutely dominated by these.
> OK that's very interesting. Just in doubt, please update to latest
> 1.8-git to see if it makes this issue disappear.
>
> Thanks,
> Willy
On Fri, Jan 12, 2018 at 11:06:32AM -0600, Samuel Reed wrote:
> On 1.8-git, similar results on the new process:
>
> % time     seconds  usecs/call     calls    errors syscall
> ------ ----------- ----------- --------- --------- ----------------
>  93.75    0.265450          15     17805           epoll_wait
>   4.85    0.013730          49       283           write
>   1.40    0.003960          15       266        12 recvfrom
>   0.01    0.000018           0        42        12 read
>   0.00    0.000000           0        28           close
>   0.00    0.000000           0        12           socket
>   0.00    0.000000           0        12        12 connect
>   0.00    0.000000           0        19         1 sendto
>   0.00    0.000000           0        12           sendmsg
>   0.00    0.000000           0         6           shutdown
>   0.00    0.000000           0        35           setsockopt
>   0.00    0.000000           0         7           getsockopt
>   0.00    0.000000           0        12           fcntl
>   0.00    0.000000           0        13           epoll_ctl
>   0.00    0.000000           0         2         2 accept4
> ------ ----------- ----------- --------- --------- ----------------
> 100.00    0.283158                 18554        39 total
>
> Cursory look through the strace output looks the same, with the same
> three types as in the last email, including the cascade.

OK thank you for testing. On Monday we'll study this with Christopher.

Have a nice week-end!
Willy
Christopher Faulet
Re: High load average under 1.8 with multiple draining processes
January 15, 2018 03:00PM
Le 12/01/2018 à 18:51, Willy Tarreau a écrit :
> On Fri, Jan 12, 2018 at 11:06:32AM -0600, Samuel Reed wrote:
>> On 1.8-git, similar results on the new process:
>>
>> % time     seconds  usecs/call     calls    errors syscall
>> ------ ----------- ----------- --------- --------- ----------------
>>  93.75    0.265450          15     17805           epoll_wait
>>   4.85    0.013730          49       283           write
>>   1.40    0.003960          15       266        12 recvfrom
>>   0.01    0.000018           0        42        12 read
>>   0.00    0.000000           0        28           close
>>   0.00    0.000000           0        12           socket
>>   0.00    0.000000           0        12        12 connect
>>   0.00    0.000000           0        19         1 sendto
>>   0.00    0.000000           0        12           sendmsg
>>   0.00    0.000000           0         6           shutdown
>>   0.00    0.000000           0        35           setsockopt
>>   0.00    0.000000           0         7           getsockopt
>>   0.00    0.000000           0        12           fcntl
>>   0.00    0.000000           0        13           epoll_ctl
>>   0.00    0.000000           0         2         2 accept4
>> ------ ----------- ----------- --------- --------- ----------------
>> 100.00    0.283158                 18554        39 total
>>
>> Cursory look through the strace output looks the same, with the same
>> three types as in the last email, including the cascade.
>
> OK thank you for testing. On Monday we'll study this with Christopher.
>

Hi Samuel,

Here are 2 patches that may solve your problem. Idea is to set the
poller timeout to 0 for a specific thread only when some processing are
expected for this thread. The job was already done for tasks and applets
using bitfields. Now, we do the same for the FDs.

For now, we don't know if it aims your problem, but it should avoid a
thread to loop for nothing. Could you check if it works please ?

--
Christopher Faulet
Thank you for the patch and your quick attention to this issue. Results
after a few reloads, 8 threads on 16 core machine, both draining and new
process have patches.

New process:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 96.24    0.432348          16     26917           epoll_wait
  3.60    0.016158          16      1023         7 recvfrom
  0.12    0.000524          31        17           sendto
  0.04    0.000190           0      1126         3 write
  0.01    0.000036           1        70        23 read
  0.00    0.000000           0        21           close
  0.00    0.000000           0         7           socket
  0.00    0.000000           0         7         7 connect
  0.00    0.000000           0        13           sendmsg
  0.00    0.000000           0        17           setsockopt
  0.00    0.000000           0         7           fcntl
  0.00    0.000000           0         9           epoll_ctl
  0.00    0.000000           0         5         5 accept4
------ ----------- ----------- --------- --------- ----------------
100.00    0.449256                 29239        45 total


Draining process:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 78.26    0.379045          16     23424           epoll_wait
 13.94    0.067539           7      9877         4 recvfrom
  7.80    0.037764           4     10471         6 write
  0.00    0.000007           0        29        10 read
  0.00    0.000000           0         9           close
  0.00    0.000000           0         5           sendto
  0.00    0.000000           0         3           shutdown
  0.00    0.000000           0        20           epoll_ctl
------ ----------- ----------- --------- --------- ----------------
100.00    0.484355                 43838        20 total


I ran this a few times while both processes were live and the numbers
weren't significantly different. The new process still has a remarkably
high proportion of epoll_wait.


On 1/15/18 7:48 AM, Christopher Faulet wrote:
> Le 12/01/2018 à 18:51, Willy Tarreau a écrit :
>> On Fri, Jan 12, 2018 at 11:06:32AM -0600, Samuel Reed wrote:
>>> On 1.8-git, similar results on the new process:
>>>
>>> % time     seconds  usecs/call     calls    errors syscall
>>> ------ ----------- ----------- --------- --------- ----------------
>>>   93.75    0.265450          15     17805           epoll_wait
>>>    4.85    0.013730          49       283           write
>>>    1.40    0.003960          15       266        12 recvfrom
>>>    0.01    0.000018           0        42        12 read
>>>    0.00    0.000000           0        28           close
>>>    0.00    0.000000           0        12           socket
>>>    0.00    0.000000           0        12        12 connect
>>>    0.00    0.000000           0        19         1 sendto
>>>    0.00    0.000000           0        12           sendmsg
>>>    0.00    0.000000           0         6           shutdown
>>>    0.00    0.000000           0        35           setsockopt
>>>    0.00    0.000000           0         7           getsockopt
>>>    0.00    0.000000           0        12           fcntl
>>>    0.00    0.000000           0        13           epoll_ctl
>>>    0.00    0.000000           0         2         2 accept4
>>> ------ ----------- ----------- --------- --------- ----------------
>>> 100.00    0.283158                 18554        39 total
>>>
>>> Cursory look through the strace output looks the same, with the same
>>> three types as in the last email, including the cascade.
>>
>> OK thank you for testing. On Monday we'll study this with Christopher.
>>
>
> Hi Samuel,
>
> Here are 2 patches that may solve your problem. Idea is to set the
> poller timeout to 0 for a specific thread only when some processing
> are expected for this thread. The job was already done for tasks and
> applets using bitfields. Now, we do the same for the FDs.
>
> For now, we don't know if it aims your problem, but it should avoid a
> thread to loop for nothing. Could you check if it works please ?
>
On Mon, Jan 15, 2018 at 08:14:40AM -0600, Samuel Reed wrote:
> Thank you for the patch and your quick attention to this issue. Results
> after a few reloads, 8 threads on 16 core machine, both draining and new
> process have patches.
>
> New process:
>
> % time     seconds  usecs/call     calls    errors syscall
> ------ ----------- ----------- --------- --------- ----------------
>  96.24    0.432348          16     26917           epoll_wait
>   3.60    0.016158          16      1023         7 recvfrom
>   0.12    0.000524          31        17           sendto
>   0.04    0.000190           0      1126         3 write
>   0.01    0.000036           1        70        23 read
>   0.00    0.000000           0        21           close
>   0.00    0.000000           0         7           socket
>   0.00    0.000000           0         7         7 connect
>   0.00    0.000000           0        13           sendmsg
>   0.00    0.000000           0        17           setsockopt
>   0.00    0.000000           0         7           fcntl
>   0.00    0.000000           0         9           epoll_ctl
>   0.00    0.000000           0         5         5 accept4
> ------ ----------- ----------- --------- --------- ----------------
> 100.00    0.449256                 29239        45 total
>
>
> Draining process:
>
> % time     seconds  usecs/call     calls    errors syscall
> ------ ----------- ----------- --------- --------- ----------------
>  78.26    0.379045          16     23424           epoll_wait
>  13.94    0.067539           7      9877         4 recvfrom
>   7.80    0.037764           4     10471         6 write
>   0.00    0.000007           0        29        10 read
>   0.00    0.000000           0         9           close
>   0.00    0.000000           0         5           sendto
>   0.00    0.000000           0         3           shutdown
>   0.00    0.000000           0        20           epoll_ctl
> ------ ----------- ----------- --------- --------- ----------------
> 100.00    0.484355                 43838        20 total
>
>
> I ran this a few times while both processes were live and the numbers
> weren't significantly different. The new process still has a remarkably
> high proportion of epoll_wait.

Thank you Samuel for the test. It's sad, but it may indicate something
completely different. Christopher at least I'm willing to integrate your
fix to rule out this corner case in the future.

Among the differences possible between an old and a new process, we can
enumerate very few things, for example the peers, which work differently
for new and old processes. Do you use peers in your config ? It would also
be possible that we pass an fd corresponding to a more or less closed
listener or something like this. Do you reload with -x to pass FDs across
processes ? Do you use master-worker ? Just trying to rule out a number of
hypothesis. An anonymized version of your config will definitely help here
I'm afraid.

Thanks!
Willy
Christopher Faulet
Re: High load average under 1.8 with multiple draining processes
January 15, 2018 03:40PM
Le 15/01/2018 à 15:14, Samuel Reed a écrit :
> Thank you for the patch and your quick attention to this issue. Results
> after a few reloads, 8 threads on 16 core machine, both draining and new
> process have patches.
>
> New process:
>
> % time     seconds  usecs/call     calls    errors syscall
> ------ ----------- ----------- --------- --------- ----------------
>  96.24    0.432348          16     26917           epoll_wait
>   3.60    0.016158          16      1023         7 recvfrom
>   0.12    0.000524          31        17           sendto
>   0.04    0.000190           0      1126         3 write
>   0.01    0.000036           1        70        23 read
>   0.00    0.000000           0        21           close
>   0.00    0.000000           0         7           socket
>   0.00    0.000000           0         7         7 connect
>   0.00    0.000000           0        13           sendmsg
>   0.00    0.000000           0        17           setsockopt
>   0.00    0.000000           0         7           fcntl
>   0.00    0.000000           0         9           epoll_ctl
>   0.00    0.000000           0         5         5 accept4
> ------ ----------- ----------- --------- --------- ----------------
> 100.00    0.449256                 29239        45 total
>
>
> Draining process:
>
> % time     seconds  usecs/call     calls    errors syscall
> ------ ----------- ----------- --------- --------- ----------------
>  78.26    0.379045          16     23424           epoll_wait
>  13.94    0.067539           7      9877         4 recvfrom
>   7.80    0.037764           4     10471         6 write
>   0.00    0.000007           0        29        10 read
>   0.00    0.000000           0         9           close
>   0.00    0.000000           0         5           sendto
>   0.00    0.000000           0         3           shutdown
>   0.00    0.000000           0        20           epoll_ctl
> ------ ----------- ----------- --------- --------- ----------------
> 100.00    0.484355                 43838        20 total
>
>
> I ran this a few times while both processes were live and the numbers
> weren't significantly different. The new process still has a remarkably
> high proportion of epoll_wait.
>

Thanks. So the problem is still here. I just noticed this is the new
process that is looping, not the old one. Could you provide the output
of "show sess all" command on the CLI (and maybe "show fd" too) ?

--
Christopher Faulet
On Mon, Jan 15, 2018 at 03:28:12PM +0100, Christopher Faulet wrote:
> Could you provide the output of "show
> sess all" command on the CLI (and maybe "show fd" too) ?

By the way, be careful, "show sess all" will be huge and will disclose
some potentially sensitive information (source/destination addresses,
etc), so if you don't feel at ease with sending this on the ML, please
at least consider feeding Christopher :-)

thanks,
Willy
Reload is done via the sysvinit wrapper, which executes:

$HAPROXY -f "$CONFIG" -p $PIDFILE -sf $(cat $PIDFILE) -D $EXTRAOPTS

No $EXTRAOPTS are specified. We don't use unix sockets and we don't use
master-worker. I'll send an anonymized haproxy config directly to you
two along with the results of "show sess" et al.

We don't use peers but if we fall back to nbproc we likely will.

More info to come shortly.

On 1/15/18 8:22 AM, Willy Tarreau wrote:
> On Mon, Jan 15, 2018 at 08:14:40AM -0600, Samuel Reed wrote:
>> Thank you for the patch and your quick attention to this issue. Results
>> after a few reloads, 8 threads on 16 core machine, both draining and new
>> process have patches.
>>
>> New process:
>>
>> % time     seconds  usecs/call     calls    errors syscall
>> ------ ----------- ----------- --------- --------- ----------------
>>  96.24    0.432348          16     26917           epoll_wait
>>   3.60    0.016158          16      1023         7 recvfrom
>>   0.12    0.000524          31        17           sendto
>>   0.04    0.000190           0      1126         3 write
>>   0.01    0.000036           1        70        23 read
>>   0.00    0.000000           0        21           close
>>   0.00    0.000000           0         7           socket
>>   0.00    0.000000           0         7         7 connect
>>   0.00    0.000000           0        13           sendmsg
>>   0.00    0.000000           0        17           setsockopt
>>   0.00    0.000000           0         7           fcntl
>>   0.00    0.000000           0         9           epoll_ctl
>>   0.00    0.000000           0         5         5 accept4
>> ------ ----------- ----------- --------- --------- ----------------
>> 100.00    0.449256                 29239        45 total
>>
>>
>> Draining process:
>>
>> % time     seconds  usecs/call     calls    errors syscall
>> ------ ----------- ----------- --------- --------- ----------------
>>  78.26    0.379045          16     23424           epoll_wait
>>  13.94    0.067539           7      9877         4 recvfrom
>>   7.80    0.037764           4     10471         6 write
>>   0.00    0.000007           0        29        10 read
>>   0.00    0.000000           0         9           close
>>   0.00    0.000000           0         5           sendto
>>   0.00    0.000000           0         3           shutdown
>>   0.00    0.000000           0        20           epoll_ctl
>> ------ ----------- ----------- --------- --------- ----------------
>> 100.00    0.484355                 43838        20 total
>>
>>
>> I ran this a few times while both processes were live and the numbers
>> weren't significantly different. The new process still has a remarkably
>> high proportion of epoll_wait.
> Thank you Samuel for the test. It's sad, but it may indicate something
> completely different. Christopher at least I'm willing to integrate your
> fix to rule out this corner case in the future.
>
> Among the differences possible between an old and a new process, we can
> enumerate very few things, for example the peers, which work differently
> for new and old processes. Do you use peers in your config ? It would also
> be possible that we pass an fd corresponding to a more or less closed
> listener or something like this. Do you reload with -x to pass FDs across
> processes ? Do you use master-worker ? Just trying to rule out a number of
> hypothesis. An anonymized version of your config will definitely help here
> I'm afraid.
>
> Thanks!
> Willy
On Mon, Jan 15, 2018 at 09:39:25AM -0600, Samuel Reed wrote:
> Reload is done via the sysvinit wrapper, which executes:
>
> $HAPROXY -f "$CONFIG" -p $PIDFILE -sf $(cat $PIDFILE) -D $EXTRAOPTS
>
> No $EXTRAOPTS are specified. We don't use unix sockets and we don't use
> master-worker. I'll send an anonymized haproxy config directly to you
> two along with the results of "show sess" et al.
>
> We don't use peers but if we fall back to nbproc we likely will.
>
> More info to come shortly.

OK thank you for this!
Willy
Samuel,

While running a few tests with Christopher's patch in order to integrate
it, I managed to find a case where I'm still seeing quite a number of
calls to epoll_wait(0)=0. Studying the patch, I found that there's a
corner case which it doesn't address, which is where an fd is processed
and removed from the cache. By changing the test a little bit, I managed
to reduce the number of empty calls to epoll_loop by about 1/3 on my test
case. But I've also just noticed that at too low loads I can now miss some
events so I'm not sharing it now. And I don't know if the remaining ones
wakeups are valid so I want to study all this compared to a single-threaded
test.

It's possible that sometimes we wake up too aggressively even maybe for
the tasts, though that's pure speculation.

We'll keep you updated on this.

Cheers,
Willy
Hey guys,

On 15 January 2018 at 20:49, Willy Tarreau <[email protected]> wrote:
> Samuel,
>
> While running a few tests with Christopher's patch in order to integrate
> it, I managed to find a case where I'm still seeing quite a number of
> calls to epoll_wait(0)=0. Studying the patch, I found that there's a
> corner case which it doesn't address, which is where an fd is processed
> and removed from the cache. By changing the test a little bit, I managed
> to reduce the number of empty calls to epoll_loop by about 1/3 on my test
> case. But I've also just noticed that at too low loads I can now miss some
> events so I'm not sharing it now. And I don't know if the remaining ones
> wakeups are valid so I want to study all this compared to a single-threaded
> test.
>
> It's possible that sometimes we wake up too aggressively even maybe for
> the tasts, though that's pure speculation.
>
> We'll keep you updated on this.


Also consider this report from discourse:
https://discourse.haproxy.org/t/haproxy1-8-3-dynamic-dns-resolvers-problem/1997

Haproxy 1.8.3 with nbthread > 1 and while reloading, DNS resolutions fails.
Its possibile this was working in 1.8.2, but I'm waiting for confirmation.


Not sure if the issue here could be related to DNS resolutions (and it's FDs).



cheers,
lukas
Hi Lukas,

On Mon, Jan 15, 2018 at 11:30:49PM +0100, Lukas Tribus wrote:
> Also consider this report from discourse:
> https://discourse.haproxy.org/t/haproxy1-8-3-dynamic-dns-resolvers-problem/1997
>
> Haproxy 1.8.3 with nbthread > 1 and while reloading, DNS resolutions fails.
> Its possibile this was working in 1.8.2, but I'm waiting for confirmation.
>
> Not sure if the issue here could be related to DNS resolutions (and it's FDs).

At least in Samuel's case there's no resolver involved. The config is pretty
standard with nothing fancy, a bit of SSL and websocket and that's basically
all. There's no -x nor master-worker used during reload. The use of websocket
made me think that it would be possible that some old connections were not
properly closed over time and may start to be incorrectly handled but my
understanding is that the problem happens very quickly. Also the problem is
on the new process only.

But it could be possible that the problem is totally related.

Willy
Sorry, only registered users may post in this forum.

Click here to login