Welcome! Log In Create A New Profile

Advanced

100% cpu usage 1.9-dev0-48d92ee 2018/07/30, task.?. but keeps working.. (nbthread 1)

Posted by PiBa-NL 
Hi List,

Ive got a weird issue.. and im not sure where/how to continue digging at
the moment...

Using nbthread=1 nbproc=1, a few lua scripts, ssl offloading / http
traffic.. Only a few connections < 100...

Sometimes haproxy starts using 100% cpu usage.. After a few days..
(Makes it hard to debug..)
Currently running with version 'HA-Proxy version 1.9-dev0-48d92ee
2018/07/30'
Ive ran some commands against the haproxy socket like 'show activity',
as can be seen there are lots of loops and tasks in just a second of time.

[2.4.3-RELEASE][[email protected]_3]/root:
/usr/local/pkg/haproxy/haproxy_socket.sh show activity
show activity thread_id: 0
date_now: 1533754729.799405
loops: 828928664
wake_cache: 845396
wake_tasks: 827400248
wake_signal: 0
poll_exp: 828245644
poll_drop: 17451
poll_dead: 0
poll_skip: 0
fd_skip: 0
fd_lock: 0
fd_del: 0
conn_dead: 0
stream: 101147
empty_rq: 1242050
long_rq: 0

[2.4.3-RELEASE][[email protected]_3]/root:
/usr/local/pkg/haproxy/haproxy_socket.sh show activity
show activity thread_id: 0
date_now: 1533754731.084664
loops: 829000230
wake_cache: 845398
wake_tasks: 827471812
wake_signal: 0
poll_exp: 828317210
poll_drop: 17452
poll_dead: 0
poll_skip: 0
fd_skip: 0
fd_lock: 0
fd_del: 0
conn_dead: 0
stream: 101149
empty_rq: 1242050
long_rq: 0

Other than that ive tried to attach gdb and step through / log some
functions.. it passes through.

With a gdb 'command' file like bellow i created a little log of function
breakpoints hit:
set pagination off
set height 0
set logging on
delete
rbreak haproxy.c:.
rbreak session.c:.
rbreak hlua.c:.
rbreak task.c:.
commands 1-99999
cont
end
cont

Which got me a log with the following content.. as you can see it
'seems' to be looping over the same task multiple times.., which might
not even be a problem.??. the t=0x802545a60 expires and wakes up ,
expires and wakes up.?.:

Breakpoint 249, __task_queue (task=0x802545a60) at src/task.c:185
185    src/task.c: No such file or directory.

Breakpoint 253, wake_expired_tasks () at src/task.c:209
209    in src/task.c

Breakpoint 250, __task_wakeup (t=0x802545a60, root=0x8ced50
<rqueue_local>) at src/task.c:72
72    in src/task.c

Breakpoint 41, sync_poll_loop () at src/haproxy.c:2378
2378    src/haproxy.c: No such file or directory.

Breakpoint 252, process_runnable_tasks () at src/task.c:275
275    src/task.c: No such file or directory.

Breakpoint 51, session_expire_embryonic (t=0x802545a60,
context=0x8024483a0, state=513) at src/session.c:389
389    src/session.c: No such file or directory.

Breakpoint 249, __task_queue (task=0x802545a60) at src/task.c:185
185    src/task.c: No such file or directory.

Breakpoint 253, wake_expired_tasks () at src/task.c:209
209    in src/task.c

Breakpoint 250, __task_wakeup (t=0x802545a60, root=0x8ced50
<rqueue_local>) at src/task.c:72
72    in src/task.c

Breakpoint 41, sync_poll_loop () at src/haproxy.c:2378
2378    src/haproxy.c: No such file or directory.

Breakpoint 252, process_runnable_tasks () at src/task.c:275
275    src/task.c: No such file or directory.

Breakpoint 51, session_expire_embryonic (t=0x802545a60,
context=0x8024483a0, state=513) at src/session.c:389
389    src/session.c: No such file or directory.

Breakpoint 249, __task_queue (task=0x802545a60) at src/task.c:185
185    src/task.c: No such file or directory.

Breakpoint 253, wake_expired_tasks () at src/task.c:209
209    in src/task.c

Breakpoint 250, __task_wakeup (t=0x802545a60, root=0x8ced50
<rqueue_local>) at src/task.c:72
72    in src/task.c

Breakpoint 41, sync_poll_loop () at src/haproxy.c:2378
2378    src/haproxy.c: No such file or directory.

Breakpoint 252, process_runnable_tasks () at src/task.c:275
275    src/task.c: No such file or directory.

Breakpoint 51, session_expire_embryonic (t=0x802545a60,
context=0x8024483a0, state=513) at src/session.c:389
389    src/session.c: No such file or directory.

Breakpoint 249, __task_queue (task=0x802545a60) at src/task.c:185
185    src/task.c: No such file or directory.

Breakpoint 253, wake_expired_tasks () at src/task.c:209
209    in src/task.c

Breakpoint 250, __task_wakeup (t=0x802545a60, root=0x8ced50
<rqueue_local>) at src/task.c:72
72    in src/task.c

Breakpoint 41, sync_poll_loop () at src/haproxy.c:2378
2378    src/haproxy.c: No such file or directory.

Breakpoint 252, process_runnable_tasks () at src/task.c:275
275    src/task.c: No such file or directory.

Breakpoint 51, session_expire_embryonic (t=0x802545a60,
context=0x8024483a0, state=513) at src/session.c:389
389    src/session.c: No such file or directory.

Breakpoint 249, __task_queue (task=0x802545a60) at src/task.c:185
185    src/task.c: No such file or directory.

Breakpoint 253, wake_expired_tasks () at src/task.c:209
209    in src/task.c

Breakpoint 250, __task_wakeup (t=0x802545a60, root=0x8ced50
<rqueue_local>) at src/task.c:72
72    in src/task.c

Breakpoint 41, sync_poll_loop () at src/haproxy.c:2378
2378    src/haproxy.c: No such file or directory.

Breakpoint 252, process_runnable_tasks () at src/task.c:275
275    src/task.c: No such file or directory.


haproxy -vv
HA-Proxy version 1.9-dev0-48d92ee 2018/07/30
Copyright 2000-2017 Willy Tarreau <[email protected]>

Build options :
  TARGET  = freebsd
  CPU     = generic
  CC      = cc
  CFLAGS  = -DDEBUG_THREAD -DDEBUG_MEMORY -pipe -g -fstack-protector
-fno-strict-aliasing -fno-strict-aliasing -Wdeclaration-after-statement
-fwrapv -fno-strict-overflow -Wno-address-of-packed-member
-Wno-null-dereference -Wno-unused-label -DFREEBSD_PORTS -DFREEBSD_PORTS
  OPTIONS = USE_GETADDRINFO=1 USE_ZLIB=1 USE_CPU_AFFINITY=1
USE_ACCEPT4=1 USE_REGPARM=1 USE_OPENSSL=1 USE_LUA=1 USE_STATIC_PCRE=1
USE_PCRE_JIT=1

Default settings :
  maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents = 200

Built with network namespace support.
Built with zlib version : 1.2.11
Running on zlib version : 1.2.11
Compression algorithms supported : identity("identity"),
deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with PCRE version : 8.40 2017-01-11
Running on PCRE version : 8.40 2017-01-11
PCRE library supports JIT : yes
Built with multi-threading support.
Encrypted password support via crypt(3): yes
Built with transparent proxy support using: IP_BINDANY IPV6_BINDANY
Built with Lua version : Lua 5.3.4
Built with OpenSSL version : OpenSSL 1.0.2k-freebsd  26 Jan 2017
Running on OpenSSL version : OpenSSL 1.0.2m-freebsd  2 Nov 2017
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports : SSLv3 TLSv1.0 TLSv1.1 TLSv1.2

Available polling systems :
     kqueue : pref=300,  test result OK
       poll : pref=200,  test result OK
     select : pref=150,  test result OK
Total: 3 (3 usable), will use kqueue.

Available filters :
        [TRACE] trace
        [COMP] compression
        [SPOE] spoe

But im not sure what to do next.. Today it happened again and i tried to
run the gdb log command with all haproxy source files.. but then it
stopped working completely..(or at least so slow it didnt properly
respond anymore..) so i had to abort and restart..

Is there any extra info i can gather next time? A different gdb command
script to try and run?

Should i try a newer version? (i did have it with a previous build from
a week earlier also.., not sure if it happened before that..)

Could it be because of the openssl version mismatch? (not sure how easy
it is for me to compile it against the 'correct' version.. i never
seemed to have issues with that before..)

Hoping someone has an idea how to debug it further/differently or
perhaps create a patch that might provide extra information when it
occurs again.?.

Regards,

PiBa-NL (Pieter)
Hi List,

Anyone got a idea how to debug this further?
Currently its running at 100% again, any pointers to debug the process
as its running would be appreciated.

Or should i compile again from current master and 'hope' it doesn't return?

b.t.w. truss output is as follows:
kevent(3,0x0,0,{ },200,{ 0.000000000 })          = 0 (0x0)
kevent(3,0x0,0,{ },200,{ 0.000000000 })          = 0 (0x0)
kevent(3,0x0,0,{ },200,{ 0.000000000 })          = 0 (0x0)
kevent(3,0x0,0,{ },200,{ 0.000000000 })          = 0 (0x0)
kevent(3,0x0,0,{ },200,{ 0.000000000 })          = 0 (0x0)
kevent(3,0x0,0,{ },200,{ 0.000000000 })          = 0 (0x0)
kevent(3,0x0,0,{ },200,{ 0.000000000 })          = 0 (0x0)

Regards,
PiBa-NL (Pieter)

Op 8-8-2018 om 22:49 schreef PiBa-NL:
> Hi List,
>
> Ive got a weird issue.. and im not sure where/how to continue digging
> at the moment...
>
> Using nbthread=1 nbproc=1, a few lua scripts, ssl offloading / http
> traffic.. Only a few connections < 100...
>
> Sometimes haproxy starts using 100% cpu usage.. After a few days..
> (Makes it hard to debug..)
> Currently running with version 'HA-Proxy version 1.9-dev0-48d92ee
> 2018/07/30'
> Ive ran some commands against the haproxy socket like 'show activity',
> as can be seen there are lots of loops and tasks in just a second of
> time.
>
> [2.4.3-RELEASE][[email protected]_3]/root:
> /usr/local/pkg/haproxy/haproxy_socket.sh show activity
> show activity thread_id: 0
> date_now: 1533754729.799405
> loops: 828928664
> wake_cache: 845396
> wake_tasks: 827400248
> wake_signal: 0
> poll_exp: 828245644
> poll_drop: 17451
> poll_dead: 0
> poll_skip: 0
> fd_skip: 0
> fd_lock: 0
> fd_del: 0
> conn_dead: 0
> stream: 101147
> empty_rq: 1242050
> long_rq: 0
>
> [2.4.3-RELEASE][[email protected]_3]/root:
> /usr/local/pkg/haproxy/haproxy_socket.sh show activity
> show activity thread_id: 0
> date_now: 1533754731.084664
> loops: 829000230
> wake_cache: 845398
> wake_tasks: 827471812
> wake_signal: 0
> poll_exp: 828317210
> poll_drop: 17452
> poll_dead: 0
> poll_skip: 0
> fd_skip: 0
> fd_lock: 0
> fd_del: 0
> conn_dead: 0
> stream: 101149
> empty_rq: 1242050
> long_rq: 0
>
> Other than that ive tried to attach gdb and step through / log some
> functions.. it passes through.
>
> With a gdb 'command' file like bellow i created a little log of
> function breakpoints hit:
> set pagination off
> set height 0
> set logging on
> delete
> rbreak haproxy.c:.
> rbreak session.c:.
> rbreak hlua.c:.
> rbreak task.c:.
> commands 1-99999
> cont
> end
> cont
>
> Which got me a log with the following content.. as you can see it
> 'seems' to be looping over the same task multiple times.., which might
> not even be a problem.??. the t=0x802545a60 expires and wakes up ,
> expires and wakes up.?.:
>
> Breakpoint 249, __task_queue (task=0x802545a60) at src/task.c:185
> 185    src/task.c: No such file or directory.
>
> Breakpoint 253, wake_expired_tasks () at src/task.c:209
> 209    in src/task.c
>
> Breakpoint 250, __task_wakeup (t=0x802545a60, root=0x8ced50
> <rqueue_local>) at src/task.c:72
> 72    in src/task.c
>
> Breakpoint 41, sync_poll_loop () at src/haproxy.c:2378
> 2378    src/haproxy.c: No such file or directory.
>
> Breakpoint 252, process_runnable_tasks () at src/task.c:275
> 275    src/task.c: No such file or directory.
>
> Breakpoint 51, session_expire_embryonic (t=0x802545a60,
> context=0x8024483a0, state=513) at src/session.c:389
> 389    src/session.c: No such file or directory.
>
> Breakpoint 249, __task_queue (task=0x802545a60) at src/task.c:185
> 185    src/task.c: No such file or directory.
>
> Breakpoint 253, wake_expired_tasks () at src/task.c:209
> 209    in src/task.c
>
> Breakpoint 250, __task_wakeup (t=0x802545a60, root=0x8ced50
> <rqueue_local>) at src/task.c:72
> 72    in src/task.c
>
> Breakpoint 41, sync_poll_loop () at src/haproxy.c:2378
> 2378    src/haproxy.c: No such file or directory.
>
> Breakpoint 252, process_runnable_tasks () at src/task.c:275
> 275    src/task.c: No such file or directory.
>
> Breakpoint 51, session_expire_embryonic (t=0x802545a60,
> context=0x8024483a0, state=513) at src/session.c:389
> 389    src/session.c: No such file or directory.
>
> Breakpoint 249, __task_queue (task=0x802545a60) at src/task.c:185
> 185    src/task.c: No such file or directory.
>
> Breakpoint 253, wake_expired_tasks () at src/task.c:209
> 209    in src/task.c
>
> Breakpoint 250, __task_wakeup (t=0x802545a60, root=0x8ced50
> <rqueue_local>) at src/task.c:72
> 72    in src/task.c
>
> Breakpoint 41, sync_poll_loop () at src/haproxy.c:2378
> 2378    src/haproxy.c: No such file or directory.
>
> Breakpoint 252, process_runnable_tasks () at src/task.c:275
> 275    src/task.c: No such file or directory.
>
> Breakpoint 51, session_expire_embryonic (t=0x802545a60,
> context=0x8024483a0, state=513) at src/session.c:389
> 389    src/session.c: No such file or directory.
>
> Breakpoint 249, __task_queue (task=0x802545a60) at src/task.c:185
> 185    src/task.c: No such file or directory.
>
> Breakpoint 253, wake_expired_tasks () at src/task.c:209
> 209    in src/task.c
>
> Breakpoint 250, __task_wakeup (t=0x802545a60, root=0x8ced50
> <rqueue_local>) at src/task.c:72
> 72    in src/task.c
>
> Breakpoint 41, sync_poll_loop () at src/haproxy.c:2378
> 2378    src/haproxy.c: No such file or directory.
>
> Breakpoint 252, process_runnable_tasks () at src/task.c:275
> 275    src/task.c: No such file or directory.
>
> Breakpoint 51, session_expire_embryonic (t=0x802545a60,
> context=0x8024483a0, state=513) at src/session.c:389
> 389    src/session.c: No such file or directory.
>
> Breakpoint 249, __task_queue (task=0x802545a60) at src/task.c:185
> 185    src/task.c: No such file or directory.
>
> Breakpoint 253, wake_expired_tasks () at src/task.c:209
> 209    in src/task.c
>
> Breakpoint 250, __task_wakeup (t=0x802545a60, root=0x8ced50
> <rqueue_local>) at src/task.c:72
> 72    in src/task.c
>
> Breakpoint 41, sync_poll_loop () at src/haproxy.c:2378
> 2378    src/haproxy.c: No such file or directory.
>
> Breakpoint 252, process_runnable_tasks () at src/task.c:275
> 275    src/task.c: No such file or directory.
>
>
> haproxy -vv
> HA-Proxy version 1.9-dev0-48d92ee 2018/07/30
> Copyright 2000-2017 Willy Tarreau <[email protected]>
>
> Build options :
>   TARGET  = freebsd
>   CPU     = generic
>   CC      = cc
>   CFLAGS  = -DDEBUG_THREAD -DDEBUG_MEMORY -pipe -g -fstack-protector
> -fno-strict-aliasing -fno-strict-aliasing
> -Wdeclaration-after-statement -fwrapv -fno-strict-overflow
> -Wno-address-of-packed-member -Wno-null-dereference -Wno-unused-label
> -DFREEBSD_PORTS -DFREEBSD_PORTS
>   OPTIONS = USE_GETADDRINFO=1 USE_ZLIB=1 USE_CPU_AFFINITY=1
> USE_ACCEPT4=1 USE_REGPARM=1 USE_OPENSSL=1 USE_LUA=1 USE_STATIC_PCRE=1
> USE_PCRE_JIT=1
>
> Default settings :
>   maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents = 200
>
> Built with network namespace support.
> Built with zlib version : 1.2.11
> Running on zlib version : 1.2.11
> Compression algorithms supported : identity("identity"),
> deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
> Built with PCRE version : 8.40 2017-01-11
> Running on PCRE version : 8.40 2017-01-11
> PCRE library supports JIT : yes
> Built with multi-threading support.
> Encrypted password support via crypt(3): yes
> Built with transparent proxy support using: IP_BINDANY IPV6_BINDANY
> Built with Lua version : Lua 5.3.4
> Built with OpenSSL version : OpenSSL 1.0.2k-freebsd  26 Jan 2017
> Running on OpenSSL version : OpenSSL 1.0.2m-freebsd  2 Nov 2017
> OpenSSL library supports TLS extensions : yes
> OpenSSL library supports SNI : yes
> OpenSSL library supports : SSLv3 TLSv1.0 TLSv1.1 TLSv1.2
>
> Available polling systems :
>      kqueue : pref=300,  test result OK
>        poll : pref=200,  test result OK
>      select : pref=150,  test result OK
> Total: 3 (3 usable), will use kqueue.
>
> Available filters :
>         [TRACE] trace
>         [COMP] compression
>         [SPOE] spoe
>
> But im not sure what to do next.. Today it happened again and i tried
> to run the gdb log command with all haproxy source files.. but then it
> stopped working completely..(or at least so slow it didnt properly
> respond anymore..) so i had to abort and restart..
>
> Is there any extra info i can gather next time? A different gdb
> command script to try and run?
>
> Should i try a newer version? (i did have it with a previous build
> from a week earlier also.., not sure if it happened before that..)
>
> Could it be because of the openssl version mismatch? (not sure how
> easy it is for me to compile it against the 'correct' version.. i
> never seemed to have issues with that before..)
>
> Hoping someone has an idea how to debug it further/differently or
> perhaps create a patch that might provide extra information when it
> occurs again.?.
>
> Regards,
>
> PiBa-NL (Pieter)
>
Hi Pieter,

On Thu, Aug 16, 2018 at 12:24:04AM +0200, PiBa-NL wrote:
> Hi List,
>
> Anyone got a idea how to debug this further?
> Currently its running at 100% again, any pointers to debug the process as
> its running would be appreciated.
>
> Or should i compile again from current master and 'hope' it doesn't return?
>
> b.t.w. truss output is as follows:
> kevent(3,0x0,0,{ },200,{ 0.000000000 })          = 0 (0x0)
> kevent(3,0x0,0,{ },200,{ 0.000000000 })          = 0 (0x0)
> kevent(3,0x0,0,{ },200,{ 0.000000000 })          = 0 (0x0)
> kevent(3,0x0,0,{ },200,{ 0.000000000 })          = 0 (0x0)
> kevent(3,0x0,0,{ },200,{ 0.000000000 })          = 0 (0x0)
> kevent(3,0x0,0,{ },200,{ 0.000000000 })          = 0 (0x0)
> kevent(3,0x0,0,{ },200,{ 0.000000000 })          = 0 (0x0)
>
> Regards,
> PiBa-NL (Pieter)

I'm interested in figuring that one out.
From the look at it, it seems the scheduler thinks there's a task to be run,
and so won't let the poller sleep in kevent().
Can you
- update to the latest master, even though I don't think any relevant fix
was applied since Jul 30.
- compile it with -O0, so that we can get meaningful informations from gdb.
- When/if that happens again, getting a core, and send it to me with the
haproxy binary, assuming there's no confidential information in that core,
of course.

Thanks !

Olivier
Hi again,

On Thu, Aug 16, 2018 at 05:50:27PM +0200, Olivier Houchard wrote:
> Hi Pieter,
>
> On Thu, Aug 16, 2018 at 12:24:04AM +0200, PiBa-NL wrote:
> > Hi List,
> >
> > Anyone got a idea how to debug this further?
> > Currently its running at 100% again, any pointers to debug the process as
> > its running would be appreciated.
> >
> > Or should i compile again from current master and 'hope' it doesn't return?
> >
> > b.t.w. truss output is as follows:
> > kevent(3,0x0,0,{ },200,{ 0.000000000 })          = 0 (0x0)
> > kevent(3,0x0,0,{ },200,{ 0.000000000 })          = 0 (0x0)
> > kevent(3,0x0,0,{ },200,{ 0.000000000 })          = 0 (0x0)
> > kevent(3,0x0,0,{ },200,{ 0.000000000 })          = 0 (0x0)
> > kevent(3,0x0,0,{ },200,{ 0.000000000 })          = 0 (0x0)
> > kevent(3,0x0,0,{ },200,{ 0.000000000 })          = 0 (0x0)
> > kevent(3,0x0,0,{ },200,{ 0.000000000 })          = 0 (0x0)
> >
> > Regards,
> > PiBa-NL (Pieter)
>
> I'm interested in figuring that one out.
> From the look at it, it seems the scheduler thinks there's a task to be run,
> and so won't let the poller sleep in kevent().
> Can you
> - update to the latest master, even though I don't think any relevant fix
> was applied since Jul 30.
> - compile it with -O0, so that we can get meaningful informations from gdb.
> - When/if that happens again, getting a core, and send it to me with the
> haproxy binary, assuming there's no confidential information in that core,
> of course.
>
> Thanks !
>

So after discussing on IRC, I'm pretty sure I figured it out. The two
attached patches should fix it.

Thanks a lot !

Olivier
From 90fc92f777772c6b47d88769bb73680702d7b8e6 Mon Sep 17 00:00:00 2001
From: Olivier Houchard <[email protected]>
Date: Thu, 16 Aug 2018 19:03:02 +0200
Subject: [PATCH 1/2] BUG/MEDIUM: tasks: Don't insert in the global rqueue if
nbthread == 1

Make sure we don't insert a task in the global run queue if nbthread == 1,
as, as an optimisation, we avoid reading from it if nbthread == 1.
---
src/task.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/src/task.c b/src/task.c
index de097baf7..e357bc169 100644
--- a/src/task.c
+++ b/src/task.c
@@ -395,7 +395,8 @@ void process_runnable_tasks()
state = HA_ATOMIC_AND(&t->state, ~TASK_RUNNING);
if (state)
#ifdef USE_THREAD
- __task_wakeup(t, (t->thread_mask == tid_bit) ?
+ __task_wakeup(t, (t->thread_mask == tid_bit ||
+ global.nbthread == 1) ?
&rqueue_local[tid] : &rqueue);
#else
__task_wakeup(t, &rqueue_local[tid]);
--
2.14.3

From 7640aa3de3c9ad00fe82cda4a50351e46fc0bf48 Mon Sep 17 00:00:00 2001
From: Olivier Houchard <[email protected]>
Date: Thu, 16 Aug 2018 19:03:50 +0200
Subject: [PATCH 2/2] BUG/MEDIUM: sessions: Don't use t->state.

In session_expire_embryonic(), don't use t->state, use the "state" argument
instead, as t->state has been cleaned before we're being called.
---
src/session.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/session.c b/src/session.c
index 1d66b739f..d7d8544c7 100644
--- a/src/session.c
+++ b/src/session.c
@@ -388,7 +388,7 @@ static struct task *session_expire_embryonic(struct task *t, void *context, unsi
{
struct session *sess = context;

- if (!(t->state & TASK_WOKEN_TIMER))
+ if (!(state & TASK_WOKEN_TIMER))
return t;

session_kill_embryonic(sess);
--
2.14.3
Both patches applied, thanks guys!

Olivier, I have a suggestion for this one :
On Thu, Aug 16, 2018 at 07:17:07PM +0200, Olivier Houchard wrote:
> From 90fc92f777772c6b47d88769bb73680702d7b8e6 Mon Sep 17 00:00:00 2001
> From: Olivier Houchard <[email protected]>
> Date: Thu, 16 Aug 2018 19:03:02 +0200
> Subject: [PATCH 1/2] BUG/MEDIUM: tasks: Don't insert in the global rqueue if
> nbthread == 1
>
> Make sure we don't insert a task in the global run queue if nbthread == 1,
> as, as an optimisation, we avoid reading from it if nbthread == 1.
> ---
> src/task.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/src/task.c b/src/task.c
> index de097baf7..e357bc169 100644
> --- a/src/task.c
> +++ b/src/task.c
> @@ -395,7 +395,8 @@ void process_runnable_tasks()
> state = HA_ATOMIC_AND(&t->state, ~TASK_RUNNING);
> if (state)
> #ifdef USE_THREAD
> - __task_wakeup(t, (t->thread_mask == tid_bit) ?
> + __task_wakeup(t, (t->thread_mask == tid_bit ||
> + global.nbthread == 1) ?
> &rqueue_local[tid] : &rqueue);
> #else

I'm pretty sure we'll get caught by this sort of stuff in the future
again. I think it would be safer to proceed like this :

__task_wakeup(t, ((t->thread_mask & all_threads_mask) == tid_bit)

It should always be valid regardless of the number of threads. The same
could be done in task_wakeup().

I suspect we may have a similar issue with the fd cache applied to listeners
here :

static inline void updt_fd_polling(const int fd)
{
if (fdtab[fd].thread_mask == tid_bit) {
unsigned int oldupdt;

/* note: we don't have a test-and-set yet in hathreads */


In this case the thread_mask might be larger than all_threads_mask and
we might fail this test. Or we may see that when threads exit and the
threads mask changes.

Just my two cents, both patches were applied anyway.

Willy
On Thu, Aug 16, 2018 at 07:31:17PM +0200, Willy Tarreau wrote:
> Both patches applied, thanks guys!
>
> Olivier, I have a suggestion for this one :
> On Thu, Aug 16, 2018 at 07:17:07PM +0200, Olivier Houchard wrote:
> > From 90fc92f777772c6b47d88769bb73680702d7b8e6 Mon Sep 17 00:00:00 2001
> > From: Olivier Houchard <[email protected]>
> > Date: Thu, 16 Aug 2018 19:03:02 +0200
> > Subject: [PATCH 1/2] BUG/MEDIUM: tasks: Don't insert in the global rqueue if
> > nbthread == 1
> >
> > Make sure we don't insert a task in the global run queue if nbthread == 1,
> > as, as an optimisation, we avoid reading from it if nbthread == 1.
> > ---
> > src/task.c | 3 ++-
> > 1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/src/task.c b/src/task.c
> > index de097baf7..e357bc169 100644
> > --- a/src/task.c
> > +++ b/src/task.c
> > @@ -395,7 +395,8 @@ void process_runnable_tasks()
> > state = HA_ATOMIC_AND(&t->state, ~TASK_RUNNING);
> > if (state)
> > #ifdef USE_THREAD
> > - __task_wakeup(t, (t->thread_mask == tid_bit) ?
> > + __task_wakeup(t, (t->thread_mask == tid_bit ||
> > + global.nbthread == 1) ?
> > &rqueue_local[tid] : &rqueue);
> > #else
>
> I'm pretty sure we'll get caught by this sort of stuff in the future
> again. I think it would be safer to proceed like this :
>
> __task_wakeup(t, ((t->thread_mask & all_threads_mask) == tid_bit)
>
> It should always be valid regardless of the number of threads. The same
> could be done in task_wakeup().
>

Sure, why not.

> I suspect we may have a similar issue with the fd cache applied to listeners
> here :
>
> static inline void updt_fd_polling(const int fd)
> {
> if (fdtab[fd].thread_mask == tid_bit) {
> unsigned int oldupdt;
>
> /* note: we don't have a test-and-set yet in hathreads */
>
>
> In this case the thread_mask might be larger than all_threads_mask and
> we might fail this test. Or we may see that when threads exit and the
> threads mask changes.
>
> Just my two cents, both patches were applied anyway.
>

That is true, this one is not a bug, but a pessimization, by using the global
update_list which is more costly than the local one.

Patches attached to do as suggest.

Regards,

Olivier

From c61f3f76f7e546aac14d7db33552dba11ce71e12 Mon Sep 17 00:00:00 2001
From: Olivier Houchard <[email protected]>
Date: Fri, 17 Aug 2018 13:36:08 +0200
Subject: [PATCH 1/2] MINOR: tasks: Don't special-case when nbthreads == 1

Instead of checking if nbthreads == 1, just and thread_mask with
all_threads_mask to know if we're supposed to add the task to the local or
the global runqueue.
---
src/task.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/src/task.c b/src/task.c
index e357bc169..ce5b4f907 100644
--- a/src/task.c
+++ b/src/task.c
@@ -395,8 +395,7 @@ void process_runnable_tasks()
state = HA_ATOMIC_AND(&t->state, ~TASK_RUNNING);
if (state)
#ifdef USE_THREAD
- __task_wakeup(t, (t->thread_mask == tid_bit ||
- global.nbthread == 1) ?
+ __task_wakeup(t, ((t->thread_mask & all_threads_mask) == tid_bit) ?
&rqueue_local[tid] : &rqueue);
#else
__task_wakeup(t, &rqueue_local[tid]);
--
2.14.3

From ea3b07fd2ed96b46ea107c9d862b5a1c27e728c2 Mon Sep 17 00:00:00 2001
From: Olivier Houchard <[email protected]>
Date: Fri, 17 Aug 2018 13:37:59 +0200
Subject: [PATCH 2/2] MINOR: fd cache: And the thread_mask with
all_threads_mask.

When we choose to insert a fd in either the global or the local fd update list,
and the thread_mask against all_threads_mask before checking if it's tid_bit,
that way, if we run with nbthreads==1, we will always use the local list,
which is cheaper than the global one.
---
include/proto/fd.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/proto/fd.h b/include/proto/fd.h
index a4cee3220..a3ec5e854 100644
--- a/include/proto/fd.h
+++ b/include/proto/fd.h
@@ -109,7 +109,7 @@ void fd_rm_from_fd_list(volatile struct fdlist *list, int fd, int off);
*/
static inline void updt_fd_polling(const int fd)
{
- if (fdtab[fd].thread_mask == tid_bit) {
+ if ((fdtab[fd].thread_mask & all_threads_mask) == tid_bit) {
unsigned int oldupdt;

/* note: we don't have a test-and-set yet in hathreads */
--
2.14.3
On Fri, Aug 17, 2018 at 01:41:54PM +0200, Olivier Houchard wrote:
> That is true, this one is not a bug, but a pessimization, by using the global
> update_list which is more costly than the local one.
>
> Patches attached to do as suggest.

Applied, thank you!
willy
Hi Olivier,

Op 17-8-2018 om 14:51 schreef Willy Tarreau:
> On Fri, Aug 17, 2018 at 01:41:54PM +0200, Olivier Houchard wrote:
>> That is true, this one is not a bug, but a pessimization, by using the global
>> update_list which is more costly than the local one.
>>
>> Patches attached to do as suggest.
> Applied, thank you!
> willy

Just a little update :)

The '1.9-dev0-48d92ee 2018/07/30'+your initial 2 patches, is running
correctly using as little cpu as can be expected for its little workload
for 4+ days now. I think i can call it 'fix confirmed', as you already
knew ;) , previously the issue would have likely returned in this time
period..

Ill keep it running for a few more days, and switch back to nbthread 3
then.. Till next time ;)

Thanks again!

Best regards,

PiBa-NL (Pieter)
Hi Pieter,

On Mon, Aug 20, 2018 at 09:33:49PM +0200, PiBa-NL wrote:
> Hi Olivier,
>
> Op 17-8-2018 om 14:51 schreef Willy Tarreau:
> > On Fri, Aug 17, 2018 at 01:41:54PM +0200, Olivier Houchard wrote:
> > > That is true, this one is not a bug, but a pessimization, by using the global
> > > update_list which is more costly than the local one.
> > >
> > > Patches attached to do as suggest.
> > Applied, thank you!
> > willy
>
> Just a little update :)
>
> The '1.9-dev0-48d92ee 2018/07/30'+your initial 2 patches, is running
> correctly using as little cpu as can be expected for its little workload for
> 4+ days now. I think i can call it 'fix confirmed', as you already knew ;) ,
> previously the issue would have likely returned in this time period..
>
> Ill keep it running for a few more days, and switch back to nbthread 3
> then.. Till next time ;)
>
> Thanks again!
>
> Best regards,
>
> PiBa-NL (Pieter)
>

Thanks for the feedback :)
Let us know how it goes with 3 threads !

Regards,

Olivier
Sorry, only registered users may post in this forum.

Click here to login