Welcome! Log In Create A New Profile

Advanced

dev1.9 2018/06/05 threads cpu 100% spin_lock v.s. thread_sync_barrier

Posted by PiBa-NL 
Hi List,

I've got no clue how i got into this state ;) and maybe there is nothing
wrong..(well i did resume a VM that was suspended for half a day..)

Still thought it might be worth reporting, or perhaps its solved already
as there are a few fixes for threads after the 6-6 snapshot that i build
with..
Sometimes all that some people need is half a idea to find a problem...
So maybe there is something that needs fixing??

Haproxy running with 3 threads at 300% cpu usage, .. some lua, almost no
traffic, in a vm that just resumed operation and is still going through
its passes to initialize its nic's and some stuff that noticed the clock
jumped on its back and its dhcp lease expired or something like that..
anyhow lots of things going on at that moment..

Below some of the details ive got about the threads, one is spinning,
the others seemingly waiting for spin_locks.?.

Like i wrote, not sure if its 'something' and i don't know yet if i can
reproduce it a second time.
If more info is needed, please let me know and ill try and provide it.
But at the moment its a 1 time occurrence, i think..
If there is nothing obvious wrong, for now maybe just ignore this mail.
Also ill update to latest snapshot 2018/06/08. Maybe i wont see it ever
again :).

haproxy -vv
HA-Proxy version 1.9-dev0-cc0a957 2018/06/05
Copyright 2000-2017 Willy Tarreau <[email protected]>

Build options :
  TARGET  = freebsd
  CPU     = generic
  CC      = cc
  CFLAGS  = -pipe -g -fstack-protector -fno-strict-aliasing
-fno-strict-aliasing -Wdeclaration-after-statement -fwrapv
-fno-strict-overflow -Wno-address-of-packed-member -Wno-null-dereference
-Wno-unused-label -DFREEBSD_PORTS
  OPTIONS = USE_GETADDRINFO=1 USE_ZLIB=1 USE_CPU_AFFINITY=1
USE_ACCEPT4=1 USE_REGPARM=1 USE_OPENSSL=1 USE_LUA=1 USE_STATIC_PCRE=1
USE_PCRE_JIT=1

Default settings :
  maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents = 200

Built with network namespace support.
Built with zlib version : 1.2.11
Running on zlib version : 1.2.11
Compression algorithms supported : identity("identity"),
deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with PCRE version : 8.40 2017-01-11
Running on PCRE version : 8.40 2017-01-11
PCRE library supports JIT : yes
Built with multi-threading support.
Encrypted password support via crypt(3): yes
Built with transparent proxy support using: IP_BINDANY IPV6_BINDANY
Built with Lua version : Lua 5.3.4
Built with OpenSSL version : OpenSSL 1.0.2k-freebsd  26 Jan 2017
Running on OpenSSL version : OpenSSL 1.0.2o-freebsd  27 Mar 2018
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports : SSLv3 TLSv1.0 TLSv1.1 TLSv1.2

Available polling systems :
     kqueue : pref=300,  test result OK
       poll : pref=200,  test result OK
     select : pref=150,  test result OK
Total: 3 (3 usable), will use kqueue.

Available filters :
        [TRACE] trace
        [COMP] compression
        [SPOE] spoe



(gdb) info threads
  Id   Target Id         Frame
* 1    LWP 100660 of process 56253 0x00000000005b0202 in
thread_sync_barrier (barrier=0x8bc690 <thread_enter_sync.barrier>) at
src/hathreads.c:109
  2    LWP 101036 of process 56253 0x000000000050874a in
process_chk_conn (t=0x8025187c0, context=0x802482610, state=33) at
src/checks.c:2112
  3    LWP 101037 of process 56253 0x000000000050b58e in
enqueue_one_email_alert (p=0x80253f400, s=0x8024dec00, q=0x802482600,
    msg=0x7fffdfdfc770 "Health check for server
Test-SNI_ipvANY/srv451-4 failed, reason: Layer4 connection problem,
info: \"General socket error (Network is unreachable)\", check duration:
0ms, status: 0/2 DOWN") at src/checks.c:3396
(gdb) next
110     in src/hathreads.c
(gdb) next
109     in src/hathreads.c
(gdb) next
110     in src/hathreads.c
(gdb) next
109     in src/hathreads.c
(gdb) next
110     in src/hathreads.c
(gdb) next
109     in src/hathreads.c
(gdb) next


Command name abbreviations are allowed if unambiguous.
(gdb) thread 1
[Switching to thread 1 (LWP 100660 of process 56253)]
#0  thread_sync_barrier (barrier=0x8bc690 <thread_enter_sync.barrier>)
at src/hathreads.c:109
109     src/hathreads.c: No such file or directory.
(gdb) bt full
#0  thread_sync_barrier (barrier=0x8bc690 <thread_enter_sync.barrier>)
at src/hathreads.c:109
        old = 7
#1  0x00000000005b0038 in thread_enter_sync () at src/hathreads.c:122
        barrier = 1
#2  0x000000000051737c in sync_poll_loop () at src/haproxy.c:2380
No locals.
#3  0x00000000005172ed in run_poll_loop () at src/haproxy.c:2432
        next = -357534104
        exp = -357534104
#4  0x0000000000514670 in run_thread_poll_loop (data=0x802491380) at
src/haproxy.c:2462
        start_lock = 0
        ptif = 0x8af8f8 <per_thread_init_list>
        ptdf = 0x7fffffffec80
#5  0x0000000000511199 in main (argc=10, argv=0x7fffffffec28) at
src/haproxy.c:3052
        tids = 0x802491380
        threads = 0x8024831a0
        i = 3
        err = 0
        retry = 200
        limit = {rlim_cur = 6116, rlim_max = 6116}
        errmsg =
"\000\354\377\377\377\177\000\000\200\354\377\377\377\177\000\000(\354\377\377\377\177\000\000\n\000\000\000\000\000\000\000\202$B\374|\211\275C\240g\213\000\000\000\000\000
\354\377\377\377\177\000\000\200\354\377\377\377\177\000\000(\354\377\377\377\177\000\000\n\000\000\000\000\000\000\000\300\353\377\377\377\177\000\000\302Z\340\001\b\000\000\000\001\000\000"
        pidfd = 6
(gdb) thread 2
[Switching to thread 2 (LWP 101036 of process 56253)]
#0  0x0000000000508718 in process_chk_conn (t=0x8025187c0,
context=0x802482610, state=33) at src/checks.c:2112
2112    src/checks.c: No such file or directory.
(gdb) bt full
#0  0x0000000000508718 in process_chk_conn (t=0x8025187c0,
context=0x802482610, state=33) at src/checks.c:2112
        __pl_l = 0x8024df5a0
        __pl_r = 4294967300
        check = 0x802482610
        s = 0x8024dec00
        cs = 0x8024f1790
        conn = 0x803091480
        rv = 8
        ret = 38896344
        expired = 0
#1  0x000000000050805e in process_chk (t=0x8025187c0,
context=0x802482610, state=33) at src/checks.c:2281
        check = 0x802482610
#2  0x0000000000503265 in process_email_alert (t=0x8025187c0,
context=0x802482610, state=33) at src/checks.c:3156
        check = 0x802482610
        q = 0x802482600
        alert = 0x7fffdfffde10
#3  0x000000000059b9bf in process_runnable_tasks () at src/task.c:351
        t = 0x8025187c0
        state = 33
        ctx = 0x802482610
        process = 0x503060 <process_email_alert>
        t = 0x8025182c0
        max_processed = 200
        average = 2
#4  0x000000000051717b in run_poll_loop () at src/haproxy.c:2403
        next = -357534145
        exp = -357534147
#5  0x0000000000514670 in run_thread_poll_loop (data=0x802491384) at
src/haproxy.c:2462
        start_lock = 0
        ptif = 0x8af8f8 <per_thread_init_list>
        ptdf = 0x800f037cc
#6  0x0000000800efec06 in ?? () from /lib/libthr.so.3
No symbol table info available.
#7  0x0000000000000000 in ?? ()
No symbol table info available.
Backtrace stopped: Cannot access memory at address 0x7fffdfffe000
(gdb) thread 3
[Switching to thread 3 (LWP 101037 of process 56253)]
#0  0x000000000050b4a5 in enqueue_one_email_alert (p=0x80253f400,
s=0x8024dec00, q=0x802482600,
    msg=0x7fffdfdfc770 "Health check for server
Test-SNI_ipvANY/srv451-4 failed, reason: Layer4 connection problem,
info: \"General socket error (Network is unreachable)\", check duration:
0ms, status: 0/2 DOWN") at src/checks.c:3396
3396    in src/checks.c
(gdb) bt full
#0  0x000000000050b4a5 in enqueue_one_email_alert (p=0x80253f400,
s=0x8024dec00, q=0x802482600,
    msg=0x7fffdfdfc770 "Health check for server
Test-SNI_ipvANY/srv451-4 failed, reason: Layer4 connection problem,
info: \"General socket error (Network is unreachable)\", check duration:
0ms, status: 0/2 DOWN") at src/checks.c:3396
        __pl_l = 0x8024827d0
        __pl_r = 4294967300
        strs = {0x639ab8 "DATA\r\n", 0x0}
        strs = {0x639b0d "QUIT\r\n", 0x0}
        alert = 0x80308f0b0
        tcpcheck = 0x802f22780
        check = 0x802482610
#1  0x0000000000503692 in enqueue_email_alert (p=0x80253f400, s=0x8024dec00,
    msg=0x7fffdfdfc770 "Health check for server
Test-SNI_ipvANY/srv451-4 failed, reason: Layer4 connection problem,
info: \"General socket error (Network is unreachable)\", check duration:
0ms, status: 0/2 DOWN") at src/checks.c:3414
        i = 0
        mailer = 0x8024329c0
#2  0x00000000005035fa in send_email_alert (s=0x8024dec00, level=6,
format=0x6437bd "%s") at src/checks.c:3445
        argp = {{gp_offset = 32, fp_offset = 48, overflow_arg_area =
0x7fffdfdfcb90, reg_save_area = 0x7fffdfdfc680}}
        buf = "Health check for server Test-SNI_ipvANY/srv451-4 failed,
reason: Layer4 connection problem, info: \"General socket error (Network
is unreachable)\", check duration: 0ms, status: 0/2
DOWN\000*\006\002\b\000\000\000\000\332\000\003\b\000\000\000"...
        len = 184
        p = 0x80253f400
#3  0x00000000005000a8 in set_server_check_status (check=0x8024df048,
status=8,
    desc=0x803017880 "Health check for server Test-SNI_ipvANY/srv451-4
failed, reason: Layer4 connection problem, info: \"General socket error
(Network is unreachable)\", check duration: 0ms, status: 0/2 DOWN") at
src/checks.c:294
        s = 0x8024dec00
        prev_status = 7
        report = 0
#4  0x0000000000503f0e in chk_report_conn_err (check=0x8024df048,
errno_bck=51, expired=0) at src/checks.c:669
        cs = 0x8024f1760
        conn = 0x803091000
        err_msg = 0x803017880 "Health check for server
Test-SNI_ipvANY/srv451-4 failed, reason: Layer4 connection problem,
info: \"General socket error (Network is unreachable)\", check duration:
0ms, status: 0/2 DOWN"
        chk = 0x800907228
        step = 0
        comment = 0x0
#5  0x0000000000508ab4 in process_chk_conn (t=0x8027f1b40,
context=0x8024df048, state=5) at src/checks.c:2165
        check = 0x8024df048
        s = 0x8024dec00
        cs = 0x8024f1760
        conn = 0x803091000
        rv = 32767
        ret = 20480
        expired = 1
#6  0x000000000050805e in process_chk (t=0x8027f1b40,
context=0x8024df048, state=5) at src/checks.c:2281
        check = 0x8024df048
#7  0x000000000059b9bf in process_runnable_tasks () at src/task.c:351
        t = 0x8027f1b40
        state = 5
        ctx = 0x8024df048
        process = 0x508000 <process_chk>
        t = 0x8027f1a00
        max_processed = 200
        average = 3
#8  0x000000000051717b in run_poll_loop () at src/haproxy.c:2403
        next = -357534145
        exp = -357534147
#9  0x0000000000514670 in run_thread_poll_loop (data=0x802491388) at
src/haproxy.c:2462
        start_lock = 0
        ptif = 0x8af8f8 <per_thread_init_list>
---Type <return> to continue, or q <return> to quit---
        ptdf = 0x800f037cc
#10 0x0000000800efec06 in ?? () from /lib/libthr.so.3
No symbol table info available.
#11 0x0000000000000000 in ?? ()
No symbol table info available.
Backtrace stopped: Cannot access memory at address 0x7fffdfdfd000
Hi Pieter,

On Mon, Jun 11, 2018 at 10:48:25PM +0200, PiBa-NL wrote:
> Hi List,
>
> I've got no clue how i got into this state ;) and maybe there is nothing
> wrong..(well i did resume a VM that was suspended for half a day..)
>
> Still thought it might be worth reporting, or perhaps its solved already as
> there are a few fixes for threads after the 6-6 snapshot that i build with..
> Sometimes all that some people need is half a idea to find a problem... So
> maybe there is something that needs fixing??

This one is not known yet, to the best of my knowledge, or at least
not reported yet.

> (gdb) info threads
>   Id   Target Id         Frame
> * 1    LWP 100660 of process 56253 0x00000000005b0202 in thread_sync_barrier
> (barrier=0x8bc690 <thread_enter_sync.barrier>) at src/hathreads.c:109
>   2    LWP 101036 of process 56253 0x000000000050874a in process_chk_conn
> (t=0x8025187c0, context=0x802482610, state=33) at src/checks.c:2112
>   3    LWP 101037 of process 56253 0x000000000050b58e in
> enqueue_one_email_alert (p=0x80253f400, s=0x8024dec00, q=0x802482600,
>     msg=0x7fffdfdfc770 "Health check for server Test-SNI_ipvANY/srv451-4
> failed, reason: Layer4 connection problem, info: \"General socket error
> (Network is unreachable)\", check duration: 0ms, status: 0/2 DOWN") at
> src/checks.c:3396

This is quite odd. Either some quit a function without releasing the
server's lock nor the email_queue's lock (suspicious but possible),
or both just happen to be the same. And at first glance, seeing that
process_chk_conn() is called witha context of 0x802482610 which is
only 16 bytes above enqueue_one_email_alert's queue parameter, this
last possibility suddenly seems quite likely.

So my guess is that we take the server's lock in process_chk_conn(),
that we go down through this :
chk_report_conn_err()
-> set_server_check_status()
-> send_email_alert()
-> enqueue_email_alert()
-> enqueue_one_email_alert()

And here we spin on a lock which very likely is the same eventhough I
have no idea why. I suspect the way the queue or lock is retrieved
there is incorrect and explains the situation with a recursive lock,
but I'm afraid I don't really know since I don't know this part of
the code :-(

Note, it could also be that the queue's lock has never been initialized,
and makes the inner lock block there, with the server's lock held, causing
the second thread to spin and the 3rd one to wait for the barrier. I'll
have to ask for some help on this part. I'm pretty confident that 1.8 is
affected as well.

Thanks,
Willy
Hi Willy,

Op 12-6-2018 om 14:31 schreef Willy Tarreau:
> This one is not known yet, to the best of my knowledge, or at least
> not reported yet.
Okay :) I guess ill keep an eye on if it happens again.

Is there something i can do to find out more info if it happens again?
Or maybe before that build with more specific debug info so if it
happens again more info would be retrievable.?.
(My usual way of 'debugging' with the relatively easy to reproduce
issues is just cramming a lot of printf statements into the code until
something makes sense., but with a issue that only happens once in a
blue moon (sofar), that doesn't work very well...)

For the moment it hasn't happened again. i suspend/resume the vm it
happened on almost daily (that's running on my workstation witch is
shutdown overnight) the vm also has haproxy running on it and some other
stuff..

If it does happen again, would there be any other gdb information
helpful? Inspecting specific variables or creating a memory dump or
something.?  (please give a hint about the comment/option to call if
applicable/possible, i'm still a rookie with gdb..) p.s. i'm on FreeBSD
not sure if that matters for some of gdb's available options or something..

Would the complete configuration be helpfull? There is a lot of useless
stuff in there. because its my test/development test vm, and because of
lack of it happening again there is no good opportunity to shrink it
down to specific options/parts..

Thanks for your analysis, though i'm not sure what to do with the info
at the moment, i do hope you guys together can find a likely culprit
from the information given ;).

Regards,
PiBa-NL (Pieter)
Hi Pieter,

On Tue, Jun 12, 2018 at 08:09:08PM +0200, PiBa-NL wrote:
> Is there something i can do to find out more info if it happens again? Or
> maybe before that build with more specific debug info so if it happens again
> more info would be retrievable.?.
> (My usual way of 'debugging' with the relatively easy to reproduce issues is
> just cramming a lot of printf statements into the code until something makes
> sense., but with a issue that only happens once in a blue moon (sofar), that
> doesn't work very well...)

Building with -DDEBUG_THREAD and -DDEBUG_MEMORY can help to get a more
exploitable core file which will reveal where a given lock was already
taken. Be careful that it will slightly increase the CPU usage though.

> For the moment it hasn't happened again. i suspend/resume the vm it happened
> on almost daily (that's running on my workstation witch is shutdown
> overnight) the vm also has haproxy running on it and some other stuff..
>
> If it does happen again, would there be any other gdb information helpful?
> Inspecting specific variables or creating a memory dump or something.? 
> (please give a hint about the comment/option to call if applicable/possible,
> i'm still a rookie with gdb..) p.s. i'm on FreeBSD not sure if that matters
> for some of gdb's available options or something..

Creating a dump is often quite useful. You can do it with the command
"generate-core-file". A "bt full" and "info thread" will also be needed
as you did last time. I cannot guide you through the commands used to
debug the threads however because I don't know them, I think I remember
you need to print the contents of the lock itself (eg: "p foo->lock").

> Would the complete configuration be helpfull? There is a lot of useless
> stuff in there. because its my test/development test vm, and because of lack
> of it happening again there is no good opportunity to shrink it down to
> specific options/parts..

Well probably not, though if at least to keep it unmodified with the
unstripped executable and the upcoming core, that may be useful afterwards.

Thanks!
Willy
Sorry, only registered users may post in this forum.

Click here to login