Welcome! Log In Create A New Profile

Advanced

1.8.8 lua, xref_get_peer_and_lock hang / 100% cpu usage after restarting haproxy a few times

Posted by PiBa-NL 
Hi List,

Sometimes after a few 'restarts' of haproxy 1.8.8 (using -sf <pid>
parameter) one of the processes seems to get into a 'hanging' state
consuming 100% cpu..

In this configuration i'm using 'nbthread 1' not sure if this is related
to the corrupted task-tree from my other lua issue.?.
https://www.mail-archive.com/[email protected]/msg29801.html .?.

Also i'm using my new smtpmailqueue and serverhealthchecker lua scripts
(can be found on github.).. these probably 'contribute' to triggering
the condition.

Anything i can check / provide.?

(cant really share the config itself a.t.m. as its from our production
env, but it has like 15 backends with 1 server each, a little header
rewriting/insertion but nothing big..)

GNU gdb (GDB) 8.0.1 [GDB v8.0.1 for FreeBSD]
Copyright (C) 2017 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later
http://gnu.org/licenses/gpl.html
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-portbld-freebsd11.1".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
http://www.gnu.org/software/gdb/bugs/.
Find the GDB manual and other documentation resources online at:
http://www.gnu.org/software/gdb/documentation/.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/local/sbin/haproxy...done.
Attaching to program: /usr/local/sbin/haproxy, process 68580
Reading symbols from /lib/libcrypt.so.5...(no debugging symbols
found)...done.
Reading symbols from /lib/libz.so.6...(no debugging symbols found)...done.
Reading symbols from /lib/libthr.so.3...(no debugging symbols found)...done.
Reading symbols from /usr/lib/libssl.so.8...(no debugging symbols
found)...done.
Reading symbols from /lib/libcrypto.so.8...(no debugging symbols
found)...done.
Reading symbols from /usr/local/lib/liblua-5.3.so...(no debugging
symbols found)...done.
Reading symbols from /lib/libm.so.5...(no debugging symbols found)...done.
Reading symbols from /lib/libc.so.7...(no debugging symbols found)...done.
Reading symbols from /libexec/ld-elf.so.1...(no debugging symbols
found)...done.
[Switching to LWP 100340 of process 68580]
0x000000000044890b in xref_get_peer_and_lock (xref=0x80254b680) at
include/common/xref.h:34
34      include/common/xref.h: No such file or directory.
(gdb) next
37      in include/common/xref.h
(gdb) next
45      in include/common/xref.h
(gdb) next
46      in include/common/xref.h
(gdb) next
34      in include/common/xref.h
(gdb) next
37      in include/common/xref.h
(gdb) next
45      in include/common/xref.h
(gdb) next
46      in include/common/xref.h
(gdb) next
34      in include/common/xref.h
(gdb) next
37      in include/common/xref.h
(gdb) next
45      in include/common/xref.h
(gdb) next
46      in include/common/xref.h
(gdb) next
34      in include/common/xref.h
(gdb) next
37      in include/common/xref.h
(gdb) next
45      in include/common/xref.h
(gdb) next
46      in include/common/xref.h
(gdb) next
34      in include/common/xref.h
(gdb) next
37      in include/common/xref.h
(gdb) next
45      in include/common/xref.h
(gdb) next
46      in include/common/xref.h
(gdb) next
34      in include/common/xref.h
(gdb) next
37      in include/common/xref.h
(gdb) next
45      in include/common/xref.h
(gdb) next
46      in include/common/xref.h
(gdb) next
34      in include/common/xref.h
(gdb) next
37      in include/common/xref.h
(gdb) next
45      in include/common/xref.h
(gdb) next
46      in include/common/xref.h
(gdb) next
34      in include/common/xref.h
(gdb) next
37      in include/common/xref.h
(gdb) next
45      in include/common/xref.h
(gdb) next
46      in include/common/xref.h
(gdb) next
34      in include/common/xref.h
(gdb) next
37      in include/common/xref.h
(gdb) next
45      in include/common/xref.h
(gdb) next
46      in include/common/xref.h
(gdb) next
34      in include/common/xref.h
(gdb) next
37      in include/common/xref.h
(gdb) next
45      in include/common/xref.h
(gdb) next
46      in include/common/xref.h
(gdb) next
34      in include/common/xref.h
(gdb) next
37      in include/common/xref.h
(gdb) next
45      in include/common/xref.h
(gdb) next
46      in include/common/xref.h
(gdb) next
34      in include/common/xref.h
(gdb) next
37      in include/common/xref.h
(gdb) next
45      in include/common/xref.h
(gdb) next
46      in include/common/xref.h
(gdb) next
34      in include/common/xref.h
(gdb) next
37      in include/common/xref.h
(gdb) next
45      in include/common/xref.h
(gdb) next
46      in include/common/xref.h
(gdb) next
34      in include/common/xref.h
(gdb) next
37      in include/common/xref.h
(gdb) next
45      in include/common/xref.h
(gdb) next
46      in include/common/xref.h
(gdb) next
34      in include/common/xref.h
(gdb) next
37      in include/common/xref.h
(gdb) bt full
#0  xref_get_peer_and_lock (xref=0x80254b680) at include/common/xref.h:37
        local = 0x1
        remote = 0x7fffffffe210
#1  0x000000000044833d in hlua_socket_release (appctx=0x80254b600) at
src/hlua.c:1583
        peer = 0x0
#2  0x0000000000546519 in si_applet_release (si=0x802551dc8) at
include/proto/stream_interface.h:233
        appctx = 0x80254b600
#3  0x0000000000541a8d in stream_int_shutw_applet (si=0x802551dc8) at
src/stream_interface.c:1502
        ic = 0x802551b90
        oc = 0x802551bd0
#4  0x0000000000543809 in si_shutw (si=0x802551dc8) at
include/proto/stream_interface.h:320
No locals.
#5  0x0000000000543320 in stream_int_notify (si=0x802551dc8) at
src/stream_interface.c:465
        conn = 0x0
        ic = 0x802551b90
        oc = 0x802551bd0
#6  0x000000000044dc81 in hlua_socket_receive_yield (L=0x80243f2a8,
status=1, ctx=0) at src/hlua.c:1770
        socket = 0x8026a0028
        wanted = -1
        hlua = 0x8024d6780
        appctx = 0x80254b600
        len = 25
        nblk = 1
        blk1 = 0x8027da4d4 "501 5.1.3 Invalid address\r\nclass=\"tbl\"
width=\"100%\">\n<tr class=\"titre\"><th class=\"pxname\"
width=\"10%\"><a
name=\"SMTP-Mailroute_XXXX-25_EX01-125_tcp_ipv4\"></a><a class=px
href=\"#SMTP-Mailroute_XXXX-25_E"...
        len1 = 25
        blk2 = 0x100000005 <error: Cannot access memory at address
0x100000005>
        len2 = 3
        skip_at_end = 2
        oc = 0x802551bd0
        si = 0x802551dc8
        s = 0x802551b80
        peer = 0x80254b680
#7  0x0000000801880417 in ?? () from /usr/local/lib/liblua-5.3.so
---Type <return> to continue, or q <return> to quit---
No symbol table info available.
#8  0x000000080187f1a6 in ?? () from /usr/local/lib/liblua-5.3.so
No symbol table info available.
#9  0x0000000801880194 in lua_resume () from /usr/local/lib/liblua-5.3.so
No symbol table info available.
#10 0x00000000004370a9 in hlua_ctx_resume (lua=0x8024d6780,
yield_allowed=1) at src/hlua.c:1014
        ret = 39132616
        msg = 0x802647f00 ""
#11 0x00000000004406c7 in hlua_process_task (task=0x80242d460) at
src/hlua.c:5523
        hlua = 0x8024d6780
        status = HLUA_E_OK
#12 0x0000000000591177 in process_runnable_tasks () at src/task.c:231
        t = 0x80242d460
        i = 32767
        max_processed = 199
        rq_next = 0x0
        local_tasks = {0x0, 0x8add10 <applet_active_lock>, 0x0,
0x80254b600, 0x80254b600, 0x80254b600, 0x80254b600, 0x80254b600,
0x100000000, 0x0, 0x0, 0x0,
          0x8add10 <applet_active_lock>, 0x0, 0x0, 0x7fffffffe758}
        local_tasks_count = 0
        final_tasks_count = 0
#13 0x0000000000510a5b in run_poll_loop () at src/haproxy.c:2399
        next = 588656154
        exp = 588656154
#14 0x000000000050e370 in run_thread_poll_loop (data=0x802446068) at
src/haproxy.c:2461
        start_lock = 0
        ptif = 0x8a0908 <per_thread_init_list>
        ptdf = 0x202
#15 0x000000000050aef2 in main (argc=10, argv=0x7fffffffec58) at
src/haproxy.c:3050
        tids = 0x802446068
        threads = 0x8024efd60
        i = 1
        err = 0
        retry = 200
        limit = {rlim_cur = 2285, rlim_max = 2285}
---Type <return> to continue, or q <return> to quit---
        errmsg =
"\000\354\377\377\377\177\000\000\260\354\377\377\377\177\000\000X\354\377\377\377\177\000\000\n\000\000\000\000\000\000\000R\325c\321\223Z\332\017\240u\212\000\000\000\000\000P\354\377\377\377\177\000\000\260\354\377\377\377\177\000\000X\354\377\377\377\177\000\000\n\000\000\000\000\000\000\000\360\353\377\377\377\177\000\000r3\340\001\b\000\000\000\001\000\000"
        pidfd = 26
(gdb) info locals
tids = 0x802446068
threads = 0x8024efd60
i = 1
err = 0
retry = 200
limit = {rlim_cur = 2285, rlim_max = 2285}
errmsg =
"\000\354\377\377\377\177\000\000\260\354\377\377\377\177\000\000X\354\377\377\377\177\000\000\n\000\000\000\000\000\000\000R\325c\321\223Z\332\017\240u\212\000\000\000\000\000P\354\377\377\377\177\000\000\260\354\377\377\377\177\000\000X\354\377\377\377\177\000\000\n\000\000\000\000\000\000\000\360\353\377\377\377\177\000\000r3\340\001\b\000\000\000\001\000\000"
pidfd = 26
Hi List, Thierry,

Actually this is not limited to restarts, and also happens with 1.9dev.
It now happens while haproxy was running for a while and no restart was
attempted while running/debugging in my NetBeans IDE..

Root cause imo is that hlua_socket_receive_yield and hlua_socket_release
both try and acquire the same lock.


For debugging purposes ive added some code in
hlua_socket_receive_yield(..) before the stream_int_notify:

    struct channel *ic2 = si_ic(si);
    struct channel *oc2 = si_oc(si);
    ha_warning("hlua_socket_receive_yield calling notify peer:%9x
si[0].state:%d oc2.flag:%09x ic2.flag:%09x\n", peer, s->si[0].state,
oc2->flags, ic2->flags);
    stream_int_notify(&s->si[0]);

And:
static void hlua_socket_release(struct appctx *appctx)
{
    struct xref *peer;
    if (appctx->ctx.hlua_cosocket.xref.peer > 1)
        ha_warning("hlua_socket_release peer: %9x %9x\n",
appctx->ctx.hlua_cosocket.xref, appctx->ctx.hlua_cosocket.xref.peer->peer);
    else
        ha_warning("hlua_socket_release peer: %9x 0\n",
appctx->ctx.hlua_cosocket.xref);


And also added code in xref_get_peer_and_lock(..):
static inline struct xref *xref_get_peer_and_lock(struct xref *xref)
{
    if (xref->peer == 1) {
        printf("  xref_get_peer_and_lock xref->peer == 1 \n");
    }


This produces the logging:

[WARNING] 127/001127 (36579) : hlua_socket_receive_yield calling notify
peer:  2355590  si[0].state:7 oc2.flag:0c000c220 ic2.flag:00084a024
[WARNING] 127/001127 (36579) : hlua_socket_release peer:         1 0
  xref_get_peer_and_lock xref->peer == 1

When xref_get_peer_and_lock is called with a parameter xref->peer value
of 1 then it looks like it keeps swapping 1 and 1 until it is not 1,
that never happens..

As for the oc2.flags it contains the CF_SHUTW_NOW.. of which im still
not 100% when that flag is exactly set to get a foolproof reproduction..
but it happens on pretty much a daily basis for me in production and in
test i can now usually trigger it after a few testruns with no actual
traffic passing along within the first minute of running (healthchecks
are performed on several backend, and a mail or 2 is send by the lua
code during this startup period..).. with the full production config..

Below the stacktrace that comes with it..

xref_get_peer_and_lock (xref=0x802355590) at
P:\Git\haproxy\include\common\xref.h:37
hlua_socket_release (appctx=0x802355500) at P:\Git\haproxy\src\hlua.c:1595
si_applet_release (si=0x8023514c8) at
P:\Git\haproxy\include\proto\stream_interface.h:233
stream_int_shutw_applet (si=0x8023514c8) at
P:\Git\haproxy\src\stream_interface.c:1504
si_shutw (si=0x8023514c8) at
P:\Git\haproxy\include\proto\stream_interface.h:320
stream_int_notify (si=0x8023514c8) at
P:\Git\haproxy\src\stream_interface.c:465
hlua_socket_receive_yield (L=0x80223b388, status=1, ctx=0) at
P:\Git\haproxy\src\hlua.c:1789
?? () at null:
?? () at null:
lua_resume () at null:
hlua_ctx_resume (lua=0x8022cb800, yield_allowed=1) at
P:\Git\haproxy\src\hlua.c:1022
hlua_process_task (task=0x80222a500) at P:\Git\haproxy\src\hlua.c:5556
process_runnable_tasks () at P:\Git\haproxy\src\task.c:232
run_poll_loop () at P:\Git\haproxy\src\haproxy.c:2401
run_thread_poll_loop (data=0x802242080) at P:\Git\haproxy\src\haproxy.c:2463
main (argc=4, argv=0x7fffffffea80) at P:\Git\haproxy\src\haproxy.c:3053

I don't yet have a any idea about the direction of a possible fix.. :(..
Issue is that probably the hlua_socket_release should happen, but it
doesnt know what socket / peer it should release at that point.. its in
the local peer variable of the hlua_socket_receive_yield funtion..
Should it be 'unlocked' before calling stream_int_notify??

Does anyone dare to take a stab at a creating a patch ? If so thanks in
advance ;)

Regards,
PiBa-NL (Pieter)


Op 3-5-2018 om 1:30 schreef PiBa-NL:
> Hi List,
>
> Sometimes after a few 'restarts' of haproxy 1.8.8 (using -sf <pid>
> parameter) one of the processes seems to get into a 'hanging' state
> consuming 100% cpu..
>
> In this configuration i'm using 'nbthread 1' not sure if this is
> related to the corrupted task-tree from my other lua issue.?.
> https://www.mail-archive.com/[email protected]/msg29801.html .?.
>
> Also i'm using my new smtpmailqueue and serverhealthchecker lua
> scripts (can be found on github.).. these probably 'contribute' to
> triggering the condition.
>
> Anything i can check / provide.?
>
> (cant really share the config itself a.t.m. as its from our production
> env, but it has like 15 backends with 1 server each, a little header
> rewriting/insertion but nothing big..)
>
> GNU gdb (GDB) 8.0.1 [GDB v8.0.1 for FreeBSD]
> Copyright (C) 2017 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later
> http://gnu.org/licenses/gpl.html
> This is free software: you are free to change and redistribute it.
> There is NO WARRANTY, to the extent permitted by law.  Type "show
> copying"
> and "show warranty" for details.
> This GDB was configured as "x86_64-portbld-freebsd11.1".
> Type "show configuration" for configuration details.
> For bug reporting instructions, please see:
> http://www.gnu.org/software/gdb/bugs/.
> Find the GDB manual and other documentation resources online at:
> http://www.gnu.org/software/gdb/documentation/.
> For help, type "help".
> Type "apropos word" to search for commands related to "word"...
> Reading symbols from /usr/local/sbin/haproxy...done.
> Attaching to program: /usr/local/sbin/haproxy, process 68580
> Reading symbols from /lib/libcrypt.so.5...(no debugging symbols
> found)...done.
> Reading symbols from /lib/libz.so.6...(no debugging symbols
> found)...done.
> Reading symbols from /lib/libthr.so.3...(no debugging symbols
> found)...done.
> Reading symbols from /usr/lib/libssl.so.8...(no debugging symbols
> found)...done.
> Reading symbols from /lib/libcrypto.so.8...(no debugging symbols
> found)...done.
> Reading symbols from /usr/local/lib/liblua-5.3.so...(no debugging
> symbols found)...done.
> Reading symbols from /lib/libm.so.5...(no debugging symbols
> found)...done.
> Reading symbols from /lib/libc.so.7...(no debugging symbols
> found)...done.
> Reading symbols from /libexec/ld-elf.so.1...(no debugging symbols
> found)...done.
> [Switching to LWP 100340 of process 68580]
> 0x000000000044890b in xref_get_peer_and_lock (xref=0x80254b680) at
> include/common/xref.h:34
> 34      include/common/xref.h: No such file or directory.
> (gdb) next
> 37      in include/common/xref.h
> (gdb) next
> 45      in include/common/xref.h
> (gdb) next
> 46      in include/common/xref.h
> (gdb) next
> 34      in include/common/xref.h
> (gdb) next
> 37      in include/common/xref.h
> (gdb) next
> 45      in include/common/xref.h
> (gdb) next
> 46      in include/common/xref.h
> (gdb) next
> 34      in include/common/xref.h
> (gdb) next
> 37      in include/common/xref.h
> (gdb) next
> 45      in include/common/xref.h
> (gdb) next
> 46      in include/common/xref.h
> (gdb) next
> 34      in include/common/xref.h
> (gdb) next
> 37      in include/common/xref.h
> (gdb) next
> 45      in include/common/xref.h
> (gdb) next
> 46      in include/common/xref.h
> (gdb) next
> 34      in include/common/xref.h
> (gdb) next
> 37      in include/common/xref.h
> (gdb) next
> 45      in include/common/xref.h
> (gdb) next
> 46      in include/common/xref.h
> (gdb) next
> 34      in include/common/xref.h
> (gdb) next
> 37      in include/common/xref.h
> (gdb) next
> 45      in include/common/xref.h
> (gdb) next
> 46      in include/common/xref.h
> (gdb) next
> 34      in include/common/xref.h
> (gdb) next
> 37      in include/common/xref.h
> (gdb) next
> 45      in include/common/xref.h
> (gdb) next
> 46      in include/common/xref.h
> (gdb) next
> 34      in include/common/xref.h
> (gdb) next
> 37      in include/common/xref.h
> (gdb) next
> 45      in include/common/xref.h
> (gdb) next
> 46      in include/common/xref.h
> (gdb) next
> 34      in include/common/xref.h
> (gdb) next
> 37      in include/common/xref.h
> (gdb) next
> 45      in include/common/xref.h
> (gdb) next
> 46      in include/common/xref.h
> (gdb) next
> 34      in include/common/xref.h
> (gdb) next
> 37      in include/common/xref.h
> (gdb) next
> 45      in include/common/xref.h
> (gdb) next
> 46      in include/common/xref.h
> (gdb) next
> 34      in include/common/xref.h
> (gdb) next
> 37      in include/common/xref.h
> (gdb) next
> 45      in include/common/xref.h
> (gdb) next
> 46      in include/common/xref.h
> (gdb) next
> 34      in include/common/xref.h
> (gdb) next
> 37      in include/common/xref.h
> (gdb) next
> 45      in include/common/xref.h
> (gdb) next
> 46      in include/common/xref.h
> (gdb) next
> 34      in include/common/xref.h
> (gdb) next
> 37      in include/common/xref.h
> (gdb) next
> 45      in include/common/xref.h
> (gdb) next
> 46      in include/common/xref.h
> (gdb) next
> 34      in include/common/xref.h
> (gdb) next
> 37      in include/common/xref.h
> (gdb) next
> 45      in include/common/xref.h
> (gdb) next
> 46      in include/common/xref.h
> (gdb) next
> 34      in include/common/xref.h
> (gdb) next
> 37      in include/common/xref.h
> (gdb) bt full
> #0  xref_get_peer_and_lock (xref=0x80254b680) at include/common/xref.h:37
>         local = 0x1
>         remote = 0x7fffffffe210
> #1  0x000000000044833d in hlua_socket_release (appctx=0x80254b600) at
> src/hlua.c:1583
>         peer = 0x0
> #2  0x0000000000546519 in si_applet_release (si=0x802551dc8) at
> include/proto/stream_interface.h:233
>         appctx = 0x80254b600
> #3  0x0000000000541a8d in stream_int_shutw_applet (si=0x802551dc8) at
> src/stream_interface.c:1502
>         ic = 0x802551b90
>         oc = 0x802551bd0
> #4  0x0000000000543809 in si_shutw (si=0x802551dc8) at
> include/proto/stream_interface.h:320
> No locals.
> #5  0x0000000000543320 in stream_int_notify (si=0x802551dc8) at
> src/stream_interface.c:465
>         conn = 0x0
>         ic = 0x802551b90
>         oc = 0x802551bd0
> #6  0x000000000044dc81 in hlua_socket_receive_yield (L=0x80243f2a8,
> status=1, ctx=0) at src/hlua.c:1770
>         socket = 0x8026a0028
>         wanted = -1
>         hlua = 0x8024d6780
>         appctx = 0x80254b600
>         len = 25
>         nblk = 1
>         blk1 = 0x8027da4d4 "501 5.1.3 Invalid address\r\nclass=\"tbl\"
> width=\"100%\">\n<tr class=\"titre\"><th class=\"pxname\"
> width=\"10%\"><a
> name=\"SMTP-Mailroute_XXXX-25_EX01-125_tcp_ipv4\"></a><a class=px
> href=\"#SMTP-Mailroute_XXXX-25_E"...
>         len1 = 25
>         blk2 = 0x100000005 <error: Cannot access memory at address
> 0x100000005>
>         len2 = 3
>         skip_at_end = 2
>         oc = 0x802551bd0
>         si = 0x802551dc8
>         s = 0x802551b80
>         peer = 0x80254b680
> #7  0x0000000801880417 in ?? () from /usr/local/lib/liblua-5.3.so
> ---Type <return> to continue, or q <return> to quit---
> No symbol table info available.
> #8  0x000000080187f1a6 in ?? () from /usr/local/lib/liblua-5.3.so
> No symbol table info available.
> #9  0x0000000801880194 in lua_resume () from /usr/local/lib/liblua-5.3.so
> No symbol table info available.
> #10 0x00000000004370a9 in hlua_ctx_resume (lua=0x8024d6780,
> yield_allowed=1) at src/hlua.c:1014
>         ret = 39132616
>         msg = 0x802647f00 ""
> #11 0x00000000004406c7 in hlua_process_task (task=0x80242d460) at
> src/hlua.c:5523
>         hlua = 0x8024d6780
>         status = HLUA_E_OK
> #12 0x0000000000591177 in process_runnable_tasks () at src/task.c:231
>         t = 0x80242d460
>         i = 32767
>         max_processed = 199
>         rq_next = 0x0
>         local_tasks = {0x0, 0x8add10 <applet_active_lock>, 0x0,
> 0x80254b600, 0x80254b600, 0x80254b600, 0x80254b600, 0x80254b600,
> 0x100000000, 0x0, 0x0, 0x0,
>           0x8add10 <applet_active_lock>, 0x0, 0x0, 0x7fffffffe758}
>         local_tasks_count = 0
>         final_tasks_count = 0
> #13 0x0000000000510a5b in run_poll_loop () at src/haproxy.c:2399
>         next = 588656154
>         exp = 588656154
> #14 0x000000000050e370 in run_thread_poll_loop (data=0x802446068) at
> src/haproxy.c:2461
>         start_lock = 0
>         ptif = 0x8a0908 <per_thread_init_list>
>         ptdf = 0x202
> #15 0x000000000050aef2 in main (argc=10, argv=0x7fffffffec58) at
> src/haproxy.c:3050
>         tids = 0x802446068
>         threads = 0x8024efd60
>         i = 1
>         err = 0
>         retry = 200
>         limit = {rlim_cur = 2285, rlim_max = 2285}
> ---Type <return> to continue, or q <return> to quit---
>         errmsg =
> "\000\354\377\377\377\177\000\000\260\354\377\377\377\177\000\000X\354\377\377\377\177\000\000\n\000\000\000\000\000\000\000R\325c\321\223Z\332\017\240u\212\000\000\000\000\000P\354\377\377\377\177\000\000\260\354\377\377\377\177\000\000X\354\377\377\377\177\000\000\n\000\000\000\000\000\000\000\360\353\377\377\377\177\000\000r3\340\001\b\000\000\000\001\000\000"
>         pidfd = 26
> (gdb) info locals
> tids = 0x802446068
> threads = 0x8024efd60
> i = 1
> err = 0
> retry = 200
> limit = {rlim_cur = 2285, rlim_max = 2285}
> errmsg =
> "\000\354\377\377\377\177\000\000\260\354\377\377\377\177\000\000X\354\377\377\377\177\000\000\n\000\000\000\000\000\000\000R\325c\321\223Z\332\017\240u\212\000\000\000\000\000P\354\377\377\377\177\000\000\260\354\377\377\377\177\000\000X\354\377\377\377\177\000\000\n\000\000\000\000\000\000\000\360\353\377\377\377\177\000\000r3\340\001\b\000\000\000\001\000\000"
> pidfd = 26
>
> On 8 May 2018, at 00:33, PiBa-NL <[email protected]> wrote:
>
> Hi List, Thierry,
>
> Actually this is not limited to restarts, and also happens with 1.9dev. It now happens while haproxy was running for a while and no restart was attempted while running/debugging in my NetBeans IDE..
>
> Root cause imo is that hlua_socket_receive_yield and hlua_socket_release both try and acquire the same lock.
>
>
> For debugging purposes ive added some code in hlua_socket_receive_yield(..) before the stream_int_notify:
>
> struct channel *ic2 = si_ic(si);
> struct channel *oc2 = si_oc(si);
> ha_warning("hlua_socket_receive_yield calling notify peer:%9x si[0].state:%d oc2.flag:%09x ic2.flag:%09x\n", peer, s->si[0].state, oc2->flags, ic2->flags);
> stream_int_notify(&s->si[0]);
>
> And:
> static void hlua_socket_release(struct appctx *appctx)
> {
> struct xref *peer;
> if (appctx->ctx.hlua_cosocket.xref.peer > 1)
> ha_warning("hlua_socket_release peer: %9x %9x\n", appctx->ctx.hlua_cosocket.xref, appctx->ctx.hlua_cosocket.xref.peer->peer);
> else
> ha_warning("hlua_socket_release peer: %9x 0\n", appctx->ctx.hlua_cosocket.xref);
>
>
> And also added code in xref_get_peer_and_lock(..):
> static inline struct xref *xref_get_peer_and_lock(struct xref *xref)
> {
> if (xref->peer == 1) {
> printf(" xref_get_peer_and_lock xref->peer == 1 \n");
> }
>
>
> This produces the logging:
>
> [WARNING] 127/001127 (36579) : hlua_socket_receive_yield calling notify peer: 2355590 si[0].state:7 oc2.flag:0c000c220 ic2.flag:00084a024
> [WARNING] 127/001127 (36579) : hlua_socket_release peer: 1 0
> xref_get_peer_and_lock xref->peer == 1
>
> When xref_get_peer_and_lock is called with a parameter xref->peer value of 1 then it looks like it keeps swapping 1 and 1 until it is not 1, that never happens..
>
> As for the oc2.flags it contains the CF_SHUTW_NOW.. of which im still not 100% when that flag is exactly set to get a foolproof reproduction.. but it happens on pretty much a daily basis for me in production and in test i can now usually trigger it after a few testruns with no actual traffic passing along within the first minute of running (healthchecks are performed on several backend, and a mail or 2 is send by the lua code during this startup period..).. with the full production config..
>
> Below the stacktrace that comes with it..
>
> xref_get_peer_and_lock (xref=0x802355590) at P:\Git\haproxy\include\common\xref.h:37
> hlua_socket_release (appctx=0x802355500) at P:\Git\haproxy\src\hlua.c:1595
> si_applet_release (si=0x8023514c8) at P:\Git\haproxy\include\proto\stream_interface.h:233
> stream_int_shutw_applet (si=0x8023514c8) at P:\Git\haproxy\src\stream_interface.c:1504
> si_shutw (si=0x8023514c8) at P:\Git\haproxy\include\proto\stream_interface.h:320
> stream_int_notify (si=0x8023514c8) at P:\Git\haproxy\src\stream_interface.c:465
> hlua_socket_receive_yield (L=0x80223b388, status=1, ctx=0) at P:\Git\haproxy\src\hlua.c:1789
> ?? () at null:
> ?? () at null:
> lua_resume () at null:
> hlua_ctx_resume (lua=0x8022cb800, yield_allowed=1) at P:\Git\haproxy\src\hlua.c:1022
> hlua_process_task (task=0x80222a500) at P:\Git\haproxy\src\hlua.c:5556
> process_runnable_tasks () at P:\Git\haproxy\src\task.c:232
> run_poll_loop () at P:\Git\haproxy\src\haproxy.c:2401
> run_thread_poll_loop (data=0x802242080) at P:\Git\haproxy\src\haproxy.c:2463
> main (argc=4, argv=0x7fffffffea80) at P:\Git\haproxy\src\haproxy.c:3053
>
> I don't yet have a any idea about the direction of a possible fix.. :(..
> Issue is that probably the hlua_socket_release should happen, but it doesnt know what socket / peer it should release at that point.. its in the local peer variable of the hlua_socket_receive_yield funtion.. Should it be 'unlocked' before calling stream_int_notify?


This part of the code is tricky.

Just a precision:

- I write the “socket” session an haproxy classic session which is created by
Lua code for performing network I/O.

- I write “Lua” session for a classic HAProxy session which executes “Lua”, and
this “Lua” code communicates with the “socket” session.

This lock have two goal: maintain the link between the Lua session and the “socket”
session in a coherent state and ensure the access of the “socket” buffer from the
Lua session. When the lock is acquired, the “socket” session can’t be cleaned.

I resume:

- “Lua” session is executed

- hlua_socket_receive_yield() acquire a lock

- hlua_socket_receive_yield() call stream_int_notify()
NOTE: the function hlua_socket_receive_yield() can’t close the “socket” session

- stream_int_notify() is executed from “Lua” session.

- stream_int_notify() try to close the “socket” session

- stream_int_notify() try to acquire the lock which is already acquired by the
caller function hlua_socket_receive_yield().

It seems a dead lock, but you observe a loop.

The end of the function “hlua_socket_receive_yield()” seems suspect. You’re right:
the lock can be released before the stream_int_notify(). Unfortunately, the
stream_int_notify() is applied to the variable “s” which is a “socket” session. And
this session can be cleared between the unlock and the stream_int_notify().

A first idea is moving the stream_int_notify() from direct function to a task. It is
not a simple job, and this need more brainstorm.

BR,
Thierry



> Does anyone dare to take a stab at a creating a patch ? If so thanks in advance ;)
>
> Regards,
> PiBa-NL (Pieter)
>
>
> Op 3-5-2018 om 1:30 schreef PiBa-NL:
>> Hi List,
>>
>> Sometimes after a few 'restarts' of haproxy 1.8.8 (using -sf <pid> parameter) one of the processes seems to get into a 'hanging' state consuming 100% cpu..
>>
>> In this configuration i'm using 'nbthread 1' not sure if this is related to the corrupted task-tree from my other lua issue.?. https://www.mail-archive.com/[email protected]/msg29801.html .?.
>>
>> Also i'm using my new smtpmailqueue and serverhealthchecker lua scripts (can be found on github.).. these probably 'contribute' to triggering the condition.
>>
>> Anything i can check / provide.?
>>
>> (cant really share the config itself a.t.m. as its from our production env, but it has like 15 backends with 1 server each, a little header rewriting/insertion but nothing big..)
>>
>> GNU gdb (GDB) 8.0.1 [GDB v8.0.1 for FreeBSD]
>> Copyright (C) 2017 Free Software Foundation, Inc.
>> License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html
>> This is free software: you are free to change and redistribute it.
>> There is NO WARRANTY, to the extent permitted by law. Type "show copying"
>> and "show warranty" for details.
>> This GDB was configured as "x86_64-portbld-freebsd11.1".
>> Type "show configuration" for configuration details.
>> For bug reporting instructions, please see:
>> http://www.gnu.org/software/gdb/bugs/.
>> Find the GDB manual and other documentation resources online at:
>> http://www.gnu.org/software/gdb/documentation/.
>> For help, type "help".
>> Type "apropos word" to search for commands related to "word"...
>> Reading symbols from /usr/local/sbin/haproxy...done.
>> Attaching to program: /usr/local/sbin/haproxy, process 68580
>> Reading symbols from /lib/libcrypt.so.5...(no debugging symbols found)...done.
>> Reading symbols from /lib/libz.so.6...(no debugging symbols found)...done.
>> Reading symbols from /lib/libthr.so.3...(no debugging symbols found)...done.
>> Reading symbols from /usr/lib/libssl.so.8...(no debugging symbols found)...done.
>> Reading symbols from /lib/libcrypto.so.8...(no debugging symbols found)...done.
>> Reading symbols from /usr/local/lib/liblua-5.3.so...(no debugging symbols found)...done.
>> Reading symbols from /lib/libm.so.5...(no debugging symbols found)...done.
>> Reading symbols from /lib/libc.so.7...(no debugging symbols found)...done.
>> Reading symbols from /libexec/ld-elf.so.1...(no debugging symbols found)...done.
>> [Switching to LWP 100340 of process 68580]
>> 0x000000000044890b in xref_get_peer_and_lock (xref=0x80254b680) at include/common/xref.h:34
>> 34 include/common/xref.h: No such file or directory.
>> (gdb) next
>> 37 in include/common/xref.h
>> (gdb) next
>> 45 in include/common/xref.h
>> (gdb) next
>> 46 in include/common/xref.h
>> (gdb) next
>> 34 in include/common/xref.h
>> (gdb) next
>> 37 in include/common/xref.h
>> (gdb) next
>> 45 in include/common/xref.h
>> (gdb) next
>> 46 in include/common/xref.h
>> (gdb) next
>> 34 in include/common/xref.h
>> (gdb) next
>> 37 in include/common/xref.h
>> (gdb) next
>> 45 in include/common/xref.h
>> (gdb) next
>> 46 in include/common/xref.h
>> (gdb) next
>> 34 in include/common/xref.h
>> (gdb) next
>> 37 in include/common/xref.h
>> (gdb) next
>> 45 in include/common/xref.h
>> (gdb) next
>> 46 in include/common/xref.h
>> (gdb) next
>> 34 in include/common/xref.h
>> (gdb) next
>> 37 in include/common/xref.h
>> (gdb) next
>> 45 in include/common/xref.h
>> (gdb) next
>> 46 in include/common/xref.h
>> (gdb) next
>> 34 in include/common/xref.h
>> (gdb) next
>> 37 in include/common/xref.h
>> (gdb) next
>> 45 in include/common/xref.h
>> (gdb) next
>> 46 in include/common/xref.h
>> (gdb) next
>> 34 in include/common/xref.h
>> (gdb) next
>> 37 in include/common/xref.h
>> (gdb) next
>> 45 in include/common/xref.h
>> (gdb) next
>> 46 in include/common/xref.h
>> (gdb) next
>> 34 in include/common/xref.h
>> (gdb) next
>> 37 in include/common/xref.h
>> (gdb) next
>> 45 in include/common/xref.h
>> (gdb) next
>> 46 in include/common/xref.h
>> (gdb) next
>> 34 in include/common/xref.h
>> (gdb) next
>> 37 in include/common/xref.h
>> (gdb) next
>> 45 in include/common/xref.h
>> (gdb) next
>> 46 in include/common/xref.h
>> (gdb) next
>> 34 in include/common/xref.h
>> (gdb) next
>> 37 in include/common/xref.h
>> (gdb) next
>> 45 in include/common/xref.h
>> (gdb) next
>> 46 in include/common/xref.h
>> (gdb) next
>> 34 in include/common/xref.h
>> (gdb) next
>> 37 in include/common/xref.h
>> (gdb) next
>> 45 in include/common/xref.h
>> (gdb) next
>> 46 in include/common/xref.h
>> (gdb) next
>> 34 in include/common/xref.h
>> (gdb) next
>> 37 in include/common/xref.h
>> (gdb) next
>> 45 in include/common/xref.h
>> (gdb) next
>> 46 in include/common/xref.h
>> (gdb) next
>> 34 in include/common/xref.h
>> (gdb) next
>> 37 in include/common/xref.h
>> (gdb) next
>> 45 in include/common/xref.h
>> (gdb) next
>> 46 in include/common/xref.h
>> (gdb) next
>> 34 in include/common/xref.h
>> (gdb) next
>> 37 in include/common/xref.h
>> (gdb) next
>> 45 in include/common/xref.h
>> (gdb) next
>> 46 in include/common/xref.h
>> (gdb) next
>> 34 in include/common/xref.h
>> (gdb) next
>> 37 in include/common/xref.h
>> (gdb) bt full
>> #0 xref_get_peer_and_lock (xref=0x80254b680) at include/common/xref.h:37
>> local = 0x1
>> remote = 0x7fffffffe210
>> #1 0x000000000044833d in hlua_socket_release (appctx=0x80254b600) at src/hlua.c:1583
>> peer = 0x0
>> #2 0x0000000000546519 in si_applet_release (si=0x802551dc8) at include/proto/stream_interface.h:233
>> appctx = 0x80254b600
>> #3 0x0000000000541a8d in stream_int_shutw_applet (si=0x802551dc8) at src/stream_interface.c:1502
>> ic = 0x802551b90
>> oc = 0x802551bd0
>> #4 0x0000000000543809 in si_shutw (si=0x802551dc8) at include/proto/stream_interface.h:320
>> No locals.
>> #5 0x0000000000543320 in stream_int_notify (si=0x802551dc8) at src/stream_interface.c:465
>> conn = 0x0
>> ic = 0x802551b90
>> oc = 0x802551bd0
>> #6 0x000000000044dc81 in hlua_socket_receive_yield (L=0x80243f2a8, status=1, ctx=0) at src/hlua.c:1770
>> socket = 0x8026a0028
>> wanted = -1
>> hlua = 0x8024d6780
>> appctx = 0x80254b600
>> len = 25
>> nblk = 1
>> blk1 = 0x8027da4d4 "501 5.1.3 Invalid address\r\nclass=\"tbl\" width=\"100%\">\n<tr class=\"titre\"><th class=\"pxname\" width=\"10%\"><a name=\"SMTP-Mailroute_XXXX-25_EX01-125_tcp_ipv4\"></a><a class=px href=\"#SMTP-Mailroute_XXXX-25_E"...
>> len1 = 25
>> blk2 = 0x100000005 <error: Cannot access memory at address 0x100000005>
>> len2 = 3
>> skip_at_end = 2
>> oc = 0x802551bd0
>> si = 0x802551dc8
>> s = 0x802551b80
>> peer = 0x80254b680
>> #7 0x0000000801880417 in ?? () from /usr/local/lib/liblua-5.3.so
>> ---Type <return> to continue, or q <return> to quit---
>> No symbol table info available.
>> #8 0x000000080187f1a6 in ?? () from /usr/local/lib/liblua-5.3.so
>> No symbol table info available.
>> #9 0x0000000801880194 in lua_resume () from /usr/local/lib/liblua-5.3.so
>> No symbol table info available.
>> #10 0x00000000004370a9 in hlua_ctx_resume (lua=0x8024d6780, yield_allowed=1) at src/hlua.c:1014
>> ret = 39132616
>> msg = 0x802647f00 ""
>> #11 0x00000000004406c7 in hlua_process_task (task=0x80242d460) at src/hlua.c:5523
>> hlua = 0x8024d6780
>> status = HLUA_E_OK
>> #12 0x0000000000591177 in process_runnable_tasks () at src/task.c:231
>> t = 0x80242d460
>> i = 32767
>> max_processed = 199
>> rq_next = 0x0
>> local_tasks = {0x0, 0x8add10 <applet_active_lock>, 0x0, 0x80254b600, 0x80254b600, 0x80254b600, 0x80254b600, 0x80254b600, 0x100000000, 0x0, 0x0, 0x0,
>> 0x8add10 <applet_active_lock>, 0x0, 0x0, 0x7fffffffe758}
>> local_tasks_count = 0
>> final_tasks_count = 0
>> #13 0x0000000000510a5b in run_poll_loop () at src/haproxy.c:2399
>> next = 588656154
>> exp = 588656154
>> #14 0x000000000050e370 in run_thread_poll_loop (data=0x802446068) at src/haproxy.c:2461
>> start_lock = 0
>> ptif = 0x8a0908 <per_thread_init_list>
>> ptdf = 0x202
>> #15 0x000000000050aef2 in main (argc=10, argv=0x7fffffffec58) at src/haproxy.c:3050
>> tids = 0x802446068
>> threads = 0x8024efd60
>> i = 1
>> err = 0
>> retry = 200
>> limit = {rlim_cur = 2285, rlim_max = 2285}
>> ---Type <return> to continue, or q <return> to quit---
>> errmsg = "\000\354\377\377\377\177\000\000\260\354\377\377\377\177\000\000X\354\377\377\377\177\000\000\n\000\000\000\000\000\000\000R\325c\321\223Z\332\017\240u\212\000\000\000\000\000P\354\377\377\377\177\000\000\260\354\377\377\377\177\000\000X\354\377\377\377\177\000\000\n\000\000\000\000\000\000\000\360\353\377\377\377\177\000\000r3\340\001\b\000\000\000\001\000\000"
>> pidfd = 26
>> (gdb) info locals
>> tids = 0x802446068
>> threads = 0x8024efd60
>> i = 1
>> err = 0
>> retry = 200
>> limit = {rlim_cur = 2285, rlim_max = 2285}
>> errmsg = "\000\354\377\377\377\177\000\000\260\354\377\377\377\177\000\000X\354\377\377\377\177\000\000\n\000\000\000\000\000\000\000R\325c\321\223Z\332\017\240u\212\000\000\000\000\000P\354\377\377\377\177\000\000\260\354\377\377\377\177\000\000X\354\377\377\377\177\000\000\n\000\000\000\000\000\000\000\360\353\377\377\377\177\000\000r3\340\001\b\000\000\000\001\000\000"
>> pidfd = 26
>>
>
> On 9 May 2018, at 18:30, Thierry Fournier <[email protected]> wrote:
>
>
>
>> On 8 May 2018, at 00:33, PiBa-NL <[email protected]> wrote:
>>
>> Hi List, Thierry,
>>
>> Actually this is not limited to restarts, and also happens with 1.9dev. It now happens while haproxy was running for a while and no restart was attempted while running/debugging in my NetBeans IDE..
>>
>> Root cause imo is that hlua_socket_receive_yield and hlua_socket_release both try and acquire the same lock.
>>
>>
>> For debugging purposes ive added some code in hlua_socket_receive_yield(..) before the stream_int_notify:
>>
>> struct channel *ic2 = si_ic(si);
>> struct channel *oc2 = si_oc(si);
>> ha_warning("hlua_socket_receive_yield calling notify peer:%9x si[0].state:%d oc2.flag:%09x ic2.flag:%09x\n", peer, s->si[0].state, oc2->flags, ic2->flags);
>> stream_int_notify(&s->si[0]);
>>
>> And:
>> static void hlua_socket_release(struct appctx *appctx)
>> {
>> struct xref *peer;
>> if (appctx->ctx.hlua_cosocket.xref.peer > 1)
>> ha_warning("hlua_socket_release peer: %9x %9x\n", appctx->ctx.hlua_cosocket.xref, appctx->ctx.hlua_cosocket.xref.peer->peer);
>> else
>> ha_warning("hlua_socket_release peer: %9x 0\n", appctx->ctx.hlua_cosocket.xref);
>>
>>
>> And also added code in xref_get_peer_and_lock(..):
>> static inline struct xref *xref_get_peer_and_lock(struct xref *xref)
>> {
>> if (xref->peer == 1) {
>> printf(" xref_get_peer_and_lock xref->peer == 1 \n");
>> }
>>
>>
>> This produces the logging:
>>
>> [WARNING] 127/001127 (36579) : hlua_socket_receive_yield calling notify peer: 2355590 si[0].state:7 oc2.flag:0c000c220 ic2.flag:00084a024
>> [WARNING] 127/001127 (36579) : hlua_socket_release peer: 1 0
>> xref_get_peer_and_lock xref->peer == 1
>>
>> When xref_get_peer_and_lock is called with a parameter xref->peer value of 1 then it looks like it keeps swapping 1 and 1 until it is not 1, that never happens..
>>
>> As for the oc2.flags it contains the CF_SHUTW_NOW.. of which im still not 100% when that flag is exactly set to get a foolproof reproduction.. but it happens on pretty much a daily basis for me in production and in test i can now usually trigger it after a few testruns with no actual traffic passing along within the first minute of running (healthchecks are performed on several backend, and a mail or 2 is send by the lua code during this startup period..).. with the full production config..
>>
>> Below the stacktrace that comes with it..
>>
>> xref_get_peer_and_lock (xref=0x802355590) at P:\Git\haproxy\include\common\xref.h:37
>> hlua_socket_release (appctx=0x802355500) at P:\Git\haproxy\src\hlua.c:1595
>> si_applet_release (si=0x8023514c8) at P:\Git\haproxy\include\proto\stream_interface.h:233
>> stream_int_shutw_applet (si=0x8023514c8) at P:\Git\haproxy\src\stream_interface.c:1504
>> si_shutw (si=0x8023514c8) at P:\Git\haproxy\include\proto\stream_interface.h:320
>> stream_int_notify (si=0x8023514c8) at P:\Git\haproxy\src\stream_interface.c:465
>> hlua_socket_receive_yield (L=0x80223b388, status=1, ctx=0) at P:\Git\haproxy\src\hlua.c:1789
>> ?? () at null:
>> ?? () at null:
>> lua_resume () at null:
>> hlua_ctx_resume (lua=0x8022cb800, yield_allowed=1) at P:\Git\haproxy\src\hlua.c:1022
>> hlua_process_task (task=0x80222a500) at P:\Git\haproxy\src\hlua.c:5556
>> process_runnable_tasks () at P:\Git\haproxy\src\task.c:232
>> run_poll_loop () at P:\Git\haproxy\src\haproxy.c:2401
>> run_thread_poll_loop (data=0x802242080) at P:\Git\haproxy\src\haproxy.c:2463
>> main (argc=4, argv=0x7fffffffea80) at P:\Git\haproxy\src\haproxy.c:3053
>>
>> I don't yet have a any idea about the direction of a possible fix.. :(..
>> Issue is that probably the hlua_socket_release should happen, but it doesnt know what socket / peer it should release at that point.. its in the local peer variable of the hlua_socket_receive_yield funtion.. Should it be 'unlocked' before calling stream_int_notify?
>
>
> This part of the code is tricky.
>
> Just a precision:
>
> - I write the “socket” session an haproxy classic session which is created by
> Lua code for performing network I/O.
>
> - I write “Lua” session for a classic HAProxy session which executes “Lua”, and
> this “Lua” code communicates with the “socket” session.
>
> This lock have two goal: maintain the link between the Lua session and the “socket”
> session in a coherent state and ensure the access of the “socket” buffer from the
> Lua session. When the lock is acquired, the “socket” session can’t be cleaned.
>
> I resume:
>
> - “Lua” session is executed
>
> - hlua_socket_receive_yield() acquire a lock
>
> - hlua_socket_receive_yield() call stream_int_notify()
> NOTE: the function hlua_socket_receive_yield() can’t close the “socket” session
>
> - stream_int_notify() is executed from “Lua” session.
>
> - stream_int_notify() try to close the “socket” session
>
> - stream_int_notify() try to acquire the lock which is already acquired by the
> caller function hlua_socket_receive_yield().
>
> It seems a dead lock, but you observe a loop.
>
> The end of the function “hlua_socket_receive_yield()” seems suspect. You’re right:
> the lock can be released before the stream_int_notify(). Unfortunately, the
> stream_int_notify() is applied to the variable “s” which is a “socket” session. And
> this session can be cleared between the unlock and the stream_int_notify().
>
> A first idea is moving the stream_int_notify() from direct function to a task. It is
> not a simple job, and this need more brainstorm.


PS: It is very complicated to reproduce the problem. I suppose that it appens
when the remote server close the connection just after the begin of the execution
of the Lua read() function. If the connection is closed before the Lua execution,
it is detected by HAProxy, and process is clean.

Maybe the problem can be reproduced adding a sleep() function in the c code of
the hlua_socket_receive_yield() function. Like this:

__LJMP static int hlua_socket_receive_yield(struct lua_State *L, int status, lua_KContext ctx)
{
// [...] function variables

fprintf(stderr, “Now, I sleep 5S, close the server connection in order to reproduce the bug.\n”);
sleep(5);

// [...] function main part
}

I try to reproduce it later.

Thierry


>
> BR,
> Thierry
>
>
>
>> Does anyone dare to take a stab at a creating a patch ? If so thanks in advance ;)
>>
>> Regards,
>> PiBa-NL (Pieter)
>>
>>
>> Op 3-5-2018 om 1:30 schreef PiBa-NL:
>>> Hi List,
>>>
>>> Sometimes after a few 'restarts' of haproxy 1.8.8 (using -sf <pid> parameter) one of the processes seems to get into a 'hanging' state consuming 100% cpu..
>>>
>>> In this configuration i'm using 'nbthread 1' not sure if this is related to the corrupted task-tree from my other lua issue.?. https://www.mail-archive.com/[email protected]/msg29801.html .?.
>>>
>>> Also i'm using my new smtpmailqueue and serverhealthchecker lua scripts (can be found on github.).. these probably 'contribute' to triggering the condition.
>>>
>>> Anything i can check / provide.?
>>>
>>> (cant really share the config itself a.t.m. as its from our production env, but it has like 15 backends with 1 server each, a little header rewriting/insertion but nothing big..)
>>>
>>> GNU gdb (GDB) 8.0.1 [GDB v8.0.1 for FreeBSD]
>>> Copyright (C) 2017 Free Software Foundation, Inc.
>>> License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html
>>> This is free software: you are free to change and redistribute it.
>>> There is NO WARRANTY, to the extent permitted by law. Type "show copying"
>>> and "show warranty" for details.
>>> This GDB was configured as "x86_64-portbld-freebsd11.1".
>>> Type "show configuration" for configuration details.
>>> For bug reporting instructions, please see:
>>> http://www.gnu.org/software/gdb/bugs/.
>>> Find the GDB manual and other documentation resources online at:
>>> http://www.gnu.org/software/gdb/documentation/.
>>> For help, type "help".
>>> Type "apropos word" to search for commands related to "word"...
>>> Reading symbols from /usr/local/sbin/haproxy...done.
>>> Attaching to program: /usr/local/sbin/haproxy, process 68580
>>> Reading symbols from /lib/libcrypt.so.5...(no debugging symbols found)...done.
>>> Reading symbols from /lib/libz.so.6...(no debugging symbols found)...done.
>>> Reading symbols from /lib/libthr.so.3...(no debugging symbols found)...done.
>>> Reading symbols from /usr/lib/libssl.so.8...(no debugging symbols found)...done.
>>> Reading symbols from /lib/libcrypto.so.8...(no debugging symbols found)...done.
>>> Reading symbols from /usr/local/lib/liblua-5.3.so...(no debugging symbols found)...done.
>>> Reading symbols from /lib/libm.so.5...(no debugging symbols found)...done.
>>> Reading symbols from /lib/libc.so.7...(no debugging symbols found)...done.
>>> Reading symbols from /libexec/ld-elf.so.1...(no debugging symbols found)...done.
>>> [Switching to LWP 100340 of process 68580]
>>> 0x000000000044890b in xref_get_peer_and_lock (xref=0x80254b680) at include/common/xref.h:34
>>> 34 include/common/xref.h: No such file or directory.
>>> (gdb) next
>>> 37 in include/common/xref.h
>>> (gdb) next
>>> 45 in include/common/xref.h
>>> (gdb) next
>>> 46 in include/common/xref.h
>>> (gdb) next
>>> 34 in include/common/xref.h
>>> (gdb) next
>>> 37 in include/common/xref.h
>>> (gdb) next
>>> 45 in include/common/xref.h
>>> (gdb) next
>>> 46 in include/common/xref.h
>>> (gdb) next
>>> 34 in include/common/xref.h
>>> (gdb) next
>>> 37 in include/common/xref.h
>>> (gdb) next
>>> 45 in include/common/xref.h
>>> (gdb) next
>>> 46 in include/common/xref.h
>>> (gdb) next
>>> 34 in include/common/xref.h
>>> (gdb) next
>>> 37 in include/common/xref.h
>>> (gdb) next
>>> 45 in include/common/xref.h
>>> (gdb) next
>>> 46 in include/common/xref.h
>>> (gdb) next
>>> 34 in include/common/xref.h
>>> (gdb) next
>>> 37 in include/common/xref.h
>>> (gdb) next
>>> 45 in include/common/xref.h
>>> (gdb) next
>>> 46 in include/common/xref.h
>>> (gdb) next
>>> 34 in include/common/xref.h
>>> (gdb) next
>>> 37 in include/common/xref.h
>>> (gdb) next
>>> 45 in include/common/xref.h
>>> (gdb) next
>>> 46 in include/common/xref.h
>>> (gdb) next
>>> 34 in include/common/xref.h
>>> (gdb) next
>>> 37 in include/common/xref.h
>>> (gdb) next
>>> 45 in include/common/xref.h
>>> (gdb) next
>>> 46 in include/common/xref.h
>>> (gdb) next
>>> 34 in include/common/xref.h
>>> (gdb) next
>>> 37 in include/common/xref.h
>>> (gdb) next
>>> 45 in include/common/xref.h
>>> (gdb) next
>>> 46 in include/common/xref.h
>>> (gdb) next
>>> 34 in include/common/xref.h
>>> (gdb) next
>>> 37 in include/common/xref.h
>>> (gdb) next
>>> 45 in include/common/xref.h
>>> (gdb) next
>>> 46 in include/common/xref.h
>>> (gdb) next
>>> 34 in include/common/xref.h
>>> (gdb) next
>>> 37 in include/common/xref.h
>>> (gdb) next
>>> 45 in include/common/xref.h
>>> (gdb) next
>>> 46 in include/common/xref.h
>>> (gdb) next
>>> 34 in include/common/xref.h
>>> (gdb) next
>>> 37 in include/common/xref.h
>>> (gdb) next
>>> 45 in include/common/xref.h
>>> (gdb) next
>>> 46 in include/common/xref.h
>>> (gdb) next
>>> 34 in include/common/xref.h
>>> (gdb) next
>>> 37 in include/common/xref.h
>>> (gdb) next
>>> 45 in include/common/xref.h
>>> (gdb) next
>>> 46 in include/common/xref.h
>>> (gdb) next
>>> 34 in include/common/xref.h
>>> (gdb) next
>>> 37 in include/common/xref.h
>>> (gdb) next
>>> 45 in include/common/xref.h
>>> (gdb) next
>>> 46 in include/common/xref.h
>>> (gdb) next
>>> 34 in include/common/xref.h
>>> (gdb) next
>>> 37 in include/common/xref.h
>>> (gdb) next
>>> 45 in include/common/xref.h
>>> (gdb) next
>>> 46 in include/common/xref.h
>>> (gdb) next
>>> 34 in include/common/xref.h
>>> (gdb) next
>>> 37 in include/common/xref.h
>>> (gdb) bt full
>>> #0 xref_get_peer_and_lock (xref=0x80254b680) at include/common/xref.h:37
>>> local = 0x1
>>> remote = 0x7fffffffe210
>>> #1 0x000000000044833d in hlua_socket_release (appctx=0x80254b600) at src/hlua.c:1583
>>> peer = 0x0
>>> #2 0x0000000000546519 in si_applet_release (si=0x802551dc8) at include/proto/stream_interface.h:233
>>> appctx = 0x80254b600
>>> #3 0x0000000000541a8d in stream_int_shutw_applet (si=0x802551dc8) at src/stream_interface.c:1502
>>> ic = 0x802551b90
>>> oc = 0x802551bd0
>>> #4 0x0000000000543809 in si_shutw (si=0x802551dc8) at include/proto/stream_interface.h:320
>>> No locals.
>>> #5 0x0000000000543320 in stream_int_notify (si=0x802551dc8) at src/stream_interface.c:465
>>> conn = 0x0
>>> ic = 0x802551b90
>>> oc = 0x802551bd0
>>> #6 0x000000000044dc81 in hlua_socket_receive_yield (L=0x80243f2a8, status=1, ctx=0) at src/hlua.c:1770
>>> socket = 0x8026a0028
>>> wanted = -1
>>> hlua = 0x8024d6780
>>> appctx = 0x80254b600
>>> len = 25
>>> nblk = 1
>>> blk1 = 0x8027da4d4 "501 5.1.3 Invalid address\r\nclass=\"tbl\" width=\"100%\">\n<tr class=\"titre\"><th class=\"pxname\" width=\"10%\"><a name=\"SMTP-Mailroute_XXXX-25_EX01-125_tcp_ipv4\"></a><a class=px href=\"#SMTP-Mailroute_XXXX-25_E"...
>>> len1 = 25
>>> blk2 = 0x100000005 <error: Cannot access memory at address 0x100000005>
>>> len2 = 3
>>> skip_at_end = 2
>>> oc = 0x802551bd0
>>> si = 0x802551dc8
>>> s = 0x802551b80
>>> peer = 0x80254b680
>>> #7 0x0000000801880417 in ?? () from /usr/local/lib/liblua-5.3.so
>>> ---Type <return> to continue, or q <return> to quit---
>>> No symbol table info available.
>>> #8 0x000000080187f1a6 in ?? () from /usr/local/lib/liblua-5.3.so
>>> No symbol table info available.
>>> #9 0x0000000801880194 in lua_resume () from /usr/local/lib/liblua-5.3.so
>>> No symbol table info available.
>>> #10 0x00000000004370a9 in hlua_ctx_resume (lua=0x8024d6780, yield_allowed=1) at src/hlua.c:1014
>>> ret = 39132616
>>> msg = 0x802647f00 ""
>>> #11 0x00000000004406c7 in hlua_process_task (task=0x80242d460) at src/hlua.c:5523
>>> hlua = 0x8024d6780
>>> status = HLUA_E_OK
>>> #12 0x0000000000591177 in process_runnable_tasks () at src/task.c:231
>>> t = 0x80242d460
>>> i = 32767
>>> max_processed = 199
>>> rq_next = 0x0
>>> local_tasks = {0x0, 0x8add10 <applet_active_lock>, 0x0, 0x80254b600, 0x80254b600, 0x80254b600, 0x80254b600, 0x80254b600, 0x100000000, 0x0, 0x0, 0x0,
>>> 0x8add10 <applet_active_lock>, 0x0, 0x0, 0x7fffffffe758}
>>> local_tasks_count = 0
>>> final_tasks_count = 0
>>> #13 0x0000000000510a5b in run_poll_loop () at src/haproxy.c:2399
>>> next = 588656154
>>> exp = 588656154
>>> #14 0x000000000050e370 in run_thread_poll_loop (data=0x802446068) at src/haproxy.c:2461
>>> start_lock = 0
>>> ptif = 0x8a0908 <per_thread_init_list>
>>> ptdf = 0x202
>>> #15 0x000000000050aef2 in main (argc=10, argv=0x7fffffffec58) at src/haproxy.c:3050
>>> tids = 0x802446068
>>> threads = 0x8024efd60
>>> i = 1
>>> err = 0
>>> retry = 200
>>> limit = {rlim_cur = 2285, rlim_max = 2285}
>>> ---Type <return> to continue, or q <return> to quit---
>>> errmsg = "\000\354\377\377\377\177\000\000\260\354\377\377\377\177\000\000X\354\377\377\377\177\000\000\n\000\000\000\000\000\000\000R\325c\321\223Z\332\017\240u\212\000\000\000\000\000P\354\377\377\377\177\000\000\260\354\377\377\377\177\000\000X\354\377\377\377\177\000\000\n\000\000\000\000\000\000\000\360\353\377\377\377\177\000\000r3\340\001\b\000\000\000\001\000\000"
>>> pidfd = 26
>>> (gdb) info locals
>>> tids = 0x802446068
>>> threads = 0x8024efd60
>>> i = 1
>>> err = 0
>>> retry = 200
>>> limit = {rlim_cur = 2285, rlim_max = 2285}
>>> errmsg = "\000\354\377\377\377\177\000\000\260\354\377\377\377\177\000\000X\354\377\377\377\177\000\000\n\000\000\000\000\000\000\000R\325c\321\223Z\332\017\240u\212\000\000\000\000\000P\354\377\377\377\177\000\000\260\354\377\377\377\177\000\000X\354\377\377\377\177\000\000\n\000\000\000\000\000\000\000\360\353\377\377\377\177\000\000r3\340\001\b\000\000\000\001\000\000"
>>> pidfd = 26
Hi Thierry,

Op 9-5-2018 om 18:30 schreef Thierry Fournier:
> It seems a dead lock, but you observe a loop.
Effectively it is a deadlock, it keeps looping over these few lines of
code below from xref.h
http://git.haproxy.org/?p=haproxy.git;a=blob_plain;f=include/common/xref.h;hb=29d698040d6bb56b29c036aeba05f0d52d8ce94b..
The XCHG just swaps the 2 values (both are '1') and continues on, then
the local==BUSY check is true it loops and swaps 1 and 1 again, and the
circle continues..

Thanks for looking into it :) Ill try and get 'simpler' reproduction
with some well placed sleep() as you suggest.
Regards,
PiBa-NL

http://git.haproxy.org/?p=haproxy.git;a=blob;f=include/common/xref.h;h=6dfa7b62758dfaebe12d25f66aaa858dc873a060;hb=29d698040d6bb56b29c036aeba05f0d52d8ce94b

31
<http://git.haproxy.org/?p=haproxy.git;a=blob;f=include/common/xref.h;h=6dfa7b62758dfaebe12d25f66aaa858dc873a060;hb=29d698040d6bb56b29c036aeba05f0d52d8ce94b#l31>;
        while (1) {
32
<http://git.haproxy.org/?p=haproxy.git;a=blob;f=include/common/xref.h;h=6dfa7b62758dfaebe12d25f66aaa858dc873a060;hb=29d698040d6bb56b29c036aeba05f0d52d8ce94b#l32>;

33
<http://git.haproxy.org/?p=haproxy.git;a=blob;f=include/common/xref.h;h=6dfa7b62758dfaebe12d25f66aaa858dc873a060;hb=29d698040d6bb56b29c036aeba05f0d52d8ce94b#l33>;
                /* Get the local pointer to the peer. */
34
<http://git.haproxy.org/?p=haproxy.git;a=blob;f=include/common/xref.h;h=6dfa7b62758dfaebe12d25f66aaa858dc873a060;hb=29d698040d6bb56b29c036aeba05f0d52d8ce94b#l34>;
                local = HA_ATOMIC_XCHG(&xref->peer, XREF_BUSY);
35
<http://git.haproxy.org/?p=haproxy.git;a=blob;f=include/common/xref.h;h=6dfa7b62758dfaebe12d25f66aaa858dc873a060;hb=29d698040d6bb56b29c036aeba05f0d52d8ce94b#l35>;

36
<http://git.haproxy.org/?p=haproxy.git;a=blob;f=include/common/xref.h;h=6dfa7b62758dfaebe12d25f66aaa858dc873a060;hb=29d698040d6bb56b29c036aeba05f0d52d8ce94b#l36>;
                /* If the local pointer is NULL, the peer no longer exists. */
37
<http://git.haproxy.org/?p=haproxy.git;a=blob;f=include/common/xref.h;h=6dfa7b62758dfaebe12d25f66aaa858dc873a060;hb=29d698040d6bb56b29c036aeba05f0d52d8ce94b#l37>;
                if (local == NULL) {
38
<http://git.haproxy.org/?p=haproxy.git;a=blob;f=include/common/xref.h;h=6dfa7b62758dfaebe12d25f66aaa858dc873a060;hb=29d698040d6bb56b29c036aeba05f0d52d8ce94b#l38>;
                        xref->peer = NULL;
39
<http://git.haproxy.org/?p=haproxy.git;a=blob;f=include/common/xref.h;h=6dfa7b62758dfaebe12d25f66aaa858dc873a060;hb=29d698040d6bb56b29c036aeba05f0d52d8ce94b#l39>;
                        return NULL;
40
<http://git.haproxy.org/?p=haproxy.git;a=blob;f=include/common/xref.h;h=6dfa7b62758dfaebe12d25f66aaa858dc873a060;hb=29d698040d6bb56b29c036aeba05f0d52d8ce94b#l40>;
                }
41
<http://git.haproxy.org/?p=haproxy.git;a=blob;f=include/common/xref.h;h=6dfa7b62758dfaebe12d25f66aaa858dc873a060;hb=29d698040d6bb56b29c036aeba05f0d52d8ce94b#l41>;

42
<http://git.haproxy.org/?p=haproxy.git;a=blob;f=include/common/xref.h;h=6dfa7b62758dfaebe12d25f66aaa858dc873a060;hb=29d698040d6bb56b29c036aeba05f0d52d8ce94b#l42>;
                /* If the local pointeru is BUSY, the peer try to acquire the
43
<http://git.haproxy.org/?p=haproxy.git;a=blob;f=include/common/xref.h;h=6dfa7b62758dfaebe12d25f66aaa858dc873a060;hb=29d698040d6bb56b29c036aeba05f0d52d8ce94b#l43>;
                 * lock. We retry the process.
44
<http://git.haproxy.org/?p=haproxy.git;a=blob;f=include/common/xref.h;h=6dfa7b62758dfaebe12d25f66aaa858dc873a060;hb=29d698040d6bb56b29c036aeba05f0d52d8ce94b#l44>;
                 */
45
<http://git.haproxy.org/?p=haproxy.git;a=blob;f=include/common/xref.h;h=6dfa7b62758dfaebe12d25f66aaa858dc873a060;hb=29d698040d6bb56b29c036aeba05f0d52d8ce94b#l45>;
                if (local == XREF_BUSY)
46
<http://git.haproxy.org/?p=haproxy.git;a=blob;f=include/common/xref.h;h=6dfa7b62758dfaebe12d25f66aaa858dc873a060;hb=29d698040d6bb56b29c036aeba05f0d52d8ce94b#l46>;
                        continue;
Hi Thierry,

Okay found a simple reproduction with tcploop with a 6 second delay in
there and a short sleep before calling kqueue.

../tcploop 81 L W N20 A R S:"response1\r\n" R P6000 S:"response2\r\n" R [
F K ]

     gettimeofday(&before_poll, NULL);
+    usleep(100);
     status = kevent(kqueue_fd[tid], // int kq

Together with the attached config the issue is reproduced every time the
/myapplet url is requested.

Output as below:
00000000:stats.clihdr[0007:ffffffff]: Accept-Language:
nl-NL,nl;q=0.9,en-US;q=0.8,en;q=0.7
[info] 130/195936 (76770) : Wait for it..
[info] 130/195937 (76770) : Wait response 2..
  xref_get_peer_and_lock xref->peer == 1

Hope this helps to come up with a solution..

Thanks in advance,
PiBa-NL (Pieter)

Op 9-5-2018 om 19:47 schreef PiBa-NL:
> Hi Thierry,
>
> Op 9-5-2018 om 18:30 schreef Thierry Fournier:
>> It seems a dead lock, but you observe a loop.
> Effectively it is a deadlock, it keeps looping over these few lines of
> code below from xref.h
> http://git.haproxy.org/?p=haproxy.git;a=blob_plain;f=include/common/xref.h;hb=29d698040d6bb56b29c036aeba05f0d52d8ce94b..
> The XCHG just swaps the 2 values (both are '1') and continues on, then
> the local==BUSY check is true it loops and swaps 1 and 1 again, and
> the circle continues..
>
> Thanks for looking into it :) Ill try and get 'simpler' reproduction
> with some well placed sleep() as you suggest.
> Regards,
> PiBa-NL
>
> http://git.haproxy.org/?p=haproxy.git;a=blob;f=include/common/xref.h;h=6dfa7b62758dfaebe12d25f66aaa858dc873a060;hb=29d698040d6bb56b29c036aeba05f0d52d8ce94b
>

function myapplet(applet)

core.Info("Wait for it..")
core.sleep(1)

local result = ""
con = core.tcp()
con:settimeout(1)
con:connect("127.0.0.1",81)
con:send("Test1\r\n")
r = con:receive("*l")
result = result .. tostring(r)
con:send("Test\r\n")
core.Info("Wait response 2..")
r2 = con:receive("*l")
result = result .. tostring(r2)
core.Info("close..")
con:close()
core.Info("DONE")

response = "Finished"
applet:add_header("Server", "haproxy/webstats")
applet:add_header("Content-Type", "text/html")
applet:start_response()
applet:send(response)

end

core.register_service("myapplet", "http", myapplet)
global
lua-load /root/haproxytest/hang_timeout_close.lua

defaults
mode http
timeout connect 5s
timeout client 30s
timeout server 60s

frontend stats
bind *:80
stats enable
stats admin if TRUE
stats refresh 1s

acl myapplet path -m beg /myapplet
http-request use-service lua.myapplet if myapplet

include/common/xref.h | 4 ++++
src/ev_kqueue.c | 1 +
2 files changed, 5 insertions(+)

diff --git a/include/common/xref.h b/include/common/xref.h
index 6dfa7b6..e6905a1 100644
--- a/include/common/xref.h
+++ b/include/common/xref.h
@@ -25,6 +25,10 @@ static inline void xref_create(struct xref *xref_a, struct xref *xref_b)

static inline struct xref *xref_get_peer_and_lock(struct xref *xref)
{
+ if (xref->peer == 1) {
+ printf(" xref_get_peer_and_lock xref->peer == 1 \n");
+ }
+
struct xref *local;
struct xref *remote;

diff --git a/src/ev_kqueue.c b/src/ev_kqueue.c
index bf7f666..732f20d 100644
--- a/src/ev_kqueue.c
+++ b/src/ev_kqueue.c
@@ -145,6 +145,7 @@ REGPRM2 static void _do_poll(struct poller *p, int exp)

fd = global.tune.maxpollevents;
gettimeofday(&before_poll, NULL);
+ usleep(100);
status = kevent(kqueue_fd[tid], // int kq
NULL, // const struct kevent *changelist
0, // int nchanges
Hi Pieter,

Could you test the attached patch ? It seems to fix the problem, but I
have some doubts about the reliability of the patch.

Thierry




On Fri, 11 May 2018 20:13:25 +0200
PiBa-NL <[email protected]> wrote:

> Hi Thierry,
>
> Okay found a simple reproduction with tcploop with a 6 second delay in
> there and a short sleep before calling kqueue.
>
> ./tcploop 81 L W N20 A R S:"response1\r\n" R P6000 S:"response2\r\n" R [
> F K ]
>
>      gettimeofday(&before_poll, NULL);
> +    usleep(100);
>      status = kevent(kqueue_fd[tid], // int kq
>
> Together with the attached config the issue is reproduced every time the
> /myapplet url is requested.
>
> Output as below:
> 00000000:stats.clihdr[0007:ffffffff]: Accept-Language:
> nl-NL,nl;q=0.9,en-US;q=0.8,en;q=0.7
> [info] 130/195936 (76770) : Wait for it..
> [info] 130/195937 (76770) : Wait response 2..
>   xref_get_peer_and_lock xref->peer == 1
>
> Hope this helps to come up with a solution..
>
> Thanks in advance,
> PiBa-NL (Pieter)
>
> Op 9-5-2018 om 19:47 schreef PiBa-NL:
> > Hi Thierry,
> >
> > Op 9-5-2018 om 18:30 schreef Thierry Fournier:
> >> It seems a dead lock, but you observe a loop.
> > Effectively it is a deadlock, it keeps looping over these few lines of
> > code below from xref.h
> > http://git.haproxy.org/?p=haproxy.git;a=blob_plain;f=include/common/xref.h;hb=29d698040d6bb56b29c036aeba05f0d52d8ce94b..
> > The XCHG just swaps the 2 values (both are '1') and continues on, then
> > the local==BUSY check is true it loops and swaps 1 and 1 again, and
> > the circle continues..
> >
> > Thanks for looking into it :) Ill try and get 'simpler' reproduction
> > with some well placed sleep() as you suggest.
> > Regards,
> > PiBa-NL
> >
> > http://git.haproxy.org/?p=haproxy.git;a=blob;f=include/common/xref.h;h=6dfa7b62758dfaebe12d25f66aaa858dc873a060;hb=29d698040d6bb56b29c036aeba05f0d52d8ce94b
> >
>
Hi Thierry,

Op 18-5-2018 om 20:00 schreef Thierry FOURNIER:
> Hi Pieter,
>
> Could you test the attached patch ? It seems to fix the problem, but I
> have some doubts about the reliability of the patch.
>
> Thierry
The crash seems 'fixed' indeed.. but the lua scipt below now takes
5seconds instead of 150ms.

Regards,
PiBa-NL (Pieter)

con = core.tcp()
con:connect_ssl("216.58.212.132",443) --google: 216.58.212.132
request = [[GET / HTTP/1.0

]]
con:send(request)
res = con:receive("*a")
con:close()
On Fri, 18 May 2018 22:17:00 +0200
PiBa-NL <[email protected]> wrote:

> Hi Thierry,
>
> Op 18-5-2018 om 20:00 schreef Thierry FOURNIER:
> > Hi Pieter,
> >
> > Could you test the attached patch ? It seems to fix the problem, but I
> > have some doubts about the reliability of the patch.
> >
> > Thierry
> The crash seems 'fixed' indeed.. but the lua scipt below now takes
> 5seconds instead of 150ms.
>
> Regards,
> PiBa-NL (Pieter)
>
> con = core.tcp()
> con:connect_ssl("216.58.212.132",443) --google: 216.58.212.132
> request = [[GET / HTTP/1.0
>
> ]]
> con:send(request)
> res = con:receive("*a")
> con:close()

One bug can hide another bug :-) I catch both. Could you test ?

If the result is positive I join also the backport for 1.6 and 1.7

Thierry
Hi Thierry,

Op 25-5-2018 om 15:40 schreef Thierry FOURNIER:
> On Fri, 18 May 2018 22:17:00 +0200
> PiBa-NL <[email protected]> wrote:
>
>> Hi Thierry,
>>
>> Op 18-5-2018 om 20:00 schreef Thierry FOURNIER:
>>> Hi Pieter,
>>>
>>> Could you test the attached patch ? It seems to fix the problem, but I
>>> have some doubts about the reliability of the patch.
>>>
>>> Thierry
>> The crash seems 'fixed' indeed.. but the lua scipt below now takes
>> 5seconds instead of 150ms.
>>
>> Regards,
>> PiBa-NL (Pieter)
>>
>> con = core.tcp()
>> con:connect_ssl("216.58.212.132",443) --google: 216.58.212.132
>> request = [[GET / HTTP/1.0
>>
>> ]]
>> con:send(request)
>> res = con:receive("*a")
>> con:close()
> One bug can hide another bug :-) I catch both. Could you test ?
>
> If the result is positive I join also the backport for 1.6 and 1.7
>
> Thierry

Thanks, seems both the hang and the 2nd uncovered task schedule issue
are fixed now (google website response is received/processed fast
again). I've done some testing, and installed the updated/patched
version on my production box last night. At the moment it still works
properly. Activated my lua healthchecker and mailer tasks and enabled 3
threads again.. Lets see how it go's :), but as said for now it seems to
work alright.

Does the second issue you found and fixed clear the initial 'doubts
about the reliability' of the first one.? Or did you have a particular
possibly problematic scenario in mind that i could try and check for?

For the moment i think it is more 'reliable / stable' with the patches
than without. So in that regard i think they could be merged.

There seems to be a issue with tcp:settimeout() though. But ill put that
in a different mail-threat as im not sure its related and the issues
where this thread started are fixed.

Regards,

PiBa-NL (Pieter)
> On 26 May 2018, at 19:47, PiBa-NL <[email protected]> wrote:
>
> Hi Thierry,
>
> Op 25-5-2018 om 15:40 schreef Thierry FOURNIER:
>> On Fri, 18 May 2018 22:17:00 +0200
>> PiBa-NL <[email protected]> wrote:
>>
>>> Hi Thierry,
>>>
>>> Op 18-5-2018 om 20:00 schreef Thierry FOURNIER:
>>>> Hi Pieter,
>>>>
>>>> Could you test the attached patch ? It seems to fix the problem, but I
>>>> have some doubts about the reliability of the patch.
>>>>
>>>> Thierry
>>> The crash seems 'fixed' indeed.. but the lua scipt below now takes
>>> 5seconds instead of 150ms.
>>>
>>> Regards,
>>> PiBa-NL (Pieter)
>>>
>>> con = core.tcp()
>>> con:connect_ssl("216.58.212.132",443) --google: 216.58.212.132
>>> request = [[GET / HTTP/1.0
>>>
>>> ]]
>>> con:send(request)
>>> res = con:receive("*a")
>>> con:close()
>> One bug can hide another bug :-) I catch both. Could you test ?
>>
>> If the result is positive I join also the backport for 1.6 and 1.7
>>
>> Thierry
>
> Thanks, seems both the hang and the 2nd uncovered task schedule issue are fixed now (google website response is received/processed fast again). I've done some testing, and installed the updated/patched version on my production box last night. At the moment it still works properly. Activated my lua healthchecker and mailer tasks and enabled 3 threads again.. Lets see how it go's :), but as said for now it seems to work alright.
>
> Does the second issue you found and fixed clear the initial 'doubts about the reliability' of the first one.? Or did you have a particular possibly problematic scenario in mind that i could try and check for?


Yes. On the first version, I just tried something without
confidence, and this version I read all the socket code,
so I’m confident.


> For the moment i think it is more 'reliable / stable' with the patches than without. So in that regard i think they could be merged.
>
> There seems to be a issue with tcp:settimeout() though. But ill put that in a different mail-threat as im not sure its related and the issues where this thread started are fixed.


I saw this problem. I tried to understand it, but without success.
The timeout applied seems to be the hardcoded default timeout. I
check this later.

BR,
Thierry
On Sat, 26 May 2018 19:47:54 +0200
PiBa-NL <piba.nl.d[email protected]> wrote:

> Hi Thierry,
>
> Op 25-5-2018 om 15:40 schreef Thierry FOURNIER:
> > On Fri, 18 May 2018 22:17:00 +0200
> > PiBa-NL <[email protected]> wrote:
> >
> >> Hi Thierry,
> >>
> >> Op 18-5-2018 om 20:00 schreef Thierry FOURNIER:
> >>> Hi Pieter,
> >>>
> >>> Could you test the attached patch ? It seems to fix the problem, but I
> >>> have some doubts about the reliability of the patch.
> >>>
> >>> Thierry
> >> The crash seems 'fixed' indeed.. but the lua scipt below now takes
> >> 5seconds instead of 150ms.
> >>
> >> Regards,
> >> PiBa-NL (Pieter)
> >>
> >> con = core.tcp()
> >> con:connect_ssl("216.58.212.132",443) --google: 216.58.212.132
> >> request = [[GET / HTTP/1.0
> >>
> >> ]]
> >> con:send(request)
> >> res = con:receive("*a")
> >> con:close()
> > One bug can hide another bug :-) I catch both. Could you test ?
> >
> > If the result is positive I join also the backport for 1.6 and 1.7
> >
> > Thierry
>
> Thanks, seems both the hang and the 2nd uncovered task schedule issue
> are fixed now (google website response is received/processed fast
> again). I've done some testing, and installed the updated/patched
> version on my production box last night. At the moment it still works
> properly. Activated my lua healthchecker and mailer tasks and enabled 3
> threads again.. Lets see how it go's :), but as said for now it seems to
> work alright.
>
> Does the second issue you found and fixed clear the initial 'doubts
> about the reliability' of the first one.? Or did you have a particular
> possibly problematic scenario in mind that i could try and check for?
>
> For the moment i think it is more 'reliable / stable' with the patches
> than without. So in that regard i think they could be merged.
>
> There seems to be a issue with tcp:settimeout() though. But ill put that
> in a different mail-threat as im not sure its related and the issues
> where this thread started are fixed.


I tried a case never encountered before. When the data sent is bigger than
an haproxy buffer, the send function is absolutely bugged, so I submit
some additional patches.

BR,
Thierry



> Regards,
>
> PiBa-NL (Pieter)
>
Sorry, only registered users may post in this forum.

Click here to login