Welcome! Log In Create A New Profile

Advanced

Segfault with 1.8.0 build (RHEL5, old gcc).

Posted by Christopher Lane 
Christopher Lane
Segfault with 1.8.0 build (RHEL5, old gcc).
December 01, 2017 09:10PM
gist with backtrace, -vv output, and config file. Also strace.

https://gist.github.com/jayalane/c6dbe7918aa9635b62c874d20f57dfec

It does all the listens and then right after the first epoll is done it has
this segv. all the local variables are "optimized out"

(We really want the hard-stop-after -- we get a leak of children with our
super frequent soft-reloads).

It's possible something is messed up in my compiling/linking environment,
but I don't see anything, and I thought maybe a .0 release had a chance of
issues on very old RHEL versions. We currently run 1.5 and have to
manually clean up the left over children (but are hoping to scale up to
nbproc 4 and use CPU pinning).

I'm debugging also, but hoped someone would have something off the top of
their head (apologies for not having read the list for a while before
posting).

--Chris
Olivier Houchard
Re: Segfault with 1.8.0 build (RHEL5, old gcc).
December 01, 2017 10:20PM
Hi Christopher,

On Fri, Dec 01, 2017 at 11:59:14AM -0800, Christopher Lane wrote:
> gist with backtrace, -vv output, and config file. Also strace.
>
> https://gist.github.com/jayalane/c6dbe7918aa9635b62c874d20f57dfec
>
> It does all the listens and then right after the first epoll is done it has
> this segv. all the local variables are "optimized out"
>
> (We really want the hard-stop-after -- we get a leak of children with our
> super frequent soft-reloads).
>
> It's possible something is messed up in my compiling/linking environment,
> but I don't see anything, and I thought maybe a .0 release had a chance of
> issues on very old RHEL versions. We currently run 1.5 and have to
> manually clean up the left over children (but are hoping to scale up to
> nbproc 4 and use CPU pinning).
>
> I'm debugging also, but hoped someone would have something off the top of
> their head (apologies for not having read the list for a while before
> posting).
>

Thanks a lot for the very detailled report !
I think I know what's going on.
In your config file, you have :
server stage.qa.example.com:11302 stage.qa.example.com check inter 2m fastinter 2s fall 2
It should probably be :
server stage.qa.example.com stage.qa.example.com:11302 check inter 2m fastinter 2s fall 2

However this is of course no reason to segfault.
Can you please try the attached patch, and let me know if it fixes your issue ?

Thanks a lot !

Olivier
From 5236a1a4ac19cc27c6f06d328b2df0c4cdfe220c Mon Sep 17 00:00:00 2001
From: Olivier Houchard <[email protected]>
Date: Fri, 1 Dec 2017 22:04:05 +0100
Subject: [PATCH] MINOR: checks: Be sure we have a mux if we created a cs.

In connect_conn_chk(), there were one case we could return with a new
conn_stream created, but no mux attached. With no mux, cs_destroy() would
segfault. Fix that by setting the mux before we can fail.

This should be backported to 1.8.
---
src/checks.c | 18 ++++++++----------
1 file changed, 8 insertions(+), 10 deletions(-)

diff --git a/src/checks.c b/src/checks.c
index 63747201e..eaf84a225 100644
--- a/src/checks.c
+++ b/src/checks.c
@@ -1564,25 +1564,23 @@ static int connect_conn_chk(struct task *t)
conn->addr.to = s->addr;
}

+ proto = protocol_by_family(conn->addr.to.ss_family);
+
+ conn_prepare(conn, proto, check->xprt);
+ conn_install_mux(conn, &mux_pt_ops, cs);
+ cs_attach(cs, check, &check_conn_cb);
+ conn->target = &s->obj_type;
+
if ((conn->addr.to.ss_family == AF_INET) || (conn->addr.to.ss_family == AF_INET6)) {
int i = 0;

i = srv_check_healthcheck_port(check);
- if (i == 0) {
- cs->data = check;
+ if (i == 0)
return SF_ERR_CHK_PORT;
- }

set_host_port(&conn->addr.to, i);
}

- proto = protocol_by_family(conn->addr.to.ss_family);
-
- conn_prepare(conn, proto, check->xprt);
- conn_install_mux(conn, &mux_pt_ops, cs);
- cs_attach(cs, check, &check_conn_cb);
- conn->target = &s->obj_type;
-
/* no client address */
clear_addr(&conn->addr.from);

--
2.14.3
Lukas Tribus
Re: Segfault with 1.8.0 build (RHEL5, old gcc).
December 04, 2017 01:30PM
Hello Christopher,


2017-12-01 20:59 GMT+01:00 Christopher Lane <[email protected]>:
>
> gist with backtrace, -vv output, and config file. Also strace.
>
> https://gist.github.com/jayalane/c6dbe7918aa9635b62c874d20f57dfec
>
> It does all the listens and then right after the first epoll is done it has this segv. all the local variables are "optimized out"
>
> (We really want the hard-stop-after -- we get a leak of children with our super frequent soft-reloads).

FYI, hard-stop-after has been backported to 1.7 stable and is in all
rebuilds starting with 1.7.4:

https://www.mail-archive.com/[email protected]/msg25494.html



Regards,
Lukas
Christopher Lane
Re: Segfault with 1.8.0 build (RHEL5, old gcc).
December 04, 2017 09:10PM
On Mon, Dec 4, 2017 at 4:22 AM Lukas Tribus <[email protected]> wrote:

>Hello Christopher,


>2017-12-01 20:59 GMT+01:00 Christopher Lane <[email protected]>:
>>
>> gist with backtrace, -vv output, and config file. Also strace.
>>
>> https://gist.github.com/jayalane/c6dbe7918aa9635b62c874d20f57dfec
>>
>> It does all the listens and then right after the first epoll is done it
has this segv. all the local variables are "optimized out"
>>
>> (We really want the hard-stop-after -- we get a leak of children with
our super frequent soft-reloads).

>FYI, hard-stop-after has been backported to 1.7 stable and is in all
>rebuilds starting with 1.7.4:

> https://www.mail-archive.com/[email protected]/msg25494.html
https://www.mail-archive.com/[email protected]/msg25494.html



> Regards,
> Lukas

Olivier:

The patch worked beautifully to keep the 1.8.0 from crashing.

Lukas:

Thanks for the tip, we'll consider 1.7.9 then (settled on 1.8.0 by starting
out with reading the release notes for it; we are upgrading from 1.5.0).

--Chris
Christopher Lane
Re: Segfault with 1.8.0 build (RHEL5, old gcc).
December 07, 2017 02:40AM
On Mon, Dec 4, 2017 at 11:56 AM, Christopher Lane
<[email protected]> wrote:
>
>
>
> On Mon, Dec 4, 2017 at 4:22 AM Lukas Tribus <[email protected]> wrote:
>
>>Hello Christopher,
>
>
>>2017-12-01 20:59 GMT+01:00 Christopher Lane <[email protected]>:
>>>
>>> gist with backtrace, -vv output, and config file. Also strace.
>>>
>>> https://gist.github.com/jayalane/c6dbe7918aa9635b62c874d20f57dfec
>>>
>>> It does all the listens and then right after the first epoll is done it
>>> has this segv. all the local variables are "optimized out"
>>>
>>> (We really want the hard-stop-after -- we get a leak of children with our
>>> super frequent soft-reloads).
>
>>FYI, hard-stop-after has been backported to 1.7 stable and is in all
>>rebuilds starting with 1.7.4:
>
>> https://www.mail-archive.com/[email protected]/msg25494.html
>
>
>
>> Regards,
>> Lukas
>
> Olivier:
>
> The patch worked beautifully to keep the 1.8.0 from crashing.
>
> Lukas:
>
> Thanks for the tip, we'll consider 1.7.9 then (settled on 1.8.0 by starting
> out with reading the release notes for it; we are upgrading from 1.5.0).
>
> --Chris

Unrelated to the prior contents of this thread, I found the root cause
for our issue (child leak).

The haproxy was being started from a Java process using Runtime.exec()
and the PIDs were jammed into 1 cell of argv. It killed the first
child but not the later ones, as atol("13233 13234 13235 13236")
returns 13233. We have fixed the Java code.

http://git.haproxy.org/?p=haproxy.git;a=blob;f=src/haproxy.c;h=eb5e65b40e7b8b2a4f8fb04b3552401e42fb0a89;hb=HEAD#l1421

I note that the -sf parsing code uses atol and has no error checking.
Would the project be interested in a patch to use strtol with error
checks? Could log a warning if unconsumed bytes in the arg (safer),
or fail to start (unsafe). I'm sure I'm not the only one with this
sort of bug, given how tricky shell escaping and so on is.

Thanks again, and we are enjoying 1.8.1 and much lower loadavgs.

--Chris
Olivier Houchard
Re: Segfault with 1.8.0 build (RHEL5, old gcc).
December 07, 2017 01:40PM
Hi Christopher,

On Wed, Dec 06, 2017 at 05:34:15PM -0800, Christopher Lane wrote:
> On Mon, Dec 4, 2017 at 11:56 AM, Christopher Lane
> <[email protected]> wrote:
> >
> >
> >
> > On Mon, Dec 4, 2017 at 4:22 AM Lukas Tribus <[email protected]> wrote:
> >
> >>Hello Christopher,
> >
> >
> >>2017-12-01 20:59 GMT+01:00 Christopher Lane <[email protected]>:
> >>>
> >>> gist with backtrace, -vv output, and config file. Also strace.
> >>>
> >>> https://gist.github.com/jayalane/c6dbe7918aa9635b62c874d20f57dfec
> >>>
> >>> It does all the listens and then right after the first epoll is done it
> >>> has this segv. all the local variables are "optimized out"
> >>>
> >>> (We really want the hard-stop-after -- we get a leak of children with our
> >>> super frequent soft-reloads).
> >
> >>FYI, hard-stop-after has been backported to 1.7 stable and is in all
> >>rebuilds starting with 1.7.4:
> >
> >> https://www.mail-archive.com/[email protected]/msg25494.html
> >
> >
> >
> >> Regards,
> >> Lukas
> >
> > Olivier:
> >
> > The patch worked beautifully to keep the 1.8.0 from crashing.
> >
> > Lukas:
> >
> > Thanks for the tip, we'll consider 1.7.9 then (settled on 1.8.0 by starting
> > out with reading the release notes for it; we are upgrading from 1.5.0).
> >
> > --Chris
>
> Unrelated to the prior contents of this thread, I found the root cause
> for our issue (child leak).
>
> The haproxy was being started from a Java process using Runtime.exec()
> and the PIDs were jammed into 1 cell of argv. It killed the first
> child but not the later ones, as atol("13233 13234 13235 13236")
> returns 13233. We have fixed the Java code.
>
> http://git.haproxy.org/?p=haproxy.git;a=blob;f=src/haproxy.c;h=eb5e65b40e7b8b2a4f8fb04b3552401e42fb0a89;hb=HEAD#l1421
>
> I note that the -sf parsing code uses atol and has no error checking.
> Would the project be interested in a patch to use strtol with error
> checks? Could log a warning if unconsumed bytes in the arg (safer),
> or fail to start (unsafe). I'm sure I'm not the only one with this
> sort of bug, given how tricky shell escaping and so on is.

Yes, replacing ato* with something more reasonnable would certainly be
appreciated :)

Regards,

Olivier
Christopher Lane
Re: Segfault with 1.8.0 build (RHEL5, old gcc).
December 07, 2017 10:30PM
On Thu, Dec 7, 2017 at 4:36 AM, Olivier Houchard <[email protected]> wrote:
> Hi Christopher,
>
> On Wed, Dec 06, 2017 at 05:34:15PM -0800, Christopher Lane wrote:
>> On Mon, Dec 4, 2017 at 11:56 AM, Christopher Lane
>> <[email protected]> wrote:
>> >
>> >
>> >
>> > On Mon, Dec 4, 2017 at 4:22 AM Lukas Tribus <[email protected]> wrote:
>> >
>> >>Hello Christopher,
>> >
>> >
>> >>2017-12-01 20:59 GMT+01:00 Christopher Lane <[email protected]>:
>> >>>
>> >>> gist with backtrace, -vv output, and config file. Also strace.
>> >>>
>> >>> https://gist.github.com/jayalane/c6dbe7918aa9635b62c874d20f57dfec
>> >>>
>> >>> It does all the listens and then right after the first epoll is done it
>> >>> has this segv. all the local variables are "optimized out"
>> >>>
>> >>> (We really want the hard-stop-after -- we get a leak of children with our
>> >>> super frequent soft-reloads).
>> >
>> >>FYI, hard-stop-after has been backported to 1.7 stable and is in all
>> >>rebuilds starting with 1.7.4:
>> >
>> >> https://www.mail-archive.com/[email protected]/msg25494.html
>> >
>> >
>> >
>> >> Regards,
>> >> Lukas
>> >
>> > Olivier:
>> >
>> > The patch worked beautifully to keep the 1.8.0 from crashing.
>> >
>> > Lukas:
>> >
>> > Thanks for the tip, we'll consider 1.7.9 then (settled on 1.8.0 by starting
>> > out with reading the release notes for it; we are upgrading from 1.5.0).
>> >
>> > --Chris
>>
>> Unrelated to the prior contents of this thread, I found the root cause
>> for our issue (child leak).
>>
>> The haproxy was being started from a Java process using Runtime.exec()
>> and the PIDs were jammed into 1 cell of argv. It killed the first
>> child but not the later ones, as atol("13233 13234 13235 13236")
>> returns 13233. We have fixed the Java code.
>>
>> http://git.haproxy.org/?p=haproxy.git;a=blob;f=src/haproxy.c;h=eb5e65b40e7b8b2a4f8fb04b3552401e42fb0a89;hb=HEAD#l1421
>>
>> I note that the -sf parsing code uses atol and has no error checking.
>> Would the project be interested in a patch to use strtol with error
>> checks? Could log a warning if unconsumed bytes in the arg (safer),
>> or fail to start (unsafe). I'm sure I'm not the only one with this
>> sort of bug, given how tricky shell escaping and so on is.
>
> Yes, replacing ato* with something more reasonnable would certainly be
> appreciated :)
>
> Regards,
>
> Olivier

Didn't really see a place to put in tests for it, but did build git
master and try it with various correct and incorrect args. Patch is
under subject:

[PATCH] CONTRIB: log: emit warning when -sf/-sd cannot parse argument


Thanks,


--Chris
Sorry, only registered users may post in this forum.

Click here to login