Welcome! Log In Create A New Profile

Advanced

Segfault on haproxy 1.7.10 with state file and slowstart

Posted by Raghu Udiyar 
Raghu Udiyar
Segfault on haproxy 1.7.10 with state file and slowstart
January 11, 2018 10:00AM
Hello,

Haproxy 1.7.10 segfaults when the srv_admin_state is set to
SRV_ADMF_CMAINT (0x04)
for a backend server, and that backend has the `slowstart` option set.

The following configuration reproduces it :

-----------------------------
# haproxy.cfg (replace <path-to-state-folder> below)

global
maxconn 30000
user haproxy
group haproxy
server-state-file /<path-to-state-folder>/servers.state

log-tag haproxy
nbproc 1
cpu-map 1 2
stats socket /run/haproxy.sock level admin
stats socket /run/haproxy_op.sock mode 666 level operator

defaults
mode http
option forwardfor

option dontlognull
option httplog
log 127.0.0.1 local1 debug

timeout connect 5s
timeout client 50s
timeout server 50s
timeout http-request 8s

load-server-state-from-file global
listen admin
bind *:9002
stats enable
stats auth haproxyadmin:xxxxxxx

frontend testserver
bind *:9000
option tcp-smart-accept
option splice-request
option splice-response
default_backend testservers

backend testservers
balance roundrobin
option tcp-smart-connect
option splice-request
option splice-response
timeout server 2s
timeout queue 2s
default-server maxconn 10 *slowstart 10s* weight 1
server testserver15 10.0.19.10:9003 check
server testserver16 10.0.19.12:9003 check

server testserver17 169.254.0.9:9003 disabled check
server testserver20 169.254.0.9:9003 disabled check


# servers.state file

1
# be_id be_name srv_id srv_name srv_addr srv_op_state srv_admin_state
srv_uweight srv_iweight srv_time_since_last_change srv_check_status
srv_check_result srv_check_health srv_check_state srv_agent_state
bk_f_forced_id srv_f_forced_id
4 testservers 1 testserver15 10.0.19.10 2 0 1 1 924 6 3 4 6 0 0 0
4 testservers 2 testserver16 10.0.19.12 2 0 1 1 924 6 3 4 6 0 0 0
4 testservers 3 testserver17 169.254.0.9 0 5 1 1 924 1 0 0 14 0 0 0
4 testservers 4 testserver20 10.0.19.17 0 *4* 1 1 454 6 3 4 6 0 0 0

--------------------

The state *4* above for testserver20 causes the segfault, and only occurs
when slowstart is set.

The configuration check can reproduce it ie: haproxy -c -f haproxy.cfg

The backtrace :

(gdb) bt
#0 task_schedule (when=-508447097, task=0x0) at include/proto/task.h:244
#1 srv_clr_admin_flag (mode=SRV_ADMF_FMAINT, s=0x1fb0fd0) at
src/server.c:626
#2 srv_adm_set_ready (s=0x1fb0fd0) at include/proto/server.h:231
#3 srv_update_state (params=0x7ffe4f15e7d0, version=1, srv=0x1fb0fd0) at
src/server.c:2289
#4 apply_server_state () at src/server.c:2664
#5 0x000000000044b60f in init (argc=<optimized out>, [email protected]=4,
argv=<optimized out>,
[email protected]=0x7ffe4f160d38) at src/haproxy.c:975
#6 0x00000000004491be in main (argc=4, argv=0x7ffe4f160d38) at
src/haproxy.c:1795


The way we use the state file is to have servers with `disabled` option in
the configuration; and during scaling update the backend address and mark
as active using the socket. The 169.254.0.9 address is a dummy address for
the disabled servers.

Can someone take a look? I couldn't find any related bugs fixed in 1.8.

Thanks
-- Raghu
Willy Tarreau
Re: Segfault on haproxy 1.7.10 with state file and slowstart
January 12, 2018 11:50AM
Hello Raghu,

On Thu, Jan 11, 2018 at 02:20:34PM +0530, Raghu Udiyar wrote:
> Hello,
>
> Haproxy 1.7.10 segfaults when the srv_admin_state is set to
> SRV_ADMF_CMAINT (0x04)
> for a backend server, and that backend has the `slowstart` option set.
>
> The following configuration reproduces it :
(...)

Thanks for all the details, they made it easy to reproduce.

From what I'm seeing, it's a fundamental design issue in the state
file handling in 1.7. It starts checks before they have been
initialized, and try to wake up a NULL task. In 1.8 due to the more
dynamic config, the initialization sequence has changed and checks
are initialized before parsing the state file, but I don't feel at
ease with doing in 1.7 since I don't know if some config elements
may remain non-updated.

So instead I've just protected against using the task wakeups during
the state file parsing, and they will be initialized later with the
appropriate parameters.

Could you please check the attached patch on top of 1.7.10 ?

Thanks,
Willy
diff --git a/src/server.c b/src/server.c
index 66e8f8a..e847723 100644
--- a/src/server.c
+++ b/src/server.c
@@ -318,8 +318,10 @@ void srv_set_running(struct server *s, const char *reason)
s->last_change = now.tv_sec;

s->state = SRV_ST_STARTING;
- if (s->slowstart > 0)
- task_schedule(s->warmup, tick_add(now_ms, MS_TO_TICKS(MAX(1000, s->slowstart / 20))));
+ if (s->slowstart > 0) {
+ if (s->warmup)
+ task_schedule(s->warmup, tick_add(now_ms, MS_TO_TICKS(MAX(1000, s->slowstart / 20))));
+ }
else
s->state = SRV_ST_RUNNING;

@@ -622,8 +624,10 @@ void srv_clr_admin_flag(struct server *s, enum srv_admin mode)
s->state = SRV_ST_STOPPING;
else {
s->state = SRV_ST_STARTING;
- if (s->slowstart > 0)
- task_schedule(s->warmup, tick_add(now_ms, MS_TO_TICKS(MAX(1000, s->slowstart / 20))));
+ if (s->slowstart > 0) {
+ if (s->warmup)
+ task_schedule(s->warmup, tick_add(now_ms, MS_TO_TICKS(MAX(1000, s->slowstart / 20))));
+ }
else
s->state = SRV_ST_RUNNING;
}
Raghu Udiyar
Re: Segfault on haproxy 1.7.10 with state file and slowstart
January 14, 2018 01:40PM
Hi Willy

Yes, this patch works, no more segfaults.

However, unrelated to this issue, while testing I noticed that the address
from the state file is not applied if the configuration does not use a
hostname.

In the example I provided, the configuration originally has the IP address
set to 169.254.0.9, which is then updated through the socket to say 10.0.19.17.
Expectation was that upon haproxy restart the address in the state fill
will override the address in the configuration, but this does not happen.

I see the relevant code in srv_init_addr :

if (srv->hostname)
return_code |= srv_iterate_initaddr(srv);

As per the server-template example in the dynamic scaling blog post, this
should have been supported. Am I missing some configuration?

Thanks!
-- Raghu


-- Raghu

On Fri, Jan 12, 2018 at 4:14 PM, Willy Tarreau <[email protected]> wrote:

> Hello Raghu,
>
> On Thu, Jan 11, 2018 at 02:20:34PM +0530, Raghu Udiyar wrote:
> > Hello,
> >
> > Haproxy 1.7.10 segfaults when the srv_admin_state is set to
> > SRV_ADMF_CMAINT (0x04)
> > for a backend server, and that backend has the `slowstart` option set.
> >
> > The following configuration reproduces it :
> (...)
>
> Thanks for all the details, they made it easy to reproduce.
>
> From what I'm seeing, it's a fundamental design issue in the state
> file handling in 1.7. It starts checks before they have been
> initialized, and try to wake up a NULL task. In 1.8 due to the more
> dynamic config, the initialization sequence has changed and checks
> are initialized before parsing the state file, but I don't feel at
> ease with doing in 1.7 since I don't know if some config elements
> may remain non-updated.
>
> So instead I've just protected against using the task wakeups during
> the state file parsing, and they will be initialized later with the
> appropriate parameters.
>
> Could you please check the attached patch on top of 1.7.10 ?
>
> Thanks,
> Willy
>
Willy Tarreau
Re: Segfault on haproxy 1.7.10 with state file and slowstart
January 14, 2018 08:50PM
Hi Raghu,

On Sun, Jan 14, 2018 at 05:59:16PM +0530, Raghu Udiyar wrote:
> Hi Willy
>
> Yes, this patch works, no more segfaults.
>
> However, unrelated to this issue, while testing I noticed that the address
> from the state file is not applied if the configuration does not use a
> hostname.
>
> In the example I provided, the configuration originally has the IP address
> set to 169.254.0.9, which is then updated through the socket to say 10.0.19.17.

I noticed it as well but I thought I was the one doing something wrong.
CCing Baptiste in case he remembers certain cases that have to be
addressed a specific way by the state file. I remember this stuff was
tested quite a bit so I suspect that it could depend on certain
conditions.

> Expectation was that upon haproxy restart the address in the state fill
> will override the address in the configuration, but this does not happen.
>
> I see the relevant code in srv_init_addr :
>
> if (srv->hostname)
> return_code |= srv_iterate_initaddr(srv);
>
> As per the server-template example in the dynamic scaling blog post, this
> should have been supported. Am I missing some configuration?

I can't really tell, let's wait for Baptiste's opinion on this.

Thanks,
Willy
Raghu Udiyar
Re: Segfault on haproxy 1.7.10 with state file and slowstart
February 24, 2018 03:30PM
Hi Baptiste

Were you able to look into this? A workaround is to use a dummy hostname,
and init-addr like so :

server testserver13 dummyhost:9003 check init-addr last,10.0.19.10
server testserver14 dummyhost:9003 check init-addr last,10.0.19.12

# disabled placeholder
server testserver15 dummyhost:9003 disabled check init-addr
last,169.254.0.9


This works as expected, but seems like a hack to me. And the blog post
should be updated to indicate this behaviour.


-- Raghu

On Mon, Jan 15, 2018 at 1:10 AM, Willy Tarreau <[email protected]> wrote:

> Hi Raghu,
>
> On Sun, Jan 14, 2018 at 05:59:16PM +0530, Raghu Udiyar wrote:
> > Hi Willy
> >
> > Yes, this patch works, no more segfaults.
> >
> > However, unrelated to this issue, while testing I noticed that the
> address
> > from the state file is not applied if the configuration does not use a
> > hostname.
> >
> > In the example I provided, the configuration originally has the IP
> address
> > set to 169.254.0.9, which is then updated through the socket to say
> 10.0.19.17.
>
> I noticed it as well but I thought I was the one doing something wrong.
> CCing Baptiste in case he remembers certain cases that have to be
> addressed a specific way by the state file. I remember this stuff was
> tested quite a bit so I suspect that it could depend on certain
> conditions.
>
> > Expectation was that upon haproxy restart the address in the state fill
> > will override the address in the configuration, but this does not happen.
> >
> > I see the relevant code in srv_init_addr :
> >
> > if (srv->hostname)
> > return_code |= srv_iterate_initaddr(srv);
> >
> > As per the server-template example in the dynamic scaling blog post, this
> > should have been supported. Am I missing some configuration?
>
> I can't really tell, let's wait for Baptiste's opinion on this.
>
> Thanks,
> Willy
>
Hi Willy, Baptiste

It looks like this was the intended behaviour as per the documentation. But
it looks to be simple to enable the init-addr behaviour for IP addresses as
well. Can you please review the attached patch, this is against devel.
Attachments:
open | download - 0001-MEDIUM-init-allow-init-addr-to-work-with-server-ip-a.patch (2.6 KB)
Hi Raghu,

On Wed, Mar 28, 2018 at 08:30:39PM +0000, Raghu Udiyar wrote:
> Hi Willy, Baptiste
>
> It looks like this was the intended behaviour as per the documentation. But
> it looks to be simple to enable the init-addr behaviour for IP addresses as
> well. Can you please review the attached patch, this is against devel.

Baptiste, I missed this one. Could you please take a look ?

Thanks,
Willy
Hi Willy, Baptiste

I have been running this in production without any issues. Can you please
review, verified that this will apply on current devel as well.

On Fri, 6 Apr 2018 at 22:36 Willy Tarreau <[email protected]> wrote:

> Hi Raghu,
>
> On Wed, Mar 28, 2018 at 08:30:39PM +0000, Raghu Udiyar wrote:
> > Hi Willy, Baptiste
> >
> > It looks like this was the intended behaviour as per the documentation.
> But
> > it looks to be simple to enable the init-addr behaviour for IP addresses
> as
> > well. Can you please review the attached patch, this is against devel.
>
> Baptiste, I missed this one. Could you please take a look ?
>
> Thanks,
> Willy
>
Hi Raghu,

On Thu, Jun 28, 2018 at 02:46:36PM +0530, Raghu Udiyar wrote:
> Hi Willy, Baptiste
>
> I have been running this in production without any issues. Can you please
> review, verified that this will apply on current devel as well.

Thanks, however while looking if we had already merged it or not, I just
found that the exact same patch was already merged then reverted in
september last year :

commit 19e8aa58f7c42e602a95b4ceb4b254c424aed11c
Author: Nenad Merdanovic <[email protected]>
Date: Tue Sep 5 15:32:47 2017 +0200

BUG/MINOR: server: Remove FQDN requirement for using init-addr and state file

Historically the DNS was the only way of updating the server IP dynamically
and the init-addr processing and state file load required the server to have
an FQDN defined. Given that we can now update the IP through the socket as
well and also can have different init-addr values (like IP and 'none') - this
requirement needs to be removed.

This patch should be backported to 1.7.

Then the revert and the explanation :

commit 3d609a755e228cb6e601311e9268ad1eeac964e3
Author: Willy Tarreau <[email protected]>
Date: Wed Sep 6 14:22:45 2017 +0200

Revert "BUG/MINOR: server: Remove FQDN requirement for using init-addr and state file"

This reverts commit 19e8aa58f7c42e602a95b4ceb4b254c424aed11c.

It causes some trouble reported by Manu :
listen tls
[...]
server bla 127.0.0.1:8080

[ALERT] 248/130258 (21960) : parsing [/etc/haproxy/test.cfg:53] : 'server bla' : no method found to resolve address '(null)'
[ALERT] 248/130258 (21960) : Failed to initialize server(s) addr.

According to Nenad :
"It's not a good way to fix the issue we were experiencing
before. It will need a bigger rewrite, because the logic in
srv_iterate_initaddr needs to be changed."

Could you please try the example above in the commit message with and without
your patch ? Maybe the problem is still there, or maybe the logic change
mentioned by Nenad has already been done and this is no longer a problem,
in which case we could take your patch this time.

Thanks,
Willy
Hi Willy, Baptiste

On Fri, 29 Jun 2018 at 08:12 Willy Tarreau <[email protected]> wrote:

> Hi Raghu,
> [...]
>
> [ALERT] 248/130258 (21960) : parsing [/etc/haproxy/test.cfg:53] :
> 'server bla' : no method found to resolve address '(null)'
> [ALERT] 248/130258 (21960) : Failed to initialize server(s) addr.
>
> According to Nenad :
> "It's not a good way to fix the issue we were experiencing
> before. It will need a bigger rewrite, because the logic in
> srv_iterate_initaddr needs to be changed."
>
> Could you please try the example above in the commit message with and
> without
> your patch ? Maybe the problem is still there, or maybe the logic change
> mentioned by Nenad has already been done and this is no longer a problem,
> in which case we could take your patch this time.
>

Yes this is a problem - I missed to test this case. I have a proposal.

The init-addr option only makes sense when fqdn is used. But last method is
useful when using IP addresses. This patch enables
the last method to work by default when IP addresses are used. Can you
please review?

Thanks
Raghu
Attachments:
open | download - 0001-MEDIUM-init-allow-init-addr-to-work-with-server-ip-a.patch (1.6 KB)
Willy Tarreau
Re: Segfault on haproxy 1.7.10 with state file and slowstart
August 16, 2018 08:00PM
Hi Baptiste,

could you please have a look at this one, I suspect Raghu's proposal
is reasonable, I just want to be sure we don't break DNS and friends.

Thanks,
Willy

On Mon, Aug 13, 2018 at 08:16:31PM +0530, Raghu Udiyar wrote:
> Hi Willy, Baptiste
>
> On Fri, 29 Jun 2018 at 08:12 Willy Tarreau <[email protected]> wrote:
>
> > Hi Raghu,
> > [...]
> >
> > [ALERT] 248/130258 (21960) : parsing [/etc/haproxy/test.cfg:53] :
> > 'server bla' : no method found to resolve address '(null)'
> > [ALERT] 248/130258 (21960) : Failed to initialize server(s) addr.
> >
> > According to Nenad :
> > "It's not a good way to fix the issue we were experiencing
> > before. It will need a bigger rewrite, because the logic in
> > srv_iterate_initaddr needs to be changed."
> >
> > Could you please try the example above in the commit message with and
> > without
> > your patch ? Maybe the problem is still there, or maybe the logic change
> > mentioned by Nenad has already been done and this is no longer a problem,
> > in which case we could take your patch this time.
> >
>
> Yes this is a problem - I missed to test this case. I have a proposal.
>
> The init-addr option only makes sense when fqdn is used. But last method is
> useful when using IP addresses. This patch enables
> the last method to work by default when IP addresses are used. Can you
> please review?
>
> Thanks
> Raghu
Sorry, only registered users may post in this forum.

Click here to login