Welcome! Log In Create A New Profile

Advanced

remaining process after (seamless) reload

Posted by William Dauchy 
William Dauchy
remaining process after (seamless) reload
May 23, 2018 07:00PM
Hello,

I am trying to understand a possible issue we have regarding haproxy (seamless)
reloads.
I am using haproxy v1.8.9 with the following config (using nbthread):

global
log 127.0.0.1 local0 info
maxconn 262144
user haproxy
group haproxy
nbproc 1
daemon
stats socket /var/lib/haproxy/stats level admin mode 644 expose-fd listeners
stats timeout 2m
tune.bufsize 33792
ssl-default-bind-options no-sslv3 no-tlsv10 no-tlsv11 no-tls-tickets
ssl-default-bind-ciphers ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-SHA384:ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA256
ssl-default-server-options no-sslv3 no-tlsv10 no-tlsv11 no-tls-tickets
ssl-default-server-ciphers ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-SHA384:ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA256
ssl-server-verify none
crt-base /etc/haproxy/tls/
nbthread 4
cpu-map auto:1/1-4 0-3

defaults
log global
mode http
retries 3
timeout connect 10s
timeout client 180s
timeout server 180s
timeout http-keep-alive 10s
timeout http-request 10s
timeout queue 1s
timeout check 5s
option httplog
option dontlognull
option redispatch
option prefer-last-server
option dontlog-normal
option http-keep-alive
option forwardfor except 127.0.0.0/8
balance roundrobin
maxconn 262134
http-reuse safe
default-server inter 5s fastinter 1s fall 3 slowstart 20s observe layer7 error-limit 1 on-error fail-check
http-check send-state
http-check expect status 200 string passing

listen stats
bind *:8080
stats enable
stats uri /haproxy_stats


frontend fe_main
# arbitrary split in two for http/https traffic
bind *:80 name http_1 process 1/1
bind *:80 name http_2 process 1/2
bind *:443 name https_3 ssl crt /etc/haproxy/tls/fe_main process 1/3
bind *:443 name https_4 ssl crt /etc/haproxy/tls/fe_main process 1/4
[...]

The rest of the config contains lost of acl/backends (> 1000)

We do frequent reloads (approximatively every 10s).
After a while some processes remains alive and seem to never exit (waited >24
hours). While stracing them, some of them are still handling traffic and
doing healthchecks. Some of them are exiting normally after the reload.
I was wondering how I can help to debug this issue. I assume I won't
have any other info through the stats socket, since it concerns older
processes but maybe I missed something.
Do you have any hint to help me understand what is going on?

Best regards,

--
William
William Dauchy
Re: remaining process after (seamless) reload
May 23, 2018 08:50PM
On Wed, May 23, 2018 at 06:49:09PM +0200, William Dauchy wrote:
> We do frequent reloads (approximatively every 10s).
> After a while some processes remains alive and seem to never exit (waited >24
> hours). While stracing them, some of them are still handling traffic and
> doing healthchecks. Some of them are exiting normally after the reload.
> I was wondering how I can help to debug this issue. I assume I won't
> have any other info through the stats socket, since it concerns older
> processes but maybe I missed something.
> Do you have any hint to help me understand what is going on?

More details which could help understand what is going on:

ps output:

root 15928 0.3 0.0 255216 185268 ? Ss May21 10:11 /usr/sbin/haproxy -Ws -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid -sf 16988 16912 6340 28271 30590 30334 -x /var/lib/haproxy/stats
haproxy 6340 2.0 0.0 526172 225476 ? Ssl May22 35:03 \_ /usr/sbin/haproxy -Ws -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid -sf 6328 6315 -x /var/lib/haproxy/stats
haproxy 28271 1.8 0.0 528720 229508 ? Ssl May22 27:13 \_ /usr/sbin/haproxy -Ws -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid -sf 28258 28207 28232 6340 -x /var/lib/haproxy/stats
haproxy 30590 265 0.0 527268 225032 ? Rsl 04:35 2188:55 \_ /usr/sbin/haproxy -Ws -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid -sf 30578 28271 6340 -x /var/lib/haproxy/stats
haproxy 30334 197 0.0 526704 224544 ? Rsl 09:17 1065:59 \_ /usr/sbin/haproxy -Ws -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid -sf 30322 30295 27095 6340 28271 30590 -x /var/lib/haproxy/stats
haproxy 16912 1.7 0.0 527544 216552 ? Ssl 18:14 0:03 \_ /usr/sbin/haproxy -Ws -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid -sf 16899 28271 30590 30334 6340 -x /var/lib/haproxy/stats
haproxy 17001 2.2 0.0 528392 214656 ? Ssl 18:17 0:00 \_ /usr/sbin/haproxy -Ws -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid -sf 16988 16912 6340 28271 30590 30334 -x /var/lib/haproxy/stats


lsof output:

haproxy 6340 haproxy 5u unix 0xffff883feec97000 0t0 679289634 /var/lib/haproxy/stats.15928.tmp
haproxy 6340 6341 haproxy 5u unix 0xffff883feec97000 0t0 679289634 /var/lib/haproxy/stats.15928.tmp
haproxy 6340 6342 haproxy 5u unix 0xffff883feec97000 0t0 679289634 /var/lib/haproxy/stats.15928.tmp
haproxy 6340 6343 haproxy 5u unix 0xffff883feec97000 0t0 679289634 /var/lib/haproxy/stats.15928.tmp
haproxy 17020 haproxy 5u unix 0xffff883feec97000 0t0 679289634 /var/lib/haproxy/stats.15928.tmp
haproxy 17020 17021 haproxy 5u unix 0xffff883feec97000 0t0 679289634 /var/lib/haproxy/stats.15928.tmp
haproxy 17020 17022 haproxy 5u unix 0xffff883feec97000 0t0 679289634 /var/lib/haproxy/stats.15928.tmp
haproxy 17020 17023 haproxy 5u unix 0xffff883feec97000 0t0 679289634 /var/lib/haproxy/stats.15928.tmp
haproxy 28271 haproxy 5u unix 0xffff883feec97000 0t0 679289634 /var/lib/haproxy/stats.15928.tmp
haproxy 28271 28272 haproxy 5u unix 0xffff883feec97000 0t0 679289634 /var/lib/haproxy/stats.15928.tmp
haproxy 28271 28273 haproxy 5u unix 0xffff883feec97000 0t0 679289634 /var/lib/haproxy/stats.15928.tmp
haproxy 28271 28274 haproxy 5u unix 0xffff883feec97000 0t0 679289634 /var/lib/haproxy/stats.15928.tmp

(So on unhealthy nodes, I find old processes which are still linked to
the socket.)

The provisioning part is also seeing data which are supposed to be
already updated through the runtime API. I suspect I am getting old
data when connecting to the unix socket. The later being still attached
to an old process?
Indeed, if I try
for i in {1..500}; do sudo echo "show info" | sudo socat stdio /var/lib/haproxy/stats | grep Pid; done

I get "Pid: 17001" most of the time, which is the last process
but I sometimes get: "Pid: 28271"(!) which is a > 24 hours old
process.

Is there something we are doing wrongly?

Best,

--
William
William Dauchy
Re: remaining process after (seamless) reload
May 24, 2018 10:20AM
On Wed, May 23, 2018 at 08:45:04PM +0200, William Dauchy wrote:
> More details which could help understand what is going on:
>
> ps output:
>
> root 15928 0.3 0.0 255216 185268 ? Ss May21 10:11 /usr/sbin/haproxy -Ws -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid -sf 16988 16912 6340 28271 30590 30334 -x /var/lib/haproxy/stats
> haproxy 6340 2.0 0.0 526172 225476 ? Ssl May22 35:03 \_ /usr/sbin/haproxy -Ws -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid -sf 6328 6315 -x /var/lib/haproxy/stats
> haproxy 28271 1.8 0.0 528720 229508 ? Ssl May22 27:13 \_ /usr/sbin/haproxy -Ws -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid -sf 28258 28207 28232 6340 -x /var/lib/haproxy/stats
> haproxy 30590 265 0.0 527268 225032 ? Rsl 04:35 2188:55 \_ /usr/sbin/haproxy -Ws -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid -sf 30578 28271 6340 -x /var/lib/haproxy/stats
> haproxy 30334 197 0.0 526704 224544 ? Rsl 09:17 1065:59 \_ /usr/sbin/haproxy -Ws -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid -sf 30322 30295 27095 6340 28271 30590 -x /var/lib/haproxy/stats
> haproxy 16912 1.7 0.0 527544 216552 ? Ssl 18:14 0:03 \_ /usr/sbin/haproxy -Ws -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid -sf 16899 28271 30590 30334 6340 -x /var/lib/haproxy/stats
> haproxy 17001 2.2 0.0 528392 214656 ? Ssl 18:17 0:00 \_ /usr/sbin/haproxy -Ws -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid -sf 16988 16912 6340 28271 30590 30334 -x /var/lib/haproxy/stats
>
>
> lsof output:
>
> haproxy 6340 haproxy 5u unix 0xffff883feec97000 0t0 679289634 /var/lib/haproxy/stats.15928.tmp
> haproxy 6340 6341 haproxy 5u unix 0xffff883feec97000 0t0 679289634 /var/lib/haproxy/stats.15928.tmp
> haproxy 6340 6342 haproxy 5u unix 0xffff883feec97000 0t0 679289634 /var/lib/haproxy/stats.15928.tmp
> haproxy 6340 6343 haproxy 5u unix 0xffff883feec97000 0t0 679289634 /var/lib/haproxy/stats.15928.tmp
> haproxy 17020 haproxy 5u unix 0xffff883feec97000 0t0 679289634 /var/lib/haproxy/stats.15928.tmp
> haproxy 17020 17021 haproxy 5u unix 0xffff883feec97000 0t0 679289634 /var/lib/haproxy/stats.15928.tmp
> haproxy 17020 17022 haproxy 5u unix 0xffff883feec97000 0t0 679289634 /var/lib/haproxy/stats.15928.tmp
> haproxy 17020 17023 haproxy 5u unix 0xffff883feec97000 0t0 679289634 /var/lib/haproxy/stats.15928.tmp
> haproxy 28271 haproxy 5u unix 0xffff883feec97000 0t0 679289634 /var/lib/haproxy/stats.15928.tmp
> haproxy 28271 28272 haproxy 5u unix 0xffff883feec97000 0t0 679289634 /var/lib/haproxy/stats.15928.tmp
> haproxy 28271 28273 haproxy 5u unix 0xffff883feec97000 0t0 679289634 /var/lib/haproxy/stats.15928.tmp
> haproxy 28271 28274 haproxy 5u unix 0xffff883feec97000 0t0 679289634 /var/lib/haproxy/stats.15928.tmp
>
> (So on unhealthy nodes, I find old processes which are still linked to
> the socket.)
>
> The provisioning part is also seeing data which are supposed to be
> already updated through the runtime API. I suspect I am getting old
> data when connecting to the unix socket. The later being still attached
> to an old process?
> Indeed, if I try
> for i in {1..500}; do sudo echo "show info" | sudo socat stdio /var/lib/haproxy/stats | grep Pid; done
>
> I get "Pid: 17001" most of the time, which is the last process
> but I sometimes get: "Pid: 28271"(!) which is a > 24 hours old
> process.
>
> Is there something we are doing wrongly?

After some more testing, I don't have this issue using haproxy v1.8.8
(rollbacked for > 12 hours). I hope I don't speak too fast.

--
William
William Lallemand
Re: remaining process after (seamless) reload
May 24, 2018 12:10PM
On Thu, May 24, 2018 at 10:07:23AM +0200, William Dauchy wrote:
> On Wed, May 23, 2018 at 08:45:04PM +0200, William Dauchy wrote:
> > More details which could help understand what is going on:
> >
> > ps output:
> >
> > root 15928 0.3 0.0 255216 185268 ? Ss May21 10:11 /usr/sbin/haproxy -Ws -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid -sf 16988 16912 6340 28271 30590 30334 -x /var/lib/haproxy/stats
> > haproxy 6340 2.0 0.0 526172 225476 ? Ssl May22 35:03 \_ /usr/sbin/haproxy -Ws -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid -sf 6328 6315 -x /var/lib/haproxy/stats
> > haproxy 28271 1.8 0.0 528720 229508 ? Ssl May22 27:13 \_ /usr/sbin/haproxy -Ws -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid -sf 28258 28207 28232 6340 -x /var/lib/haproxy/stats
> > haproxy 30590 265 0.0 527268 225032 ? Rsl 04:35 2188:55 \_ /usr/sbin/haproxy -Ws -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid -sf 30578 28271 6340 -x /var/lib/haproxy/stats
> > haproxy 30334 197 0.0 526704 224544 ? Rsl 09:17 1065:59 \_ /usr/sbin/haproxy -Ws -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid -sf 30322 30295 27095 6340 28271 30590 -x /var/lib/haproxy/stats
> > haproxy 16912 1.7 0.0 527544 216552 ? Ssl 18:14 0:03 \_ /usr/sbin/haproxy -Ws -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid -sf 16899 28271 30590 30334 6340 -x /var/lib/haproxy/stats
> > haproxy 17001 2.2 0.0 528392 214656 ? Ssl 18:17 0:00 \_ /usr/sbin/haproxy -Ws -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid -sf 16988 16912 6340 28271 30590 30334 -x /var/lib/haproxy/stats
> >
> >
> > lsof output:
> >
> > haproxy 6340 haproxy 5u unix 0xffff883feec97000 0t0 679289634 /var/lib/haproxy/stats.15928.tmp
> > haproxy 6340 6341 haproxy 5u unix 0xffff883feec97000 0t0 679289634 /var/lib/haproxy/stats.15928.tmp
> > haproxy 6340 6342 haproxy 5u unix 0xffff883feec97000 0t0 679289634 /var/lib/haproxy/stats.15928.tmp
> > haproxy 6340 6343 haproxy 5u unix 0xffff883feec97000 0t0 679289634 /var/lib/haproxy/stats.15928.tmp
> > haproxy 17020 haproxy 5u unix 0xffff883feec97000 0t0 679289634 /var/lib/haproxy/stats.15928.tmp
> > haproxy 17020 17021 haproxy 5u unix 0xffff883feec97000 0t0 679289634 /var/lib/haproxy/stats.15928.tmp
> > haproxy 17020 17022 haproxy 5u unix 0xffff883feec97000 0t0 679289634 /var/lib/haproxy/stats.15928.tmp
> > haproxy 17020 17023 haproxy 5u unix 0xffff883feec97000 0t0 679289634 /var/lib/haproxy/stats.15928.tmp
> > haproxy 28271 haproxy 5u unix 0xffff883feec97000 0t0 679289634 /var/lib/haproxy/stats.15928.tmp
> > haproxy 28271 28272 haproxy 5u unix 0xffff883feec97000 0t0 679289634 /var/lib/haproxy/stats.15928.tmp
> > haproxy 28271 28273 haproxy 5u unix 0xffff883feec97000 0t0 679289634 /var/lib/haproxy/stats.15928.tmp
> > haproxy 28271 28274 haproxy 5u unix 0xffff883feec97000 0t0 679289634 /var/lib/haproxy/stats.15928.tmp
> >
> > (So on unhealthy nodes, I find old processes which are still linked to
> > the socket.)
> >
> > The provisioning part is also seeing data which are supposed to be
> > already updated through the runtime API. I suspect I am getting old
> > data when connecting to the unix socket. The later being still attached
> > to an old process?
> > Indeed, if I try
> > for i in {1..500}; do sudo echo "show info" | sudo socat stdio /var/lib/haproxy/stats | grep Pid; done
> >
> > I get "Pid: 17001" most of the time, which is the last process
> > but I sometimes get: "Pid: 28271"(!) which is a > 24 hours old
> > process.
> >
> > Is there something we are doing wrongly?
>
> After some more testing, I don't have this issue using haproxy v1.8.8
> (rollbacked for > 12 hours). I hope I don't speak too fast.
>

Hi,

I managed to reproduce something similar with the 1.8.8 version. It looks like
letting a socat connected to the socket helps.

I'm looking into the code to see what's happening.

--
William Lallemand
William Dauchy
Re: remaining process after (seamless) reload
May 24, 2018 11:10PM
Hi William,

Thank you for your reply.

On Thu, May 24, 2018 at 12:01:38PM +0200, William Lallemand wrote:
> I managed to reproduce something similar with the 1.8.8 version. It looks like
> letting a socat connected to the socket helps.
>
> I'm looking into the code to see what's happening.

Indeed, after some more hours, I got the same issue on v1.8.8. However it seems to
be easier to reproduce in v1.8.9, but I might be wrong.
So now I bet on either thread issue, or bind with reuseport.
I'll try to do some more tests.

Best,
--
William
William Lallemand
Re: remaining process after (seamless) reload
May 28, 2018 05:30PM
On Thu, May 24, 2018 at 11:00:29PM +0200, William Dauchy wrote:
> On Thu, May 24, 2018 at 12:01:38PM +0200, William Lallemand wrote:
> > I managed to reproduce something similar with the 1.8.8 version. It looks like
> > letting a socat connected to the socket helps.
> >
> > I'm looking into the code to see what's happening.
>
> Indeed, after some more hours, I got the same issue on v1.8.8. However it seems to
> be easier to reproduce in v1.8.9, but I might be wrong.
> So now I bet on either thread issue, or bind with reuseport.
> I'll try to do some more tests.
>
> Best,

Hi,

I don't think I reproduced the same problem, so I have a few questions for you :-)

Are the problematical workers leaving when you reload a second time?

Did you try to kill -USR1 the worker ? It should exits with "Former worker $PID
exited with code 0" on stderr.

If not, could you check the Sig* lines in /proc/$PID/status for this worker?

Do you know how much time take haproxy to load its configuration, and do you
think you tried a reload before it finished to parse and load the config?
Type=notify in your systemd unit file should help for this case. If I remember
well it checks that the service is 'ready' before trying to reload.

I suspect the SIGUSR1 signal is not received by the worker, but I'm not sure
either if it's the master that didn't send it or if the worker blocked it.

Thanks!

--
William Lallemand
William Dauchy
Re: remaining process after (seamless) reload
May 29, 2018 05:20PM
Hello William,

Sorry for the last answer.

> Are the problematical workers leaving when you reload a second time?

no, they seems to stay for a long time (forever?)

> Did you try to kill -USR1 the worker ? It should exits with "Former worker $PID
> exited with code 0" on stderr.
> If not, could you check the Sig* lines in /proc/$PID/status for this worker?

will try. I need to put the setup back in shape, and maybe test
without multiple binding.

> Do you know how much time take haproxy to load its configuration, and do you
> think you tried a reload before it finished to parse and load the config?
> Type=notify in your systemd unit file should help for this case. If I remember
> well it checks that the service is 'ready' before trying to reload.

We are using Type=notify. I however cannot guarantee we do not trigger
a new reload, before the previous one is done. Is there a way to check
the "ready" state you mentioned?
(We are talking about a reload every 10 seconds maximum though)

> I suspect the SIGUSR1 signal is not received by the worker, but I'm not sure
> either if it's the master that didn't send it or if the worker blocked it.

good to know.

Best,
--
William
Dave Chiluk
Re: remaining process after (seamless) reload
May 29, 2018 06:00PM
We've battled the same issue with our haproxys. We root caused it to slow
dns lookup times while parsing the config was causing haproxy config
parsing to take so long that we were attempting to reload again before the
original reload had completed. I'm still not sure why or where the Signals
are getting dropped to the old haproxy, but we found by installing a dns
cache on our haproxy nodes we were able to greatly decrease the likelihood
of creating zombie haproxy instances.

We further improved on that by rearchitecting our micro-service
architecture to make use of the haproxy dynamic scaling apis, and
allocating dummy slots for future expansion. Similar to
https://www.haproxy.com/blog/dynamic-scaling-for-microservices-with-runtime-api/
..

Good luck, I hope that's the answer to your problem.
Dave.

On Tue, May 29, 2018 at 10:12 AM William Dauchy <[email protected]> wrote:

> Hello William,
>
> Sorry for the last answer.
>
> > Are the problematical workers leaving when you reload a second time?
>
> no, they seems to stay for a long time (forever?)
>
> > Did you try to kill -USR1 the worker ? It should exits with "Former
> worker $PID
> > exited with code 0" on stderr.
> > If not, could you check the Sig* lines in /proc/$PID/status for this
> worker?
>
> will try. I need to put the setup back in shape, and maybe test
> without multiple binding.
>
> > Do you know how much time take haproxy to load its configuration, and do
> you
> > think you tried a reload before it finished to parse and load the config?
> > Type=notify in your systemd unit file should help for this case. If I
> remember
> > well it checks that the service is 'ready' before trying to reload.
>
> We are using Type=notify. I however cannot guarantee we do not trigger
> a new reload, before the previous one is done. Is there a way to check
> the "ready" state you mentioned?
> (We are talking about a reload every 10 seconds maximum though)
>
> > I suspect the SIGUSR1 signal is not received by the worker, but I'm not
> sure
> > either if it's the master that didn't send it or if the worker blocked
> it.
>
> good to know.
>
> Best,
> --
> William
>
>
William Dauchy
Re: remaining process after (seamless) reload
May 29, 2018 08:40PM
Hello Dave,

On Tue, May 29, 2018 at 5:55 PM, Dave Chiluk <[email protected]> wrote:
> We've battled the same issue with our haproxys. We root caused it to slow
> dns lookup times while parsing the config was causing haproxy config parsing
> to take so long that we were attempting to reload again before the original
> reload had completed. I'm still not sure why or where the Signals are
> getting dropped to the old haproxy, but we found by installing a dns cache
> on our haproxy nodes we were able to greatly decrease the likelihood of
> creating zombie haproxy instances.

interesting findings. I however don't see on which part haproxy would
need to do dns lookup on our side. Front end side is host matching and
backend side is IP only.
But I will have a closer look at this point. What I am missing for now
is how to know when haproxy is considered as "ready" to prevent new
reloads.

> We further improved on that by rearchitecting our micro-service architecture
> to make use of the haproxy dynamic scaling apis, and allocating dummy slots
> for future expansion. Similar to
> https://www.haproxy.com/blog/dynamic-scaling-for-microservices-with-runtime-api/.

that's indeed what we are already doing for servers.

Best,
--
William
Tim Düsterhus
Re: remaining process after (seamless) reload
May 29, 2018 09:40PM
William,

Am 29.05.2018 um 17:09 schrieb William Dauchy:
> We are using Type=notify. I however cannot guarantee we do not trigger
> a new reload, before the previous one is done. Is there a way to check
> the "ready" state you mentioned?

Run systemctl status haproxy. It shows the status:

> [[email protected]/s/haproxy (maxrewrite-warn)]sudo systemctl status haproxy
> ● haproxy.service - HAProxy Load Balancer
> Loaded: loaded (/lib/systemd/system/haproxy.service; disabled; vendor preset: enabled)
> Active: reloading (reload) since Tue 2018-05-29 21:25:06 CEST; 22s ago

"Active: reloading (reload)" implies that the haproxy master process did
not signal readiness to SystemD yet. In my cause that is, because I
commented out the line signaling readiness. Readiness is signaled, once
the master process forked off the children, but before unblocking the
signals in the master process.

@William Lallemand Possibly the sd_notifyf should be moved below
mworker_unblock_signals in mworker_wait?

Once the reload finished it should show "Active: active (running)":

> [[email protected]/s/haproxy (maxrewrite-warn)]sudo systemctl status haproxy
> ● haproxy.service - HAProxy Load Balancer
> Loaded: loaded (/lib/systemd/system/haproxy.service; disabled; vendor preset: enabled)
> Active: active (running) since Tue 2018-05-29 21:27:58 CEST; 1s ago

Best regards
Tim Düsterhus
William Dauchy
Re: remaining process after (seamless) reload
May 29, 2018 11:30PM
Hello Tim,

On Tue, May 29, 2018 at 9:33 PM, Tim Düsterhus <[email protected]> wrote:
> Run systemctl status haproxy. It shows the status:
>
>> [[email protected]/s/haproxy (maxrewrite-warn)]sudo systemctl status haproxy
>> ● haproxy.service - HAProxy Load Balancer
>> Loaded: loaded (/lib/systemd/system/haproxy.service; disabled; vendor preset: enabled)
>> Active: reloading (reload) since Tue 2018-05-29 21:25:06 CEST; 22s ago
>
> "Active: reloading (reload)" implies that the haproxy master process did
> not signal readiness to SystemD yet. In my cause that is, because I
> commented out the line signaling readiness. Readiness is signaled, once
> the master process forked off the children, but before unblocking the
> signals in the master process.

thanks for the precision. I was more looking for a way to easily check
it in my provisioning system, e.g a file to check instead of parsing
the output of systemctl status.

--
William
Willy Tarreau
Re: remaining process after (seamless) reload
May 30, 2018 05:30AM
On Tue, May 29, 2018 at 08:35:19PM +0200, William Dauchy wrote:
> I however don't see on which part haproxy would
> need to do dns lookup on our side. Front end side is host matching and
> backend side is IP only.

We studied the possibility that a reload happends at the exact moment
the config finishes to be parsed and thought about such possibilities
as well. We could imagine that you're using tons of certificates and
that they take a bit of time to be processed on startup. But despite
this apparently the startup sequence *looks* safe.

> But I will have a closer look at this point. What I am missing for now
> is how to know when haproxy is considered as "ready" to prevent new
> reloads.

From what I understood it's when the workers are forked, at this point
sd_notify is used. There's very likely a race somewhere there. We imagined
several hypothesis, like reload signal being delivered before workers are
started, etc, but William managed to rule them all out till now.

All ideas and silly suggestions are welcome of course.

Cheers,
Willy
William Dauchy
Re: remaining process after (seamless) reload
May 30, 2018 05:00PM
Hello William L.,

I did some more testing:
I simplified my config, removing the multi binding part and cpu-map.
Conclusion is, I have this issue when I activate nbthread feature
(meaning no probkem without).

I tried to kill -USR1 the failing worker, but it remains.

Here are the Sig* from status file of one of the failing process:
SigQ: 0/192448
SigPnd: 0000000000000000
SigBlk: 0000000000000800
SigIgn: 0000000000001800
SigCgt: 0000000180300205

About the timing of reload, it seems to take a few seconds most of the
time, so I *think* I am not reloading before another is not yet done,
but I would appreciate whether I can check this fact through a file
before sending the reload; do you have any hint?

Best,
--
William
William Lallemand
Re: remaining process after (seamless) reload
May 30, 2018 05:40PM
On Wed, May 30, 2018 at 04:47:31PM +0200, William Dauchy wrote:
> Hello William L.,
>

Hi William D. :-)

> I did some more testing:
> I simplified my config, removing the multi binding part and cpu-map.
> Conclusion is, I have this issue when I activate nbthread feature
> (meaning no probkem without).
>
> I tried to kill -USR1 the failing worker, but it remains.
>
> Here are the Sig* from status file of one of the failing process:
> SigQ: 0/192448
> SigPnd: 0000000000000000
> SigBlk: 0000000000000800
> SigIgn: 0000000000001800
> SigCgt: 0000000180300205
>

I can reproduce the same situation there, however I disabled the seamless
reload. When doing a -USR1 & strace on an remaining worker, I can see that the
the signal is not blocked, and that it's still polling

My guess is that something is preventing the leaving of the worker. It tried to
gdb the threads but not one seems to be in a dead lock. I have to investigate
more.

I'm not sure that's related at all with the timing of the reload but I could be
wrong.

> About the timing of reload, it seems to take a few seconds most of the
> time, so I *think* I am not reloading before another is not yet done,
> but I would appreciate whether I can check this fact through a file
> before sending the reload; do you have any hint?

I think systemd is not trying to reload when a reload is not finished yet
with Type=notify. You could 'grep reloading' on the systemctl status haproxy to
check that.

Unfortunately the only way to know when the service is ready is with systemd,
but I planned to make the status available on the stats socket in the future.

--
William Lallemand
William Lallemand
Re: remaining process after (seamless) reload
May 30, 2018 07:50PM
Hi Tim,

On Tue, May 29, 2018 at 09:33:48PM +0200, Tim Düsterhus wrote:
>
> @William Lallemand Possibly the sd_notifyf should be moved below
> mworker_unblock_signals in mworker_wait?
>

This shouldn't happen with or without systemd. I can reproduce it without
using systemd, we should not rely on an external component to mitigate the
problem.

In my opinion your suggestion won't change anything, and I suspect that the
problem is not there, but probably in the worker.

--
William Lallemand
Tim Düsterhus
Re: remaining process after (seamless) reload
May 30, 2018 08:10PM
William,

Am 30.05.2018 um 19:45 schrieb William Lallemand:
>> @William Lallemand Possibly the sd_notifyf should be moved below
>> mworker_unblock_signals in mworker_wait?
>>
>
> This shouldn't happen with or without systemd. I can reproduce it without
> using systemd, we should not rely on an external component to mitigate the
> problem.

Yes, I agree. But I believe that readiness should be signaled as late as
possible to ensure that everything is set-up correctly. So the
`sd_notifyf` should be moved down below the signal registering, no? Even
if it does not fix this issue it is technically incorrect.

The ideal case would be the workers signaling their status to the master
(which in turn signals to SystemD). Is this part of your current work on
master-worker Willy announced in his 1.9 mail? Personally I'd like to
see that workers are able to send information like the number of
requests handled to the master to be able to display them in SystemD [1]

Best regards
Tim Düsterhus

[1]
https://www.freedesktop.org/software/systemd/man/sd_notify.html#STATUS=%E2%80%A6
William Dauchy
Re: remaining process after (seamless) reload
May 30, 2018 08:10PM
On Wed, May 30, 2018 at 5:29 PM, William Lallemand
<[email protected]> wrote:
> I can reproduce the same situation there, however I disabled the seamless
> reload. When doing a -USR1 & strace on an remaining worker, I can see that the
> the signal is not blocked, and that it's still polling

good news!

> Unfortunately the only way to know when the service is ready is with systemd,
> but I planned to make the status available on the stats socket in the future.

That's exactly what I was looking for!

Best,
--
William
William Lallemand
Re: remaining process after (seamless) reload
May 30, 2018 08:30PM
On Wed, May 30, 2018 at 07:57:03PM +0200, Tim Düsterhus wrote:
> William,
>
> Am 30.05.2018 um 19:45 schrieb William Lallemand:
> >> @William Lallemand Possibly the sd_notifyf should be moved below
> >> mworker_unblock_signals in mworker_wait?
> >>
> >
> > This shouldn't happen with or without systemd. I can reproduce it without
> > using systemd, we should not rely on an external component to mitigate the
> > problem.
>
> Yes, I agree. But I believe that readiness should be signaled as late as
> possible to ensure that everything is set-up correctly. So the
> `sd_notifyf` should be moved down below the signal registering, no? Even
> if it does not fix this issue it is technically incorrect.
>

Even there it's technically incorrect in fact, readiness should be notified
once all workers are ready and it's not possible to do this right now.

When the signals are blocked they are stored within the signal queue in order
to be transmitted once they are unblocked, so that does not change a lot if
it's before or after the unblock. Even if we send the reload signal just before
the unblocking, it will still work.

> The ideal case would be the workers signaling their status to the master
> (which in turn signals to SystemD). Is this part of your current work on
> master-worker Willy announced in his 1.9 mail?

Better than that, the whole idea started from the need of synchronous
reloading that don't rely on signals anymore.

The current problems with the signal method:
- kill -USR2 is asynchronous and can't return an error code
- sd_notify can't signal an error during a reload, which is a big problem
- we use haproxy -c as a hack to stop the reloading if the new config is wrong
- haproxy can fail even if the config was right during the check
- we load the configuration twice because of the -c, which is not convenient
for people having a heavy configuration to load.

> Personally I'd like to see that workers are able to send information like
> the number of requests handled to the master to be able to display them in
> SystemD [1]
>

I thought about it too, but people tends to rely on statistics usually. I
already have in mind how to do that kind of things but I can't say it will be
there, it will depends on how much time I spend on the other features :)

--
William Lallemand
William Lallemand
Re: remaining process after (seamless) reload
June 07, 2018 12:00PM
Hi guys,

Sorry for the late reply, I manage to reproduce and fix what seams to be the bug.
The signal management was not handled correctly with threads.

Could you try those patches and see if it fixes the problem?

Thanks.

--
William Lallemand
Willy Tarreau
Re: remaining process after (seamless) reload
June 07, 2018 12:10PM
On Thu, Jun 07, 2018 at 11:50:45AM +0200, William Lallemand wrote:
> /* block signal delivery during processing */
> +#ifdef USE_THREAD
> + pthread_sigmask(SIG_SETMASK, &blocked_sig, &old_sig);
> +#else
> sigprocmask(SIG_SETMASK, &blocked_sig, &old_sig);
> +#endif

I think for the merge we'd rather put a wrapper into hathreads.h, like
"ha_sigmask()" which uses either pthread_sigmask() or sigprocmask().

That will remove ifdefs and lower the risk of reusing these unsafe
calls.

What do you think ?

thanks,
Willy
William Lallemand
Re: remaining process after (seamless) reload
June 08, 2018 10:20AM
On Thu, Jun 07, 2018 at 12:02:46PM +0200, Willy Tarreau wrote:
> On Thu, Jun 07, 2018 at 11:50:45AM +0200, William Lallemand wrote:
> > /* block signal delivery during processing */
> > +#ifdef USE_THREAD
> > + pthread_sigmask(SIG_SETMASK, &blocked_sig, &old_sig);
> > +#else
> > sigprocmask(SIG_SETMASK, &blocked_sig, &old_sig);
> > +#endif
>
> I think for the merge we'd rather put a wrapper into hathreads.h, like
> "ha_sigmask()" which uses either pthread_sigmask() or sigprocmask().
>
> That will remove ifdefs and lower the risk of reusing these unsafe
> calls.
>
> What do you think ?
>

Agreed, I'll rework the patches once they are tested.

--
William Lallemand
William Dauchy
Re: remaining process after (seamless) reload
June 08, 2018 02:20PM
Hello William L.,

On Thu, Jun 07, 2018 at 11:50:45AM +0200, William Lallemand wrote:
> Sorry for the late reply, I manage to reproduce and fix what seams to be the bug.
> The signal management was not handled correctly with threads.
> Could you try those patches and see if it fixes the problem?

Thanks a lot for the proposed patches!
I started a new test yesterday evening, and it seems ok so far, so I
deployed on two new machines this morning.
I hope I don't speak too fast but it looks promising because I was
previously able to reproduce in a few hours maximum.

Keep me up to date if you need a test on a re-worked version.

Thanks!


> From d695242fb260538bd8db323715d627c4a9deacc7 Mon Sep 17 00:00:00 2001
> From: William Lallemand <[email protected]>
> Date: Thu, 7 Jun 2018 09:46:01 +0200
> Subject: [PATCH 1/3] BUG/MEDIUM: threads: handle signal queue only in thread 0
>
> Signals were handled in all threads which caused some signals to be lost
> from time to time. To avoid complicated lock system (threads+signals),
> we prefer handling the signals in one thread avoiding concurrent access.
>
> The side effect of this bug was that some process were not leaving from
> time to time during a reload.
> ---
> src/haproxy.c | 21 ++++++++++++++++++---
> src/signal.c | 8 --------
> 2 files changed, 18 insertions(+), 11 deletions(-)
>
> diff --git a/src/haproxy.c b/src/haproxy.c
> index 4628d8296..e984ca573 100644
> --- a/src/haproxy.c
> +++ b/src/haproxy.c
> @@ -2398,8 +2398,10 @@ static void run_poll_loop()
> /* Process a few tasks */
> process_runnable_tasks();
>
> - /* check if we caught some signals and process them */
> - signal_process_queue();
> + /* check if we caught some signals and process them in the
> + first thread */
> + if (tid == 0)
> + signal_process_queue();
>
> /* Check if we can expire some tasks */
> next = wake_expired_tasks();
> @@ -2416,7 +2418,7 @@ static void run_poll_loop()
> activity[tid].wake_tasks++;
> else if (active_applets_mask & tid_bit)
> activity[tid].wake_applets++;
> - else if (signal_queue_len)
> + else if (signal_queue_len && tid == 0)
> activity[tid].wake_signal++;
> else
> exp = next;
> @@ -3006,6 +3008,7 @@ int main(int argc, char **argv)
> unsigned int *tids = calloc(global.nbthread, sizeof(unsigned int));
> pthread_t *threads = calloc(global.nbthread, sizeof(pthread_t));
> int i;
> + sigset_t blocked_sig, old_sig;
>
> THREAD_SYNC_INIT((1UL << global.nbthread) - 1);
>
> @@ -3013,6 +3016,15 @@ int main(int argc, char **argv)
> for (i = 0; i < global.nbthread; i++)
> tids = i;
>
> + /* ensure the signals will be blocked in every thread */
> + sigfillset(&blocked_sig);
> + sigdelset(&blocked_sig, SIGPROF);
> + sigdelset(&blocked_sig, SIGBUS);
> + sigdelset(&blocked_sig, SIGFPE);
> + sigdelset(&blocked_sig, SIGILL);
> + sigdelset(&blocked_sig, SIGSEGV);
> + pthread_sigmask(SIG_SETMASK, &blocked_sig, &old_sig);
> +
> /* Create nbthread-1 thread. The first thread is the current process */
> threads[0] = pthread_self();
> for (i = 1; i < global.nbthread; i++)
> @@ -3046,6 +3058,9 @@ int main(int argc, char **argv)
> }
> #endif /* !USE_CPU_AFFINITY */
>
> + /* when multithreading we need to let only the thread 0 handle the signals */
> + pthread_sigmask(SIG_SETMASK, &old_sig, NULL);
> +
> /* Finally, start the poll loop for the first thread */
> run_thread_poll_loop(&tids[0]);
>
> diff --git a/src/signal.c b/src/signal.c
> index a0975910b..f1f682188 100644
> --- a/src/signal.c
> +++ b/src/signal.c
> @@ -31,7 +31,6 @@ struct pool_head *pool_head_sig_handlers = NULL;
> sigset_t blocked_sig;
> int signal_pending = 0; /* non-zero if t least one signal remains unprocessed */
>
> -__decl_hathreads(HA_SPINLOCK_T signals_lock);
>
> /* Common signal handler, used by all signals. Received signals are queued.
> * Signal number zero has a specific status, as it cannot be delivered by the
> @@ -71,9 +70,6 @@ void __signal_process_queue()
> struct signal_descriptor *desc;
> sigset_t old_sig;
>
> - if (HA_SPIN_TRYLOCK(SIGNALS_LOCK, &signals_lock))
> - return;
> -
> /* block signal delivery during processing */
> sigprocmask(SIG_SETMASK, &blocked_sig, &old_sig);
>
> @@ -100,7 +96,6 @@ void __signal_process_queue()
>
> /* restore signal delivery */
> sigprocmask(SIG_SETMASK, &old_sig, NULL);
> - HA_SPIN_UNLOCK(SIGNALS_LOCK, &signals_lock);
> }
>
> /* perform minimal intializations, report 0 in case of error, 1 if OK. */
> @@ -112,8 +107,6 @@ int signal_init()
> memset(signal_queue, 0, sizeof(signal_queue));
> memset(signal_state, 0, sizeof(signal_state));
>
> - HA_SPIN_INIT(&signals_lock);
> -
> /* Ensure signals are not blocked. Some shells or service managers may
> * accidently block all of our signals unfortunately, causing lots of
> * zombie processes to remain in the background during reloads.
> @@ -148,7 +141,6 @@ void deinit_signals()
> pool_free(pool_head_sig_handlers, sh);
> }
> }
> - HA_SPIN_DESTROY(&signals_lock);
> }
>
> /* Register a function and an integer argument on a signal. A pointer to the
> --
> 2.16.1
>

> From 1501eddeb506897126d0d3d60a36ca780b24ffdf Mon Sep 17 00:00:00 2001
> From: William Lallemand <[email protected]>
> Date: Thu, 7 Jun 2018 09:49:04 +0200
> Subject: [PATCH 2/3] BUG/MINOR: don't ignore SIG{BUS,FPE,ILL,SEGV} during
> signal processing
>
> ---
> src/signal.c | 8 ++++++++
> 1 file changed, 8 insertions(+)
>
> diff --git a/src/signal.c b/src/signal.c
> index f1f682188..0dadd762c 100644
> --- a/src/signal.c
> +++ b/src/signal.c
> @@ -120,6 +120,14 @@ int signal_init()
>
> sigfillset(&blocked_sig);
> sigdelset(&blocked_sig, SIGPROF);
> + /* man sigprocmask: If SIGBUS, SIGFPE, SIGILL, or SIGSEGV are
> + generated while they are blocked, the result is undefined, unless
> + the signal was generated by kill(2),
> + sigqueue(3), or raise(3) */
> + sigdelset(&blocked_sig, SIGBUS);
> + sigdelset(&blocked_sig, SIGFPE);
> + sigdelset(&blocked_sig, SIGILL);
> + sigdelset(&blocked_sig, SIGSEGV);
> for (sig = 0; sig < MAX_SIGNAL; sig++)
> LIST_INIT(&signal_state[sig].handlers);
>
> --
> 2.16.1
>

> From 73423f636d906e7815589f80088084cb7ca589b1 Mon Sep 17 00:00:00 2001
> From: William Lallemand <[email protected]>
> Date: Thu, 7 Jun 2018 11:23:40 +0200
> Subject: [PATCH 3/3] BUG/MINOR: signals: sigprocmask is unspecified in a
> multithread process
>
> Replace sigprocmask by pthread_sigmask.
> ---
> src/signal.c | 8 ++++++++
> 1 file changed, 8 insertions(+)
>
> diff --git a/src/signal.c b/src/signal.c
> index 0dadd762c..767d85c48 100644
> --- a/src/signal.c
> +++ b/src/signal.c
> @@ -71,7 +71,11 @@ void __signal_process_queue()
> sigset_t old_sig;
>
> /* block signal delivery during processing */
> +#ifdef USE_THREAD
> + pthread_sigmask(SIG_SETMASK, &blocked_sig, &old_sig);
> +#else
> sigprocmask(SIG_SETMASK, &blocked_sig, &old_sig);
> +#endif
>
> /* It is important that we scan the queue forwards so that we can
> * catch any signal that would have been queued by another signal
> @@ -95,7 +99,11 @@ void __signal_process_queue()
> signal_queue_len = 0;
>
> /* restore signal delivery */
> +#ifdef USE_THREAD
> + pthread_sigmask(SIG_SETMASK, &old_sig, NULL);
> +#else
> sigprocmask(SIG_SETMASK, &old_sig, NULL);
> +#endif
> }
>
> /* perform minimal intializations, report 0 in case of error, 1 if OK. */
> --
> 2.16.1

--
William
William Lallemand
Re: remaining process after (seamless) reload
June 08, 2018 04:40PM
On Fri, Jun 08, 2018 at 02:10:44PM +0200, William Dauchy wrote:
> On Thu, Jun 07, 2018 at 11:50:45AM +0200, William Lallemand wrote:
> > Sorry for the late reply, I manage to reproduce and fix what seams to be the bug.
> > The signal management was not handled correctly with threads.
> > Could you try those patches and see if it fixes the problem?
>
> Thanks a lot for the proposed patches!
> I started a new test yesterday evening, and it seems ok so far, so I
> deployed on two new machines this morning.
> I hope I don't speak too fast but it looks promising because I was
> previously able to reproduce in a few hours maximum.
>

That's great news!

Here's the new patches. It shouldn't change anything to the fix, it only
changes the sigprocmask to pthread_sigmask.

--
William Lallemand
Willy Tarreau
Re: remaining process after (seamless) reload
June 08, 2018 06:30PM
On Fri, Jun 08, 2018 at 04:31:30PM +0200, William Lallemand wrote:
> That's great news!
>
> Here's the new patches. It shouldn't change anything to the fix, it only
> changes the sigprocmask to pthread_sigmask.

OK, I can merge them right now if you want. At the very least it will
kill a whole class of potential bugs and help future troubleshooting.
Just let me know.

Thank you guys,
Willy
William Lallemand
Re: remaining process after (seamless) reload
June 08, 2018 06:30PM
On Fri, Jun 08, 2018 at 06:20:21PM +0200, Willy Tarreau wrote:
> On Fri, Jun 08, 2018 at 04:31:30PM +0200, William Lallemand wrote:
> > That's great news!
> >
> > Here's the new patches. It shouldn't change anything to the fix, it only
> > changes the sigprocmask to pthread_sigmask.
>
> OK, I can merge them right now if you want. At the very least it will
> kill a whole class of potential bugs and help future troubleshooting.
> Just let me know.
>
> Thank you guys,
> Willy

Ok for me.

--
William Lallemand
William Dauchy
Re: remaining process after (seamless) reload
June 08, 2018 06:30PM
On Fri, Jun 08, 2018 at 04:31:30PM +0200, William Lallemand wrote:
> That's great news!
>
> Here's the new patches. It shouldn't change anything to the fix, it only
> changes the sigprocmask to pthread_sigmask.

thanks, I attached the backport for 1.8 and started a new test with
them.
Feel free to add Tested-by tag :)

--
William
Attachments:
open | download - 0001-BUG-MEDIUM-threads-handle-signal-queue-only-in-threa.patch (637 bytes)
open | download - 0002-BUG-MINOR-don-t-ignore-SIG-BUS-FPE-ILL-SEGV-during-s.patch (646 bytes)
open | download - 0003-BUG-MINOR-signals-ha_sigmask-macro-for-multithreadin.patch (956 bytes)
Willy Tarreau
Re: remaining process after (seamless) reload
June 08, 2018 06:30PM
On Fri, Jun 08, 2018 at 06:22:39PM +0200, William Dauchy wrote:
> On Fri, Jun 08, 2018 at 04:31:30PM +0200, William Lallemand wrote:
> > That's great news!
> >
> > Here's the new patches. It shouldn't change anything to the fix, it only
> > changes the sigprocmask to pthread_sigmask.
>
> thanks, I attached the backport for 1.8 and started a new test with
> them.
> Feel free to add Tested-by tag :)

Argh! 5 seconds too late, I've just pushed them now for mainline!

Thanks guys,
Willy
William Dauchy
Re: remaining process after (seamless) reload
June 12, 2018 04:10PM
Hello William L,

On Fri, Jun 08, 2018 at 04:31:30PM +0200, William Lallemand wrote:
> That's great news!
>
> Here's the new patches. It shouldn't change anything to the fix, it only
> changes the sigprocmask to pthread_sigmask.

In fact, I now have a different but similar issue.

root 18547 3.2 1.3 986660 898844 ? Ss Jun08 182:12 /usr/sbin/haproxy -Ws -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid -sf 2063 1903 1763 1445 14593 29663 4203 18290 -x /var/lib/haproxy/stats
haproxy 14593 299 1.3 1251216 920480 ? Rsl Jun11 5882:01 \_ /usr/sbin/haproxy -Ws -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid -sf 14582 14463 -x /var/lib/haproxy/stats
haproxy 18290 299 1.4 1265028 935288 ? Ssl Jun11 3425:51 \_ /usr/sbin/haproxy -Ws -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid -sf 18281 18271 18261 14593 -x /var/lib/haproxy/stats
haproxy 29663 99.9 1.4 1258024 932796 ? Ssl Jun11 1063:08 \_ /usr/sbin/haproxy -Ws -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid -sf 29653 29644 18290 14593 -x /var/lib/haproxy/stats
haproxy 4203 99.9 1.4 1258804 933216 ? Ssl Jun11 1009:27 \_ /usr/sbin/haproxy -Ws -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid -sf 4194 4182 18290 29663 14593 -x /var/lib/haproxy/stats
haproxy 1445 25.9 1.4 1261680 929516 ? Ssl 13:51 0:42 \_ /usr/sbin/haproxy -Ws -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid -sf 1436 29663 4203 18290 14593 -x /var/lib/haproxy/stats
haproxy 1763 18.9 1.4 1260500 931516 ? Ssl 13:52 0:15 \_ /usr/sbin/haproxy -Ws -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid -sf 1445 14593 29663 4203 18290 -x /var/lib/haproxy/stats
haproxy 1903 25.0 1.4 1261472 931064 ? Ssl 13:53 0:14 \_ /usr/sbin/haproxy -Ws -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid -sf 1763 1445 14593 29663 4203 18290 -x /var/lib/haproxy/stats
haproxy 2063 52.5 1.4 1259568 927916 ? Ssl 13:53 0:19 \_ /usr/sbin/haproxy -Ws -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid -sf 1903 1763 1445 14593 29663 4203 18290 -x /var/lib/haproxy/stats
haproxy 2602 62.0 1.4 1262220 928776 ? Rsl 13:54 0:02 \_ /usr/sbin/haproxy -Ws -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid -sf 2063 1903 1763 1445 14593 29663 4203 18290 -x /var/lib/haproxy/stats


# cat /proc/14593/status | grep Sig
SigQ: 0/257120
SigPnd: 0000000000000000
SigBlk: 0000000000000800
SigIgn: 0000000000001800
SigCgt: 0000000180300205

kill -USR1 14593 has no effect:

# strace -ffff -p 14593
strace: Process 14593 attached with 3 threads
strace: [ Process PID=14595 runs in x32 mode. ]
[pid 14593] --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=18547, si_uid=0} ---
[pid 14593] rt_sigaction(SIGUSR1, {0x558357660020, [USR1], SA_RESTORER|SA_RESTART, 0x7f0e87671270}, {0x558357660020, [USR1], SA_RESTORER|SA_RESTART, 0x7f0e87671270}, 8) = 0
[pid 14593] rt_sigreturn({mask=[USR2]}) = 7

however, the unix socket is on the correct process:

# lsof | grep "haproxy/stats" ; ps auxwwf | grep haproxy
haproxy 2602 haproxy 5u unix 0xffff880f902e8000 0t0 3333061798 /var/lib/haproxy/stats.18547.tmp
haproxy 2602 2603 haproxy 5u unix 0xffff880f902e8000 0t0 3333061798 /var/lib/haproxy/stats.18547.tmp
haproxy 2602 2604 haproxy 5u unix 0xffff880f902e8000 0t0 3333061798 /var/lib/haproxy/stats.18547.tmp
haproxy 2602 2605 haproxy 5u unix 0xffff880f902e8000 0t0 3333061798 /var/lib/haproxy/stats.18547.tmp

So it means, it does not cause any issue for the provisioner which talks
to the correct process, however, they are remaining process.
Should I start a different thread for that issue?

it seems harder to reproduce, I got the issue ~2 days after pushing back.

Thanks,

--
William
William Lallemand
Re: remaining process after (seamless) reload
June 12, 2018 04:40PM
On Tue, Jun 12, 2018 at 04:00:25PM +0200, William Dauchy wrote:
> Hello William L,
>
> On Fri, Jun 08, 2018 at 04:31:30PM +0200, William Lallemand wrote:
> > That's great news!
> >
> > Here's the new patches. It shouldn't change anything to the fix, it only
> > changes the sigprocmask to pthread_sigmask.
>
> In fact, I now have a different but similar issue.
>

:(

> root 18547 3.2 1.3 986660 898844 ? Ss Jun08 182:12 /usr/sbin/haproxy -Ws -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid -sf 2063 1903 1763 1445 14593 29663 4203 18290 -x /var/lib/haproxy/stats
> haproxy 14593 299 1.3 1251216 920480 ? Rsl Jun11 5882:01 \_ /usr/sbin/haproxy -Ws -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid -sf 14582 14463 -x /var/lib/haproxy/stats
> haproxy 18290 299 1.4 1265028 935288 ? Ssl Jun11 3425:51 \_ /usr/sbin/haproxy -Ws -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid -sf 18281 18271 18261 14593 -x /var/lib/haproxy/stats
> haproxy 29663 99.9 1.4 1258024 932796 ? Ssl Jun11 1063:08 \_ /usr/sbin/haproxy -Ws -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid -sf 29653 29644 18290 14593 -x /var/lib/haproxy/stats
> haproxy 4203 99.9 1.4 1258804 933216 ? Ssl Jun11 1009:27 \_ /usr/sbin/haproxy -Ws -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid -sf 4194 4182 18290 29663 14593 -x /var/lib/haproxy/stats
> haproxy 1445 25.9 1.4 1261680 929516 ? Ssl 13:51 0:42 \_ /usr/sbin/haproxy -Ws -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid -sf 1436 29663 4203 18290 14593 -x /var/lib/haproxy/stats
> haproxy 1763 18.9 1.4 1260500 931516 ? Ssl 13:52 0:15 \_ /usr/sbin/haproxy -Ws -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid -sf 1445 14593 29663 4203 18290 -x /var/lib/haproxy/stats
> haproxy 1903 25.0 1.4 1261472 931064 ? Ssl 13:53 0:14 \_ /usr/sbin/haproxy -Ws -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid -sf 1763 1445 14593 29663 4203 18290 -x /var/lib/haproxy/stats
> haproxy 2063 52.5 1.4 1259568 927916 ? Ssl 13:53 0:19 \_ /usr/sbin/haproxy -Ws -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid -sf 1903 1763 1445 14593 29663 4203 18290 -x /var/lib/haproxy/stats
> haproxy 2602 62.0 1.4 1262220 928776 ? Rsl 13:54 0:02 \_ /usr/sbin/haproxy -Ws -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid -sf 2063 1903 1763 1445 14593 29663 4203 18290 -x /var/lib/haproxy/stats
>
>

Those processes are still using a lot of CPU...

> # cat /proc/14593/status | grep Sig
> SigQ: 0/257120
> SigPnd: 0000000000000000
> SigBlk: 0000000000000800
> SigIgn: 0000000000001800
> SigCgt: 0000000180300205
>
> kill -USR1 14593 has no effect:
>
> # strace -ffff -p 14593
> strace: Process 14593 attached with 3 threads


> strace: [ Process PID=14595 runs in x32 mode. ]

This part is particularly interesting, I suppose you are not running in x32, right?
I had this problem at some point but was never able to reproduce it...

We might find something interesting by looking further..

> [pid 14593] --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=18547, si_uid=0} ---
> [pid 14593] rt_sigaction(SIGUSR1, {0x558357660020, [USR1], SA_RESTORER|SA_RESTART, 0x7f0e87671270}, {0x558357660020, [USR1], SA_RESTORER|SA_RESTART, 0x7f0e87671270}, 8) = 0
> [pid 14593] rt_sigreturn({mask=[USR2]}) = 7


At least you managed to strace when the process was seen as an x32 one, it wasn't my case.

>
> however, the unix socket is on the correct process:
>
> # lsof | grep "haproxy/stats" ; ps auxwwf | grep haproxy
> haproxy 2602 haproxy 5u unix 0xffff880f902e8000 0t0 3333061798 /var/lib/haproxy/stats.18547.tmp
> haproxy 2602 2603 haproxy 5u unix 0xffff880f902e8000 0t0 3333061798 /var/lib/haproxy/stats.18547.tmp
> haproxy 2602 2604 haproxy 5u unix 0xffff880f902e8000 0t0 3333061798 /var/lib/haproxy/stats.18547.tmp
> haproxy 2602 2605 haproxy 5u unix 0xffff880f902e8000 0t0 3333061798 /var/lib/haproxy/stats.18547.tmp
>
> So it means, it does not cause any issue for the provisioner which talks
> to the correct process, however, they are remaining process.

Are they still delivering traffic?


> Should I start a different thread for that issue?
>

That's not necessary, thanks.

> it seems harder to reproduce, I got the issue ~2 days after pushing back.
>
> Thanks,
>

I'll try to reproduce this again...

--
William Lallemand
William Dauchy
Re: remaining process after (seamless) reload
June 12, 2018 05:00PM
On Tue, Jun 12, 2018 at 04:33:43PM +0200, William Lallemand wrote:
> Those processes are still using a lot of CPU...
> Are they still delivering traffic?

they don't seem to handle any traffic (at least I can't see it through strace)
but that's the main difference here, using lots of CPU.

> > strace: [ Process PID=14595 runs in x32 mode. ]
>
> This part is particularly interesting, I suppose you are not running in x32, right?
> I had this problem at some point but was never able to reproduce it...

yup, I can't really understand where it is coming from. I have this
issue until I do a complete restart.

Thanks for your help,
--
William
Sorry, only registered users may post in this forum.

Click here to login