Welcome! Log In Create A New Profile

Advanced

Version 1.5.12, getting 502 when server check fails, but server is still working

Posted by Shawn Heisey 
I'm working on making my application capable of handling service
restarts on the back end with zero loss or interruption.  It runs on two
servers behind haproxy.

At application shutdown, I'm setting a flag that makes the healthcheck
fail, and then keeping the application running for thirty seconds in
order to finish up all requests that the server has already received.

It seems that when haproxy's health check fails while a request is
underway, the machine making the request will be sent a 502 response
instead of the good response that the server WILL make. This is probably
a good thing for haproxy to do in general, but in this case, I know that
my application's shutdown hook *WILL* allow enough time for the request
to finish before it forcibly halts the application.

Is there a config option I can enable to allow in-flight responses to be
returned after check failure?  Would I need to upgrade beyond 1.5 to get
that working?  If the server doesn't send a good response within the
timeout limit, then I'm perfectly OK with haproxy returning an error.

Thanks,
Shawn
On Sun, 15 Apr 2018 at 20:56, Shawn Heisey <[email protected]> wrote:

> Would I need to upgrade beyond 1.5 to get that working?


I don't have any info about your precise problem, but here's a quote from
Willy's 1.9 thread within the last couple of months:

"Oh, before I forget, since nobody asked for 1.4 to continue to be
maintained, I've just marked it "unmaintained", and 1.5 now entered
the "critical fixes only" status. 1.4 will have lived almost 8 years
(1.4.0 was released on 2010-02-26). Given that it doesn't support
SSL, it's unlikely to be found exposed to HTTP traffic in sensitive
places anymore. If you still use it, there's nothing wrong for now,
as it's been one of the most stable versions of all times. But please
at least regularly watch the activity on the newer ones and consider
upgrading it once you see that some issues might affect it. For those
who can really not risk to face a bug, 1.6 is a very good candidate
now and is still well supported 2 years after its birth."
>
>
You might get a solution to this and your other 1.5 problem on the list -
it has a very helpful and knowledgeable population :-)

But if you can possibly upgrade to 1.6 or later, I suspect the frequency of
answers you get and the flexibility they'll have to help you will improve
markedly.

HTH!
J
--
Jonathan Matthews
London, UK
http://www.jpluscplusm.com/contact.html
Hello,


On 15 April 2018 at 21:53, Shawn Heisey <[email protected]> wrote:
> I'm working on making my application capable of handling service restarts on
> the back end with zero loss or interruption. It runs on two servers behind
> haproxy.
>
> At application shutdown, I'm setting a flag that makes the healthcheck fail,
> and then keeping the application running for thirty seconds in order to
> finish up all requests that the server has already received.
>
> It seems that when haproxy's health check fails while a request is underway,
> the machine making the request will be sent a 502 response instead of the
> good response that the server WILL make. This is probably a good thing for
> haproxy to do in general, but in this case, I know that my application's
> shutdown hook *WILL* allow enough time for the request to finish before it
> forcibly halts the application.

You'll have to share you entire configuration for us to be able to
comment on the behavior.

Having said that, you'd be better off setting the server to
maintenance mode instead of letting the health check fail (via
webinterface or stats socket):

http://cbonte.github.io/haproxy-dconv/1.5/configuration.html#9.2-set%20server



The upgrade to a more recent build is, thanks to Vincent's work, very
simple on debian and Ubuntu:
https://haproxy.debian.net



Cheers,
Lukas
Hi,

On Mon, Apr 16, Lukas Tribus wrote:
> On 15 April 2018 at 21:53, Shawn Heisey <[email protected]> wrote:
> > I'm working on making my application capable of handling service restarts on
> > the back end with zero loss or interruption. It runs on two servers behind
> > haproxy.
> >
> > At application shutdown, I'm setting a flag that makes the healthcheck fail,
> > and then keeping the application running for thirty seconds in order to
> > finish up all requests that the server has already received.
> >
> > It seems that when haproxy's health check fails while a request is underway,
> > the machine making the request will be sent a 502 response instead of the
> > good response that the server WILL make. This is probably a good thing for
> > haproxy to do in general, but in this case, I know that my application's
> > shutdown hook *WILL* allow enough time for the request to finish before it
> > forcibly halts the application.
>
> You'll have to share you entire configuration for us to be able to
> comment on the behavior.
>
> Having said that, you'd be better off setting the server to
> maintenance mode instead of letting the health check fail (via
> webinterface or stats socket):
>
> http://cbonte.github.io/haproxy-dconv/1.5/configuration.html#9.2-set%20server

There's also http-check disable-on-404
(http://cbonte.github.io/haproxy-dconv/1.5/configuration.html#4.2-http-check%20disable-on-404)

So maybe first set flag that returns 404 on health check and only after
thirty seconds fail the health check.

-Jarno

--
Jarno Huuskonen
Hello Shawn,



please keep the mailing-list in the loop.



On 16 April 2018 at 16:53, Shawn Heisey <[email protected]> wrote:
>> Having said that, you'd be better off setting the server to
>> maintenance mode instead of letting the health check fail (via
>> webinterface or stats socket):
>>
>>
>> http://cbonte.github.io/haproxy-dconv/1.5/configuration.html#9.2-set%20server
>
>
> The back end servers don't know anything about the load balancer. And since
> the load balancer does send them requests from the Internet, I think it
> would be a potential security issue if it was able to affect the load
> balancer -- that load balancer handles a lot more than just this service.

I don't follow? Why is using a restricted admin socket a security issue?

You are already exposing the admin socket locally in your
configuration on line 16:
stats socket /etc/haproxy/stats.socket level admin

My suggestion was to use that admin interface to send the "set server" command.



> The disable-on-404 setting that Jarno mentioned might do what we need. I
> will give it a try. That's very easy to do in my application.

Yes, that may be more elegant depending on the environment, the final
result is the same: to put the server into maintenance mode.



> I have placed a slightly redacted version of my config here:

I think your original issue may be due to the "retries 1"
configuration you have in there. I would recommend removing that.




Regards,
Lukas
On 4/16/2018 6:43 AM, Jarno Huuskonen wrote:
> There's also http-check disable-on-404
> (http://cbonte.github.io/haproxy-dconv/1.5/configuration.html#4.2-http-check%20disable-on-404)
>
> So maybe first set flag that returns 404 on health check and only after
> thirty seconds fail the health check.

This looks really promising, but then I saw this in the documentation
for that option:

"If the server responds 2xx or 3xx again, it will immediately be
reinserted into the farm."

Is that referring to a 2xx or 3xx on the health check, or a 2xx/3xx on
external requests already sent to that server?  If it's the former, then
there's no problem, but if it's the latter, then that isn't what I want
at all.  My guess about this is that it's the former, but I'd like
confirmation.

Thanks,
Shawn
On 4/16/2018 9:15 AM, Lukas Tribus wrote:
> Hello Shawn,
>
> please keep the mailing-list in the loop.

Sorry about that.  Looks like the haproxy list doesn't set a reply-to
header sending replies to the list.  Most mailing lists I have dealt
with do this, so just hitting "reply" does the right thing.  I sometimes
forget to do the "reply list" option.

> I don't follow? Why is using a restricted admin socket a security issue?
>
> You are already exposing the admin socket locally in your
> configuration on line 16:
> stats socket /etc/haproxy/stats.socket level admin
>
> My suggestion was to use that admin interface to send the "set server" command.

I enabled the admin socket so that I could renew OCSP stapling. As far
as I understand, it can only be used on the load balancer machine
itself, and I think this is the only way to renew stapling other than
restarting the program, which isn't something I want to do.

As for the possible security issue: If somebody were to compromise the
back end server and the back end server had knowledge about the load
balancer, then the attacker might have enough information to fiddle with
the load balancer for *other* things the load balancer is handling that
are more sensitive.

> I think your original issue may be due to the "retries 1"
> configuration you have in there. I would recommend removing that.

The documentation for 1.5 says the default value for retries is 3. 
Wouldn't removing it make whatever problems a retry causes *worse*?  If
retries are bad, then perhaps I should set it to 0.  I have no
recollection about why I have this setting in the config.  The
default/global settings were created years ago and don't change much.

Thanks,
Shawn
Hello Shawn,



On 16 April 2018 at 17:39, Shawn Heisey <[email protected]> wrote:
> I enabled the admin socket so that I could renew OCSP stapling. As far as I
> understand, it can only be used on the load balancer machine itself, and I
> think this is the only way to renew stapling other than restarting the
> program, which isn't something I want to do.
>
> As for the possible security issue: If somebody were to compromise the back
> end server and the back end server had knowledge about the load balancer

Why would the backend need to have any knowledge about the
load-balancer? You'd adjust your workflow and command the switch from
the load-balancer instead of your backend application, that's it. Your
backend does not need to access the load-balancer in any way.



> then the attacker might have enough information to fiddle with the load
> balancer for *other* things the load balancer is handling that are more
> sensitive.
>
>> I think your original issue may be due to the "retries 1"
>> configuration you have in there. I would recommend removing that.
>
>
> The documentation for 1.5 says the default value for retries is 3. Wouldn't
> removing it make whatever problems a retry causes *worse*? If retries are
> bad, then perhaps I should set it to 0. I have no recollection about why I
> have this setting in the config. The default/global settings were created
> years ago and don't change much.

Retries are a good thing and the default retries of 3 is a good value.
Changing this to a non-default will have an impact, especially in the
cases where this server is going down or is about to go down.

By removing the "retries" configuration altogether you are using the
default value of 3, which is the recommended configuration.



Regards,
Lukas
On 4/16/2018 6:43 AM, Jarno Huuskonen wrote:
> There's also http-check disable-on-404
> (http://cbonte.github.io/haproxy-dconv/1.5/configuration.html#4.2-http-check%20disable-on-404)

I couldn't get this to work at first.  If I put the disable-on-404
option in the actual back end, it complains like this:

[WARNING] 105/095152 (5379) : config : 'disable-on-404' will be ignored
for backend 'be-cdn-9000' (requires 'option httpchk').

That makes sense, because I'm using tracking, not actual health checks
in that back end.  So I moved it to the check back end, and it gave me
much worse errors:

[ALERT] 105/095234 (7186) : config : backend 'be-cdn-9000', server
'planet': unable to use chk-cdn-9000/planet fortracking: disable-on-404
option inconsistency.
[ALERT] 105/095234 (7186) : config : backend 'be-cdn-9000', server
'hollywood': unable to use chk-cdn-9000/hollywood fortracking:
disable-on-404 option inconsistency.
[ALERT] 105/095234 (7186) : Fatal errors found in configuration.

Eliminating the "track" config and doing the health checks in the actual
back end has fixed that.  I need to do some testing to see whether it
does what I want it to do.

I am curious about why I couldn't use "track".

Thanks,
Shawn
On Mon, Apr 16, 2018 at 10:03:44AM -0600, Shawn Heisey wrote:
> I am curious about why I couldn't use "track".

"track" means that your current server will always be in the same state
as the designated one. It will never run its own checks, and will receive
notifications from the other one's state change events.

So you can simply not have any check-specific stuff on a server tracking
another one. However if you use disable-on-404 on the tracked one, the
tracking one will obviously adapt.

Willy
On 4/16/2018 1:46 PM, Willy Tarreau wrote:
> On Mon, Apr 16, 2018 at 10:03:44AM -0600, Shawn Heisey wrote:
>> I am curious about why I couldn't use "track".
> "track" means that your current server will always be in the same state
> as the designated one. It will never run its own checks, and will receive
> notifications from the other one's state change events.
>
> So you can simply not have any check-specific stuff on a server tracking
> another one. However if you use disable-on-404 on the tracked one, the
> tracking one will obviously adapt.

Thanks to you and everyone else who has replied.

That didn't work.  I tried to use disable-on-404 on the tracked
backend.  I got the fatal configuration error I mentioned before:

[ALERT] 105/095234 (7186) : config : backend 'be-cdn-9000', server
'planet': unable to use chk-cdn-9000/planet fortracking: disable-on-404
option inconsistency.
[ALERT] 105/095234 (7186) : config : backend 'be-cdn-9000', server
'hollywood': unable to use chk-cdn-9000/hollywood fortracking:
disable-on-404 option inconsistency.
[ALERT] 105/095234 (7186) : Fatal errors found in configuration.

I also tried it with that option in both backend configurations.  That
didn't work either.  I don't recall the error, but it was probably the
same as one of the other errors I had gotten before.

This is on 1.5.12, and I can't really blame you if the "standard" mental
map you keep of the project doesn't include that version.  It's got to
be hard enough keeping that straight for just 1.8 and 1.9-dev!  Maybe
the error I encountered would be solved by upgrading.  Upgrading is on
the (really long) todo list.

I do have a config that works.  I'm no longer tracking another backend,
but doing the health checks in the load-balancing backend.  The whole
reason I had migrated server checks to dedicated back ends was because I
wanted to reduce the number of check requests being sent, and I'm
sharing the check backends with multiple balancing backends in some
cases.  For the one I've been describing, I don't need to share the
check backend.

I ran into other problems on the application side with how process
shutdowns work, but resolved those by adding an endpoint into my app
with the URL path of "/lbdisable" and handling the disable/pause in the
init script instead of the application.  I can now restart my custom
application at will without any loss, and without a client even noticing
there was a problem.

As of a little while ago, I have solved all the problems I encountered
on the road to graceful application restarts except the one where a
backup server is not promoted to active as soon as the primary servers
are all down.  I described that issue in a separate message to the
list.  I do have a workaround to that issue -- I'm no longer using
"backup" on any server entries for this service.

Thanks,
Shawn
On Mon, Apr 16, 2018 at 04:13:28PM -0600, Shawn Heisey wrote:
> [ALERT] 105/095234 (7186) : config : backend 'be-cdn-9000', server
> 'planet': unable to use chk-cdn-9000/planet fortracking: disable-on-404
> option inconsistency.
> [ALERT] 105/095234 (7186) : config : backend 'be-cdn-9000', server
> 'hollywood': unable to use chk-cdn-9000/hollywood fortracking:
> disable-on-404 option inconsistency.
> [ALERT] 105/095234 (7186) : Fatal errors found in configuration.
>
> I also tried it with that option in both backend configurations.  That
> didn't work either.  I don't recall the error, but it was probably the
> same as one of the other errors I had gotten before.

Well, the doc about "track" says this :

track [<proxy>/]<server>
This option enables ability to set the current state of the server by tracking
another one. It is possible to track a server which itself tracks another
server, provided that at the end of the chain, a server has health checks
enabled. If <proxy> is omitted the current one is used. If disable-on-404 is
used, it has to be enabled on both proxies.

So it might have been a different error that you got, possibly caused by an
incompatibility with something else.

> This is on 1.5.12, and I can't really blame you if the "standard" mental
> map you keep of the project doesn't include that version.

Not at all! Versions which are considered supported are still watched,
and eventhough they are updated less often, they still receive fixes if
needed. The latest 1.5 is 1.5.19, so your version is at least affected
by all the issues referenced here :

http://www.haproxy.org/bugs/bugs-1.5.12.html

But none of them seem to involve tracking nor disable-on-404 at first
glance.

> It's got to
> be hard enough keeping that straight for just 1.8 and 1.9-dev!  Maybe
> the error I encountered would be solved by upgrading.  Upgrading is on
> the (really long) todo list.

There are two distinct things :
- getting issues fixed
- upgrading to get new features or support

As long as an issue exists in a supported version, it has to be fixed.
Some issues may be the result of an architectural limitation which will
require an upgrade. But most often it is not the case.

Here I'm afraid we're all wasting a lot of time trying to guess what
you have in your config that causes the problem. It's OK if you cannot
post your config here, but please at least post a smaller one reproducing
the issue so that we can help you.

> I do have a config that works.  I'm no longer tracking another backend,
> but doing the health checks in the load-balancing backend.  The whole
> reason I had migrated server checks to dedicated back ends was because I
> wanted to reduce the number of check requests being sent, and I'm
> sharing the check backends with multiple balancing backends in some
> cases.

That's fine, it's exactly what "track" is made for.

> For the one I've been describing, I don't need to share the
> check backend.

OK but it possibly uncovers another issue in your config.

> I ran into other problems on the application side with how process
> shutdowns work, but resolved those by adding an endpoint into my app
> with the URL path of "/lbdisable" and handling the disable/pause in the
> init script instead of the application.  I can now restart my custom
> application at will without any loss, and without a client even noticing
> there was a problem.

OK.

> As of a little while ago, I have solved all the problems I encountered
> on the road to graceful application restarts except the one where a
> backup server is not promoted to active as soon as the primary servers
> are all down.

This is normally done with the "backup" keyword on server lines.

> I described that issue in a separate message to the
> list.  I do have a workaround to that issue -- I'm no longer using
> "backup" on any server entries for this service.

Then I don't see how it can work for you. It's a bit confusing I'm afraid.

Willy
On 4/17/2018 3:41 AM, Willy Tarreau wrote:
> Here I'm afraid we're all wasting a lot of time trying to guess what
> you have in your config that causes the problem. It's OK if you cannot
> post your config here, but please at least post a smaller one reproducing
> the issue so that we can help you.

I did send a message with the config, but it turns out that it was only
in an accidental unicast to Lukas.  Sorry about that! Here is the
redacted version of my config (one month expiration):

https://apaste.info/xVwg

>> As of a little while ago, I have solved all the problems I encountered
>> on the road to graceful application restarts except the one where a
>> backup server is not promoted to active as soon as the primary servers
>> are all down.
> This is normally done with the "backup" keyword on server lines.
>
>> I described that issue in a separate message to the
>> list.  I do have a workaround to that issue -- I'm no longer using
>> "backup" on any server entries for this service.
> Then I don't see how it can work for you. It's a bit confusing I'm afraid.

Originally, the "hollywood" entry on the be-cdn-9000 backend (which you
can see at the config I linked above) had the backup keyword.  But what
I noticed happening was that when planet went down, it took about ten
additional seconds (no precise timing was done) for hollywood to go
active, and during that time, the client got "no server available"
messages.  Removing the backup keyword caused hollywood to be active
full time and requests to be load balanced to both servers, so when one
server goes down, the other is already active and everything works.

Removing the backup keyword from this particular backend is not a
problem, but I do have other backends where I really do want only one
server to get requests unless that server goes down.  I would like the
highest weight backup server to take over immediately when the primary
is marked down, not wait for another timeout to pass.

Thanks,
Shawn
Hello Shawn,



On 17 April 2018 at 15:24, Shawn Heisey <[email protected]> wrote:
>>> I described that issue in a separate message to the
>>> list. I do have a workaround to that issue -- I'm no longer using
>>> "backup" on any server entries for this service.
>>
>> Then I don't see how it can work for you. It's a bit confusing I'm afraid.
>
>
> Originally, the "hollywood" entry on the be-cdn-9000 backend (which you can
> see at the config I linked above) had the backup keyword. But what I
> noticed happening was that when planet went down, it took about ten
> additional seconds (no precise timing was done) for hollywood to go active,
> and during that time, the client got "no server available" messages.

You said this is about a 502 error. But "no server available" is not a
error that haproxy emits and 503 would be "Service Unavailable".
I don't see any issue replaying this here locally with haproxy 1.5.12.

Can you clarify what error you are seeing exactly and also the
relevant haproxy logs (including server down message).



Lukas
On 4/17/2018 2:54 PM, Lukas Tribus wrote:
>> Originally, the "hollywood" entry on the be-cdn-9000 backend (which you can
>> see at the config I linked above) had the backup keyword. But what I
>> noticed happening was that when planet went down, it took about ten
>> additional seconds (no precise timing was done) for hollywood to go active,
>> and during that time, the client got "no server available" messages.
>
> You said this is about a 502 error. But "no server available" is not a
> error that haproxy emits and 503 would be "Service Unavailable".
> I don't see any issue replaying this here locally with haproxy 1.5.12.

I have figured out how to resolve the difficulty I was encountering on
this thread, and I can now do service restarts with zero loss. The
disable-on-404 config, combined with a change to the way the init script
does the restart, has fixed it.

I switched gears and that part about the 'backup' keyword is for a
different problem. One I already emailed the list separately on and
haven't gotten a response. Feel free to drop this thread and move to
the other one. Here is the message on the list archive:

https://www.mail-archive.com/[email protected]/msg29615.html

I apologize for any confusion. The "no server available" is something I
saw in the output from curl when the primary server went down and the
backup had not yet been changed to Active.

I don't have a transcript of that ssh session, and right now I can't
take the steps to reproduce the error -- the servers are being used
heavily for something that can't be postponed, and reproducing the error
would cause some of those requests to fail.

Thanks,
Shawn
Sorry, only registered users may post in this forum.

Click here to login