Welcome! Log In Create A New Profile

Advanced

haproxy 1.8 ssl backend server leads to server session aborts

Posted by Bart Geesink 
Hi,

We have an issue in haproxy > 1.8 on CentOS when using SSL in the server
configuration. Haproxy sometimes logs a http status code "-1" followed
by the termination_state SDxx. This happens every few requests. When
using one backend, the clients don't notice it. When using multiple
backends, this can result in redirecting traffic to the wrong backend
(the proxy inserts a cookie to track which backend is used).

Removing the SSL configuration and using plain http solves the issue, as
does downgrading to version 1.7.

This is our backend configuration:

backend backend_name
option httpchk GET / HTTP/1.0
option forwardfor
rspdel ^Strict-Transport-Security:.*
mode http
balance roundrobin
option httpclose

cookie HTTPSERVERID insert nocache indirect httponly secure

server servername ip.add.rr.ess:portnumber cookie servername check
inter 8000 fall 5 rise 2 maxconn 1024 ssl verify required verifyhost
hostname ca-file /etc/pki/haproxy/cafile.pem

The backend (Apache in this case) does not log any problems.

Any insights would be welcome,

Thanks,

Bart
Hello Christopher,


On 16 January 2018 at 15:01, Bart Geesink <[email protected]> wrote:
> Hi,
>
> We have an issue in haproxy > 1.8 on CentOS when using SSL in the server
> configuration. Haproxy sometimes logs a http status code "-1" followed
> by the termination_state SDxx. This happens every few requests. When
> using one backend, the clients don't notice it. When using multiple
> backends, this can result in redirecting traffic to the wrong backend
> (the proxy inserts a cookie to track which backend is used).
>
> Removing the SSL configuration and using plain http solves the issue, as
> does downgrading to version 1.7.

Also see:
https://discourse.haproxy.org/t/session-state-sdnn-http-status-code-200-when-upgrading-to-1-7/1409/10



cheers,
lukas
Christopher Faulet
Re: haproxy 1.8 ssl backend server leads to server session aborts
January 17, 2018 10:40AM
Le 16/01/2018 à 16:18, Lukas Tribus a écrit :
> Hello Christopher,
>
>
> On 16 January 2018 at 15:01, Bart Geesink <[email protected]> wrote:
>> Hi,
>>
>> We have an issue in haproxy > 1.8 on CentOS when using SSL in the server
>> configuration. Haproxy sometimes logs a http status code "-1" followed
>> by the termination_state SDxx. This happens every few requests. When
>> using one backend, the clients don't notice it. When using multiple
>> backends, this can result in redirecting traffic to the wrong backend
>> (the proxy inserts a cookie to track which backend is used).
>>
>> Removing the SSL configuration and using plain http solves the issue, as
>> does downgrading to version 1.7.
>
> Also see:
> https://discourse.haproxy.org/t/session-state-sdnn-http-status-code-200-when-upgrading-to-1-7/1409/10
>

Hi Lukas,

Thanks, I will check these 2 issues. First of all, I need to reproduce
them to be sure. This one seems to be a bit different because the status
code is "-1".

Bart, when you said your backend does not log any problems, it means
that for a request logged with SD termination state on haproxy, you have
a 200 on Apache side ? And what does the response look like from the
client side ? (truncated / good / error ...)

--
Christopher Faulet
Hi,

On 01/17/2018 10:16 AM, Christopher Faulet wrote:
> Le 16/01/2018 à 16:18, Lukas Tribus a écrit :
>> Hello Christopher,
>>
>>
>> On 16 January 2018 at 15:01, Bart Geesink <[email protected]>
>> wrote:
>>> Hi,
>>>
>>> We have an issue in haproxy > 1.8 on CentOS when using SSL in the server
>>> configuration. Haproxy sometimes logs a http status code "-1" followed
>>> by the termination_state SDxx. This happens every few requests. When
>>> using one backend, the clients don't notice it. When using multiple
>>> backends, this can result in redirecting traffic to the wrong backend
>>> (the proxy inserts a cookie to track which backend is used).
>>>
>>> Removing the SSL configuration and using plain http solves the issue, as
>>> does downgrading to version 1.7.
>>
>> Also see:
>> https://discourse.haproxy.org/t/session-state-sdnn-http-status-code-200-when-upgrading-to-1-7/1409/10
>>
>>
>
> Hi Lukas,
>
> Thanks, I will check these 2 issues. First of all, I need to reproduce
> them to be sure. This one seems to be a bit different because the status
> code is "-1".
>
It also different since it occurs in both 1.8.1 and 1.8.3

> Bart, when you said your backend does not log any problems, it means
> that for a request logged with SD termination state on haproxy, you have
> a 200 on Apache side ? And what does the response look like from the
> client side ? (truncated / good / error ...)
>
Apache logs a 200. The reponse looks fine from the client side. The only
thing that seems to be missing is the cookie inserted by the proxy for
some requests. Other requests seem not to experience this behaviour.
There is no real pattern: I can happen when a valid proxy cookie is
present, but also when a new browser session is started and a proxy
cookie is not yet present. Downgrading to 1.7 helps, as does using a
http backend without ssl.

Regards,

Bart
Hi

On 2018-01-17 11:37, Bart Geesink wrote:
> Hi,
>
> On 01/17/2018 10:16 AM, Christopher Faulet wrote:
>> Le 16/01/2018 à 16:18, Lukas Tribus a écrit :
>>> Hello Christopher,
>>>
>>>
>>> On 16 January 2018 at 15:01, Bart Geesink <[email protected]>
>>> wrote:
>>>> Hi,
>>>>
>>>> We have an issue in haproxy > 1.8 on CentOS when using SSL in the server
>>>> configuration. Haproxy sometimes logs a http status code "-1" followed
>>>> by the termination_state SDxx. This happens every few requests. When
>>>> using one backend, the clients don't notice it. When using multiple
>>>> backends, this can result in redirecting traffic to the wrong backend
>>>> (the proxy inserts a cookie to track which backend is used).
>>>>
>>>> Removing the SSL configuration and using plain http solves the issue, as
>>>> does downgrading to version 1.7.
>>> Also see:
>>> https://discourse.haproxy.org/t/session-state-sdnn-http-status-code-200-when-upgrading-to-1-7/1409/10
>>>
>>>
>> Hi Lukas,
>>
>> Thanks, I will check these 2 issues. First of all, I need to reproduce
>> them to be sure. This one seems to be a bit different because the status
>> code is "-1".
>>
> It also different since it occurs in both 1.8.1 and 1.8.3
>
>> Bart, when you said your backend does not log any problems, it means
>> that for a request logged with SD termination state on haproxy, you have
>> a 200 on Apache side ? And what does the response look like from the
>> client side ? (truncated / good / error ...)
>>
> Apache logs a 200. The reponse looks fine from the client side. The only
> thing that seems to be missing is the cookie inserted by the proxy for
> some requests. Other requests seem not to experience this behaviour.
> There is no real pattern: I can happen when a valid proxy cookie is
> present, but also when a new browser session is started and a proxy
> cookie is not yet present. Downgrading to 1.7 helps, as does using a
> http backend without ssl.
>
> Regards,
>
> Bart
>
I have same issue. It's pretty random as I would say about 60-70%
requests are OK, but rest is failing. I compiled all 1.8 versions and
was able to isolate this a little bit. It's fine up to 1.8.0-dev3 branch
and it's failing since 1.8.0-rc1.
The problem is for SSL connections and after digging in my apps logs it
looks like it's related to reusing keep alived connections between
client and haproxy.
By default I have in config "option http-server-close" present and when
it's there I can see the problem. When this option is removed - problem
is solved.

Regards,
Tomek
Hi Tomek,

On Sat, Feb 03, 2018 at 08:47:35AM +0100, Tomek Gacek wrote:
> I have same issue. It's pretty random as I would say about 60-70% requests
> are OK, but rest is failing. I compiled all 1.8 versions and was able to
> isolate this a little bit. It's fine up to 1.8.0-dev3 branch and it's
> failing since 1.8.0-rc1.
> The problem is for SSL connections and after digging in my apps logs it
> looks like it's related to reusing keep alived connections between client
> and haproxy.
> By default I have in config "option http-server-close" present and when it's
> there I can see the problem. When this option is removed - problem is
> solved.

This is extremely valuable information. I suspect we might have damaged
something in the backend code when inserting the mux layer. Given that
the connection and stream are a bit more independant from each other now
due to the mux, it might be possible that depending the sequence of some
events (eg: connection close), it changes how the close event is
interpreted in the stream. This will definitely help us narrow down the
cause of the issue.

Thanks!
Willy
Hi Willy

On 2018-02-03 10:05, Willy Tarreau wrote:
> Hi Tomek,
>
> On Sat, Feb 03, 2018 at 08:47:35AM +0100, Tomek Gacek wrote:
>> I have same issue. It's pretty random as I would say about 60-70% requests
>> are OK, but rest is failing. I compiled all 1.8 versions and was able to
>> isolate this a little bit. It's fine up to 1.8.0-dev3 branch and it's
>> failing since 1.8.0-rc1.
>> The problem is for SSL connections and after digging in my apps logs it
>> looks like it's related to reusing keep alived connections between client
>> and haproxy.
>> By default I have in config "option http-server-close" present and when it's
>> there I can see the problem. When this option is removed - problem is
>> solved.
> This is extremely valuable information. I suspect we might have damaged
> something in the backend code when inserting the mux layer. Given that
> the connection and stream are a bit more independant from each other now
> due to the mux, it might be possible that depending the sequence of some
> events (eg: connection close), it changes how the close event is
> interpreted in the stream. This will definitely help us narrow down the
> cause of the issue.
>

My tests show the problem is caused by this commit
http://git.haproxy.org/?p=haproxy-1.8.git;a=commitdiff;h=c2aae74f010f97a3415542fe649198a5d3be1ea8
This is first snapshot were I was able to recreate problem.

Unable to recreate in previous commit -
http://git.haproxy.org/?p=haproxy-1.8.git;a=commit;h=253c62b257c137e7da5c273f42bc5d6eacd31d2c


Regards,
Tomek
Hi everyone,

I've narrowed down my problem down to the same commit as Tomek Gacek -
c2aae74f010f97a3415542fe649198a5d3be1ea8 (MEDIUM: ssl: Handle early data
with OpenSSL 1.1.1), so I guess it may be related. In my case, since
upgrade to 1.8, some responses from some backends (not sure what exactly
triggers the bug) do not have their headers modified (despite
http-response add-header and http-response del-header being set).

Applying patch part-by-part, I got to a point where it seems that that
was caused by changes to ssl_sock_to_buf function in src/ssl_sock.c
(lines 396-431):
https://gist.github.com/mkwm/13dd32fe2b5ec21182f8a06a304228df#file-break-patch-L396-L431

Code at out_error label behave a bit differently from part removed in
this commit - namely, it sets conn->flags |= CO_FL_ERROR
unconditionally, while previously there was an additional check
(skipping error flag setting if errno was set to EAGAIN). My problems
went straight away when I've changed out_error to match old code.


There is also another issue with this commit - it seems that one "1" got
lost in OPENSSL_VERSION_NUMBER comparison (line 267):
https://gist.github.com/mkwm/13dd32fe2b5ec21182f8a06a304228df#file-break-patch-L267

Throughout this commit all additions of similar ifdefs use 0x10101000L,
which translates to OpenSSL 1.1.1 - and this one oddly translates to
version 0.1.1.


Hope this helps!

Best regards
Mateusz Malek
Hi Mateusz,

I'm CCing Emeric (SSL maintainer) and Olivier (who added early-data support),
and responding to some points below.

On Sat, Feb 10, 2018 at 06:26:42PM +0100, Mateusz Malek wrote:
> Hi everyone,
>
> I've narrowed down my problem down to the same commit as Tomek Gacek -
> c2aae74f010f97a3415542fe649198a5d3be1ea8 (MEDIUM: ssl: Handle early data
> with OpenSSL 1.1.1), so I guess it may be related. In my case, since upgrade
> to 1.8, some responses from some backends (not sure what exactly triggers
> the bug) do not have their headers modified (despite http-response
> add-header and http-response del-header being set).
>
> Applying patch part-by-part, I got to a point where it seems that that was
> caused by changes to ssl_sock_to_buf function in src/ssl_sock.c (lines
> 396-431):
> https://gist.github.com/mkwm/13dd32fe2b5ec21182f8a06a304228df#file-break-patch-L396-L431
>
> Code at out_error label behave a bit differently from part removed in this
> commit - namely, it sets conn->flags |= CO_FL_ERROR unconditionally, while
> previously there was an additional check (skipping error flag setting if
> errno was set to EAGAIN). My problems went straight away when I've changed
> out_error to match old code.

Interesting, it would be possible that sometimes an SSL error remains on
the stack, and that it randomly gets trapped here. It used to happen in
the past and we went through a great pain trying to always clean all of
them.

I have no idea why this specific condition was removed. It's not mentionned
at all in Olivier's commit message and suspiciously looks like an accident.
Olivier, do you have an idea on this one ? It's commit c2aae74f. It may be
related to an issue Emeric was chasing recently where some server-side
connections would occasionally fail if my memory serves me right.

> There is also another issue with this commit - it seems that one "1" got
> lost in OPENSSL_VERSION_NUMBER comparison (line 267):
> https://gist.github.com/mkwm/13dd32fe2b5ec21182f8a06a304228df#file-break-patch-L267
>
> Throughout this commit all additions of similar ifdefs use 0x10101000L,
> which translates to OpenSSL 1.1.1 - and this one oddly translates to version
> 0.1.1.

Yes but this one was fixed in 1.8-rc3 by commit cfdef2e3 so it cannot be
this one. But the removed part above definitely is an interesting one.

> Hope this helps!

Sure it does!

Thanks!
Willy
Olivier Houchard
Re: haproxy 1.8 ssl backend server leads to server session aborts
February 13, 2018 03:40PM
Hi guys,

On Sat, Feb 10, 2018 at 06:26:42PM +0100, Mateusz Małek wrote:
> Hi everyone,
>
> I've narrowed down my problem down to the same commit as Tomek Gacek -
> c2aae74f010f97a3415542fe649198a5d3be1ea8 (MEDIUM: ssl: Handle early data
> with OpenSSL 1.1.1), so I guess it may be related. In my case, since upgrade
> to 1.8, some responses from some backends (not sure what exactly triggers
> the bug) do not have their headers modified (despite http-response
> add-header and http-response del-header being set).
>
> Applying patch part-by-part, I got to a point where it seems that that was
> caused by changes to ssl_sock_to_buf function in src/ssl_sock.c (lines
> 396-431):
> https://gist.github.com/mkwm/13dd32fe2b5ec21182f8a06a304228df#file-break-patch-L396-L431
>
> Code at out_error label behave a bit differently from part removed in this
> commit - namely, it sets conn->flags |= CO_FL_ERROR unconditionally, while
> previously there was an additional check (skipping error flag setting if
> errno was set to EAGAIN). My problems went straight away when I've changed
> out_error to match old code.
>

Thanks a lot for the detailed analyze, and sorry for the late answer.
You're probably right, SSL_ERROR_SYSCALL shouldn't be treated as an
unrecoverable error.
So, what you basically did was something equivalent to the patch attached ?

Thanks a lot !

Olivier
From b423f94273be2c7040ce0861bd4a21617b4c5c2b Mon Sep 17 00:00:00 2001
From: Olivier Houchard <[email protected]>
Date: Tue, 13 Feb 2018 15:17:23 +0100
Subject: [PATCH] MINOR/BUG: ssl: Don't always treat SSL_ERROR_SYSCALL as
unrecovarable.

SSL_Raad() might return <= 0, and SSL_get_erro() return SSL_ERROR_SYSCALL,
without meaning the connection is gone. Before flagging the conection
as in error, check the errno value.

This should be backported to 1.8.
---
src/ssl_sock.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/src/ssl_sock.c b/src/ssl_sock.c
index aee3cd965..687133b0d 100644
--- a/src/ssl_sock.c
+++ b/src/ssl_sock.c
@@ -5452,7 +5452,9 @@ static int ssl_sock_to_buf(struct connection *conn, struct buffer *buf, int coun
ssl_sock_dump_errors(conn);
ERR_clear_error();

- conn->flags |= CO_FL_ERROR;
+ if ((ret != SSL_ERROR_SYSCALL) ||
+ (errno && errno != EAGAIN))
+ conn->flags |= CO_FL_ERROR;
goto leave;
}

--
2.14.3
Hi,

On 13.02.2018 15:27, Olivier Houchard wrote:
> Thanks a lot for the detailed analyze, and sorry for the late answer.
> You're probably right, SSL_ERROR_SYSCALL shouldn't be treated as an
> unrecoverable error.
> So, what you basically did was something equivalent to the patch attached ?

Yeah, it looked exactly like this ;) I'm glad I could help.

Best regards
Mateusz Małek
On Tue, Feb 13, 2018 at 05:29:21PM +0100, Mateusz Malek wrote:
> Hi,
>
> On 13.02.2018 15:27, Olivier Houchard wrote:
> > Thanks a lot for the detailed analyze, and sorry for the late answer.
> > You're probably right, SSL_ERROR_SYSCALL shouldn't be treated as an
> > unrecoverable error.
> > So, what you basically did was something equivalent to the patch attached ?
>
> Yeah, it looked exactly like this ;) I'm glad I could help.

So does this mean I should take the patch as is ? Please just let me know.

Thanks,
Willy
Emmanuel Hocdet
Re: haproxy 1.8 ssl backend server leads to server session aborts
February 13, 2018 05:50PM
Hi Olivier

> Le 13 févr. 2018 à 15:27, Olivier Houchard <[email protected]> a écrit :
>
> Thanks a lot for the detailed analyze, and sorry for the late answer.
> You're probably right, SSL_ERROR_SYSCALL shouldn't be treated as an
> unrecoverable error.
> So, what you basically did was something equivalent to the patch attached ?
>
> Thanks a lot !
>
> Olivier
> <0001-MINOR-BUG-ssl-Don-t-always-treat-SSL_ERROR_SYSCALL-a.patch>


'ret' can be unassigned
errno must be check earlier.

++
Manu
Olivier Houchard
Re: haproxy 1.8 ssl backend server leads to server session aborts
February 13, 2018 06:20PM
Hi Emmanuel,

On Tue, Feb 13, 2018 at 05:40:00PM +0100, Emmanuel Hocdet wrote:
> Hi Olivier
>
> > Le 13 févr. 2018 à 15:27, Olivier Houchard <[email protected]> a écrit :
> >
> > Thanks a lot for the detailed analyze, and sorry for the late answer.
> > You're probably right, SSL_ERROR_SYSCALL shouldn't be treated as an
> > unrecoverable error.
> > So, what you basically did was something equivalent to the patch attached ?
> >
> > Thanks a lot !
> >
> > Olivier
> > <0001-MINOR-BUG-ssl-Don-t-always-treat-SSL_ERROR_SYSCALL-a.patch>
>
>
> 'ret' can be unassigned
> errno must be check earlier.

I (mostly) disagree.
There are 3 goto out_error, and on the only one that mattered, ret and
errno should be fine.
However, I rewrote the patch to make it clearer what the intent is.
Do you like it better ?

Cheers,

Olivier
From 8de2963753a33ca948beebcdd70c5f46bd01e4cf Mon Sep 17 00:00:00 2001
From: Olivier Houchard <[email protected]>
Date: Tue, 13 Feb 2018 15:17:23 +0100
Subject: [PATCH] MINOR/BUG: ssl: Don't always treat SSL_ERROR_SYSCALL as
unrecovarable.

SSL_Raad() might return <= 0, and SSL_get_erro() return SSL_ERROR_SYSCALL,
without meaning the connection is gone. Before flagging the conection
as in error, check the errno value.

This should be backported to 1.8.
---
src/ssl_sock.c | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/src/ssl_sock.c b/src/ssl_sock.c
index aee3cd965..83e9b86e4 100644
--- a/src/ssl_sock.c
+++ b/src/ssl_sock.c
@@ -5313,6 +5313,7 @@ reneg_ok:
static int ssl_sock_to_buf(struct connection *conn, struct buffer *buf, int count)
{
int ret, done = 0;
+ int set_error_flag = 1;
int try;

conn_refresh_polling_flags(conn);
@@ -5434,6 +5435,13 @@ static int ssl_sock_to_buf(struct connection *conn, struct buffer *buf, int coun
break;
} else if (ret == SSL_ERROR_ZERO_RETURN)
goto read0;
+ /* For SSL_ERROR_SYSCALL, make sure the error is
+ * unrecoverable before flagging the connection as
+ * in error.
+ */
+ if (ret == SSL_ERROR_SYSCALL || (!errno || errno ==
+ EAGAIN))
+ set_error_flag = 0;
/* otherwise it's a real error */
goto out_error;
}
@@ -5452,7 +5460,8 @@ static int ssl_sock_to_buf(struct connection *conn, struct buffer *buf, int coun
ssl_sock_dump_errors(conn);
ERR_clear_error();

- conn->flags |= CO_FL_ERROR;
+ if (set_error_flag)
+ conn->flags |= CO_FL_ERROR;
goto leave;
}

--
2.14.3
Hi Olivier,

On Tue, Feb 13, 2018 at 06:07:36PM +0100, Olivier Houchard wrote:
> Hi Emmanuel,
>
> On Tue, Feb 13, 2018 at 05:40:00PM +0100, Emmanuel Hocdet wrote:
> > Hi Olivier
> >
> > > Le 13 févr. 2018 à 15:27, Olivier Houchard <[email protected]> a écrit :
> > >
> > > Thanks a lot for the detailed analyze, and sorry for the late answer.
> > > You're probably right, SSL_ERROR_SYSCALL shouldn't be treated as an
> > > unrecoverable error.
> > > So, what you basically did was something equivalent to the patch attached ?
> > >
> > > Thanks a lot !
> > >
> > > Olivier
> > > <0001-MINOR-BUG-ssl-Don-t-always-treat-SSL_ERROR_SYSCALL-a.patch>
> >
> >
> > 'ret' can be unassigned
> > errno must be check earlier.
>
> I (mostly) disagree.
> There are 3 goto out_error, and on the only one that mattered, ret and
> errno should be fine.
> However, I rewrote the patch to make it clearer what the intent is.
> Do you like it better ?
>
> Cheers,
>
> Olivier

> >From 8de2963753a33ca948beebcdd70c5f46bd01e4cf Mon Sep 17 00:00:00 2001
> From: Olivier Houchard <[email protected]>
> Date: Tue, 13 Feb 2018 15:17:23 +0100
> Subject: [PATCH] MINOR/BUG: ssl: Don't always treat SSL_ERROR_SYSCALL as
> unrecovarable.
>
> SSL_Raad() might return <= 0, and SSL_get_erro() return SSL_ERROR_SYSCALL,
> without meaning the connection is gone. Before flagging the conection
> as in error, check the errno value.
>
> This should be backported to 1.8.
> ---
> src/ssl_sock.c | 11 ++++++++++-
> 1 file changed, 10 insertions(+), 1 deletion(-)
>
> diff --git a/src/ssl_sock.c b/src/ssl_sock.c
> index aee3cd965..83e9b86e4 100644
> --- a/src/ssl_sock.c
> +++ b/src/ssl_sock.c
> @@ -5313,6 +5313,7 @@ reneg_ok:
> static int ssl_sock_to_buf(struct connection *conn, struct buffer *buf, int count)
> {
> int ret, done = 0;
> + int set_error_flag = 1;
> int try;
>
> conn_refresh_polling_flags(conn);
> @@ -5434,6 +5435,13 @@ static int ssl_sock_to_buf(struct connection *conn, struct buffer *buf, int coun
> break;
> } else if (ret == SSL_ERROR_ZERO_RETURN)
> goto read0;
> + /* For SSL_ERROR_SYSCALL, make sure the error is
> + * unrecoverable before flagging the connection as
> + * in error.
> + */
> + if (ret == SSL_ERROR_SYSCALL || (!errno || errno ==
> + EAGAIN))
> + set_error_flag = 0;
> /* otherwise it's a real error */
> goto out_error;
> }
> @@ -5452,7 +5460,8 @@ static int ssl_sock_to_buf(struct connection *conn, struct buffer *buf, int coun
> ssl_sock_dump_errors(conn);
> ERR_clear_error();
>
> - conn->flags |= CO_FL_ERROR;
> + if (set_error_flag)
> + conn->flags |= CO_FL_ERROR;
> goto leave;
> }

Such type of construct tends to scare me (probably because I'm not reading
the whole code). It means we're supposed to set an error by default unless
we pass by a specific path. I fear that we'll get future issues (possibly
as the result of further fixes for unrelated stuff).

Also I'm really not fond of the fact that the out_error label isn't used
anymore to set the error, but sometimes to set it, sometimes to clear it.

Why instead not handle it almost similarly to the way it was before the
patch that removed the code ? I'm seeing that in the past we used to only
differenciate between read0 and error. But if we're failing on SYSCALL
with ret previously set to zero, an EOF was observed, so that's akin to
a read0 (this information is lost in the recent code since ret is
overwritten). May I suggest that we change all this block to a switch/case
instead ? It would more or less give something like this :

if (ret > 0) {
blah;
}
else switch (SSL_get_error(ctx, ret)) {
case SSL_ERROR_WANT_WRITE: blah; break;
case SSL_ERROR_WANT_READ: blah; break;
case SSL_ERROR_ZERO_RETURN: goto read0;
case SSL_ERROR_SYSCALL:
if (!ret)
goto read0;
else if (errno && errno != EAGAIN)
goto out_error;
else /* valid cases, connection establishment, etc */
goto leave;
default:
goto out_error;
}

It's just a suggestion, it can be done by storing the result of
SSL_get_error() anywhere else, but definitely we need to make a clear
distinction between all these cases and for now the distinction between
the read0, completed connection and other errors is lost.

I'm also seeing that the whole block could be rearranged so that ret <= 0
is handled first. That would remove some gotos and would leave the main
part after all error handling.

BTW this makes me realize that your inverted condition above seems wrong
(|| instead of &&).

Cheers,
Willy
Olivier Houchard
Re: haproxy 1.8 ssl backend server leads to server session aborts
February 14, 2018 05:40PM
Hi Willy,

On Tue, Feb 13, 2018 at 08:05:44PM +0100, Willy Tarreau wrote:
> Hi Olivier,
> Such type of construct tends to scare me (probably because I'm not reading
> the whole code). It means we're supposed to set an error by default unless
> we pass by a specific path. I fear that we'll get future issues (possibly
> as the result of further fixes for unrelated stuff).
>
> Also I'm really not fond of the fact that the out_error label isn't used
> anymore to set the error, but sometimes to set it, sometimes to clear it.
>

What about what's attached, instead ?

> Why instead not handle it almost similarly to the way it was before the
> patch that removed the code ? I'm seeing that in the past we used to only
> differenciate between read0 and error. But if we're failing on SYSCALL
> with ret previously set to zero, an EOF was observed, so that's akin to
> a read0 (this information is lost in the recent code since ret is
> overwritten). May I suggest that we change all this block to a switch/case
> instead ? It would more or less give something like this :
>
> if (ret > 0) {
> blah;
> }
> else switch (SSL_get_error(ctx, ret)) {
> case SSL_ERROR_WANT_WRITE: blah; break;
> case SSL_ERROR_WANT_READ: blah; break;
> case SSL_ERROR_ZERO_RETURN: goto read0;
> case SSL_ERROR_SYSCALL:
> if (!ret)
> goto read0;
> else if (errno && errno != EAGAIN)
> goto out_error;
> else /* valid cases, connection establishment, etc */
> goto leave;
> default:
> goto out_error;
> }
>
> It's just a suggestion, it can be done by storing the result of
> SSL_get_error() anywhere else, but definitely we need to make a clear
> distinction between all these cases and for now the distinction between
> the read0, completed connection and other errors is lost.
>

Because we can't rely on ret == 0 being meaningful. The manpage for
OpenSSL 1.1.1 states :
Old documentation indicated a difference between 0 and -1, and that -1 was retryable. You should instead call SSL_get_error() to find out if it's retryable.
And the switch would lead to more goto, as we need to break from outside the
loop :)

> I'm also seeing that the whole block could be rearranged so that ret <= 0
> is handled first. That would remove some gotos and would leave the main
> part after all error handling.
>

I'm not sure I get that part. I don't mind one way or another, but I don't
understand how it would remove gotos.

> BTW this makes me realize that your inverted condition above seems wrong
> (|| instead of &&).
>

Oops, that is true, those things are too complicated.

Regards,

Olivier
From fca75937f4f2b1927f5c82c137cba292c82dac53 Mon Sep 17 00:00:00 2001
From: Olivier Houchard <[email protected]>
Date: Tue, 13 Feb 2018 15:17:23 +0100
Subject: [PATCH] MINOR/BUG: ssl: Don't always treat SSL_ERROR_SYSCALL as
unrecovarable.

SSL_read() might return <= 0, and SSL_get_erro() return SSL_ERROR_SYSCALL,
without meaning the connection is gone. Before flagging the conection
as in error, check the errno value.

This should be backported to 1.8.
---
src/ssl_sock.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/src/ssl_sock.c b/src/ssl_sock.c
index aee3cd965..b24db079d 100644
--- a/src/ssl_sock.c
+++ b/src/ssl_sock.c
@@ -5434,6 +5434,13 @@ static int ssl_sock_to_buf(struct connection *conn, struct buffer *buf, int coun
break;
} else if (ret == SSL_ERROR_ZERO_RETURN)
goto read0;
+ /* For SSL_ERROR_SYSCALL, make sure the error is
+ * unrecoverable before flagging the connection as
+ * in error.
+ */
+ if (ret == SSL_ERROR_SYSCALL && (!errno || errno ==
+ EAGAIN))
+ goto clear_ssl_error;
/* otherwise it's a real error */
goto out_error;
}
@@ -5448,11 +5455,12 @@ static int ssl_sock_to_buf(struct connection *conn, struct buffer *buf, int coun
conn_sock_read0(conn);
goto leave;
out_error:
+ conn->flags |= CO_FL_ERROR;
+clear_ssl_error:
/* Clear openssl global errors stack */
ssl_sock_dump_errors(conn);
ERR_clear_error();

- conn->flags |= CO_FL_ERROR;
goto leave;
}

--
2.14.3
On Wed, Feb 14, 2018 at 05:29:57PM +0100, Olivier Houchard wrote:
> Hi Willy,
>
> On Tue, Feb 13, 2018 at 08:05:44PM +0100, Willy Tarreau wrote:
> > Hi Olivier,
> > Such type of construct tends to scare me (probably because I'm not reading
> > the whole code). It means we're supposed to set an error by default unless
> > we pass by a specific path. I fear that we'll get future issues (possibly
> > as the result of further fixes for unrelated stuff).
> >
> > Also I'm really not fond of the fact that the out_error label isn't used
> > anymore to set the error, but sometimes to set it, sometimes to clear it.
> >
>
> What about what's attached, instead ?

I think it should work. Mateusz, care to give it a try to confirm ?
If OK, I'll merge it.

> > It's just a suggestion, it can be done by storing the result of
> > SSL_get_error() anywhere else, but definitely we need to make a clear
> > distinction between all these cases and for now the distinction between
> > the read0, completed connection and other errors is lost.
> >
>
> Because we can't rely on ret == 0 being meaningful. The manpage for
> OpenSSL 1.1.1 states :
> Old documentation indicated a difference between 0 and -1, and that -1 was
> retryable. You should instead call SSL_get_error() to find out if it's
> retryable.

Interesting, indeed the doc and man page has changed between versions.
My man page explicitly mentions the fact that you need to use
SSL_get_error() to detect SSL_ERROR_SYSCALL, and then act depending
on what <ret> value was passed to it. Let's hope they only changed the
doc and not the semantics...

> And the switch would lead to more goto, as we need to break from outside the
> loop :)

Possibly.

> > I'm also seeing that the whole block could be rearranged so that ret <= 0
> > is handled first. That would remove some gotos and would leave the main
> > part after all error handling.
> >
>
> I'm not sure I get that part. I don't mind one way or another, but I don't
> understand how it would remove gotos.

Note that I have not made the exercise.

Willy
Hi,

On 14.02.2018 17:53, Willy Tarreau wrote:
> On Wed, Feb 14, 2018 at 05:29:57PM +0100, Olivier Houchard wrote:
>> What about what's attached, instead ?
> I think it should work. Mateusz, care to give it a try to confirm ?
> If OK, I'll merge it.

I confirm, with this patch applied problem is gone.

Best regards
Mateusz Małek
On Wed, Feb 14, 2018 at 06:20:42PM +0100, Mateusz Malek wrote:
> Hi,
>
> On 14.02.2018 17:53, Willy Tarreau wrote:
> > On Wed, Feb 14, 2018 at 05:29:57PM +0100, Olivier Houchard wrote:
> > > What about what's attached, instead ?
> > I think it should work. Mateusz, care to give it a try to confirm ?
> > If OK, I'll merge it.
>
> I confirm, with this patch applied problem is gone.

Excellent, thanks for the quick test. Patch now applied.

cheers,
Willy
Christopher Faulet
Re: haproxy 1.8 ssl backend server leads to server session aborts
February 19, 2018 03:30PM
Le 14/02/2018 à 18:53, Willy Tarreau a écrit :
> On Wed, Feb 14, 2018 at 06:20:42PM +0100, Mateusz Malek wrote:
>> Hi,
>>
>> On 14.02.2018 17:53, Willy Tarreau wrote:
>>> On Wed, Feb 14, 2018 at 05:29:57PM +0100, Olivier Houchard wrote:
>>>> What about what's attached, instead ?
>>> I think it should work. Mateusz, care to give it a try to confirm ?
>>> If OK, I'll merge it.
>>
>> I confirm, with this patch applied problem is gone.
>
> Excellent, thanks for the quick test. Patch now applied.
>

Hi,

Someone on discourse reports a problem with this patch:


https://discourse.haproxy.org/t/random-sa-errors-with-haproxy-1-8-3/2116/6

I asked him to test the attached patch. But It could be cool to have
more feedback on the fix. The bug can easily be reproduced by closing a
connection opened with openssl s_client with a control-c.

Thanks
--
Christopher Faulet
Hi Christopher,

On Mon, Feb 19, 2018 at 03:24:09PM +0100, Christopher Faulet wrote:
> Someone on discourse reports a problem with this patch:
>
> https://discourse.haproxy.org/t/random-sa-errors-with-haproxy-1-8-3/2116/6
>
> I asked him to test the attached patch. But It could be cool to have more
> feedback on the fix. The bug can easily be reproduced by closing a
> connection opened with openssl s_client with a control-c.

Hehe so it seems I was not completely dumb when saying that the error
used to compete with read0 in previous version ;-)

It it's not enough, we may decide to fall back to the old method that
was documented till OpenSSL 1.0.2 consisting in checking whether ret=0
was passed to SSL_get_error() or not to decide whether it's an error
or a shutdown. But I think that in the current situation these ones will
be equivalent for our use case anyway.

Now applied, thanks!
Willy
Sorry, only registered users may post in this forum.

Click here to login