Welcome! Log In Create A New Profile

Advanced

upstream - behavior on pool exhaustion

Posted by B.R. via nginx 
B.R. via nginx
upstream - behavior on pool exhaustion
April 14, 2017 09:50AM
Hello,

Reading from upstream
<https://nginx.org/en/docs/http/ngx_http_upstream_module.html#upstream>;
docs, on upstream pool exhaustion, every backend should be tried once, and
then if all fail the response should be crafted based on the one from the
last server attempt.
So far so good.

I recently faced a server farm which implements a dull nightly restart of
every node, not sequencing it, resulting in the possibility of having all
nodes offline at the same time.

However, I collected log entries which did not match what I was expected.
For 6 backend nodes, I got:
- log format: $status $body_bytes_sent $request_time $upstream_addr
$upstream_response_time
- log entry: 502 568 0.001 <IP address 1>:<port>, <IP address 2>:<port>,
<IP address 3>:<port>, <IP address 4>:<port>, <IP address 5>:<port>, <IP
address 6>:<port>, php-fpm 0.000, 0.000, 0.000, 0.000, 0.001, 0.000, 0.000
I got 7 entries for $upstream_addr & $upstream_response_time, instead of
the expected 6.

​Here are the interesting parts of the configuration:
upstream php-fpm {
server <machine 1>:<port> down;
server <machine 2>:<port> down;
[...]
server <machine N-5>:<port>;
server <machine N-4>:<port>;
server <machine N-3>:<port>;
server <machine N-2>:<port>;
server <machine N-1>:<port>;
server <machine N>:<port>;
keepalive 128;
}

​server {
set $fpm_pool "php-fpm$fpm_pool_ID";
[...]
location ~ \.php$ {
[...]
fastcgi_read_timeout 600;
fastcgi_keep_conn on;
fastcgi_index index.php;

include fastcgi_params;
fastcgi_param SCRIPT_FILENAME
$document_root$fastcgi_script_name;
[...]
fastcgi_pass $fpm_pool;
}
}

​The question is:
php-fpm being an upstream group name, how come has it been tried as a
domain name in the end?
Stated otherwise, is this because the upstream group is considered 'down',
thus somehow removed from the possibilities, and nginx trying one last time
the name as a domain name to see if something answers?
This 7th request is definitely strange to my point of view. Is it a bug or
a feature?
---
*B. R.*
_______________________________________________
nginx mailing list
nginx@nginx.org
http://mailman.nginx.org/mailman/listinfo/nginx
Ruslan Ermilov
Re: upstream - behavior on pool exhaustion
April 14, 2017 10:30AM
On Fri, Apr 14, 2017 at 09:41:36AM +0200, B.R. via nginx wrote:
> Hello,
>
> Reading from upstream
> <https://nginx.org/en/docs/http/ngx_http_upstream_module.html#upstream>;
> docs, on upstream pool exhaustion, every backend should be tried once, and
> then if all fail the response should be crafted based on the one from the
> last server attempt.
> So far so good.
>
> I recently faced a server farm which implements a dull nightly restart of
> every node, not sequencing it, resulting in the possibility of having all
> nodes offline at the same time.
>
> However, I collected log entries which did not match what I was expected.
> For 6 backend nodes, I got:
> - log format: $status $body_bytes_sent $request_time $upstream_addr
> $upstream_response_time
> - log entry: 502 568 0.001 <IP address 1>:<port>, <IP address 2>:<port>,
> <IP address 3>:<port>, <IP address 4>:<port>, <IP address 5>:<port>, <IP
> address 6>:<port>, php-fpm 0.000, 0.000, 0.000, 0.000, 0.001, 0.000, 0.000
> I got 7 entries for $upstream_addr & $upstream_response_time, instead of
> the expected 6.
>
> ​Here are the interesting parts of the configuration:
> upstream php-fpm {
> server <machine 1>:<port> down;
> server <machine 2>:<port> down;
> [...]
> server <machine N-5>:<port>;
> server <machine N-4>:<port>;
> server <machine N-3>:<port>;
> server <machine N-2>:<port>;
> server <machine N-1>:<port>;
> server <machine N>:<port>;
> keepalive 128;
> }
>
> ​server {
> set $fpm_pool "php-fpm$fpm_pool_ID";
> [...]
> location ~ \.php$ {
> [...]
> fastcgi_read_timeout 600;
> fastcgi_keep_conn on;
> fastcgi_index index.php;
>
> include fastcgi_params;
> fastcgi_param SCRIPT_FILENAME
> $document_root$fastcgi_script_name;
> [...]
> fastcgi_pass $fpm_pool;
> }
> }
>
> ​The question is:
> php-fpm being an upstream group name, how come has it been tried as a
> domain name in the end?
> Stated otherwise, is this because the upstream group is considered 'down',
> thus somehow removed from the possibilities, and nginx trying one last time
> the name as a domain name to see if something answers?
> This 7th request is definitely strange to my point of view. Is it a bug or
> a feature?

A feature.

Most $upstream_* variables are vectored ones, and the number of entries
in their values corresponds to the number of tries made to select a peer.
When a peer cannot be selected at all (as in your case), the status is
502 and the name equals the upstream group name.

There could be several reasons why none of the peers can be selected.
For example, some peers are marked "down", and other peers were failing
and are now in the "unavailable" state.

The number of tries is limited by the number of servers in the group,
unless futher restricted by proxy_next_upstream_tries. In your case,
since there are two "down" servers, and other servers are unavailable,
you reach the situation when a peer cannot be selected. If you comment
out the two "down" servers, and try a few requests in a row when all
servers are physically unavailable, the first log entry will list all
of the attempted servers, and then for the next 10 seconds (in the
default config) you'll see only the upstream group name and 502 in
$upstream_status, until the servers become available again (see
max_fails/fail_timeout).

Hope this makes things a little bit clearer.
_______________________________________________
nginx mailing list
nginx@nginx.org
http://mailman.nginx.org/mailman/listinfo/nginx
B.R. via nginx
Re: upstream - behavior on pool exhaustion
April 15, 2017 04:00AM
Let me be clear here:
I got 6 active servers (not marked down), and the logs show 1 attempt on
each. They all failed for a known reason, and there is no problem there.
Subsequently, the whole pool was 'down' and the response was 502.
Everything perfectly normal so far.

​What is unclear is the feature (as you classified it) of having a fake
node named after the pool appearing in the list of tried upstream servers.​
It brings confusion more than anything else: having a 502 response + the
list of all tried (and failed) nodes corresponding with the list of active
nodes is more than enough to describe what happened.
The name of the upstream group does not corresponding to any real asset, it
is purely virtual classification. It thus makes no sense at all to me to
have it appearing as a 7th 'node' in the list... and how do you interpret
its response time (where you got also a 7th item in the list)?
Moreover, it is confusing, since proxy_pass handles domain names and one
could believe nginx treated the upstream group name as such.
---
*B. R.*

On Fri, Apr 14, 2017 at 10:21 AM, Ruslan Ermilov <[email protected]> wrote:

> On Fri, Apr 14, 2017 at 09:41:36AM +0200, B.R. via nginx wrote:
> > Hello,
> >
> > Reading from upstream
> > <https://nginx.org/en/docs/http/ngx_http_upstream_module.html#upstream>;
> > docs, on upstream pool exhaustion, every backend should be tried once,
> and
> > then if all fail the response should be crafted based on the one from the
> > last server attempt.
> > So far so good.
> >
> > I recently faced a server farm which implements a dull nightly restart of
> > every node, not sequencing it, resulting in the possibility of having all
> > nodes offline at the same time.
> >
> > However, I collected log entries which did not match what I was expected.
> > For 6 backend nodes, I got:
> > - log format: $status $body_bytes_sent $request_time $upstream_addr
> > $upstream_response_time
> > - log entry: 502 568 0.001 <IP address 1>:<port>, <IP address 2>:<port>,
> > <IP address 3>:<port>, <IP address 4>:<port>, <IP address 5>:<port>, <IP
> > address 6>:<port>, php-fpm 0.000, 0.000, 0.000, 0.000, 0.001, 0.000,
> 0.000
> > I got 7 entries for $upstream_addr & $upstream_response_time, instead of
> > the expected 6.
> >
> > ​Here are the interesting parts of the configuration:
> > upstream php-fpm {
> > server <machine 1>:<port> down;
> > server <machine 2>:<port> down;
> > [...]
> > server <machine N-5>:<port>;
> > server <machine N-4>:<port>;
> > server <machine N-3>:<port>;
> > server <machine N-2>:<port>;
> > server <machine N-1>:<port>;
> > server <machine N>:<port>;
> > keepalive 128;
> > }
> >
> > ​server {
> > set $fpm_pool "php-fpm$fpm_pool_ID";
> > [...]
> > location ~ \.php$ {
> > [...]
> > fastcgi_read_timeout 600;
> > fastcgi_keep_conn on;
> > fastcgi_index index.php;
> >
> > include fastcgi_params;
> > fastcgi_param SCRIPT_FILENAME
> > $document_root$fastcgi_script_name;
> > [...]
> > fastcgi_pass $fpm_pool;
> > }
> > }
> >
> > ​The question is:
> > php-fpm being an upstream group name, how come has it been tried as a
> > domain name in the end?
> > Stated otherwise, is this because the upstream group is considered
> 'down',
> > thus somehow removed from the possibilities, and nginx trying one last
> time
> > the name as a domain name to see if something answers?
> > This 7th request is definitely strange to my point of view. Is it a bug
> or
> > a feature?
>
> A feature.
>
> Most $upstream_* variables are vectored ones, and the number of entries
> in their values corresponds to the number of tries made to select a peer.
> When a peer cannot be selected at all (as in your case), the status is
> 502 and the name equals the upstream group name.
>
> There could be several reasons why none of the peers can be selected.
> For example, some peers are marked "down", and other peers were failing
> and are now in the "unavailable" state.
>
> The number of tries is limited by the number of servers in the group,
> unless futher restricted by proxy_next_upstream_tries. In your case,
> since there are two "down" servers, and other servers are unavailable,
> you reach the situation when a peer cannot be selected. If you comment
> out the two "down" servers, and try a few requests in a row when all
> servers are physically unavailable, the first log entry will list all
> of the attempted servers, and then for the next 10 seconds (in the
> default config) you'll see only the upstream group name and 502 in
> $upstream_status, until the servers become available again (see
> max_fails/fail_timeout).
>
> Hope this makes things a little bit clearer.
> _______________________________________________
> nginx mailing list
> nginx@nginx.org
> http://mailman.nginx.org/mailman/listinfo/nginx
_______________________________________________
nginx mailing list
nginx@nginx.org
http://mailman.nginx.org/mailman/listinfo/nginx
Ruslan Ermilov
Re: upstream - behavior on pool exhaustion
April 17, 2017 06:10PM
On Sat, Apr 15, 2017 at 03:55:20AM +0200, B.R. via nginx wrote:
> Let me be clear here:
> I got 6 active servers (not marked down), and the logs show 1 attempt on
> each. They all failed for a known reason, and there is no problem there.
> Subsequently, the whole pool was 'down' and the response was 502.
> Everything perfectly normal so far.
>
> ​What is unclear is the feature (as you classified it) of having a fake
> node named after the pool appearing in the list of tried upstream servers.​
> It brings confusion more than anything else: having a 502 response + the
> list of all tried (and failed) nodes corresponding with the list of active
> nodes is more than enough to describe what happened.
> The name of the upstream group does not corresponding to any real asset, it
> is purely virtual classification. It thus makes no sense at all to me to
> have it appearing as a 7th 'node' in the list... and how do you interpret
> its response time (where you got also a 7th item in the list)?
> Moreover, it is confusing, since proxy_pass handles domain names and one
> could believe nginx treated the upstream group name as such.

Without the six attempts, if all of the servers are unreachable (either
"down" or "unavailable" because they have failed previously) at the time
the request starts, what do you expect to see in $upstream_*?

> On Fri, Apr 14, 2017 at 10:21 AM, Ruslan Ermilov <[email protected]> wrote:
>
> > On Fri, Apr 14, 2017 at 09:41:36AM +0200, B.R. via nginx wrote:
> > > Hello,
> > >
> > > Reading from upstream
> > > <https://nginx.org/en/docs/http/ngx_http_upstream_module.html#upstream>;
> > > docs, on upstream pool exhaustion, every backend should be tried once,
> > and
> > > then if all fail the response should be crafted based on the one from the
> > > last server attempt.
> > > So far so good.
> > >
> > > I recently faced a server farm which implements a dull nightly restart of
> > > every node, not sequencing it, resulting in the possibility of having all
> > > nodes offline at the same time.
> > >
> > > However, I collected log entries which did not match what I was expected.
> > > For 6 backend nodes, I got:
> > > - log format: $status $body_bytes_sent $request_time $upstream_addr
> > > $upstream_response_time
> > > - log entry: 502 568 0.001 <IP address 1>:<port>, <IP address 2>:<port>,
> > > <IP address 3>:<port>, <IP address 4>:<port>, <IP address 5>:<port>, <IP
> > > address 6>:<port>, php-fpm 0.000, 0.000, 0.000, 0.000, 0.001, 0.000,
> > 0.000
> > > I got 7 entries for $upstream_addr & $upstream_response_time, instead of
> > > the expected 6.
> > >
> > > ​Here are the interesting parts of the configuration:
> > > upstream php-fpm {
> > > server <machine 1>:<port> down;
> > > server <machine 2>:<port> down;
> > > [...]
> > > server <machine N-5>:<port>;
> > > server <machine N-4>:<port>;
> > > server <machine N-3>:<port>;
> > > server <machine N-2>:<port>;
> > > server <machine N-1>:<port>;
> > > server <machine N>:<port>;
> > > keepalive 128;
> > > }
> > >
> > > ​server {
> > > set $fpm_pool "php-fpm$fpm_pool_ID";
> > > [...]
> > > location ~ \.php$ {
> > > [...]
> > > fastcgi_read_timeout 600;
> > > fastcgi_keep_conn on;
> > > fastcgi_index index.php;
> > >
> > > include fastcgi_params;
> > > fastcgi_param SCRIPT_FILENAME
> > > $document_root$fastcgi_script_name;
> > > [...]
> > > fastcgi_pass $fpm_pool;
> > > }
> > > }
> > >
> > > ​The question is:
> > > php-fpm being an upstream group name, how come has it been tried as a
> > > domain name in the end?
> > > Stated otherwise, is this because the upstream group is considered
> > 'down',
> > > thus somehow removed from the possibilities, and nginx trying one last
> > time
> > > the name as a domain name to see if something answers?
> > > This 7th request is definitely strange to my point of view. Is it a bug
> > or
> > > a feature?
> >
> > A feature.
> >
> > Most $upstream_* variables are vectored ones, and the number of entries
> > in their values corresponds to the number of tries made to select a peer.
> > When a peer cannot be selected at all (as in your case), the status is
> > 502 and the name equals the upstream group name.
> >
> > There could be several reasons why none of the peers can be selected.
> > For example, some peers are marked "down", and other peers were failing
> > and are now in the "unavailable" state.
> >
> > The number of tries is limited by the number of servers in the group,
> > unless futher restricted by proxy_next_upstream_tries. In your case,
> > since there are two "down" servers, and other servers are unavailable,
> > you reach the situation when a peer cannot be selected. If you comment
> > out the two "down" servers, and try a few requests in a row when all
> > servers are physically unavailable, the first log entry will list all
> > of the attempted servers, and then for the next 10 seconds (in the
> > default config) you'll see only the upstream group name and 502 in
> > $upstream_status, until the servers become available again (see
> > max_fails/fail_timeout).
> >
> > Hope this makes things a little bit clearer.
> > _______________________________________________
> > nginx mailing list
> > nginx@nginx.org
> > http://mailman.nginx.org/mailman/listinfo/nginx


--
Ruslan Ermilov
Assume stupidity not malice
_______________________________________________
nginx mailing list
nginx@nginx.org
http://mailman.nginx.org/mailman/listinfo/nginx
B.R. via nginx
Re: upstream - behavior on pool exhaustion
April 18, 2017 02:50PM
'down' should not translate into any kind of attempt, so nothing should
really appear for the servers in that static state.
For 'unavailable' servers, for the most part the content of the variables
should be the same.

Starting from the example I provided, here is what I expected to see:
- $upstream_addr: <IP address 1>:<port>, <IP address 2>:<port>, <IP address
3>:<port>, <IP address 4>:<port>, <IP address 5>:<port>, <IP address
6>:<port>
- $upstream_response_time: 0.000, 0.000, 0.000, 0.000, 0.001, 0.000

That, associated with the 502 response from the HTTP language, is
sufficient to interpret the log entry as: the request failed to find a
proper backend after having attempted communication with the 6 specified
active backends. It is pretty straightforward.
If you want to add something to explicitely states the whole upstream group
is down, this should go to the error log.
At the very least, if the current way of working is kept, the grammar of
the content of the $upstream_* variables should be specified.

​Does not that seem reasonable?
---
*B. R.*

On Mon, Apr 17, 2017 at 6:09 PM, Ruslan Ermilov <[email protected]> wrote:

> On Sat, Apr 15, 2017 at 03:55:20AM +0200, B.R. via nginx wrote:
> > Let me be clear here:
> > I got 6 active servers (not marked down), and the logs show 1 attempt on
> > each. They all failed for a known reason, and there is no problem there..
> > Subsequently, the whole pool was 'down' and the response was 502.
> > Everything perfectly normal so far.
> >
> > ​What is unclear is the feature (as you classified it) of having a fake
> > node named after the pool appearing in the list of tried upstream
> servers.​
> > It brings confusion more than anything else: having a 502 response + the
> > list of all tried (and failed) nodes corresponding with the list of
> active
> > nodes is more than enough to describe what happened.
> > The name of the upstream group does not corresponding to any real asset,
> it
> > is purely virtual classification. It thus makes no sense at all to me to
> > have it appearing as a 7th 'node' in the list... and how do you interpret
> > its response time (where you got also a 7th item in the list)?
> > Moreover, it is confusing, since proxy_pass handles domain names and one
> > could believe nginx treated the upstream group name as such.
>
> Without the six attempts, if all of the servers are unreachable (either
> "down" or "unavailable" because they have failed previously) at the time
> the request starts, what do you expect to see in $upstream_*?
>
> > On Fri, Apr 14, 2017 at 10:21 AM, Ruslan Ermilov <[email protected]> wrote:
> >
> > > On Fri, Apr 14, 2017 at 09:41:36AM +0200, B.R. via nginx wrote:
> > > > Hello,
> > > >
> > > > Reading from upstream
> > > > <https://nginx.org/en/docs/http/ngx_http_upstream_module.htm
> l#upstream>
> > > > docs, on upstream pool exhaustion, every backend should be tried
> once,
> > > and
> > > > then if all fail the response should be crafted based on the one
> from the
> > > > last server attempt.
> > > > So far so good.
> > > >
> > > > I recently faced a server farm which implements a dull nightly
> restart of
> > > > every node, not sequencing it, resulting in the possibility of
> having all
> > > > nodes offline at the same time.
> > > >
> > > > However, I collected log entries which did not match what I was
> expected.
> > > > For 6 backend nodes, I got:
> > > > - log format: $status $body_bytes_sent $request_time $upstream_addr
> > > > $upstream_response_time
> > > > - log entry: 502 568 0.001 <IP address 1>:<port>, <IP address
> 2>:<port>,
> > > > <IP address 3>:<port>, <IP address 4>:<port>, <IP address 5>:<port>,
> <IP
> > > > address 6>:<port>, php-fpm 0.000, 0.000, 0.000, 0.000, 0.001, 0.000,
> > > 0.000
> > > > I got 7 entries for $upstream_addr & $upstream_response_time,
> instead of
> > > > the expected 6.
> > > >
> > > > ​Here are the interesting parts of the configuration:
> > > > upstream php-fpm {
> > > > server <machine 1>:<port> down;
> > > > server <machine 2>:<port> down;
> > > > [...]
> > > > server <machine N-5>:<port>;
> > > > server <machine N-4>:<port>;
> > > > server <machine N-3>:<port>;
> > > > server <machine N-2>:<port>;
> > > > server <machine N-1>:<port>;
> > > > server <machine N>:<port>;
> > > > keepalive 128;
> > > > }
> > > >
> > > > ​server {
> > > > set $fpm_pool "php-fpm$fpm_pool_ID";
> > > > [...]
> > > > location ~ \.php$ {
> > > > [...]
> > > > fastcgi_read_timeout 600;
> > > > fastcgi_keep_conn on;
> > > > fastcgi_index index.php;
> > > >
> > > > include fastcgi_params;
> > > > fastcgi_param SCRIPT_FILENAME
> > > > $document_root$fastcgi_script_name;
> > > > [...]
> > > > fastcgi_pass $fpm_pool;
> > > > }
> > > > }
> > > >
> > > > ​The question is:
> > > > php-fpm being an upstream group name, how come has it been tried as a
> > > > domain name in the end?
> > > > Stated otherwise, is this because the upstream group is considered
> > > 'down',
> > > > thus somehow removed from the possibilities, and nginx trying one
> last
> > > time
> > > > the name as a domain name to see if something answers?
> > > > This 7th request is definitely strange to my point of view. Is it a
> bug
> > > or
> > > > a feature?
> > >
> > > A feature.
> > >
> > > Most $upstream_* variables are vectored ones, and the number of entries
> > > in their values corresponds to the number of tries made to select a
> peer.
> > > When a peer cannot be selected at all (as in your case), the status is
> > > 502 and the name equals the upstream group name.
> > >
> > > There could be several reasons why none of the peers can be selected.
> > > For example, some peers are marked "down", and other peers were failing
> > > and are now in the "unavailable" state.
> > >
> > > The number of tries is limited by the number of servers in the group,
> > > unless futher restricted by proxy_next_upstream_tries. In your case,
> > > since there are two "down" servers, and other servers are unavailable,
> > > you reach the situation when a peer cannot be selected. If you comment
> > > out the two "down" servers, and try a few requests in a row when all
> > > servers are physically unavailable, the first log entry will list all
> > > of the attempted servers, and then for the next 10 seconds (in the
> > > default config) you'll see only the upstream group name and 502 in
> > > $upstream_status, until the servers become available again (see
> > > max_fails/fail_timeout).
> > >
> > > Hope this makes things a little bit clearer.
> > > _______________________________________________
> > > nginx mailing list
> > > nginx@nginx.org
> > > http://mailman.nginx.org/mailman/listinfo/nginx
>
>
> --
> Ruslan Ermilov
> Assume stupidity not malice
> _______________________________________________
> nginx mailing list
> nginx@nginx.org
> http://mailman.nginx.org/mailman/listinfo/nginx
>
_______________________________________________
nginx mailing list
nginx@nginx.org
http://mailman.nginx.org/mailman/listinfo/nginx
Ruslan Ermilov
Re: upstream - behavior on pool exhaustion
April 19, 2017 11:00AM
On Tue, Apr 18, 2017 at 02:39:27PM +0200, B.R. via nginx wrote:
> 'down' should not translate into any kind of attempt, so nothing should
> really appear for the servers in that static state.
> For 'unavailable' servers, for the most part the content of the variables
> should be the same.
>
> Starting from the example I provided, here is what I expected to see:
> - $upstream_addr: <IP address 1>:<port>, <IP address 2>:<port>, <IP address
> 3>:<port>, <IP address 4>:<port>, <IP address 5>:<port>, <IP address
> 6>:<port>
> - $upstream_response_time: 0.000, 0.000, 0.000, 0.000, 0.001, 0.000
>
> That, associated with the 502 response from the HTTP language, is
> sufficient to interpret the log entry as: the request failed to find a
> proper backend after having attempted communication with the 6 specified
> active backends. It is pretty straightforward.

And what about the next request when all of servers are either "down"
or "unavailable"?

I'm just trying to give you a simple example of when the value of the
$upstream_addr variable has a special value equal to the upstream
group name, and why it can be useful.

> If you want to add something to explicitely states the whole upstream group
> is down, this should go to the error log.
> At the very least, if the current way of working is kept, the grammar of
> the content of the $upstream_* variables should be specified.
>
> Does not that seem reasonable?
> ---
> *B. R.*
>
> On Mon, Apr 17, 2017 at 6:09 PM, Ruslan Ermilov <[email protected]> wrote:
>
> > On Sat, Apr 15, 2017 at 03:55:20AM +0200, B.R. via nginx wrote:
> > > Let me be clear here:
> > > I got 6 active servers (not marked down), and the logs show 1 attempt on
> > > each. They all failed for a known reason, and there is no problem there.
> > > Subsequently, the whole pool was 'down' and the response was 502.
> > > Everything perfectly normal so far.
> > >
> > > ​What is unclear is the feature (as you classified it) of having a fake
> > > node named after the pool appearing in the list of tried upstream
> > servers.​
> > > It brings confusion more than anything else: having a 502 response + the
> > > list of all tried (and failed) nodes corresponding with the list of
> > active
> > > nodes is more than enough to describe what happened.
> > > The name of the upstream group does not corresponding to any real asset,
> > it
> > > is purely virtual classification. It thus makes no sense at all to me to
> > > have it appearing as a 7th 'node' in the list... and how do you interpret
> > > its response time (where you got also a 7th item in the list)?
> > > Moreover, it is confusing, since proxy_pass handles domain names and one
> > > could believe nginx treated the upstream group name as such.
> >
> > Without the six attempts, if all of the servers are unreachable (either
> > "down" or "unavailable" because they have failed previously) at the time
> > the request starts, what do you expect to see in $upstream_*?
> >
> > > On Fri, Apr 14, 2017 at 10:21 AM, Ruslan Ermilov <[email protected]> wrote:
> > >
> > > > On Fri, Apr 14, 2017 at 09:41:36AM +0200, B.R. via nginx wrote:
> > > > > Hello,
> > > > >
> > > > > Reading from upstream
> > > > > <https://nginx.org/en/docs/http/ngx_http_upstream_module.htm
> > l#upstream>
> > > > > docs, on upstream pool exhaustion, every backend should be tried
> > once,
> > > > and
> > > > > then if all fail the response should be crafted based on the one
> > from the
> > > > > last server attempt.
> > > > > So far so good.
> > > > >
> > > > > I recently faced a server farm which implements a dull nightly
> > restart of
> > > > > every node, not sequencing it, resulting in the possibility of
> > having all
> > > > > nodes offline at the same time.
> > > > >
> > > > > However, I collected log entries which did not match what I was
> > expected.
> > > > > For 6 backend nodes, I got:
> > > > > - log format: $status $body_bytes_sent $request_time $upstream_addr
> > > > > $upstream_response_time
> > > > > - log entry: 502 568 0.001 <IP address 1>:<port>, <IP address
> > 2>:<port>,
> > > > > <IP address 3>:<port>, <IP address 4>:<port>, <IP address 5>:<port>,
> > <IP
> > > > > address 6>:<port>, php-fpm 0.000, 0.000, 0.000, 0.000, 0.001, 0.000,
> > > > 0.000
> > > > > I got 7 entries for $upstream_addr & $upstream_response_time,
> > instead of
> > > > > the expected 6.
> > > > >
> > > > > ​Here are the interesting parts of the configuration:
> > > > > upstream php-fpm {
> > > > > server <machine 1>:<port> down;
> > > > > server <machine 2>:<port> down;
> > > > > [...]
> > > > > server <machine N-5>:<port>;
> > > > > server <machine N-4>:<port>;
> > > > > server <machine N-3>:<port>;
> > > > > server <machine N-2>:<port>;
> > > > > server <machine N-1>:<port>;
> > > > > server <machine N>:<port>;
> > > > > keepalive 128;
> > > > > }
> > > > >
> > > > > ​server {
> > > > > set $fpm_pool "php-fpm$fpm_pool_ID";
> > > > > [...]
> > > > > location ~ \.php$ {
> > > > > [...]
> > > > > fastcgi_read_timeout 600;
> > > > > fastcgi_keep_conn on;
> > > > > fastcgi_index index.php;
> > > > >
> > > > > include fastcgi_params;
> > > > > fastcgi_param SCRIPT_FILENAME
> > > > > $document_root$fastcgi_script_name;
> > > > > [...]
> > > > > fastcgi_pass $fpm_pool;
> > > > > }
> > > > > }
> > > > >
> > > > > ​The question is:
> > > > > php-fpm being an upstream group name, how come has it been tried as a
> > > > > domain name in the end?
> > > > > Stated otherwise, is this because the upstream group is considered
> > > > 'down',
> > > > > thus somehow removed from the possibilities, and nginx trying one
> > last
> > > > time
> > > > > the name as a domain name to see if something answers?
> > > > > This 7th request is definitely strange to my point of view. Is it a
> > bug
> > > > or
> > > > > a feature?
> > > >
> > > > A feature.
> > > >
> > > > Most $upstream_* variables are vectored ones, and the number of entries
> > > > in their values corresponds to the number of tries made to select a
> > peer.
> > > > When a peer cannot be selected at all (as in your case), the status is
> > > > 502 and the name equals the upstream group name.
> > > >
> > > > There could be several reasons why none of the peers can be selected.
> > > > For example, some peers are marked "down", and other peers were failing
> > > > and are now in the "unavailable" state.
> > > >
> > > > The number of tries is limited by the number of servers in the group,
> > > > unless futher restricted by proxy_next_upstream_tries. In your case,
> > > > since there are two "down" servers, and other servers are unavailable,
> > > > you reach the situation when a peer cannot be selected. If you comment
> > > > out the two "down" servers, and try a few requests in a row when all
> > > > servers are physically unavailable, the first log entry will list all
> > > > of the attempted servers, and then for the next 10 seconds (in the
> > > > default config) you'll see only the upstream group name and 502 in
> > > > $upstream_status, until the servers become available again (see
> > > > max_fails/fail_timeout).
> > > >
> > > > Hope this makes things a little bit clearer.
> > > > _______________________________________________
> > > > nginx mailing list
> > > > nginx@nginx.org
> > > > http://mailman.nginx.org/mailman/listinfo/nginx
> >
> >
> > --
> > Ruslan Ermilov
> > Assume stupidity not malice
> > _______________________________________________
> > nginx mailing list
> > nginx@nginx.org
> > http://mailman.nginx.org/mailman/listinfo/nginx
> >


--
Ruslan Ermilov
Assume stupidity not malice
_______________________________________________
nginx mailing list
nginx@nginx.org
http://mailman.nginx.org/mailman/listinfo/nginx
B.R. via nginx
Re: upstream - behavior on pool exhaustion
April 19, 2017 05:30PM
On Wed, Apr 19, 2017 at 10:51 AM, Ruslan Ermilov <[email protected]> wrote:

> And what about the next request when all of servers are either "down"
> or "unavailable"?
>

To me, all the unavailable servers have been 'tried', ie their state has
been checked, as the docs
<http://nginx.org/en/docs/http/ngx_http_upstream_module.html#upstream>; say
every server of an upstream group will be checked.
Thus, they should all appear, along with the time it took for that lookup
to be made (usually 0.000 since no communication occurred and a simple
internal check has probably been made for the duration of the 'unavailable'
state).

​No special value whatsoever, as none required, and the grammar of the 2
variables is stable.​
​If an explicit message about attempts against completely unavailable
upstream groups should be logged, it would be so in the error log.​

---
*B. R.*
_______________________________________________
nginx mailing list
nginx@nginx.org
http://mailman.nginx.org/mailman/listinfo/nginx
Ruslan Ermilov
Re: upstream - behavior on pool exhaustion
April 20, 2017 03:00PM
On Wed, Apr 19, 2017 at 05:26:17PM +0200, B.R. via nginx wrote:
> On Wed, Apr 19, 2017 at 10:51 AM, Ruslan Ermilov <[email protected]> wrote:
>
> > And what about the next request when all of servers are either "down"
> > or "unavailable"?
> >
>
> To me, all the unavailable servers have been 'tried', ie their state has
> been checked, as the docs
> <http://nginx.org/en/docs/http/ngx_http_upstream_module.html#upstream>; say
> every server of an upstream group will be checked.
> Thus, they should all appear, along with the time it took for that lookup
> to be made (usually 0.000 since no communication occurred and a simple
> internal check has probably been made for the duration of the 'unavailable'
> state).
>
> ​No special value whatsoever, as none required, and the grammar of the 2
> variables is stable.​
> ​If an explicit message about attempts against completely unavailable
> upstream groups should be logged, it would be so in the error log.​

Given the following config,

upstream u {
server 127.0.0.1:8001;
server 127.0.0.1:8002;
server 127.0.0.1:8003;
}

log_format u "upstream_addr={$upstream_addr} upstream_status={$upstream_status}";

server {
access_log logs/u.log u;

listen 8000;

location / {
proxy_pass http://u;
}
}

when none of the upstream servers are available, the first request will
log

upstream_addr={127.0.0.1:8001, 127.0.0.1:8002, 127.0.0.1:8003} upstream_status={502, 502, 502}

And the next request will log:

upstream_addr={u} upstream_status={502}


If you add the 4th "server 127.0.0.1:8004 down" and start afresh, then the
first request will end up with

upstream_addr={127.0.0.1:8001, 127.0.0.1:8002, 127.0.0.1:8003, u} upstream_status={502, 502, 502, 502}

This is because the number of attempts equals the number of servers by
default.


Again, this is not a bug but a feature.

The number of values in $upstream_* always corresponds to the number
of attempts made to select a server. The last value can be a special
value with the name of the upstream group when an attempt to select
a peer has failed. This can happen on the 1st attempt, and then the
only value will be the upstream group name.
_______________________________________________
nginx mailing list
nginx@nginx.org
http://mailman.nginx.org/mailman/listinfo/nginx
B.R. via nginx
Re: upstream - behavior on pool exhaustion
April 22, 2017 02:00AM
I do not know if your detailed explanation was aimed to me, or to the list
in general, but I got all that already as far as I am concerned.

​To me, when an attempt is made to an upstream group where no peer can be
selected, a 502 should be returned for that request, and no upstream having
been selected, no $upstream_* variable​ should contain anything. An error
message should be generated in the error log.
I fail to see the benefit of having the group name being considered as an
upstream peer.... For cons, that name might get confused with a domain
name, and the grammar of the $upstream_addr variable is complexified.
Not that, as stated before, the docs merely say the $upstream_addr should
contains IP addresses and ports of peers, no mention of the upstream group
name there.

Well, it seems your design decisions are made, and even though I see things
differently, I understand what I did not get before.
Is my vision broke, ie some benefit I am failing to see about your
implementation?

Another linked question:
I got some cases in which $upstream_response_time was '-' for some peers
(and not a numeric value like 0.000).
What is the meaning of that? Connection failed? I am not logging the
$upstream_status variable, not $upstream_connect_time, thus have limited
information.
Could that '-' appear anywhere in the list?
---
*B. R.*

On Thu, Apr 20, 2017 at 2:58 PM, Ruslan Ermilov <[email protected]> wrote:

> On Wed, Apr 19, 2017 at 05:26:17PM +0200, B.R. via nginx wrote:
> > On Wed, Apr 19, 2017 at 10:51 AM, Ruslan Ermilov <[email protected]> wrote:
> >
> > > And what about the next request when all of servers are either "down"
> > > or "unavailable"?
> > >
> >
> > To me, all the unavailable servers have been 'tried', ie their state has
> > been checked, as the docs
> > <http://nginx.org/en/docs/http/ngx_http_upstream_module.html#upstream>;
> say
> > every server of an upstream group will be checked.
> > Thus, they should all appear, along with the time it took for that lookup
> > to be made (usually 0.000 since no communication occurred and a simple
> > internal check has probably been made for the duration of the
> 'unavailable'
> > state).
> >
> > ​No special value whatsoever, as none required, and the grammar of the 2
> > variables is stable.​
> > ​If an explicit message about attempts against completely unavailable
> > upstream groups should be logged, it would be so in the error log.​
>
> Given the following config,
>
> upstream u {
> server 127.0.0.1:8001;
> server 127.0.0.1:8002;
> server 127.0.0.1:8003;
> }
>
> log_format u "upstream_addr={$upstream_addr}
> upstream_status={$upstream_status}";
>
> server {
> access_log logs/u.log u;
>
> listen 8000;
>
> location / {
> proxy_pass http://u;
> }
> }
>
> when none of the upstream servers are available, the first request will
> log
>
> upstream_addr={127.0.0.1:8001, 127.0.0.1:8002, 127.0.0.1:8003}
> upstream_status={502, 502, 502}
>
> And the next request will log:
>
> upstream_addr={u} upstream_status={502}
>
>
> If you add the 4th "server 127.0.0.1:8004 down" and start afresh, then the
> first request will end up with
>
> upstream_addr={127.0.0.1:8001, 127.0.0.1:8002, 127.0.0.1:8003, u}
> upstream_status={502, 502, 502, 502}
>
> This is because the number of attempts equals the number of servers by
> default.
>
>
> Again, this is not a bug but a feature.
>
> The number of values in $upstream_* always corresponds to the number
> of attempts made to select a server. The last value can be a special
> value with the name of the upstream group when an attempt to select
> a peer has failed. This can happen on the 1st attempt, and then the
> only value will be the upstream group name.
> _______________________________________________
> nginx mailing list
> nginx@nginx.org
> http://mailman.nginx.org/mailman/listinfo/nginx
>
_______________________________________________
nginx mailing list
nginx@nginx.org
http://mailman.nginx.org/mailman/listinfo/nginx
Sorry, only registered users may post in this forum.

Click here to login