Welcome! Log In Create A New Profile

Advanced

upstream timeouts I can not explain

Posted by Руслан Закиров 
Руслан Закиров
upstream timeouts I can not explain
January 08, 2017 03:10PM
Hello,

nginx 1.10.2 [1] running on FreeBSD 10.3-RELEASE-p7

We have quite a lot of "upstream timed out" errors. For example:

2017/01/08 16:20:54 [error] 82154#0: *494223064 upstream timed out (60:
Operation timed out) while connecting to upstream, client: 192.168.1.43,
server: q0.sports.ru, request: "POST /pub?topic=docs HTTP/1.1", upstream: "
http://192.168.1.206:4151/pub?topic=docs";, host: "q0.sports.ru"
Here is tcpdump:

16:20:53.218916 IP myself-fe.28957 > q0-2.4151: Flags , seq 2479999411,
win 65535, options [mss 1460,nop,wscale 6,sackOK,TS val 787079162 ecr 0],
length 0
16:20:53.219721 IP q0-2.4151 > myself-fe.28957: Flags [S.], seq 605231398,
ack 2479999412, win 28960, options [mss 1460,sackOK,TS val 352522447 ecr
787079162,nop,wscale 7], length 0
16:20:53.219742 IP myself-fe.28957 > q0-2.4151: Flags [.], ack 1, win 1040,
options [nop,nop,TS val 787079163 ecr 352522447], length 0
16:20:54.572712 IP myself-fe.28957 > q0-2.4151: Flags [F.], seq 1, ack 1,
win 1040, options [nop,nop,TS val 787080516 ecr 352522447], length 0
16:20:54.573173 IP q0-2.4151 > myself-fe.28957: Flags [F.], seq 1, ack 2,
win 227, options [nop,nop,TS val 352522786 ecr 787080516], length 0
16:20:54.573195 IP myself-fe.28957 > q0-2.4151: Flags [.], ack 2, win 1040,
options [nop,nop,TS val 787080516 ecr 352522786], length 0

What I don't see here is a packet with HTTP request. I checked access log
and only two requests to .206 upstream were finished at "16:20:54". Another
request looked like this:

16:20:54.026151 IP myself-fe.29450 > q0-2.4151: Flags , seq 347540799,
win 65535, options [mss 1460,nop,wscale 6,sackOK,TS val 787079969 ecr 0],
length 0
16:20:54.026502 IP q0-2.4151 > myself-fe.29450: Flags [S.], seq 2176380567,
ack 347540800, win 28960, options [mss 1460,sackOK,TS val 352522649 ecr
787079969,nop,wscale 7], length 0
16:20:54.026526 IP myself-fe.29450 > q0-2.4151: Flags [.], ack 1, win 1040,
options [nop,nop,TS val 787079970 ecr 352522649], length 0
16:20:54.026537 IP myself-fe.29450 > q0-2.4151: Flags [P.], seq 1:205, ack
1, win 1040, options [nop,nop,TS val 787079970 ecr 352522649], length 204
16:20:54.026779 IP q0-2.4151 > myself-fe.29450: Flags [.], ack 205, win
235, options [nop,nop,TS val 352522649 ecr 787079970], length 0
16:20:54.027136 IP q0-2.4151 > myself-fe.29450: Flags [P.], seq 1:119, ack
205, win 235, options [nop,nop,TS val 352522649 ecr 787079970], length 118
16:20:54.027145 IP q0-2.4151 > myself-fe.29450: Flags [F.], seq 119, ack
205, win 235, options [nop,nop,TS val 352522649 ecr 787079970], length 0
16:20:54.027156 IP myself-fe.29450 > q0-2.4151: Flags [.], ack 120, win
1038, options [nop,nop,TS val 787079970 ecr 352522649], length 0
16:20:54.027179 IP myself-fe.29450 > q0-2.4151: Flags [F.], seq 205, ack
120, win 1040, options [nop,nop,TS val 787079970 ecr 352522649], length 0
16:20:54.027354 IP q0-2.4151 > myself-fe.29450: Flags [.], ack 206, win
235, options [nop,nop,TS val 352522649 ecr 787079970], length 0

It has packet with HTTP request after establishing TCP connection.

Any idea where I should look next? Is it some known issue?

[1] nginx -V

nginx version: nginx/1.10.2
built with OpenSSL 1.0.2j 26 Sep 2016
TLS SNI support enabled
configure arguments: --prefix=/usr/local/etc/nginx --with-cc-opt='-I
/usr/local/include' --with-ld-opt='-L /usr/local/lib'
--conf-path=/usr/local/etc/nginx/nginx.conf
--sbin-path=/usr/local/sbin/nginx --pid-path=/var/run/nginx.pid
--error-log-path=/var/log/nginx/error.log --user=www --group=www
--with-file-aio
--http-client-body-temp-path=/var/tmp/nginx/client_body_temp
--http-fastcgi-temp-path=/var/tmp/nginx/fastcgi_temp
--http-proxy-temp-path=/var/tmp/nginx/proxy_temp
--http-scgi-temp-path=/var/tmp/nginx/scgi_temp
--http-uwsgi-temp-path=/var/tmp/nginx/uwsgi_temp
--http-log-path=/var/log/nginx/access.log --with-http_dav_module
--with-http_geoip_module --with-http_gzip_static_module
--with-http_realip_module --with-http_slice_module
--with-http_stub_status_module --with-http_sub_module --with-pcre
--with-http_v2_module --with-stream --with-stream_ssl_module
--with-http_ssl_module

--
Руслан Закиров
Руководитель отдела разработки веб-сервисов
+7(916) 597-92-69, ruz @ http://www.sports.ru/
_______________________________________________
nginx mailing list
nginx@nginx.org
http://mailman.nginx.org/mailman/listinfo/nginx
Maxim Dounin
Re: upstream timeouts I can not explain
January 09, 2017 04:20PM
Hello!

On Sun, Jan 08, 2017 at 05:04:47PM +0300, Руслан Закиров wrote:

> Hello,
>
> nginx 1.10.2 [1] running on FreeBSD 10.3-RELEASE-p7
>
> We have quite a lot of "upstream timed out" errors. For example:
>
> 2017/01/08 16:20:54 [error] 82154#0: *494223064 upstream timed out (60:
> Operation timed out) while connecting to upstream, client: 192.168.1.43,
> server: q0.sports.ru, request: "POST /pub?topic=docs HTTP/1.1", upstream: "
> http://192.168.1.206:4151/pub?topic=docs";, host: "q0.sports.ru"
> Here is tcpdump:
>
> 16:20:53.218916 IP myself-fe.28957 > q0-2.4151: Flags , seq 2479999411,
> win 65535, options [mss 1460,nop,wscale 6,sackOK,TS val 787079162 ecr 0],
> length 0
> 16:20:53.219721 IP q0-2.4151 > myself-fe.28957: Flags [S.], seq 605231398,
> ack 2479999412, win 28960, options [mss 1460,sackOK,TS val 352522447 ecr
> 787079162,nop,wscale 7], length 0
> 16:20:53.219742 IP myself-fe.28957 > q0-2.4151: Flags [.], ack 1, win 1040,
> options [nop,nop,TS val 787079163 ecr 352522447], length 0
> 16:20:54.572712 IP myself-fe.28957 > q0-2.4151: Flags [F.], seq 1, ack 1,
> win 1040, options [nop,nop,TS val 787080516 ecr 352522447], length 0
> 16:20:54.573173 IP q0-2.4151 > myself-fe.28957: Flags [F.], seq 1, ack 2,
> win 227, options [nop,nop,TS val 352522786 ecr 787080516], length 0
> 16:20:54.573195 IP myself-fe.28957 > q0-2.4151: Flags [.], ack 2, win 1040,
> options [nop,nop,TS val 787080516 ecr 352522786], length 0
>
> What I don't see here is a packet with HTTP request.

The "upstream timeout ... while connecting to upstream" suggests
that nginx wasn't able to see the connect event.

[...]

> Any idea where I should look next?

Some things to consider:

- Make sure you are looking at tcpdump on the nginx host, and
there are no firewalls on the host to interfere with.

- Try collecting nginx debug logs, they should have enough
information to see if the event was reported by the kernel or
not.

> Is it some known issue?

No.

--
Maxim Dounin
http://nginx.org/
_______________________________________________
nginx mailing list
nginx@nginx.org
http://mailman.nginx.org/mailman/listinfo/nginx
Руслан Закиров
Re: upstream timeouts I can not explain
January 09, 2017 08:00PM
On Mon, Jan 9, 2017 at 6:14 PM, Maxim Dounin <mdounin@mdounin.ru> wrote:

> Hello!
>
> On Sun, Jan 08, 2017 at 05:04:47PM +0300, Руслан Закиров wrote:
>
> > Hello,
> >
> > nginx 1.10.2 [1] running on FreeBSD 10.3-RELEASE-p7
> >
> > We have quite a lot of "upstream timed out" errors. For example:
> >
> > 2017/01/08 16:20:54 [error] 82154#0: *494223064 upstream timed out (60:
> > Operation timed out) while connecting to upstream, client: 192.168.1.43,
> > server: q0.sports.ru, request: "POST /pub?topic=docs HTTP/1.1",
> upstream: "
> > http://192.168.1.206:4151/pub?topic=docs";, host: "q0.sports.ru"
> > Here is tcpdump:
> >
> > 16:20:53.218916 IP myself-fe.28957 > q0-2.4151: Flags , seq
> 2479999411,
> > win 65535, options [mss 1460,nop,wscale 6,sackOK,TS val 787079162 ecr 0],
> > length 0
> > 16:20:53.219721 IP q0-2.4151 > myself-fe.28957: Flags [S.], seq
> 605231398,
> > ack 2479999412, win 28960, options [mss 1460,sackOK,TS val 352522447 ecr
> > 787079162,nop,wscale 7], length 0
> > 16:20:53.219742 IP myself-fe.28957 > q0-2.4151: Flags [.], ack 1, win
> 1040,
> > options [nop,nop,TS val 787079163 ecr 352522447], length 0
> > 16:20:54.572712 IP myself-fe.28957 > q0-2.4151: Flags [F.], seq 1, ack 1,
> > win 1040, options [nop,nop,TS val 787080516 ecr 352522447], length 0
> > 16:20:54.573173 IP q0-2.4151 > myself-fe.28957: Flags [F.], seq 1, ack 2,
> > win 227, options [nop,nop,TS val 352522786 ecr 787080516], length 0
> > 16:20:54.573195 IP myself-fe.28957 > q0-2.4151: Flags [.], ack 2, win
> 1040,
> > options [nop,nop,TS val 787080516 ecr 352522786], length 0
> >
> > What I don't see here is a packet with HTTP request.
>
> The "upstream timeout ... while connecting to upstream" suggests
> that nginx wasn't able to see the connect event.
>
> [...]
>

Ok, spent whole day learning and investigating.

> Any idea where I should look next?
>
> Some things to consider:
>
> - Make sure you are looking at tcpdump on the nginx host, and
> there are no firewalls on the host to interfere with.
>

These were tcpdumps from nginx host. I have dump from other end and they
are symmetrical. We have proxy_connect_timeout at 300ms at the top level of
the config. When we first started to investigate it we increased timeout to
1s for
this location. An hour ago I increased it to 5 seconds and finally couldn't
reproduce
the problem with a simple "bomber" script.

From dumps you can see that connection was established within 10ms. What
can stop nginx from receiving the event for more than a second?

This happens on all served domains as pretty much everywhere connect
timeout is 300ms. If I tail -F error log, count this error occurrences
grouped by second then I see 1-3 seconds spikes: silence or <5 errors for
20-40 seconds then ~200 errors in a few seconds. Is there anything that may
block events processing nginx for quite a while?

- Try collecting nginx debug logs, they should have enough
> information to see if the event was reported by the kernel or
> not.
>

As you can see from above it is reported by the kernel, but with quite big
delay. Not sure if delay is in kernel or nginx, no delay on the network.

> Is it some known issue?
>
> No.



>
> --
> Maxim Dounin
> http://nginx.org/
> _______________________________________________
> nginx mailing list
> nginx@nginx.org
> http://mailman.nginx.org/mailman/listinfo/nginx




--
Руслан Закиров
Руководитель отдела разработки веб-сервисов
+7(916) 597-92-69, ruz @ http://www.sports.ru/
_______________________________________________
nginx mailing list
nginx@nginx.org
http://mailman.nginx.org/mailman/listinfo/nginx
Maxim Dounin
Re: upstream timeouts I can not explain
January 09, 2017 09:10PM
Hello!

On Mon, Jan 09, 2017 at 09:56:54PM +0300, Руслан Закиров wrote:

> On Mon, Jan 9, 2017 at 6:14 PM, Maxim Dounin <mdounin@mdounin.ru> wrote:
>
> > Hello!
> >
> > On Sun, Jan 08, 2017 at 05:04:47PM +0300, Руслан Закиров wrote:
> >
> > > Hello,
> > >
> > > nginx 1.10.2 [1] running on FreeBSD 10.3-RELEASE-p7
> > >
> > > We have quite a lot of "upstream timed out" errors. For example:
> > >
> > > 2017/01/08 16:20:54 [error] 82154#0: *494223064 upstream timed out (60:
> > > Operation timed out) while connecting to upstream, client: 192.168.1.43,
> > > server: q0.sports.ru, request: "POST /pub?topic=docs HTTP/1.1",
> > upstream: "
> > > http://192.168.1.206:4151/pub?topic=docs";, host: "q0.sports.ru"
> > > Here is tcpdump:
> > >
> > > 16:20:53.218916 IP myself-fe.28957 > q0-2.4151: Flags , seq
> > 2479999411,
> > > win 65535, options [mss 1460,nop,wscale 6,sackOK,TS val 787079162 ecr 0],
> > > length 0
> > > 16:20:53.219721 IP q0-2.4151 > myself-fe.28957: Flags [S.], seq
> > 605231398,
> > > ack 2479999412, win 28960, options [mss 1460,sackOK,TS val 352522447 ecr
> > > 787079162,nop,wscale 7], length 0
> > > 16:20:53.219742 IP myself-fe.28957 > q0-2.4151: Flags [.], ack 1, win
> > 1040,
> > > options [nop,nop,TS val 787079163 ecr 352522447], length 0
> > > 16:20:54.572712 IP myself-fe.28957 > q0-2.4151: Flags [F.], seq 1, ack 1,
> > > win 1040, options [nop,nop,TS val 787080516 ecr 352522447], length 0
> > > 16:20:54.573173 IP q0-2.4151 > myself-fe.28957: Flags [F.], seq 1, ack 2,
> > > win 227, options [nop,nop,TS val 352522786 ecr 787080516], length 0
> > > 16:20:54.573195 IP myself-fe.28957 > q0-2.4151: Flags [.], ack 2, win
> > 1040,
> > > options [nop,nop,TS val 787080516 ecr 352522786], length 0
> > >
> > > What I don't see here is a packet with HTTP request.
> >
> > The "upstream timeout ... while connecting to upstream" suggests
> > that nginx wasn't able to see the connect event.
> >
> > [...]
> >
>
> Ok, spent whole day learning and investigating.
>
> > Any idea where I should look next?
> >
> > Some things to consider:
> >
> > - Make sure you are looking at tcpdump on the nginx host, and
> > there are no firewalls on the host to interfere with.
> >
>
> These were tcpdumps from nginx host. I have dump from other end and they
> are symmetrical. We have proxy_connect_timeout at 300ms at the top level of
> the config. When we first started to investigate it we increased timeout to
> 1s for
> this location. An hour ago I increased it to 5 seconds and finally couldn't
> reproduce
> the problem with a simple "bomber" script.
>
> From dumps you can see that connection was established within 10ms. What
> can stop nginx from receiving the event for more than a second?
>
> This happens on all served domains as pretty much everywhere connect
> timeout is 300ms. If I tail -F error log, count this error occurrences
> grouped by second then I see 1-3 seconds spikes: silence or <5 errors for
> 20-40 seconds then ~200 errors in a few seconds. Is there anything that may
> block events processing nginx for quite a while?

Typical kern.sched.quantum is about 100ms, so several
CPU-intensive tasks can delay processing of the events enough to
trigger a timeout if a context switch happens at a bad time.

Note well that various blocking operations in nginx itself -
either disk or CPU-intensive ones - can also delay processing of
various events, and this in turn can trigger unexpected timeouts
when using timers comparable to a typical delay introduced on each
event loop iteration.

--
Maxim Dounin
http://nginx.org/
_______________________________________________
nginx mailing list
nginx@nginx.org
http://mailman.nginx.org/mailman/listinfo/nginx
Руслан Закиров
Re: upstream timeouts I can not explain
January 10, 2017 06:50PM
>
> > > The "upstream timeout ... while connecting to upstream" suggests
> > > that nginx wasn't able to see the connect event.
> > >
> > > [...]
> > >
> > > Some things to consider:
> > >
> > > - Make sure you are looking at tcpdump on the nginx host, and
> > > there are no firewalls on the host to interfere with.
> > >
> >
> > These were tcpdumps from nginx host. I have dump from other end and they
> > are symmetrical. We have proxy_connect_timeout at 300ms at the top level
> of
> > the config. When we first started to investigate it we increased timeout
> to
> > 1s for
> > this location. An hour ago I increased it to 5 seconds and finally
> couldn't
> > reproduce
> > the problem with a simple "bomber" script.
> >
> > From dumps you can see that connection was established within 10ms. What
> > can stop nginx from receiving the event for more than a second?
> >
> > This happens on all served domains as pretty much everywhere connect
> > timeout is 300ms. If I tail -F error log, count this error occurrences
> > grouped by second then I see 1-3 seconds spikes: silence or <5 errors for
> > 20-40 seconds then ~200 errors in a few seconds. Is there anything that
> may
> > block events processing nginx for quite a while?
>
> Typical kern.sched.quantum is about 100ms, so several
> CPU-intensive tasks can delay processing of the events enough to
> trigger a timeout if a context switch happens at a bad time.
>
> Note well that various blocking operations in nginx itself -
> either disk or CPU-intensive ones - can also delay processing of
> various events, and this in turn can trigger unexpected timeouts
> when using timers comparable to a typical delay introduced on each
> event loop iteration.


We tuned upstreams' parameters to avoid both backend servers marked as
unavailable during these spikes. This prevents bogus errors.

Also, in this particular service I'm experimenting with keepalive
connections between nginx and upstreams.

Above steps don't solve root cause. Can you suggest me further steps to
localize the issue? I'm not sure how to detect if it's blocking operation
in nginx, OS scheduling or something else.

--
Руслан Закиров
Руководитель отдела разработки веб-сервисов
+7(916) 597-92-69, ruz @ http://www.sports.ru/
_______________________________________________
nginx mailing list
nginx@nginx.org
http://mailman.nginx.org/mailman/listinfo/nginx
Peter Booth
Re: upstream timeouts I can not explain
January 10, 2017 08:20PM
All hosts have characteristic stalls and blips but the scale of this issue can vary 100x depending on is configuration. You can get some data about these stalls using solar flare's sysjitter utility or Gil Tene's jhiccup.

Sent from my iPhone

On Jan 10, 2017, at 12:46 PM, Руслан Закиров <ruz@sports.ru> wrote:

>> > > The "upstream timeout ... while connecting to upstream" suggests
>> > > that nginx wasn't able to see the connect event.
>> > >
>> > > [...]
>> > >
>> > > Some things to consider:
>> > >
>> > > - Make sure you are looking at tcpdump on the nginx host, and
>> > > there are no firewalls on the host to interfere with.
>> > >
>> >
>> > These were tcpdumps from nginx host. I have dump from other end and they
>> > are symmetrical. We have proxy_connect_timeout at 300ms at the top level of
>> > the config. When we first started to investigate it we increased timeout to
>> > 1s for
>> > this location. An hour ago I increased it to 5 seconds and finally couldn't
>> > reproduce
>> > the problem with a simple "bomber" script.
>> >
>> > From dumps you can see that connection was established within 10ms. What
>> > can stop nginx from receiving the event for more than a second?
>> >
>> > This happens on all served domains as pretty much everywhere connect
>> > timeout is 300ms. If I tail -F error log, count this error occurrences
>> > grouped by second then I see 1-3 seconds spikes: silence or <5 errors for
>> > 20-40 seconds then ~200 errors in a few seconds. Is there anything that may
>> > block events processing nginx for quite a while?
>>
>> Typical kern.sched.quantum is about 100ms, so several
>> CPU-intensive tasks can delay processing of the events enough to
>> trigger a timeout if a context switch happens at a bad time.
>>
>> Note well that various blocking operations in nginx itself -
>> either disk or CPU-intensive ones - can also delay processing of
>> various events, and this in turn can trigger unexpected timeouts
>> when using timers comparable to a typical delay introduced on each
>> event loop iteration.
>
> We tuned upstreams' parameters to avoid both backend servers marked as unavailable during these spikes. This prevents bogus errors.
>
> Also, in this particular service I'm experimenting with keepalive connections between nginx and upstreams.
>
> Above steps don't solve root cause. Can you suggest me further steps to localize the issue? I'm not sure how to detect if it's blocking operation in nginx, OS scheduling or something else.
>
> --
> Руслан Закиров
> Руководитель отдела разработки веб-сервисов
> +7(916) 597-92-69, ruz @
> _______________________________________________
> nginx mailing list
> nginx@nginx.org
> http://mailman.nginx.org/mailman/listinfo/nginx
_______________________________________________
nginx mailing list
nginx@nginx.org
http://mailman.nginx.org/mailman/listinfo/nginx
Руслан Закиров
Re: upstream timeouts I can not explain
January 11, 2017 04:10PM
On Mon, Jan 9, 2017 at 10:59 PM, Maxim Dounin <mdounin@mdounin.ru> wrote:

> Typical kern.sched.quantum is about 100ms, so several
> CPU-intensive tasks can delay processing of the events enough to
> trigger a timeout if a context switch happens at a bad time.
>

Here what I see in truss' output:

38.820523207 0.000006568 kevent(28,{ },0,{
198,EVFILT_WRITE,EV_CLEAR,0x0,0x8218,0x821405071 },512,{ 6.215000000 }) = 1
(0x1)
39.783094188 0.000022875 kevent(28,{ },0,{
52,EVFILT_READ,0x0,0x0,0x30b,0x81f800068
204,EVFILT_WRITE,EV_CLEAR,0x0,0x8218,0x821401588
51,EVFILT_READ,0x0,0x0,0xec,0x81f800000
68,EVFILT_READ,EV_CLEAR,0x0,0x8bf,0x81f816580
7,EVFILT_READ,EV_CLEAR,0x0,0x27f,0x81f813869
57,EVFILT_READ,EV_CLEAR,0x0,0x767,0x81f817bd8
203,EVFILT_READ,EV_CLEAR|EV_EOF,0x0,0x248,0x81f8030c1
181,EVFILT_READ,EV_CLEAR|EV_EOF,0x0,0x9b77,0x81f80ea68
178,EVFILT_READ,EV_CLEAR,0x0,0x39d,0x81f8010a9
198,EVFILT_READ,EV_CLEAR,0x0,0x3d3,0x81f805071
204,EVFILT_READ,EV_CLEAR,0x0,0x9da,0x81f801588
190,EVFILT_READ,EV_CLEAR,0x0,0x4ff,0x81f80fc48
154,EVFILT_READ,EV_CLEAR,0x0,0x88e,0x81f8130b1
151,EVFILT_READ,EV_CLEAR|EV_EOF,0x0,0xc1db,0x81f814290
157,EVFILT_READ,EV_CLEAR|EV_EOF,0x0,0xe841,0x81f80c029
195,EVFILT_READ,EV_CLEAR,0x0,0x952,0x81f8090a1
194,EVFILT_READ,EV_CLEAR,0x0,0x929,0x81f809ac8
201,EVFILT_READ,EV_CLEAR,0x0,0x4ef,0x81f80c980
174,EVFILT_READ,EV_CLEAR,0x0,0x51e,0x81f816518
77,EVFILT_READ,EV_CLEAR|EV_EOF,0x0,0x1168,0x81f811c61 },512,{ 5.253000000
}) = 20 (0x14)


1 second delay between two syscalls. Then nginx goes nuts processing all it
missed during this second. I can not tell from this output how much time
was spent in these syscalls. Can anyone?

What I don't like is timeout greater than 5 seconds. Doesn't it mean that
system is allowed to block for timeout time to collect events?


--
Руслан Закиров
Руководитель отдела разработки веб-сервисов
+7(916) 597-92-69, ruz @ http://www.sports.ru/
_______________________________________________
nginx mailing list
nginx@nginx.org
http://mailman.nginx.org/mailman/listinfo/nginx
Maxim Dounin
Re: upstream timeouts I can not explain
January 11, 2017 04:50PM
Hello!

On Wed, Jan 11, 2017 at 06:07:01PM +0300, Руслан Закиров wrote:

> On Mon, Jan 9, 2017 at 10:59 PM, Maxim Dounin <mdounin@mdounin.ru> wrote:
>
> > Typical kern.sched.quantum is about 100ms, so several
> > CPU-intensive tasks can delay processing of the events enough to
> > trigger a timeout if a context switch happens at a bad time.
> >
>
> Here what I see in truss' output:
>
> 38.820523207 0.000006568 kevent(28,{ },0,{
> 198,EVFILT_WRITE,EV_CLEAR,0x0,0x8218,0x821405071 },512,{ 6.215000000 }) = 1
> (0x1)
> 39.783094188 0.000022875 kevent(28,{ },0,{
> 52,EVFILT_READ,0x0,0x0,0x30b,0x81f800068
> 204,EVFILT_WRITE,EV_CLEAR,0x0,0x8218,0x821401588
> 51,EVFILT_READ,0x0,0x0,0xec,0x81f800000
> 68,EVFILT_READ,EV_CLEAR,0x0,0x8bf,0x81f816580
> 7,EVFILT_READ,EV_CLEAR,0x0,0x27f,0x81f813869
> 57,EVFILT_READ,EV_CLEAR,0x0,0x767,0x81f817bd8
> 203,EVFILT_READ,EV_CLEAR|EV_EOF,0x0,0x248,0x81f8030c1
> 181,EVFILT_READ,EV_CLEAR|EV_EOF,0x0,0x9b77,0x81f80ea68
> 178,EVFILT_READ,EV_CLEAR,0x0,0x39d,0x81f8010a9
> 198,EVFILT_READ,EV_CLEAR,0x0,0x3d3,0x81f805071
> 204,EVFILT_READ,EV_CLEAR,0x0,0x9da,0x81f801588
> 190,EVFILT_READ,EV_CLEAR,0x0,0x4ff,0x81f80fc48
> 154,EVFILT_READ,EV_CLEAR,0x0,0x88e,0x81f8130b1
> 151,EVFILT_READ,EV_CLEAR|EV_EOF,0x0,0xc1db,0x81f814290
> 157,EVFILT_READ,EV_CLEAR|EV_EOF,0x0,0xe841,0x81f80c029
> 195,EVFILT_READ,EV_CLEAR,0x0,0x952,0x81f8090a1
> 194,EVFILT_READ,EV_CLEAR,0x0,0x929,0x81f809ac8
> 201,EVFILT_READ,EV_CLEAR,0x0,0x4ef,0x81f80c980
> 174,EVFILT_READ,EV_CLEAR,0x0,0x51e,0x81f816518
> 77,EVFILT_READ,EV_CLEAR|EV_EOF,0x0,0x1168,0x81f811c61 },512,{ 5.253000000
> }) = 20 (0x14)
>
>
> 1 second delay between two syscalls. Then nginx goes nuts processing all it
> missed during this second. I can not tell from this output how much time
> was spent in these syscalls. Can anyone?

Using ktrace / kdump might be a better option, it shows both
syscall enter and syscall return with exact timestamps.

> What I don't like is timeout greater than 5 seconds. Doesn't it mean that
> system is allowed to block for timeout time to collect events?

Timeout as passed to kevent() is the nearest timer set in nginx.
That is, nginx has nothing to do till the time specified, and
allows kernel to block that long if there are no events. Kernel
is expected to return events as soon as it has any, or return 0 if
the specified time limit expires.

--
Maxim Dounin
http://nginx.org/
_______________________________________________
nginx mailing list
nginx@nginx.org
http://mailman.nginx.org/mailman/listinfo/nginx
Руслан Закиров
Re: upstream timeouts I can not explain
January 11, 2017 07:00PM
On Wed, Jan 11, 2017 at 6:44 PM, Maxim Dounin <mdounin@mdounin.ru> wrote:

> Hello!
>
> On Wed, Jan 11, 2017 at 06:07:01PM +0300, Руслан Закиров wrote:
>
> > On Mon, Jan 9, 2017 at 10:59 PM, Maxim Dounin <mdounin@mdounin.ru>
> wrote:
> >
> > > Typical kern.sched.quantum is about 100ms, so several
> > > CPU-intensive tasks can delay processing of the events enough to
> > > trigger a timeout if a context switch happens at a bad time.
> > >
> >
> > Here what I see in truss' output:
> >
> > 38.820523207 0.000006568 kevent(28,{ },0,{
> > 198,EVFILT_WRITE,EV_CLEAR,0x0,0x8218,0x821405071 },512,{ 6.215000000 })
> = 1
> > (0x1)
> > 39.783094188 0.000022875 kevent(28,{ },0,{
> > 52,EVFILT_READ,0x0,0x0,0x30b,0x81f800068
> > 204,EVFILT_WRITE,EV_CLEAR,0x0,0x8218,0x821401588
> > 51,EVFILT_READ,0x0,0x0,0xec,0x81f800000
> > 68,EVFILT_READ,EV_CLEAR,0x0,0x8bf,0x81f816580
> > 7,EVFILT_READ,EV_CLEAR,0x0,0x27f,0x81f813869
> > 57,EVFILT_READ,EV_CLEAR,0x0,0x767,0x81f817bd8
> > 203,EVFILT_READ,EV_CLEAR|EV_EOF,0x0,0x248,0x81f8030c1
> > 181,EVFILT_READ,EV_CLEAR|EV_EOF,0x0,0x9b77,0x81f80ea68
> > 178,EVFILT_READ,EV_CLEAR,0x0,0x39d,0x81f8010a9
> > 198,EVFILT_READ,EV_CLEAR,0x0,0x3d3,0x81f805071
> > 204,EVFILT_READ,EV_CLEAR,0x0,0x9da,0x81f801588
> > 190,EVFILT_READ,EV_CLEAR,0x0,0x4ff,0x81f80fc48
> > 154,EVFILT_READ,EV_CLEAR,0x0,0x88e,0x81f8130b1
> > 151,EVFILT_READ,EV_CLEAR|EV_EOF,0x0,0xc1db,0x81f814290
> > 157,EVFILT_READ,EV_CLEAR|EV_EOF,0x0,0xe841,0x81f80c029
> > 195,EVFILT_READ,EV_CLEAR,0x0,0x952,0x81f8090a1
> > 194,EVFILT_READ,EV_CLEAR,0x0,0x929,0x81f809ac8
> > 201,EVFILT_READ,EV_CLEAR,0x0,0x4ef,0x81f80c980
> > 174,EVFILT_READ,EV_CLEAR,0x0,0x51e,0x81f816518
> > 77,EVFILT_READ,EV_CLEAR|EV_EOF,0x0,0x1168,0x81f811c61 },512,{
> 5.253000000
> > }) = 20 (0x14)
> >
> >
> > 1 second delay between two syscalls. Then nginx goes nuts processing all
> it
> > missed during this second. I can not tell from this output how much time
> > was spent in these syscalls. Can anyone?
>
> Using ktrace / kdump might be a better option, it shows both
> syscall enter and syscall return with exact timestamps.
>

Tried. Found at least one issue with sendfile blocking:

66193 nginx 1484152162.182601 CALL
openat(AT_FDCWD,0x80639aac4,0x4<O_NONBLOCK>,<unused>0)
66193 nginx 1484152162.182607 NAMI ...
66193 nginx 1484152162.182643 RET openat 40/0x28
.... gz lookup and stats ...
66193 nginx 1484152162.182683 CALL
setsockopt(0x211,0x6,0x4,0x7fffffffd96c,0x4)
66193 nginx 1484152162.182687 RET setsockopt 0
66193 nginx 1484152162.182689 CALL
sendfile(0x28,0x211,0,0x806f,0x7fffffffe1d8,0x7fffffffe240,<invalid=0>)
66193 nginx 1484152163.541770 RET sendfile 0

Sendfile blocks for 1.3 seconds. However, it's:

$ sysctl hw.model hw.machine hw.ncpu
hw.model: Intel(R) Xeon(R) CPU E5620 @ 2.40GHz
hw.machine: amd64
hw.ncpu: 16

So sfbufs don't apply here, but documentation doesn't tell what resource
applies... or I couldn't find correct doc.

> What I don't like is timeout greater than 5 seconds. Doesn't it mean that
> > system is allowed to block for timeout time to collect events?
>
> Timeout as passed to kevent() is the nearest timer set in nginx.
> That is, nginx has nothing to do till the time specified, and
> allows kernel to block that long if there are no events. Kernel
> is expected to return events as soon as it has any, or return 0 if
> the specified time limit expires.
>

This I figured from reading nginx source code.

--
> Maxim Dounin
> http://nginx.org/
> _______________________________________________
> nginx mailing list
> nginx@nginx.org
> http://mailman.nginx.org/mailman/listinfo/nginx
>



--
Руслан Закиров
Руководитель отдела разработки веб-сервисов
+7(916) 597-92-69, ruz @ http://www.sports.ru/
_______________________________________________
nginx mailing list
nginx@nginx.org
http://mailman.nginx.org/mailman/listinfo/nginx
Maxim Dounin
Re: upstream timeouts I can not explain
January 11, 2017 07:50PM
Hello!

On Wed, Jan 11, 2017 at 08:54:31PM +0300, Руслан Закиров wrote:

> On Wed, Jan 11, 2017 at 6:44 PM, Maxim Dounin <mdounin@mdounin.ru> wrote:
>
> > Hello!
> >
> > On Wed, Jan 11, 2017 at 06:07:01PM +0300, Руслан Закиров wrote:
> >
> > > On Mon, Jan 9, 2017 at 10:59 PM, Maxim Dounin <mdounin@mdounin.ru>
> > wrote:
> > >
> > > > Typical kern.sched.quantum is about 100ms, so several
> > > > CPU-intensive tasks can delay processing of the events enough to
> > > > trigger a timeout if a context switch happens at a bad time.
> > > >
> > >
> > > Here what I see in truss' output:
> > >
> > > 38.820523207 0.000006568 kevent(28,{ },0,{
> > > 198,EVFILT_WRITE,EV_CLEAR,0x0,0x8218,0x821405071 },512,{ 6.215000000 })
> > = 1
> > > (0x1)
> > > 39.783094188 0.000022875 kevent(28,{ },0,{
> > > 52,EVFILT_READ,0x0,0x0,0x30b,0x81f800068
> > > 204,EVFILT_WRITE,EV_CLEAR,0x0,0x8218,0x821401588
> > > 51,EVFILT_READ,0x0,0x0,0xec,0x81f800000
> > > 68,EVFILT_READ,EV_CLEAR,0x0,0x8bf,0x81f816580
> > > 7,EVFILT_READ,EV_CLEAR,0x0,0x27f,0x81f813869
> > > 57,EVFILT_READ,EV_CLEAR,0x0,0x767,0x81f817bd8
> > > 203,EVFILT_READ,EV_CLEAR|EV_EOF,0x0,0x248,0x81f8030c1
> > > 181,EVFILT_READ,EV_CLEAR|EV_EOF,0x0,0x9b77,0x81f80ea68
> > > 178,EVFILT_READ,EV_CLEAR,0x0,0x39d,0x81f8010a9
> > > 198,EVFILT_READ,EV_CLEAR,0x0,0x3d3,0x81f805071
> > > 204,EVFILT_READ,EV_CLEAR,0x0,0x9da,0x81f801588
> > > 190,EVFILT_READ,EV_CLEAR,0x0,0x4ff,0x81f80fc48
> > > 154,EVFILT_READ,EV_CLEAR,0x0,0x88e,0x81f8130b1
> > > 151,EVFILT_READ,EV_CLEAR|EV_EOF,0x0,0xc1db,0x81f814290
> > > 157,EVFILT_READ,EV_CLEAR|EV_EOF,0x0,0xe841,0x81f80c029
> > > 195,EVFILT_READ,EV_CLEAR,0x0,0x952,0x81f8090a1
> > > 194,EVFILT_READ,EV_CLEAR,0x0,0x929,0x81f809ac8
> > > 201,EVFILT_READ,EV_CLEAR,0x0,0x4ef,0x81f80c980
> > > 174,EVFILT_READ,EV_CLEAR,0x0,0x51e,0x81f816518
> > > 77,EVFILT_READ,EV_CLEAR|EV_EOF,0x0,0x1168,0x81f811c61 },512,{
> > 5.253000000
> > > }) = 20 (0x14)
> > >
> > >
> > > 1 second delay between two syscalls. Then nginx goes nuts processing all
> > it
> > > missed during this second. I can not tell from this output how much time
> > > was spent in these syscalls. Can anyone?
> >
> > Using ktrace / kdump might be a better option, it shows both
> > syscall enter and syscall return with exact timestamps.
> >
>
> Tried. Found at least one issue with sendfile blocking:
>
> 66193 nginx 1484152162.182601 CALL
> openat(AT_FDCWD,0x80639aac4,0x4<O_NONBLOCK>,<unused>0)
> 66193 nginx 1484152162.182607 NAMI ...
> 66193 nginx 1484152162.182643 RET openat 40/0x28
> ... gz lookup and stats ...
> 66193 nginx 1484152162.182683 CALL
> setsockopt(0x211,0x6,0x4,0x7fffffffd96c,0x4)
> 66193 nginx 1484152162.182687 RET setsockopt 0
> 66193 nginx 1484152162.182689 CALL
> sendfile(0x28,0x211,0,0x806f,0x7fffffffe1d8,0x7fffffffe240,<invalid=0>)
> 66193 nginx 1484152163.541770 RET sendfile 0
>
> Sendfile blocks for 1.3 seconds. However, it's:
>
> $ sysctl hw.model hw.machine hw.ncpu
> hw.model: Intel(R) Xeon(R) CPU E5620 @ 2.40GHz
> hw.machine: amd64
> hw.ncpu: 16
>
> So sfbufs don't apply here, but documentation doesn't tell what resource
> applies... or I couldn't find correct doc.

On amd64 sendfile() uses mbufs / mbuf clusters. Try looking into
"vmstat -z" to see if there are enough mbuf clusters in various
zones. Note well tha sendfile() normally blocks on disk, and it
might simply mean that your disk subsystem is (occasionally)
overloaded. Try gstat to see some details.

Either way, blocking for 1.3s in a syscall perfectly explains why
300ms timeouts are sometimes triggered unexpectedly. If a timer
is set after such a blocking call and just before kevent() (that
is, connection is highly unlikely to be established in before
kevent() returns), on the next event loop iteraction nginx will
think that 1.3+ seconds has passed since the timer was set, and
will trigger a timeout (after processing all the events reported by
kevent()).

--
Maxim Dounin
http://nginx.org/
_______________________________________________
nginx mailing list
nginx@nginx.org
http://mailman.nginx.org/mailman/listinfo/nginx
Руслан Закиров
Re: upstream timeouts I can not explain
January 11, 2017 08:30PM
11 Янв 2017 г. 21:40 пользователь "Maxim Dounin" <mdounin@mdounin.ru>
написал:

Hello!

On Wed, Jan 11, 2017 at 08:54:31PM +0300, Руслан Закиров wrote:

> On Wed, Jan 11, 2017 at 6:44 PM, Maxim Dounin <mdounin@mdounin.ru> wrote:
>
> > Hello!
> >
> > On Wed, Jan 11, 2017 at 06:07:01PM +0300, Руслан Закиров wrote:
> >
> > > On Mon, Jan 9, 2017 at 10:59 PM, Maxim Dounin <mdounin@mdounin.ru>
> > wrote:
> > >
> > > > Typical kern.sched.quantum is about 100ms, so several
> > > > CPU-intensive tasks can delay processing of the events enough to
> > > > trigger a timeout if a context switch happens at a bad time.
> > > >
> > >
> > > Here what I see in truss' output:
> > >
> > > 38.820523207 0.000006568 kevent(28,{ },0,{
> > > 198,EVFILT_WRITE,EV_CLEAR,0x0,0x8218,0x821405071 },512,{ 6.215000000
})
> > = 1
> > > (0x1)
> > > 39.783094188 0.000022875 kevent(28,{ },0,{
> > > 52,EVFILT_READ,0x0,0x0,0x30b,0x81f800068
> > > 204,EVFILT_WRITE,EV_CLEAR,0x0,0x8218,0x821401588
> > > 51,EVFILT_READ,0x0,0x0,0xec,0x81f800000
> > > 68,EVFILT_READ,EV_CLEAR,0x0,0x8bf,0x81f816580
> > > 7,EVFILT_READ,EV_CLEAR,0x0,0x27f,0x81f813869
> > > 57,EVFILT_READ,EV_CLEAR,0x0,0x767,0x81f817bd8
> > > 203,EVFILT_READ,EV_CLEAR|EV_EOF,0x0,0x248,0x81f8030c1
> > > 181,EVFILT_READ,EV_CLEAR|EV_EOF,0x0,0x9b77,0x81f80ea68
> > > 178,EVFILT_READ,EV_CLEAR,0x0,0x39d,0x81f8010a9
> > > 198,EVFILT_READ,EV_CLEAR,0x0,0x3d3,0x81f805071
> > > 204,EVFILT_READ,EV_CLEAR,0x0,0x9da,0x81f801588
> > > 190,EVFILT_READ,EV_CLEAR,0x0,0x4ff,0x81f80fc48
> > > 154,EVFILT_READ,EV_CLEAR,0x0,0x88e,0x81f8130b1
> > > 151,EVFILT_READ,EV_CLEAR|EV_EOF,0x0,0xc1db,0x81f814290
> > > 157,EVFILT_READ,EV_CLEAR|EV_EOF,0x0,0xe841,0x81f80c029
> > > 195,EVFILT_READ,EV_CLEAR,0x0,0x952,0x81f8090a1
> > > 194,EVFILT_READ,EV_CLEAR,0x0,0x929,0x81f809ac8
> > > 201,EVFILT_READ,EV_CLEAR,0x0,0x4ef,0x81f80c980
> > > 174,EVFILT_READ,EV_CLEAR,0x0,0x51e,0x81f816518
> > > 77,EVFILT_READ,EV_CLEAR|EV_EOF,0x0,0x1168,0x81f811c61 },512,{
> > 5.253000000
> > > }) = 20 (0x14)
> > >
> > >
> > > 1 second delay between two syscalls. Then nginx goes nuts processing
all
> > it
> > > missed during this second. I can not tell from this output how much
time
> > > was spent in these syscalls. Can anyone?
> >
> > Using ktrace / kdump might be a better option, it shows both
> > syscall enter and syscall return with exact timestamps.
> >
>
> Tried. Found at least one issue with sendfile blocking:
>
> 66193 nginx 1484152162.182601 CALL
> openat(AT_FDCWD,0x80639aac4,0x4<O_NONBLOCK>,<unused>0)
> 66193 nginx 1484152162.182607 NAMI ...
> 66193 nginx 1484152162.182643 RET openat 40/0x28
> ... gz lookup and stats ...
> 66193 nginx 1484152162.182683 CALL
> setsockopt(0x211,0x6,0x4,0x7fffffffd96c,0x4)
> 66193 nginx 1484152162.182687 RET setsockopt 0
> 66193 nginx 1484152162.182689 CALL
> sendfile(0x28,0x211,0,0x806f,0x7fffffffe1d8,0x7fffffffe240,<invalid=0>)
> 66193 nginx 1484152163.541770 RET sendfile 0
>
> Sendfile blocks for 1.3 seconds. However, it's:
>
> $ sysctl hw.model hw.machine hw.ncpu
> hw.model: Intel(R) Xeon(R) CPU E5620 @ 2.40GHz
> hw.machine: amd64
> hw.ncpu: 16
>
> So sfbufs don't apply here, but documentation doesn't tell what resource
> applies... or I couldn't find correct doc.

On amd64 sendfile() uses mbufs / mbuf clusters. Try looking into
"vmstat -z" to see if there are enough mbuf clusters in various
zones. Note well tha sendfile() normally blocks on disk, and it
might simply mean that your disk subsystem is (occasionally)
overloaded. Try gstat to see some details.


It is goes red periodically and i should look further there.

At this moment i wonder if turning off sendfile will help. Will nginx use
aio to serve static files from disk without blocking?

Also, sendfile syscall has two flags on bsd to avoid blocking. Can these be
used to avoid blocks?

'
Either way, blocking for 1.3s in a syscall perfectly explains why
300ms timeouts are sometimes triggered unexpectedly. If a timer
is set after such a blocking call and just before kevent() (that
is, connection is highly unlikely to be established in before
kevent() returns), on the next event loop iteraction nginx will
think that 1.3+ seconds has passed since the timer was set, and
will trigger a timeout (after processing all the events reported by
kevent()).

--
Maxim Dounin
http://nginx.org/
_______________________________________________
nginx mailing list
nginx@nginx.org
http://mailman.nginx.org/mailman/listinfo/nginx

' А там
_______________________________________________
nginx mailing list
nginx@nginx.org
http://mailman.nginx.org/mailman/listinfo/nginx
Руслан Закиров
Re: upstream timeouts I can not explain
January 11, 2017 10:10PM
On Wed, Jan 11, 2017 at 10:25 PM, Руслан Закиров <ruz@sports.ru> wrote:

>
> On amd64 sendfile() uses mbufs / mbuf clusters. Try looking into
> "vmstat -z" to see if there are enough mbuf clusters in various
> zones. Note well tha sendfile() normally blocks on disk, and it
> might simply mean that your disk subsystem is (occasionally)
> overloaded. Try gstat to see some details.
>
>
> It is goes red periodically and i should look further there.
>

It was disk subsystem. atime updates were enabled (needed them for one
thing), kern.filedelay is 30s, top -mio -S showed syncer jumping up every
30seconds. Re-mounting disk with noatime dropped number of errors down to
<10 per minute from 200+.

At this moment i wonder if turning off sendfile will help. Will nginx use
> aio to serve static files from disk without blocking?
>

Still open question.

Also, sendfile syscall has two flags on bsd to avoid blocking. Can these be
> used to avoid blocks?
>

Looks like BSD 11 has improvements in this area. At least `man` says that
sendfile doesn't block on disk I/O and can return earlier with success.


--
Руслан Закиров
Руководитель отдела разработки веб-сервисов
+7(916) 597-92-69, ruz @ http://www.sports.ru/
_______________________________________________
nginx mailing list
nginx@nginx.org
http://mailman.nginx.org/mailman/listinfo/nginx
Maxim Konovalov
Re: upstream timeouts I can not explain
January 11, 2017 10:20PM
On 1/12/17 12:04 AM, Руслан Закиров wrote:
> Looks like BSD 11 has improvements in this area. At least `man` says
> that sendfile doesn't block on disk I/O and can return earlier with
> success.

That's correct observation.

https://www.nginx.com/blog/nginx-and-netflix-contribute-new-sendfile2-to-freebsd/

--
Maxim Konovalov
_______________________________________________
nginx mailing list
nginx@nginx.org
http://mailman.nginx.org/mailman/listinfo/nginx
Sorry, only registered users may post in this forum.

Click here to login