Welcome! Log In Create A New Profile

Advanced

limit_rate based on User-Agent; how to exempt /robots.txt ?

Posted by Cameron Kerr 
Hi all, I’ve recently deployed a rate-limiting configuration aimed at protecting myself from spiders.

nginx version: nginx/1.15.1 (RPM from nginx.org)

I did this based on the excellent Nginx blog post at https://www.nginx.com/blog/rate-limiting-nginx/ and have consulted the documentation for limit_req and limit_req_zone.

I understand that you can have multiple zones in play, and that the most-restrictive of all matches will apply for any matching request. I want to go the other way though. I want to exempt /robots.txt from being rate limited by spiders.

To put this in context, here is the gist of the relevant config, which aims to implement a caching (and rate-limiting) layer in front of a much more complex request routing layer (httpd).

http {
map $http_user_agent $user_agent_rate_key {
default "";
"~our-crawler" "wanted-robot";
"~*(bot/|crawler|robot|spider)" "robot";
"~ScienceBrowser/Nutch" "robot";
"~Arachni/" "robot";
}

limit_req_zone $user_agent_rate_key zone=per_spider_class:1m rate=100r/m;
limit_req_status 429;

server {
limit_req zone=per_spider_class;

location / {
proxy_pass http://routing_layer_http/;
}
}
}



Option 1: (working, but has issues)

Should I instead put the limit_req inside the "location / {}" stanza, and have a separate "location /robots.txt {}" (or some generalised form using a map) and not have limit_req inside that stanza

That would mean that any other configuration inside the location stanzas would get duplicated, which would be a manageability concern. I just want to override the limit_req.

server {
location /robots.txt {
proxy_pass http://routing_layer_http/;
}

location / {
limit_req zone=per_spider_class;
proxy_pass http://routing_layer_http/;
}
}

I've tested this, and it works.


Option 2: (working, but has issues)

Should I create a "location /robots.txt {}" stanza that has a limit_req with a high burst, say burst=500? It's not a whitelist, but perhaps something still useful?

But I still end up with replicated location stanzas... I don't think I like this approach.

server {
limit_req zone=per_spider_class;

location /robots.txt {
limit_req zone=per_spider_class burst=500;
proxy_pass https://routing_layer_https/;
}

location / {
proxy_pass https://routing_layer_https/;
}
}


Option 3: (does not work)

Some other way... perhaps I need to create some map that takes the path and produces a $path_exempt variable, and then somehow use that with the $user_agent_rate_key, returning "" when $path_exempt, or $user_agent_rate_key otherwise.

map $http_user_agent $user_agent_rate_key {
default "";
"~otago-crawler" "wanted-robot";
"~*(bot/|crawler|robot|spider)" "robot";
"~ScienceBrowser/Nutch" "robot";
"~Arachni/" "robot";
}

map $uri $rate_for_spider_exempting {
default $user_agent_rate_key;
"/robots.txt" "";
}

#limit_req_zone $user_agent_rate_key zone=per_spider_class:1m rate=100r/m;
limit_req_zone $rate_for_spider_exempting zone=per_spider_class:1m rate=100r/m;


However, this does not work because the second map is not returning $user_agent_rate_key; the effect is that non-robots are affected (and the load-balancer health-probes start getting rate-limited).

I'm guessing my reasoning of how this works is incorrect, or there is a limitation or some sort of implicit ordering issue.


Option 4: (does not work)

http://nginx.org/en/docs/http/ngx_http_core_module.html#limit_rate

I see that there is a variable $limit_rate that can be used, and this would seem to be the cleanest, except in testing it doesn't seem to work (still gets 429 responses as a User-Agent that is a bot)

server {
limit_req zone=per_spider_class;

location /robots.txt {
set $limit_rate 0;
}

location / {
proxy_pass http://routing_layer_http/;
}
}


I'm still fairly new with Nginx, so wanting something that decomposes cleanly into an Nginx configuration. I would quite like to be able just have one place where I specify the map of URLs I wish to exempt (I imagine there could be others, such as ~/.well-known/something that could pop up).

Thank you very much for your time.

--
Cameron Kerr
Systems Engineer, Information Technology Services
University of Otago

_______________________________________________
nginx mailing list
nginx@nginx.org
http://mailman.nginx.org/mailman/listinfo/nginx
Peter Booth via nginx
Re: limit_rate based on User-Agent; how to exempt /robots.txt ?
August 07, 2018 08:00AM
So it’s very easy to get caught up in he trap if having unrealistic mental models of how we servers work when dealing with web servers. If your host is a recent (< 5 years) single Dickey host then you can probably support 300,000 requests per second fir your robots.txt file. That’s because the file will be served from your Linux file ca he (memory)

Sent from my iPhone

> On Aug 6, 2018, at 10:45 PM, Cameron Kerr <[email protected]> wrote:
>
> Hi all, I’ve recently deployed a rate-limiting configuration aimed at protecting myself from spiders.
>
> nginx version: nginx/1.15.1 (RPM from nginx.org)
>
> I did this based on the excellent Nginx blog post at https://www.nginx.com/blog/rate-limiting-nginx/ and have consulted the documentation for limit_req and limit_req_zone.
>
> I understand that you can have multiple zones in play, and that the most-restrictive of all matches will apply for any matching request. I want to go the other way though. I want to exempt /robots.txt from being rate limited by spiders.
>
> To put this in context, here is the gist of the relevant config, which aims to implement a caching (and rate-limiting) layer in front of a much more complex request routing layer (httpd).
>
> http {
> map $http_user_agent $user_agent_rate_key {
> default "";
> "~our-crawler" "wanted-robot";
> "~*(bot/|crawler|robot|spider)" "robot";
> "~ScienceBrowser/Nutch" "robot";
> "~Arachni/" "robot";
> }
>
> limit_req_zone $user_agent_rate_key zone=per_spider_class:1m rate=100r/m;
> limit_req_status 429;
>
> server {
> limit_req zone=per_spider_class;
>
> location / {
> proxy_pass http://routing_layer_http/;
> }
> }
> }
>
>
>
> Option 1: (working, but has issues)
>
> Should I instead put the limit_req inside the "location / {}" stanza, and have a separate "location /robots.txt {}" (or some generalised form using a map) and not have limit_req inside that stanza
>
> That would mean that any other configuration inside the location stanzas would get duplicated, which would be a manageability concern. I just want to override the limit_req.
>
> server {
> location /robots.txt {
> proxy_pass http://routing_layer_http/;
> }
>
> location / {
> limit_req zone=per_spider_class;
> proxy_pass http://routing_layer_http/;
> }
> }
>
> I've tested this, and it works.
>
>
> Option 2: (working, but has issues)
>
> Should I create a "location /robots.txt {}" stanza that has a limit_req with a high burst, say burst=500? It's not a whitelist, but perhaps something still useful?
>
> But I still end up with replicated location stanzas... I don't think I like this approach.
>
> server {
> limit_req zone=per_spider_class;
>
> location /robots.txt {
> limit_req zone=per_spider_class burst=500;
> proxy_pass https://routing_layer_https/;
> }
>
> location / {
> proxy_pass https://routing_layer_https/;
> }
> }
>
>
> Option 3: (does not work)
>
> Some other way... perhaps I need to create some map that takes the path and produces a $path_exempt variable, and then somehow use that with the $user_agent_rate_key, returning "" when $path_exempt, or $user_agent_rate_key otherwise.
>
> map $http_user_agent $user_agent_rate_key {
> default "";
> "~otago-crawler" "wanted-robot";
> "~*(bot/|crawler|robot|spider)" "robot";
> "~ScienceBrowser/Nutch" "robot";
> "~Arachni/" "robot";
> }
>
> map $uri $rate_for_spider_exempting {
> default $user_agent_rate_key;
> "/robots.txt" "";
> }
>
> #limit_req_zone $user_agent_rate_key zone=per_spider_class:1m rate=100r/m;
> limit_req_zone $rate_for_spider_exempting zone=per_spider_class:1m rate=100r/m;
>
>
> However, this does not work because the second map is not returning $user_agent_rate_key; the effect is that non-robots are affected (and the load-balancer health-probes start getting rate-limited).
>
> I'm guessing my reasoning of how this works is incorrect, or there is a limitation or some sort of implicit ordering issue.
>
>
> Option 4: (does not work)
>
> http://nginx.org/en/docs/http/ngx_http_core_module.html#limit_rate
>
> I see that there is a variable $limit_rate that can be used, and this would seem to be the cleanest, except in testing it doesn't seem to work (still gets 429 responses as a User-Agent that is a bot)
>
> server {
> limit_req zone=per_spider_class;
>
> location /robots.txt {
> set $limit_rate 0;
> }
>
> location / {
> proxy_pass http://routing_layer_http/;
> }
> }
>
>
> I'm still fairly new with Nginx, so wanting something that decomposes cleanly into an Nginx configuration. I would quite like to be able just have one place where I specify the map of URLs I wish to exempt (I imagine there could be others, such as ~/.well-known/something that could pop up).
>
> Thank you very much for your time.
>
> --
> Cameron Kerr
> Systems Engineer, Information Technology Services
> University of Otago
>
> _______________________________________________
> nginx mailing list
> nginx@nginx.org
> http://mailman.nginx.org/mailman/listinfo/nginx
_______________________________________________
nginx mailing list
nginx@nginx.org
http://mailman.nginx.org/mailman/listinfo/nginx
Hello!

On Tue, Aug 07, 2018 at 02:45:02AM +0000, Cameron Kerr wrote:

> Hi all, I’ve recently deployed a rate-limiting configuration
> aimed at protecting myself from spiders.
>
> nginx version: nginx/1.15.1 (RPM from nginx.org)
>
> I did this based on the excellent Nginx blog post at
> https://www.nginx.com/blog/rate-limiting-nginx/ and have
> consulted the documentation for limit_req and limit_req_zone.
>
> I understand that you can have multiple zones in play, and that
> the most-restrictive of all matches will apply for any matching
> request. I want to go the other way though. I want to exempt
> /robots.txt from being rate limited by spiders.
>
> To put this in context, here is the gist of the relevant config,
> which aims to implement a caching (and rate-limiting) layer in
> front of a much more complex request routing layer (httpd).
>
> http {
> map $http_user_agent $user_agent_rate_key {
> default "";
> "~our-crawler" "wanted-robot";
> "~*(bot/|crawler|robot|spider)" "robot";
> "~ScienceBrowser/Nutch" "robot";
> "~Arachni/" "robot";
> }
>
> limit_req_zone $user_agent_rate_key zone=per_spider_class:1m rate=100r/m;
> limit_req_status 429;
>
> server {
> limit_req zone=per_spider_class;
>
> location / {
> proxy_pass http://routing_layer_http/;
> }
> }
> }
>
>
>
> Option 1: (working, but has issues)
>
> Should I instead put the limit_req inside the "location / {}"
> stanza, and have a separate "location /robots.txt {}" (or some
> generalised form using a map) and not have limit_req inside that
> stanza
>
> That would mean that any other configuration inside the location
> stanzas would get duplicated, which would be a manageability
> concern. I just want to override the limit_req.
>
> server {
> location /robots.txt {
> proxy_pass http://routing_layer_http/;
> }
>
> location / {
> limit_req zone=per_spider_class;
> proxy_pass http://routing_layer_http/;
> }
> }
>
> I've tested this, and it works.

This is most simple and nginx-way: provide exact configurations in
particular locations. And this is what I would recommend to use.

[...]

> Option 3: (does not work)
>
> Some other way... perhaps I need to create some map that takes
> the path and produces a $path_exempt variable, and then somehow
> use that with the $user_agent_rate_key, returning "" when
> $path_exempt, or $user_agent_rate_key otherwise.
>
> map $http_user_agent $user_agent_rate_key {
> default "";
> "~otago-crawler" "wanted-robot";
> "~*(bot/|crawler|robot|spider)" "robot";
> "~ScienceBrowser/Nutch" "robot";
> "~Arachni/" "robot";
> }
>
> map $uri $rate_for_spider_exempting {
> default $user_agent_rate_key;
> "/robots.txt" "";
> }
>
> #limit_req_zone $user_agent_rate_key zone=per_spider_class:1m rate=100r/m;
> limit_req_zone $rate_for_spider_exempting zone=per_spider_class:1m rate=100r/m;
>
>
> However, this does not work because the second map is not
> returning $user_agent_rate_key; the effect is that non-robots
> are affected (and the load-balancer health-probes start getting
> rate-limited).
>
> I'm guessing my reasoning of how this works is incorrect, or
> there is a limitation or some sort of implicit ordering issue.

This approach is expected to work fine (assuming you've used
limit_req somewhere), and I've just tested the exact configuration
snipped provided to be sure. If it doesn't work for you, the
problem is likely elsewhere.

> Option 4: (does not work)
>
> http://nginx.org/en/docs/http/ngx_http_core_module.html#limit_rate
>
> I see that there is a variable $limit_rate that can be used, and
> this would seem to be the cleanest, except in testing it doesn't
> seem to work (still gets 429 responses as a User-Agent that is a
> bot)

The limit_rate directive (and the $limit_rate variable) controls
bandwidth, and it is completely unrelated to the limit_req
module.

--
Maxim Dounin
http://mdounin.ru/
_______________________________________________
nginx mailing list
nginx@nginx.org
http://mailman.nginx.org/mailman/listinfo/nginx
Hi Maxim, that's very helpful...

> -----Original Message-----
> From: nginx [mailto:[email protected]] On Behalf Of Maxim Dounin
> On Tue, Aug 07, 2018 at 02:45:02AM +0000, Cameron Kerr wrote:


> > Option 3: (does not work)

> This approach is expected to work fine (assuming you've used limit_req
> somewhere), and I've just tested the exact configuration snipped provided
> to be sure. If it doesn't work for you, the problem is likely elsewhere.

Thank you for the confirmation; I've retried it, and testing with ab, it seems to work, so I'm not sure what I was doing wrong previously.

I like the pattern of chaining maps; its nicely functional in my way of thinking.

For the sake of others, my configuration looks like the following:

http {

map $http_user_agent $user_agent_rate_key {
default "";
"~*(bot[/-]|crawler|robot|spider)" "robot";
"~ScienceBrowser/Nutch" "robot";
"~Arachni/" "robot";
}

map $uri $rate_for_spider_exempting {
default $user_agent_rate_key;
"/robots.txt" '';
}

limit_req_zone $rate_for_spider_exempting zone=per_spider_class:1m rate=100r/m;

limit_req_status 429;
server_tokens off;

server {
limit_req zone=per_spider_class;

location / {
proxy_pass http://routing_layer_http/;
}
}
}


And my testing:

// spider with non-exempted (ie. rate-limited for spiders) URI

$ ab -H 'User-Agent: spider' -n100 https://.../hostname | grep -e '^Complete requests:' -e '^Failed requests:'
Complete requests: 100
Failed requests: 98

// spider with exempted (ie. no-rate-limiting for spiders) URI

$ ab -H 'User-Agent: spider' -n100 https://.../robots.txt | grep -e '^Complete requests:' -e '^Failed requests:'
Complete requests: 100
Failed requests: 0

// non-spider with exempted (ie. rate-limited for spiders) URI

$ ab -n100 https://.../robots.txt | grep -e '^Complete requests:' -e '^Failed requests:'
Complete requests: 100
Failed requests: 0

// non-spider with non-exempted (ie. no-rate-limiting for spiders) URI

$ ab -n100 https://.../hostname | grep -e '^Complete requests:' -e '^Failed requests:'
Complete requests: 100
Failed requests: 0


Thanks again for your feedback

Cheers,
Cameron

_______________________________________________
nginx mailing list
nginx@nginx.org
http://mailman.nginx.org/mailman/listinfo/nginx
Sorry, only registered users may post in this forum.

Click here to login