Welcome! Log In Create A New Profile

Advanced

rewrite question

Posted by shiz 
shiz
rewrite question
June 08, 2018 02:00AM
Hi,

Recently, Google has started spidering my website and in addition to normal
pages, appended "&amp" to all urls, even the pages excluded by robots.txt

e.g. page.php?page=aaa -> page.php?page=aaa&amp

Any idea how to redirect/rewrite this?

Posted at Nginx Forum: https://forum.nginx.org/read.php?2,280093,280093#msg-280093

_______________________________________________
nginx mailing list
nginx@nginx.org
http://mailman.nginx.org/mailman/listinfo/nginx
Richard Stanway via nginx
Re: rewrite question
June 11, 2018 12:10PM
This is almost certainly not Google as they obey robots.txt. The & to &
conversion is another sign of a poor quality crawler. Check the RDNS and
you will find it's probably some IP faking Google UA, I suggest blocking at
network level.

On Fri, Jun 8, 2018 at 1:57 AM shiz <[email protected]> wrote:

> Hi,
>
> Recently, Google has started spidering my website and in addition to normal
> pages, appended "&amp" to all urls, even the pages excluded by robots.txt
>
> e.g. page.php?page=aaa -> page.php?page=aaa&amp
>
> Any idea how to redirect/rewrite this?
>
> Posted at Nginx Forum:
> https://forum.nginx.org/read.php?2,280093,280093#msg-280093
>
> _______________________________________________
> nginx mailing list
> nginx@nginx.org
> http://mailman.nginx.org/mailman/listinfo/nginx
>
_______________________________________________
nginx mailing list
nginx@nginx.org
http://mailman.nginx.org/mailman/listinfo/nginx
shiz
Re: rewrite question
June 11, 2018 03:50PM
I see another poster have written this, and deleted it afterwards.

`This is almost certainly not Google as they obey robots.txt. The & to
&amp;
conversion is another sign of a poor quality crawler. Check the RDNS and
you will find it's probably some IP faking Google UA, I suggest blocking at
network level.`

My actual reply:


1 - It is Google
2 - They do not always a user friendly user agent. That is a fact.
3 - When they don't, they also don't follow robots.txt.

So my problem remains.

I don't want to block those IP ranges at iptables level because it's Google.
So a rewrite or redirect - I'm not sure exactly which ATM is badly needed.
Depends on the URL.

Here are the IP ranges, definetely Google. Referenced in
https://github.com/mitchellkrogza/nginx-ultimate-bad-bot-blocker/issues/175

And here is a copy of my original message.

"Hi,

I'm still faithful to your script. It does great things to my websites.
Thanks for that.

Not a bug properly speaking, just a constatation you might like,

Recently, 1-2 months in time, I got a lot of strange impossible requests all
with the same User-Agent, no referrer and HTTP/1.1. All came from Google.
They do not respect robots.txt and sniff everywhere they're not supposed to.
I thought you should be make aware of it.

I know you whitelist Google IPs, but after inspection from other users, you
might want to revisit those.

User-agent:
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_4) AppleWebKit/537.36 (KHTML,
like Gecko) Chrome/28.0.1500.71 Safari/537.36"

Ranges:
66.249.64.0/19
72.14.199.0/24

Examples of request:
72.14.199.18 - - [27/May/2018:14:12:01 -0700] "GET
/page.php?page%3Dabout_himeji_forklifts&amp HTTP/1.1" 301 178 "-"
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_4) AppleWebKit/537.36 (KHTML,
like Gecko) Chrome/28.0.1500.71 Safari/537.36"
72.14.199.4 - - [27/May/2018:14:12:24 -0700] "GET
/page.php?page%3Dabout_himeji_forklifts&amp HTTP/1.1" 302 165 "-"
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_4) AppleWebKit/537.36 (KHTML,
like Gecko) Chrome/28.0.1500.71 Safari/537.36"

In the meantime, I circumvented your whitelist by issuing manual range bans.
After 6 weeks, no more of those strange requests, and bandwidth has dropped
significantly since those 2 ranges were requestings quite a few hundred of
megabytes each day!

Thanks again."

Posted at Nginx Forum: https://forum.nginx.org/read.php?2,280093,280117#msg-280117

_______________________________________________
nginx mailing list
nginx@nginx.org
http://mailman.nginx.org/mailman/listinfo/nginx
shiz
Re: rewrite question
June 11, 2018 04:00PM
'The & to &amp; conversion is another sign of a poor quality crawler.'

I wasn't referring to any of them but to '&amp'. Important difference.
Also explaining my failure to filter it from parameters since parameters
contains an equal sign. E.g. ...&amp= something or even &amp=

& or &amp; would also easy do filter out. But that is not the problem I'm
having here. It's different, hence my request for assistance to the nginx
community.

Posted at Nginx Forum: https://forum.nginx.org/read.php?2,280093,280118#msg-280118

_______________________________________________
nginx mailing list
nginx@nginx.org
http://mailman.nginx.org/mailman/listinfo/nginx
Francis Daly
Re: rewrite question
June 11, 2018 05:10PM
On Thu, Jun 07, 2018 at 07:57:43PM -0400, shiz wrote:

Hi there,

> Recently, Google has started spidering my website and in addition to normal
> pages, appended "&amp" to all urls, even the pages excluded by robots.txt
>
> e.g. page.php?page=aaa -> page.php?page=aaa&amp
>
> Any idea how to redirect/rewrite this?

Untested, but:

if ($args ~ "&amp$") { return 400; }

should handle all requests that end in the four characters you report.

You may prefer a different response code.

Good luck with it,

f
--
Francis Daly francis@daoine.org
_______________________________________________
nginx mailing list
nginx@nginx.org
http://mailman.nginx.org/mailman/listinfo/nginx
Richard Stanway via nginx
Re: rewrite question
June 11, 2018 05:40PM
That IP resolves to rate-limited-proxy-72-14-199-18.google.com - this is
not the Google search crawler, hence why it ignores your robots.txt. No one
seems to know for sure what the rate-limited-proxy IPs are used for. They
could represent random Chrome users using the Google data saving feature,
hence the varying user-agents you will see. Either way, they are probably
best not blocked, as they could represent many end user IPs. Maybe there is
an X-Forwarded-For header you could look at.

The Google search crawler will resolve to an IP like
crawl-66-249-64-213.googlebot.com.



On Mon, Jun 11, 2018 at 5:05 PM Francis Daly <[email protected]> wrote:

> On Thu, Jun 07, 2018 at 07:57:43PM -0400, shiz wrote:
>
> Hi there,
>
> > Recently, Google has started spidering my website and in addition to
> normal
> > pages, appended "&amp" to all urls, even the pages excluded by robots.txt
> >
> > e.g. page.php?page=aaa -> page.php?page=aaa&amp
> >
> > Any idea how to redirect/rewrite this?
>
> Untested, but:
>
> if ($args ~ "&amp$") { return 400; }
>
> should handle all requests that end in the four characters you report.
>
> You may prefer a different response code.
>
> Good luck with it,
>
> f
> --
> Francis Daly francis@daoine.org
> _______________________________________________
> nginx mailing list
> nginx@nginx.org
> http://mailman.nginx.org/mailman/listinfo/nginx
>
_______________________________________________
nginx mailing list
nginx@nginx.org
http://mailman.nginx.org/mailman/listinfo/nginx
shiz
Re: rewrite question
June 12, 2018 12:10PM
'if ($args ~ "&amp$") { return 400; }'

Thanks a lot! Exactly what I needed :)

Posted at Nginx Forum: https://forum.nginx.org/read.php?2,94128,280124#msg-280124

_______________________________________________
nginx mailing list
nginx@nginx.org
http://mailman.nginx.org/mailman/listinfo/nginx
Sorry, only registered users may post in this forum.

Click here to login