查看apache访问记录,发现百度疯狂的爬虫记录。
1. 我还不想禁止所有的百度爬虫
2. 已经修改09-tibet-photo-show链接为2009-tibet-photo-show,并且在.htaccess文件中禁止nextgen gallery目录下面的图片外引,类似09-tibet-photo-show的页面访问只会返回301错误
3. 已经设置2009-tibet-photo-show访问密码,但是类似2009-tibet-photo-show的访问依然会返回200正确结果
有什么办法可以访问因为百度爬虫引起的系统负荷么?还是目前的设置已经足够了?

123.125.66.25 – - [11/May/2010:14:33:55 -0500] “GET /2009-tibet-photo-show HTTP/1.1″ 200 31812 “-” “Baiduspider+(+http://www.baidu.com/search/spider.htm)”
123.125.66.49 – - [11/May/2010:14:33:56 -0500] “GET /09-tibet-photo-show?amp&replytocom=145646&nggpage=2&show=slide HTTP/1.1″ 301 – “-” “Baiduspider+(+http://www.baidu.com/search/spider.htm)”
123.125.66.24 – - [11/May/2010:14:33:57 -0500] “GET /2009-tibet-photo-show HTTP/1.1″ 200 31812 “-” “Baiduspider+(+http://www.baidu.com/search/spider.htm)”
123.125.66.31 – - [11/May/2010:14:33:59 -0500] “GET /09-tibet-photo-show?amp&replytocom=145001&show=gallery&pid=151&nggpage=12 HTTP/1.1″ 301 – “-” “Baiduspider+(+http://www.baidu.com/search/spider.htm)”
123.125.66.53 – - [11/May/2010:14:33:59 -0500] “GET /09-tibet-photo-show?amp&replytocom=145001&show=gallery&pid=150&nggpage=11 HTTP/1.1″ 301 – “-” “Baiduspider+(+http://www.baidu.com/search/spider.htm)”
123.125.66.46 – - [11/May/2010:14:33:59 -0500] “GET /09-tibet-photo-show?amp&replytocom=145001&show=gallery&pid=477&nggpage=2 HTTP/1.1″ 301 – “-” “Baiduspider+(+http://www.baidu.com/search/spider.htm)”
123.125.66.35 – - [11/May/2010:14:33:56 -0500] “GET /2009-tibet-photo-show HTTP/1.1″ 200 31812 “-” “Baiduspider+(+http://www.baidu.com/search/spider.htm)”
123.125.66.49 – - [11/May/2010:14:34:00 -0500] “GET /2009-tibet-photo-show HTTP/1.1″ 200 31812 “-” “Baiduspider+(+http://www.baidu.com/search/spider.htm)”
123.125.66.26 – - [11/May/2010:14:34:00 -0500] “GET /2009-tibet-photo-show HTTP/1.1″ 200 31812 “-” “Baiduspider+(+http://www.baidu.com/search/spider.htm)”
66.249.71.166 – - [11/May/2010:14:34:02 -0500] “GET /09-tibet-photo-show?pid=463&nggpage=9&show=slide HTTP/1.1″ 301 – “-” “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)”
66.249.71.166 – - [11/May/2010:14:34:02 -0500] “GET /2009-tibet-photo-show HTTP/1.1″ 200 31812 “-” “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)”
123.125.66.18 – - [11/May/2010:14:34:00 -0500] “GET /2009-tibet-photo-show HTTP/1.1″ 200 31812 “-” “Baiduspider+(+http://www.baidu.com/search/spider.htm)”
123.125.66.101 – - [11/May/2010:14:34:02 -0500] “GET /09-tibet-photo-show?amp&replytocom=145001&show=gallery&nggpage=2&pid=150 HTTP/1.1″ 301 – “-” “Baiduspider+(+http://www.baidu.com/search/spider.htm)”
123.125.66.101 – - [11/May/2010:14:34:03 -0500] “GET /2009-tibet-photo-show HTTP/1.1″ 200 31812 “-” “Baiduspider+(+http://www.baidu.com/search/spider.htm)”
123.125.66.49 – - [11/May/2010:14:34:04 -0500] “GET /09-tibet-photo-show?amp&replytocom=145001&pid=480&show=gallery&nggpage=3 HTTP/1.1″ 301 – “-” “Baiduspider+(+http://www.baidu.com/search/spider.htm)”
123.125.66.29 – - [11/May/2010:14:34:04 -0500] “GET /09-tibet-photo-show?amp&replytocom=145001&show=gallery&nggpage=11&pid=151 HTTP/1.1″ 301 – “-” “Baiduspider+(+http://www.baidu.com/search/spider.htm)”
220.181.125.71 – - [11/May/2010:14:34:05 -0500] “GET /2009-tibet-photo-show HTTP/1.1″ 200 31812 “-” “Sogou web spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07)”
123.125.66.18 – - [11/May/2010:14:34:05 -0500] “GET /2009-tibet-photo-show HTTP/1.1″ 200 31812 “-” “Baiduspider+(+http://www.baidu.com/search/spider.htm)”
123.125.66.37 – - [11/May/2010:14:34:05 -0500] “GET /2009-tibet-photo-show HTTP/1.1″ 200 31812 “-” “Baiduspider+(+http://www.baidu.com/search/spider.htm)”
123.125.66.107 – - [11/May/2010:14:34:09 -0500] “GET /09-tibet-photo-show?amp&replytocom=145001&pid=478&nggpage=7 HTTP/1.1″ 301 – “-” “Baiduspider+(+http://www.baidu.com/search/spider.htm)”
123.125.66.40 – - [11/May/2010:14:34:09 -0500] “GET / HTTP/1.1″ 301 – “-” “Baiduspider+(+http://www.baidu.com/search/spider.htm)”
123.125.66.107 – - [11/May/2010:14:34:10 -0500] “GET /09-tibet-photo-show?amp&replytocom=145001&pid=153&nggpage=13&show=slide HTTP/1.1″ 301 – “-” “Baiduspider+(+http://www.baidu.com/search/spider.htm)”
123.125.66.50 – - [11/May/2010:14:34:10 -0500] “GET /09-tibet-photo-show?amp&replytocom=145001&pid=153&show=slide HTTP/1.1″ 301 – “-” “Baiduspider+(+http://www.baidu.com/search/spider.htm)”
123.125.66.28 – - [11/May/2010:14:34:10 -0500] “GET /09-tibet-photo-show?amp&replytocom=145001&pid=151&show=gallery&nggpage=4 HTTP/1.1″ 301 – “-” “Baiduspider+(+http://www.baidu.com/search/spider.htm)”
123.125.66.18 – - [11/May/2010:14:34:10 -0500] “GET / HTTP/1.1″ 200 48778 “-” “Baiduspider+(+http://www.baidu.com/search/spider.htm)”
123.125.66.31 – - [11/May/2010:14:34:11 -0500] “GET /2009-tibet-photo-show HTTP/1.1″ 200 31812 “-” “Baiduspider+(+http://www.baidu.com/search/spider.htm)”
123.125.66.50 – - [11/May/2010:14:34:11 -0500] “GET /2009-tibet-photo-show HTTP/1.1″ 200 31812 “-” “Baiduspider+(+http://www.baidu.com/search/spider.htm)”
123.125.66.30 – - [11/May/2010:14:34:10 -0500] “GET /2009-tibet-photo-show HTTP/1.1″ 200 31812 “-” “Baiduspider+(+http://www.baidu.com/search/spider.htm)”
123.125.66.34 – - [11/May/2010:14:34:15 -0500] “GET /09-tibet-photo-show?amp&replytocom=145001&nggpage=8&pid=480 HTTP/1.1″ 301 – “-” “Baiduspider+(+http://www.baidu.com/search/spider.htm)”
123.125.66.45 – - [11/May/2010:14:34:15 -0500] “GET /09-tibet-photo-show?amp&replytocom=145001&nggpage=5&pid=478 HTTP/1.1″ 301 – “-” “Baiduspider+(+http://www.baidu.com/search/spider.htm)”
123.125.66.26 – - [11/May/2010:14:34:15 -0500] “GET /09-tibet-photo-show?amp&replytocom=145001&pid=151&nggpage=11&show=gallery HTTP/1.1″ 301 – “-” “Baiduspider+(+http://www.baidu.com/search/spider.htm)”
123.125.66.47 – - [11/May/2010:14:34:15 -0500] “GET /09-tibet-photo-show?amp&replytocom=145001&nggpage=4&pid=478&show=slide HTTP/1.1″ 301 – “-” “Baiduspider+(+http://www.baidu.com/search/spider.htm)”
220.181.125.71 – - [11/May/2010:14:34:16 -0500] “GET /09-tibet-photo-show?nggpage=9&pageid=3169&show=slide&pid=160 HTTP/1.1″ 301 – “-” “Sogou web spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07)”
123.125.66.25 – - [11/May/2010:14:34:16 -0500] “GET /2009-tibet-photo-show HTTP/1.1″ 200 31812 “-” “Baiduspider+(+http://www.baidu.com/search/spider.htm)”
123.125.66.33 – - [11/May/2010:14:34:16 -0500] “GET /2009-tibet-photo-show HTTP/1.1″ 200 31812 “-” “Baiduspider+(+http://www.baidu.com/search/spider.htm)”
123.125.66.47 – - [11/May/2010:14:34:16 -0500] “GET /2009-tibet-photo-show HTTP/1.1″ 200 31812 “-” “Baiduspider+(+http://www.baidu.com/search/spider.htm)”
123.125.66.110 – - [11/May/2010:14:34:16 -0500] “GET /2009-tibet-photo-show HTTP/1.1″ 200 31812 “-” “Baiduspider+(+http://www.baidu.com/search/spider.htm)”
123.125.66.33 – - [11/May/2010:14:34:22 -0500] “GET /09-tibet-photo-show?amp&replytocom=144939&show=slide&pid=480 HTTP/1.1″ 301 – “-” “Baiduspider+(+http://www.baidu.com/search/spider.htm)”
123.125.66.100 – - [11/May/2010:14:34:11 -0500] “GET /2009-tibet-photo-show HTTP/1.1″ 200 31812 “-” “Baiduspider+(+http://www.baidu.com/search/spider.htm)”
123.125.66.116 – - [11/May/2010:14:34:23 -0500] “GET /09-tibet-photo-show?amp&replytocom=145001&nggpage=2&pid=480&show=gallery HTTP/1.1″ 301 – “-” “Baiduspider+(+http://www.baidu.com/search/spider.htm)”
123.125.66.34 – - [11/May/2010:14:34:23 -0500] “GET /09-tibet-photo-show?amp&replytocom=145001&nggpage=2&pid=480 HTTP/1.1″ 301 – “-” “Baiduspider+(+http://www.baidu.com/search/spider.htm)”
123.125.66.109 – - [11/May/2010:14:34:23 -0500] “GET /2009-tibet-photo-show HTTP/1.1″ 200 31812 “-” “Baiduspider+(+http://www.baidu.com/search/spider.htm)”
123.125.66.112 – - [11/May/2010:14:34:24 -0500] “GET /09-tibet-photo-show?amp&replytocom=145001&nggpage=13&pid=480&show=slide HTTP/1.1″ 301 – “-” “Baiduspider+(+http://www.baidu.com/search/spider.htm)”
123.125.66.22 – - [11/May/2010:14:34:24 -0500] “GET /2009-tibet-photo-show HTTP/1.1″ 200 31812 “-” “Baiduspider+(+http://www.baidu.com/search/spider.htm)”
123.125.66.32 – - [11/May/2010:14:34:25 -0500] “GET /2009-tibet-photo-show HTTP/1.1″ 200 31812 “-” “Baiduspider+(+http://www.baidu.com/search/spider.htm)”
123.125.66.47 – - [11/May/2010:14:34:24 -0500] “GET /2009-tibet-photo-show HTTP/1.1″ 200 31812 “-” “Baiduspider+(+http://www.baidu.com/search/spider.htm)”
220.181.125.71 – - [11/May/2010:14:34:26 -0500] “GET /2009-tibet-photo-show HTTP/1.1″ 200 31812 “-” “Sogou web spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07)”
123.125.66.100 – - [11/May/2010:14:34:27 -0500] “GET /09-tibet-photo-show?amp&replytocom=144939&show=slide&pid=478 HTTP/1.1″ 301 – “-” “Baiduspider+(+http://www.baidu.com/search/spider.htm)”
123.125.66.21 – - [11/May/2010:14:34:27 -0500] “GET /09-tibet-photo-show?replyto&sh&replytocom=145001&nggpage=13&show=gallery&pid=150 HTTP/1.1″ 301 – “-” “Baiduspider+(+http://www.baidu.com/search/spider.htm)”
123.125.66.37 – - [11/May/2010:14:34:28 -0500] “GET /09-tibet-photo-show?replyto&sh&replytocom=145001&nggpage=13&show=gallery HTTP/1.1″ 301 – “-” “Baiduspider+(+http://www.baidu.com/search/spider.htm)”
123.125.66.102 – - [11/May/2010:14:34:28 -0500] “GET /2009-tibet-photo-show HTTP/1.1″ 200 31812 “-” “Baiduspider+(+http://www.baidu.com/search/spider.htm)”
123.125.66.107 – - [11/May/2010:14:34:28 -0500] “GET /2009-tibet-photo-show HTTP/1.1″ 200 31812 “-” “Baiduspider+(+http://www.baidu.com/search/spider.htm)”
123.125.66.113 – - [11/May/2010:14:34:29 -0500] “GET /09-tibet-photo-show?replyto&sh&replytocom=145001&nggpage=13&pid=149&show=gallery HTTP/1.1″ 301 – “-” “Baiduspider+(+http://www.baidu.com/search/spider.htm)”
123.125.66.102 – - [11/May/2010:14:34:29 -0500] “GET /09-tibet-photo-show?replyto&sh&replytocom=145001&nggpage=12&show=gallery&pid=478 HTTP/1.1″ 301 – “-” “Baiduspider+(+http://www.baidu.com/search/spider.htm)”
123.125.66.18 – - [11/May/2010:14:34:29 -0500] “GET /2009-tibet-photo-show HTTP/1.1″ 200 31812 “-” “Baiduspider+(+http://www.baidu.com/search/spider.htm)”
123.125.66.100 – - [11/May/2010:14:34:30 -0500] “GET /2009-tibet-photo-show HTTP/1.1″ 200 31812 “-” “Baiduspider+(+http://www.baidu.com/search/spider.htm)”
123.125.66.21 – - [11/May/2010:14:34:30 -0500] “GET /2009-tibet-photo-show HTTP/1.1″ 200 31812 “-” “Baiduspider+(+http://www.baidu.com/search/spider.htm)”
123.125.66.102 – - [11/May/2010:14:34:32 -0500] “GET /09-tibet-photo-show?replyto&sh&replytocom=145001&nggpage=12&pid=151&show=slide HTTP/1.1″ 301 – “-” “Baiduspider+(+http://www.baidu.com/search/spider.htm)”

2 Responses to “如何阻止疯狂的”百度“爬虫?”
  1. Tintin says:

    老牛 – 屏蔽搜索引擎用robots.txt文件,但是这些记录的ip显然不是百度的,百度不可能用adsl的ip,是有人模拟爬虫抓数据,以前我也干过。
    百度公司虽然没原则,他的爬虫还是有原则的。10:16 am
    Phil Chen – 多好的一句话呀:百度公司虽然没原则,他的爬虫还是有原则的。10:21 am
    Juntao JIANG – 我已经更新了robots.txt,不知道什么时候搜索引擎能更新,也不知道是否对这些模拟的爬虫生效。
    昨晚上网搜了一下,发现这个ip确实有问题。Edit10:47 am
    Patrick He – 直接封 ip 得了11:58 am
    Juntao JIANG – ip已经封了,目前我想在.htaccess里面把那个url的访问直接设置成返回403,现在文件都不存在了,怎么访问都是返回404,但是404的流量和对apache的访问依然不小。
    试了
    Redirect 403 /myfiles/music/firefly_a-teens.mp3
    无效Edit12:03 pm
    Juntao JIANG – 类似RewriteCond和RewriteRule的设置都有些复杂,试了几次都不正确,而且有的错误会直接导致整个网站的web访问产生500错误……
    已经发信给客服了,等答复……Edit12:05 pm
    Juntao JIANG – 做个404页面也是可以考虑的,但是现在404页面已经被wp的某个设置接管了,不知道我的设置是否生效。
    不过,目前这个404是挺讨厌的,迟早要换掉。Edit12:07 pm
    Juntao JIANG – 发现这个语句是错误的
    Redirect 403 /myfiles/music/firefly_a-teens.mp3

    实际上转达了http://ttrek.net403/,晕死Edit12:10 pm
    Juntao JIANG – 搞定
    ErrorDocument 403 /myfiles/music/firefly_a-teens.mp3Edit12:14 pm
    Juntao JIANG – 自己设置的404页面果然无效,暂时不管了。Edit12:23 pm

  2. Tintin says:

    Jetcheng Chu – 直接去 cpanel 裡设定 IP 禁止, 一次把整个 IP 段阻止掉。12:24 pm
    Juntao JIANG – @Jeccheng Chu,.htaccess的设置似乎是根本之道,客服推荐的方式也是这个。cpanel的设置一般不如直接修改apache和wp的配置文件起作用。

    还有个奇怪的地方。

    用ErrorDocument设置的禁止访问页面:
    在log里面看到有的是404,有的是500。
    页面直接访问,title显示403错误,而页面内容显示500内部错误。

    用ip deny设置禁止访问页面:
    log里面只会显示403错误
    页面直接访问,title和页面内容都显示403错误(我是把自己ip禁止掉来测试的)。Edit12:31 pm

  3.  
Leave a Reply

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

*