Nginx Web服务器环境利用user_agent屏蔽和限制访问实例

有些细心的网友会发现，我们的网站可能还没有流量或者流量不大的时候，但是有些时候会负载比较大，甚至可以通过日志看到很多无用的爬虫抓取。比如有国外搜索蜘蛛爬取，有些是采集蜘蛛爬取。我们是否可以通过一些技术手段屏蔽呢？这里我们使用较多的还是Nginx，通过user_agent屏蔽和限制访问。

在这篇文章中，老蒋准备通过记录一些 user_agent 屏蔽限制实例记录如何控制的。有些可能也是我们以后需要用到的，顺带记录收集做个笔记。

禁止空agent的浏览器访问

if ($http_user_agent ~ ^$) {
return 403;
}

禁止Scrapy等工具的抓取

if ($http_user_agent ~* (Scrapy|Curl|HttpClient)) {
return 403;
}

禁止指定UA的访问

if ($http_user_agent ~ "ApacheBench|WebBench|HttpClient|Java|python|Go-http-client|FeedDemon|JikeSpider|Indy Library|Alexa Toolbar|AskTbFXTV|AhrefsBot|CrawlDaddy|CoolpadWebkit|Feedly|UniversalFeedParser|Microsoft URL Control|Swiftbot|ZmEu|oBot|jaunty|lightDeckReports Bot|YYSpider|DigExt|YisouSpider|MJ12bot|heritrix|EasouSpider|LinkpadBot|Ezooms" )
{
return 403;
}

禁止爬虫抓取

if ($http_user_agent ~* "qihoobot|Googlebot-Mobile|Googlebot-Image|Mediapartners-Google|Adsbot-Google|Feedfetcher-Google|Yahoo! Slurp|Yahoo! Slurp China|YoudaoBot|Sosospider|Sogou spider|Sogou web spider|MSNBot|ia_archiver|Tomato Bot")
{
return 403;
}

禁止非GET|HEAD|POST方式的抓取

if ($request_method !~ ^(GET|HEAD|POST)$) {
return 403;
}

禁止特殊的user_agent的访问

if ($http_user_agent ~ "Mozilla/4.0\ $compatible;\ MSIE\ 6.0;\ Windows\ NT\ 5.1;\ SV1;\ .NET\ CLR\ 1.1.4322;\ .NET\ CLR\ 2.0.50727$") {
return 404;
}

这里，我们看看常用的一些爬虫。

UA类型	描述
ApacheBench	性能压测
WebBench	性能压测
WinHttp	采集cc
HttpClient	tcp攻击
Jmeter	压力测试
BOT/0.1 (BOT for JCE)	sql注入
CrawlDaddy	sql注入
Indy Library	扫描
ZmEu phpmyadmin	扫描
Microsoft URL Control	扫描
jaunty	wordpress扫描器
Java	内容采集
Python-urllib	内容采集
Jullo	内容采集
FeedDemon	内容采集
Feedly	内容采集
UniversalFeedParser	内容采集
Alexa Toolbar	内容采集
Swiftbot	无用爬虫
YandexBot	无用爬虫
AhrefsBot	无用爬虫
jikeSpider	无用爬虫
MJ12bot	无用爬虫
oBot	无用爬虫
FlightDeckReports Bot	无用爬虫
Linguee Bot	无用爬虫
EasouSpider	无用爬虫
YYSpider	无用爬虫

根据需要进行判断和屏蔽。这里在看一些常见爬虫的User-Agent

百度爬虫

Baiduspider+(+http://www.baidu.com/search/spider.htm”)

Google爬虫

Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

Googlebot/2.1 (+http://www.googlebot.com/bot.html)

Googlebot/2.1 (+http://www.google.com/bot.html)

雅虎爬虫（分别中国和美国爬虫）

Mozilla/5.0 (compatible; Yahoo! Slurp China; http://misc.yahoo.com.cn/help.html”)

Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp”)

新浪爱问爬虫

iaskspider/2.0(+http://iask.com/help/help_index.html”)

Mozilla/5.0 (compatible; iaskspider/1.0; MSIE 6.0)

搜狗爬虫

Sogou web spider/3.0(+http://www.sogou.com/docs/help/webmasters.htm#07″)

Sogou Push Spider/3.0(+http://www.sogou.com/docs/help/webmasters.htm#07″)

网易爬虫

Mozilla/5.0 (compatible; YodaoBot/1.0; http://www.yodao.com/help/webmaster/spider/”; )

MSN爬虫

msnbot/1.0 (+http://search.msn.com/msnbot.htm”)

其他相关Nginx运维文章：

1、记录在Nginx环境将不同的爬虫指向不同的后端

2、记录Nginx和Apache屏蔽指定页面目录不被访问（用户可以访问）

3、利用Nginx user_agent 屏蔽指定爬虫实现跳转

4、Nginx 和 Apache 设置限制IP并发访问数的办法降低服务器负载

5、Nginx服务器加固配置Naxsi软件提升WEB应用安全防火墙WAF设置

Nginx Web服务器环境利用user_agent屏蔽和限制访问实例

相关推荐

评论抢沙发

评论前必须登录！

文章分类

随机文章

热门标签

网站统计

切换注册登录

切换登录注册

相关推荐

评论 抢沙发

评论前必须登录！

文章分类

随机文章

热门标签

网站统计

切换注册登录

切换登录注册

评论抢沙发