像这样快速和肮脏的东西可能是一个好的开始:
return if request.user_agent =~ /googlebot|msnbot|baidu|curl|wget|Mediapartners-Google|slurp|ia_archiver|Gigabot|libwww-perl|lwp-trivial/i
注意:rails代码,但正则表达式通常适用。
我很确定很大一部分机器人不使用robots.txt,但这是我的第一个想法。
在我看来,检测机器人的最佳方法是在请求之间有时间,如果请求之间的时间一直快于机器人。
任何访问者的入口页面是/robots.txt可能是一个机器人。
你说在用户代理上匹配可能很尴尬,但我们发现它是一个非常好的匹配。我们的研究表明,它将覆盖您收到的大约98%的点击量。我们也没有遇到过任何误报。如果你想把它提高到99.9%,你可以包括一些其他众所周知的比赛,比如锟craw craw craw ba ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,我们已经在数百万次点击中对我们的生产系统进行了测试。
以下是一些针对您的c#解决方案:
处理未命中时最快。即来自非机器人的流量是普通用户。 捕获99%以上的爬虫。
bool iscrawler = Regex.IsMatch(Request.UserAgent, @"bot|crawler|baiduspider|80legs|ia_archiver|voyager|curl|wget|yahoo! slurp|mediapartners-google", RegexOptions.IgnoreCase);
处理命中时速度最快。即来自机器人的流量。对于未命中也很快。 捕获接近100%的爬虫。 匹配的是蜘蛛,而不是蜘蛛的前期。 您可以添加任何其他已知的抓取工具。
List<string> Crawlers3 = new List<string>() { "bot","crawler","spider","80legs","baidu","yahoo! slurp","ia_archiver","mediapartners-google", "lwp-trivial","nederland.zoek","ahoy","anthill","appie","arale","araneo","ariadne", "atn_worldwide","atomz","bjaaland","ukonline","calif","combine","cosmos","cusco", "cyberspyder","digger","grabber","downloadexpress","ecollector","ebiness","esculapio", "esther","felix ide","hamahakki","kit-fireball","fouineur","freecrawl","desertrealm", "gcreep","golem","griffon","gromit","gulliver","gulper","whowhere","havindex","hotwired", "htdig","ingrid","informant","inspectorwww","iron33","teoma","ask jeeves","jeeves", "image.kapsi.net","kdd-explorer","label-grabber","larbin","linkidator","linkwalker", "lockon","marvin","mattie","mediafox","merzscope","nec-meshexplorer","udmsearch","moget", "motor","muncher","muninn","muscatferret","mwdsearch","sharp-info-agent","webmechanic", "netscoop","newscan-online","objectssearch","orbsearch","packrat","pageboy","parasite", "patric","pegasus","phpdig","piltdownman","pimptrain","plumtreewebaccessor","getterrobo-plus", "raven","roadrunner","robbie","robocrawl","robofox","webbandit","scooter","search-au", "searchprocess","senrigan","shagseeker","site valet","skymob","slurp","snooper","speedy", "curl_image_client","suke","www.sygol.com","tach_bw","templeton","titin","topiclink","udmsearch", "urlck","valkyrie libwww-perl","verticrawl","victoria","webscout","voyager","crawlpaper", "webcatcher","t-h-u-n-d-e-r-s-t-o-n-e","webmoose","pagesinventory","webquest","webreaper", "webwalker","winona","occam","robi","fdse","jobo","rhcs","gazz","dwcp","yeti","fido","wlm", "wolp","wwwc","xget","legs","curl","webs","wget","sift","cmc" }; string ua = Request.UserAgent.ToLower(); bool iscrawler = Crawlers3.Exists(x => ua.Contains(x));
相当快,但比选项1和2慢一点。 它是最准确的,并允许您根据需要维护列表。 如果您将来害怕误报,您可以在其中维护一个单独的名称列表。 如果我们得到一个短的匹配,我们记录它并检查它是否为误报。
// crawlers that have 'bot' in their useragent List<string> Crawlers1 = new List<string>() { "googlebot","bingbot","yandexbot","ahrefsbot","msnbot","linkedinbot","exabot","compspybot", "yesupbot","paperlibot","tweetmemebot","semrushbot","gigabot","voilabot","adsbot-google", "botlink","alkalinebot","araybot","undrip bot","borg-bot","boxseabot","yodaobot","admedia bot", "ezooms.bot","confuzzledbot","coolbot","internet cruiser robot","yolinkbot","diibot","musobot", "dragonbot","elfinbot","wikiobot","twitterbot","contextad bot","hambot","iajabot","news bot", "irobot","socialradarbot","ko_yappo_robot","skimbot","psbot","rixbot","seznambot","careerbot", "simbot","solbot","mail.ru_bot","spiderbot","blekkobot","bitlybot","techbot","void-bot", "vwbot_k","diffbot","friendfeedbot","archive.org_bot","woriobot","crystalsemanticsbot","wepbot", "spbot","tweetedtimes bot","mj12bot","who.is bot","psbot","robot","jbot","bbot","bot" }; // crawlers that don't have 'bot' in their useragent List<string> Crawlers2 = new List<string>() { "baiduspider","80legs","baidu","yahoo! slurp","ia_archiver","mediapartners-google","lwp-trivial", "nederland.zoek","ahoy","anthill","appie","arale","araneo","ariadne","atn_worldwide","atomz", "bjaaland","ukonline","bspider","calif","christcrawler","combine","cosmos","cusco","cyberspyder", "cydralspider","digger","grabber","downloadexpress","ecollector","ebiness","esculapio","esther", "fastcrawler","felix ide","hamahakki","kit-fireball","fouineur","freecrawl","desertrealm", "gammaspider","gcreep","golem","griffon","gromit","gulliver","gulper","whowhere","portalbspider", "havindex","hotwired","htdig","ingrid","informant","infospiders","inspectorwww","iron33", "jcrawler","teoma","ask jeeves","jeeves","image.kapsi.net","kdd-explorer","label-grabber", "larbin","linkidator","linkwalker","lockon","logo_gif_crawler","marvin","mattie","mediafox", "merzscope","nec-meshexplorer","mindcrawler","udmsearch","moget","motor","muncher","muninn", "muscatferret","mwdsearch","sharp-info-agent","webmechanic","netscoop","newscan-online", "objectssearch","orbsearch","packrat","pageboy","parasite","patric","pegasus","perlcrawler", "phpdig","piltdownman","pimptrain","pjspider","plumtreewebaccessor","getterrobo-plus","raven", "roadrunner","robbie","robocrawl","robofox","webbandit","scooter","search-au","searchprocess", "senrigan","shagseeker","site valet","skymob","slcrawler","slurp","snooper","speedy", "spider_monkey","spiderline","curl_image_client","suke","www.sygol.com","tach_bw","templeton", "titin","topiclink","udmsearch","urlck","valkyrie libwww-perl","verticrawl","victoria", "webscout","voyager","crawlpaper","wapspider","webcatcher","t-h-u-n-d-e-r-s-t-o-n-e", "webmoose","pagesinventory","webquest","webreaper","webspider","webwalker","winona","occam", "robi","fdse","jobo","rhcs","gazz","dwcp","yeti","crawler","fido","wlm","wolp","wwwc","xget", "legs","curl","webs","wget","sift","cmc" }; string ua = Request.UserAgent.ToLower(); string match = null; if (ua.Contains("bot")) match = Crawlers1.FirstOrDefault(x => ua.Contains(x)); else match = Crawlers2.FirstOrDefault(x => ua.Contains(x)); if (match != null && match.Length < 5) Log("Possible new crawler found: ", ua); bool iscrawler = match != null;
的 笔记: 强>
的 蜜罐 强>
唯一真正的替代方案是在您的网站上创建一个只有机器人才能到达的蜜罐链接。然后,将访问蜜罐页面的用户代理字符串记录到数据库中。然后,您可以使用这些记录的字符串对爬网程序进行分类。
Postives: 它将匹配一些不能声明自己的未知爬虫。
Postives:
Negatives: 并非所有抓取工具都能够深入挖掘您网站上的每个链接,因此他们可能无法访问您的蜜罐。
Negatives:
一个建议是在您的页面上创建一个只有机器人会跟随的空锚。普通用户不会看到链接,留下蜘蛛和机器人。例如,指向子文件夹的空锚标记会在日志中记录获取请求...
<a href="dontfollowme.aspx"></a>
许多人在运行HoneyPot时使用此方法来捕获未遵循robots.txt文件的恶意机器人。我在一个中使用空锚方法 ASP.NET蜜罐解决方案 我写信陷阱并阻止那些令人毛骨悚然的爬虫......