PROSAGA码农传奇-策略引擎-检测诚实的网络爬虫

0# 谦逊的毛巾 | 2019-08-31 10-32

1# 天线宝宝 | 2019-08-31 10-32

2# 遇见你 | 2019-08-31 10-32

3# SHOU宅大可爱 | 2019-08-31 10-32

<div class =“post-text”itemprop =“text”>
  <P>
    你说在用户代理上匹配可能很尴尬，但我们发现它是一个非常好的匹配。我们的研究表明，它将覆盖您收到的大约98％的点击量。我们也没有遇到过任何误报。如果你想把它提高到99.9％，你可以包括一些其他众所周知的比赛，比如锟craw craw craw ba ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,我们已经在数百万次点击中对我们的生产系统进行了测试。
  </p>
  <P>
    以下是一些针对您的c＃解决方案：
  </p>
  <H3>
    1）最简单
  </H3>
  <P>
    处理未命中时最快。即来自非机器人的流量是普通用户。
捕获99％以上的爬虫。
  </p>
   <pre>
    <code>
      bool iscrawler = Regex.IsMatch(Request.UserAgent, @"bot|crawler|baiduspider|80legs|ia_archiver|voyager|curl|wget|yahoo! slurp|mediapartners-google", RegexOptions.IgnoreCase);

</code>
  </pre>
  <H3>
    2）中等
  </H3>
  <P>
    处理命中时速度最快。即来自机器人的流量。对于未命中也很快。
捕获接近100％的爬虫。
匹配的是蜘蛛，而不是蜘蛛的前期。
您可以添加任何其他已知的抓取工具。
  </p>
   <pre>
    <code>
      List<string> Crawlers3 = new List<string>()
{
    "bot","crawler","spider","80legs","baidu","yahoo! slurp","ia_archiver","mediapartners-google",
    "lwp-trivial","nederland.zoek","ahoy","anthill","appie","arale","araneo","ariadne",            
    "atn_worldwide","atomz","bjaaland","ukonline","calif","combine","cosmos","cusco",
    "cyberspyder","digger","grabber","downloadexpress","ecollector","ebiness","esculapio",
    "esther","felix ide","hamahakki","kit-fireball","fouineur","freecrawl","desertrealm",
    "gcreep","golem","griffon","gromit","gulliver","gulper","whowhere","havindex","hotwired",
    "htdig","ingrid","informant","inspectorwww","iron33","teoma","ask jeeves","jeeves",
    "image.kapsi.net","kdd-explorer","label-grabber","larbin","linkidator","linkwalker",
    "lockon","marvin","mattie","mediafox","merzscope","nec-meshexplorer","udmsearch","moget",
    "motor","muncher","muninn","muscatferret","mwdsearch","sharp-info-agent","webmechanic",
    "netscoop","newscan-online","objectssearch","orbsearch","packrat","pageboy","parasite",
    "patric","pegasus","phpdig","piltdownman","pimptrain","plumtreewebaccessor","getterrobo-plus",
    "raven","roadrunner","robbie","robocrawl","robofox","webbandit","scooter","search-au",
    "searchprocess","senrigan","shagseeker","site valet","skymob","slurp","snooper","speedy",
    "curl_image_client","suke","www.sygol.com","tach_bw","templeton","titin","topiclink","udmsearch",
    "urlck","valkyrie libwww-perl","verticrawl","victoria","webscout","voyager","crawlpaper",
    "webcatcher","t-h-u-n-d-e-r-s-t-o-n-e","webmoose","pagesinventory","webquest","webreaper",
    "webwalker","winona","occam","robi","fdse","jobo","rhcs","gazz","dwcp","yeti","fido","wlm",
    "wolp","wwwc","xget","legs","curl","webs","wget","sift","cmc"
};
string ua = Request.UserAgent.ToLower();
bool iscrawler = Crawlers3.Exists(x => ua.Contains(x));

</code>
  </pre>
  <H3>
    3）偏执狂
  </H3>
  <P>
    相当快，但比选项1和2慢一点。
它是最准确的，并允许您根据需要维护列表。
如果您将来害怕误报，您可以在其中维护一个单独的名称列表。
如果我们得到一个短的匹配，我们记录它并检查它是否为误报。
  </p>
   <pre>
    <code>
      // crawlers that have 'bot' in their useragent
List<string> Crawlers1 = new List<string>()
{
    "googlebot","bingbot","yandexbot","ahrefsbot","msnbot","linkedinbot","exabot","compspybot",
    "yesupbot","paperlibot","tweetmemebot","semrushbot","gigabot","voilabot","adsbot-google",
    "botlink","alkalinebot","araybot","undrip bot","borg-bot","boxseabot","yodaobot","admedia bot",
    "ezooms.bot","confuzzledbot","coolbot","internet cruiser robot","yolinkbot","diibot","musobot",
    "dragonbot","elfinbot","wikiobot","twitterbot","contextad bot","hambot","iajabot","news bot",
    "irobot","socialradarbot","ko_yappo_robot","skimbot","psbot","rixbot","seznambot","careerbot",
    "simbot","solbot","mail.ru_bot","spiderbot","blekkobot","bitlybot","techbot","void-bot",
    "vwbot_k","diffbot","friendfeedbot","archive.org_bot","woriobot","crystalsemanticsbot","wepbot",
    "spbot","tweetedtimes bot","mj12bot","who.is bot","psbot","robot","jbot","bbot","bot"
};

// crawlers that don't have 'bot' in their useragent
List<string> Crawlers2 = new List<string>()
{
    "baiduspider","80legs","baidu","yahoo! slurp","ia_archiver","mediapartners-google","lwp-trivial",
    "nederland.zoek","ahoy","anthill","appie","arale","araneo","ariadne","atn_worldwide","atomz",
    "bjaaland","ukonline","bspider","calif","christcrawler","combine","cosmos","cusco","cyberspyder",
    "cydralspider","digger","grabber","downloadexpress","ecollector","ebiness","esculapio","esther",
    "fastcrawler","felix ide","hamahakki","kit-fireball","fouineur","freecrawl","desertrealm",
    "gammaspider","gcreep","golem","griffon","gromit","gulliver","gulper","whowhere","portalbspider",
    "havindex","hotwired","htdig","ingrid","informant","infospiders","inspectorwww","iron33",
    "jcrawler","teoma","ask jeeves","jeeves","image.kapsi.net","kdd-explorer","label-grabber",
    "larbin","linkidator","linkwalker","lockon","logo_gif_crawler","marvin","mattie","mediafox",
    "merzscope","nec-meshexplorer","mindcrawler","udmsearch","moget","motor","muncher","muninn",
    "muscatferret","mwdsearch","sharp-info-agent","webmechanic","netscoop","newscan-online",
    "objectssearch","orbsearch","packrat","pageboy","parasite","patric","pegasus","perlcrawler",
    "phpdig","piltdownman","pimptrain","pjspider","plumtreewebaccessor","getterrobo-plus","raven",
    "roadrunner","robbie","robocrawl","robofox","webbandit","scooter","search-au","searchprocess",
    "senrigan","shagseeker","site valet","skymob","slcrawler","slurp","snooper","speedy",
    "spider_monkey","spiderline","curl_image_client","suke","www.sygol.com","tach_bw","templeton",
    "titin","topiclink","udmsearch","urlck","valkyrie libwww-perl","verticrawl","victoria",
    "webscout","voyager","crawlpaper","wapspider","webcatcher","t-h-u-n-d-e-r-s-t-o-n-e",
    "webmoose","pagesinventory","webquest","webreaper","webspider","webwalker","winona","occam",
    "robi","fdse","jobo","rhcs","gazz","dwcp","yeti","crawler","fido","wlm","wolp","wwwc","xget",
    "legs","curl","webs","wget","sift","cmc"
};

string ua = Request.UserAgent.ToLower();
string match = null;

if (ua.Contains("bot")) match = Crawlers1.FirstOrDefault(x => ua.Contains(x));
else match = Crawlers2.FirstOrDefault(x => ua.Contains(x));

if (match != null && match.Length < 5) Log("Possible new crawler found: ", ua);

bool iscrawler = match != null;

</code>
  </pre>
  <P>
    的<strong>
      笔记：
    </强>
  </p>
  <UL>
    <LI>
      仅仅继续为正则表达式选项1添加名称是诱人的。但是如果你这样做会变慢。如果你想要一个更完整的列表，那么带有lambda的linq会更快。
    </LI>
    <LI>
      确保.ToLower（）不在你的linq方法中记住，该方法是一个循环，你将在每次迭代期间修改字符串。
    </LI>
    <LI>
      始终将最重的机器人放在列表的开头，这样它们就能更快地匹配。
    </LI>
    <LI>
      将列表放入静态类，以便不在每个页面视图上重建它们。
    </LI>
  </UL>
  <P>
    的<strong>
      蜜罐
    </强>
  </p>
  <P>
    唯一真正的替代方案是在您的网站上创建一个只有机器人才能到达的蜜罐链接。然后，将访问蜜罐页面的用户代理字符串记录到数据库中。然后，您可以使用这些记录的字符串对爬网程序进行分类。
  </p>
  <P>
     <code>
      Postives:
    </code>
     它将匹配一些不能声明自己的未知爬虫。
  </p>
  <P>
     <code>
      Negatives:
    </code>
     并非所有抓取工具都能够深入挖掘您网站上的每个链接，因此他们可能无法访问您的蜜罐。
  </p>
</DIV>

4# 求赞有赞必回 | 2019-08-31 10-32