PROSAGA码农传奇-大数据安全-Python

Python - 美丽的汤返回错误

作者: 丫头
发布时间: 2025-03-18 05:48:43 (28天前)
转自：

            -philosophical-

协会
</跨度>

/核心/期刊/杂志的最澳数学型社会
/核心/期刊/杂志的最-镀金时代和渐进时代
/ core / journals / journal-of-the-economic-of-economic -

协会
</跨度>

/核心/期刊/杂志的最-海洋biological-

协会
</跨度>
-of最团结，王国
/核心/期刊/杂志的最王室语系型社会
/ core / journals / the-social-of-the-social-of-american-music

您应该简化代码和抓取策略，尽管我可以看到并非所有期刊页面都具有相同的结构。在大多数页面上，您可以通过表单值轻松获取ISSN。在其他人（我认为是免费访问）上，您需要应用某种启发式方法来获取ISSN。此外，我不知道你为什么使用httplib2和请求，因为它们提供或多或少相同的功能。无论如何这里有一些代码可以做你想要的……有点（我也删除了CSV代码，因为它不需要那样）：


    
      import requests
from bs4 import BeautifulSoup, SoupStrainer
with open(‘listoflinks.csv’, encoding=”utf8”) as f:
        for line in f:
            path = line.strip()
            print(“getting”, path)
            response = requests.get(“https://www.cambridge.org“ + path)
            soup = BeautifulSoup(response.text, “html.parser”)
            try:
               issn = soup.find(“input”, attrs={‘name’: ‘productIssn’}).get(‘value’)
            except:
               values = soup.findall(“span”, class=”value”)
               for v in values:
                  if “(Online)” in v.string:
                      issn = v.string.split(“ “)[0]
                      break
        print("issn:", issn)
        details_container = soup.find("div", class_="details-container")
        image = details_container.find("img")
        imgurl = image['src'][2:]
        print("imgurl:", imgurl)
        with open(issn + ".jpg", 'wb') as output:
           output.write(requests.get("http://" + imgurl).content)
</code>