您应该简化代码和抓取策略,尽管我可以看到并非所有期刊页面都具有相同的结构。在大多数页面上,您可以通过表单值轻松获取ISSN。在其他人(我认为是免费访问)上,您需要应用某种启发式方法来获取ISSN。此外,我不知道你为什么使用httplib2和请求,因为它们提供或多或少相同的功能。无论如何这里有一些代码可以做你想要的……有点(我也删除了CSV代码,因为它不需要那样):
import requests
from bs4 import BeautifulSoup, SoupStrainer
with open(‘listoflinks.csv’, encoding=”utf8”) as f:
for line in f:
path = line.strip()
print(“getting”, path)
response = requests.get(“https://www.cambridge.org“ + path)
soup = BeautifulSoup(response.text, “html.parser”)
try:
issn = soup.find(“input”, attrs={‘name’: ‘productIssn’}).get(‘value’)
except:
values = soup.findall(“span”, class=”value”)
for v in values:
if “(Online)” in v.string:
issn = v.string.split(“ “)[0]
break
print("issn:", issn)
details_container = soup.find("div", class_="details-container")
image = details_container.find("img")
imgurl = image['src'][2:]
print("imgurl:", imgurl)
with open(issn + ".jpg", 'wb') as output:
output.write(requests.get("http://" + imgurl).content)
</code>