使用BeautifulSoup提取网页标题中的部分文本

作者: 扬尘
发布时间: 2024-05-15 05:26:42 (2月前)
转自：

3 条回复

0#
回复此人
Gassyc加西可 | 2019-08-31 10-32

<div class =“post-text”itemprop =“text”> <P> 你可以试试这个库（ <a href="https://github.com/grangier/python-goose" rel="nofollow"> 鹅 </A> ）。 </p> <P> 我试图为一些带有beautifulsoup的网站创建我自己的提取器，但后来我意识到Goose完全符合我的需要。 </p> </DIV>

编辑
1#
回复此人
关于贤的记忆 | 2019-08-31 10-32

<div class =“post-text”itemprop =“text”> <P> 该 <code> .string </code> 属性为您提供标记文本： </p> <pre> <code> uni_name = soup.title.string </code> </pre> <P> 如果你只想使用第一部分，请拆分 <code> | </code> 管： </p> <pre> <code> uni_name = soup.title.string.partition('|')[0].strip() </code> </pre> <P> 这用 <a href="http://docs.python.org/2/library/stdtypes.html#str.partition" rel="nofollow"> <code> str.partition() </code> </A> 分割一次（为了效率），获取结果的第一部分，并删除该结果周围的任何额外空格。 </p> <P> 演示： </p> <pre> <code> >>> soup.title <title>College of Agriculture & Life Sciences | The University of Arizona, Tucson, Arizona</title> >>> soup.title.string u'College of Agriculture & Life Sciences | The University of Arizona, Tucson, Arizona' >>> soup.title.string.partition('|')[0].strip() u'College of Agriculture & Life Sciences' </code> </pre> </DIV>

编辑

登录后才能参与评论