我正试图抓一些新闻。我有一个更大的3k文章列表来自这个网站,按标准选择,并且(考虑到我是Python新手)我出来了这个脚本来刮掉它们:
进口……
好。我用你的问题解决了这个问题 soup 对象存储为字符串,因此您可以使用bs4来解析html。我也选择使用熊猫 .to_csv() ,因为我对它更熟悉,但它可以获得所需的输出:
soup
.to_csv()
import pandas as pd from bs4 import BeautifulSoup import requests # get the URL list list1 = [] a = 'https://www.dnes.bg/sofia/2019/03/13/borisov-se-pohvali-prihodite-ot-gorivata-sa-sys-7-poveche.404467' b = 'https://www.dnes.bg/obshtestvo/2019/03/13/pazim-ezika-si-pravopis-pod-patronaja-na-radeva.404462' c = 'https://www.dnes.bg/politika/2019/01/03/politikata-nekanen-gost-na-praznichnata-novogodishna-trapeza.398091' list1.append(a) list1.append(b) list1.append(c) # define the variables #url = "https://www.dnes.bg/politika/2019/01/03/politikata-nekanen-gost-na-praznichnata-novogodishna-trapeza.398091" list2 = list1 #[0:10] #type(list2) art1 = [] results = pd.DataFrame() for url in list2: html = requests.get(url) soup = BeautifulSoup(html.text, 'html.parser') href = url title = soup.find("h1", "title").text #title = soup.find("h1", "title").string #title.extend(soup.find("h1", "title").string) # the title string subtitle = soup.find("div", "descr").text #subtitle.extend(soup.find("div", "descr").string) # the subtitle string time = soup.find("div", "art_author").text #time.extend(soup.find("div", "art_author").text) #par = soup.find("div", id="art_start").find_all("p") art1.extend(soup.find("div", id="art_start").find_all("p")) for a in art1: #article.extend(art1.find_all("p")) article = ([a.text.strip()]) break article = article[0] temp_df = pd.DataFrame([[title, subtitle, time, article]], columns = ['title','subtitle','time','article']) results = results.append(temp_df).reset_index(drop=True) results.to_csv("scraped.csv", index=False, encoding='utf-8-sig')
的 输出: 强>
print (results.to_string()) title subtitle time article 0 �����ڧ��� ��� ����ӧѧݧ�: ����ڧ��էڧ�� ��� �ԧ��ڧӧѧ�� ��� ��... ���֧�ܧڧ�� �٧� �ڧ٧�ӧ֧�ݧ�ӧѧߧ� �ߧ� ��֧ܧ���� �է֧ۧ��ӧѧ�, ��... ���ҧߧ�ӧ֧ߧ�: 13 �ާѧ� 2019 13:24 | 13 �ާѧ� 2019 11:3... ����ڧ��էڧ�� ��� �ԧ��ڧӧѧ�� ��� ���� 7% ���ӧ֧��. ����ӧ� ��... 1 "���ѧ٧ڧ� �֧٧ڧܧ� ���": ����ѧӧ��ڧ� ���� ��ѧ���ߧѧا� �ߧ� ����... ����ѧާ��ߧ����� �٧ѧӧڧ�� �ߧ� ��ѧާ� ��� ���ڧݧڧ�֧��, ��ާ��... ���ҧߧ�ӧ֧ߧ�: 13 �ާѧ� 2019 11:34 | 13 �ާѧ� 2019 11:2... ����ڧ��էڧ�� ��� �ԧ��ڧӧѧ�� ��� ���� 7% ���ӧ֧��. ����ӧ� ��... 2 ����ݧڧ�ڧܧѧ�� �C "�ߧ֧ܧѧߧ֧� �ԧ���" �ߧ� ���ѧ٧ߧڧ�ߧѧ�� �ߧ��... ����ߧ�ӧѧ�֧ݧߧ� �ݧ� �ҧ��� �ܧ�ڧ�ڧܧڧ�� �ߧ� ���֧٧ڧէ֧ߧ�� ����... 3 ��ߧ� 2019 10:45, ���ӧ֧�֧ݧڧ� ���ڧާڧ���� ����ڧ��էڧ�� ��� �ԧ��ڧӧѧ�� ��� ���� 7% ���ӧ֧��. ����ӧ� ��...