ONE一個除了有一個ONE模塊之外浓冒,另外還有ONE 文章,ONE 問題模塊
這篇筆記將會講述如何爬取這兩個模塊。
15年前尖坤,互聯(lián)網(wǎng)是一個逃避現(xiàn)實的地方稳懒;現(xiàn)在,現(xiàn)實是一個可以逃避互聯(lián)網(wǎng)的地方慢味。
From ONE一個
獲得文章鏈接列表
繼續(xù)上篇日記场梆,依舊是利用chorme分析網(wǎng)頁構(gòu)成。
由圖可看出纯路,文章的主要內(nèi)容并不在主頁當中或油,需要點開鏈接跳轉(zhuǎn)得到,我們需要得到每一篇文章的url驰唬,組成文章鏈接列表顶岸,再利用列表鏈接進行文章的爬取。
而通過源代碼定嗓,我們發(fā)現(xiàn)文章的鏈接都在類名為fp-one-articulo的div中蜕琴,于是我們遍歷該div中所有的a標簽,再從標簽中提取url宵溅。
def getArticlelist(page):
article_list = []
soup = BeautifulSoup(page, 'html.parser')
for i in soup.findAll('div',class_ ='fp-one-articulo'):
for j in i.find_all('a'):
article_url = j['href']
article_list.append(article_url)
return article_list
可以獲得文章鏈接列表:
'http://wufazhuce.com/article/2818',
'http://wufazhuce.com/article/2819',
'http://wufazhuce.com/article/2816',
'http://wufazhuce.com/article/2810',
'http://wufazhuce.com/article/2808',
'http://wufazhuce.com/article/2815',
'http://wufazhuce.com/article/2812'
獲得文章內(nèi)容
既然已經(jīng)獲得文章鏈接列表,接下來我們就需要遍歷列表上炎,對每一個鏈接進行解析恃逻,目標得到文章的作者雏搂,標題,內(nèi)容寇损。
def getArticle(list):
artlist = []
for url in list:
page_article = requests.get(url).content
soup = BeautifulSoup(page_article, 'html.parser')
title = soup.find_all('div',class_ = 'one-articulo')[0].h2.text
autor = soup.find_all('div',class_ = 'one-articulo')[0].p.text
article = soup.find_all('div',class_ = 'one-articulo')[0].find_all('div',class_ = 'articulo-contenido')[0].text
data = {
'title':title,
'article':article,
'autor':autor
}
artlist.append(data)
return artlist
函數(shù)返回包含所有文章的標題凸郑,作者及內(nèi)容的字典格式
問題模塊
問題模塊與上面文章模塊沒有明顯差別,依舊先得到urllist矛市,再對每一個url進行爬取芙沥。
def getQuestionlist(page):
question_list = []
soup = BeautifulSoup(page, 'html.parser')
for i in soup.findAll('div',class_ ='fp-one-cuestion'):
for j in i.find_all('a'):
question_url = j['href']
question_list.append(question_url)
return question_list
def getQuestion(list):
queslist = []
for url in list:
page_article = requests.get(url).content
soup = BeautifulSoup(page_article, 'html.parser')
question_title = soup.find_all('div',class_ = 'one-cuestion')[0].h4.text
question_brief = soup.find_all('div',class_ = 'cuestion-contenido')[0].text
question_content = soup.find_all('div',class_ = 'cuestion-contenido')[1].text
data = {
'ques_title':question_title,
'ques_brief':question_brief,
'ques_content':question_content
}
queslist.append(data)
return queslist
集合字典
上文中,我們分別獲得了ONE模塊浊吏,ONE 文章模塊,ONE 問題模塊的字典列表,那么我們?nèi)绾螌⑷齻€字典集合為一個字典對象呢而昨?
for one,art,ques in zip(one_dict,article_dict,question_dict):
dic = {}
dic.update(one)
dic.update(art)
dic.update(ques)
dict_list.append(dic)
for dict in dict_list:
for key in dict:
print key, ':', dict[key]
我們就獲得最終的dict_list數(shù)據(jù)列表。
源碼
#!/usr/bin/python
# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup
def getPage(url):
return requests.get(url).content
def getOne(page):
list = []
soup = BeautifulSoup(page, 'html.parser')
for i in soup.find_all('div',class_ = 'item'):
# image = i.a.img['src']
onelist = i.find_all('a')
image = onelist[0].img['src']
word = onelist[1].text
infolist = i.find_all('p')
id = infolist[0].text
date = infolist[1].text+' '+infolist[2].text
data = {
'image':image,
'word':word,
'id':id,
'date':date
}
list.append(data)
return list
def getArticlelist(page):
article_list = []
soup = BeautifulSoup(page, 'html.parser')
for i in soup.findAll('div',class_ ='fp-one-articulo'):
for j in i.find_all('a'):
article_url = j['href']
article_list.append(article_url)
return article_list
def getQuestionlist(page):
question_list = []
soup = BeautifulSoup(page, 'html.parser')
for i in soup.findAll('div',class_ ='fp-one-cuestion'):
for j in i.find_all('a'):
question_url = j['href']
question_list.append(question_url)
return question_list
def getArticle(list):
artlist = []
for url in list:
page_article = requests.get(url).content
soup = BeautifulSoup(page_article, 'html.parser')
title = soup.find_all('div',class_ = 'one-articulo')[0].h2.text
autor = soup.find_all('div',class_ = 'one-articulo')[0].p.text
article = soup.find_all('div',class_ = 'one-articulo')[0].find_all('div',class_ = 'articulo-contenido')[0].text
data = {
'title':title,
'article':article,
'autor':autor
}
artlist.append(data)
return artlist
def getQuestion(list):
queslist = []
for url in list:
page_article = requests.get(url).content
soup = BeautifulSoup(page_article, 'html.parser')
question_title = soup.find_all('div',class_ = 'one-cuestion')[0].h4.text
question_brief = soup.find_all('div',class_ = 'cuestion-contenido')[0].text
question_content = soup.find_all('div',class_ = 'cuestion-contenido')[1].text
data = {
'ques_title':question_title,
'ques_brief':question_brief,
'ques_content':question_content
}
queslist.append(data)
return queslist
if __name__ == '__main__':
url = "http://www.wufazhuce.com/"
dict_list = []
one_page = getPage(url)
one_dict = getOne(one_page)
article_list = getArticlelist(one_page)
article_dict = getArticle(article_list)
question_list = getQuestionlist(one_page)
question_dict = getQuestion(question_list)
for one,art,ques in zip(one_dict,article_dict,question_dict):
dic = {}
dic.update(one)
dic.update(art)
dic.update(ques)
dict_list.append(dic)
for dict in dict_list:
for key in dict:
print key, ':', dict[key]
問題
雖然解決了模塊內(nèi)容問題找田,但是數(shù)據(jù)少依舊是一個問題歌憨。ONE一個網(wǎng)站只開放了七天的數(shù)據(jù),如何獲得更多的數(shù)據(jù)墩衙,甚至是一年的數(shù)據(jù)呢务嫡?這么大的數(shù)據(jù)如何保存呢?