原理看上一篇
工具篇
Xpath Help 谷歌插件(谷歌商店你懂得)
爬取鳳凰首頁新聞
插件使用
提取全部修改Xpath語法即可
在python上如何使用瞧哟?
代碼如下:
#!/usr/bin/python
# -*- coding: UTF-8 -*-
import requests
from lxml import etree
from lxml.html import tostring#將某個(gè)元素節(jié)點(diǎn) 保存為字符串
import json
def getNews():
url = 'https://news.ifeng.com/'
html = requests.get(url=url)
html = html.content.decode('utf-8')
news_tree = etree.HTML(html)
# #xpath返回一個(gè)集合數(shù)組,如果有20條,則數(shù)組的len為20
titles = news_tree.xpath(
'//*[@id="root"]/div[6]/div[1]/div[5]/ul/li/a/@title')
hrefs = news_tree.xpath('//*[@id="root"]/div[6]/div[1]/div[5]/ul/li/a/@href')
imgs = news_tree.xpath(
'//*[@id="root"]/div[6]/div[1]/div[5]/ul/li/a/img/@src')
times = news_tree.xpath(
'//*[@id="root"]/div[6]/div[1]/div[5]/ul/li/div/div/time')
tags = news_tree.xpath(
'//*[@id="root"]/div[6]/div[1]/div[5]/ul/li/div/div/span')
#通過遍歷,獲得每一個(gè)的信息,然后存入字典中
#然后存入數(shù)組峻堰,返回json數(shù)據(jù)
array = []
count = 0
while (count < len(titles)):
title = titles[count]
link = hrefs[count]
img = imgs[count]
time = times[count].text
tag = tags[count].text
dic = {'title': title, 'href': link, 'img': img, 'time': time, 'tag': tag}
array.append(dic)
count = count + 1
return json.dumps(array, ensure_ascii=False)
if __name__ == "__main__":
jsonstring = getNews()
print(jsonstring)
打印輸入如下:
[{
"title": "綠地回應(yīng)被舉報(bào)高管貪腐問題:調(diào)查中 不會(huì)姑息",
"href": "http://news.ifeng.com/c/7weTelvvWbY",
"img": "http://d.ifengimg.com/w144_h80_q70/x0.ifengimg.com/thmaterial/2020_21/CDB7AA8A2B55483B843DAF99CE559E11_w698_h392.png",
"time": "今天 12:05",
"tag": "中國新聞網(wǎng)"
}, {
"title": "美國抗議者在白宮外放裝尸袋辦“葬禮” 問責(zé)政府抗疫不力",
"href": "http://news.ifeng.com/c/7weTH6IwesH",
"img": "http://d.ifengimg.com/w144_h80_q70/x0.ifengimg.com/ucms/2020_21/0370BFA6C72EAB721A55DB02731CED811930349E_w698_h392.png",
"time": "今天 12:05",
"tag": "環(huán)球網(wǎng)"
}, {
"title": "張文宏:各地有偶發(fā)病例是大概率事件悟民,應(yīng)長(zhǎng)期保持適當(dāng)社交距離",
"href": "http://news.ifeng.com/c/7weRdCzXkJc",
"img": "http://d.ifengimg.com/w144_h80_q70/x0.ifengimg.com/thmaterial/2020_21/1F2D720F73E54AF8956B39DB212606C6_w690_h387.jpg",
"time": "今天 11:37",
"tag": "張文宏醫(yī)生"
}, {
"title": "又美又有才,難道她就是特朗普的“完美”發(fā)言人蔚出?",
"href": "http://news.ifeng.com/c/7weRLg43Viq",
"img": "http://d.ifengimg.com/w144_h80_q70/x0.ifengimg.com/thmaterial/2020_21/2CB1E314289C482395DE1CD313E0CCD2_w698_h392.jpg",
"time": "今天 11:33",
"tag": "冰汝看美國"
}, {
"title": "美國傳染病專家福奇兩周未接受采訪,美媒懷疑其被禁聲",
"href": "http://news.ifeng.com/c/7weOmUniq6O",
"img": "http://d.ifengimg.com/w144_h80_q70/x0.ifengimg.com/thmaterial/2020_21/9DE3F1B1A36F4832BA4DF6D12267D80C_w698_h392.jpg",
"time": "今天 11:15",
"tag": "澎湃新聞"
}, {
"title": "酒駕致廣東援鄂醫(yī)生王爍殉職案開庭 被告曾以涉嫌交通肇事罪被批捕",
"href": "http://news.ifeng.com/c/7wePraJu7kG",
"img": "http://d.ifengimg.com/w144_h80_q70/x0.ifengimg.com/thmaterial/2020_21/3968E83E17854629AC6BDEE647F8C3B4_w698_h392.png",
"time": "今天 11:10",
"tag": "南方都市報(bào)"
}, {
"title": "全國政協(xié)會(huì)議將為抗疫犧牲烈士和逝世同胞默哀一分鐘",
"href": "http://news.ifeng.com/c/7wePxtxWRZA",
"img": "http://d.ifengimg.com/w144_h80_q70/x0.ifengimg.com/thmaterial/2020_21/4F5C7FED0DA045EE96DDD311B4542436_w533_h299.jpg",
"time": "今天 11:09",
"tag": "工人日?qǐng)?bào)"
}, {
"title": "全國人大代表姚勁波:降低公積金繳存比例虫腋,減輕企業(yè)經(jīng)營(yíng)負(fù)擔(dān)",
"href": "http://news.ifeng.com/c/7wePkyXwfho",
"img": "http://d.ifengimg.com/w144_h80_q70/x0.ifengimg.com/ucms/2020_21/740C78C4AE2878A548CAFB829EA511B7B5405646_w698_h392.jpg",
"time": "今天 11:08",
"tag": "澎湃新聞網(wǎng)"
}, {
"title": "人民日?qǐng)?bào):把“黑暴”趕出香港骄酗,得從根上拔除“毒瘤”",
"href": "http://news.ifeng.com/c/7wePQZK5wUS",
"img": "http://d.ifengimg.com/w144_h80_q70/x0.ifengimg.com/thmaterial/2020_21/07991C78F2EA42DB85E525BE4E847C6F_w600_h336.jpg",
"time": "今天 11:05",
"tag": "人民日?qǐng)?bào)"
}, {
"title": "華為美國高管:美國斷供我們能挺過去,不過大量美國人會(huì)失業(yè)",
"href": "http://news.ifeng.com/c/7wePHsPV6UC",
"img": "http://d.ifengimg.com/w144_h80_q70/x0.ifengimg.com/thmaterial/2020_21/01D0AD2B338D469286850CC5CD8F19AE_w569_h319.jpg",
"time": "今天 11:04",
"tag": "環(huán)球網(wǎng)"
}, {
"title": "人大代表建議:取消生育三孩以上的處罰政策 國家給予育兒補(bǔ)貼",
"href": "http://news.ifeng.com/c/7weNTNgpLOi",
"img": "http://d.ifengimg.com/w144_h80_q70/x0.ifengimg.com/ucms/2020_21/630A01F7A7A78464A6D06536A6A6873858EFD058_w698_h392.jpg",
"time": "今天 11:00",
"tag": "新京報(bào)"
}, {
"title": "瘋狂的頭盔:我10天賺了800萬",
"href": "http://news.ifeng.com/c/7weOSjP6hN2",
"img": "http://d.ifengimg.com/w144_h80_q70/x0.ifengimg.com/thmaterial/2020_21/9CCED3FB9DFF4554B5C9F9FC4599B608_w512_h287.jpg",
"time": "今天 10:53",
"tag": "縱相新聞"
}, {
"title": "美國加州聯(lián)邦參議員提議案 譴責(zé)“中國病毒”等詞匯指稱新冠",
"href": "http://news.ifeng.com/c/7weNkE8jFA0",
"img": "http://d.ifengimg.com/w144_h80_q70/x0.ifengimg.com/thmaterial/2020_21/FF274567E4C0492496E65C1ECF119A87_w698_h392.jpg",
"time": "今天 10:44",
"tag": "中國日?qǐng)?bào)網(wǎng)"
}, {
"title": "王學(xué)坤委員:建議建立農(nóng)民退休制度 讓65歲以上農(nóng)民“洗腳上田悦冀,老有所養(yǎng)”",
"href": "http://news.ifeng.com/c/7weNbtSo7v6",
"img": "http://d.ifengimg.com/w144_h80_q70/x0.ifengimg.com/thmaterial/2020_21/75DF69024398483AAA24F657C0EC764F_w602_h338.png",
"time": "今天 10:39",
"tag": "最高人民檢察院"
}, {
"title": "特朗普叫囂“中國有個(gè)瘋子”趋翻,評(píng)論區(qū)翻車",
"href": "http://news.ifeng.com/c/7weN4fqF7BI",
"img": "http://d.ifengimg.com/w144_h80_q70/x0.ifengimg.com/thmaterial/2020_21/4969C19002F340678BCC48B9266B1D2C_w698_h392.jpg",
"time": "今天 10:32",
"tag": "觀察者網(wǎng)"
}, {
"title": "軍報(bào)頭版評(píng)論:“蓬佩奧們”邊喊抓賊邊做賊,下場(chǎng)注定可悲",
"href": "http://news.ifeng.com/c/7weMvlBWShM",
"img": "http://d.ifengimg.com/w144_h80_q70/x0.ifengimg.com/thmaterial/2020_21/636D31AEB733412F96581882FCEFC64E_w698_h392.png",
"time": "今天 10:31",
"tag": "解放軍報(bào)"
}, {
"title": "特殊時(shí)期的中國兩會(huì) 外媒都在關(guān)注這些",
"href": "http://news.ifeng.com/c/7weMghINhz6",
"img": "http://d.ifengimg.com/w144_h80_q70/x0.ifengimg.com/thmaterial/2020_21/D4270B52C0C14B72909202DBABECE1B6_w698_h392.jpg",
"time": "今天 10:28",
"tag": "央視新聞客戶端"
}, {
"title": "雷軍建議:進(jìn)一步降低民營(yíng)企業(yè)進(jìn)入衛(wèi)星互聯(lián)網(wǎng)門檻",
"href": "http://news.ifeng.com/c/7weL0NllooO",
"img": "http://d.ifengimg.com/w144_h80_q70/x0.ifengimg.com/ucms/2020_21/AB6E6070DCCBE7465220469B2578CD910EA67390_w698_h392.jpg",
"time": "今天 10:20",
"tag": "澎湃新聞"
}, {
"title": "北京15座王府14座被占盒蟆,政協(xié)委員:應(yīng)設(shè)騰退協(xié)調(diào)機(jī)構(gòu)",
"href": "http://news.ifeng.com/c/7weKz9vBGm5",
"img": "http://d.ifengimg.com/w144_h80_q70/x0.ifengimg.com/ucms/2020_21/429BCCAD3FC016669563C909F36859F71B506DE0_w698_h392.jpg",
"time": "今天 10:20",
"tag": "新京報(bào)"
}, {
"title": "荷蘭政府:水貂可能將新冠病毒傳給人 清查所有養(yǎng)殖場(chǎng)",
"href": "http://news.ifeng.com/c/7weKeI1Yr6D",
"img": "http://d.ifengimg.com/w144_h80_q70/x0.ifengimg.com/thmaterial/2020_21/09FBF81BFE594527AAF2C36D2ED4EEDF_w519_h291.jpg",
"time": "今天 10:16",
"tag": "觀察者網(wǎng)"
}]
如果需要新聞詳情呢:
方式一:直接在列表中返回踏烙,也就是在 getNews()
方法中师骗,先獲取到連接 hrefs
然后遍歷鏈接 得到 href
再去重新使用 lxml
抓取,這種方式對(duì)直接返回給客戶端使用不是很友好讨惩,一個(gè)是返回 json
體積過大辟癌,一個(gè)是等待時(shí)間過長(zhǎng)
方式二:重寫抓取函數(shù),傳入相對(duì)應(yīng)頁面的 URL
獲取詳情數(shù)據(jù)代碼如下:
def getNewsContent(url):
html = requests.get(url=url)
html = html.content.decode('utf-8')
news_content_tree = etree.HTML(html)
#因?yàn)閤path 語法可以保證只獲取一個(gè)詳情元素荐捻,所以直接取第一個(gè)即可
content = news_content_tree.xpath(
'//*[@id="root"]/div/div[3]/div[1]/div[1]/div[3]')[0]
content_html = str(tostring(content))
#如果打印 會(huì)發(fā)現(xiàn) 前面有一個(gè)(b') 以及最后的 (') 所以直接執(zhí)行切割字符串操作
content_html_text = content_html[2:len(content_html)-1]
return content_html_text
打印數(shù)據(jù)如下:
<div class="main_content-LcrEruCc"><div><div class="text-3zQ3cZD4"><p>近日,21岁的冼嘉豪因暴动罪被香港法院判刑4年,他在求情信中说:“没有一天不后悔”。2019年6月至2020年4月15日,8001人被捕,1365人被起诉,566人被控暴动罪。个体的悲剧还在持续上演,数字的揪心让人持久难平,一场“修例风波”造就的暴力旋涡,已让多少香港年轻人命运脱轨、前途毁弃。</p><p>曾经拥有的东西因为参与非法暴力活动而丧失,一直拥有的生活因为暴力破坏而止步,狮子山下的纷乱伤害了多少逐梦路上的人。回望香港“修例风波”,正是因为反中乱港分子鼓吹暴力、煽惑暴力,被洗脑的年轻人迷信暴力、使用暴力,香港才结出了孩子有家难回、有梦难圆,市民有工难开、无工可开的苦果,让繁荣稳定的香港陷入危机困境。</p><p><img src="https://x0.ifengimg.com/ucms/2020_21/A1688E829DE205EEBC309384E3783FE8BA15437D_w1080_h1920.jpg"></p><p>这是香港市民想要的吗?最基本的安全被剥夺,出行怕有人又去砸地铁,营业怕黑衣人又来打砸,饭桌上有不同政见也不敢轻易发表,校园里竟成了“兵工厂”;人被贴上标签,店被贴上标签,被起底、被排斥、被攻击,在所谓“私了”和“装修”之下,黑色恐怖的利刃戳进市民的心,让人普遍变得焦虑、恐惧。因为暴徒,个人这小家被黑暗包裹,因为暴力,香港这个大家已满目疮痍,怎能不让人心痛、不让人愤慨,不让人期盼香港重归祥和安定!</p><p>在“修例风波”中,人们已经看尽暴力的危害、暴徒的凶残。特区政府警务处处长邓炳强此前表示,香港正面临本土恐怖主义的威胁,威胁到香港市民的人身安全,也在对国家安全造成冲击。反暴力,是因为暴力已渗透进香港市民的日常生活,危险近在咫尺;是因为暴力还有延续、扩散和升级的可能,要摧毁家园;是因为暴力不止,暴徒将更加猖狂,反中乱港分子将更加嚣张,香港要葬送掉一代代人辛苦建立的基业,辉煌篇章被恐怖主义湮灭。</p><p>通过香港警方严正执法,香港暴徒的气焰已被压制;由于香港市民拥护止暴制乱,香港暴力的土壤正被逐步铲除。但发生在香港的暴力并未绝迹,蠢蠢欲动的暴徒还在伺机而动。5月份前后,人们又看到了暴徒投掷的燃烧弹,看到了暴徒寄出的恐吓邮件。香港市民需要强化共识,一起向暴力说不;香港警方需要再接再厉,不给暴徒任何喘息之机。更需从根本上想办法,根治“黑暴”这个毒瘤。只有让暴徒、暴力成过街老鼠、众矢之的,纵暴、施暴的人付出沉重的代价,香港才有岁月静好,市民才能安心生活。</p></div><span></span><div class="end-37GBinZ_"></div></div></div>