python爬取動態(tài)網(wǎng)頁

1.首先下載phantomjs蹲坷、selenium换途,將phantomjs放于設(shè)置環(huán)境變量的目錄中,

2.嘗試獲取加載js后的單頁面膛薛,

Paste_Image.png
from urllib import request
import urllib
from bs4 import BeautifulSoup as bs
import re  
import os
import pandas as pd
import time
import random
from selenium import webdriver 
def getHtml(url):  
    driver=webdriver.PhantomJS();  
    driver.get(url)  
    return driver.page_source 
    
page_info=getHtml('http://tv.sohu.com/item/MTIwNjUzMg==.html')
soup = bs(page_info, 'html5lib')
tp=soup.find('em','total-play').string
Paste_Image.png

發(fā)現(xiàn)播放量成功得到了加載听隐,爬取成功。

3.嘗試多頁面的爬取

由于爬取中需要模擬瀏覽器哄啄,加載js文件雅任,因此如果未加載完全進行處理,就會獲取不到信息咨跌。確保加載完全沪么,需要設(shè)置等待時間,爬取速度較慢锌半。將代碼改寫為多進程可加快速度禽车,但仍然受到限制。

from urllib import request
import urllib
from bs4 import BeautifulSoup as bs
import re  
import pandas as pd
import time
import random
import multiprocessing
from itertools import chain
from selenium import webdriver
import sys
sys.setrecursionlimit(10000000)
def urlAdd():
  list_type=['1100','1101','1102','1103','1104','1105','1106','1107','1108','1109','1110','1111','1112','1113','1114'
  ,'1115','1116','1117','1118','1119','1120','1121','1122','1123','1124','1125','1127','1128']
  list_loc=['1000','1001','1002','1003','1004','1015','1007','1006','1014']
  list_time=['2017','2016','2015','2014','2013','2012','2011','2010','11','90','80','1']
  return list_loc,list_type,list_time
  
def PageCreate():
   urlsys=[]
   list_loc,list_type,list_time=urlAdd()
   for loc in list_loc:
     for type_1 in list_type:
       for time_1 in list_time:
            url1='http://so.tv.sohu.com/list_p1101_p210%s_p3%s_p4%s_p5_p6_p7_p8_p92_p101_p11_p12_p13.html'%(type_1,loc,time_1)
            urlsys.append(url1)
   return urlsys 
   
def urlsPages(url):
  url_hrefs=[]
  time.sleep(5+random.uniform(-1,1)) 
  req=urllib.request.Request(url)
  req.add_header("Origin","http://so.tv.sohu.com")
  req.add_header('User-Agent','Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2657.3 Safari/537.36')
  resp=request.urlopen(req,timeout=300)
  html=resp.read().decode("utf-8")
  soup = bs(html, 'lxml') 
  if soup.find_all(attrs={'title':'下一頁'})==[]:
     page=1
  else:
     page=int(soup.find_all(attrs={'title':'下一頁'})[-1].find_previous_sibling().string)
  for i in range(page):
      url_page=url.replace('p101',('p10'+str(i+1)))
      time.sleep(5+random.uniform(-1,1)) 
      print(url_page)
      req=urllib.request.Request(url_page)
      req.add_header("Origin","http://so.tv.sohu.com")
      req.add_header('User-Agent','Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2657.3 Safari/537.36')
      resp=request.urlopen(req,timeout=300)
      html=resp.read().decode("utf-8")
      soup = bs(html, 'html5lib') 
      hrefs=soup.find_all(attrs={'pb-url':'meta$$list'})
      for hreftq in hrefs:
          url_href=hreftq.get('href')
          url_hrefs.append("http:"+url_href)
  return url_hrefs
  
  
def  getInfo(url):
    info=[]
    cap = webdriver.DesiredCapabilities.PHANTOMJS
    cap["phantomjs.page.settings.resourceTimeout"] = 100000
    cap["phantomjs.page.settings.loadImages"] = False
    cap["phantomjs.page.settings.userAgent"] = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2657.3 Safari/537.36"
    driver=webdriver.PhantomJS(desired_capabilities=cap)
    driver.implicitly_wait(30)
    driver.get(url) 
    time.sleep(3+random.uniform(-1,1))
    ps=driver.page_source
    soup = bs(ps,'html5lib')
    print('有無',soup.find_all('em','total-play'))
    if(len(soup.find_all('span','vname'))>=1):
       name=soup.find_all('span','vname')[0].getText()
       info.append(name)
    else:
       info.append('missing')
    if(len(soup.find_all('em','total-play'))>=1):
       tp=soup.find_all('em','total-play')[0].getText()
       info.append(tp)
    else:
       info.append('missing')
    print(info) 
    return info
   
if __name__ == "__main__":
    url_hrefss=[]
    urlsys=PageCreate()
    print(urlsys)
    pool = multiprocessing.Pool(multiprocessing.cpu_count())
    url_hrefss.append(pool.map(urlsPages,urlsys))
    pool.close()
    pool.join()
    url_links = list(chain(*url_hrefss))
    url_links = list(chain(*url_links))
    url_links = [i for i in url_links if i != []]
    url_links = list(set(url_links))
    print('所有頁面',url_links)
    infos=[]
    pool = multiprocessing.Pool(multiprocessing.cpu_count())
    infos.extend(pool.map(getInfo,url_links))
    print(infos)
    sohu_infos=pd.DataFrame(infos)
    sohu_infos.to_csv("c:/tv_his_sohu.csv")

4.利用json文件獲取
詳情頁通過加載json文件刊殉,顯示播放量殉摔。


Paste_Image.png

而每一個plids對應(yīng)一條記錄。


Paste_Image.png

plids可以在網(wǎng)頁源代碼中提取到记焊,將該值添加到j(luò)son文件的url里面的對應(yīng)位置就可以獲取所要的信息逸月。
Paste_Image.png
from urllib import request
import urllib
from bs4 import BeautifulSoup as bs
import re  
import pandas as pd
import time
import random
import sys
import multiprocessing
from itertools import chain

sys.setrecursionlimit(10000000)
def urlAdd():
  list_type=['1100','1101','1102','1103','1104','1105','1106','1107','1108','1109','1110','1111','1112','1113','1114'
  ,'1115','1116','1117','1118','1119','1120','1121','1122','1123','1124','1125','1127','1128']
  list_loc=['1000','1001','1002','1003','1004','1015','1007','1006','1014']
  list_time=['2017','2016','2015','2014','2013','2012','2011','2010','11','90','80','1']
  return list_loc,list_type,list_time
  
def PageCreate():
   urlsys=[]
   list_loc,list_type,list_time=urlAdd()
   for loc in list_loc:
     for type_1 in list_type:
       for time_1 in list_time:
            url1='http://so.tv.sohu.com/list_p1101_p210%s_p3%s_p4%s_p5_p6_p7_p8_p92_p101_p11_p12_p13.html'%(type_1,loc,time_1)
            urlsys.append(url1)
   return urlsys 
   
def urlsPages(url):
  url_hrefs=[]
  time.sleep(2+random.uniform(-1,1)) 
  req=urllib.request.Request(url)
  req.add_header("Origin","http://so.tv.sohu.com")
  req.add_header('User-Agent','Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2657.3 Safari/537.36')
  resp=request.urlopen(req,timeout=180)
  html=resp.read().decode("utf-8")
  soup = bs(html, 'lxml') 
  if soup.find_all(attrs={'title':'下一頁'})==[]:
     page=1
  else:
     page=int(soup.find_all(attrs={'title':'下一頁'})[-1].find_previous_sibling().string)
  for i in range(page):
      url_page=url.replace('p101',('p10'+str(i+1)))
      time.sleep(2+random.uniform(-1,1)) 
      print('收集頁面',url_page)
      req=urllib.request.Request(url_page)
      req.add_header("Origin","http://so.tv.sohu.com")
      req.add_header('User-Agent','Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2657.3 Safari/537.36')
      resp=request.urlopen(req,timeout=180)
      html=resp.read().decode("utf-8")
      soup = bs(html, 'html5lib') 
      hrefs=soup.find_all(attrs={'pb-url':'meta$$list'})
      for hreftq in hrefs:
          url_href=hreftq.get('href')
          url_hrefs.append("http:"+url_href)
  return url_hrefs
  
def  getInfo(url):
  info=[]
  user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36"
  headers={"User-Agent":user_agent,'Referer':"http://tv.sohu.com"}
  #time.sleep(2+random.uniform(-1,1))
  req=urllib.request.Request(url,headers=headers)
  resp=request.urlopen(req,timeout=300)
  html=resp.read()
  soup = bs(html, 'html5lib')
  try:
     title=soup.find('span','vname').getText()
     plids=soup.script.getText()
     match=re.findall('playlistId="(.+)";',plids)[0]
     url_json='http://count.vrs.sohu.com/count/queryext.action?callback=playCountVrs&plids='+match
     data = urllib.request.Request(url=url_json,headers=headers)
     resp=request.urlopen(data,timeout=300).read().decode("utf-8")
     total = re.findall(r'(\w*[0-9]+)\w*',resp)[1]
  except Exception as e:
      print(e,url)
      title="missing" 
      total="missing"
  info.append(title)
  info.append(total)
  print('信息',info) 
  return info
  
if __name__ == "__main__":
    print('開始')
    url_hrefss=[]
    urlsys=PageCreate()
    print(urlsys)
    pool = multiprocessing.Pool(multiprocessing.cpu_count())
    url_hrefss.append(pool.map(urlsPages,urlsys))
    pool.close()
    pool.join()
    url_links = list(chain(*url_hrefss))
    url_links = list(chain(*url_links))
    url_links = [i for i in url_links if i != []]
    url_links = list(set(url_links))
    print('所有頁面',url_links)
    url_exp=pd.Series(url_links)
    url_exp.to_csv("c:/tv_his_sohu_url_exp.csv")
    infos=[]
    pool = multiprocessing.Pool(multiprocessing.cpu_count())
    infos.extend(pool.map(getInfo,url_links))
    print(infos)
    sohu_infos=pd.DataFrame(infos)
    sohu_infos.to_csv("c:/tv_his_sohu.csv")

最后獲取到csv。

Paste_Image.png
最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
  • 序言:七十年代末遍膜,一起剝皮案震驚了整個濱河市碗硬,隨后出現(xiàn)的幾起案子,更是在濱河造成了極大的恐慌瓢颅,老刑警劉巖恩尾,帶你破解...
    沈念sama閱讀 206,126評論 6 481
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件,死亡現(xiàn)場離奇詭異挽懦,居然都是意外死亡翰意,警方通過查閱死者的電腦和手機,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 88,254評論 2 382
  • 文/潘曉璐 我一進店門巾兆,熙熙樓的掌柜王于貴愁眉苦臉地迎上來猎物,“玉大人,你說我怎么就攤上這事角塑∧枘ィ” “怎么了?”我有些...
    開封第一講書人閱讀 152,445評論 0 341
  • 文/不壞的土叔 我叫張陵圃伶,是天一觀的道長堤如。 經(jīng)常有香客問我蒲列,道長,這世上最難降的妖魔是什么搀罢? 我笑而不...
    開封第一講書人閱讀 55,185評論 1 278
  • 正文 為了忘掉前任蝗岖,我火速辦了婚禮,結(jié)果婚禮上榔至,老公的妹妹穿的比我還像新娘抵赢。我一直安慰自己,他們只是感情好唧取,可當(dāng)我...
    茶點故事閱讀 64,178評論 5 371
  • 文/花漫 我一把揭開白布铅鲤。 她就那樣靜靜地躺著,像睡著了一般枫弟。 火紅的嫁衣襯著肌膚如雪邢享。 梳的紋絲不亂的頭發(fā)上,一...
    開封第一講書人閱讀 48,970評論 1 284
  • 那天淡诗,我揣著相機與錄音骇塘,去河邊找鬼。 笑死韩容,一個胖子當(dāng)著我的面吹牛款违,可吹牛的內(nèi)容都是我干的。 我是一名探鬼主播群凶,決...
    沈念sama閱讀 38,276評論 3 399
  • 文/蒼蘭香墨 我猛地睜開眼奠货,長吁一口氣:“原來是場噩夢啊……” “哼!你這毒婦竟也來了座掘?” 一聲冷哼從身側(cè)響起,我...
    開封第一講書人閱讀 36,927評論 0 259
  • 序言:老撾萬榮一對情侶失蹤柔滔,失蹤者是張志新(化名)和其女友劉穎溢陪,沒想到半個月后,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體睛廊,經(jīng)...
    沈念sama閱讀 43,400評論 1 300
  • 正文 獨居荒郊野嶺守林人離奇死亡形真,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點故事閱讀 35,883評論 2 323
  • 正文 我和宋清朗相戀三年,在試婚紗的時候發(fā)現(xiàn)自己被綠了超全。 大學(xué)時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片咆霜。...
    茶點故事閱讀 37,997評論 1 333
  • 序言:一個原本活蹦亂跳的男人離奇死亡,死狀恐怖嘶朱,靈堂內(nèi)的尸體忽然破棺而出蛾坯,到底是詐尸還是另有隱情,我是刑警寧澤疏遏,帶...
    沈念sama閱讀 33,646評論 4 322
  • 正文 年R本政府宣布脉课,位于F島的核電站救军,受9級特大地震影響,放射性物質(zhì)發(fā)生泄漏倘零。R本人自食惡果不足惜唱遭,卻給世界環(huán)境...
    茶點故事閱讀 39,213評論 3 307
  • 文/蒙蒙 一、第九天 我趴在偏房一處隱蔽的房頂上張望呈驶。 院中可真熱鬧拷泽,春花似錦、人聲如沸袖瞻。這莊子的主人今日做“春日...
    開封第一講書人閱讀 30,204評論 0 19
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽虏辫。三九已至蚌吸,卻和暖如春,著一層夾襖步出監(jiān)牢的瞬間砌庄,已是汗流浹背羹唠。 一陣腳步聲響...
    開封第一講書人閱讀 31,423評論 1 260
  • 我被黑心中介騙來泰國打工, 沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留娄昆,地道東北人佩微。 一個月前我還...
    沈念sama閱讀 45,423評論 2 352
  • 正文 我出身青樓,卻偏偏與公主長得像萌焰,于是被迫代替她去往敵國和親哺眯。 傳聞我的和親對象是個殘疾皇子,可洞房花燭夜當(dāng)晚...
    茶點故事閱讀 42,722評論 2 345

推薦閱讀更多精彩內(nèi)容