Python爬蟲--BeautifulSoup(三)

一棺牧、Beautiful Soup簡(jiǎn)介

??Beautiful Soup是python的一個(gè)庫(kù)，最主要的功能是從網(wǎng)頁(yè)抓取數(shù)據(jù)缀棍。
官方解釋如下：
??Beautiful Soup提供一些簡(jiǎn)單的莺葫、python式的函數(shù)用來(lái)處理導(dǎo)航素挽、搜索照卦、修改分析樹等功能式矫。它是一個(gè)工具箱，通過解析文檔為用戶提供需要抓取的數(shù)據(jù)役耕，因?yàn)楹?jiǎn)單采转，所以不需要多少代碼就可以寫出一個(gè)完整的應(yīng)用程序。
??Beautiful Soup自動(dòng)將輸入文檔轉(zhuǎn)換為Unicode編碼瞬痘，輸出文檔轉(zhuǎn)換為utf-8編碼故慈。你不需要考慮編碼方式，除非文檔沒有指定一個(gè)編碼方式框全，這時(shí)察绷，Beautiful Soup就不能自動(dòng)識(shí)別編碼方式了。然后津辩，你僅僅需要說明一下原始編碼方式就可以了拆撼。
??Beautiful Soup已成為和lxml、html5lib一樣出色的python解釋器喘沿，為用戶靈活地提供不同的解析策略或強(qiáng)勁的速度情萤。
??BeautifulSoup支持Python標(biāo)準(zhǔn)庫(kù)中的HTML解析器,還支持一些第三方的解析器，如果我們不安裝它摹恨，則 Python 會(huì)使用 Python默認(rèn)的解析器，lxml 解析器更加強(qiáng)大娶视，速度更快晒哄，推薦使用lxml 解析器睁宰。
官網(wǎng)文檔：https://beautifulsoup.readthedocs.io/zh_CN/latest/
BeautifulSoup4主要解析器：

解析器	使用方法	優(yōu)勢(shì)	劣勢(shì)
Python標(biāo)準(zhǔn)庫(kù)	BeautifulSoup(markup, "html.parser")	Python內(nèi)置標(biāo)準(zhǔn)庫(kù) 執(zhí)行速度適中文檔容錯(cuò)能力強(qiáng)	Python 2.7.3 and 3.2.2 之前的版本容錯(cuò)能力差
lxml HTML 解析器	BeautifulSoup(markup, "lxml")	速度快容錯(cuò)能力強(qiáng)	需安裝C語(yǔ)言庫(kù)
lxml XML 解析器	BeautifulSoup(markup, "lxml-xml") BeautifulSoup(markup,"xml")	速度快唯一支持xml解析	需安裝C語(yǔ)言庫(kù)
html5lib 解析器	BeautifulSoup(markup, "html5lib")	最好的容錯(cuò)方式以瀏覽器的方式解析文檔生成HTML5格式的文檔	速度慢，不依賴外部擴(kuò)展

二寝凌、模塊安裝

需安裝：lxml模塊
注意：4.3.2 沒有集成 etree
請(qǐng)安裝3.7.2：

(film) C:\Users\Administrator>conda install lxml==3.7.2
# 安裝lxml
pip install lxml
# 2.x+版本
pip install BeautifulSoup
# 3.x+請(qǐng)安裝以下版本
pip install beautifulsoup4

二柒傻、BeautifulSoup4使用

假設(shè)有這樣一個(gè)Html，具體內(nèi)容如下：

<!DOCTYPE html>
<html>
<head>
    <meta content="text/html;charset=utf-8" http-equiv="content-type" />
    <meta content="IE=Edge" http-equiv="X-UA-Compatible" />
    <meta content="always" name="referrer" />
    <link  rel="stylesheet" type="text/css" />
    <title>百度一下较木，你就知道 </title>
</head>
<body link="#0000cc">
  <div id="wrapper">
    <div id="head">
        <div class="head_wrapper">
          <div id="u1">
            <a class="mnav"  name="tj_trnews">新聞 </a>
            <a class="mnav"  name="tj_trhao123">hao123 </a>
            <a class="mnav"  name="tj_trmap">地圖 </a>
            <a class="mnav"  name="tj_trvideo">視頻 </a>
            <a class="mnav"  name="tj_trtieba">貼吧 </a>
            <a class="bri"  name="tj_briicon" style="display: block;">更多產(chǎn)品 </a>
          </div>
        </div>
    </div>
  </div>
</body>
</html>

創(chuàng)建beautifulsoup4對(duì)象并獲取內(nèi)容：

from bs4 import BeautifulSoup

file = open('./index.html', 'rb')
html = file.read()
bs = BeautifulSoup(html,"html.parser")
print(bs.prettify())     # 縮進(jìn)格式
print(bs.title)          # 獲取title標(biāo)簽的所有內(nèi)容
print(bs.title.name)     # 獲取title標(biāo)簽的名稱
print(bs.title.string)   # 獲取title標(biāo)簽的文本內(nèi)容
print(bs.head)           # 獲取head標(biāo)簽的所有內(nèi)容
print(bs.div)            # 獲取第一個(gè)div標(biāo)簽中的所有內(nèi)容
print(bs.div["id"])      # 獲取第一個(gè)div標(biāo)簽的id的值
print(bs.a)              # 獲取第一個(gè)a標(biāo)簽中的所有內(nèi)容
print(bs.find_all("a"))  # 獲取所有的a標(biāo)簽中的所有內(nèi)容
print(bs.find(id="u1"))   # 獲取id="u1"
for item in bs.find_all("a"):
    print(item.get("href")) # 獲取所有的a標(biāo)簽红符，并遍歷打印a標(biāo)簽中的href的值
for item in bs.find_all("a"):
    print(item.get_text())# 獲取所有的a標(biāo)簽，并遍歷打印a標(biāo)簽的文本值

三伐债、BeautifulSoup4四大對(duì)象種類

BeautifulSoup4將復(fù)雜HTML文檔轉(zhuǎn)換成一個(gè)復(fù)雜的樹形結(jié)構(gòu),每個(gè)節(jié)點(diǎn)都是Python對(duì)象,所有對(duì)象可以歸納為4種:
Tag
NavigableString
BeautifulSoup
Comment

1. Tag

Tag通俗點(diǎn)講就是HTML中的一個(gè)個(gè)標(biāo)簽预侯，例如：

from bs4 import BeautifulSoup

file = open('./index.html', 'rb')
html = file.read()
bs = BeautifulSoup(html,"html.parser")

# 獲取title標(biāo)簽的所有內(nèi)容
print(bs.title)

# 獲取head標(biāo)簽的所有內(nèi)容
print(bs.head)

# 獲取第一個(gè)a標(biāo)簽的所有內(nèi)容
print(bs.a)

# 類型
print(type(bs.a))

可以利用 soup 加標(biāo)簽名輕松地獲取這些標(biāo)簽的內(nèi)容，這些對(duì)象的類型是bs4.element.Tag峰锁。
但是注意萎馅，它查找的是在所有內(nèi)容中的第一個(gè)符合要求的標(biāo)簽。
對(duì)于 Tag虹蒋，它有兩個(gè)重要的屬性糜芳，是name和attrs：

from bs4 import BeautifulSoup

file = open('./index.html', 'rb')
html = file.read()
bs = BeautifulSoup(html,"html.parser")

# [document] #bs 對(duì)象本身比較特殊，它的 name 即為 [document]
print(bs.name)
# head #對(duì)于其他內(nèi)部標(biāo)簽魄衅，輸出的值便為標(biāo)簽本身的名稱
print(bs.head.name)
# 在這里峭竣，我們把 a 標(biāo)簽的所有屬性打印輸出了出來(lái)，得到的類型是一個(gè)字典晃虫。
print(bs.a.attrs)
# 還可以利用get方法皆撩，傳入屬性的名稱，二者是等價(jià)的
print(bs.a['class']) # 等價(jià) bs.a.get('class')
# 可以對(duì)這些屬性和內(nèi)容等等進(jìn)行修改
bs.a['class'] = "newClass"
print(bs.a)
# 還可以對(duì)這個(gè)屬性進(jìn)行刪除
del bs.a['class']
print(bs.a)

2. NavigableString

獲取標(biāo)簽內(nèi)部的文字,用 .string 即可傲茄。

from bs4 import BeautifulSoup

file = open('./index.html', 'rb')
html = file.read()
bs = BeautifulSoup(html,"html.parser")

print(bs.title.string)
print(type(bs.title.string))

3. BeautifulSoup

BeautifulSoup對(duì)象表示的是一個(gè)文檔的內(nèi)容毅访。大部分時(shí)候，可以把它當(dāng)作 Tag 對(duì)象盘榨，是一個(gè)特殊的 Tag喻粹，我們可以分別獲取它的類型，名稱草巡，以及屬性:

from bs4 import BeautifulSoup

file = open('./index.html', 'rb')
html = file.read()
bs = BeautifulSoup(html,"html.parser")

print(type(bs.name))
print(bs.name)
print(bs.attrs)

4. Comment

Comment 對(duì)象是一個(gè)特殊類型的 NavigableString 對(duì)象守呜，其輸出的內(nèi)容不包括注釋符號(hào)。

from bs4 import BeautifulSoup

file = open('./index.html', 'rb')
html = file.read()
bs = BeautifulSoup(html,"html.parser")

print(bs.a)
# 此時(shí)不能出現(xiàn)空格和換行符山憨，a標(biāo)簽如下：
# <a class="mnav"  name="tj_trnews"><!--新聞--></a>
print(bs.a.string)          # 新聞
print(type(bs.a.string))    # <class 'bs4.element.Comment'>

四查乒、遍歷文檔樹

1. contents：獲取Tag的所有子節(jié)點(diǎn)，返回一個(gè)list

from bs4 import BeautifulSoup

file = open('./index.html', 'rb')
html = file.read()
bs = BeautifulSoup(html,"html.parser")

# tag的.content 屬性可以將tag的子節(jié)點(diǎn)以列表的方式輸出
print(bs.head.contents)
# 用列表索引來(lái)獲取它的某一個(gè)元素
print(bs.head.contents[1])

2. .children：獲取Tag的所有子節(jié)點(diǎn)郁竟，返回一個(gè)生成器

from bs4 import BeautifulSoup

file = open('./index.html', 'rb')
html = file.read()
bs = BeautifulSoup(html,"html.parser")

for child in  bs.body.children:
    print(child)

3. .descendants：獲取Tag的所有子孫節(jié)點(diǎn)

4. .strings：如果Tag包含多個(gè)字符串玛迄，即在子孫節(jié)點(diǎn)中有內(nèi)容，可以用此獲取棚亩，而后進(jìn)行遍歷

5. .stripped_strings：與strings用法一致蓖议，只不過可以去除掉那些多余的空白內(nèi)容

6. .parent：獲取Tag的父節(jié)點(diǎn)

7. .parents：遞歸得到父輩元素的所有節(jié)點(diǎn)虏杰，返回一個(gè)生成器

8. .previous_sibling：獲取當(dāng)前Tag的上一個(gè)節(jié)點(diǎn)，屬性通常是字符串或空白勒虾，真實(shí)結(jié)果是當(dāng)前標(biāo)簽與上一個(gè)標(biāo)簽之間的頓號(hào)和換行符

9纺阔、.next_sibling：獲取當(dāng)前Tag的下一個(gè)節(jié)點(diǎn)，屬性通常是字符串或空白修然，真是結(jié)果是當(dāng)前標(biāo)簽與下一個(gè)標(biāo)簽之間的頓號(hào)與換行符

10. .previous_siblings：獲取當(dāng)前Tag的上面所有的兄弟節(jié)點(diǎn)笛钝，返回一個(gè)生成器

11. .next_siblings：獲取當(dāng)前Tag的下面所有的兄弟節(jié)點(diǎn)，返回一個(gè)生成器

12. .previous_element：獲取解析過程中上一個(gè)被解析的對(duì)象(字符串或tag)愕宋，可能與previous_sibling相同玻靡，但通常是不一樣的

13. .next_element：獲取解析過程中下一個(gè)被解析的對(duì)象(字符串或tag)，可能與next_sibling相同掏婶，但通常是不一樣的

14. .previous_elements：返回一個(gè)生成器啃奴，可以向前訪問文檔的解析內(nèi)容

15. .next_elements：返回一個(gè)生成器，可以向后訪問文檔的解析內(nèi)容

16. .has_attr：判斷Tag是否包含屬性

五雄妥、搜索文檔樹

find_all(name, attrs, recursive, text, **kwargs)
??在上面的栗子中我們簡(jiǎn)單介紹了find_all的使用最蕾，接下來(lái)介紹一下find_all的更多用法-過濾器。這些過濾器貫穿整個(gè)搜索API老厌，過濾器可以被用在tag的name中瘟则，節(jié)點(diǎn)的屬性等。
（1）name參數(shù)：
字符串過濾：會(huì)查找與字符串完全匹配的內(nèi)容:

from bs4 import BeautifulSoup

file = open('./index.html', 'rb')
html = file.read()
bs = BeautifulSoup(html,"html.parser")

a_list = bs.find_all("a")
print(a_list)

正則表達(dá)式過濾：如果傳入的是正則表達(dá)式枝秤，那么BeautifulSoup4會(huì)通過search()來(lái)匹配內(nèi)容

from bs4 import BeautifulSoup
# 正則表達(dá)式
import re

file = open('./index.html', 'rb')
html = file.read()
bs = BeautifulSoup(html,"html.parser")

t_list = bs.find_all(re.compile("a"))
for item in t_list:
   print(item)

列表：如果傳入一個(gè)列表醋拧，BeautifulSoup4將會(huì)與列表中的任一元素匹配到的節(jié)點(diǎn)返回

from bs4 import BeautifulSoup

file = open('./index.html', 'rb')
html = file.read()
bs = BeautifulSoup(html,"html.parser")

t_list = bs.find_all(["meta","link"])
for item in t_list:
    print(item)

方法：傳入一個(gè)方法，根據(jù)方法來(lái)匹配

from bs4 import BeautifulSoup

file = open('./index.html', 'rb')
html = file.read()
bs = BeautifulSoup(html,"html.parser")

# 定義函數(shù)
def name_is_exists(tag):
    return tag.has_attr("name")

# 使用函數(shù)
t_list = bs.find_all(name_is_exists)
for item in t_list:
    print(item)

（2）kwargs參數(shù)：

from bs4 import BeautifulSoup
import re

file = open('./index.html', 'rb')
html = file.read()
bs = BeautifulSoup(html,"html.parser")

# 查詢id=head的Tag
t_list = bs.find_all(id="head")
print(t_list)
# 查詢href屬性包含ss1.bdstatic.com的Tag
t_list = bs.find_all(href=re.compile("http://news.baidu.com"))
print(t_list)
# 查詢所有包含class的Tag(注意：class在Python中屬于關(guān)鍵字淀弹，所以加_以示區(qū)別)
t_list = bs.find_all(class_=True)
for item in t_list:
    print(item)

（3）attrs參數(shù)：
并不是所有的屬性都可以使用上面這種方式進(jìn)行搜索丹壕，比如HTML的data-*屬性：

t_list = bs.find_all(data-foo="value")

如果執(zhí)行這段代碼，將會(huì)報(bào)錯(cuò)薇溃。我們可以使用attrs參數(shù)菌赖，定義一個(gè)字典來(lái)搜索包含特殊屬性的tag：

from bs4 import BeautifulSoup
import re

file = open('./index.html', 'rb')
html = file.read()
bs = BeautifulSoup(html,"html.parser")

t_list = bs.find_all(attrs={"data-foo":"value"})
for item in t_list:
    print(item)

（4）text參數(shù)：
通過text參數(shù)可以搜索文檔中的字符串內(nèi)容，與name參數(shù)的可選值一樣沐序，text參數(shù)接受字符串琉用，正則表達(dá)式，列表:

from bs4 import BeautifulSoup
import re

file = open('./index.html', 'rb')
html = file.read()
bs = BeautifulSoup(html,"html.parser")

t_list = bs.find_all(attrs={"data-foo": "value"})
for item in t_list:
    print(item)
t_list = bs.find_all(text="hao123")
for item in t_list:
    print(item)
t_list = bs.find_all(text=["hao123", "地圖", "貼吧"])
for item in t_list:
    print(item)
t_list = bs.find_all(text=re.compile("\d"))
for item in t_list:
    print(item)

當(dāng)我們搜索text中的一些特殊屬性時(shí)策幼，同樣也可以傳入一個(gè)方法來(lái)達(dá)到我們的目的：

from bs4 import BeautifulSoup
import re

file = open('./index.html', 'rb')
html = file.read()
bs = BeautifulSoup(html,"html.parser")

# 定義函靈敏
def length_is_two(text):
    return text and len(text) == 2

t_list = bs.find_all(text=length_is_two)
for item in t_list:
    print(item)

（5）limit參數(shù)：
可以傳入一個(gè)limit參數(shù)來(lái)限制返回的數(shù)量邑时，當(dāng)搜索出的數(shù)據(jù)量為5，而設(shè)置了limit=2時(shí)特姐，此時(shí)只會(huì)返回前2個(gè)數(shù)據(jù):

from bs4 import BeautifulSoup
import re

file = open('./index.html', 'rb')
html = file.read()
bs = BeautifulSoup(html,"html.parser")

t_list = bs.find_all("a",limit=2)
for item in t_list:
    print(item)

find_all除了上面一些常規(guī)的寫法晶丘，還可以對(duì)其進(jìn)行一些簡(jiǎn)寫：

from bs4 import BeautifulSoup
import re

file = open('./index.html', 'rb')
html = file.read()
bs = BeautifulSoup(html,"html.parser")

# 兩者是相等的
# t_list = bs.find_all("a") => t_list = bs("a")
t_list = bs("a") # 兩者是相等的
# t_list = bs.a.find_all(text="新聞") => t_list = bs.a(text="新聞")
t_list = bs.a(text="新聞")

2. find()

find()將返回符合條件的第一個(gè)Tag，有時(shí)我們只需要或一個(gè)Tag時(shí)唐含，我們就可以用到find()方法了铣口。當(dāng)然了滤钱，也可以使用find_all()方法，傳入一個(gè)limit=1脑题，然后再取出第一個(gè)值也是可以的，不過未免繁瑣铜靶。

from bs4 import BeautifulSoup
import re

file = open('./index.html', 'rb')
html = file.read()
bs = BeautifulSoup(html,"html.parser")

# 返回只有一個(gè)結(jié)果的列表
t_list = bs.find_all("title",limit=1)
print(t_list)
# 返回唯一值
t = bs.find("title")
print(t)
# 如果沒有找到叔遂，則返回None
t = bs.find("abc")
print(t)

從結(jié)果可以看出find_all，盡管傳入了limit=1争剿，但是返回值仍然為一個(gè)列表已艰，當(dāng)我們只需要取一個(gè)值時(shí)，遠(yuǎn)不如find方法方便蚕苇。但是如果未搜索到值時(shí)哩掺，將返回一個(gè)None

可以通過bs.div來(lái)獲取第一個(gè)div標(biāo)簽，如果我們需要獲取第一個(gè)div下的第一個(gè)div涩笤，我們可以這樣：

from bs4 import BeautifulSoup
import re

file = open('./index.html', 'rb')
html = file.read()
bs = BeautifulSoup(html,"html.parser")

t = bs.div.div
# 等價(jià)于
t = bs.find("div").find("div")
print(t)

六嚼吞、CSS選擇器

BeautifulSoup支持發(fā)部分的CSS選擇器，在Tag獲取BeautifulSoup對(duì)象的.select()方法中傳入字符串參數(shù)蹬碧，即可使用CSS選擇器的語(yǔ)法找到Tag:

通過標(biāo)簽名查找

from bs4 import BeautifulSoup
import re

file = open('./index.html', 'rb')
html = file.read()
bs = BeautifulSoup(html,"html.parser")

print(bs.select('title'))
print(bs.select('a'))

通過類名查找

from bs4 import BeautifulSoup
import re

file = open('./index.html', 'rb')
html = file.read()
bs = BeautifulSoup(html,"html.parser")

print(bs.select('.mnav'))

通過id查找

from bs4 import BeautifulSoup
import re

file = open('./index.html', 'rb')
html = file.read()
bs = BeautifulSoup(html,"html.parser")

print(bs.select('#u1'))

組合查找

from bs4 import BeautifulSoup
import re

file = open('./index.html', 'rb')
html = file.read()
bs = BeautifulSoup(html,"html.parser")

print(bs.select('div .bri'))

屬性查找

from bs4 import BeautifulSoup
import re

file = open('./index.html', 'rb')
html = file.read()
bs = BeautifulSoup(html,"html.parser")

print(bs.select('a[class="bri"]'))
print(bs.select('a[))

直接子標(biāo)簽查找

from bs4 import BeautifulSoup
import re

file = open('./index.html', 'rb')
html = file.read()
bs = BeautifulSoup(html,"html.parser")

t_list = bs.select("head > title")
print(t_list)

兄弟節(jié)點(diǎn)標(biāo)簽查找

from bs4 import BeautifulSoup
import re

file = open('./index.html', 'rb')
html = file.read()
bs = BeautifulSoup(html,"html.parser")

t_list = bs.select(".mnav ~ .bri")
print(t_list)

獲取內(nèi)容

from bs4 import BeautifulSoup
import re

file = open('./index.html', 'rb')
html = file.read()
bs = BeautifulSoup(html,"html.parser")

t_list = bs.select("title")
print(bs.select('title')[0].get_text())

七舱禽、示例

BeautifulSoup 解析58同城網(wǎng)

#encoding:UTF-8
from bs4 import BeautifulSoup
import requests
import time
import json
 
url = 'http://bj.58.com/pingbandiannao/24604629984324x.shtml'
 
wb_data = requests.get(url)
soup = BeautifulSoup(wb_data.text,'lxml')
 
#獲取每件商品的URL
def get_links_from(who_sells):
    urls = []
    list_view = 'http://bj.58.com/pbdn/pn{}/'.format(str(who_sells))
    print ('list_view:{}'.format(list_view) )
    wb_data = requests.get(list_view)
    soup = BeautifulSoup(wb_data.text,'lxml')
    #for link in soup.select('td.t > a.t'):
    for link in soup.select('td.t  a.t'):  #跟上面的方法等價(jià)
        print link
        urls.append(link.get('href').split('?')[0])
    return urls
  
#獲取58同城每一類商品的url  比如平板電腦  手機(jī) 等
def get_classify_url():
    url58 = 'http://bj.58.com'
    wb_data = requests.get(url58)
    soup = BeautifulSoup(wb_data.text, 'lxml')
    for link in soup.select('span.jumpBusiness a'):
        classify_href = link.get('href')
        print classify_href
        classify_url = url58 + classify_href
        print classify_url
 
#獲取每件商品的具體信息
def get_item_info(who_sells=0):
 
    urls = get_links_from(who_sells)
    for url in urls:
        print url
        wb_data = requests.get(url)
        #print wb_data.text
        soup = BeautifulSoup(wb_data.text,'lxml')
        #print soup.select('infolist > div > table > tbody > tr.article-info > td.t > span.pricebiao > span')   ##infolist > div > table > tbody > tr.article-info > td.t > span.pricebiao > span
        print soup.select('span[class="price_now"]')[0].text
        print soup.select('div[class="palce_li"]')[0].text
        #print list(soup.select('.palce_li')[0].stripped_strings) if soup.find_all('div','palce_li') else None,  #body > div > div > div > div > div.info_massege.left > div.palce_li > span > i
        data = {
            'title':soup.title.text,
            'price': soup.select('span[class="price_now"]')[0].text,
            'area': soup.select('div[class="palce_li"]')[0].text if soup.find_all('div', 'palce_li') else None,
            'date' :soup.select('.look_time')[0].text,
            'cate' :'個(gè)人' if who_sells == 0 else '商家',
        }
        print(data)
        result = json.dumps(data, encoding='UTF-8', ensure_ascii=False) #中文內(nèi)容仍然無(wú)法正常顯示。 使用json進(jìn)行格式轉(zhuǎn)換恩沽，然后打印輸出誊稚。
        print result
 
# get_item_info(url)
 
# get_links_from(1)
 
get_item_info(2)
#get_classify_url()

時(shí)光網(wǎng)電影票房 top 100
Mtime時(shí)光網(wǎng)
http://movie.mtime.com/boxoffice/

# -*- coding:UTF-8 -*-

from bs4 import BeautifulSoup
import pandas as pd
import requests

"""
pandas 模塊
requests 模塊
BeautifulSoup 模塊
openpyxl 模塊

爬取時(shí)光網(wǎng)電影票房數(shù)據(jù)
"""
def sgw(year):
    # 設(shè)置session
    s = requests.session()
    headers = {
        'Accept': '*/*',
        'Accept-Encoding': 'gzip, deflate',
        'Accept-Language': 'zh-CN,zh;q=0.9',
        'Connection': 'keep-alive',
        'Host': 'movie.mtime.com',
        'Referer': 'http://movie.mtime.com/boxoffice/',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.15 Safari/537.36',
        'X-Requested-With': 'XMLHttpRequest',
    }
    # 更新頭信息
    s.headers.update(headers)
    #
    df = pd.DataFrame(columns=('排名', '電影', '類型', '首日票房（元）', '年度票房（元）', '上映日期'))
    x = 0

    for i in range(10):
        # 兩個(gè)參數(shù) 一個(gè)為年，一個(gè)為頁(yè)
        url = 'http://movie.mtime.com/boxoffice/?year={}&area=china&type=MovieRankingYear&category=all&page={}&display=table&timestamp=1547015331595&version=07bb781100018dd58eafc3b35d42686804c6df8d&dataType=json'.format(
            year,str(i))
        req = s.get(url=url, verify=False).text
        bs = BeautifulSoup(req, 'lxml')
        tr = bs.find_all('tr')
        for j in tr[1:]:
            td = j.find_all('td')
            list = []
            for k in range(6):
                if k == 1:
                    nm = td[k].find('a').text
                    print(td[k].a.string)
                    list.append(nm)
                else:
                    list.append(td[k].text)
            df.loc[x] = list
            x = x + 1
    print(df)
    df.to_excel('時(shí)光網(wǎng).xlsx', index=False, encoding="GB18030")

# 調(diào)用方法
sgw(2019)

豆瓣top250電影
https://movie.douban.com/top250

豆瓣top250電影

# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
import requests

'''
requests 模塊
BeautifulSoup 模塊
lxml 模塊
豆瓣top250電影
'''

rank = 1
def write_one_page(soup):
    global rank
    # 查找相應(yīng)數(shù)據(jù)
    for k in soup.find('div',class_='article').find_all('div',class_='info'):
        name = k.find('div',class_='hd').find_all('span')       #電影名字
        score = k.find('div',class_='star').find_all('span')    #分?jǐn)?shù)

        if k.find('p',class_='quote') != None :
            inq = k.find('p',class_='quote').find('span')           #一句話簡(jiǎn)介

        #抓取年份罗心、國(guó)家
        actor_infos_html = k.find(class_='bd')

        #strip() 方法用于移除字符串頭尾指定的字符（默認(rèn)為空格）
        actor_infos = actor_infos_html.find('p').get_text().strip().split('\n')
        # \xa0 是不間斷空白符 &nbsp;
        actor_infos1 = actor_infos[0].split('\xa0\xa0\xa0')
        director = actor_infos1[0][3:]
        role = actor_infos[1]
        year_area = actor_infos[1].lstrip().split('\xa0/\xa0')
        year = year_area[0]
        country = year_area[1]
        type = year_area[2]
        # 序號(hào) 電影名稱 評(píng)分 簡(jiǎn)介 年份 地區(qū)  類型
        print(rank,name[0].string,score[1].string,inq.string,year,country,type)
        #寫txt
        write_to_file(rank,name[0].string,score[1].string,year,country,type,inq.string)
        # 頁(yè)數(shù)
        rank=rank+1
# 寫文件
def write_to_file(rank,name,score,year,country,type,quote):
    with open('Top_250_movie.txt', 'a', encoding='utf-8') as f:
        f.write(str(rank)+';'+str(name)+';'+str(score)+';'+str(year)+';'+str(country)+';'+str(type)+';'+str(quote)+'\n')
        f.close()

if __name__ == '__main__':
    for i in range(10):
        a = i*25
        # https://movie.douban.com/top250
        url = "https://movie.douban.com/top250?start="+str(a)+"&filter="
        f = requests.get(url)
        # 創(chuàng)建對(duì)象
        soup = BeautifulSoup(f.content, "lxml")

        write_one_page(soup)

最后編輯于：2019.12.07 11:36:03

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者

人面猴
序言：七十年代末里伯，一起剝皮案震驚了整個(gè)濱河市，隨后出現(xiàn)的幾起案子渤闷，更是在濱河造成了極大的恐慌疾瓮，老刑警劉巖，帶你破解...
沈念sama閱讀 218,122評(píng)論 6贊 505
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件肤晓，死亡現(xiàn)場(chǎng)離奇詭異爷贫，居然都是意外死亡，警方通過查閱死者的電腦和手機(jī)补憾，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 93,070評(píng)論 3贊 395
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進(jìn)店門漫萄，熙熙樓的掌柜王于貴愁眉苦臉地迎上來(lái)，“玉大人盈匾，你說我怎么就攤上這事腾务。” “怎么了削饵？”我有些...
開封第一講書人閱讀 164,491評(píng)論 0贊 354
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵岩瘦，是天一觀的道長(zhǎng)未巫。經(jīng)常有香客問我，道長(zhǎng)启昧，這世上最難降的妖魔是什么叙凡？我笑而不...
開封第一講書人閱讀 58,636評(píng)論 1贊 293
?港島之戀（遺憾婚禮）
正文為了忘掉前任，我火速辦了婚禮密末，結(jié)果婚禮上握爷，老公的妹妹穿的比我還像新娘。我一直安慰自己严里，他們只是感情好新啼，可當(dāng)我...
茶點(diǎn)故事閱讀 67,676評(píng)論 6贊 392
惡毒庶女頂嫁案：這布局不是一般人想出來(lái)的
文/花漫我一把揭開白布。她就那樣靜靜地躺著刹碾，像睡著了一般燥撞。火紅的嫁衣襯著肌膚如雪。梳的紋絲不亂的頭發(fā)上迷帜，一...
開封第一講書人閱讀 51,541評(píng)論 1贊 305
城市分裂傳說
那天物舒，我揣著相機(jī)與錄音，去河邊找鬼瞬矩。笑死茶鉴，一個(gè)胖子當(dāng)著我的面吹牛，可吹牛的內(nèi)容都是我干的景用。我是一名探鬼主播涵叮，決...
沈念sama閱讀 40,292評(píng)論 3贊 418
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼，長(zhǎng)吁一口氣：“原來(lái)是場(chǎng)噩夢(mèng)啊……” “哼伞插！你這毒婦竟也來(lái)了割粮？” 一聲冷哼從身側(cè)響起，我...
開封第一講書人閱讀 39,211評(píng)論 0贊 276
萬(wàn)榮殺人案實(shí)錄
序言：老撾萬(wàn)榮一對(duì)情侶失蹤媚污，失蹤者是張志新（化名）和其女友劉穎舀瓢，沒想到半個(gè)月后，有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體耗美，經(jīng)...
沈念sama閱讀 45,655評(píng)論 1贊 314
?護(hù)林員之死
正文獨(dú)居荒郊野嶺守林人離奇死亡京髓，尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點(diǎn)故事閱讀 37,846評(píng)論 3贊 336
?白月光啟示錄
正文我和宋清朗相戀三年，在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了商架。大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片堰怨。...
茶點(diǎn)故事閱讀 39,965評(píng)論 1贊 348
活死人
序言：一個(gè)原本活蹦亂跳的男人離奇死亡，死狀恐怖蛇摸，靈堂內(nèi)的尸體忽然破棺而出备图，到底是詐尸還是另有隱情，我是刑警寧澤，帶...
沈念sama閱讀 35,684評(píng)論 5贊 347
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布揽涮，位于F島的核電站抠藕，受9級(jí)特大地震影響，放射性物質(zhì)發(fā)生泄漏蒋困。R本人自食惡果不足惜盾似，卻給世界環(huán)境...
茶點(diǎn)故事閱讀 41,295評(píng)論 3贊 329
男人毒藥：我在死后第九天來(lái)索命
文/蒙蒙一、第九天我趴在偏房一處隱蔽的房頂上張望家破。院中可真熱鬧颜说，春花似錦、人聲如沸汰聋。這莊子的主人今日做“春日...
開封第一講書人閱讀 31,894評(píng)論 0贊 22
一樁弒父案喊积，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽(yáng)烹困。三九已至，卻和暖如春乾吻，著一層夾襖步出監(jiān)牢的瞬間髓梅，已是汗流浹背。一陣腳步聲響...
開封第一講書人閱讀 33,012評(píng)論 1贊 269
情欲美人皮
我被黑心中介騙來(lái)泰國(guó)打工绎签，沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留枯饿，地道東北人。一個(gè)月前我還...
沈念sama閱讀 48,126評(píng)論 3贊 370
代替公主和親
正文我出身青樓诡必，卻偏偏與公主長(zhǎng)得像奢方，于是被迫代替她去往敵國(guó)和親。傳聞我的和親對(duì)象是個(gè)殘疾皇子爸舒，可洞房花燭夜當(dāng)晚...
茶點(diǎn)故事閱讀 44,914評(píng)論 2贊 355