python學(xué)習(xí)記錄1 - 極客學(xué)院視頻爬蟲

背景：

極客學(xué)院網(wǎng)站，課程標(biāo)簽下有個(gè)知識(shí)體系圖岳守。因?yàn)轶w系比較全，因此沖了一個(gè)月vip碌冶，想要將系列課程下載下來以后慢慢看湿痢，也就有了下面的爬蟲。（僅供個(gè)人學(xué)習(xí),請勿商用扑庞，侵刪）

技術(shù)文檔：

想要學(xué)習(xí)python爬蟲譬重，當(dāng)然少不了request用戶指南、Beautiful Soup技術(shù)文檔

代碼塊：

config.py 爬蟲基礎(chǔ)的設(shè)置

proxies = {
    'https': '42.123.125.181:8088',
}
headers = {
    #這里填寫自己的cookie
    'Cookie': '_uab_collina=xxxxx; PHPSESSID=xxxxx; jkxyid_v2=xxxx; _ga=xxxx; _gid=xxxx; gr_user_id=xxxx; uname=xxxxx; uid=xxxx; code=xxxx; authcode=xxxxx',
   
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'
}
KnowledgeSystemUrl = 'https://www.jikexueyuan.com/path/'

為避免網(wǎng)站的反爬機(jī)制罐氨，對本地ip的訪問限制臀规，這里添加了proxies 代理服務(wù)器地址，可以在國內(nèi)高匿代理獲取代理服務(wù)器栅隐；
下載部分視頻時(shí)候需要用戶vip登陸塔嬉，這里獲取瀏覽器訪問極客學(xué)院的request header玩徊，使用其中的cookie來跳過登陸步驟查看http請求的header信息；

crawl.py 獲取html

import requests
from config import headers
from config import proxies
class Crawl(object):


    def getText(self,url):
        try:
            r = requests.get(url,headers = headers,proxies = proxies,timeout = 20)
            r.encoding = r.apparent_encoding
            print (r.status_code,'request')
            self.html = r.text
            return r.text
        except:
            return 'getText error'

    def getResponse(self, url):
        try:
            r = requests.get(url, headers = headers,proxies = proxies,timeout = 20)
            r.encoding = r.apparent_encoding
            print (r.status_code,'request')
            return r
        except:
            return 'getResponse error'

KnowledgeSystem.py 獲取知識(shí)體系列表

from config import KnowledgeSystemUrl
from crawl import Crawl
from bs4 import BeautifulSoup
class KnowledgeSystem(Crawl):
    class listData():
        nameList = []
        srcList = []
            
    def getList(self):
        try:
            html = self.getText(KnowledgeSystemUrl)
            soup = BeautifulSoup(html,'html.parser')
            print('----正在查找知識(shí)體系列表----')
            srcList = []
            nameList = []
            index = 1
            cf = soup.find_all(attrs = 'pathlist-one cf')
            for member in cf:
                h2 = member.find('h2')
                print('%d  '%(index) + h2.string)
                nameList.append(h2.string)
                srcList.append('https:' + member['href'])
                index = index + 1
            
            ld = self.listData()
            ld.nameList = nameList
            ld.srcList = srcList
            return ld
        except:
            print('getList error')

    def sellect(self):
        n = input('-----請輸入你想要下載的課程號(hào)-----\n')
        return int(n)

courseList.py 獲取每個(gè)章節(jié)的課程列表

from crawl import Crawl
from bs4 import BeautifulSoup
class CourseList(Crawl):
    class CourseData:
        chapterName = ''
        lessonNameList = []
        lessonSrcList = []

    class CourseList:
        #存放CourseData類
        chapterList = []

    def getCourse(self,url):
        print('-------正在獲取該系列課程信息---------')
        chapterListHtml = self.getText(url)
        chapterListSoup = BeautifulSoup(chapterListHtml,'html.parser')

        temp = chapterListSoup.find_all(attrs = 'pathstage mar-t30')
        self.CourseList.chapterList = []
        for each in temp:
            #獲取該系列的章節(jié)名字谨究，存放在CourseData類中
            CD = self.CourseData()
            CD.chapterName = each.find('h2').string
            
            lessonInfoList = each.find_all(attrs = 'lesson-info-h2')
            index = 1
            #初始化課程名列表恩袱、url源列表
            CD.lessonNameList = []
            CD.lessonSrcList = []
            for info in lessonInfoList:
                #獲取課程名字，存放在CourseData類中的名字列表中
                courseName = str(index) + '.' + info.string
                CD.lessonNameList.append(courseName)
                
                #獲取課程名字胶哲，存放在CourseData類中的url列表中
                lessonSrc = 'https:'+ info.a['href']
                CD.lessonSrcList.append(lessonSrc)
                index = index + 1
            #將處理好的課程數(shù)據(jù)類保存在chapterList中
            self.CourseList.chapterList.append(CD)

    def printChapterNameList(self):
        print('-----查找到該知識(shí)體系有如下章節(jié)：-----')
        for each in self.CourseList.chapterList:
            print(each.chapterName)

    def printLessonNameList(self):
        index = 0
        for each in self.CourseList.chapterList:
            for lessonName in each.lessonNameList:
                print(lessonName)
            index = index + 1
                
    def printLessonSrcList(self):
        index = 0
        for each in self.CourseList.chapterList:
            for lessonSrc in each.lessonSrcList:
                print(lessonSrc)
            index = index + 1

section.py 獲取每一課程的小節(jié)信息

from crawl import Crawl 
from bs4 import BeautifulSoup
import bs4
class Section(Crawl):
    class SectionData:
        sectionNameList = []
        sectionSrcList = []

    def getSection(self,url):
        print('--------正在獲取該知識(shí)體系的小節(jié)信息-------')
        lessonHtml = self.getText(url)
        soup = BeautifulSoup(lessonHtml,'html.parser') 
        temp = soup.find(attrs='lessonvideo-list')
        while(isinstance(temp,bs4.element.Tag) == False): 
            lessonHtml = self.getText(url)
            soup = BeautifulSoup(lessonHtml,'html.parser')
            print('isinstance(temp,bs4.element.Tag) == False')        
            temp = soup.find(attrs='lessonvideo-list')
        aTag = temp.find_all('a')

        self.SectionData.sectionNameList = []
        self.SectionData.sectionSrcList = []
        for each in aTag:
            #print(each.string)
            #print('https:' + each['href']) 
            self.SectionData.sectionNameList.append(each.string)
            self.SectionData.sectionSrcList.append('https:' + each['href'])

download.py 下載視頻

from crawl import Crawl
from section import Section
from bs4 import BeautifulSoup
import bs4
import os
import requests
class Download(Crawl):
    class DownloadData:
        sourceList = []
        nameList = []
    
    def findVideoSrc(self,SectionData):
        print('-----正在獲取課程的視頻鏈接-------')
        self.DownloadData.sourceList = []
        self.DownloadData.nameList = SectionData.sectionNameList

        for Src in SectionData.sectionSrcList:
            html = self.getText(Src)
            soup = BeautifulSoup(html,'html.parser')
            sourceTag = soup.find('source')
            while(isinstance(sourceTag,bs4.element.Tag) == False): 
                print('isinstance(sourceTag,bs4.element.Tag) == False')   
                html = self.getText(Src)
                soup = BeautifulSoup(html,'html.parser')
                sourceTag = soup.find('source')
            source = sourceTag['src']
            #print(source)
            self.DownloadData.sourceList.append(source)

    def makeDir(self,dirName):
        print('-------正在創(chuàng)建路徑:%s------'%dirName)
        try:
            if(os.path.exists(dirName)):
                return dirName
            else:
                os.mkdir(dirName)
                return dirName
        except:
            print('當(dāng)前要?jiǎng)?chuàng)建的路徑為：'+ dirName)
            dirName = input('創(chuàng)建失敗畔塔，請手動(dòng)輸入路徑')
            dirName = self.makeDir(dirName)
            return dirName

    def saveVideoFile(self,path,videoName,videoSrc):
        videoFilePath = path +'/'+ videoName + '.mp4'
        if(os.path.exists(videoFilePath)):
            print('        ' + '視頻已存在。    %s'%(videoName))
            return
        else:
            video = requests.get(videoSrc)
            print('        ' + '開始下載視頻    %s'%(videoName))
            f = open(videoFilePath, 'ab')
            print('        ' + '開始保存視頻    %s'%(videoName))
            f.write(video.content)    
            f.close()

    def downloadVideo(self,path):
        path = self.makeDir(path)
        for i in range(len(self.DownloadData.sourceList)):
            self.saveVideoFile(path,self.DownloadData.nameList[i],self.DownloadData.sourceList[i])

main.py 主函數(shù)

import sys
import io
from KnowledgeSystem import KnowledgeSystem
from courseList import CourseList
from section import Section
from download import Download
import os
#sys.stdout=io.TextIOWrapper(sys.stdout.buffer,encoding='utf8')

if __name__ == '__main__':
    #聲明類
    KS = KnowledgeSystem()
    KSLD = KS.listData()

    #獲取知識(shí)體系列表
    KSLD = KS.getList()
    
    #用戶選擇想要下載的某一條知識(shí)體系
    num = KS.sellect()

    #該體系的名字鸯屿、該體系的url源
    ksName = KSLD.nameList[num - 1]
    ksSrc = KSLD.srcList[num - 1]
    
    #獲取該體系所有的課程
    CL = CourseList()
    CL.getCourse(ksSrc)
    CL.printChapterNameList()
    
    sec = Section()
    dld = Download()
    pathTemp = './'+ksName
    pathTemp = dld.makeDir('./'+ksName)#./andorid
    for each in CL.CourseList.chapterList:
        pathTemp2 = dld.makeDir(pathTemp + '/' + each.chapterName)#./andorid/1.環(huán)境搭建
        for i in range(len(each.lessonSrcList)):
            path = pathTemp2 + '/' + each.lessonNameList[i] #./andorid/1.環(huán)境搭建/1.Android 集成開發(fā)環(huán)境搭建
            sec.getSection(each.lessonSrcList[i])
            videoFilePath = path +'/'+ sec.SectionData.sectionNameList[len(sec.SectionData.sectionNameList)-1] + '.mp4'
            if(os.path.exists(videoFilePath)):
                print('文件已存在澈吨，跳過     %s'%videoFilePath)
                pass
            else:
                dld.findVideoSrc(sec.SectionData)
                dld.downloadVideo(path)


    print('download successful')

運(yùn)行結(jié)果預(yù)覽：

獲取想要的資源

正在下載視頻

下載好的視頻文件

最后編輯于：2019.03.06 23:22:52

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者

人面猴
序言：七十年代末，一起剝皮案震驚了整個(gè)濱河市寄摆，隨后出現(xiàn)的幾起案子谅辣，更是在濱河造成了極大的恐慌，老刑警劉巖婶恼，帶你破解...
沈念sama閱讀 219,039評論 6贊 508
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件桑阶，死亡現(xiàn)場離奇詭異，居然都是意外死亡熙尉，警方通過查閱死者的電腦和手機(jī)联逻，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 93,426評論 3贊 395
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進(jìn)店門，熙熙樓的掌柜王于貴愁眉苦臉地迎上來检痰，“玉大人包归，你說我怎么就攤上這事∏撸” “怎么了公壤？”我有些...
開封第一講書人閱讀 165,417評論 0贊 356
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵，是天一觀的道長椎椰。經(jīng)常有香客問我厦幅，道長，這世上最難降的妖魔是什么慨飘？我笑而不...
開封第一講書人閱讀 58,868評論 1贊 295
?港島之戀（遺憾婚禮）
正文為了忘掉前任确憨，我火速辦了婚禮，結(jié)果婚禮上瓤的，老公的妹妹穿的比我還像新娘休弃。我一直安慰自己，他們只是感情好圈膏，可當(dāng)我...
茶點(diǎn)故事閱讀 67,892評論 6贊 392
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布塔猾。她就那樣靜靜地躺著，像睡著了一般稽坤。火紅的嫁衣襯著肌膚如雪丈甸。梳的紋絲不亂的頭發(fā)上糯俗，一...
開封第一講書人閱讀 51,692評論 1贊 305
城市分裂傳說
那天，我揣著相機(jī)與錄音睦擂，去河邊找鬼得湘。笑死，一個(gè)胖子當(dāng)著我的面吹牛祈匙，可吹牛的內(nèi)容都是我干的忽刽。我是一名探鬼主播天揖，決...
沈念sama閱讀 40,416評論 3贊 419
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼夺欲，長吁一口氣：“原來是場噩夢啊……” “哼！你這毒婦竟也來了今膊？” 一聲冷哼從身側(cè)響起些阅，我...
開封第一講書人閱讀 39,326評論 0贊 276
萬榮殺人案實(shí)錄
序言：老撾萬榮一對情侶失蹤，失蹤者是張志新（化名）和其女友劉穎斑唬，沒想到半個(gè)月后市埋，有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體，經(jīng)...
沈念sama閱讀 45,782評論 1贊 316
?護(hù)林員之死
正文獨(dú)居荒郊野嶺守林人離奇死亡恕刘，尸身上長有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點(diǎn)故事閱讀 37,957評論 3贊 337
?白月光啟示錄
正文我和宋清朗相戀三年缤谎，在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了。大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片褐着。...
茶點(diǎn)故事閱讀 40,102評論 1贊 350
活死人
序言：一個(gè)原本活蹦亂跳的男人離奇死亡坷澡，死狀恐怖，靈堂內(nèi)的尸體忽然破棺而出含蓉，到底是詐尸還是另有隱情频敛，我是刑警寧澤，帶...
沈念sama閱讀 35,790評論 5贊 346
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布馅扣，位于F島的核電站斟赚，受9級(jí)特大地震影響，放射性物質(zhì)發(fā)生泄漏差油。R本人自食惡果不足惜拗军，卻給世界環(huán)境...
茶點(diǎn)故事閱讀 41,442評論 3贊 331
男人毒藥：我在死后第九天來索命
文/蒙蒙一、第九天我趴在偏房一處隱蔽的房頂上張望蓄喇。院中可真熱鬧发侵，春花似錦、人聲如沸公罕。這莊子的主人今日做“春日...
開封第一講書人閱讀 31,996評論 0贊 22
一樁弒父案，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽楼眷。三九已至铲汪，卻和暖如春熊尉，著一層夾襖步出監(jiān)牢的瞬間，已是汗流浹背掌腰。一陣腳步聲響...
開封第一講書人閱讀 33,113評論 1贊 272
情欲美人皮
我被黑心中介騙來泰國打工狰住，沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留，地道東北人齿梁。一個(gè)月前我還...
沈念sama閱讀 48,332評論 3贊 373
代替公主和親
正文我出身青樓催植，卻偏偏與公主長得像，于是被迫代替她去往敵國和親勺择。傳聞我的和親對象是個(gè)殘疾皇子创南，可洞房花燭夜當(dāng)晚...
茶點(diǎn)故事閱讀 45,044評論 2贊 355

python學(xué)習(xí)記錄1 - 極客學(xué)院視頻爬蟲

背景：

技術(shù)文檔：

代碼塊：

運(yùn)行結(jié)果預(yù)覽：

推薦閱讀更多精彩內(nèi)容