Python爬取博客網(wǎng)站所有頁面文章內(nèi)容

更多教程請移步至：洛涼博客

求助請移步至：Python自學技術(shù)交流

前幾天一直在寫爬取圖片的代碼欺抗。XXOO網(wǎng)站审胚，煎蛋網(wǎng)妹子圖，桌酷壁紙網(wǎng)胆萧，最好大學排名庆揩。
想看所有代碼的朋友，可以上git上拉取下。
這是倉庫地址：https://github.com/YGQ8988/reptile.git
最近一直在想爬文本是不是比圖片還難呢订晌？
今天就隨便訪問了一個博客地址虏辫，試手了一下，起初就爬取了一篇文章锈拨，然后成功獲取到了文章標題砌庄，內(nèi)容。然后有嘗試保存到本地推励。成功了鹤耍。
然后又觀察了下頁面肉迫，每篇文章的源代碼所在的位置都一樣验辞。
然后嘗試了下10頁數(shù)據(jù)的爬取，發(fā)現(xiàn)有的文章內(nèi)容保存不下來喊衫，報錯為編碼問題跌造，但是我代碼里每次requests訪問都加了編碼，暫時沒找到解決的辦法族购。
最后只能簡單粗暴的加了try和except過濾掉了壳贪。
抓不成功，直接過濾掉寝杖。進行下一個文章爬取违施。
改了下代碼。文件名稱優(yōu)化了下瑟幕。

image.png

后面改了下代碼磕蒲，爬取全部頁面颤殴。也加了下time模塊做休眠斗忌，防止訪問頻繁IP被封。
比較菜姿鸿，IP代理訪問設(shè)置還不會殖卑，scrapy框架也還不會站削。
最近免費領(lǐng)了個阿里云服務(wù)器，改完后直接丟掉服務(wù)器上運行了孵稽。

image.png

下面直接貼代碼了许起。這次代碼的注釋沒寫很多。大家自己去嘗試下菩鲜。
這樣就會明白每行代碼的作用街氢。

import requests
from bs4 import BeautifulSoup
import bs4
import os
from time import sleep
url_list = []
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'}
def url_all():
    for page in range(1,401):
        url = 'http://blog.csdn.net/?ref=toolbar_logo&page='+str(page)
        url_list.append(url)
def essay_url(): #找到所有文章地址
    blog_urls = []
    for url in url_list:
        html = requests.get(url, headers=headers)
        html.encoding = html.apparent_encoding
        soup = BeautifulSoup(html.text, 'html.parser')
        for h3 in soup.find_all('h3'):
            blog_url = (h3('a')[0]['href'])
            blog_urls.append(blog_url)
    return blog_urls
def save_path():
    s_path = 'D:/blog/'
    if  not os.path.isdir(s_path):
        os.mkdir(s_path)
    else:
        pass
    return s_path
def save_essay(blog_urls,s_path): #找到所有文章標題，文章內(nèi)容睦袖。
    for url in blog_urls:
        blog_html = requests.get(url, headers=headers)
        blog_html.encoding = blog_html.apparent_encoding
        soup = BeautifulSoup(blog_html.text, 'html.parser')
        try:
            for title in soup.find('span', {'class': 'link_title'}):
                if isinstance(title, bs4.element.Tag):
                    print('-----文章標題-----：', title.text)
                    blogname = title.text
                    blogname = blogname.replace("\n",'')
                    blogname = blogname.replace("\r",'')
                    blogname = blogname.replace(" ",'')
                    try:
                        file = open(s_path + str(blogname) + '.txt', 'w')
                        file.write(str(title.text))
                        file.close()
                    except BaseException as a:
                        print(a)

            for p in soup.find('div', {'class': 'article_content'}).children:
                if isinstance(p, bs4.element.Tag):
                    try:
                        file = open(s_path + str(blogname) + '.txt', 'a')
                        file.write(p.text)
                        file.close()
                    except BaseException as f:
                        print(f)
        except BaseException as b:
            print(b)
    print('---------------所有頁面遍歷完成----------------')
sleep(10)
url_all()
save_essay(essay_url(),save_path())

買了三本書珊肃，最近都沒看了。書上講的基本都是內(nèi)置模塊。
最近發(fā)現(xiàn)爬蟲挺好玩的伦乔，也在繼續(xù)學習厉亏，研究。
希望學會框架烈和，這樣就能勝任簡單的爬蟲工程師了爱只。
哈哈，是不是想的太美招刹。

最后編輯于：2017.11.24 09:09:58

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者