大家好,我叫謝偉预皇,是一名程序員。
我們已經(jīng)研究了:
Golang 環(huán)境的搭建婉刀、設(shè)置GOPATH吟温、GOROOT 參數(shù),Govendor 包管理路星, Goland 集成開發(fā)環(huán)境
Golang 語言學(xué)習(xí)專欄 -- 第一期Golang 的基礎(chǔ)知識:變量聲明溯街、基本數(shù)據(jù)類型、基本數(shù)據(jù)結(jié)構(gòu)(map洋丐、數(shù)組呈昔、切片、結(jié)構(gòu)體)友绝、流程控制堤尾、循環(huán)操作等
Golang 語言學(xué)習(xí)專欄 -- 第二期Golang 函數(shù):入?yún)ⅰ⒎祷刂登汀⒛涿瘮?shù)郭宝、函數(shù)作為參數(shù)辞槐、函數(shù)作為返回值
Golang 語言學(xué)習(xí)專欄 -- 第三期Golang 結(jié)構(gòu)體:聲明和定義、組合粘室、格式化顯示榄檬、訪問字段、方法定義
Golang 語言學(xué)習(xí)專欄 -- 第四期Golang 錯誤處理機制
Golang 語言學(xué)習(xí)專欄 -- 第五期Golang 結(jié)構(gòu)體
Golang 語言學(xué)習(xí)專欄 -- 第六期
不管學(xué)習(xí)什么衔统,如果沒有得到快速入門的機會鹿榜,會喪失學(xué)習(xí)的動力。進而失去深入研究一門技能的機會锦爵。這對初學(xué)者或者自學(xué)者來說舱殿,這一點非常的重要,不然的話险掀,會重復(fù)的抓起沙子沪袭,而建設(shè)不了大廈,所以說自信心很重要樟氢。
這節(jié)呢冈绊,使用之前學(xué)習(xí)的知識。完成一個小任務(wù)嗡害。
作為程序員呢焚碌。我們在專注學(xué)習(xí)研究技術(shù)的同時,也需要關(guān)注一些技術(shù)的熱點霸妹。那怎么才能關(guān)注技術(shù)熱點十电,比如現(xiàn)在的技術(shù)人員在研究些什么、關(guān)注些什么叹螟?
方法當然是上主流的技術(shù)社區(qū)鹃骂,了解現(xiàn)在的技術(shù)人員在研究些什么東西。
這里我們說的主流的技術(shù)社區(qū)罢绽,認為是 Github 畏线。因為這個托管網(wǎng)站實在是存在太多值得你研究的東西、巨多開源的技術(shù)值得你去研究良价。
Github 專門有一個鏈接指向當天最熱門的項目寝殴。從這一個側(cè)面,我們大概可以了解到熱門的語言的一些熱門項目明垢。
還可以根據(jù)編程語言查看熱門的項目:
比如:
語言 | 鏈接 |
---|---|
Python | Github Trending Python |
Go | Github Trending Go |
我們的目的是:抓取這些熱門的項目的一些信息蚣常。(因為我發(fā)現(xiàn),不管是Python 還是Go 爬蟲似乎總能很好的激發(fā)學(xué)習(xí)者的興趣痊银?)
任務(wù)就是上面兩張圖里的內(nèi)容:
- 定義抓取字段
- 獲取網(wǎng)頁信息
- 解析網(wǎng)頁信息
- 任務(wù)調(diào)度
- 函數(shù)主入口
這里在提一點:初學(xué)者往往不太注重自己的項目的工程結(jié)構(gòu)抵蚊。什么意思呢?意思是說初學(xué)者往往注重在實現(xiàn)部分,認為實現(xiàn)了功能贞绳,整個工程就差不多結(jié)束了谷醉,就理所當然的認為自己的開發(fā)任務(wù)完成了。實際上在企業(yè)里的任務(wù)開發(fā)和你自己練手玩的項目很不一樣冈闭,企業(yè)里的任務(wù)開發(fā)往往會根據(jù)需求變動俱尼,假如在學(xué)校里,你做一個項目萎攒,老師給你定下了一個任務(wù)号显,中途又改變了,待你代碼差不多寫好了躺酒,又更改了任務(wù)目標,看上去你肯定會抱怨老師蔑歌,實際上這種情形在企業(yè)里開發(fā)是日常很常見的羹应。所以,剛開始我就建議初學(xué)者或者自學(xué)者堅持一項好的工程組織結(jié)構(gòu)次屠,以后都在這個項目的組織結(jié)構(gòu)上動態(tài)的調(diào)整(主體不變园匹,內(nèi)部細節(jié)調(diào)整)。事實上很多設(shè)計模式或者軟件設(shè)計架構(gòu)都是有一套固定的項目組織結(jié)構(gòu)劫灶。這樣保證項目可擴展性裸违、低耦合等
項目結(jié)構(gòu)
就爬蟲項目,給你推薦下面一個工程目錄:
workspace
download
download.go
engine
engine.go
object.go
infra
util.go
main
main.go
parse
github
github_trending_parse.go
解釋下各個文件的含義:
download
download.go
定位為:下載器
download.go
完成的是:獲取網(wǎng)頁信息
engine
engine.go
object.go
定位為:調(diào)度引擎
engine.go
完成的是:爬蟲任務(wù)的調(diào)度
object.go
完成的是: 定義抓取的字段
infra
util.go
定位為:基礎(chǔ)設(shè)施
util.go
完成的是:項目需要的一些輔助函數(shù)
main
main.go
主函數(shù)入口本昏。沒什么好說的供汛。
parse
github
github_trending_parse.go
定位為:解析器
github_trending_parse.go
完成的是:解析github 網(wǎng)站的一些解析函數(shù)
下載器
// download.go
var (
ErrorNil = errors.New("response is nil")
ErrorWrongCode = errors.New("http response code is wrong")
)
func Download(url string) (*goquery.Document, error) {
var (
resp *http.Response
err error
)
if resp, err = http.Get(url); err != nil {
return nil, ErrorNil
}
defer resp.Body.Close()
if resp.StatusCode != 200 {
return nil, ErrorWrongCode
}
return goquery.NewDocumentFromReader(resp.Body)
}
- 注意函數(shù)命名
- 注意錯誤處理機制:建議每個文件的開頭定義一些錯誤信息
解析器
// github_trending_parse.go
func ParseForGithub(document *goquery.Document) {
document.Find("div.explore-content ol.repo-list li").Each(func(i int, selection *goquery.Selection) {
RespName, _ := infra.HandleCommon(selection.Find("div h3 a").Text())
URL, _ := infra.HandlerURL(selection.Find("div").Eq(0).Find("h3 a").AttrOr("href", "None"))
Description, _ := infra.HandleCommon(selection.Find("div").Eq(2).Find("p").Text())
Stars, _ := infra.HandleCommon(selection.Find("div").Eq(3).Find("a").Eq(0).Text())
Fork, _ := infra.HandleCommon(selection.Find("div").Eq(3).Find("a").Eq(1).Text())
TodayStars, _ := infra.HandleCommon(selection.Find("div").Eq(3).Find("span").Eq(1).Text())
fmt.Println(RespName, URL, Description, Stars, Fork, TodayStars)
})
}
func ParseForDevelopers(document *goquery.Document) {
document.Find("div.explore-content ol li").Each(func(i int, selection *goquery.Selection) {
DevName, _ := infra.HandleCommon(selection.Find("li div div").Eq(1).Find("h2 a").Text())
Description, _ := infra.HandleCommon(selection.Find("li div div").Eq(1).Find("a span").Text())
URL, _ := infra.HandleCommon(selection.Find("li div div").Eq(1).Find("h2 a").AttrOr("href", "None"))
fmt.Println(DevName, Description, URL)
})
}
- 一個解析函數(shù)解析:
https://github.com/trending/developers
- 一解析函數(shù)解析:
https://github.com/trending
調(diào)度器
// enigin.go
package engine
import (
"errors"
"fmt"
"go-example-for-live/seven_learning/download"
"github.com/PuerkitoBio/goquery"
)
var (
ErrorDocWrong = errors.New("document wrong")
)
type Trending struct {
}
func (t Trending) Run(request RequestForGithub) {
var doc *goquery.Document
doc, err := download.Download(request.URL)
if err != nil {
fmt.Println(ErrorDocWrong)
return
}
if doc != nil {
fmt.Println("Game start!")
request.ParseFunc(doc)
} else {
fmt.Println("Game over!")
}
}
- 負責(zé)串接:下載器和解析器,獲取到抓取的字段
package engine
import "github.com/PuerkitoBio/goquery"
type RequestForGithub struct {
URL string
ParseFunc func(doc *goquery.Document)
}
type Repositories struct {
RespName string
URL string
Stars int
Fork int
TodayStars string
Description string
}
type Developers struct {
DevName string
Description string
URL string
}
- 定義三個結(jié)構(gòu)體:
1涌穆、稱之為種子:包括URL 和 解析函數(shù)
2怔昨、Developers 定義為https://github.com/trending/developers
網(wǎng)頁的抓取字段
3、Repositories 定義為https://github.com/trending
網(wǎng)頁的抓取字段
基礎(chǔ)設(shè)施
// util.go
package infra
import (
"errors"
"strings"
)
var (
ErrorStringSpace = errors.New("string trim error")
)
func HandleCommon(oldString string) (string, error) {
newReplacer := strings.NewReplacer("\n", "", "\t", "")
return strings.TrimSpace(newReplacer.Replace(oldString)), nil
}
func HandlerURL(oldString string) (string, error) {
return "https://github.com" + strings.TrimSpace(oldString), nil
}
即:一些字符串的處理函數(shù)宿稀,比如替換函數(shù)趁舀、拼接函數(shù)
主函數(shù)入口
// main.go
package main
import (
"go-example-for-live/seven_learning/engine"
"go-example-for-live/seven_learning/parse/github"
)
func main() {
var simplerTest engine.Trending
simplerTest.Run(
engine.RequestForGithub{
URL: "https://github.com/trending",
ParseFunc: github.ParseForGithub,
},
)
simplerTest.Run(
engine.RequestForGithub{
URL: "https://github.com/trending/developers",
ParseFunc: github.ParseForDevelopers,
},
)
}
結(jié)果:
Game start!
xingshaocheng / architect-awesome https://github.com/xingshaocheng/architect-awesome 后端架構(gòu)師技術(shù)圖譜 13,220 3,150 1,528 stars today
google / gvisor https://github.com/google/gvisor Container Runtime Sandbox 5,080 190
davideuler / architecture.of.internet-product https://github.com/davideuler/architecture.of.internet-product 互聯(lián)網(wǎng)公司技術(shù)架構(gòu),微信/淘寶/微博/騰訊/阿里/美團點評/百度/Google/Facebook/Amazon/eBay的架構(gòu)祝沸,歡迎PR補充 7,058 1,123 1,427 stars today
kusti8 / proton-native https://github.com/kusti8/proton-native A React environment for cross platform native desktop apps 6,216 153
github / gh-ost https://github.com/github/gh-ost GitHub's Online Schema Migrations for MySQL 5,136 335
pytorch / ELF https://github.com/pytorch/ELF ELF: a platform for game research 1,525 218
cyanharlow / purecss-francine https://github.com/cyanharlow/purecss-francine HTML/CSS drawing in the style of an 18th-century oil painting. Hand-coded entirely in HTML & CSS. 4,035 169
sallar / github-contributions-chart https://github.com/sallar/github-contributions-chart Generate an image of all your Github contributions 2,228 60
RelaxedJS / ReLaXed https://github.com/RelaxedJS/ReLaXed Create PDF documents using web technologies 7,116 181
sindresorhus / ow https://github.com/sindresorhus/ow Function argument validation for humans 1,790 18
xx45 / dayjs https://github.com/xx45/dayjs Fast 2KB immutable date library alternative to Moment.js with the same modern API 9,310 269
sharkdp / bat https://github.com/sharkdp/bat A cat(1) clone with wings. 2,102 26
CyC2018 / Interview-Notebook https://github.com/CyC2018/Interview-Notebook 技術(shù)面試需要掌握的基礎(chǔ)知識整理矮烹,歡迎編輯~ 21,845 5,763 256 stars today
shimohq / chinese-programmer-wrong-pronunciation https://github.com/shimohq/chinese-programmer-wrong-pronunciation 中國程序員容易發(fā)音錯誤的單詞 5,963 530 257 stars today
binhnguyennus / awesome-scalability https://github.com/binhnguyennus/awesome-scalability High Scalability, High Availability, High Stability, High Performance, and High Intelligence Back-End Design Patterns 10,778 810 246 stars today
nhnent / tui.calendar https://github.com/nhnent/tui.calendar A JavaScript calendar that everything you need. 4,759 192
YadiraF / PRNet https://github.com/YadiraF/PRNet The source code of 'Joint 3D Face Reconstruction and Dense Alignment with Position Map Regression Network'. 1,296 108
hasura / skor https://github.com/hasura/skor Listen to postgres events and forward them as JSON payloads to a webhook 946 19
roytseng-tw / Detectron.pytorch https://github.com/roytseng-tw/Detectron.pytorch A pytorch implementation of Detectron. Both training from scratch and inferring directly from pretrained Detectron weights are available. 926 114
iotexproject / iotex-core https://github.com/iotexproject/iotex-core Connecting the physical world, block by block. 654 58
layerJS / layerJS https://github.com/layerJS/layerJS layerJS: Javascript UI composition framework 1,039 26
AllThingsSmitty / css-protips https://github.com/AllThingsSmitty/css-protips A collection of tips to help take your CSS skills pro 11,921 785 188 stars today
tabler / tabler https://github.com/tabler/tabler Tabler is free and open-source HTML Dashboard UI Kit built on Bootstrap 4 13,629 983
tmcw / big https://github.com/tmcw/big presentations for busy messy hackers 2,436 137
cgoldsby / LoginCritter https://github.com/cgoldsby/LoginCritter An animated avatar that responds to text field interactions 3,351 141
Game start!
google (Google) (Google) material-design-icons material-design-icons Material Design icons by Google /google
davideuler (david l euler) (david l euler) architecture.of.internet-product architecture.of.internet-product 互聯(lián)網(wǎng)公司技術(shù)架構(gòu),微信/淘寶/微博/騰訊/阿里/美團點評/百度/Google/Facebook/Amazon/eBay的架構(gòu)罩锐,歡迎PR補充 /davideuler
xingshaocheng architect-awesome architect-awesome 后端架構(gòu)師技術(shù)圖譜 /xingshaocheng
cyanharlow (Diana Smith) (Diana Smith) purecss-francine purecss-francine HTML/CSS drawing in the style of an 18th-century oil painting. Hand-coded entirely in HTML & CSS. /cyanharlow
kusti8 (Gustav Hansen) (Gustav Hansen) proton-native proton-native A React environment for cross platform native desktop apps /kusti8
pytorch pytorch pytorch Tensors and Dynamic neural networks in Python with strong GPU acceleration /pytorch
github (GitHub) (GitHub) gitignore gitignore A collection of useful .gitignore templates /github
sindresorhus (Sindre Sorhus) (Sindre Sorhus) awesome awesome Curated list of awesome lists /sindresorhus
sallar (Sallar Kaboli) (Sallar Kaboli) github-contributions-chart github-contributions-chart Generate an image of all your Github contributions /sallar
RelaxedJS (ReLaXed) (ReLaXed) ReLaXed ReLaXed Create PDF documents using web technologies /RelaxedJS
symfony (Symfony) (Symfony) symfony symfony The Symfony PHP framework /symfony
facebook (Facebook) (Facebook) react react A declarative, efficient, and flexible JavaScript library for building user interfaces. /facebook
Microsoft (Microsoft) (Microsoft) vscode vscode Visual Studio Code /Microsoft
xx45 dayjs dayjs Fast 2KB immutable date library alternative to Moment.js with the same modern API /xx45
apache (The Apache Software Foundation) (The Apache Software Foundation) incubator-echarts incubator-echarts A powerful, interactive charting and visualization library for browser /apache
tensorflow tensorflow tensorflow Computation using data flow graphs for scalable machine learning /tensorflow
sharkdp (David Peter) (David Peter) fd fd A simple, fast and user-friendly alternative to 'find' /sharkdp
vuejs (vuejs) (vuejs) vue vue A progressive, incrementally-adoptable JavaScript framework for building UI on the web. /vuejs
CyC2018 Interview-Notebook Interview-Notebook 技術(shù)面試需要掌握的基礎(chǔ)知識整理奉狈,歡迎編輯~ /CyC2018
nhnent (NHN Entertainment) (NHN Entertainment) tui.editor tui.editor Markdown WYSIWYG Editor. GFM Standard + Chart & UML Extensible. /nhnent
binhnguyennus (Binh Nguyen) (Binh Nguyen) awesome-scalability awesome-scalability High Scalability, High Availability, High Stability, High Performance, and High Intelligence Back-End Design Patterns /binhnguyennus
shimohq (Shimo Docs) (Shimo Docs) chinese-programmer-wrong-pronunciation chinese-programmer-wrong-pronunciation 中國程序員容易發(fā)音錯誤的單詞 /shimohq
YadiraF PRNet PRNet The source code of 'Joint 3D Face Reconstruction and Dense Alignment with Position Map Regression Network'. /YadiraF
yyx990803 (Evan You) (Evan You) pod pod Git push deploy for Node.js /yyx990803
roytseng-tw (Roy) (Roy) Detectron.pytorch Detectron.pytorch A pytorch implementation of Detectron. Both training from scratch and inferring directly from pretrained Detectron weights are available. /roytseng-tw
需要強調(diào)的是這個項目的組織結(jié)構(gòu)能夠很好的進行擴展:比如說,我又想抓取其他網(wǎng)頁唯欣。即重新再 parse 定義個新的解析器即可嘹吨。其他可以復(fù)用。
另外境氢,最后抓取的字段并沒有填充進定義的結(jié)構(gòu)體內(nèi)蟀拷。
再有碰纬,看上去這項目沒什么值得提的,事實上问芬,已經(jīng)有人做了這個項目悦析。每天抓取github trending 寫入文件并托管在 github 上。有興趣的可以看看別人的實現(xiàn)方式此衅。
如果你自學(xué)者强戴,接觸不到企業(yè)級的項目,我建議你從 github 上尋找自己感興趣的編程語言的項目重新寫一遍挡鞍。這樣相當于骑歹,給自己出了一個題,而又有一份參考答案墨微,能給自己一些反饋道媚,同時不斷的精進自己的技術(shù)。
全文完翘县。希望大家學(xué)的開心最域。