golang-proxy v3.0
golang-proxy是一個開箱即用的高匿代理抓取工具, 它是語言無關(guān)的
項目地址: https://github.com/storyicon/golang-proxy
中文文檔
Golang-Proxy -- 簡單高效的免費代理抓取工具通過抓取網(wǎng)絡(luò)上公開的免費代理,來維護一個屬于自己的高匿代理池,用于網(wǎng)絡(luò)爬蟲、資源下載等用途。
在 v3.0
有哪些新特性
- 依舊提供了高度靈活的 API 接口寿冕,在啟動主程序后谨读,即可通過在瀏覽器訪問
localhost:9999/all
與localhost:9999/random
直接獲取抓到的代理均践!甚至可以使用localhost:9999/sql?query=
來執(zhí)行一些簡單的 SQL 語句來自定義代理篩選規(guī)則晤锹! - 依舊提供
Windows
、Linux
彤委、Mac
開箱即用版鞭铆!
Download Release v3.0 - 支持自動對代理類型進行判斷, 可以通過
schemeType
判定代理對http
和https
的支持程度 - 支持了MySQL數(shù)據(jù)庫, 詳情請見 Config
- 支持單獨啟動服務(wù), 在啟動編譯好的二進制文件時, 通過
-mode=
來指定是否單獨啟動producer
/consumer
/assessor
/service
- 重新設(shè)計了數(shù)據(jù)表, 請注意, 這意味著
API
接口發(fā)生了變動 - 重新設(shè)計了
源
的數(shù)據(jù)結(jié)構(gòu), 去除了filter
等字段, 請注意, 這意味著v2.0
的源在直接提供給v3.0
使用時可能會出現(xiàn)一些問題 - 更新了一些
源
- 不再支持
-source
啟動參數(shù)
如何使用 golang-proxy
1. 使用開箱即用版本
Release 頁面 根據(jù)系統(tǒng)環(huán)境提供了一些壓縮包,將他們解壓后執(zhí)行即可焦影。
開箱即用版下載地址: Download Release v3.0
下載完成后, 將壓縮包中的二進制文件和 source
目錄解壓到同一個位置, 啟動二進制文件即可, 程序?qū)酉旅孢@些服務(wù):
-
producer
: 周期性的抓取source
目錄中定義的源, 將抓取到的代理寫入到crude_proxy
表中 -
consumer
: 周期性的從crude_proxy
中讀取一定數(shù)量的代理, 判斷它們的代理類型以及可用性, 將它們寫入到proxy
表中 -
assessor
: 周期性的從proxy
表中讀取一定數(shù)量的代理, 評估它們的質(zhì)量 -
service
:golang-proxy
提供的 http api 接口, 使你可以通過localhost:9999/all
,localhost:9999/random
,localhost:9999/sql?query=
這三個接口來篩選和獲取crude_proxy
和proxy
表中的代理
當(dāng)你啟動編譯好的二進制文件時, 默認(rèn)這些服務(wù)會依次啟動, 但是在 v3.0
版本, 你可以通過添加 -mode
啟動參數(shù)來指定單獨啟動某個服務(wù), 比如:
golang-proxy -mode=service
這樣運行, 將只會啟動 service
服務(wù), 在啟動了 service
之后, 你可以在瀏覽器中訪問以下接口, 獲得相應(yīng)的代理:
url | description |
---|---|
localhost:9999/all |
獲取 proxy 表中所有已經(jīng)抓取到的代理 |
localhost:9999/all?table=proxy |
獲取 proxy 表中所有已經(jīng)抓取到的代理 |
localhost:9999/all?table=crude_proxy |
獲取 crude_proxy 表中所有已經(jīng)抓取到的代理 |
localhost:9999/random |
從 proxy 表中隨機獲取一條代理 |
localhost:9999/random?table=proxy |
從 proxy 表中隨機獲取一條代理 |
localhost:9999/random?table=crude_proxy |
從 crude_proxy 表中隨機獲取一條代理 |
localhost:9999/sql?query= |
在query= 后加上SQL 語句, 返回SQL執(zhí)行結(jié)果, 只支持較為簡單的查詢語句 |
請注意, crude_proxy
只是抓取到的代理的臨時儲存表, 不能保證它們的質(zhì)量, 而proxy
表中的代理將會不斷得到 assessor
的評估, proxy
表中的 score
字段可以較為全面的反映一個代理的質(zhì)量, 質(zhì)量較低時會被刪除
接口示例: localhost:9999/sql
例如訪問 localhost:9999/sql?query=SELECT * FROM PROXY WHERE SCORE > 5 ORDER BY SCORE DESC
, 將會返回 proxy
表中所有分?jǐn)?shù)大于5的代理, 并按照分?jǐn)?shù)從高到低返回
{
"error": "",
"message": [
{
"id": 2,
"ip": "45.113.69.177",
"port": "1080",
// scheme_type 可以取以下值:
// 0: 代理只支持 http
// 1: 代理只支持 https
// 2: 代理同時支持 http 和 https
"scheme_type": 0,
"content": "45.113.69.177:1080",
// 評估次數(shù)
"assess_times": 9,
// 評估成功次數(shù), 可以通過 success_times/assess_times獲得代理連接成功率
"success_times": 9,
// 平均響應(yīng)時間
"avg_response_time": 0.098,
// 連續(xù)失敗次數(shù)
"continuous_failed_times": 0,
// 分?jǐn)?shù), 推薦使用 5 分以上的代理
"score": 68.45106053570785,
"insert_time": 1540793312,
"update_time": 1540797880
},
]
}
2. 使用源碼編譯
go get -u github.com/storyicon/golang-proxy
進入到 golang-proxy
目錄车遂,執(zhí)行 go build main.go
,執(zhí)行生成的二進制的執(zhí)行程序即可斯辰。
注意:
項目根目錄下的 ./source
是項目執(zhí)行必須的文件夾舶担,里面存儲了各類網(wǎng)站源,其他的文件夾儲存的均為項目源碼彬呻。所以在編譯后得到二進制程序 main
文件后衣陶,即可將 main
文件和 source
文件夾一同移動到任意地方,main
文件可以任意命名闸氮。
為什么要用 Golang-Proxy
- 穩(wěn)定剪况、快速。
抓取模塊蒲跨,單核并發(fā)可以到達(dá) 1000 個頁面/秒译断。 - 高可配置性、高拓展性财骨。
你不需要寫任何代碼,花一兩分鐘填寫一個配置文件就可以添加一個新的網(wǎng)站源藏姐。 - 評估功能隆箩。
通過 Assessor 評估模塊,周期性測試代理質(zhì)量羔杨,根據(jù)代理的測試成功率捌臊、高匿性、測試次數(shù)兜材、突變性理澎、響應(yīng)速度等獨立影響因子進行綜合評分,算法具有高度可配置性曙寡,可以根據(jù)項目的需要可以對因子的權(quán)重進行獨立調(diào)整糠爬。 - 提供了高度靈活的 API 接口,在啟動主程序后举庶,即可通過在瀏覽器訪問
localhost:9999/all
與localhost:9999/random
直接獲取抓到的代理执隧!甚至可以使用localhost:9999/sql?query=
來執(zhí)行 SQL 語句來自定義代理篩選規(guī)則! - 不依賴任何服務(wù)型數(shù)據(jù)庫,一鍵下載镀琉,開箱即用峦嗤!
如何配置一個新的源
./source/
下的所有 yml 格式的文件都是源,你可以增加源屋摔,也可以通過在文件名前加上一個 .
來使程序忽略這個源烁设,當(dāng)然你也可以直接刪除,來讓一個源永遠(yuǎn)的消失钓试,下面進行 Source 參數(shù)介紹:
#Page配置項
page:
entry: "https://xxx/1.html"
template: "https://xxx/{page}.html"
from: 2
to: 10
#publisher將會首先抓取entry装黑,即 https://xxx/1.html
#然后根據(jù) template、from 和 to 依次抓取
# https://xxx/2.html
# https://xxx/3.html
# https://xxx/4.html
# ...
# https://xxx/10.html
#Selector配置項
selector:
iterator: ".table tbody tr"
ip: "td:nth-child(1)"
port: "td:nth-child(2)"
# 以上配置用于抓取下面這種 HTML 結(jié)構(gòu)
# <table class="table">
# <tbody>
# <tr>
# <td>187.3.0.1</td>
# <td>8080</td>
# <td>HTTP</td>
# <tr>
# <tr>
# <td>164.23.1.2</td>
# <td>80</td>
# <td>HTTPS</td>
# <tr>
# <tr>
# <td>131.9.2.3</td>
# <td>8080</td>
# <td>HTTP</td>
# <tr>
# <tbody>
# <table>
# 選擇器為通用的JQuery選擇器亚侠,iterator為循環(huán)對象曹体,比如表格里的行,每行一條代理硝烂,那這個行的選擇器就是iterator箕别,而ip、port滞谢、protocal則是在iterator選擇器的基礎(chǔ)上進行子元素的查找串稀。
category:
# 并行數(shù)
parallelnumber: 1
# 對于這個源,每抓取一個頁面
# 將會隨機等待5~20s再抓下一個頁面
delayRange: [5, 20]
# 間隔多長時間啟用一次這個源
# @every 10s 狮杨, @every 10h...
interval: "@every 10m"
debug: true
征求意見
- 使用中任何問題提
issues
即可 - 如果發(fā)現(xiàn)了新的好用的源母截,歡迎提交上來分享
- 來都來了點個 Star 再走唄 : )
English Document
Golang-proxy is an efficient free proxy crawler that ensures that the captured proxies are highly anonymous and at the same time guarantee their quality. You can use these captured proxies to download network resources and ensure the privacy of your own identity.
1. Feature
- Very high speed of proxy crawler, which can download 1000 pages per second.
- You can customize the source of proxy crawler. The configuration file is extremely simple.
- Provide a compiled version, comes with a SQLite database, and supports mysql
- Comes with an API interface, all functions can be used with one click
- Proxy evaluation system to ensure the quality of the proxy pool
2. How to use
golang-proxy
provides compiled binary files so that you do not need golang
on the machine. Download binary compression pack to Release Page
According to your system type, download the corresponding compression package, unzip it and run it. After a few minutes, you can access localhost:9999/all
in the browser to see the proxy's crawl results.
Before I go into the detailed introduction of golang-proxy, I think it's best to tell you the most useful information first.
API interface
After you start the binary, you can access the following interface in the browser to get the proxy
url | description |
---|---|
localhost:9999/all |
Get all highly available proxies |
localhost:9999/all?table=proxy |
Get all highly available proxies |
localhost:9999/random |
Randomly acquire a highly available proxy |
localhost:9999/all?table=crude_proxy |
Obtain the proxies in the temporary table (the quality of them cannot be guaranteed) |
localhost:9999/random?table=proxy |
Randomly get an proxy from the temporary table (the quality of them cannot be guaranteed) |
localhost:9999/sql?query= |
Write the SQL statement you want to execute after query= , customize your filter rules. |
Having mastered the above content, you have been able to use the 50% function of golang-proxy
. But the last interface allows you to execute custom SQL statements, and you'll find that you need to know at least the structure of the tables. The following will tell you.
3. Advanced
golang-proxy consists of the following parts:
- two
data tables
- one
configuration file
- one
source folder
- four
modules
two data tables
1. Table Crude Proxy
In order to store temporary proxies, we designed the data table crude_proxy
, the table is defined as follows.
field | type | example | description |
---|---|---|---|
id | int | - | - |
ip | string | 192.168.0.1 | - |
port | string | 255 | - |
content | string | 192.168.0.1:255 | - |
insert_time | int | 1540798717 | - |
update_time | int | 1540798717 | - |
table crude_proxy
stores the proxies that are crawled out, and cannot guarantee their quality.
2. Table Proxy
When the agent in the crude_proxy
table passes through pre assess
( pre assess
roughly verifies the availability of the proxy and tests the proxy's support for https
and http
), it will enter the proxy
table.
field | type | example | description |
---|---|---|---|
id | int | - | - |
ip | string | 192.168.0.1 | - |
port | string | 255 | - |
scheme_type | int | 2 | Identify the extent to which the proxy supports http and https, 0 : http only, 1 https only, 2 https & http |
content | string | 192.168.0.1:255 | |
assess_times | int | 5 | proxy evaluation times |
success_times | int | 5 | The number of times the proxy successfully passed the evaluation |
avg_response_time | float | 0.001 | - |
continuous_failed_times | int | 0 | The number of consecutive failures during the proxy evaluation process |
score | float | 25 | The higher the better |
insert_time | int | 1540798717 | - |
update_time | int | 1540798717 | - |
The proxy in the proxy
table will be evaluated periodically and their scores will be modified. Low scores will be deleted.
one configuration file
For convenience, the proxy in golang-proxy is stored in the portable database sqlite by default. You can make golang-proxy
use the mysql database by adding the config.yml
file in the executable directory.
For details, see Config page.
one source folder
golang-proxy needs source
to define its crawling contents and rules. Therefore, the run directory of golang-proxy needs at least one source
folder, and the source folder should have at least one source in yml
format.
The source is defined as follows:
page:
entry: "http://www.xxx.com/http/?page=1"
template: "http://www.xxx.com/http/?page={page}"
from: 1
to: 2000
selector:
iterator: ".list item"
ip: ".ip"
port: ".port"
category:
parallelnumber: 3
delayRange: [10, 30]
interval: "@every 10m"
debug: true
In the definition above, producer
will first crawl the entry page, then crawl:
http://www.xxx.com/http/?page=1
http://www.xxx.com/http/?page=2
http://www.xxx.com/http/?page=3
...
http://www.xxx.com/http/?page=2000
This source definition page expects this format:
<html>
...
<div class="list">
<div class="item">
<div class="ip"> 127.0.0.1 </div>
<div class="port"> 80 </div>
...
</div>
<div class="item">
<div class="ip"> 125.4.0.1 </div>
<div class="port"> 8080 </div>
...
</div>
...
</div>
...
</html>
When producer
parses a single page, it always traverses the nodes defined by iterator first, and then gets the elements defined by ip
and port
selectors from these nodes. The source definition above is still valid for the following HTML structure.
<html>
...
<div class="list">
<div class="item">
<div class="ip"> 127.0.0.1:80 </div>
</div>
<div class="item">
<div class="ip"> 125.4.0.1:8080</div>
</div>
...
</div>
...
</html>
Because when the port
selector cannot get the content, it will try to parse the port from the text selected by the ip
selector.
The source is stored in the source folder in yml format, and a source definition is completed. Golang-proxy will read it and crawl it the next time it starts. So you successfully define a source, store it in the source folder in YML format, and the next time you start golang-proxy, the source will enter the crawl list.
If a source file name starts with a
.
, the source will not be read.
four modules
golang-proxy consists of four modules, which cooperate to complete the task that golang-proxy wants to accomplish.
module name | description |
---|---|
producer | Periodically fetch the source defined in the source directory, and write the fetched proxy to the crude_proxy table. |
consumer | Periodically read a certain number of proxies from crude_proxy , determine their proxy scheme type and availability, and write them to the proxy table. |
assessor | Periodically read a number of proxies from the proxy table to evaluate their quality. |
service | Be responsible for the HTTP API interface provided by golang-proxy , allows you to filter and obtain the proxies in the crude_proxy and proxy tables by localhost: 9999/all , localhost: 9999/random , and localhost: 9999/sql . |
When you start the executable file of golang-proxy, you will start these module in turn. But you can add the -mode
startup parameter after the golang-proxy executable to command golang-proxy to start only one module. Like below:
golang-proxy -mode=service
This will only start the HTTP API interface service.
At this point, you have mastered the 95% function of golang-proxy. If you want to find more, you can read the source code provided above, and improve them.
Request for comments
Welcome to submit issue.
If you feel that golang-proxy is helping you, you can order a star or watch, thanks !