因?yàn)槲沂怯⑽陌娴南到y(tǒng),中文系統(tǒng)的界面轉(zhuǎn)換成字符串都成了亂碼方庭,因此測(cè)試都是在英文網(wǎng)頁(yè)上操作的。
PowerShell 5里面有一個(gè)新的函數(shù)叫做ConvertFrom-String, 他的作用是把字符串轉(zhuǎn)換成對(duì)象酱固。其中一個(gè)參數(shù)是可以根據(jù)指定的模板械念,把對(duì)應(yīng)的那一部分字符串匹配出來生成對(duì)象,我們可以利用這個(gè)功能抓取網(wǎng)頁(yè)中的表格运悲。
首先看個(gè)基本例子
<pre class="public-DraftStyleDefault-pre" data-offset-key="e3a03-0-0" style="margin: 1.4em 0px; padding: 0.88889em; font-size: 0.9em; word-break: normal; overflow-wrap: normal; white-space: pre; overflow: auto; background: rgb(246, 246, 246); border-radius: 4px; color: rgb(18, 18, 18); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">
<pre class="Editable-styled" data-block="true" data-editor="4naj0" data-offset-key="e3a03-0-0" style="margin: 0px; padding: 0px; font-size: 0.9em; word-break: normal; overflow-wrap: normal; white-space: pre; overflow: initial; background: rgb(246, 246, 246); border-radius: 0px;">
t=
@'
{Co1:1} {Co2:2} {Co3:3} {Co4:4}
{Co1:5} 6 7 8
'@
a | ConvertFrom-String -Delimiter "\r\n"
a | ConvertFrom-string -TemplateContent $t
</pre>
</pre>
同樣的字符串龄减,第一個(gè)我用分隔符回車換行來生成一個(gè)對(duì)象;第二個(gè)我用自定義的模板格式來進(jìn)行匹配班眯。注意屬性定義的格式寫法 {}隔開希停,然后第一個(gè)需要{屬性名字:},后面不需要加,至少需要匹配2行數(shù)據(jù)才行署隘。
可以看見第一個(gè)對(duì)象有3個(gè)屬性宠能,P1是1 2 3 4,P2 是 4 5 6 7 磁餐,P3是9 2 2 3;
第二個(gè)對(duì)象則是根據(jù)每一列來自動(dòng)匹配的(已經(jīng)有一個(gè)模板匹配了前2行)
[圖片上傳失敗...(image-88d472-1613784436944)]
接下來我們來看2個(gè)實(shí)例。
第一個(gè)例子是這個(gè)網(wǎng)頁(yè)诊霹,里面有一個(gè)澳洲代理服務(wù)器的列表羞延,如下所示,我想抓出來
<pre class="public-DraftStyleDefault-pre" data-offset-key="1jo7c-0-0" style="margin: 1.4em 0px; padding: 0.88889em; font-size: 0.9em; word-break: normal; overflow-wrap: normal; white-space: pre; overflow: auto; background: rgb(246, 246, 246); border-radius: 4px; color: rgb(18, 18, 18); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">
<pre class="Editable-styled" data-block="true" data-editor="4naj0" data-offset-key="1jo7c-0-0" style="margin: 0px; padding: 0px; font-size: 0.9em; word-break: normal; overflow-wrap: normal; white-space: pre; overflow: initial; background: rgb(246, 246, 246); border-radius: 0px;">
http://www.proxylisty.com/country/Australia-ip-list
</pre>
</pre>
[圖片上傳失敗...(image-9b7f9f-1613784436944)]
基本思路:invoke-restmethod直接抓取整個(gè)網(wǎng)頁(yè)脾还,自動(dòng)轉(zhuǎn)換為string對(duì)象伴箩。
然后設(shè)計(jì)對(duì)應(yīng)的模板。因?yàn)槭莌tml文件鄙漏,轉(zhuǎn)換為string以后對(duì)應(yīng)的html代碼都在里面嗤谚。因此關(guān)鍵是怎么把這些帶有html代碼的表格模板弄出來。
很簡(jiǎn)單泥张,網(wǎng)頁(yè)都可以查看html的源代碼呵恢,下面一大段html的代碼可以直接從網(wǎng)頁(yè)上復(fù)制粘貼對(duì)應(yīng)的2行表格代碼即可,稍加修改添加屬性名字就行了媚创。
然后根據(jù)模板匹配就會(huì)自動(dòng)生成對(duì)應(yīng)的表格對(duì)象了
<pre class="public-DraftStyleDefault-pre" data-offset-key="6f9u-0-0" style="margin: 1.4em 0px; padding: 0.88889em; font-size: 0.9em; word-break: normal; overflow-wrap: normal; white-space: pre; overflow: auto; background: rgb(246, 246, 246); border-radius: 4px; color: rgb(18, 18, 18); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">
<pre class="Editable-styled" data-block="true" data-editor="4naj0" data-offset-key="6f9u-0-0" style="margin: 0px; padding: 0px; font-size: 0.9em; word-break: normal; overflow-wrap: normal; white-space: pre; overflow: initial; background: rgb(246, 246, 246); border-radius: 0px;">
template =
@'
<tr>
<td>{IP:203.56.188.145}</td>
<td><a title='Port 8080 Proxy List'>{Port:8080}</a></td>
<td>HTTP</td>
<td><a style='color:red;' href='http://www.proxylisty.com/anonymity/High anonymous / Elite proxy-ip-list' title='High anonymous / Elite proxy Proxy List'>High anonymous / Elite proxy</a></td>
<td>No</td>
<td><a title='Australia IP Proxy List'><img style='margin: 0px 5px 0px 0px; padding: 0px;' src='http://www.proxylisty.com/assets/flags/AU.png' title='Australia IP Proxy List'/>Australia</a></td>
<td>13 Months</td>
<td>2.699 Sec</td>
<td><div id="progress-bar" class="all-rounded">
<div title='50%' id="progress-bar-percentage" class="all-rounded" style="width: 50%">{Reliability:50%}</div></div></td>
</tr>
<tr>
<td>{IP:103.25.182.1}</td>
<td><a title='Port 8081 Proxy List'>{Port:8081}</a></td>
<td>HTTP</td>
<td><a style='color:red;' href='http://www.proxylisty.com/anonymity/Anonymous proxy-ip-list' title='Anonymous proxy Proxy List'>Anonymous proxy</a></td>
<td>No</td>
<td><a title='Australia IP Proxy List'><img style='margin: 0px 5px 0px 0px; padding: 0px;' src='http://www.proxylisty.com/assets/flags/AU.png' title='Australia IP Proxy List'/>Australia</a></td>
<td>15 Months</td>
<td>7.242 Sec</td>
<td><div id="progress-bar" class="all-rounded">
<div title='55%' id="progress-bar-percentage" class="all-rounded" style="width: 55%">{Reliability:55%}</div></div></td>
</tr>
'@
web
template -InputObject result | sort reliability
</pre>
</pre>
成功抓取
我還可以更進(jìn)一步渗钉,我想測(cè)試一下這些抓取下來的地址是否真的可以用,寫個(gè)function測(cè)試看看
<pre class="public-DraftStyleDefault-pre" data-offset-key="132b-0-0" style="margin: 1.4em 0px; padding: 0.88889em; font-size: 0.9em; word-break: normal; overflow-wrap: normal; white-space: pre; overflow: auto; background: rgb(246, 246, 246); border-radius: 4px; color: rgb(18, 18, 18); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">
<pre class="Editable-styled" data-block="true" data-editor="4naj0" data-offset-key="132b-0-0" style="margin: 0px; padding: 0px; font-size: 0.9em; word-break: normal; overflow-wrap: normal; white-space: pre; overflow: initial; background: rgb(246, 246, 246); border-radius: 0px;">
function Test-Proxy{
[cmdletbinding()]
param(
[Parameter(Mandatory=true,
ValueFromPipelineByPropertyName=server,
[string]server" -NoNewline
server)
WebClient.proxy = content = url)
Write-Host " Opened url" -ForegroundColor Yellow
}
}
foreach (result){
r.IP+":"+servername -url "www.google.com"
}
</pre>
</pre>
測(cè)試標(biāo)明都是坑貨
類似的,豆子最近比較關(guān)注健康食物鳄橘,我想看看低GI的食物有哪些
<pre class="public-DraftStyleDefault-pre" data-offset-key="6v8j6-0-0" style="margin: 1.4em 0px; padding: 0.88889em; font-size: 0.9em; word-break: normal; overflow-wrap: normal; white-space: pre; overflow: auto; background: rgb(246, 246, 246); border-radius: 4px; color: rgb(18, 18, 18); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">
<pre class="Editable-styled" data-block="true" data-editor="4naj0" data-offset-key="6v8j6-0-0" style="margin: 0px; padding: 0px; font-size: 0.9em; word-break: normal; overflow-wrap: normal; white-space: pre; overflow: initial; background: rgb(246, 246, 246); border-radius: 0px;">
http://ultimatepaleoguide.com/glycemic-index-food-list
</pre>
</pre>
需要把下面這個(gè)表格抓出來
<pre class="public-DraftStyleDefault-pre" data-offset-key="chnbs-0-0" style="margin: 1.4em 0px; padding: 0.88889em; font-size: 0.9em; word-break: normal; overflow-wrap: normal; white-space: pre; overflow: auto; background: rgb(246, 246, 246); border-radius: 4px; color: rgb(18, 18, 18); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">
<pre class="Editable-styled" data-block="true" data-editor="4naj0" data-offset-key="chnbs-0-0" style="margin: 0px; padding: 0px; font-size: 0.9em; word-break: normal; overflow-wrap: normal; white-space: pre; overflow: initial; background: rgb(246, 246, 246); border-radius: 0px;">
web2='http://ultimatepaleoguide.com/glycemic-index-food-list/'
web2
t2 -InputObject result1 | Out-GridView
</pre>
</pre>
成功声离!
這種方式很有用,尤其是需要獲取網(wǎng)頁(yè)某些列表信息的時(shí)候瘫怜,當(dāng)然术徊,如果網(wǎng)頁(yè)本身就提供RESTFUL的接口,可以直接獲取JSON格式的內(nèi)容 那就更省事了鲸湃。