前言
最近,我使用爬蟲技術(shù)切省,爬取了美國(guó)航空航天局最岗,也就是你電影里經(jīng)常見到的 NASA
, 火星探索的相關(guān)圖片,有 10000
張吧朝捆。
嗯嗯般渡,小事情,小事情。
完事兒之后驯用,有點(diǎn)小激動(dòng)脸秽,于是就有了這篇文章,將有以下內(nèi)容:
- 我為什么要爬取NASA的圖片
- 我是如何爬取NASA圖片的(超詳細(xì))
- 我得到了什么(高清大圖)
- 我發(fā)現(xiàn)了什么秘密(超勁爆)
我為什么要爬NASA的圖片
我已經(jīng)超過35了蝴乔,瑟瑟發(fā)抖啥時(shí)候被開了记餐。
天天想著萬一哪天失業(yè)了干點(diǎn)啥,想著玩?zhèn)€自媒體吧薇正,天天給大家瞎白話片酝。白話點(diǎn)歷史謎團(tuán),宇宙奧秘啥的挖腰,于是我就盯上了 NASA
雕沿。
NASA
有各種宇宙探索任務(wù),并且有相關(guān)的文章曙聂,訪談晦炊,圖片,視頻公開宁脊,這是不可多得的資源庫断国。
我是如何爬取NASA圖片的(超詳細(xì))
NASA
的網(wǎng)站是可以公開訪問的,地址是
https://www.nasa.gov/
打開以后榆苞,它首頁長(zhǎng)這樣稳衬,可以看到各種內(nèi)容。右上角還有個(gè)搜索框坐漏,我們輸入 Mars
也就是 火星
稍等片刻薄疚,會(huì)展示出與 Mars
相關(guān)的各種內(nèi)容,其中有一項(xiàng) Mars Exploration
也就是 火星探索
點(diǎn)開之后赊琳,就到了一個(gè)新頁面街夭,然后找到 Images
圖片,就到了我們爬取的目標(biāo)頁
https://www.nasa.gov/mission_pages/mars/images/index.html
頁面下拉躏筏,你將會(huì)看到有個(gè)大大的按鈕板丽,上面寫著 MORE IMAGES
, 點(diǎn)擊試試你就發(fā)現(xiàn):
頁面的內(nèi)容趁尼,并不是頁面直接加載的埃碱,而是通過 api
請(qǐng)求后,異步渲染的
F12, 打開瀏覽器開發(fā)者模式,重新執(zhí)行剛才的步驟酥泞,觀察請(qǐng)求信息砚殿,發(fā)現(xiàn)都會(huì)有如下的情況
看上去這個(gè)url
地址非常重要,我們先看他的請(qǐng)求地址:
https://www.nasa.gov/api/2/ubernode/_search?size=24&from=24&sort=promo-date-time%3Adesc&q=((ubernode-type%3Aimage)%20AND%20(topics%3A3152))&_source_include=promo-date-time%2Cmaster-image%2Cnid%2Ctitle%2Ctopics%2Cmissions%2Ccollections%2Cother-tags%2Cubernode-type%2Cprimary-tag%2Csecondary-tag%2Ccardfeed-title%2Ctype%2Ccollection-asset-link%2Clink-or-attachment%2Cpr-leader-sentence%2Cimage-feature-caption%2Cattachments%2Curi
注意看里面的參數(shù)
size=24&from=24
很顯然芝囤,size
就是每次請(qǐng)求圖片的數(shù)量似炎,from
經(jīng)過試驗(yàn)辛萍,是查詢初始位置,我們可以改它來獲取其他內(nèi)容
我們?cè)賮砜纯此姆祷匦畔ⅲ?/p>
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 659,
"max_score": null,
"hits": [{
"_index": "nasa-public",
"_type": "ubernode",
"_id": "450040",
"_score": null,
"_source": {
"image-feature-caption": "Mars 2020 rover underwent an eye exam after several cameras were installed on the rover. ",
"topics": ["3140", "3152"],
"nid": "450040",
"title": "NASA 'Optometrists' Verify Mars 2020 Rover's 20/20 Vision",
"type": "ubernode",
"uri": "/image-feature/jpl/nasa-optometrists-verify-mars-2020-rovers-2020-vision",
"collections": ["4525", "5246"],
"link-or-attachment": "link",
"missions": ["6336"],
"primary-tag": "6336",
"cardfeed-title": "NASA 'Optometrists' Verify Mars 2020 Rover's 20/20 Vision",
"promo-date-time": "2019-08-05T17:49:00-04:00",
"secondary-tag": "3140",
"master-image": {
"fid": "603128",
"alt": "Engineers test cameras on the top of the Mars 2020 rover’s mast and front chassis. ",
"width": "1600",
"id": "603128",
"title": "Engineers test cameras on the top of the Mars 2020 rover’s mast and front chassis. ",
"uri": "public://thumbnails/image/pia23314-16.jpg",
"height": "900"
},
"ubernode-type": "image"
},
"sort": [1565041740000]
}, {
"_index": "nasa-public",
"_type": "ubernode",
"_id": "433172",
"_score": null,
"_source": {
"image-feature-caption": "NASA still hasn't heard from the Opportunity rover, but at least we can see it again.",
"topics": ["3152"],
"nid": "433172",
"title": "Opportunity Emerges in a Dusty Picture",
"type": "ubernode",
"uri": "/image-feature/opportunity-emerges-in-a-dusty-picture",
"collections": ["7628"],
"link-or-attachment": "link",
"missions": ["3639"],
"primary-tag": "3152",
"cardfeed-title": "Opportunity Emerges in a Dusty Picture",
"promo-date-time": "2018-09-26T12:39:00-04:00",
"secondary-tag": "7628",
"master-image": {
"fid": "584263",
"alt": "NASA's Opportunity rover appears as a blip in the center of this square",
"width": "1400",
"id": "584263",
"title": "NASA's Opportunity rover appears as a blip in the center of this square",
"uri": "public://thumbnails/image/pia22549-16.jpg",
"height": "788"
},
"ubernode-type": "image"
},
"sort": [1537979940000]
}]
}
}
上面的 json
內(nèi)容太長(zhǎng)羡藐,我刪除了一些重復(fù)的叹阔,實(shí)際上 hits
這個(gè)數(shù)組,也是24個(gè)传睹,和頁面上顯示的圖片數(shù)量是一樣的。那基本可以斷定岸晦,頁面上的信息欧啤,就是從這個(gè)數(shù)組來的。
進(jìn)而對(duì)比發(fā)現(xiàn)启上,master-image
字段下邢隧,就是我們需要的信息,包括 圖片地址
, 圖片尺寸
, 圖片標(biāo)題
冈在。
下面就上代碼倒慧,組裝請(qǐng)求URL, 獲取內(nèi)容,下載圖片 三步走
我使用了 Dart
語言包券,你們隨意
import 'dart:convert';
import 'package:dio/dio.dart';
main() async {
// 每頁數(shù)量是固定24個(gè),改變初始值即可
for (int from = 0; from < 24 * 100; from = from + 24) {
await getPage(from);
}
}
//獲取每一頁的信息并且下載
Future<void> getPage(int from) async {
String url = 'https://www.nasa.gov/api/2/ubernode/_search?size=24&from=' +
from.toString() +
'&sort=promo-date-time%3Adesc&q=((ubernode-type%3Aimage)%20AND%20(topics%3A3152))&_source_include=promo-date-time%2Cmaster-image%2Cnid%2Ctitle%2Ctopics%2Cmissions%2Ccollections%2Cother-tags%2Cubernode-type%2Cprimary-tag%2Csecondary-tag%2Ccardfeed-title%2Ctype%2Ccollection-asset-link%2Clink-or-attachment%2Cpr-leader-sentence%2Cimage-feature-caption%2Cattachments%2Curi';
//獲取到內(nèi)容
var res = await Dio().get(url);
var map = jsonDecode(res.toString());
(map['hits']['hits'] as List<dynamic>).forEach((element) async {
Uri fileUri = Uri.parse(getUri(element));
String savePath = getSavePath(element);
await Dio().downloadUri(fileUri, savePath);
print('已下載: ' + savePath);
});
}
//獲取圖片下載地址
String getUri(dynamic element) {
String uri = element['_source']['master-image']['uri'].toString();
uri = uri.replaceAll('public://',
'https://www.nasa.gov/sites/default/files/styles/full_width_feature/public/');
return uri;
}
//處理信息纫谅,并且返回圖片保存地址
String getSavePath(dynamic element) {
String id = element['_id'];
String fid = element['_source']['master-image']['fid'].toString();
String title = element['_source']['master-image']['title'].toString();
String uri = element['_source']['master-image']['uri'].toString();
String savePath =
id + '_' + fid + '_' + title.trim() + '.' + uri.split('.').last;
savePath = savePath.replaceAll('/', '');
savePath = savePath.replaceAll('\\', '');
savePath = savePath.replaceAll('"', '');
savePath = 'images/' + savePath;
return savePath;
}
上面的代碼,還是很簡(jiǎn)單的溅固,有經(jīng)驗(yàn)的同學(xué)應(yīng)該一看就懂付秕。
走起來吧。
已下載: images/470436_643588_This is the third color image taken by NASA’s Ingenuity helicopter.jpg
已下載: images/470435_643587_This is the second color image taken by NASA’s Ingenuity helicopter.jpg
已下載: images/468546_639327_This is the first high-resolution, color image to be sent back by the Hazard Cameras (Hazcams).jpg
已下載: images/452007_605784_Danielson Crater on Mars.jpg
已下載: images/458478_615132_Gullies on Mars.jpg
已下載: images/469416_641582_A field of sand dunes occupies this frosty 5-kilometer diameter crater in the high-latitudes of the northern plains of Mars..jpeg
已下載: images/458075_614251_Mars 2020 With Sample Tubes (Artist's Concept).jpg
已下載: images/470381_643473_CME.jpg
已下載: images/458813_615896_Mars.jpg
已下載: images/467026_635309_Illustration of NASA’s Perseverance rover begins its descent through the Martian atmosphere.jpg
已下載: images/470438_643591_This black and white image was taken by NASA’s Ingenuity helicopter during its third flight on April 25, 2021.jpg
已下載: images/465488_631398_Cliffs in Ancient Ice on Mars.jpg
已下載: images/463659_626874_Avalanche on Mars.jpg
已下載: images/470251_643164_This image from NASA’s Perseverance rover shows the agency’s Ingenuity Mars Helicopter right after it successfully completed a high-speed spin-up test..jpeg
已下載: images/468636_639726_Mars' Jezero Crater.jpg
我得到了什么
這些圖片
還有這些
圖片侍郭,圖片標(biāo)題都有了, 夠看一個(gè)月了我估計(jì)询吴。
我發(fā)現(xiàn)了什么秘密
這張圖,我最喜歡了亮元。一個(gè)如此清晰猛计,一個(gè)如此渾濁,這是為什么爆捞? 火星人的裂縫產(chǎn)生器奉瘤?
好吧,真正的秘密是:
NASA
的網(wǎng)站竟然沒有防采集嵌削,不信你也試試毛好。。苛秕。