相信看這篇文章的都知道500px.com這個網(wǎng)站把饭聚,提供了好多優(yōu)質(zhì)的圖片可以使用,但是網(wǎng)站的JS很強搁拙,只用右鍵是不能下載的秒梳,但是也可以Dev Tools看一下html代碼就可以找出來,不過也很麻煩箕速,如果要一張一張的下載的酪碘。
考慮:
- 網(wǎng)站這么多圖片,不可能一次性加載完盐茎,必定有接口
查看了“Network”tab發(fā)現(xiàn)確實有一個加載完頁面之后兴垦,動態(tài)加載圖片調(diào)用的API- 圖片點擊才會放大,之前是小的圖片字柠,說明圖片提供了多種尺寸的
看API對用的GET參數(shù)發(fā)現(xiàn)探越,用一個數(shù)組形式的來定義的- 使用PHP太簡單,還是用Python弄一個吧
代碼
現(xiàn)在網(wǎng)站更新了窑业,就得API/v1(第一個版本)更新為了API/v2扶关,網(wǎng)站所有的圖片格式都是用Google的webp,不在是之前的jpeg格式了数冬。
但是幸運的是現(xiàn)在還可以用是v1的接口节槐,不知道以后等v2穩(wěn)定了,v1是不是就down了拐纱。先不管那些铜异,現(xiàn)在v1還可以用。
#!/usr/bin/python3
# file-name : byId2.py
# get picture by id from 500px.com
# by Ray Lee
# raylee.bio#qq.com
###############################
# 20170825
# 還可以用秸架,但是網(wǎng)站吧所有圖片格式都換成了webp
# 怪不得打不開
###############################
import sys, requests, json, shutil, os
import re
from urllib import parse # added 2017.03.04
# determine the destination folder
if sys.platform == 'linux':
destFolder = "/mnt/c/Users/xxx/Pictures/500px/"
logFolder = destFolder+".cache/"
else:
destFolder = "C:\\Users\\xxx\\Pictures\\500px\\"
logFolder = destFolder+".cache\\"
def log(cur, ib):
'''log to file and stderr'''
log_file = logFolder + "{}.log".format(ib)
if not os.path.exists(log_file):
print("", file=open(log_file, 'w+'), end='')
print(cur+"\n", file=open(log_file,"a"))
def id_existed_or_not(id):
'''check if the image id has existed in the dest folder
if yes, skip
'''
existedFiles = os.listdir(destFolder)
for f in existedFiles:
if f.startswith(ib+"_"):
return True
return False
def id_valid_or_not(id):
''' is the image id is valid'''
if len(id) < 8:
return False
else:
return True
def get_photo_urls(ib, headers):
''' get urls of given image ids [array]'''
try:
r = requests.get("https://api.500px.com/v1/photos", params={"expanded_user_info": True, "ids": ib, "image_size[]": 2048, "include_licensing": True, "include_releases": True, "include_tags": True},headers=headers2)
cont = json.loads(r.text)
except ConnectionError as e:
raise e
else:
if 'error' in cont: # added 2017.4.30
print("\n"+ str(cont['status']) +" "+ cont['error'])
sys.exit(1)
return cont["photos"]
def save_photo(dat):
'''save image data to file'''
img_file_name = re.sub(r'[:<>"/|?*\\]',"-",dat[0])
img_file_name = parse.unquote(img_file_name)
url = dat[1]
if url.startswith("/photo"):
url = 'https://drscdn.500px.org'+url
res = requests.get(url, headers=headers3, stream=True)
if res.status_code == 200:
res.raw.decode_content = True
with open(destFolder+img_file_name, mode="wb") as img_file:
shutil.copyfileobj(res.raw, img_file)
print("# \033[01;32m>>{}\033[00m".format(img_file_name), file=sys.stderr, flush=True)
else:
print("x \033[01;31m<<Error to download\033[00m", file=sys.stderr)
# ids 支持使用,分割的多個id同時請求
# 1. check id's validity
valid_ids = []
if __name__ == '__main__':
inputs = sys.argv[1:]
# 必須提供參數(shù)揍庄,至少一個
if len(inputs) < 1:
print("at least one augument required", file=sys.stderr)
sys.exit(0)
for ib in inputs:
if id_existed_or_not(ib) == True:
print(ib+" existed, skipped", file=sys.stderr, flush=True)
continue
if id_valid_or_not(ib) == False:
print(ib+" invalid, skipped", file=sys.stderr, flush=True)
continue
valid_ids.append(ib)
if len(valid_ids) < 1:
print("no valid id")
sys.exit(1)
# 1.1 check if the url has been cached under .cache folder
cached_ids = []
cachedFiles = os.listdir(logFolder)
for f in cachedFiles:
if f.startswith(ib+"."):
cached_ids.append(ib);
valid_ids.remove(ib);
# 2, fetch url of each id
urlhub = get_photo_urls(','.join(valid_ids), headers)
# 3, save images
# 3.1 add cached urls
if cached_ids:
for x in cached_ids:
y = open(logFolder+x+".log")
z = json.loads(y.read())
urlhub[x] = z
# 3.2 iterate the url hub and save image stream
for i in urlhub:
cur = urlhub[i]
name = cur["name"]
ext = 'webp' if True else cur["image_format"]
url = cur["image_url"][-1]
suri = cur["url"].split('/')[-1]
fname = "{}_{}.{}".format(i, suri, ext)
log(json.dumps(cur), i)
save_photo([fname, url])
運行
- 配置要放置圖片的位置
- id_valid_or_not表示id至少需要8位數(shù),但是之前的一些圖片可能id會很小东抹,只是我下載的是否防止自己復制錯加的蚂子,你可以不用弄
- 現(xiàn)在是用的webp格式,默認瀏覽器可以打開缭黔,瀏覽器之不包括IE和Edge在內(nèi)的其他網(wǎng)頁瀏覽軟件食茎。
- headers headers2 headers3 包含個人信息,沒有包括在內(nèi)馏谨,需要的話可以發(fā)郵件給我别渔,但不公開。
python3 byId2.py 225733431 225576949