Dear all:
- I wrote a simple Python Script to download the Google Images.
- It could download all images of a specific Search Term on Google Image webpage
- How to set the environment and how to use it is on below
- I hope it could make some help for Sephra's project
- As J also use Python, I think he can understand the script and set the environment
1. Python Version and Libraries
1.1 Version
- Python 3.6 with anaconda3
- Chrome browser
1.2 Libraries
-
selenium : Could be installed easily by pip
- Reference : (https://github.com/SeleniumHQ/selenium)
- Notation : As the Google Image Search APIs could not be used since 2011, many developers now using this library to develop webpage crawlers
pip install selenium
2. Full Code
from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
import time
import urllib
import sys
# search URL
# 要搜索的網(wǎng)址
Url_Behind = "https://www.google.co.jp/search?q="
Url_SearchTerm = sys.argv[1] # 要搜索的詞條 Search Term
Url_After = "&newwindow=1&source=lnms&tbm=isch&sa=X&ved=0ahUKEwjTqs6EwPDXAhXCa7wKHQzNAn0Q_AUICigB&biw=1920&bih=1069"
searchUrl = Url_Behind + Url_SearchTerm + Url_After
print(searchUrl)
# Chrome驅(qū)動器下載屠列,配置過程詳見下個章節(jié)
# Chrome Driver, How to configure it, please see next chapter
path = 'D:\\DATA\\chromedriver.exe'
# start Chrome
# 啟動 Chrome
driver = webdriver.Chrome(path)
# Maximize the website window
# 網(wǎng)站窗口最大化, 因為每次爬取只能看到視窗內(nèi)的圖片
driver.maximize_window()
# start search
# 開始搜索
driver.get(searchUrl)
# scroll postion
# 模擬滾動窗口以瀏覽下載更多圖片
pos = 0
# image id, 要下載的圖片編號
m = 0
img_url_list = []
for i in range(20):
pos += i*500 # scroll down
js = "document.documentElement.scrollTop=%d" % pos
driver.execute_script(js)
time.sleep(1)
# Find Image
# 找到圖片
ELEMENT = driver.find_elements_by_tag_name("img")
for element in ELEMENT:
# find image download URL
# 找到圖片下載地址
IMG_URL = element.get_attribute('src')
if type(IMG_URL) == str and IMG_URL[8:17] == "encrypted":
img_url = IMG_URL
if img_url != None and (img_url not in img_url_list):
img_url_list.append(img_url)
m += 1
# Image save Dir and filename,could set by your self
# 圖片保存地址和圖片名,可以自己設(shè)置
filename = "D:\\DATA\\PicCrawler\\" + Url_SearchTerm + str(m) + ".jpg"
urllib.request.urlretrieve(img_url,filename)
print("Save Picture %s" %filename)
# show more images on webpage
# 點擊網(wǎng)頁的“顯示更多圖片按鈕”讲竿,顯示更多圖片繼續(xù)下載
click_btn = driver.find_element_by_id('smb')
ActionChains(driver).click(click_btn).perform()
# Close webpage
# 關(guān)閉網(wǎng)頁
driver.close()
3. Usage
3.1 Download Chrome driver
- Download URL: (https://chromedriver.storage.googleapis.com/index.html?path=2.33/)
- Put the chromedriver.exe in the Dir showed in the code or the Dir your like.
path = 'D:\\DATA\\chromedriver.exe'
- You can set the path by yourself
3.2 Install selenium library
pip install selenium
3.3 Start Download
- You can execute the script with Term you want to search
- For instants, if you want to download the image of Soccer you can use like this:
python filename.py soccer
4. Discussion
- When you search a Term on Google Image, it usually only show less than 1000 images(Don't know Why!), The script only can download all the showed images, So if we can let the webpage show more images when search a Term, we can download more images.