Three different methods in data-scaring: six.urllib, beautifulsoup, RE-xpath
Just write down what I've learned about web data scraping so that I won't forget everything and start all over next time I need to use the technique.
To work easier with python 2.x, try use lib "six":
from six.moves import urllib
Typical request format would be:
url = ...
hdr = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36', 'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8'} This depends on your laptop spec
req = urllib.request.Request(url, headers=hdr)
doc = urllib.request.urlopen(req).read() This gives you a file of unicodes
Now, it all comes to the choice among different parsing tools that you like to use, beautifulsoup/regular expression… What I have tried is RE-xpath, RE-pattern matching, and beautifulsoup.?
For RE-xpath:
IP_ADDRESS_PATH = '//td[2]/text()'
PORT_ADDRESS_PATH = '//tr/td[3]/text()'
You need to understand the html file and know how to construct the xpath towards the notes you like to extract. So for the above IP_ADDRESS_PATH, it's actually saying that starting from the root, find the text of all the third td.
IP_list = list(set(re.findall(IP_ADDRESS_PATH, doc)))
Then use the re.findall() method to find all the contents of nodes you want. Set() makes elements unique and list() turns it back to the list.
** Not sure why this wasn't working, but pretty sure the xpath was constructed correctly since it's verified by some html tester.
For RE-pattern match:
prep = re.compile(r"""<tr\s.*>….\n....</tr>""", re.VERBOSE)
\s means a space in the xpath, \n means a return in the xpath, .* means it represents whatever (could be anything). This summarizes the pattern of the specific block that might be repeated for many times and is under your interest.?
proxy_list = prep.findall(doc) ?
proxy_list = list(set(proxy_list))
proxy_list now contains all the block of codes that have the same pattern.
For beautifulsoup:
You still need six.moves urllib to open up the url.
req = urllib.request.Request(url, headers=hdr)
doc = urllib.request.urlopen(req).read()
soup = bs(doc, 'lxml')
So now you've opened up the html file and can start parsing with the beautiful beautifulsoup.?
list1 = [tr.find_all('td') for tr in soup.find_all('tr')]