use case - generic spider has useful methods for common crawling actions such as following all links on a site based on certain rules, crawling from Sitemaps, or parsing an XML/CSV feed
CrawlSipder
rules - objects that define crawling behavior
parse_start_url
- a method that can be overriden to parse the initial responses and must return
Rules
scrapy.spiders.Rule
can declare multiple rules for followed links, always add a ,
at the end
-
link_extractor
defines how links will be extracted from each crawled page - allow/deny - only allow or ignore domains
-
callback
- calling methods to perform crawling on the response; if no callback is specified,follow
is default to True
avoid calling
parse
since this is reserved for CrawlSpider to use it to set up the rules
-
follow
- a boolean if set to true extract all links on the page
Scrapy filter out duplicate link by default
beware that start_urls should not contain trailling slash
works
does not work
-
process_links
- filter purpose