python - Scrapy finishes crawl prematurely -
i've had crawler running months last few weeks it's been finishing prematurely after few crawled pages out of tens of thousands of pages should crawled.
it's sitemapspider
following sitemap_rules
.
class foositemapspider(sitemapspider): name = "foo" sitemap_urls = ["http://www.foo.se/sitemap.xml"] sitemap_rules = [ ('/bostad/', 'parse_house') ]
all url's want crawl looks this:
http://www.foo.se/bostad/address-1-259413 http://www.foo.se/bostad/address-2-275754
there aprox 50,000+ pages should crawled, instead spider stops crawling after 0 crawled pages , handful of pages crawled, without error. says:
2015-06-25 19:37:38 [scrapy] info: closing spider (finished) 2015-06-25 19:37:38 [scrapy] info: dumping scrapy stats: {'downloader/request_bytes': 106313, 'downloader/request_count': 310, 'downloader/request_method_count/get': 310, 'downloader/response_bytes': 2809108, 'downloader/response_count': 310, 'downloader/response_status_count/200': 309, 'downloader/response_status_count/404': 1, 'file_count': 21, 'file_status_count/downloaded': 21, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2015, 6, 25, 17, 37, 38, 154000), 'item_scraped_count': 4, 'log_count/debug': 1717, 'log_count/info': 9, 'log_count/warning': 8, 'request_depth_max': 2, 'response_received_count': 310, 'scheduler/dequeued': 289, 'scheduler/dequeued/memory': 289, 'scheduler/enqueued': 289, 'scheduler/enqueued/memory': 289, 'start_time': datetime.datetime(2015, 6, 25, 17, 35, 51, 868000)} 2015-06-25 19:37:38 [scrapy] info: spider closed (finished)
i've tried changing user_agent
, download_delay
, server/ip run spider from, make sure it's not target stopping requests.
any ideas? suggestions of should debug? it's difficult since no errors.
here complete log of crawl 0 errors: http://pastebin.com/psqx6bck
Comments
Post a Comment