基于scrapy可见可得的爬虫工具arachnado
效果预览和项目所在
先上Git地址:https://github.com/TeamHG-Memex/arachnado
这个库在去年8月就已经上线了,作者写的东西和整体的UI界面满不错的,
这是从youtube下载下来后上传到youku的演示效果
整体的效果确实真的很不错,基于tornado 高效,封装了一些scrapyd webservice 的api,数据都是保存在mongo之中的,可以自己自由定制,不过可惜的是,目前只能通过修改spider里面的代码来个性爬虫的整体逻辑,不过代码逻辑不复杂,可以学习自己封装一些api .
关于定制spider ,是否任何网站都可以爬取
https://github.com/TeamHG-Memex/arachnado/blob/master/arachnado/spider.py
class ArachnadoSpider(scrapy.Spider):
"""
A base spider that contains common attributes and utilities for all
Arachnado spiders
"""
crawl_id = None
domain = None
motor_job_id = None
def __init__(self, *args, **kwargs):
super(ArachnadoSpider, self).__init__(*args, **kwargs)
# don't log scraped items
logging.getLogger("scrapy.core.scraper").setLevel(logging.INFO)
def get_page_item(self, response, type_='page'):
return {
'crawled_at': datetime.datetime.utcnow(),
'url': response.url,
'status': response.status,
'headers': response.headers,
'body': response.body_as_unicode(),
'meta': response.meta,
'_type': type_,
}
class CrawlWebsiteSpider(ArachnadoSpider):
"""
A spider which crawls all the website.
To run it, set its ``crawl_id`` and ``domain`` arguments.
"""
name = 'crawlwebsite'
custom_settings = {
'DEPTH_LIMIT': 10,
}
def __init__(self, *args, **kwargs):
super(CrawlWebsiteSpider, self).__init__(*args, **kwargs)
self.start_url = add_scheme_if_missing(self.domain)
def start_requests(self):
self.logger.info("Started job %s#%d for domain %s",
self.motor_job_id, self.crawl_id, self.domain)
yield scrapy.Request(self.start_url, self.parse_first,
dont_filter=True)
def parse_first(self, response):
# If there is a redirect in the first request, use the target domain
# to restrict crawl instead of the original.
self.domain = get_netloc(response.url)
self.crawler.stats.set_value('arachnado/start_url', self.start_url)
self.crawler.stats.set_value('arachnado/domain', self.domain)
allow_domain = self.domain
if self.domain.startswith("www."):
allow_domain = allow_domain[len("www."):]
self.get_links = LinkExtractor(
allow_domains=[allow_domain]
).extract_links
for elem in self.parse(response):
yield elem
def parse(self, response):
if not isinstance(response, HtmlResponse):
self.logger.info("non-HTML response is skipped: %s" % response.url)
return
yield self.get_page_item(response)
for link in self.get_links(response):
yield scrapy.Request(link.url, self.parse)
其实是将数据一些常见的链接抓出来了,并没有对特定的数据进行处理
数据如何处理
数据都是经过piepline来处理的,可以查看代码
https://github.com/TeamHG-Memex/arachnado/blob/master/arachnado/motor_exporter/pipelines.py
存入到mongodb ,数据拿到后的样子
- 原文作者:大鱼
- 原文链接:https://brucedone.com/archives/496/
- 版权声明:本作品采用知识共享署名-非商业性使用-禁止演绎 4.0 国际许可协议. 进行许可,非商业转载请注明出处(作者,原文链接),商业转载请联系作者获得授权。