在群里经常被问到如何解决同一个爬虫启用不同的piepline的问题,一般的解决通过参数的不同解决这个

def process_item(self, item, spider):
    self._client.create_index([('msg_id', pymongo.DESCENDING)], background=True)
    self._client.update_one(filter={'msg_id': item['msg_id']}, update={'$set': dict(item)}, upsert=True)
    return item

代码中有两个重要的参数item,还有一个spider ,我们打个断点来看看spider 里面的都有些什么

简单点我们可以通过name做逻辑上的判断

if spider.name == 'spider_1':
    do sth....
if spider.name == 'spider_2':
    do sth...

这样做的确可以,不过有一个不好的地方,如果同一个项目很多,spider也很多,piepline处理的业务也比较复杂,都杂糅到一起,首先不好看是其次,另外不方便代码的更新。

那能不能做到不同的spider启用不到的配置呢?当然显然是可以的

先看一段代码

    class MySpider(scrapy.Spider):
        name = 'myspider'
    
        custom_settings = {
            'SOME_SETTING': 'some value',
        }

custom_settings这个属性真是好用极了,就算你在settings里面进行了配置,这个也会将原先的配置进行覆盖,配置是如何加载的?

    default_settngs ~ project_settings ~ custom_settings 

    从左到右

最后将得到最终的配置文件,所以我们看到自己在没有settings.py写一些配置,比如启动的时候会显示这些日志

    2016-12-29 16:55:34 [scrapy] INFO: Scrapy 1.1.1 started (bot: weibo)
    2016-12-29 16:55:34 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'weibo.spiders', 'CONCURRENT_REQUESTS_PER_DOMAIN': 16, 'CONCURRENT_REQUESTS': 32, 'SPIDER_MODULES': ['weibo.spiders'], 'AUTOTHROTTLE_START_DELAY': 10, 'CONCURRENT_REQUESTS_PER_IP': 16, 'BOT_NAME': 'weibo', 'RETRY_TIMES': 5, 'COOKIES_ENABLED': False, 'RETRY_HTTP_CODES': [500, 502, 503, 504, 408, 403, 414], 'TELNETCONSOLE_ENABLED': False, 'AUTOTHROTTLE_ENABLED': True, 'DOWNLOAD_DELAY': 2}
    2016-12-29 16:55:34 [py.warnings] WARNING: /Users/brucedone/anaconda/envs/scrapy3/lib/python2.7/site-packages/scrapy/utils/deprecate.py:156: ScrapyDeprecationWarning: `scrapy.telnet.TelnetConsole` class is deprecated, use `scrapy.extensions.telnet.TelnetConsole` instead
      ScrapyDeprecationWarning)
    
    2016-12-29 16:55:34 [scrapy] INFO: Enabled extensions:
    ['scrapy.extensions.logstats.LogStats',
     'scrapy.extensions.corestats.CoreStats',
     'scrapy.extensions.throttle.AutoThrottle']
    2016-12-29 16:55:35 [scrapy] INFO: Enabled downloader middlewares:
    ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
     'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
     'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
     'scrapy.downloadermiddlewares.retry.RetryMiddleware',
     'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
     'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
     'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
     'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
     'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
     'scrapy.downloadermiddlewares.stats.DownloaderStats']
    2016-12-29 16:55:35 [scrapy] INFO: Enabled spider middlewares:
    ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
     'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
     'scrapy.spidermiddlewares.referer.RefererMiddleware',
     'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
     'scrapy.spidermiddlewares.depth.DepthMiddleware']
    2016-12-29 16:55:35 [scrapy] INFO: Enabled item pipelines:

其实scrapy的包文件里面就已经有默认的配置了。