[scrapy] images pipeline分析– 下载图片如何上传到阿里云服务

源起

现在网上已经有很多方便的云存储了，比如阿里云的oss，亚马逊的s3 ，Azure 的blob，云的储存对于大数据量的文件或者图片来说，绝对是相当的方便，现在我们就来分析一下，如何使用scrapy的pipeline ，将我们下载图片直接上传到我们的阿里云oss服务

代码地址

https://github.com/BruceDone/scrapy_demo/tree/master/douban_oss

背景知识

阿里云oss: 海量、安全、低成本、高可靠的云存储服务，提供99.999999999%的数据可靠性。使用RESTful API 可以在互联网任何位置存储和访问，容量和处理能力弹性扩展，多种存储类型供选择全面优化存储成本。
链接: https://cn.aliyun.com/product/oss?utm_content=se_1272306
sdk地址: https://promotion.aliyun.com/ntms/act/ossdoclist.html?spm=5176.7933691.744462.yuan1.68546a561PwRab

准备

python: 2.7
依赖库
- Scrapy==1.5.0
- oss2==2.4.0
- Pillow==5.1.0

分析

拿到一个现有的需求，我们可以参考现有的项目里面,是否已经有实现过的代码做为参考，我们知道，scrapy项目一直有对s3(亚马逊云存储),我们可以直接去scrapy的源代码里面去找

通过ImagesPipeline 我们发现在实例化类的时候会初使化一个s3store的对象

https://github.com/scrapy/scrapy/blob/master/scrapy/pipelines/images.py

ImagesPipeline继承自FilesPipeline，我们进一步查找

进一步查找这个s3 store 类是怎么实现的

OK，我们分析完了源码，基本得出结论：只要照样子实现这个alioss store 就可以存储到我们的云服务了

实践

由于我们是直接上传到oss目录，所以完全不用使用到缩略图片的功能

alioss store 构建

import os
import json

try:
    from cStringIO import StringIO as BytesIO
except ImportError:
    from io import BytesIO

from PIL import Image
import oss2

from scrapy.http.request import Request
from scrapy.pipelines.images import ImagesPipeline
from scrapy.exceptions import NotConfigured, DropItem
from scrapy.pipelines.files import FileException
from scrapy.log import logger


class ImageException(FileException):
    """General image error exception"""


class AliOssStore(object):
    def __init__(self, host_base, access_key_id, access_key_secret, bucket_name):
        """
        auto define the store object
        for more detail please refer
        https://github.com/scrapy/scrapy/blob/0ede017d2ac057b1c3f9fb77a875e4d083e65401/scrapy/pipelines/files.py
        :param host_base: 
        :param access_key_id: 
        :param access_key_secret: 
        :param bucket_name:
        """
        self._auth = oss2.Auth(access_key_id, access_key_secret)
        self._bucket = oss2.Bucket(self._auth, host_base, bucket_name)

    def stat_file(self, path, info):
        # always return the empty result ,force the media request to download the file
        return {}

    def _check_file(self, path):
        if not os.path.exists(path):
            return False

        return True

    def persist_file(self, path, buf, info, meta=None, headers=None):
        """Upload file to Ali oss storage"""
        self._upload_file(path, buf)

    def _upload_file(self, path, buf):
        logger.warning('now i will upload the image {}'.format(path))
        self._bucket.put_object(key=path, data=buf.getvalue())

pipeline的构建

import os
import json

try:
    from cStringIO import StringIO as BytesIO
except ImportError:
    from io import BytesIO

from PIL import Image
import oss2

from scrapy.http.request import Request
from scrapy.pipelines.images import ImagesPipeline
from scrapy.exceptions import NotConfigured, DropItem
from scrapy.pipelines.files import FileException
from scrapy.log import logger

class DoubanOssStorePipeline(ImagesPipeline):
    MEDIA_NAME = 'image'

    # Uppercase attributes kept for backward compatibility with code that subclasses
    # ImagesPipeline. They may be overridden by settings.
    MIN_WIDTH = 0
    MIN_HEIGHT = 0

    DEFAULT_IMAGES_RESULT_FIELD = 'images'

    def __init__(self, ali_oss_config):
        self.ali_oss_config = ali_oss_config
        self.folder = ali_oss_config.get('folder', '')
        super(DoubanOssStorePipeline, self).__init__(ali_oss_config)

    def _get_store(self, uri):
        # get the ali oss store object
        return AliOssStore(
            self.ali_oss_config['host_base'],
            self.ali_oss_config['access_key_id'],
            self.ali_oss_config['access_key_secret'],
            self.ali_oss_config['bucket_name'],
        )

    def get_media_requests(self, item, info):
        if not item.get('url'):
            raise DropItem('item not find any url:{}'.format(json.dumps(item)))

        yield Request(url=item.get('url'))

    def get_images(self, response, request, info):
        path = self.file_path(request, response=response, info=info)
        orig_image = Image.open(BytesIO(response.body))

        width, height = orig_image.size
        if width < self.min_width or height < self.min_height:
            raise ImageException("Image too small (%dx%d < %dx%d)" %
                                 (width, height, self.min_width, self.min_height))

        image, buf = self.convert_image(orig_image)
        yield path, image, buf

    def file_path(self, request, response=None, info=None):
        img_path = super(DoubanOssStorePipeline, self).file_path(request, response, info)
        # the image path will like this full/abc.jpg ,we just need the image name
        image_name = img_path.rsplit('/', 1)[-1] if '/' in img_path else img_path
        if self.folder:
            image_name = os.path.join(self.folder, image_name)

        return image_name

    @classmethod
    def from_settings(cls, settings):

        ali_oss_config = settings.getdict('ALI_OSS_CONFIG', {})
        if not ali_oss_config:
            raise NotConfigured('You should config the ali_oss_config to enable this pipeline')

        return cls(ali_oss_config)

在settings.py 文件中配置我们的oss

ITEM_PIPELINES = {
    'douban_oss.pipelines.DoubanOssStorePipeline': 300,
} # enable our pipeline 

ALI_OSS_CONFIG = {
    'host_base': "",
    'access_key_id': "",
    'access_key_secret': "",
    'bucket_name': "",
    'folder': 'douban',  # white space means the root folder
}

以上节点配置请参考阿里云的oss sdk 文档

效果图

总结

我们这次主要解决了scrapy使用ImagesPipeline 下载图片,并上传到我们云服务的过程，我们的思维过程是

假设-> 验证->实践 -> 论证-> 结论

拿到一个需求,我们可以先参考现有的逻辑和工程代码，不要怕，总会有解决方案的，要自己多看源代码。
本次的代码已经上传到git上了，欢迎star或者fork

代码地址: https://github.com/BruceDone/scrapy_demo/tree/master/douban_oss