[scrapy] images pipeline分析– 下载图片如何上传到阿里云服务
目录
- 源起
- 准备
- 分析
- 实践
- 总结
源起
现在网上已经有很多方便的云存储了,比如阿里云的oss,亚马逊的s3 ,Azure 的blob,云的储存对于大数据量的文件或者图片来说,绝对是相当的方便,现在我们就来分析一下,如何使用scrapy的pipeline ,将我们下载图片直接上传到我们的阿里云oss服务
代码地址
背景知识
- 阿里云oss: 海量、安全、低成本、高可靠的云存储服务,提供99.999999999%的数据可靠性。使用RESTful API 可以在互联网任何位置存储和访问,容量和处理能力弹性扩展,多种存储类型供选择全面优化存储成本。
- 链接: https://cn.aliyun.com/product/oss?utm_content=se_1272306
- sdk地址: https://promotion.aliyun.com/ntms/act/ossdoclist.html?spm=5176.7933691.744462.yuan1.68546a561PwRab
准备
- python: 2.7
- 依赖库
- Scrapy==1.5.0
- oss2==2.4.0
- Pillow==5.1.0
分析
拿到一个现有的需求,我们可以参考现有的项目里面,是否已经有实现过的代码做为参考,我们知道,scrapy项目一直有对s3(亚马逊云存储),我们可以直接去scrapy的源代码里面去找
通过ImagesPipeline 我们发现在实例化类的时候会初使化一个s3store的对象
https://github.com/scrapy/scrapy/blob/master/scrapy/pipelines/images.py
ImagesPipeline继承自FilesPipeline,我们进一步查找
OK,我们分析完了源码,基本得出结论:只要照样子实现这个alioss store 就可以存储到我们的云服务了
实践
由于我们是直接上传到oss目录,所以完全不用使用到缩略图片的功能
alioss store 构建
import os
import json
try:
from cStringIO import StringIO as BytesIO
except ImportError:
from io import BytesIO
from PIL import Image
import oss2
from scrapy.http.request import Request
from scrapy.pipelines.images import ImagesPipeline
from scrapy.exceptions import NotConfigured, DropItem
from scrapy.pipelines.files import FileException
from scrapy.log import logger
class ImageException(FileException):
"""General image error exception"""
class AliOssStore(object):
def __init__(self, host_base, access_key_id, access_key_secret, bucket_name):
"""
auto define the store object
for more detail please refer
https://github.com/scrapy/scrapy/blob/0ede017d2ac057b1c3f9fb77a875e4d083e65401/scrapy/pipelines/files.py
:param host_base:
:param access_key_id:
:param access_key_secret:
:param bucket_name:
"""
self._auth = oss2.Auth(access_key_id, access_key_secret)
self._bucket = oss2.Bucket(self._auth, host_base, bucket_name)
def stat_file(self, path, info):
# always return the empty result ,force the media request to download the file
return {}
def _check_file(self, path):
if not os.path.exists(path):
return False
return True
def persist_file(self, path, buf, info, meta=None, headers=None):
"""Upload file to Ali oss storage"""
self._upload_file(path, buf)
def _upload_file(self, path, buf):
logger.warning('now i will upload the image {}'.format(path))
self._bucket.put_object(key=path, data=buf.getvalue())
pipeline的构建
import os
import json
try:
from cStringIO import StringIO as BytesIO
except ImportError:
from io import BytesIO
from PIL import Image
import oss2
from scrapy.http.request import Request
from scrapy.pipelines.images import ImagesPipeline
from scrapy.exceptions import NotConfigured, DropItem
from scrapy.pipelines.files import FileException
from scrapy.log import logger
class DoubanOssStorePipeline(ImagesPipeline):
MEDIA_NAME = 'image'
# Uppercase attributes kept for backward compatibility with code that subclasses
# ImagesPipeline. They may be overridden by settings.
MIN_WIDTH = 0
MIN_HEIGHT = 0
DEFAULT_IMAGES_RESULT_FIELD = 'images'
def __init__(self, ali_oss_config):
self.ali_oss_config = ali_oss_config
self.folder = ali_oss_config.get('folder', '')
super(DoubanOssStorePipeline, self).__init__(ali_oss_config)
def _get_store(self, uri):
# get the ali oss store object
return AliOssStore(
self.ali_oss_config['host_base'],
self.ali_oss_config['access_key_id'],
self.ali_oss_config['access_key_secret'],
self.ali_oss_config['bucket_name'],
)
def get_media_requests(self, item, info):
if not item.get('url'):
raise DropItem('item not find any url:{}'.format(json.dumps(item)))
yield Request(url=item.get('url'))
def get_images(self, response, request, info):
path = self.file_path(request, response=response, info=info)
orig_image = Image.open(BytesIO(response.body))
width, height = orig_image.size
if width < self.min_width or height < self.min_height:
raise ImageException("Image too small (%dx%d < %dx%d)" %
(width, height, self.min_width, self.min_height))
image, buf = self.convert_image(orig_image)
yield path, image, buf
def file_path(self, request, response=None, info=None):
img_path = super(DoubanOssStorePipeline, self).file_path(request, response, info)
# the image path will like this full/abc.jpg ,we just need the image name
image_name = img_path.rsplit('/', 1)[-1] if '/' in img_path else img_path
if self.folder:
image_name = os.path.join(self.folder, image_name)
return image_name
@classmethod
def from_settings(cls, settings):
ali_oss_config = settings.getdict('ALI_OSS_CONFIG', {})
if not ali_oss_config:
raise NotConfigured('You should config the ali_oss_config to enable this pipeline')
return cls(ali_oss_config)
在settings.py 文件中配置我们的oss
ITEM_PIPELINES = {
'douban_oss.pipelines.DoubanOssStorePipeline': 300,
} # enable our pipeline
ALI_OSS_CONFIG = {
'host_base': "",
'access_key_id': "",
'access_key_secret': "",
'bucket_name': "",
'folder': 'douban', # white space means the root folder
}
以上节点配置请参考阿里云的oss sdk 文档
效果图
总结
我们这次主要解决了scrapy使用ImagesPipeline 下载图片,并上传到我们云服务的过程,我们的思维过程是
假设-> 验证->实践 -> 论证-> 结论
拿到一个需求,我们可以先参考现有的逻辑和工程代码,不要怕,总会有解决方案的,要自己多看源代码。
本次的代码已经上传到git上了,欢迎star或者fork
代码地址: https://github.com/BruceDone/scrapy_demo/tree/master/douban_oss
- 原文作者:大鱼
- 原文链接:https://brucedone.com/archives/1160/
- 版权声明:本作品采用知识共享署名-非商业性使用-禁止演绎 4.0 国际许可协议. 进行许可,非商业转载请注明出处(作者,原文链接),商业转载请联系作者获得授权。