目录
- 源起
- 准备
- 分析
- 实践
- 总结
源起
现在网上已经有很多方便的云存储了,比如阿里云的oss,亚马逊的s3 ,Azure 的blob,云的储存对于大数据量的文件或者图片来说,绝对是相当的方便,现在我们就来分析一下,如何使用scrapy的pipeline ,将我们下载图片直接上传到我们的阿里云oss服务
代码地址
背景知识
- 阿里云oss: 海量、安全、低成本、高可靠的云存储服务,提供99.999999999%的数据可靠性。使用RESTful API 可以在互联网任何位置存储和访问,容量和处理能力弹性扩展,多种存储类型供选择全面优化存储成本。
- 链接: https://cn.aliyun.com/product/oss?utm_content=se_1272306
- sdk地址: https://promotion.aliyun.com/ntms/act/ossdoclist.html?spm=5176.7933691.744462.yuan1.68546a561PwRab
准备
- python: 2.7
- 依赖库
- Scrapy==1.5.0
- oss2==2.4.0
- Pillow==5.1.0
分析
拿到一个现有的需求,我们可以参考现有的项目里面,是否已经有实现过的代码做为参考,我们知道,scrapy项目一直有对s3(亚马逊云存储),我们可以直接去scrapy的源代码里面去找
通过ImagesPipeline 我们发现在实例化类的时候会初使化一个s3store的对象
https://github.com/scrapy/scrapy/blob/master/scrapy/pipelines/images.py
ImagesPipeline继承自FilesPipeline,我们进一步查找
OK,我们分析完了源码,基本得出结论:只要照样子实现这个alioss store 就可以存储到我们的云服务了
实践
由于我们是直接上传到oss目录,所以完全不用使用到缩略图片的功能
alioss store 构建
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 |
import os import json try: from cStringIO import StringIO as BytesIO except ImportError: from io import BytesIO from PIL import Image import oss2 from scrapy.http.request import Request from scrapy.pipelines.images import ImagesPipeline from scrapy.exceptions import NotConfigured, DropItem from scrapy.pipelines.files import FileException from scrapy.log import logger class ImageException(FileException): """General image error exception""" class AliOssStore(object): def __init__(self, host_base, access_key_id, access_key_secret, bucket_name): """ auto define the store object for more detail please refer https://github.com/scrapy/scrapy/blob/0ede017d2ac057b1c3f9fb77a875e4d083e65401/scrapy/pipelines/files.py :param host_base: :param access_key_id: :param access_key_secret: :param bucket_name: """ self._auth = oss2.Auth(access_key_id, access_key_secret) self._bucket = oss2.Bucket(self._auth, host_base, bucket_name) def stat_file(self, path, info): # always return the empty result ,force the media request to download the file return {} def _check_file(self, path): if not os.path.exists(path): return False return True def persist_file(self, path, buf, info, meta=None, headers=None): """Upload file to Ali oss storage""" self._upload_file(path, buf) def _upload_file(self, path, buf): logger.warning('now i will upload the image {}'.format(path)) self._bucket.put_object(key=path, data=buf.getvalue()) |
pipeline的构建
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 |
import os import json try: from cStringIO import StringIO as BytesIO except ImportError: from io import BytesIO from PIL import Image import oss2 from scrapy.http.request import Request from scrapy.pipelines.images import ImagesPipeline from scrapy.exceptions import NotConfigured, DropItem from scrapy.pipelines.files import FileException from scrapy.log import logger class DoubanOssStorePipeline(ImagesPipeline): MEDIA_NAME = 'image' # Uppercase attributes kept for backward compatibility with code that subclasses # ImagesPipeline. They may be overridden by settings. MIN_WIDTH = 0 MIN_HEIGHT = 0 DEFAULT_IMAGES_RESULT_FIELD = 'images' def __init__(self, ali_oss_config): self.ali_oss_config = ali_oss_config self.folder = ali_oss_config.get('folder', '') super(DoubanOssStorePipeline, self).__init__(ali_oss_config) def _get_store(self, uri): # get the ali oss store object return AliOssStore( self.ali_oss_config['host_base'], self.ali_oss_config['access_key_id'], self.ali_oss_config['access_key_secret'], self.ali_oss_config['bucket_name'], ) def get_media_requests(self, item, info): if not item.get('url'): raise DropItem('item not find any url:{}'.format(json.dumps(item))) yield Request(url=item.get('url')) def get_images(self, response, request, info): path = self.file_path(request, response=response, info=info) orig_image = Image.open(BytesIO(response.body)) width, height = orig_image.size if width < self.min_width or height < self.min_height: raise ImageException("Image too small (%dx%d < %dx%d)" % (width, height, self.min_width, self.min_height)) image, buf = self.convert_image(orig_image) yield path, image, buf def file_path(self, request, response=None, info=None): img_path = super(DoubanOssStorePipeline, self).file_path(request, response, info) # the image path will like this full/abc.jpg ,we just need the image name image_name = img_path.rsplit('/', 1)[-1] if '/' in img_path else img_path if self.folder: image_name = os.path.join(self.folder, image_name) return image_name @classmethod def from_settings(cls, settings): ali_oss_config = settings.getdict('ALI_OSS_CONFIG', {}) if not ali_oss_config: raise NotConfigured('You should config the ali_oss_config to enable this pipeline') return cls(ali_oss_config) |
在settings.py 文件中配置我们的oss
1 2 3 4 5 6 7 8 9 10 11 12 |
ITEM_PIPELINES = { 'douban_oss.pipelines.DoubanOssStorePipeline': 300, } # enable our pipeline ALI_OSS_CONFIG = { 'host_base': "", 'access_key_id': "", 'access_key_secret': "", 'bucket_name': "", 'folder': 'douban', # white space means the root folder } |
以上节点配置请参考阿里云的oss sdk 文档
效果图
总结
我们这次主要解决了scrapy使用ImagesPipeline 下载图片,并上传到我们云服务的过程,我们的思维过程是
1 2 |
假设-> 验证->实践 -> 论证-> 结论 |
拿到一个需求,我们可以先参考现有的逻辑和工程代码,不要怕,总会有解决方案的,要自己多看源代码。
本次的代码已经上传到git上了,欢迎star或者fork
代码地址: https://github.com/BruceDone/scrapy_demo/tree/master/douban_oss
请问 上传到七牛云中,怎么处理呢
参考他们的sdk , https://developer.qiniu.com/kodo/sdk/1242/python ,稍微修改一下代码就OK了
我试过 按照你这样的修改成 七牛云的,但是一直不行,你能帮我看看怎么写最好吗
可以打断点debug的,你可以看下具体问题是啥,不难的
我还是没能成功执行~~~
要学会自己debug啦~
因为我使用了splash进行js渲染,但是这跟图片应该没关系把。。
按道理没有太大的影响,你可以直接把这些因为splash 相当于一个中间层,在返回给middleware 的时候会添加一个上层的refer ,你将返回的response 中这个refer 移除掉就可以了。
还是不行,我找到scrapy.pipelines.files 中的 media_to_download 方法,尝试去除referer,但还是报错,图片依旧无法上传到oss, 我换了一种思路,在 imagepipeline 的 get_media_requests方法,将原始的Requet对象替换为SplashRequest对象,结果成功了,但是它渲染出来的图片有白色背景,并且它的图片尺寸也被固定了1024x768, 还有没有好的方法,告诉ImagePipeline, 我需要根据image_url使用原始的Requet进行下载??
请问如果要下载gif图片应该怎么改来着
你可以直接用原来的FilePipeline来进行下载,改造思路是一样的