目录
- 环境准备
- 启动事项
- 使用指南
- 代码分析
- 总结分享
环境准备
- mac os (或者ubuntu 14.04 , 16.04 也可以)
- docker (搜索安装)
- docker-compose (pip安装)
- docker加速 ==> 点击导航 <==
有人会问了,这次怎么多了一个镜像加速啦?因为国内的某些不可知的原因,我们docker默认的镜像点是在hub.docker.com的,所以在拉取镜像的时候异常的缓慢,需要加一些加速点(和pip加速原理类似),个人比较推荐阿里云加速点,方便,稳定
启动事项
首先准备好本地的工作目录
1 2 3 4 5 6 |
<br /># 工作目录 mkdir -p /Users/brucedone/Projects/portia_projects # portia 目录 mkdir -p /Users/brucedone/Scripts/docker_compose/portia |
然后切换到 /Users/brucedone/Scripts/docker_compose/portia
,新建一个文件 docker-compose.yml
内容如下
1 2 3 4 5 6 7 |
portia: image: scrapinghub/portia:portia-2.0.8 ports: - 9001:9001 volumes: - /Users/brucedone/Projects/portia_projects:/app/data/projects |
使用命令
1 2 |
sudo docker-compose up -d |
然后等正常运行完之后,打本浏览器http://127.0.0.1:9001/
,就可以正常的看到如下的画面
使用指南
参考这 [可视化抓取]portia2.0尝鲜体验以及自动化畅想[1]
请保证自己的界面已经出现 列表数据
为什么要强调列表数据呢,我们在大多数的爬虫场景,都是针对一个列表页来提取item ,所以我们通过这样的操作才能了解一个具体的爬虫怎样的运作的
代码分析
结构分析
切换到我们的实际的目录 /Users/brucedone/Projects/portia_projects
我们看一下目录里面就已经有我们刚刚创建的项目,我这里是以cnblogs为例子来做的,所以我本地的目录里面就只有cnblogs,切换到目录里面
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
2017-08-02 17:19:37 ☆ DoneBruces-MacBook-Pro in ~/Projects/portia_projects/cnblogs ○ → tree . |____extractors.json |____items.json |____project.json |____scrapy.cfg |____setup.py |____spiders | |______init__.py | |____settings.py | |____www.cnblogs.com | | |____dd60-46f2-bbea | | | |____original_body.html | | | |____rendered_body.html | | |____dd60-46f2-bbea.json | |____www.cnblogs.com.json |
- items.json 我们在前端操作的对字段的定义
- settings.py 常规的配置文件
- dd60-46f2-bbea 就是我本次执行的数据模版
- www.cnblogs.com.json 关于整个spider的配置
dd60-46f2-bbea.json – 关于我们前端做出的点点出来的xpath 选择的配置文件
关键分析
- www.cnblogs.com.json
- dd60-46f2-bbea.json
www.cnblogs.com.json
我们首先看看里面的内容
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
○ → cat www.cnblogs.com.json { " ": [], "exclude_patterns": [], "follow_patterns": [], "id": "www.cnblogs.com", "js_disable_patterns": [], "js_enable_patterns": [], "js_enabled": false, "links_to_follow": "none", "respect_nofollow": false, "start_urls": [ { "url": "https://www.cnblogs.com/", "type": "url" } ] } |
里面的links_to_follow ,follow_patterns,allowed_domains
就是在前端的体现,这里我们暂时先留住,暂时不写内容,只是做一个简单的测试,再后面的文章中我们会进一步hook这个内容点的
dd60-46f2-bbea.json
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 |
{ "extractors": {}, "id": "dd60-46f2-bbea", "name": "main_page", "page_id": "", "page_type": "item", "plugins": { "annotations-plugin": { "extracts": [ { "annotations": { "#portia-content": "#dummy" }, "container_id": null, "id": "687e-4ce3-83b1#parent", "item_container": true, "repeated": false, "required": [], "schema_id": "0098-4bf5-b042", "selector": "#post_list", "siblings": 0, "tagid": null, "text-content": "#portia-content" }, { "annotations": { "#portia-content": "#dummy" }, "container_id": "687e-4ce3-83b1#parent", "id": "687e-4ce3-83b1", "item_container": true, "repeated": true, "required": [], "schema_id": "0098-4bf5-b042", "selector": ".post_item_body", "siblings": 0, "tagid": null, "text-content": "#portia-content" }, { "accept_selectors": [ ".post_item:nth-child(1) > .post_item_body > .post_item_summary", ".post_item:nth-child(2) > .post_item_body > .post_item_summary" ], "container_id": "687e-4ce3-83b1", "data": { "7fc5-4eae-ae67": { "attribute": "content", "extractors": {}, "field": "3872-4c4e-aad1", "required": false } }, "id": "30be-4ae8-9be4", "text-content": "content", "post_text": null, "pre_text": null, "reject_selectors": [], "required": [], "repeated": false, "selection_mode": "auto", "selector": ".post_item_body > .post_item_summary", "tagid": null, "xpath": "//*[contains(concat(\" \", @class, \" \"), \" post_item_body \")]/*[contains(concat(\" \", @class, \" \"), \" post_item_summary \")]" }, { "accept_selectors": [ ".post_item:nth-child(1) > .post_item_body > .post_item_foot > .article_comment > .gray" ], "container_id": "687e-4ce3-83b1", "data": { "ba8a-4352-a3b6": { "attribute": "content", "extractors": {}, "field": "133c-4825-9e7b", "required": false } }, "id": "0892-4756-92ed", "text-content": "content", "post_text": null, "pre_text": null, "reject_selectors": [], "required": [], "repeated": false, "selection_mode": "auto", "selector": ".post_item_body > .post_item_foot > .article_comment > .gray", "tagid": null, "xpath": "//*[contains(concat(\" \", @class, \" \"), \" post_item_body \")]/*[contains(concat(\" \", @class, \" \"), \" post_item_foot \")]/*[contains(concat(\" \", @class, \" \"), \" article_comment \")]/*[contains(concat(\" \", @class, \" \"), \" gray \")]" }, { "accept_selectors": [ ".post_item:nth-child(1) > .post_item_body > .post_item_foot > .lightblue" ], "container_id": "687e-4ce3-83b1", "data": { "786a-4fb2-91b3": { "attribute": "content", "extractors": {}, "field": "be7f-423a-8647", "required": false } }, "id": "e303-4467-b55e", "text-content": "content", "post_text": null, "pre_text": null, "reject_selectors": [], "required": [], "repeated": false, "selection_mode": "auto", "selector": ".post_item_body > .post_item_foot > .lightblue", "tagid": null, "xpath": "//*[contains(concat(\" \", @class, \" \"), \" post_item_body \")]/*[contains(concat(\" \", @class, \" \"), \" post_item_foot \")]/*[contains(concat(\" \", @class, \" \"), \" lightblue \")]" }, { "accept_selectors": [ ".post_item:nth-child(1) > .post_item_body > .post_item_foot > .article_view > .gray" ], "container_id": "687e-4ce3-83b1", "data": { "dbc1-4fb5-99e9": { "attribute": "content", "extractors": {}, "field": "af74-46cd-a93f", "required": false } }, "id": "775d-4c38-ab15", "text-content": "content", "post_text": null, "pre_text": null, "reject_selectors": [], "required": [], "repeated": false, "selection_mode": "auto", "selector": ".post_item_body > .post_item_foot > .article_view > .gray", "tagid": null, "xpath": "//*[contains(concat(\" \", @class, \" \"), \" post_item_body \")]/*[contains(concat(\" \", @class, \" \"), \" post_item_foot \")]/*[contains(concat(\" \", @class, \" \"), \" article_view \")]/*[contains(concat(\" \", @class, \" \"), \" gray \")]" }, { "accept_selectors": [ ".post_item:nth-child(1) > .post_item_body > .post_item_foot > .article_comment > .gray" ], "container_id": "687e-4ce3-83b1", "data": { "f5ea-47ad-8015": { "attribute": "href", "extractors": {}, "field": "de8a-4b9b-b19c", "required": false } }, "id": "68d4-4193-a5e2", "text-content": "content", "post_text": null, "pre_text": null, "reject_selectors": [], "required": [], "repeated": false, "selection_mode": "auto", "selector": ".post_item_body > .post_item_foot > .article_comment > .gray", "tagid": null, "xpath": "//*[contains(concat(\" \", @class, \" \"), \" post_item_body \")]/*[contains(concat(\" \", @class, \" \"), \" post_item_foot \")]/*[contains(concat(\" \", @class, \" \"), \" article_comment \")]/*[contains(concat(\" \", @class, \" \"), \" gray \")]" } ] } }, "scrapes": "0098-4bf5-b042", "spider": "www.cnblogs.com", "url": "https://www.cnblogs.com/", "version": "0.13.0b37" } |
我们重点关注两个字段
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
"annotations": { "#portia-content": "#dummy" }, "container_id": null, "id": "687e-4ce3-83b1#parent", "item_container": true, "repeated": false, "required": [], "schema_id": "0098-4bf5-b042", "selector": "#post_list", "siblings": 0, "tagid": null, "text-content": "#portia-content" }, |
这是列表的根节点,简单来说我们拿到"selector": "#post_list",
这个选择方法,使用xpath的语法//*[@id="post_list"]
就可以列表的的根容器,接下来我们看item节点
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
"annotations": { "#portia-content": "#dummy" }, "container_id": "687e-4ce3-83b1#parent", "id": "687e-4ce3-83b1", "item_container": true, "repeated": true, "required": [], "schema_id": "0098-4bf5-b042", "selector": ".post_item_body", "siblings": 0, "tagid": null, "text-content": "#portia-content" }, |
我们直接查看"selector": ".post_item_body",
,使用xpath语法 //*[@id="post_list"]//*[@class="post_item_body"]
,就可以拿到我们想要的列表元素了,怎么样,是不是也觉得很方便?我们再来看看属性节点
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
{ "accept_selectors": [ ".post_item:nth-child(1) > .post_item_body > .post_item_summary", ".post_item:nth-child(2) > .post_item_body > .post_item_summary" ], "container_id": "687e-4ce3-83b1", "data": { "7fc5-4eae-ae67": { "attribute": "content", "extractors": {}, "field": "3872-4c4e-aad1", "required": false } }, "id": "30be-4ae8-9be4", "text-content": "content", "post_text": null, "pre_text": null, "reject_selectors": [], "required": [], "repeated": false, "selection_mode": "auto", "selector": ".post_item_body > .post_item_summary", "tagid": null, "xpath": "//*[contains(concat(\" \", @class, \" \"), \" post_item_body \")]/*[contains(concat(\" \", @class, \" \"), \" post_item_summary \")]" }, |
这里就不一一分析了,到这一个层级我们已经拿属性结点了,所以,总结来看
1 2 |
root 根节点(列表根元素) => item 节点 => 属性节点 |
如果我们直接拿这个json,放在自己的spider里面也完全可以,你要知道,scrapy已经支持这种语法了
1 2 |
yied {'a':'a",'b':'b','c':'c'} |
这样的直接返回 纯 dict的item 语法了
总结分享
本次主要从
- docker-compose 安装以及挂载本地目录
- portia生成的数据模板分析
- 模板的原理以及和我们自己的scrapy项目结合进行分析
大神好
我不是用daocker搭建的,我是直接部署github上的源码,,,但是配置数据存储路径后,,没发现有数据呀?您知道portia抓取的数据存储在哪儿?或者怎么配置存储(mysql,,本地)(我配置mysql运行init_mysql_db会报错,所以我配置成本地存储,但依然没有数据输出)
?如果你直接使用源码,你使用portia只是生成相应的文件夹和json相关文件了,配置爬虫输出到本地那就直接将生成的爬虫的pipeline.py文件写好,正常运行爬虫就可以了。
大神好
,你的文章给我帮助很多!!!
客气了,能帮助你就好