[可视化抓取]portia2.0尝鲜体验以及自动化畅想-数据输出以及原理分析
目录
- 环境准备
- 启动事项
- 使用指南
- 代码分析
- 总结分享
环境准备
- mac os (或者ubuntu 14.04 , 16.04 也可以)
- docker (搜索安装)
- docker-compose (pip安装)
- docker加速 ==> 点击导航 <==
有人会问了,这次怎么多了一个镜像加速啦?因为国内的某些不可知的原因,我们docker默认的镜像点是在hub.docker.com的,所以在拉取镜像的时候异常的缓慢,需要加一些加速点(和pip加速原理类似),个人比较推荐阿里云加速点,方便,稳定
启动事项
首先准备好本地的工作目录
# 工作目录
mkdir -p /Users/brucedone/Projects/portia_projects
# portia 目录
mkdir -p /Users/brucedone/Scripts/docker_compose/portia
然后切换到 /Users/brucedone/Scripts/docker_compose/portia
,新建一个文件 docker-compose.yml
内容如下
portia:
image: scrapinghub/portia:portia-2.0.8
ports:
- 9001:9001
volumes:
- /Users/brucedone/Projects/portia_projects:/app/data/projects
使用命令
sudo docker-compose up -d
然后等正常运行完之后,打本浏览器http://127.0.0.1:9001/
,就可以正常的看到如下的画面
使用指南
参考这 [[可视化抓取]portia2.0尝鲜体验以及自动化畅想1]1
请保证自己的界面已经出现 列表数据
为什么要强调列表数据呢,我们在大多数的爬虫场景,都是针对一个列表页来提取item ,所以我们通过这样的操作才能了解一个具体的爬虫怎样的运作的
代码分析
结构分析
切换到我们的实际的目录 /Users/brucedone/Projects/portia_projects
我们看一下目录里面就已经有我们刚刚创建的项目,我这里是以cnblogs为例子来做的,所以我本地的目录里面就只有cnblogs,切换到目录里面
2017-08-02 17:19:37 ☆ DoneBruces-MacBook-Pro in ~/Projects/portia_projects/cnblogs
○ → tree
.
|____extractors.json
|____items.json
|____project.json
|____scrapy.cfg
|____setup.py
|____spiders
| |______init__.py
| |____settings.py
| |____www.cnblogs.com
| | |____dd60-46f2-bbea
| | | |____original_body.html
| | | |____rendered_body.html
| | |____dd60-46f2-bbea.json
| |____www.cnblogs.com.json
- items.json 我们在前端操作的对字段的定义
- settings.py 常规的配置文件
- dd60-46f2-bbea 就是我本次执行的数据模版
- www.cnblogs.com.json 关于整个spider的配置
- dd60-46f2-bbea.json - 关于我们前端做出的点点出来的xpath 选择的配置文件
关键分析
- www.cnblogs.com.json
- dd60-46f2-bbea.json
www.cnblogs.com.json
我们首先看看里面的内容
○ → cat www.cnblogs.com.json
{
" ": [],
"exclude_patterns": [],
"follow_patterns": [],
"id": "www.cnblogs.com",
"js_disable_patterns": [],
"js_enable_patterns": [],
"js_enabled": false,
"links_to_follow": "none",
"respect_nofollow": false,
"start_urls": [
{
"url": "https://www.cnblogs.com/",
"type": "url"
}
]
}
里面的links_to_follow ,follow_patterns,allowed_domains
就是在前端的体现,这里我们暂时先留住,暂时不写内容,只是做一个简单的测试,再后面的文章中我们会进一步hook这个内容点的
dd60-46f2-bbea.json
{
"extractors": {},
"id": "dd60-46f2-bbea",
"name": "main_page",
"page_id": "",
"page_type": "item",
"plugins": {
"annotations-plugin": {
"extracts": [
{
"annotations": {
"#portia-content": "#dummy"
},
"container_id": null,
"id": "687e-4ce3-83b1#parent",
"item_container": true,
"repeated": false,
"required": [],
"schema_id": "0098-4bf5-b042",
"selector": "#post_list",
"siblings": 0,
"tagid": null,
"text-content": "#portia-content"
},
{
"annotations": {
"#portia-content": "#dummy"
},
"container_id": "687e-4ce3-83b1#parent",
"id": "687e-4ce3-83b1",
"item_container": true,
"repeated": true,
"required": [],
"schema_id": "0098-4bf5-b042",
"selector": ".post_item_body",
"siblings": 0,
"tagid": null,
"text-content": "#portia-content"
},
{
"accept_selectors": [
".post_item:nth-child(1) > .post_item_body > .post_item_summary",
".post_item:nth-child(2) > .post_item_body > .post_item_summary"
],
"container_id": "687e-4ce3-83b1",
"data": {
"7fc5-4eae-ae67": {
"attribute": "content",
"extractors": {},
"field": "3872-4c4e-aad1",
"required": false
}
},
"id": "30be-4ae8-9be4",
"text-content": "content",
"post_text": null,
"pre_text": null,
"reject_selectors": [],
"required": [],
"repeated": false,
"selection_mode": "auto",
"selector": ".post_item_body > .post_item_summary",
"tagid": null,
"xpath": "//*[contains(concat(\" \", @class, \" \"), \" post_item_body \")]/*[contains(concat(\" \", @class, \" \"), \" post_item_summary \")]"
},
{
"accept_selectors": [
".post_item:nth-child(1) > .post_item_body > .post_item_foot > .article_comment > .gray"
],
"container_id": "687e-4ce3-83b1",
"data": {
"ba8a-4352-a3b6": {
"attribute": "content",
"extractors": {},
"field": "133c-4825-9e7b",
"required": false
}
},
"id": "0892-4756-92ed",
"text-content": "content",
"post_text": null,
"pre_text": null,
"reject_selectors": [],
"required": [],
"repeated": false,
"selection_mode": "auto",
"selector": ".post_item_body > .post_item_foot > .article_comment > .gray",
"tagid": null,
"xpath": "//*[contains(concat(\" \", @class, \" \"), \" post_item_body \")]/*[contains(concat(\" \", @class, \" \"), \" post_item_foot \")]/*[contains(concat(\" \", @class, \" \"), \" article_comment \")]/*[contains(concat(\" \", @class, \" \"), \" gray \")]"
},
{
"accept_selectors": [
".post_item:nth-child(1) > .post_item_body > .post_item_foot > .lightblue"
],
"container_id": "687e-4ce3-83b1",
"data": {
"786a-4fb2-91b3": {
"attribute": "content",
"extractors": {},
"field": "be7f-423a-8647",
"required": false
}
},
"id": "e303-4467-b55e",
"text-content": "content",
"post_text": null,
"pre_text": null,
"reject_selectors": [],
"required": [],
"repeated": false,
"selection_mode": "auto",
"selector": ".post_item_body > .post_item_foot > .lightblue",
"tagid": null,
"xpath": "//*[contains(concat(\" \", @class, \" \"), \" post_item_body \")]/*[contains(concat(\" \", @class, \" \"), \" post_item_foot \")]/*[contains(concat(\" \", @class, \" \"), \" lightblue \")]"
},
{
"accept_selectors": [
".post_item:nth-child(1) > .post_item_body > .post_item_foot > .article_view > .gray"
],
"container_id": "687e-4ce3-83b1",
"data": {
"dbc1-4fb5-99e9": {
"attribute": "content",
"extractors": {},
"field": "af74-46cd-a93f",
"required": false
}
},
"id": "775d-4c38-ab15",
"text-content": "content",
"post_text": null,
"pre_text": null,
"reject_selectors": [],
"required": [],
"repeated": false,
"selection_mode": "auto",
"selector": ".post_item_body > .post_item_foot > .article_view > .gray",
"tagid": null,
"xpath": "//*[contains(concat(\" \", @class, \" \"), \" post_item_body \")]/*[contains(concat(\" \", @class, \" \"), \" post_item_foot \")]/*[contains(concat(\" \", @class, \" \"), \" article_view \")]/*[contains(concat(\" \", @class, \" \"), \" gray \")]"
},
{
"accept_selectors": [
".post_item:nth-child(1) > .post_item_body > .post_item_foot > .article_comment > .gray"
],
"container_id": "687e-4ce3-83b1",
"data": {
"f5ea-47ad-8015": {
"attribute": "href",
"extractors": {},
"field": "de8a-4b9b-b19c",
"required": false
}
},
"id": "68d4-4193-a5e2",
"text-content": "content",
"post_text": null,
"pre_text": null,
"reject_selectors": [],
"required": [],
"repeated": false,
"selection_mode": "auto",
"selector": ".post_item_body > .post_item_foot > .article_comment > .gray",
"tagid": null,
"xpath": "//*[contains(concat(\" \", @class, \" \"), \" post_item_body \")]/*[contains(concat(\" \", @class, \" \"), \" post_item_foot \")]/*[contains(concat(\" \", @class, \" \"), \" article_comment \")]/*[contains(concat(\" \", @class, \" \"), \" gray \")]"
}
]
}
},
"scrapes": "0098-4bf5-b042",
"spider": "www.cnblogs.com",
"url": "https://www.cnblogs.com/",
"version": "0.13.0b37"
}
我们重点关注两个字段
"annotations": {
"#portia-content": "#dummy"
},
"container_id": null,
"id": "687e-4ce3-83b1#parent",
"item_container": true,
"repeated": false,
"required": [],
"schema_id": "0098-4bf5-b042",
"selector": "#post_list",
"siblings": 0,
"tagid": null,
"text-content": "#portia-content"
},
这是列表的根节点,简单来说我们拿到"selector": "#post_list",
这个选择方法,使用xpath的语法//*[@id="post_list"]
就可以列表的的根容器,接下来我们看item节点
"annotations": {
"#portia-content": "#dummy"
},
"container_id": "687e-4ce3-83b1#parent",
"id": "687e-4ce3-83b1",
"item_container": true,
"repeated": true,
"required": [],
"schema_id": "0098-4bf5-b042",
"selector": ".post_item_body",
"siblings": 0,
"tagid": null,
"text-content": "#portia-content"
},
我们直接查看"selector": ".post_item_body",
,使用xpath语法 //*[@id="post_list"]//*[@class="post_item_body"]
,就可以拿到我们想要的列表元素了,怎么样,是不是也觉得很方便?我们再来看看属性节点
{
"accept_selectors": [
".post_item:nth-child(1) > .post_item_body > .post_item_summary",
".post_item:nth-child(2) > .post_item_body > .post_item_summary"
],
"container_id": "687e-4ce3-83b1",
"data": {
"7fc5-4eae-ae67": {
"attribute": "content",
"extractors": {},
"field": "3872-4c4e-aad1",
"required": false
}
},
"id": "30be-4ae8-9be4",
"text-content": "content",
"post_text": null,
"pre_text": null,
"reject_selectors": [],
"required": [],
"repeated": false,
"selection_mode": "auto",
"selector": ".post_item_body > .post_item_summary",
"tagid": null,
"xpath": "//*[contains(concat(\" \", @class, \" \"), \" post_item_body \")]/*[contains(concat(\" \", @class, \" \"), \" post_item_summary \")]"
},
这里就不一一分析了,到这一个层级我们已经拿属性结点了,所以,总结来看
root 根节点(列表根元素) => item 节点 => 属性节点
如果我们直接拿这个json,放在自己的spider里面也完全可以,你要知道,scrapy已经支持这种语法了
yied {'a':'a",'b':'b','c':'c'}
这样的直接返回 纯 dict的item 语法了
总结分享
本次主要从
- docker-compose 安装以及挂载本地目录
- portia生成的数据模板分析
- 模板的原理以及和我们自己的scrapy项目结合进行分析
- 原文作者:大鱼
- 原文链接:https://brucedone.com/archives/1059/
- 版权声明:本作品采用知识共享署名-非商业性使用-禁止演绎 4.0 国际许可协议. 进行许可,非商业转载请注明出处(作者,原文链接),商业转载请联系作者获得授权。