目录

  • 环境准备
  • 启动事项
  • 使用指南
  • 代码分析
  • 总结分享

环境准备

  • mac os (或者ubuntu 14.04 , 16.04 也可以)
  • docker (搜索安装)
  • docker-compose (pip安装)
  • docker加速 ==> 点击导航 <==

有人会问了,这次怎么多了一个镜像加速啦?因为国内的某些不可知的原因,我们docker默认的镜像点是在hub.docker.com的,所以在拉取镜像的时候异常的缓慢,需要加一些加速点(和pip加速原理类似),个人比较推荐阿里云加速点,方便,稳定

启动事项

首先准备好本地的工作目录

# 工作目录 
mkdir -p /Users/brucedone/Projects/portia_projects

# portia 目录 
mkdir -p /Users/brucedone/Scripts/docker_compose/portia

然后切换到 /Users/brucedone/Scripts/docker_compose/portia,新建一个文件 docker-compose.yml 内容如下

    portia:
        image: scrapinghub/portia:portia-2.0.8
        ports:
          - 9001:9001
        volumes:
          - /Users/brucedone/Projects/portia_projects:/app/data/projects

使用命令

    sudo docker-compose up -d

然后等正常运行完之后,打本浏览器http://127.0.0.1:9001/ ,就可以正常的看到如下的画面

使用指南

参考这 [[可视化抓取]portia2.0尝鲜体验以及自动化畅想1]1

请保证自己的界面已经出现 列表数据

为什么要强调列表数据呢,我们在大多数的爬虫场景,都是针对一个列表页来提取item ,所以我们通过这样的操作才能了解一个具体的爬虫怎样的运作的

代码分析

结构分析

切换到我们的实际的目录 /Users/brucedone/Projects/portia_projects
我们看一下目录里面就已经有我们刚刚创建的项目,我这里是以cnblogs为例子来做的,所以我本地的目录里面就只有cnblogs,切换到目录里面

     2017-08-02 17:19:37 ☆  DoneBruces-MacBook-Pro in ~/Projects/portia_projects/cnblogs
    ○ → tree
    .
    |____extractors.json
    |____items.json
    |____project.json
    |____scrapy.cfg
    |____setup.py
    |____spiders
    | |______init__.py
    | |____settings.py
    | |____www.cnblogs.com
    | | |____dd60-46f2-bbea
    | | | |____original_body.html
    | | | |____rendered_body.html
    | | |____dd60-46f2-bbea.json
    | |____www.cnblogs.com.json
  • items.json 我们在前端操作的对字段的定义
  • settings.py 常规的配置文件
  • dd60-46f2-bbea 就是我本次执行的数据模版
  • www.cnblogs.com.json 关于整个spider的配置
  • dd60-46f2-bbea.json - 关于我们前端做出的点点出来的xpath 选择的配置文件

关键分析

www.cnblogs.com.json

我们首先看看里面的内容

    ○ → cat www.cnblogs.com.json
    {
        " ": [],
        "exclude_patterns": [],
        "follow_patterns": [],
        "id": "www.cnblogs.com",
        "js_disable_patterns": [],
        "js_enable_patterns": [],
        "js_enabled": false,
        "links_to_follow": "none",
        "respect_nofollow": false,
        "start_urls": [
            {
                "url": "https://www.cnblogs.com/",
                "type": "url"
            }
        ]
    }

里面的links_to_follow ,follow_patterns,allowed_domains

就是在前端的体现,这里我们暂时先留住,暂时不写内容,只是做一个简单的测试,再后面的文章中我们会进一步hook这个内容点的

dd60-46f2-bbea.json

    {
      "extractors": {},
      "id": "dd60-46f2-bbea",
      "name": "main_page",
      "page_id": "",
      "page_type": "item",
      "plugins": {
        "annotations-plugin": {
          "extracts": [
            {
              "annotations": {
                "#portia-content": "#dummy"
              },
              "container_id": null,
              "id": "687e-4ce3-83b1#parent",
              "item_container": true,
              "repeated": false,
              "required": [],
              "schema_id": "0098-4bf5-b042",
              "selector": "#post_list",
              "siblings": 0,
              "tagid": null,
              "text-content": "#portia-content"
            },
            {
              "annotations": {
                "#portia-content": "#dummy"
              },
              "container_id": "687e-4ce3-83b1#parent",
              "id": "687e-4ce3-83b1",
              "item_container": true,
              "repeated": true,
              "required": [],
              "schema_id": "0098-4bf5-b042",
              "selector": ".post_item_body",
              "siblings": 0,
              "tagid": null,
              "text-content": "#portia-content"
            },
            {
              "accept_selectors": [
                ".post_item:nth-child(1) > .post_item_body > .post_item_summary",
                ".post_item:nth-child(2) > .post_item_body > .post_item_summary"
              ],
              "container_id": "687e-4ce3-83b1",
              "data": {
                "7fc5-4eae-ae67": {
                  "attribute": "content",
                  "extractors": {},
                  "field": "3872-4c4e-aad1",
                  "required": false
                }
              },
              "id": "30be-4ae8-9be4",
              "text-content": "content",
              "post_text": null,
              "pre_text": null,
              "reject_selectors": [],
              "required": [],
              "repeated": false,
              "selection_mode": "auto",
              "selector": ".post_item_body > .post_item_summary",
              "tagid": null,
              "xpath": "//*[contains(concat(\" \", @class, \" \"), \" post_item_body \")]/*[contains(concat(\" \", @class, \" \"), \" post_item_summary \")]"
            },
            {
              "accept_selectors": [
                ".post_item:nth-child(1) > .post_item_body > .post_item_foot > .article_comment > .gray"
              ],
              "container_id": "687e-4ce3-83b1",
              "data": {
                "ba8a-4352-a3b6": {
                  "attribute": "content",
                  "extractors": {},
                  "field": "133c-4825-9e7b",
                  "required": false
                }
              },
              "id": "0892-4756-92ed",
              "text-content": "content",
              "post_text": null,
              "pre_text": null,
              "reject_selectors": [],
              "required": [],
              "repeated": false,
              "selection_mode": "auto",
              "selector": ".post_item_body > .post_item_foot > .article_comment > .gray",
              "tagid": null,
              "xpath": "//*[contains(concat(\" \", @class, \" \"), \" post_item_body \")]/*[contains(concat(\" \", @class, \" \"), \" post_item_foot \")]/*[contains(concat(\" \", @class, \" \"), \" article_comment \")]/*[contains(concat(\" \", @class, \" \"), \" gray \")]"
            },
            {
              "accept_selectors": [
                ".post_item:nth-child(1) > .post_item_body > .post_item_foot > .lightblue"
              ],
              "container_id": "687e-4ce3-83b1",
              "data": {
                "786a-4fb2-91b3": {
                  "attribute": "content",
                  "extractors": {},
                  "field": "be7f-423a-8647",
                  "required": false
                }
              },
              "id": "e303-4467-b55e",
              "text-content": "content",
              "post_text": null,
              "pre_text": null,
              "reject_selectors": [],
              "required": [],
              "repeated": false,
              "selection_mode": "auto",
              "selector": ".post_item_body > .post_item_foot > .lightblue",
              "tagid": null,
              "xpath": "//*[contains(concat(\" \", @class, \" \"), \" post_item_body \")]/*[contains(concat(\" \", @class, \" \"), \" post_item_foot \")]/*[contains(concat(\" \", @class, \" \"), \" lightblue \")]"
            },
            {
              "accept_selectors": [
                ".post_item:nth-child(1) > .post_item_body > .post_item_foot > .article_view > .gray"
              ],
              "container_id": "687e-4ce3-83b1",
              "data": {
                "dbc1-4fb5-99e9": {
                  "attribute": "content",
                  "extractors": {},
                  "field": "af74-46cd-a93f",
                  "required": false
                }
              },
              "id": "775d-4c38-ab15",
              "text-content": "content",
              "post_text": null,
              "pre_text": null,
              "reject_selectors": [],
              "required": [],
              "repeated": false,
              "selection_mode": "auto",
              "selector": ".post_item_body > .post_item_foot > .article_view > .gray",
              "tagid": null,
              "xpath": "//*[contains(concat(\" \", @class, \" \"), \" post_item_body \")]/*[contains(concat(\" \", @class, \" \"), \" post_item_foot \")]/*[contains(concat(\" \", @class, \" \"), \" article_view \")]/*[contains(concat(\" \", @class, \" \"), \" gray \")]"
            },
            {
              "accept_selectors": [
                ".post_item:nth-child(1) > .post_item_body > .post_item_foot > .article_comment > .gray"
              ],
              "container_id": "687e-4ce3-83b1",
              "data": {
                "f5ea-47ad-8015": {
                  "attribute": "href",
                  "extractors": {},
                  "field": "de8a-4b9b-b19c",
                  "required": false
                }
              },
              "id": "68d4-4193-a5e2",
              "text-content": "content",
              "post_text": null,
              "pre_text": null,
              "reject_selectors": [],
              "required": [],
              "repeated": false,
              "selection_mode": "auto",
              "selector": ".post_item_body > .post_item_foot > .article_comment > .gray",
              "tagid": null,
              "xpath": "//*[contains(concat(\" \", @class, \" \"), \" post_item_body \")]/*[contains(concat(\" \", @class, \" \"), \" post_item_foot \")]/*[contains(concat(\" \", @class, \" \"), \" article_comment \")]/*[contains(concat(\" \", @class, \" \"), \" gray \")]"
            }
          ]
        }
      },
      "scrapes": "0098-4bf5-b042",
      "spider": "www.cnblogs.com",
      "url": "https://www.cnblogs.com/",
      "version": "0.13.0b37"
    }

我们重点关注两个字段

    "annotations": {
      "#portia-content": "#dummy"
    },
    "container_id": null,
    "id": "687e-4ce3-83b1#parent",
    "item_container": true,
    "repeated": false,
    "required": [],
    "schema_id": "0098-4bf5-b042",
    "selector": "#post_list",
    "siblings": 0,
    "tagid": null,
    "text-content": "#portia-content"
  },

这是列表的根节点,简单来说我们拿到"selector": "#post_list", 这个选择方法,使用xpath的语法//*[@id="post_list"] 就可以列表的的根容器,接下来我们看item节点

"annotations": {
          "#portia-content": "#dummy"
        },
        "container_id": "687e-4ce3-83b1#parent",
        "id": "687e-4ce3-83b1",
        "item_container": true,
        "repeated": true,
        "required": [],
        "schema_id": "0098-4bf5-b042",
        "selector": ".post_item_body",
        "siblings": 0,
        "tagid": null,
        "text-content": "#portia-content"
      },

我们直接查看"selector": ".post_item_body", ,使用xpath语法 //*[@id="post_list"]//*[@class="post_item_body"],就可以拿到我们想要的列表元素了,怎么样,是不是也觉得很方便?我们再来看看属性节点

{
  "accept_selectors": [
    ".post_item:nth-child(1) > .post_item_body > .post_item_summary",
    ".post_item:nth-child(2) > .post_item_body > .post_item_summary"
  ],
  "container_id": "687e-4ce3-83b1",
  "data": {
    "7fc5-4eae-ae67": {
      "attribute": "content",
      "extractors": {},
      "field": "3872-4c4e-aad1",
      "required": false
    }
  },
  "id": "30be-4ae8-9be4",
  "text-content": "content",
  "post_text": null,
  "pre_text": null,
  "reject_selectors": [],
  "required": [],
  "repeated": false,
  "selection_mode": "auto",
  "selector": ".post_item_body > .post_item_summary",
  "tagid": null,
  "xpath": "//*[contains(concat(\" \", @class, \" \"), \" post_item_body \")]/*[contains(concat(\" \", @class, \" \"), \" post_item_summary \")]"
},

这里就不一一分析了,到这一个层级我们已经拿属性结点了,所以,总结来看

    root 根节点(列表根元素) => item 节点 => 属性节点

如果我们直接拿这个json,放在自己的spider里面也完全可以,你要知道,scrapy已经支持这种语法了

    yied {'a':'a",'b':'b','c':'c'}

这样的直接返回 纯 dict的item 语法了

总结分享

本次主要从

  • docker-compose 安装以及挂载本地目录
  • portia生成的数据模板分析
  • 模板的原理以及和我们自己的scrapy项目结合进行分析