常见的一些小错误分类处理
内部错误
- TypeError
- 表现形式:TypeError: ‘float’ object is not iterable
- 相关搜索:https://github.com/scrapy/scrapy/issues/2461
- 解决方法:sudo pip install -U Twisted==16.6.0
- ERROR: Unable to read the instance data ,giving up
- 表现形式: 直接error 报错,拿不到数据
- 相关搜索: 无
- 解决方法: 回调函数中,必须返回 Request 对象 或者Item对象 ,可以直接返回这种类型的数据就可以了
- Library not loaded: /opt/local/lib/libssl.1.0.0.dylib (LoadError)
- 解决方法: brew remove openssl 先卸载,然后 brew install openssl
- unknown command: crawl error
周边错误
- scrapyd run spider 出现 TypeError:
__init__()
got an unexpected keyword argument ‘_job
- spider 的init函数 需要改成
__init__(*args,**kwargs)
- 相关搜索: https://github.com/scrapy/scrapyd/issues/78
大V,请问以下这个问题如何做?我之前是用requests, pika来做。现在换上框架这种组件。
0. 大V的博客
http://brucedone.com/
1. 问题驱动
如何发送post?
如何发送post参数?
如何定制请求头?
如何获取响应cookie?
如何发送请求cookie?
相关代码:
# -*- coding: utf-8 -*-
from scrapy_redis.spiders import RedisSpider
import scrapy
class PeopleSpider(RedisSpider):
name = "people"
reids_key = "ithome:people"
allowed_domains = ["it.ithome.com"]
def __init__(self, *args, **kwargs):
super(PeopleSpider, self).__init__(*args, **kwargs)
print('输出调试信息', args, kwargs)
self.page = '1'
self.start_urls = ['http://it.ithome.com/ithome/getajaxdata.aspx']
self.headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36',
'Accept': 'text/html, */*; q=0.01',
'Accept-Language': 'zh-CN,zh;q=0.8',
'Referer': 'http://it.ithome.com/people/',
'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
'Origin': 'http://it.ithome.com',
'Host': 'it.ithome.com',
'X-Requested-With': 'XMLHttpRequest',
'Accept-Encoding': 'gzip, deflate',
}
def start_requests(self):
urls = self.start_urls
formdata = {
'categoryid': '34',
'type': 'pccategorypage',
'page': self.page,
}
for url in urls:
yield scrapy.FormRequest(url=url, method='POST', headers=self.headers,
callback=self.parse, formdata=formdata)
def parse(self, response):
#photos = response.xpath('./ul/li/a/img')
#titles = response.xpath('./div/h2)')
#contents = response.xpath('./div/div/p')
#print(photos)
#print(titles)
#print(contents)
print('输出调试信息:', response.text)
# 无法接受页码参数
最后的问题:
如何向scrapy_reids spider传递页码?
不能用meta传递吗?