文章详情页

python - scrapy pipeline报错求助

浏览：111日期：2022-08-09 08:55:51

问题描述

由于不太清楚传输的机制,卡在SCRAPY传输的这个问题上近半个月，翻阅了好多资料，还是不懂,基础比较差所以上来求助各位老师！不涉及自定义就以SCRAPY默认的格式为例spider return的东西需要什么样的格式?dict?{a:1,b:2,.....}还是[{a:1,aa:11},{b:2,bb:22},{......}]return的东西传去哪了?是不是下面代码的item？

class pipeline : def process_item(self, item, spider):

我真的是很菜，但是我很想学希望能得到各位老师的帮助！下面是我的代码，希望能指出缺点

spider:

# -*- coding: utf-8 -*-import scrapyfrom pm25.items import Pm25Itemimport reclass InfospSpider(scrapy.Spider): name = 'infosp' allowed_domains = ['pm25.com'] start_urls = [’http://www.pm25.com/rank/1day.html’, ] def parse(self, response):item = Pm25Item()re_time = re.compile('d+-d+-d+')date = response.xpath('/html/body/p[4]/p/p/p[2]/span').extract()[0] #单独解析出DATE# items = []selector = response.selector.xpath('/html/body/p[5]/p/p[3]/ul[2]/li') #从response里确立解析范围for subselector in selector: #通过范围逐条解析 try: #防止[0]报错rank = subselector.xpath('span[1]/text()').extract()[0] quality = subselector.xpath('span/em/text()')[0].extract()city = subselector.xpath('a/text()').extract()[0]province = subselector.xpath('span[3]/text()').extract()[0]aqi = subselector.xpath('span[4]/text()').extract()[0]pm25 = subselector.xpath('span[5]/text()').extract()[0] except IndexError:print(rank,quality,city,province,aqi,pm25) item[’date’] = re_time.findall(date)[0] item[’rank’] = rank item[’quality’] = quality item[’province’] = city item[’city’] = province item[’aqi’] = aqi item[’pm25’] = pm25 # items.append(item) yield item #这里不懂该怎么用，出来的是什么格式， #有的教程会return items,所以希望能得到指点

pipeline:

import timeclass Pm25Pipeline(object): def process_item(self, item, spider):today = time.strftime('%y%m%d',time.localtime())fname = str(today) + '.txt'with open(fname,'a') as f: for tmp in item: #不知道这里是否写的对， #个人理解是spider return出来的item是yiled dict #[{a:1,aa:11},{b:2,bb:22},{......}]f.write(tmp['date'] + ’t’ +tmp['rank'] + ’t’ +tmp['quality'] + ’t’ +tmp['province'] + ’t’ +tmp['city'] + ’t’ +tmp['aqi'] + ’t’ +tmp['pm25'] + ’n’) f.close()return item

items:

import scrapyclass Pm25Item(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() date = scrapy.Field() rank = scrapy.Field() quality = scrapy.Field() province = scrapy.Field() city = scrapy.Field() aqi = scrapy.Field() pm25 = scrapy.Field() pass

部分运行报错代码:

Traceback (most recent call last): File 'd:python35libsite-packagestwistedinternetdefer.py', line 653, in _runCallbacks current.result = callback(current.result, *args, **kw) File 'D:pypropm25pm25pipelines.py', line 23, in process_item tmp['pm25'] + ’n’TypeError: string indices must be integers2017-04-03 10:23:14 [scrapy.core.scraper] ERROR: Error processing {’aqi’: ’30’, ’city’: ’新疆’, ’date’: ’2017-04-02’, ’pm25’: ’13 ’, ’province’: ’伊犁哈萨克州’, ’quality’: ’优’, ’rank’: ’357’}Traceback (most recent call last): File 'd:python35libsite-packagestwistedinternetdefer.py', line 653, in _runCallbacks current.result = callback(current.result, *args, **kw) File 'D:pypropm25pm25pipelines.py', line 23, in process_item tmp['pm25'] + ’n’TypeError: string indices must be integers2017-04-03 10:23:14 [scrapy.core.scraper] ERROR: Error processing {’aqi’: ’28’, ’city’: ’西藏’, ’date’: ’2017-04-02’, ’pm25’: ’11 ’, ’province’: ’林芝’, ’quality’: ’优’, ’rank’: ’358’}Traceback (most recent call last): File 'd:python35libsite-packagestwistedinternetdefer.py', line 653, in _runCallbacks current.result = callback(current.result, *args, **kw) File 'D:pypropm25pm25pipelines.py', line 23, in process_item tmp['pm25'] + ’n’TypeError: string indices must be integers2017-04-03 10:23:14 [scrapy.core.scraper] ERROR: Error processing {’aqi’: ’28’, ’city’: ’云南’, ’date’: ’2017-04-02’, ’pm25’: ’11 ’, ’province’: ’丽江’, ’quality’: ’优’, ’rank’: ’359’}Traceback (most recent call last): File 'd:python35libsite-packagestwistedinternetdefer.py', line 653, in _runCallbacks current.result = callback(current.result, *args, **kw) File 'D:pypropm25pm25pipelines.py', line 23, in process_item tmp['pm25'] + ’n’TypeError: string indices must be integers2017-04-03 10:23:14 [scrapy.core.scraper] ERROR: Error processing {’aqi’: ’27’, ’city’: ’云南’, ’date’: ’2017-04-02’, ’pm25’: ’15 ’, ’province’: ’玉溪’, ’quality’: ’优’, ’rank’: ’360’}Traceback (most recent call last): File 'd:python35libsite-packagestwistedinternetdefer.py', line 653, in _runCallbacks current.result = callback(current.result, *args, **kw) File 'D:pypropm25pm25pipelines.py', line 23, in process_item tmp['pm25'] + ’n’TypeError: string indices must be integers2017-04-03 10:23:14 [scrapy.core.scraper] ERROR: Error processing {’aqi’: ’26’, ’city’: ’云南’, ’date’: ’2017-04-02’, ’pm25’: ’10 ’, ’province’: ’楚雄州’, ’quality’: ’优’, ’rank’: ’361’}Traceback (most recent call last): File 'd:python35libsite-packagestwistedinternetdefer.py', line 653, in _runCallbacks current.result = callback(current.result, *args, **kw) File 'D:pypropm25pm25pipelines.py', line 23, in process_item tmp['pm25'] + ’n’TypeError: string indices must be integers2017-04-03 10:23:14 [scrapy.core.scraper] ERROR: Error processing {’aqi’: ’24’, ’city’: ’云南’, ’date’: ’2017-04-02’, ’pm25’: ’11 ’, ’province’: ’迪庆州’, ’quality’: ’优’, ’rank’: ’362’}Traceback (most recent call last): File 'd:python35libsite-packagestwistedinternetdefer.py', line 653, in _runCallbacks current.result = callback(current.result, *args, **kw) File 'D:pypropm25pm25pipelines.py', line 23, in process_item tmp['pm25'] + ’n’TypeError: string indices must be integers2017-04-03 10:23:14 [scrapy.core.scraper] ERROR: Error processing {’aqi’: ’22’, ’city’: ’云南’, ’date’: ’2017-04-02’, ’pm25’: ’9 ’, ’province’: ’怒江州’, ’quality’: ’优’, ’rank’: ’363’}Traceback (most recent call last): File 'd:python35libsite-packagestwistedinternetdefer.py', line 653, in _runCallbacks current.result = callback(current.result, *args, **kw) File 'D:pypropm25pm25pipelines.py', line 23, in process_item tmp['pm25'] + ’n’TypeError: string indices must be integers2017-04-03 10:23:14 [scrapy.core.engine] INFO: Closing spider (finished)2017-04-03 10:23:14 [scrapy.statscollectors] INFO: Dumping Scrapy stats:{’downloader/request_bytes’: 328, ’downloader/request_count’: 1, ’downloader/request_method_count/GET’: 1, ’downloader/response_bytes’: 38229, ’downloader/response_count’: 1, ’downloader/response_status_count/200’: 1, ’finish_reason’: ’finished’, ’finish_time’: datetime.datetime(2017, 4, 3, 2, 23, 14, 972356), ’log_count/DEBUG’: 2, ’log_count/ERROR’: 363, ’log_count/INFO’: 7, ’response_received_count’: 1, ’scheduler/dequeued’: 1, ’scheduler/dequeued/memory’: 1, ’scheduler/enqueued’: 1, ’scheduler/enqueued/memory’: 1, ’start_time’: datetime.datetime(2017, 4, 3, 2, 23, 13, 226730)}2017-04-03 10:23:14 [scrapy.core.engine] INFO: Spider closed (finished)

希望能到到各位老师的帮助再次感谢~!

问题解答

回答1：

直接写入就行，不用做循环，item是单个处理，并不是你想的那样的列表：

import timeclass Pm25Pipeline(object): def process_item(self, item, spider):today = time.strftime('%y%m%d', time.localtime())fname = str(today) + '.txt'with open(fname, 'a') as f: f.write(item['date'] + ’t’ + item['rank'] + ’t’ + item['quality'] + ’t’ + item['province'] + ’t’ + item['city'] + ’t’ + item['aqi'] + ’t’ + item['pm25'] + ’n’ )f.close()return item回答2：

搜索：TypeError: string indices must be integers，搞清楚什么问题定位行数，解决问题

回答3：

Scrapy的Item类似python字典，扩展了一些功能而已。

Scrapy的设计，每生成一个Item，即可传递到pipeline中处理。你在里面写的for tmp in item循环的是item字典的键了，键应是字符串，再用__getitem__语法就会提示你使用的不是数字。

回答4：

你可以把一个item看作一个字典，实际它就是dict类的派生类。你在pipeline里对这个item直接遍历,取到的tmp实际是都是字典的键，类型是字符串，所以tmp[’pm25’]这种操作报出TypeError：string类型的对象索引必须是int型。

Python 编程

上一条：python - 为什么__del__不执行呢？下一条：sublime-text - sublime text 3中编译python（sublimeREPL），如何仅运行单行/选定部分

相关文章：

1. mysql时间格式问题2. 数组排序，并把排序后的值存入到新数组中3. 默认输出类型为json，如何输出html4. mysql - msyql 判断字段不为空简单方法5. mysql的主从复制、读写分离，关于从的问题6. mysql - sql 左连接结果union右连接结果，导致重复性计算怎么解决？7. MySQL的联合查询[union]有什么实际的用处8. php多任务倒计时求助9. mysql 远程连接出错10060，我已经设置了任意主机了。。。10. PHP订单派单系统

排行榜

					
					javascript - webpack打包出现react-dom相关错误
android - 能够自定义安卓webview的内核版本吗
怎么能做出标签切换页的效果，（文字内容随动）
这种功能前端是如何实现?
python - 安装anaconda2出错
html - 爬虫时出现“DNS lookup failed”，打开网页却没问题，这是什么情况？
vue.js - vue获取mongodb中的数据起初显示未定义，但还是可以渲染
mac连接阿里云docker集群，已经卡了2天了，求问？
angular.js - angular-ui-bootstrap 报错无法使用？
javascript - 如何让手机端的代码只能在手机端执行，在pc端的模拟器里面也不执行
docker pull 错误？
				

热门标签