文章详情页

python - 大文本数据合并问题思路

浏览：127日期：2022-08-12 15:46:37

问题描述

背景：

我有三个csv文件，分别如下：

afile: userid, username, ....bfile: postid, userid, postname, ...cfile: postid, postnum, ...

afile = 10Gbfile = 150Gcfile = 20G

注：各个field的分隔符并不是单个字符（例如逗号），而是一串特殊符号，因为部分field可能会包含某些单字符分隔符，键盘上的单字符都试过了，都有包含，所以用了一串几个字符组成的特殊字符串来分隔，所以并不是严格的csv，这是最蛋疼的地方

目的：

我想合并这三个文件，bfile和cfile根据postid列合并，合并后再根据userid列合并afile，最终大概是postid, userid, postname, postnum, username这样的形式。

目前我的伪代码如下：

import pandas as pdchunksize = 1000000 # 100W 目前看没问题 try:resultchunktotal = []bfilereader = pd.read_csv(bfile, iterator=True, engine=’python’, sep=’##’)goon_1 = Truewhile goon_1: try:# 分块读取 bfilebfilechunk = bfilereader.get_chunk(chunksize)if not bfilechunk.empty: cfilereader = pd.read_csv(cfile, iterator=True, engine=’python’, sep=’##’) goon_2 = True while goon_2:try: # 分块读取 cfile cfilechunk = cfilereader.get_chunk(chunksize) if not cfilechunk.empty:bfilecfilechunk = pd.merge(bfilechunk, cfilechunk, on=’postid’)# 不为空代表 bfile cfile有共同的postidif not bfilecfilechunk.empty: afilereader = pd.read_csv(afile, iterator=True, engine=’python’, sep=’##’) goon_3 = True while goon_3:try: # 分块读取afile afilechunk = afilereader.get_chunk(chunksize) if not afilechunk.empty:chunkresult = pd.merge(bfilecfilechunk, afilechunk, on=’’)# 不为空表示有共同的useridif not chunkresult.empty:resultchunktotal.append(chunkresult)except StopIteration: goon_3 = Falseexcept StopIteration: goon_2 = False except StopIteration:goon_1 = Falseif len(resultchunktotal) > 0: pd.concat(resultchunktotal).to_csv(’result.csv’, index=False) except Exception as e:print(e)

但是感觉这样，很低效，所以跪求各位大神好的思路以及好的工具方法

ps: 这是一道“大数据”的伪命题，无非数据稍大了点

问题解答

回答1：

别写代码啦。看起来是一行 shell 脚本的事情，用 xsv join 子命令。

Python 编程

上一条：python - 当装饰器遇到multiprocessing, 出了点bug.下一条：python - sqlalchemy更新数据报错

相关文章：

1. python小白问题关于局部变量和全局变量2. node.js - win7下用nodejs使用gm出错3. python - 如何给模块传参数，参数是模块的函数名？4. python3.x - 如何将python3.4的程序转为python2.75. selenium - 请教一下 Python 爬虫工具6. Python列表或者字典里面的中文如何处理？7. python - Django ManyToManyField 字段数据在 admin后台显示不正确，这是怎么回事？8. python - 如何解决queue中同一个参数被多个线程同时调用？9. python - pandas中mode()怎么使用?10. python - 网页title中包含换行，如何用正则表达式提取出来？

排行榜

					
					angular.js - angular双向绑定问题
angular.js : select默认选项怎么设置？
javascript - koa中读取文件应该怎么写
javascript - jQuery引用出错
angular.js - angular js  报错 [$injector:modulerr] 模块注入错误
angular.js - 如何控制ngrepeat输出的个数
ruby - 为什么我新建的字段内容能捕获到，但存不进数据库？
angular.js - angularjs 公用的方法
angular.js - yeoman创建好框架后，如何运行dist目录
Python列表或者字典里面的中文如何处理？
angular.js - angular1运行程序报错
				

热门标签