文章详情页

JAVA读取HDFS的文件数据出现乱码的解决方案

【字号：大中小】日期：2022-08-21 09:51:08浏览：2作者：猪猪

使用JAVA api读取HDFS文件乱码踩坑

想写一个读取HFDS上的部分文件数据做预览的接口，根据网上的博客实现后，发现有时读取信息会出现乱码，例如读取一个csv时，字符串之间被逗号分割

英文字符串aaa，能正常显示中文字符串“你好”，能正常显示中英混合字符串如“aaa你好”，出现乱码

查阅了众多博客，解决方案大概都是：使用xxx字符集解码。抱着不信的想法，我依次尝试，果然没用。

解决思路

因为HDFS支持6种字符集编码，每个本地文件编码方式又是极可能不一样的，我们上传本地文件的时候其实就是把文件编码成字节流上传到文件系统存储。那么在GET文件数据时，面对不同文件、不同字符集编码的字节流，肯定不是一种固定字符集解码就能正确解码的吧。

那么解决方案其实有两种

固定HDFS的编解码字符集。比如我选用UTF-8，那么在上传文件时统一编码，即把不同文件的字节流都转化为UTF-8编码再进行存储。这样的话在获取文件数据的时候，采用UTF-8字符集解码就没什么问题了。但这样做的话仍然会在转码部分存在诸多问题，且不好实现。动态解码。根据文件的编码字符集选用对应的字符集对解码，这样的话并不会对文件的原生字符流进行改动，基本不会乱码。

我选用动态解码的思路后，其难点在于如何判断使用哪种字符集解码。参考下面的内容，获得了解决方案

java检测文本(字节流)的编码方式

需求：

某文件或者某字节流要检测他的编码格式。

实现：

基于jchardet

<dependency><groupId>net.sourceforge.jchardet</groupId><artifactId>jchardet</artifactId><version>1.0</version></dependency>

代码如下：

public class DetectorUtils {private DetectorUtils() {} static class ChineseCharsetDetectionObserver implementsnsICharsetDetectionObserver {private boolean found = false;private String result; public void Notify(String charset) {found = true;result = charset;} public ChineseCharsetDetectionObserver(boolean found, String result) {super();this.found = found;this.result = result;} public boolean isFound() {return found;} public String getResult() {return result;} } public static String[] detectChineseCharset(InputStream in)throws Exception {String[] prob=null;BufferedInputStream imp = null;try {boolean found = false;String result = Charsets.UTF_8.toString();int lang = nsPSMDetector.CHINESE;nsDetector det = new nsDetector(lang);ChineseCharsetDetectionObserver detectionObserver = new ChineseCharsetDetectionObserver(found, result);det.Init(detectionObserver);imp = new BufferedInputStream(in);byte[] buf = new byte[1024];int len;boolean isAscii = true;while ((len = imp.read(buf, 0, buf.length)) != -1) {if (isAscii)isAscii = det.isAscii(buf, len);if (!isAscii) {if (det.DoIt(buf, len, false))break;}} det.DataEnd();boolean isFound = detectionObserver.isFound();if (isAscii) {isFound = true;prob = new String[] { 'ASCII' };} else if (isFound) {prob = new String[] { detectionObserver.getResult() };} else {prob = det.getProbableCharsets();}return prob;} finally {IOUtils.closeQuietly(imp);IOUtils.closeQuietly(in);}}}

测试：

String file = 'C:/3737001.xml';String[] probableSet = DetectorUtils.detectChineseCharset(new FileInputStream(file));for (String charset : probableSet) {System.out.println(charset);}

Google提供了检测字节流编码方式的包。那么方案就很明了了，先读一些文件字节流，用工具检测编码方式，再对应进行解码即可。

具体解决代码

pom

<dependency><groupId>net.sourceforge.jchardet</groupId><artifactId>jchardet</artifactId><version>1.0</version></dependency>

从HDFS读取部分文件做预览的逻辑

// 获取文件的部分数据做预览 public List<String> getFileDataWithLimitLines(String filePath, Integer limit) { FSDataInputStream fileStream = openFile(filePath); return readFileWithLimit(fileStream, limit); } // 获取文件的数据流 private FSDataInputStream openFile(String filePath) { FSDataInputStream fileStream = null; try { fileStream = fs.open(new Path(getHdfsPath(filePath))); } catch (IOException e) { logger.error('fail to open file:{}', filePath, e); } return fileStream; } // 读取最多limit行文件数据 private List<String> readFileWithLimit(FSDataInputStream fileStream, Integer limit) { byte[] bytes = readByteStream(fileStream); String data = decodeByteStream(bytes); if (data == null) { return null; } List<String> rows = Arrays.asList(data.split('rn')); return rows.stream().filter(StringUtils::isNotEmpty) .limit(limit) .collect(Collectors.toList()); } // 从文件数据流中读取字节流 private byte[] readByteStream(FSDataInputStream fileStream) { byte[] bytes = new byte[1024*30]; int len; ByteArrayOutputStream stream = new ByteArrayOutputStream(); try { while ((len = fileStream.read(bytes)) != -1) { stream.write(bytes, 0, len); } } catch (IOException e) { logger.error('read file bytes stream failed.', e); return null; } return stream.toByteArray(); } // 解码字节流 private String decodeByteStream(byte[] bytes) { if (bytes == null) { return null; } String encoding = guessEncoding(bytes); String data = null; try { data = new String(bytes, encoding); } catch (Exception e) { logger.error('decode byte stream failed.', e); } return data; } // 根据Google的工具判别编码 private String guessEncoding(byte[] bytes) { UniversalDetector detector = new UniversalDetector(null); detector.handleData(bytes, 0, bytes.length); detector.dataEnd(); String encoding = detector.getDetectedCharset(); detector.reset(); if (StringUtils.isEmpty(encoding)) { encoding = 'UTF-8'; } return encoding; }

以上就是JAVA读取HDFS的文件数据出现乱码的解决方案的详细内容，更多关于JAVA读取HDFS的文件乱码的资料请关注好吧啦网其它相关文章！

Java

上一条：java实现删除某条信息并刷新当前页操作下一条：Java将CSV的数据发送到kafka的示例

相关文章：

1. 匹配模式 - XSL教程 - 42. asp画中画广告插入在每篇文章中的实现方法3. 低版本IE正常运行HTML5+CSS3网站的3种解决方案4. XML入门精解之结构与语法5. ASP脚本组件实现服务器重启6. CSS可以做的几个令你叹为观止的实例分享7. xpath简介_动力节点Java学院整理8. 使用Spry轻松将XML数据显示到HTML页的方法9. HTML <!DOCTYPE> 标签10. ASP基础知识Command对象讲解

排行榜

					
					将properties文件的配置设置为整个Web应用的全局变量实现方法
PHP的FTP学习（一）
el-table表格动态合并相同数据单元格(可指定列+自定义合并)
VMware中如何安装Ubuntu
idea打开多个窗口的操作方法
PHP技术发展迅猛 Zend完成D轮2000万融资
PHP字符串前后字符或空格删除方法介绍
ASP基础知识Command对象讲解
使用Spry轻松将XML数据显示到HTML页的方法
XML入门精解之结构与语法
xpath简介_动力节点Java学院整理
				

热门标签