irpas技术客

Spark疑难杂症排查-Premature EOF: no length prefix available_高世之智_spark 排查问题

网络 1111

最近公司生产SaaS环境下的spark任务总是失败,发现部分节点上的executor任务失败次数过高,问题的排查也是较为曲折,故将此次疑难杂症的排查过程记录一下。

问题如下:

日志如下:

22/07/22 10:20:28 WARN DFSClient: Error Recovery for block BP-888461729-172.16.34.148-1397820377004:blk_15089246483_16183344527 in pipeline 172.16.34.64:50010, 172.16.34.223:50010: bad datanode 172.16.34.64:50010 [DataStreamer for file /bdp/data/u9083189ae0349dbaf8f2f2ee3351579_temp/.hive-staging_hive_2022-07-22_10-16-47_909_6320972307505128897-167/-ext-10000/_temporary/0/_temporary/attempt_20220722101347_2794_m_000034_0/part-00034-c0b2f98a-7049-4a5d-bcf3-b3042cd16645-c000 block BP-888461729-172.16.34.148-1397820377004:blk_15089246483_16183344527] 22/07/22 10:21:07 WARN DFSClient: Slow ReadProcessor read fields took 51412ms (threshold=30000ms); ack: seqno: 10 status: SUCCESS status: SUCCESS downstreamAckTimeNanos: 1027966, targets: [172.16.34.64:50010, 172.16.34.95:50010] [ResponseProcessor for block BP-888461729-172.16.34.148-1397820377004:blk_15089250682_16183349712] 22/07/22 10:22:18 WARN DFSClient: DFSOutputStream ResponseProcessor exception for block BP-888461729-172.16.34.148-1397820377004:blk_15089250682_16183349712 [ResponseProcessor for block BP-888461729-172.16.34.148-1397820377004:blk_15089250682_16183349712] java.io.EOFException: Premature EOF: no length prefix available at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2207) at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:176) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:867) 22/07/22 10:23:09 WARN DFSClient: DataStreamer Exception [DataStreamer for file /bdp/data/9ca8f3a33ce543909594541286116aec_temp/.hive-staging_hive_2022-07-22_10-19-47_845_1102463989488675558-143/-ext-10000/_temporary/0/_temporary/attempt_20220722101658_2821_m_000049_0/part-00049-b0b65b5f-f8f0-4b60-9e72-04dde1343ede-c000 block BP-888461729-172.16.34.148-1397820377004:blk_15089250682_16183349712] java.io.IOException: Broken pipe at sun.nio.ch.FileDispatcherImpl.write0(Native Method) at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) at sun.nio.ch.IOUtil.write(IOUtil.java:65) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471) at org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:63) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142) at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:159) at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:117) at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122) at java.io.DataOutputStream.write(DataOutputStream.java:107) at org.apache.hadoop.hdfs.DFSOutputStream$Packet.writeTo(DFSOutputStream.java:321) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:643) 22/07/22 10:23:09 WARN DFSClient: Error Recovery for block BP-888461729-172.16.34.148-1397820377004:blk_15089250682_16183349712 in pipeline 172.16.34.64:50010, 172.16.34.95:50010: bad datanode 172.16.34.64:50010 [DataStreamer for file /bdp/data/9ca8f3a33ce543909594541286116aec_temp/.hive-staging_hive_2022-07-22_10-19-47_845_1102463989488675558-143/-ext-10000/_temporary/0/_temporary/attempt_20220722101658_2821_m_000049_0/part-00049-b0b65b5f-f8f0-4b60-9e72-04dde1343ede-c000 block BP-888461729-172.16.34.148-1397820377004:blk_15089250682_16183349712]

看executor错误日志显示是 DFSClient :Premature EOF: no length prefix available 导致异常重试

于是开始查看此节点上的datanode日志:

日志如下:

2022-07-21 12:04:27,437 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP-888461729-172.16.34.148-1397820377004:blk_15087313175_16181097975, type=HAS_DOWNSTREAM_IN_PIPELINE terminating 2022-07-21 12:04:27,437 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception for BP-888461729-172.16.34.148-1397820377004:blk_15087313175_16181097975 java.io.IOException: Premature EOF from inputStream at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:194) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:496) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:889) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:757) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:239) at java.lang.Thread.run(Thread.java:745) 2022-07-21 12:04:27,438 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: opWriteBlock BP-888461729-172.16.34.148-1397820377004:blk_15087313175_16181097975 received exception java.io.IOException: Premature EOF from inputStream 2022-07-21 12:04:27,438 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: bdp-171:50010:DataXceiver error processing WRITE_BLOCK operation src: /172.16.34.214:9107 dst: /172.16.34.214:50010 java.io.IOException: Premature EOF from inputStream at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:194) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:496) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:889) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:757) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:239) at java.lang.Thread.run(Thread.java:745) 2022-07-21 12:04:27,441 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving BP-888461729-172.16.34.148-1397820377004:blk_15087313175_16181097975 src: /172.16.34.214:10348 dest: /172.16.34.214:50010 2022-07-21 12:04:27,441 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recover RBW replica BP-888461729-172.16.34.148-1397820377004:blk_15087313175_16181097975 2022-07-21 12:04:27,441 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recovering ReplicaBeingWritten, blk_15087313175_16181097975, RBW getNumBytes() = 1040384 getBytesOnDisk() = 1040384 getVisibleLength()= 0 getVolume() = /data3/hdfs/data/current getBlockFile() = /data3/hdfs/data/current/BP-888461729-172.16.34.148-1397820377004/current/rbw/blk_15087313175 bytesAcked=0 bytesOnDisk=1040384

以前未见过这种异常信息,于是根据datanode的异常信息,google到了此文章:https://stackoverflow.com/questions/32060049/hadoop-mapreduce-job-i-o-exception-due-to-premature-eof-from-inputstream

简单来讲:过早的 EOF 可能由于多种原因而发生,其中之一是使用 FileOutputCommitter 在一个 reducer 节点上生成大量线程以写入磁盘。MultipleOutputs 类允许您写入具有自定义名称的文件并完成此操作,它为每个文件生成一个线程并将端口绑定到它以写入磁盘,此配置是:dfs.datanode.max.transfer.threads,社区默认值为256,该参数取值范围[1 to 8192]

于是先来查看hdfs-site.xml文件:发现已经配置,且本身设置的值就不小

这就有点诡异了,再猜测有没有可能是文件句柄数过小导致的:

# 查看文件句柄数 ulimit -n # 结果 1024000

发现句柄数也没问题

后面有看到一篇文章可能因为丢包导致,于是查看此节点是否存在丢包,发现也不是网络问题

最后又想了想会不会与磁盘相关,因为此物理机已经使用很久了,会不会是磁盘问题,于是执行一下命令:

1、正常环境

2、问题环境

最终发现问题环境下的磁盘读写速度巨慢,大致可以确定是磁盘问题。

总结:软件行业一半以上的疑难杂症记得找运维。


1.本站遵循行业规范,任何转载的稿件都会明确标注作者和来源;2.本站的原创文章,会注明原创字样,如未注明都非原创,如有侵权请联系删除!;3.作者投稿可能会经我们编辑修改或补充;4.本站不提供任何储存功能只提供收集或者投稿人的网盘链接。

标签: #spark #排查问题 #eof #no #length #prefix #available