本文共 2364 字,大约阅读时间需要 7 分钟。
昨天同事遇到一个hadoop故障,找了半天没看出问题,问到我这里,花了一会解决了一下,估计这是我给暴风的集群解决的最后的故障了,以后就不定给谁解决问题去了。
只截下来了Namenode的报错Log,Datanode的刷屏刷过去了,不过都差不多。
2013-09-03 18:11:44,021 WARN org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.blockReceived: blk_8094241928859719036_2147969 is received from dead or unregistered node 192.168.1.99:500102013-09-03 18:11:44,022 ERROR org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:hdfs cause:java.io.IOException: Got blockReceived message from unregistered or dead node blk_8094241928859719036_21479692013-09-03 18:11:44,022 INFO org.apache.hadoop.ipc.Server: IPC Server handler 2 on 9000, call blockReceived(DatanodeRegistration(192.168.1.99:50010, storageID=DS-1925877777-192.168.1.99-50010-1372745739682, infoPort=50075, ipcPort=50020), [Lorg.apache.hadoop.hdfs.protocol.Block;@4ec371c, [Ljava.lang.String;@301611ca) from 192.168.1.99:18853: error: java.io.IOException: Got blockReceived message from unregistered or dead node blk_8094241928859719036_2147969java.io.IOException: Got blockReceived message from unregistered or dead node blk_8094241928859719036_2147969 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.blockReceived(FSNamesystem.java:4188) at org.apache.hadoop.hdfs.server.namenode.NameNode.blockReceived(NameNode.java:1069) at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:578) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1393) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1389) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Unknown Source) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1149) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1387)
看上去是个IPC的错误,从下往上看,都是报权限错误,然后无法注册Datanode,还有从未注册或死亡的Datanode上报了一个块已被接收的错误。同事就晕了,已经死亡的node怎么还上报啊。
然后重启datanode时间不长,就又挂掉了。
登录到datanode,先看了一下dfs的数据文件夹的权限,正确无误。然后看了一下df -h,发现/var文件夹满了,OPS很缺,只给分了20G的/var。结果Hadoop的log写不进去了,自然就挂了。删掉/var/log/hadoop/hdfs里面的历史日志,datanode启动正常。以后的解决办法只有两个,要么设置定时脚本每天删历史日志,要么就把/var/log/hadoop/hdfs文件夹软链到一个比较大的硬盘上。
马上要离开暴风影音了,心中有诸多槽点,以后慢慢吐了。
转载地址:http://fseox.baihongyu.com/