Definition: Solve the problem of single point of failure of the system, prevent a nameNode from hanging up, and there will be problems such as data loss. Principle: (1) Record the data in the Qjournal distributed log management system, the active nameNode will upload the data to Qjournal periodically, and Qjournal will periodically put the data into another inactive NameNode. With new, regular data refresh will prevent loss (also refresh the fsimage image file regularly) (2) Then if it hangs, how to notify another nameNode to restart the service, there is a controller called zkfc, which monitors the status of the nameNode in real time according to the process, and And zookeeper to interact at any time, if you feel the nameNode hangs, it will notify the next nameNode to take over (3) Why does the above say that it feels hanged, because he judges according to the process, and sometimes it is not dead. At this time, when two NameNodes are started simultaneously to manage the dateNode, there will be a phenomenon of brain splitting (system uncoordinated), there are two Boss, of course, is not coordinated, this time there are two ways to prevent brain splitting SSH KILL (send a kill command) and shell script. If the ssh kill command is sent and the nameNode does not respond, use the shell script to kill and confirm that the next nameNode will be started after killing.