Impossibility of Distributed Consensus with One Faulty Process
Proceedings of the 2nd ACM SIGACT-SIGMOD symposium on Principles of database systems, March 1983
https://dl.acm.org/doi/abs/10.1145/588058.588060
In this paper, it is shown that every protocol for consensus problem has the possibility of nontermination, even with only one faulty process.
在异步环境下,或者真实的系统中,往往会出现进程崩溃、网络分区、消息丢失/乱序/冗余等问题。本文在假设网络可靠(it delivers all messages correctly and exactly once)、non-Byzantine Failure 的异步系统中,即使只有一个进程在不合适的时间停机,也会导致任何分布式提交协议失败。
FLP result 的证明还有如下几点假设:
- make no assumptions about the relative speeds of processes or about the delay time in delivering a message.
- the processes do not have access to synchronized clocks, so algorithms based on time-outs can not be used.
- do not postulate the ability to detect the death of a process
证明过程有点晦涩,可以结合参考中的连接和 paper 一起理解,我只是浏览了一遍,后面需要可能需要继续学习 :(
论文的结论中指出:
FLP result 并非认为 fault-tolerant cooperative computing
问题在实际中不可解决,而是指出了如果想要解决这些问题,需要更细化的错误模型,如对通信时间、失败检测的假设。
Paxos 和 Raft 理论上可能永远运行却不能达成共识(即 nontermination),软件工程通常使用 random 超时机制来减少这种可能性。
References
[1] A Brief Tour of FLP Impossibility
[2] John Feminella on Impossibility of Distributed Consensus with One Faulty Process