flp - 论文阅读笔记

Impossibility of Distributed Consensus with One Faulty Process

Proceedings of the 2nd ACM SIGACT-SIGMOD symposium on Principles of database systems, March 1983

https://dl.acm.org/doi/abs/10.1145/588058.588060

In this paper, it is shown that every protocol for consensus problem has the possibility of nontermination, even with only one faulty process.

在异步环境下，或者真实的系统中，往往会出现进程崩溃、网络分区、消息丢失/乱序/冗余等问题。本文在假设网络可靠（it delivers all messages correctly and exactly once）、non-Byzantine Failure 的异步系统中，即使只有一个进程在不合适的时间停机，也会导致任何分布式提交协议失败。

FLP result 的证明还有如下几点假设:

make no assumptions about the relative speeds of processes or about the delay time in delivering a message.
the processes do not have access to synchronized clocks, so algorithms based on time-outs can not be used.
do not postulate the ability to detect the death of a process

证明过程有点晦涩，可以结合参考中的连接和 paper 一起理解，我只是浏览了一遍，后面需要可能需要继续学习 :(

论文的结论中指出：

FLP result 并非认为 fault-tolerant cooperative computing 问题在实际中不可解决，而是指出了如果想要解决这些问题，需要更细化的错误模型，如对通信时间、失败检测的假设。

Paxos 和 Raft 理论上可能永远运行却不能达成共识（即 nontermination），软件工程通常使用 random 超时机制来减少这种可能性。

References

[1] A Brief Tour of FLP Impossibility
[2] John Feminella on Impossibility of Distributed Consensus with One Faulty Process