A principle for resilient sharing of distributed resources

Proceedings of the 2nd International Conference on Software Engineering, October 1976


本文提出了 primary/backup 复制模型,其本质是将其中一个节点作为主,其它几点作为备份。任一备份节点都可以执行主节点的职责,因此 primary 角色可以迁移到任一节点。


  • It is able to detect and recover from a given maximum number of errors
  • It is reliable to a sufficiently high degree that a user of the resilient service can ignore the possibility of service failure
  • If the service provides perfect detection and recovery from n errors, the (n+1)st error is not catastrophic. A "best effort" is made to continue service
  • The abuse of the service by a single user should have negligible effect on other users of the service

在本文的 Resiliency 服务中,更新操作可以被发送给主或任意备节点,然后用户请求被阻塞,等待服务返回或超时后重新发送请求。如果更新操作被发送给备节点,备节点会将请求转发给主,所有的更新必须由主节点开始更新。


Summary of the message flow for the resiliency scheme



  • host failure (easy to handle)
  • network partition (hard to handle)

Host Failure 的恢复使用 structure modification 消息来处理,当一个节点监测到其下游不可用时,会向 primary 报告,primary 则将需要的节点信息传递给下游,其过程图:

Restructuring after a Host Failure

网络分区的处理则先通过发送 are you alive 消息,来判断是否进入 partitioned operation mode,在此模式下,服务对更新的的处理往往依赖应用程序。