A principle for resilient sharing of distributed resources

Proceedings of the 2nd International Conference on Software Engineering, October 1976

https://dl.acm.org/doi/10.5555/800253.807732

本文提出了 primary/backup 复制模型,其本质是将其中一个节点作为主,其它几点作为备份。任一备份节点都可以执行主节点的职责,因此 primary 角色可以迁移到任一节点。

Resiliency(即恢复能力),具有如下四个主要特性:

  • It is able to detect and recover from a given maximum number of errors
  • It is reliable to a sufficiently high degree that a user of the resilient service can ignore the possibility of service failure
  • If the service provides perfect detection and recovery from n errors, the (n+1)st error is not catastrophic. A "best effort" is made to continue service
  • The abuse of the service by a single user should have negligible effect on other users of the service

在本文的 Resiliency 服务中,更新操作可以被发送给主或任意备节点,然后用户请求被阻塞,等待服务返回或超时后重新发送请求。如果更新操作被发送给备节点,备节点会将请求转发给主,所有的更新必须由主节点开始更新。

备份的架构可以是线性的(linearly),如下图:

Summary of the message flow for the resiliency scheme

也可以是广播的模式,广播的模式具有最小的延迟、需要更少的发送消息,但相对线性模式其错误恢复更复杂。

系统中有两种失败需要被处理:

  • host failure (easy to handle)
  • network partition (hard to handle)

Host Failure 的恢复使用 structure modification 消息来处理,当一个节点监测到其下游不可用时,会向 primary 报告,primary 则将需要的节点信息传递给下游,其过程图:

Restructuring after a Host Failure

网络分区的处理则先通过发送 are you alive 消息,来判断是否进入 partitioned operation mode,在此模式下,服务对更新的的处理往往依赖应用程序。