Recovery
The recovery process of a system heavily relies on accessible data replicas, except stateless systems. This implies that the recovery approach heavily relies on the replication approach.
Snapshots and checkpoints
The most common approach for recovery is to have a snapshot of the last known system state. Periodically saving the state of the distributed system is known as checkpointing.
In the event of a failure, the system can be rolled back to the last known good checkpoint to restore the system to a consistent state. Data that didn’t persist in the snapshot will be lost. The amount of data loss would depend on how often the snapshots are taken.
Change logs
A system state can also be restored by replaying the change logs of all operations and transactions within the distributed system.
It’s common to recover distributed systems using a combination of checkpoints and change logs. This is similar to the event sourcing recovery method mentioned...