There are N replicas in a group. A replica group stores a collection of items; for example it might store files or file pages. Each item has an identifier and a state.

Each replica stores the state for each item, plus an extra piece of information: a version number. The version number is a pair v1.val v2.val or (v1.val = v2.val and v1.cid v2.cid)

The idea is that if different replicas store different version numbers for an item, the state associated with a larger version number is more recent than the state associated with a smaller version number.

Reads go to a read quorum of size R (for the read to be considered successful it must be acknowledged from R replicas) and writes go to a write quorum of size W (for the write to be successful it must be acknowledged from W replicas). Furthermore,

we require that R+W > N, i.e., read quorums always intersect with write quorums.

<aside> 💡 This relation between W,R,N is a typical tradeoff between latency and consistency. If W=1 or R=1 an operation is returned quickly because of coordinator only needs to wait for a response from any of the replicas. If W or R > 1, the system offers better consistency. If W + R > N then strong consistency is guaranteed because there will be some overlapping replica common in both of them.

</aside>

This will ensure that read results always reflect the result of the most recent write (because the read quorum will include at least one replica that was involved in the most recent write).

For example, consider a group of 3 replicas. Then we have the following possibilities:

R=3 and W=1. This improves performance for writes at the expense of reads, which is probably a bad idea since generally reads are more common than writes. In addition, this choice of quorums is bad because a write might happen at a single replica that then fails. If that replica were to lose its state, the outcome of the write would be lost. So generally we would like to have W>1.
R=1 and W=3. This works very well for reads which is generally good since reads are common. But it is undesirable for writes because if one of the replicas is down or inaccessible, a write cannot complete until that replica recovers.
R=2 and W=2. This choice is a good compromise compared to the R=1 and W=3 choice, since it increases the cost of reads in return for providing reasonable availability for writes.

How it works

Each client machine runs a client-side proxy that carries out the replication protocols. The proxy provides two operations that can be called by user code on the client machine:

write (x, s) s <- read (x)

Here x is an identifier that indicates the item of interest (e.g., the name of a file). The write operation takes the new state that is intended to be stored for that item, and the read operation returns the current state of the item.

The write operation

A write requires either one or two phases. These phases require communication with a quorum, and the phase is complete when a quorum of responses has arrived at the client side.

The proxy sends a read request for the version number to all the replicas, and waits for replies from a read quorum. Then it takes the biggest version number , where c2 is its own client-id.
The proxy sends a write request containing the state and new version number to all the replicas and waits to receive acknowledgements from a write quorum. At that point the write operation is complete and the proxy can return to the user code.

If there are concurrent writers, the one choosing the largest version number will prevail. The protocol doesn't provide any synchronization; instead this would be done outside, e.g., by using locks.

The first phase can be avoided if the client knows what version number to provide without having to read it. This will be true if there is just one writer, i.e., one client modifies the item and all the other clients just read it. It can also be true if there is some way that writers coordinate to find out the proper version number without reading it.

Note that if the first phase is needed, it doesn't do any good to choose a small write quorum and a larger read quorum since completing a write will involve communicating with a read quorum.

The Read Operation