0. Prelude

This article is relevant to lecture 14. Although this paper take a lot of space to talk about fault recovery, I will ignore these which is unimportant for the lecture.

1. Overview

FaRMis distributed in-memory database , which consists of many machines in one datacenter(including clients) and leverage RDMA and DRAM with UPS technology .
Configuration file comprised of a tuple $< i, S, F, C M >$ representing a unique configuration identifier, the set of machines comprising FaRM, a mapping from machines to failure domains and configuration manager respectively, is spread by zookeeper.
FaRM exposes a global address comprised 2 GB region to the application.
Configuration manager maintains a mapping from region identifier to primary and backup machines storing corresponding region(this mapping is not stored in zookeeper and will be cached in other machines) and is responsible for allocating new region.

2. Transaction process

An atomic distributed transaction which is driven by clients like Spanner can be split into following steps.

Read all the objects needed to read from primary servers storing these objects and remember each version number.
buffer all writes in local memory
Send each primary server storing written objects a Lock log record using RDMA(actually, each machine need to reserve a RDMA queue pair of message and log respectively for each other machine).
Primary servers will try to lock each written object and check whether version number of each written object has changed. If locking some object fails or version number has changed, the transaction must be aborted, otherwise the server will reply a LOCK-REPLY message to the client, which plays the coordinator role in two phase commit.
Once receiving all LOCK-REPLY messages, client starts to validate all
client sends a COMMIT-BACKUP log record to each backup server.
As long as receiving all hardware ACK from backup server, client will send a COMMIT-PRIMARY message to all primary server. Once receiving an ACK, client API can return COMMIT SUCCESS to API caller(from recovery protocol, we can see system with replica number equal to F can guarantee that this interrupted transaction will be committed even though F replica machines storing the same object fail from the time of receiving one COMMIT-PRIMARY ACK).

3. Discussion about correctness

Actually, what I describe about transaction process above has a fatal error(I did not realize it when reading the paper until the teacher pointed it out in the lecture).

Before discussing where is the error, we first talk about serialization point I define as a timestamp at which all operations specified in a transaction are completed immediately.

Two phase lock scheme, easy to find we can consider the timestamp when we get all locks needed in the transaction as the serialization point.

We will see lock scheme of this paper is essentially identical with two phase lock, so defines the same serialization point. Consider following example.

T1 begin 
read x
if x = 0
  set y = 1
T1 end

T2 begin
read y
if y = 0
  set x = 1
T2 end

And following execution order.

--------> time
T1: read x  lock y validate x                          commit
T2:                           read y lock x validate y commit

If in validation stage we only check version number, T1 and T2 will both commit successfully, which is not a serializable result !

So we do need a adjustment in the validation stage. Now, validating object y not only check version number, but also check whether lock of object y is being acquired by others.

this adjustment essentially guarantee that anyone can’t read y once we have locked y.

If we think reading object y acquire read lock at the same time, and release read lock when validating y successfully, optimistic concurrency control of the paper is reduced to two phase lock scheme!

4. Limitation

Application code can see inconsistencies while executing transactions that will eventually abort. For example, if the transaction reads a big object at the same time that a committing transaction is overwriting the object.
scalability?