Monday, September 4, 2017

CAP theorem is mostly misunderstood

I have always been a big fan of the CAP theorem and its corollary PACELC. There are so many articles that talk about these theorems. However, they often end up making misdirected claims by categorizing systems one way or another.

After much googling, I was not able to find a single article that clearly explained how these rules should be applied in real life. So, here is a way to make it easy for a layperson:

CAP Recap

CAP is actually well explained by most write-ups, and the wikipedia is a good starting point. The one-sentence explanation is: if a system is distributed, you can expect operations on it to be either consistent or available, but not both.

The PACELC corollary is essentially the CAP theorem, where Latency replaces Availability.

The CAP theorem can also be evolved by replacing Consistency with Durability, if a system chooses to achieve Durability by writing to multiple nodes. But this is another topic.

The reason why the theorem gets misunderstood is because people try to categorize systems as CA, CP or AP.

In reality, the theorem must be applied per-operation. A system is capable of exhibiting all of the above properties, just not at the same time.

This is why it's incorrect to categorize a system into a corner of the CAP triangle.

Vitess as Example

Since I work with Vitess, I'll explain how it can satisfy all the above categories. However, the rules can be extended to any distributed system.

CA

CA means that you want data to be Consistent and Available. This means that you have to forego partition tolerance. The way achieve this is to co-locate your app with the master databases. You can then read and write to the master nodes and enjoy the CA benefits.

Variation (suggested by Dahlia Malkhi)
CA can also be achieved by using intersecting quorums. This is natural for systems that use consensus algorithms like Paxos or Raft. In the case of Vitess, this is achieved through automated master fail-overs followed by optional fail-over of app server traffic.

CP

CP means that you want consistent data at all costs. If there is a network partition, you'd rather wait. Obviously, everyone will prefer CA over CP. But one chooses CP only because they want the app to be distributed in multiple locations.

Variation 1
Send all reads to the master database no matter where you are. If there is a network partition, a read will not be possible until you can talk to the master again.

Variation 2
Wait for the master to replicate the latest data to a replica before performing a local read. In such cases, the read can be held up waiting for replication to be up-to-date. Note: this is currently not supported in Vitess.

Variation 3
Make the writes wait till the data is copied to all relevant replicas. This will cause writes to be slow, or hang if there is a partition. Vitess semi-sync replication falls in this category. However, semi-sync is meant more for achieving Durability rather than Consistency.

In all the above schemes, a network partition will cause the system to stall. And in the absence of a network partition, the system will still be subject to the cost of round-trip latency across data centers.

AP

AP means that you can tolerate stale data. To achieve this, you can provision replicas in all places where the apps are, and just read from them. If there is a partition, the reads will still be successful and quick, but the data will be stale.

AP-CP Trade-off

There are ways to trade-off between AP and CP. For example, you can say that you'll tolerate staleness up to X seconds. Operations will be AP as long as the lag is within limits. If it becomes too high, the operation becomes CP, and refuses to serve the data until replication is caught up.

In Vitess, this trade-off is achieved with two values: The discovery_low_replication_lag duration asks vitess to disregard replicas that are lagging by longer than this specified value, unless all of them are lagged, in which case, vitess will serve the queries anyway. The discovery_high_replication_lag_minimum_serving duration has a higher value and tells vitess to unconditionally disregard replicas that are lagging beyond this amount.

Behavior of Systems

Some systems may prefer CP over CA, or vice-versa. Some of them may also choose to give you only one. However, it's always possible for any system to make a decision to support all modes of operation.

In this light, the better way to evaluate a system is to ask which of its operations are CA and which are CP.

CA is either an end-user choice achieved by deciding to co-locate the app with the master nodes, or an operational decision based on how quorums are configured or failovers managed.

Tuesday, September 6, 2016

Distributed Durability in MySQL

This blog post proposes modifications to the MySQL semi-sync replication process in order to improve the overall consistency and resilience of the system.

This is based on my previous blog post on Flexible Paxos and the related paper Flexible Paxos: Quorum intersection revisited by Howard, Malkhi and Spiegelman.


Background

Durability requirements have changed over the last few years. Traditional systems considered their data Durable if it was written to disk. However, this is not acceptable any more. In today’s world, data is considered durable only if it has been replicated to more than one machine.

MySQL is one of the few databases that has made an effort to satisfy this form of durability. It supported the ability to replicate data from a master to its replicas in near-realtime. But this ability did not address the failure mode where a master database experiences a network partition. If this happened, then commits would continue to succeed, but the data would fail to get replicated. An eventual loss of such a master would result in a data loss.

In order to address this problem, MySQL introduced semi-sync replication. This feature required that a transaction needed to be successfully sent to at least S (number of semi-sync acks) replicas before it was considered a success.

There are many versions of semi-sync. For this particular blog, we’re assuming the lossless version where the master waits for the acks before committing. Also, semi-sync needs to be configured with async fallback "disabled" (by setting the timeout to several decades, and enabling wait_no_slave).

This approach has a few advantages:

  • If the master database crashes, then we know that at least S other replicas have received all its committed transactions. So, one of them could be designated as the new master, and the system can resume without data loss.
  • If the master database encountered a network partition, none of its transactions would succeed. If this happened, we can safely failover to another replica that is still able to connect to the rest of the cluster. We can then abandon the old master because we know it could not have progressed ahead of the replicas.
  • The system can survive the failure of up to S nodes without loss of durability.

There are also a few drawbacks:

  • If a master is only able to replicate to less than S replicas, then it won’t be able to commit those transactions. In this case, the replicas themselves will be ahead of the master. Applications that read from those replicas will end up seeing future values (uncommitted reads).
  • Extending the above story, the eventual failure of such replicas could lead to loss of replicated data that may not be recoverable (ghost reads).
  • Semi-sync does not specify how the system should be repaired if faced with various failures. Without this specification, the story remains incomplete and open to the interpretations of the implementers, and they may not have thought through all possible scenarios. Tools like Orchestrator have tried to close this gap, but the solutions are not perfect.

Let’s do a quick analysis of the failure modes. They fall into two categories:

  1. Master failure: This is the easy part. We figure out which replica is most up-to-date, we make that the master and point all the other replicas to it. Note that this is a paraphrase. The actual workflow will have many steps and verifications.
  2. Network partition: If the master is unable to reach S nodes, then it will lock up. We then have to look for the most progressed replica on the other side of the partition and make that the master. However, the nodes that got left out might have already diverged. If so, they’ll have to be rebuilt from scratch before they can join the quorum.


The network partition issue discourages the use of high values of S. Most systems tend to run with S=1.

Although not directly stated, this is basically a distributed consensus problem.


The solution

To address the above failure modes, all we need to do is make semi-sync a two-step process:

  • Send the data like before to the replicas and wait for S acks. However, the replicas do not commit those transactions yet.
  • Once the necessary acks are received, send commit confirmations, which will inform the replicas that they can commit the data now.

If we work through the failure scenarios, it will be evident that this change will address the previously described problems. But then, there’s an easier way, which is described below.


Proof

The change essentially makes the algorithm match the formal steps required by a consensus protocol like Paxos. To be more precise, it follows the requirements of the new FPaxos generalization. Once we’re convinced that we comply, we can rely on the existing proofs and guarantees to gain confidence.

In my previous post about Flexible Paxos, I’ve listed three steps:

  1. Leader election
  2. Proposal Agreement
  3. Proposal Finalization

Semi-sync replication is a combination of steps 2 & 3. The only difference is that replicas that receive a proposal automatically treat it as final. With the new change, replicas have to wait till they receive a finalization message. This change, along with the master waiting for the required number of acks before sending the commit message, will make this Paxos compliant.

Traditional Paxos requires majority agreement for steps 1 and 2. However, the new FPaxos generalization shows that any kind of overlap between the two steps is sufficient. The L+P>N formula gives you that guarantee. In our case, P=S+1 (sem-sync replicas+master).

The failover process is the other side of the coin. We need to show that such a process honors the fundamentals of the leader election process (step 1). A tool like Orchestrator finds the most progressed replica, designates it as the master and points all the other replicas to it. Once all the replicas are repointed, the failover is complete. Comparing these steps with Paxos:

  • The act of finding the most progressed replica amounts to honoring the proposal(s) of the last leader.
  • The act of pointing a replica to the new master is equivalent to gaining a vote from that replica. If sufficient number of replicas have been repointed to the new master, then the old master will never be able to get the necessary acks to succeed. This allows the new master to start accepting traffic.

In other words, Orchestrator just needs to reach L nodes to consider the failover process successful, where L=N-S.


Performance considerations

The main concern with the proposed change is that the new scheme costs two round trips instead of one. Let’s look at how to optimize this:

  • The steps up to the point where the master gets the acks and commits don’t really change. So, there’s no added cost there. The new system’s commit performance remains unaffected.
  • After the commit is done, the confirmation needs to be sent. However, it’s lightweight. For high QPS systems, this message can actually be piggy-bagged with the next transaction.
  • On the replica side, they are allowed to proactively replay the transaction. They just shouldn’t commit until the confirmation is received.

The details


The intent of this blog is to only present the basic idea and its feasibility. A consensus algorithm requires additional details to be worked out. For example, it’s possible that replicas contain diverged uncommitted transactions. Fortunately, Paxos has rules about how to resolve these conflicts. We just need to apply them in the new context.

Thursday, August 25, 2016

2PC Without Prepare?

In case you didn't know, I also publish blogs on behalf of vitess. I wanted to cross-reference my post on how to build a Prepare functionality on top of a transactional system that does not inherently support it, and also a design doc on how we're going to make it work for vitess.

Thursday, August 11, 2016

A More Flexible Paxos

With systems getting more and more distributed, the Paxos algorithm has been gaining popularity. However, a major drawback with today’s configurations is that you cannot run too many Paxos servers in a quorum. The sweet spot seems to be five servers. Those who run only three have to deal with the risk of downtime. Those who run seven or more have to face degraded write performance.

In short, it seems that you have to trade-off between uptime and write performance. However, it is possible to have both if we redefined the quorum rules, and this blog intends to show how.


Disclaimer

I've learnt about Paxos relatively recently. So, I could have misunderstood some things. Feel free to correct me if there are flaws.
The proposed variation looks fairly straightforward to me. Strangely, I haven't seen this discussed elsewhere. Also, if the idea is noteworthy, a formal proof will need to be worked on.


Paxos in very few words

The original Paxos paper is hard to understand. The RAFT paper tries to make consensus algorithms more understandable. Here’s how I would paraphrase Paxos:

Paxos is a three-step process:

  1. Leader Election
  2. Proposal Agreement
  3. Proposal Finalization

Leader Election

A server becomes a leader through an election process. If the majority of voters agree, it considers itself to be the leader.


Proposal Agreement

The elected leader is responsible for getting agreement on a proposal. A follower will accept a proposal as long as it’s not voted for a newer leader. If a majority of followers have agreed on a proposal, then it’s final.


Proposal Finalization

Once majority is reached, the leader communicates the proposal as final to all members.


The honor code

For the above steps to work correctly, an important rule must be added: During the leader election process, the voters have to communicate previous proposals to the new leader. If so, the new leader’s responsibility is to propagate the latest proposal it has seen from the voters. In this situation, the leader is not allowed to make an alternate proposal.


Why it works

There are two critical rules that make this algorithm work:

  1. The honor code gives us the assurance that a proposal will not be changed once it’s accepted by the majority, because any new server that subsequently becomes a leader is guaranteed to see at least one follower with the latest proposal. So, it will be forced to propagate it.
  2. If a new leader was elected before a proposal reached majority, the old leader will not succeed at getting its proposal through, because it will encounter at least one follower that has changed allegiance before majority is reached.

Multi-Paxos

In real life, Paxos implementations do not perform all three steps for every proposal. Instead, they go with a leader lease. Once a leader is elected, then no other servers attempt to become a leader for an agreed period of time. Additionally, every time a proposal is successfully accepted, the lease is implicitly renewed.

This means that a server typically remains a leader for long periods of time, sometimes weeks. Leader election is basically a low QPS event. On the other hand, the proposal agreements and finalizations are high QPS.

This fact will be used to change the rules in our favor.


Reinterpreting the rules

If we took a step back and re-evaluated the two ‘why it works’ rules, we can restate them more generally:

  1. If a new leader is chosen after a proposal was successful, then it has to see it, and must honor it.
  2. If a new leader is chosen before a proposal was successful, then the old leader must not succeed at that proposal.

We’re specifically refraining from defining how a leader gets chosen, and how a proposal is deemed successful. Requiring the leader and the proposal to get a majority vote as described by Paxos is a specific way of achieving this. But this can be generalized as follows:

  • If there are N servers in a quorum,
  • if the number of voters required for a leader is L,
  • if the number of voters required for a proposal is P,
  • then, as long as L+P > N, the two rules are still preserved.

A similar concept was previously used in Dynamo for a simpler read vs. write quorum as R+W >N.

For standard Paxos, L and P are set to N/2+1, which makes (N/2+1)*2 > N. Instead, we can choose N and P such that P <= N, and L can be computed as N-P+1.

An example:

  • N=11
  • L=9
  • P=3

In the above case, 9 voters have to agree for a server to be a leader, but only 3 voters are needed for a proposal to be final. This system will have the performance of a 5-node paxos, but the uptime of 11 nodes.

The formula works for even values of N also. For example, if you had a 5-node cluster that you wanted to deploy in three zones, then the zone that has only one node becomes ‘weaker’ than the other two. This can now be overcome by setting N=6 (two nodes per zone) and preserving P at 3.

High values of N give us high uptime, while low values of P give us high write performance. The two parameters become independently tunable, and we can now get the best of both worlds.


Caveat

If there is a total network partition, a standard Paxos system can still operate normally. But an additional node failure could cause the system to lock up.

With the new changes, a network partition could happen in such a way that a leader election may not be possible on any of the sides. Such a system will still make progress because the current leader will be able to reach the necessary number of acceptors. But if an additional failure triggers a leader election, the system will lock up.

So, in both cases, two simultaneous failures are needed for the system to lock up. Standard Paxos does have an additional flexibility: If a network partition is the only failure, it can still execute a successful leader change while the new system cannot.

If a double-failure were to happen, a human operator will have to intervene. The corrective action will be to subset the quorum to re-satisfy L+P>Nnew, and the system will resume. Once the network issue is resolved, the blackholed servers can be added back to the quorum.

There are probably situations where the system itself can perform this subsetting without losing data integrity. Solutions need to be brainstormed.


Acknowlegements

I'd like to thank Anthony Yeh, Michael Berlin and Erez Louidor for reviewing this write-up.

Coincidentally, Howard, Malkhi and Spiegelman independently came up with the same idea, and have published a formal paper that also proves correctness and safety.