Tuesday, September 6, 2016

Distributed Durability in MySQL

This blog post proposes modifications to the MySQL semi-sync replication process in order to improve the overall consistency and resilience of the system.

This is based on my previous blog post on Flexible Paxos and the related paper Flexible Paxos: Quorum intersection revisited by Howard, Malkhi and Spiegelman.


Durability requirements have changed over the last few years. Traditional systems considered their data Durable if it was written to disk. However, this is not acceptable any more. In today’s world, data is considered durable only if it has been replicated to more than one machine.

MySQL is one of the few databases that has made an effort to satisfy this form of durability. It supported the ability to replicate data from a master to its replicas in near-realtime. But this ability did not address the failure mode where a master database experiences a network partition. If this happened, then commits would continue to succeed, but the data would fail to get replicated. An eventual loss of such a master would result in a data loss.

In order to address this problem, MySQL introduced semi-sync replication. This feature required that a transaction needed to be successfully sent to at least S (number of semi-sync acks) replicas before it was considered a success.

There are many versions of semi-sync. For this particular blog, we’re assuming the lossless version where the master waits for the acks before committing. Also, semi-sync needs to be configured with async fallback "disabled" (by setting the timeout to several decades, and enabling wait_no_slave).

This approach has a few advantages:

  • If the master database crashes, then we know that at least S other replicas have received all its committed transactions. So, one of them could be designated as the new master, and the system can resume without data loss.
  • If the master database encountered a network partition, none of its transactions would succeed. If this happened, we can safely failover to another replica that is still able to connect to the rest of the cluster. We can then abandon the old master because we know it could not have progressed ahead of the replicas.
  • The system can survive the failure of up to S nodes without loss of durability.

There are also a few drawbacks:

  • If a master is only able to replicate to less than S replicas, then it won’t be able to commit those transactions. In this case, the replicas themselves will be ahead of the master. Applications that read from those replicas will end up seeing future values (uncommitted reads).
  • Extending the above story, the eventual failure of such replicas could lead to loss of replicated data that may not be recoverable (ghost reads).
  • Semi-sync does not specify how the system should be repaired if faced with various failures. Without this specification, the story remains incomplete and open to the interpretations of the implementers, and they may not have thought through all possible scenarios. Tools like Orchestrator have tried to close this gap, but the solutions are not perfect.

Let’s do a quick analysis of the failure modes. They fall into two categories:

  1. Master failure: This is the easy part. We figure out which replica is most up-to-date, we make that the master and point all the other replicas to it. Note that this is a paraphrase. The actual workflow will have many steps and verifications.
  2. Network partition: If the master is unable to reach S nodes, then it will lock up. We then have to look for the most progressed replica on the other side of the partition and make that the master. However, the nodes that got left out might have already diverged. If so, they’ll have to be rebuilt from scratch before they can join the quorum.

The network partition issue discourages the use of high values of S. Most systems tend to run with S=1.

Although not directly stated, this is basically a distributed consensus problem.

The solution

To address the above failure modes, all we need to do is make semi-sync a two-step process:

  • Send the data like before to the replicas and wait for S acks. However, the replicas do not commit those transactions yet.
  • Once the necessary acks are received, send commit confirmations, which will inform the replicas that they can commit the data now.

If we work through the failure scenarios, it will be evident that this change will address the previously described problems. But then, there’s an easier way, which is described below.


The change essentially makes the algorithm match the formal steps required by a consensus protocol like Paxos. To be more precise, it follows the requirements of the new FPaxos generalization. Once we’re convinced that we comply, we can rely on the existing proofs and guarantees to gain confidence.

In my previous post about Flexible Paxos, I’ve listed three steps:

  1. Leader election
  2. Proposal Agreement
  3. Proposal Finalization

Semi-sync replication is a combination of steps 2 & 3. The only difference is that replicas that receive a proposal automatically treat it as final. With the new change, replicas have to wait till they receive a finalization message. This change, along with the master waiting for the required number of acks before sending the commit message, will make this Paxos compliant.

Traditional Paxos requires majority agreement for steps 1 and 2. However, the new FPaxos generalization shows that any kind of overlap between the two steps is sufficient. The L+P>N formula gives you that guarantee. In our case, P=S+1 (sem-sync replicas+master).

The failover process is the other side of the coin. We need to show that such a process honors the fundamentals of the leader election process (step 1). A tool like Orchestrator finds the most progressed replica, designates it as the master and points all the other replicas to it. Once all the replicas are repointed, the failover is complete. Comparing these steps with Paxos:

  • The act of finding the most progressed replica amounts to honoring the proposal(s) of the last leader.
  • The act of pointing a replica to the new master is equivalent to gaining a vote from that replica. If sufficient number of replicas have been repointed to the new master, then the old master will never be able to get the necessary acks to succeed. This allows the new master to start accepting traffic.

In other words, Orchestrator just needs to reach L nodes to consider the failover process successful, where L=N-S.

Performance considerations

The main concern with the proposed change is that the new scheme costs two round trips instead of one. Let’s look at how to optimize this:

  • The steps up to the point where the master gets the acks and commits don’t really change. So, there’s no added cost there. The new system’s commit performance remains unaffected.
  • After the commit is done, the confirmation needs to be sent. However, it’s lightweight. For high QPS systems, this message can actually be piggy-bagged with the next transaction.
  • On the replica side, they are allowed to proactively replay the transaction. They just shouldn’t commit until the confirmation is received.

The details

The intent of this blog is to only present the basic idea and its feasibility. A consensus algorithm requires additional details to be worked out. For example, it’s possible that replicas contain diverged uncommitted transactions. Fortunately, Paxos has rules about how to resolve these conflicts. We just need to apply them in the new context.

Thursday, August 25, 2016

2PC Without Prepare?

In case you didn't know, I also publish blogs on behalf of vitess. I wanted to cross-reference my post on how to build a Prepare functionality on top of a transactional system that does not inherently support it, and also a design doc on how we're going to make it work for vitess.

Thursday, August 11, 2016

A More Flexible Paxos

With systems getting more and more distributed, the Paxos algorithm has been gaining popularity. However, a major drawback with today’s configurations is that you cannot run too many Paxos servers in a quorum. The sweet spot seems to be five servers. Those who run only three have to deal with the risk of downtime. Those who run seven or more have to face degraded write performance.

In short, it seems that you have to trade-off between uptime and write performance. However, it is possible to have both if we redefined the quorum rules, and this blog intends to show how.


I've learnt about Paxos relatively recently. So, I could have misunderstood some things. Feel free to correct me if there are flaws.
The proposed variation looks fairly straightforward to me. Strangely, I haven't seen this discussed elsewhere. Also, if the idea is noteworthy, a formal proof will need to be worked on.

Paxos in very few words

The original Paxos paper is hard to understand. The RAFT paper tries to make consensus algorithms more understandable. Here’s how I would paraphrase Paxos:

Paxos is a three-step process:

  1. Leader Election
  2. Proposal Agreement
  3. Proposal Finalization

Leader Election

A server becomes a leader through an election process. If the majority of voters agree, it considers itself to be the leader.

Proposal Agreement

The elected leader is responsible for getting agreement on a proposal. A follower will accept a proposal as long as it’s not voted for a newer leader. If a majority of followers have agreed on a proposal, then it’s final.

Proposal Finalization

Once majority is reached, the leader communicates the proposal as final to all members.

The honor code

For the above steps to work correctly, an important rule must be added: During the leader election process, the voters have to communicate previous proposals to the new leader. If so, the new leader’s responsibility is to propagate the latest proposal it has seen from the voters. In this situation, the leader is not allowed to make an alternate proposal.

Why it works

There are two critical rules that make this algorithm work:

  1. The honor code gives us the assurance that a proposal will not be changed once it’s accepted by the majority, because any new server that subsequently becomes a leader is guaranteed to see at least one follower with the latest proposal. So, it will be forced to propagate it.
  2. If a new leader was elected before a proposal reached majority, the old leader will not succeed at getting its proposal through, because it will encounter at least one follower that has changed allegiance before majority is reached.


In real life, Paxos implementations do not perform all three steps for every proposal. Instead, they go with a leader lease. Once a leader is elected, then no other servers attempt to become a leader for an agreed period of time. Additionally, every time a proposal is successfully accepted, the lease is implicitly renewed.

This means that a server typically remains a leader for long periods of time, sometimes weeks. Leader election is basically a low QPS event. On the other hand, the proposal agreements and finalizations are high QPS.

This fact will be used to change the rules in our favor.

Reinterpreting the rules

If we took a step back and re-evaluated the two ‘why it works’ rules, we can restate them more generally:

  1. If a new leader is chosen after a proposal was successful, then it has to see it, and must honor it.
  2. If a new leader is chosen before a proposal was successful, then the old leader must not succeed at that proposal.

We’re specifically refraining from defining how a leader gets chosen, and how a proposal is deemed successful. Requiring the leader and the proposal to get a majority vote as described by Paxos is a specific way of achieving this. But this can be generalized as follows:

  • If there are N servers in a quorum,
  • if the number of voters required for a leader is L,
  • if the number of voters required for a proposal is P,
  • then, as long as L+P > N, the two rules are still preserved.

A similar concept was previously used in Dynamo for a simpler read vs. write quorum as R+W >N.

For standard Paxos, L and P are set to N/2+1, which makes (N/2+1)*2 > N. Instead, we can choose N and P such that P <= N, and L can be computed as N-P+1.

An example:

  • N=11
  • L=9
  • P=3

In the above case, 9 voters have to agree for a server to be a leader, but only 3 voters are needed for a proposal to be final. This system will have the performance of a 5-node paxos, but the uptime of 11 nodes.

The formula works for even values of N also. For example, if you had a 5-node cluster that you wanted to deploy in three zones, then the zone that has only one node becomes ‘weaker’ than the other two. This can now be overcome by setting N=6 (two nodes per zone) and preserving P at 3.

High values of N give us high uptime, while low values of P give us high write performance. The two parameters become independently tunable, and we can now get the best of both worlds.


If there is a total network partition, a standard Paxos system can still operate normally. But an additional node failure could cause the system to lock up.

With the new changes, a network partition could happen in such a way that a leader election may not be possible on any of the sides. Such a system will still make progress because the current leader will be able to reach the necessary number of acceptors. But if an additional failure triggers a leader election, the system will lock up.

So, in both cases, two simultaneous failures are needed for the system to lock up. Standard Paxos does have an additional flexibility: If a network partition is the only failure, it can still execute a successful leader change while the new system cannot.

If a double-failure were to happen, a human operator will have to intervene. The corrective action will be to subset the quorum to re-satisfy L+P>Nnew, and the system will resume. Once the network issue is resolved, the blackholed servers can be added back to the quorum.

There are probably situations where the system itself can perform this subsetting without losing data integrity. Solutions need to be brainstormed.


I'd like to thank Anthony Yeh, Michael Berlin and Erez Louidor for reviewing this write-up.

Coincidentally, Howard, Malkhi and Spiegelman independently came up with the same idea, and have published a formal paper that also proves correctness and safety.