Summary:
Symptoms:
Cause:
Consider the case of CL=1, RF=3 to replicas A, B, C. We begin bootstrapping node D, and write a row K to the range being moved from C to D.
If the cluster is heavily loaded, it's possible that we write one copy to C, all the other writes get dropped, and once bootstrap completes we lose the row. Or if we write one copy to D, and cancel bootstrap, we again lose the row.
As said above, we want to satisfy CL for both the pre- and post-bootstrap nodes (in case bootstrap aborts). This requires treating the old/new range owner as a unit: both D and C need to accept the write for it to count towards CL. So rather than considering
{A, B, C, D}
we should consider
{A, B, (C, D)}
This is a lot of complexity to introduce. A simplification that preserves correctness is to continue treating nodes independently but require one more node than normal CL. So CL=1 would actually require 2 nodes; CL=Q would require 3 (for RF=3), and so forth. (Note that Q(3) + 1 is the same as Q(4), which is what the existing code computes; that is one reason I chose a CL=1 example to start with, since those are not the same even for the simple case of RF=3.)
This would mean we may fail a few writes unnecessarily (a write to A or B is actually sufficient to satisfy CL=1, but this scheme would time that out) but never allow a write to succeed that would leave CL unsatisfied post-bootstrap (or if bootstrap is cancelled).
++++++++++++++++++++++++++++++++++++++++++++++++++++
NOTE (1): If, instead of CL=ONE (as is used in the above example), you use a LOCAL_* CL, then, if/when there is a node currently bootstrapping or decommissioning in the local DC that's being written to, you will need acknowledgement from both the old and new replica; you will not need acknowledgement from both the old and new replica if the node currently bootstrapping/decommissioning is in the remote DC because, when using LOCAL_* CL, only local replicas are required. If you observe that cassandra is actually requiring an ack from a remote replica, even though you have specified a local CL, you might be hitting CASSANDRA-8058 which is a regression, fixed in Cassandra versions 2.0.11+ and 2.1.1+.
This diagram also attempts to clarify this