In the past, users have asked if running the nodetool repair -pr command on one data center would repair the full data set in the local data center. This is typically followed by a question about how to find out which token ranges would be repaired under nodetool repair -pr beforehand. This article will explain the internals of nodetool repair -pr.
The assignment of the primary token ranges
Adding the -pr flag to the nodetool repair command tells Cassandra to repair the primary token ranges of the node.
What are the primary token ranges? A primary range is just a token range for which a node is the first replica in the ring. These are the tokens that were assigned to the node during the bootstrap process.
There are three ways to find the primary token ranges:
- Query the system.size_estimates table on each node. This is a system table available on both Cassandra OSS and DSE clusters. Datastax OpsCenter repair service also uses this table to retrieve the token ranges and do subrange repairs.
cqlsh> select * from system.size_estimates where keyspace_name = 'demodb' and table_name = 'test';
keyspace_name | table_name | range_start | range_end | mean_partition_size | partitions_count
---------------+------------+---------------------+----------------------+---------------------+------------------
demodb | test | -9223372036854775798 | -3074457345618258603 | 0 | 0
(1 rows) - Sort the output of nodetool ring regardless of the data centers, and recreate the token ranges from the output. Please refer to the examples below.
- The repaired token ranges will be printed in the output of the nodetool repair <options> command. The output can be used to verify the token ranges identified by other methods.
For simplicity, let's first see how the primary token ranges are determined for a single token cluster. The following example is a cluster with 2 data centers, with each data center having 3 nodes.
automaton@ip-10-101-32-114:~$ nodetool ring
Datacenter: DC1
==========
Address Rack Status State Load Owns Token
3074457345618258602
10.101.32.114 rack1 Up Normal 107.6 KiB ? -9223372036854775808
10.101.34.15 rack1 Up Normal 91.83 KiB ? -3074457345618258603
10.101.34.47 rack1 Up Normal 102.61 KiB ? 3074457345618258602
Datacenter: DC2
==========
Address Rack Status State Load Owns Token
3074457345618258612
10.101.33.33 rack1 Up Normal 91.82 KiB ? -9223372036854775798
10.101.33.199 rack1 Up Normal 91.8 KiB ? -3074457345618258593
10.101.33.192 rack1 Up Normal 102.61 KiB ? 3074457345618258612
By querying the system.size_estimates table, we find the primary token range for each node:
{"10.101.32.114": "(3074457345618258612, -9223372036854775808]"}
{"10.101.34.15": "(-9223372036854775798, -3074457345618258603]"}
{"10.101.34.47": "(-3074457345618258593, 3074457345618258602]"}
{"10.101.33.33": "(-9223372036854775808, -9223372036854775798]"}
{"10.101.33.199": "(-3074457345618258603, -3074457345618258593]"}
{"10.101.33.192": "(3074457345618258602, 3074457345618258612]"}
We sort the initial tokens in nodetool ring regardless of data centers:
10.101.32.114 -9223372036854775808
10.101.33.33 -9223372036854775798
10.101.34.15 -3074457345618258603
10.101.33.199 -3074457345618258593
10.101.34.47 3074457345618258602
10.101.33.192 3074457345618258612
Now we see how the primary token ranges are created. The end token is the initial token of the local node, and the start token is one from the neighboring node in the token-sorted list. For example, node 10.101.34.15, the end token is -3074457345618258603, which is the assigned token for that node. The start token is from node 10.101.33.33, the neighbor in the sorted output.
Note: 10.101.33.33 happens to be from a remote data center in this case.
When we run repair -pr to verify our theory, we confirm the primary token range in the repair output:
automaton@ip-10-101-34-15:~$ nodetool repair -pr demodb test
[2020-05-05 16:46:56,932] Starting repair command #2 (0833f720-8ef0-11ea-b408-cdadc58ca007), repairing keyspace demodb with repair options (parallelism: parallel, primary range: true, incremental: false, job threads: 1, ColumnFamilies: [test], dataCenters: [], hosts: [], runAntiCompaction: false, # of ranges: 1, pull repair: false)
[2020-05-05 16:46:56,958] Repair session 08370460-8ef0-11ea-b408-cdadc58ca007 for range [(-9223372036854775798,-3074457345618258603]] finished (progress: 100%)
With Virtual Nodes (or vnodes) enabled, the primary token range assignment is similar. The only difference is that multiple token ranges are located on each node. This results in more primary ranges distributed throughout the cluster but smaller in size .
For example,
cqlsh> select * from system.size_estimates where keyspace_name = 'demodb' and table_name = 'test';
keyspace_name | table_name | range_start | range_end | mean_partition_size | partitions_count
---------------+------------+----------------------+----------------------+---------------------+------------------
demodb | test | -1166339920912110788 | -689126574447853997 | 71 | 1
demodb | test | -2305282840026761736 | -2133161161366689222 | 71 | 1
demodb | test | -2528205690402994935 | -2305282840026761736 | 71 | 1
demodb | test | -4288621780242584267 | -4247223339848916897 | 71 | 1
demodb | test | -6798969080905713247 | -6712864315351077558 | 71 | 1
demodb | test | 1734825422640040056 | 1876849918621508530 | 71 | 1
demodb | test | 1876849918621508530 | 2376155006289856525 | 71 | 1
demodb | test | 7969461472240666403 | 8045067393681522426 | 71 | 1
From the sorted output of nodetool ring, we confirm the start/end tokens for the primary token ranges on the local node, 10.101.34.2:
10.101.33.51 -6798969080905713247
10.101.34.2 -6712864315351077558
...
10.101.33.157 -4288621780242584267
10.101.34.2 -4247223339848916897
...
10.101.33.51 -2528205690402994935
10.101.34.2 -2305282840026761736
10.101.34.2 -2133161161366689222
10.101.33.51 -1166339920912110788
10.101.34.2 -689126574447853997
...
10.101.33.157 1734825422640040056
10.101.34.2 1876849918621508530
10.101.34.2 2376155006289856525
...
10.101.33.157 7969461472240666403
10.101.34.2 8045067393681522426
From this example, we can see that vnodes are slightly more complicated. Multiple token ranges exist on a single node. However, the concept is still the same.
Why we need to run nodetool repair -pr on all nodes in all data centers?
From the assignment of the primary token ranges, we can see that the token ranges can be affected by all nodes in the cluster, often regardless of the data center. If the nodetool repair -pr is run on a single node, or a single data center, it is not possible to reach the goal to repair the entire cluster.
Related article: https://www.datastax.com/blog/2014/07/repair-cassandra