Summary
This technical note address troubleshoting of false-positives of the error found in OpsCenter logs
Based on current repair throughput, it appears that the Repair Service will not complete within the specified repair window
Applies to
- DataStax OpsCenter 6.7 (potentially all versions)
Symptoms
Reviewing the opscenterd.log
or historical ones (opscenterd.1.log
…), the following error is found in the logs on a regular basis:
2020-05-31 03:21:46,983 [cluster] ERROR: Based on current repair throughput, it appears that the Repair Service will not complete within the specified repair window.
Tasks remaining to repair: 674.63 GB. Repair must complete by Tue Jun 09 03:11:23 UTC 2020, which requires throughput of: 910.4 KB/s; however, the actual throughput is: 71.6 KB/s. Estimated completion is Tue Sep 22 12:42:13 UTC 2020. For more information on tuning the repair service, see http://www.datastax.com/documentation/opscenter/help/repair_services_advanced.html (MainThread)
Cause
During the start of a new cycle of repair, or after a schema change occur, as well as in case of a slow down of repair, this error can appear during a warm up period of 30 minutes or more following any pause/new start of Repair Service and lead to a false-positive
Analysis
The following script allows to
- find if the Repair Service finished a cycle and is starting a new one, looking for the following
Rolling subrange repair success, starting a new run
- identify if a schema change occured or other operations occured around repair via search of the
Repair Service
string, which leads to a pause of repair - Parse the error message, to trim and show only relevant information (sed command) as such:
remain: 31.85 GB tocomplete: Wed Jul 01 14:39:34 CEST 2020 throuputreq: 3.51 MB/s throughputactual: 522.9 KB/s estimate: Thu Jul 02 05:49:08 CEST 2020
cd
to the opscenter log folder prior to running the command below
for i in $(find . -name 'opscenterd.*log'| sort -V -r); \ do echo ++++ $i ++++; \ grep -E 'Based on current repair|Rolling subrange repair success, starting a new run|Repair Service' $i | \ sed -E 's_(2020.*ERROR):.* repair: ([0-9\.]{1,} [TGKM]B).*complete by (.*2020).*requires throughput of: ([0-9.]{1,} [TGKM]B\/s).*actual throughput is: ([0-9.]{1,} [TGKMb].*B?\/s).*Estimated completion is (.*20[0-9]{2}).*_\1\t remain: \2\t tocomplete: \3\t throuputreq: \4\t throughputactual: \5\t estimate: \6_g'; \ done | less
Based on this, we can identify the scenarios where a schema change or a new cycle led to a false positive. Note the timestamp at the beginning and ensuing error
Scenarios
1- End of a Repair cycle and start of a new one
Error shows for the first 30mn following a new cycle
2020-06-18 06:44:08,027 [cluster] INFO: Rolling subrange repair success, starting a new run (MainThread) 2020-06-18 07:44:42,287 [cluster] ERROR remain: 105.22 GB tocomplete: Sat Jun 27 06:49:07 UTC 2020 throuputreq: 142.5 KB/s throughputactual: 9.5 KB/s estimate: Fri Oct 30 08:01:00 UTC 2020 2020-06-19 08:10:48,771 [cluster] ERROR remain: 79.99 GB tocomplete: Sat Jun 27 06:49:07 UTC 2020 throuputreq: 122.2 KB/s throughputactual: 23.5 KB/s estimate: Thu Jul 30 14:54:31 UTC 2020 2020-06-20 08:11:05,127 [cluster] ERROR remain: 43.01 GB tocomplete: Sat Jun 27 06:49:07 UTC 2020 throuputreq: 75.2 KB/s throughputactual: 43.6 KB/s estimate: Thu Jul 02 07:20:40 UTC 2020 2020-06-20 23:59:16,846 [cluster] INFO: Rolling subrange repair success, starting a new run (MainThread)
Error repeats after every repair cycle (low gc_grace leading to multiple run of repair in a day)
++++ ./opscenterd.log ++++ 2020-07-02 10:26:46,480 [cluster] INFO: Rolling subrange repair success, starting a new run (MainThread) 2020-07-02 10:38:13,926 [cluster] ERROR remain: 1.24 GB tocomplete: Thu Jul 02 15:08:46 CEST 2020 throuputreq: 143.5 KB/s throughputactual: 33.2 KB/s estimate: Thu Jul 02 23:28:30 CEST 2020 2020-07-02 11:12:35,688 [cluster] INFO: Rolling subrange repair success, starting a new run (MainThread) 2020-07-02 11:23:55,749 [cluster] ERROR remain: 1.33 GB tocomplete: Thu Jul 02 15:54:36 CEST 2020 throuputreq: 153.7 KB/s throughputactual: 54.3 KB/s estimate: Thu Jul 02 20:30:40 CEST 2020 2020-07-02 12:20:35,824 [cluster] INFO: Rolling subrange repair success, starting a new run (MainThread) 2020-07-02 12:23:12,117 [cluster] ERROR remain: 32.64 GB tocomplete: Thu Jul 02 17:02:36 CEST 2020 throuputreq: 3.49 MB/s throughputactual: 1.07 MB/s estimate: Thu Jul 02 23:05:54 CEST 2020 2020-07-02 13:03:44,105 [cluster] INFO: Rolling subrange repair success, starting a new run (MainThread)
2- Schema change lead to pause of repair and error follows (full cycle over 7 days):
++++ ./opsclogs/var/log/opscenter/opscenterd.5.log ++++ 2020-04-23 18:13:50,176 [cluster] INFO: Rolling subrange repair success, starting a new run (MainThread) 2020-04-23 18:36:08,101 [cluster] ERROR remain: 981.52 GB tocomplete: Sat May 02 18:17:27 UTC 2020 throuputreq: 1.29 MB/s throughputactual: 359.5 KB/s estimate: Tue May 26 21:48:40 UTC 2020 2020-04-23 23:43:45,335 [cluster] INFO: Detected a keyspace changed schema change. The Repair Service will pause for 5 minutes then the Repair Service will activate again. This period of time is configurable with [repair_service].restart_period. (MainThread) 2020-04-23 23:56:01,217 [cluster] ERROR remain: 1.3 TB tocomplete: Sat May 02 23:52:12 UTC 2020 throuputreq: 1.76 MB/s throughputactual: 419.2 KB/s estimate: Mon Jun 01 13:43:35 UTC 2020 2020-04-24 23:56:06,993 [cluster] ERROR remain: 559.09 GB tocomplete: Sat May 02 23:52:12 UTC 2020 throuputreq: 848.4 KB/s throughputactual: 22.6 KB/s estimate: Thu Feb 18 19:12:42 UTC 2021 2020-04-26 01:23:57,832 [cluster] ERROR remain: 202.95 GB tocomplete: Sat May 02 23:52:12 UTC 2020 throuputreq: 355.1 KB/s throughputactual: 1.3 KB/s estimate: Sun Sep 07 16:56:18 UTC 2025 2020-04-29 09:33:23,899 [cluster] INFO: Rolling subrange repair success, starting a new run (MainThread)
Solution
The scenario above indicates false-positives of the error and can be ignored
In case of doubt or if the OpsCenter Repair Service fail to go through the 100% cycle, tweaking of the Repair Service may be necessary. Please contact Support if need be.