Summary
On the OpsCenter console, repair service is shown as FAILED
However manual repairs are successful.
Symptoms
An error like this is shown on the repair_service.log
Repair service needs to run 2 range repairs in parallel; maximum is 1. The repair cannot complete without adversely affecting the cluster. Data left: 862620416738.87, time left: 763080.00, required throughput: 1130445.58, actual throughput: 1087764.45. More information on tuning the repair service can be found here:http://www.datastax.com/documentation/opscenter/help/repair_services_advanced.html
Cause
By default, single repairs have a pre-defined timeout of 3600 secs
Whenever a repair segment takes longer than 1 hour, OPSC repair service mark it as if it failed
However the repair session on system log may still be in progress - normally
Workaround
Set the following OpsCenter section/parameters on /etc/opscenter/opscenterd.conf
This will allow for a 4 hour window on single repairs and a max of 4 concurrent repair sessions
[repair_service]
max_parallel_repairs = 4
single_repair_timeout = 14400
Solution
Support has requested a formal revision of this mechanism, in the meantime the workaround should suffice to get a functional repair service operation as the default values might not work in instances where repair sessions I/O and other factors do not allow it to complete within an hour.