The repair service was observed to be failing after upgrading to OpsCenter versions 6.0.0 through to 6.0.2.
Repair service failure messages were observed in the UI and the
repair_service.log may show error messages like so:
2016-08-11 06:01:44,723 [MyCluster] ERROR: Error running task: [RepairTask 0xc7bd3: [Node 10.1.2.3: listen_address 10.1.2.3, token -1001410347016415507, num_tokens 256], repair range: [-1572525048168036963, -1553460003985855589]] Traceback (most recent call last): File "/usr/share/opscenter/lib/py/twisted/internet/defer.py", line 1122, in _inlineCallbacks result = result.throwExceptionIntoGenerator(g) File "/usr/share/opscenter/lib/py/twisted/python/failure.py", line 389, in throwExceptionIntoGenerator return g.throw(self.type, self.value, self.tb) File "/usr/share/opscenter/jython/Lib/site-packages/opscenterd/cluster/Repair.py", line 362, in _doRepair DefaultException: Check log for further details (MainThread) 2016-08-11 06:01:44,724 [MyCluster] ERROR: [RepairTask 0xc7bd3: [Node 10.1.2.3: listen_address 10.1.2.3, token -1001410347016415507, num_tokens 256], repair range: [-1572525048168036963, -1553460003985855589]] has failed 1 times. (MainThread) 2016-08-11 06:01:44,724 [MyCluster] ERROR: 52 errors have occurred out of 100 allowed. (MainThread) 2016-08-11 06:01:44,724 [MyCluster] INFO: Adding repair task to end of queue (MainThread)
agent.log on the node may show the following error:
ERROR [ClientNotifForwarder-3] 2016-08-29 13:51:53,020 Repair messages lost via Cassandra JMX, please check cassandra logs for status. #<JMXConnectionNotification javax.management.remote.JMXConnectionNotification[source=javax.management.remote.rmi.RMIConnector: jmxServiceURL=service:jmx:rmi:///jndi/rmi://127.0.0.1:7199/jmxrmi][type=jmx.remote.connection.notifs.lost][message=May have lost up to 461 notifications]>
OpsCenter was treating lost JMX notifications to indicate that the repairs had indeed failed when they had not. The issue is being tracked with the following internal jira:
OPSC-10112 - Improve logging when Repair Service repairs fail due to lost jmx notifications
The following workaround may be implemented until such time as an upgrade can be scheduled:
Add the following into the
This will require a restart of the DSE process to take effect.
The above jira is fixed in OpsCenter 6.0.3 and later. See the following release notes:
Note even post 6.0.3 JMX notifications may still be lost, however repairs will still try to run as normal