Summary
The repair service was observed to be failing after upgrading to OpsCenter versions 6.0.0 through to 6.0.2.
Symptoms
Repair service failure messages were observed in the UI and the repair_service.log
may show error messages like so:
2016-08-11 06:01:44,723 [MyCluster] ERROR: Error running task: [RepairTask 0xc7bd3: [Node 10.1.2.3: listen_address 10.1.2.3, token -1001410347016415507, num_tokens 256], repair range: [-1572525048168036963, -1553460003985855589]] Traceback (most recent call last):
File "/usr/share/opscenter/lib/py/twisted/internet/defer.py", line 1122, in _inlineCallbacks
result = result.throwExceptionIntoGenerator(g)
File "/usr/share/opscenter/lib/py/twisted/python/failure.py", line 389, in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb)
File "/usr/share/opscenter/jython/Lib/site-packages/opscenterd/cluster/Repair.py", line 362, in _doRepair
DefaultException: Check log for further details
(MainThread)
2016-08-11 06:01:44,724 [MyCluster] ERROR: [RepairTask 0xc7bd3: [Node 10.1.2.3: listen_address 10.1.2.3, token -1001410347016415507, num_tokens 256], repair range: [-1572525048168036963, -1553460003985855589]] has failed 1 times. (MainThread)
2016-08-11 06:01:44,724 [MyCluster] ERROR: 52 errors have occurred out of 100 allowed. (MainThread)
2016-08-11 06:01:44,724 [MyCluster] INFO: Adding repair task to end of queue (MainThread)
The agent.log
on the node may show the following error:
ERROR [ClientNotifForwarder-3] 2016-08-29 13:51:53,020 Repair messages lost via Cassandra JMX, please check cassandra logs for status. #<JMXConnectionNotification javax.management.remote.JMXConnectionNotification[source=javax.management.remote.rmi.RMIConnector: jmxServiceURL=service:jmx:rmi:///jndi/rmi://127.0.0.1:7199/jmxrmi][type=jmx.remote.connection.notifs.lost][message=May have lost up to 461 notifications]>
Cause
OpsCenter was treating lost JMX notifications to indicate that the repairs had indeed failed when they had not. The issue is being tracked with the following internal jira:
OPSC-10112 - Improve logging when Repair Service repairs fail due to lost jmx notifications
Workaround
The following workaround may be implemented until such time as an upgrade can be scheduled:
Add the following into the cassandra-env.sh
file
JVM_OPTS="$JVM_OPTS -Djmx.remote.x.notification.buffer.size=5000"
This will require a restart of the DSE process to take effect.
Solution
The above jira is fixed in OpsCenter 6.0.3 and later. See the following release notes:
http://docs.datastax.com/en/opscenter/6.0/opsc/release_notes/opscReleaseNotes603.html
Note even post 6.0.3 JMX notifications may still be lost, however repairs will still try to run as normal