DataStax Help Center

Repair service may fail due to lost JMX notifications from Cassandra

Summary

The repair service was observed to be failing after upgrading to OpsCenter versions 6.0.0 through to 6.0.2.

Symptoms

Repair service failure messages were observed in the UI and the repair_service.log may show error messages like so:

2016-08-11 06:01:44,723 [MyCluster] ERROR: Error running task: [RepairTask 0xc7bd3: [Node 10.1.2.3: listen_address 10.1.2.3, token -1001410347016415507, num_tokens 256], repair range: [-1572525048168036963, -1553460003985855589]] Traceback (most recent call last):
  File "/usr/share/opscenter/lib/py/twisted/internet/defer.py", line 1122, in _inlineCallbacks
    result = result.throwExceptionIntoGenerator(g)
  File "/usr/share/opscenter/lib/py/twisted/python/failure.py", line 389, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/usr/share/opscenter/jython/Lib/site-packages/opscenterd/cluster/Repair.py", line 362, in _doRepair
DefaultException: Check log for further details
 (MainThread)
2016-08-11 06:01:44,724 [MyCluster] ERROR: [RepairTask 0xc7bd3: [Node 10.1.2.3: listen_address 10.1.2.3, token -1001410347016415507, num_tokens 256], repair range: [-1572525048168036963, -1553460003985855589]] has failed 1 times. (MainThread)
2016-08-11 06:01:44,724 [MyCluster] ERROR: 52 errors have occurred out of 100 allowed. (MainThread)
2016-08-11 06:01:44,724 [MyCluster]  INFO: Adding repair task to end of queue (MainThread)

The agent.log on the node may show the following error:

ERROR [ClientNotifForwarder-3] 2016-08-29 13:51:53,020 Repair messages lost via Cassandra JMX, please check cassandra logs for status. #<JMXConnectionNotification javax.management.remote.JMXConnectionNotification[source=javax.management.remote.rmi.RMIConnector: jmxServiceURL=service:jmx:rmi:///jndi/rmi://127.0.0.1:7199/jmxrmi][type=jmx.remote.connection.notifs.lost][message=May have lost up to 461 notifications]>

Cause

OpsCenter was treating lost JMX notifications to indicate that the repairs had indeed failed when they had not. The issue is being tracked with the following internal jira: 

OPSC-10112 - Improve logging when Repair Service repairs fail due to lost jmx notifications

Workaround

The following workaround may be implemented until such time as an upgrade can be scheduled:

Add the following into the cassandra-env.sh file

JVM_OPTS="$JVM_OPTS -Djmx.remote.x.notification.buffer.size=5000"

This will require a restart of the DSE process to take effect.

Solution

The above jira is fixed in OpsCenter 6.0.3 and later. See the following release notes:

http://docs.datastax.com/en/opscenter/6.0/opsc/release_notes/opscReleaseNotes603.html

Note even post 6.0.3 JMX notifications may still be lost, however repairs will still try to run as normal

Was this article helpful?
0 out of 0 found this helpful
Have more questions? Submit a request

Comments

Powered by Zendesk