DataStax Help Center

OpsCenter 6.0 agents unable to save metrics, leads to "OutOfMemoryError"

Summary

OpsCenter agent cannot save data to the OpsCenter keyspace and eventually crashes.

Symptoms

In the agent.log, nodes report availability exceptions when attempting to save data. Here is a sample entry that is logged repeatedly:

ERROR [performance-service-1] 2016-08-02 06:11:03,715 Unhandled exception updating slowest query cache
 com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /10.2.3.4:9042 \
(com.datastax.driver.core.exceptions.UnavailableException: \
Not enough replicas available for query at consistency LOCAL_ONE \
(1 required but only 0 alive)))

These errors are also logged repeatedly:

ERROR [async-dispatch-2] 2016-08-02 06:11:06,591 Error starting RollupComponent.
 clojure.lang.ExceptionInfo: throw+: {:type :opsagent.cassandra/storage-db-down, :message "Failed to load pdps, storage database connection appears to be down."} \
{:type :opsagent.cassandra/storage-db-down, :message "Failed to load pdps, storage database connection appears to be down."}

And the agent's heap eventually gets exhausted:

ERROR [node-details-1] 2016-08-02 07:16:29,890 Error getting data directory device java.util.concurrent.ExecutionException:
java.lang.OutOfMemoryError: GC overhead limit exceeded

Cause

The default replication strategy for the OpsCenter keyspace is SimpleStrategy. This strategy is designed for clusters with a single data centre only.

In a cluster with multiple data centres, agents in DCs where there are no local replicas are unable to save the data since they are written with a consistency level of LOCAL_ONE. The operation cannot be satisfied by Cassandra (CASSANDRA-12053) and results in the unavailable exception (internal defect ID OPSC-9239), again because there are no local replicas available.

As a side effect of not being able to connect to Cassandra, the agent restarts components which starts up new loader threads but the previous threads are never stopped (internal defect ID OPSC-9712). Since new loader threads are being started repeatedly, the agent's heap eventually gets exhausted.

Workaround

When adding new data centres to a cluster, it is important to always review and change the replication settings to use the NetworkTopologyStrategy as specified in the Adding a data center to a cluster document. For example:

cqlsh> ALTER KEYSPACE OpsCenter WITH REPLICATION = { \
'class' : 'NetworkTopologyStrategy', \
'DC1' : 3, 'DC2' : 3 };

Solution

In the upcoming release of OpsCenter 6.0.2, agent connections will default to a consistency level of ONE (internal ID OPSC-9659). It is however important to note that this is not required if the replication for the OpsCenter keyspace is configured correctly as above.

See also

Jira - [CASSANDRA-12053] "ONE != LOCAL_ONE for SimpleStrategy"

DataStax doc - Adding a data center to a cluster

DataStax doc - Changing keyspace replication strategy

Was this article helpful?
0 out of 0 found this helpful
Have more questions? Submit a request

Comments

Powered by Zendesk