DataStax Help Center

Repair Service Wont Start - Cant Allocate Parallel Repair Threads.

Summary

Starting OpsCenter 4.0 a Repair Service can be started for each cluster maintained by OpsCenter.  It is possible to come across the error: 

#ERROR: Repair service cannot complete without adversely affecting the cluster. Required parallel repairs: 18.0, Max parallel repairs: 11.0, Shutting down repair service.Symptoms

 

Cause

 

When starting the repair service OpsCenter makes an estimate of how long it will need to perform the repairs.  This based upon 90% of the smallest period set by a keyspace's gc_grace.  Using the smallest gc_grace found among all the keyspaces, OpsCenter will supply the suggested value when starting the repair service.

The estimate we give for time to completion is 90% of the minimum value of gc_grace_seconds in the cluster, discounting values of 0 or non-existent values.

From an example keyspaces output:

ColumnFamily: schema_keyspaces

    "keyspace definitions"

      Key Validation Class: org.apache.cassandra.db.marshal.UTF8Type

      Default column value validator: org.apache.cassandra.db.marshal.BytesType

      Cells sorted by: org.apache.cassandra.db.marshal.UTF8Type

      GC grace seconds: 8640

      Compaction min/max thresholds: 4/32

      Read repair chance: 0.0

      DC Local Read repair chance: 0.0

      Populate IO Cache on flush: false

      Replicate on write: true

      Caching: KEYS_ONLY

      Bloom Filter FP chance: 0.01

      Built indexes: []

      Column Metadata:

        Column Name: strategy_options

          Validation Class: org.apache.cassandra.db.marshal.UTF8Type

        Column Name: durable_writes

          Validation Class: org.apache.cassandra.db.marshal.BooleanType

        Column Name: strategy_class

          Validation Class: org.apache.cassandra.db.marshal.UTF8Type

      Compaction Strategy: org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy

      Compression Options:

        sstable_compression: org.apache.cassandra.io.compress.SnappyCompressor

 

 

Using the above GC grace seconds of 8640, the value is calculated as .09 days

(8640 * 0.9) / (60 * 60 * 24) = 0.09 days

 

The lower the time to completion is, the more repairs in parallel are needed to repair the entire cluster in that time. Nine (9) days is the most common value used for the time to completion because of the default value of 864000 (= 10 days) GC grace.

If the cluster can't be repaired without going over the estimated time, OpsCenter reports that and will not allow the service to start.  This is to prevent OpsCenter  from introducing performance issues into the cluster.

Upping the time to completion should allow the repair service to turn on and run with little impact. If the repairs don't take that full amount of time, parallel repairs and throughput will be re-estimated on the next repair service cycle.

OpsCenter supports configuring these parameters as well. Refer to the Opscenter 4.0 Repair Service Documentation for these values.

Solution

Specifying a larger time period for the time to completion should allow the repair service to turn on and run with little impact. If the repairs don't take that full amount of time, parallel repairs and throughput will be re-estimated on the next repair service cycle.

 

Was this article helpful?
1 out of 1 found this helpful
Have more questions? Submit a request

Comments

Powered by Zendesk