Summary
This article discusses an issue where the OpsCenter Backup Service fails on nodes with too many data files.
Applies to
- OpsCenter 6.0.x
Symptoms
When performing a backup using the OpsCenter Backup Service, a snapshot attempt fails and reports that there are too many files to backup. Here is an example error in opscenterd.log
:
2017-09-20 01:23:45,678 [MyCluster] ERROR: Snapshot of keyspaces [playlist] on node 10.1.2.3 failed: \
Cannot take snapshot, number of files to be backed up 56789 is larger than the queue size 10000. \
Increase queue size by changing backup_file_queue_max in the agent configuration. (MainThread)
Cause
Prior to the snapshot phase, the OpsCenter Backup Service traverses each keyspace's data directory and records metadata about each file onto a work queue. This queue is used for determining which files should get a snapshot and sent to the respective backup destination.
The issue arises during the traversal of the data directory which unnecessarily includes the snapshot subdirectories when compiling the list of files for the work queue and results in having too many files for processing.
Workaround
Follow these procedures to allow backups to complete successfully.
OPTION A - Remove old snapshots
Step A1 - Manually delete snapshots which you no longer require on the affected node and the affected keyspace. For example, to delete a specific snapshot for the playlist
keyspace:
$ nodetool clearsnapshot -t <snapshot_name> -- playlist
Step A2 - Repeat the step above until the number files in the subdirectories for the playlist is below backup_file_queue_max
(default is 10000 files).
OPTION B - Increase the queue size
Step B1 - For the affected node, increase the maximum backup queue size by adding the following line to address.yaml
:
backup_file_queue_max: 60000
Step B2 - This would cause the agent to consume more memory. We recommend increasing the agent's heap size. By default, the agent is allocated 128MB of heap in conf/datastax-agent-env.sh
:
JVM_OPTS="$JVM_OPTS -Xmx128M -Djclouds.mpu.parts.magnitude=100000 -Djclouds.mpu.parts.size=16777216"
Increase the heap to 256MB or 512MB.
Step B3 - Restart the agent for the changes to take effect.
Solution
The backup_file_queue_max
was deprecated in OpsCenter 6.1.0 (internal ID OPSC-8045) since the functionality it provided was no longer required. Upgrade to OpsCenter 6.1.0+ to take advantage of the new Backup Service algorithm.