In some circumstances when running incremental repair nodes have been observed to run out of heap space.
During manual incremental repair run with "nodetool -inc -par <keyspace>" it has been observed that G1GC starts doing ever more frequent garbage collection with pauses for Old Generation in the range of 10 seconds. These pauses become so frequent that there are 6 of them in a minute. Also only a small fraction of the Old generation gets purged and this goes on for many minutes until the system crashes for OutOfMemoryError.
When incremental repair run it keeps references to all sstables that it needs until the end of the repair session. With a large number of sstables, the amount of references stored in memory can grow until space in the heap becomes exhausted. This can cause an OutOfMemory condition leading to JVM instability
The jvm heap dump will show the objects referenced by "org.apache.cassandra.service.ActiveRepairService".
A way to alleviate the problem and reduce the risk of OOM, is to manually run "nodetool repair -inc -par" on single tables at a time, avoiding the buildup of large number of sstable references in memory.
Two Jiras are tracking the fix for this issue:
- internal jira: DSP-9640
OutOfMemoryError heap repair incremental