Full ERROR Message Example
================= INFO [main] 2020-08-19 11:50:10,860 ColumnFamilyStore.java:429 - Initializing keyspace1.cf1 ERROR [SSTableBatchOpen:1] 2020-08-19 11:50:10,893 StartupDiskErrorHandler.java:41 - Exiting forcefully due to file system exception on startup, disk failure policy "stop" org.apache.cassandra.io.sstable.CorruptSSTableException: Corrupted: /var/lib/cassandra/data/keyspace1/cf1-fe434300e20b11eab65aff186e227ea0/aa-1-bti-Data.db at org.apache.cassandra.io.sstable.format.SSTableReader.open(SSTableReader.java:510) at org.apache.cassandra.io.sstable.format.SSTableReader.open(SSTableReader.java:350) at org.apache.cassandra.io.sstable.format.SSTableReader$2.run(SSTableReader.java:533) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:79) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.RuntimeException: io.netty.channel.unix.Errors$NativeIoException: open(..) failed: Permission denied ... skipped for readability ...
What does this ERROR message mean?
During DSE startup, some kind of disk or sstable issue has resulted in a sstable failing to be initialized successfully as part of the startup process. The disk may be unmounted, corrupt, or simply full.
Why does this ERROR occur?
Often it is preferable to prevent the node from becoming available so the administrator can deal with the corrupted file, disk failure, etc, with the node offline and not serving application traffic.
You can configure the disk_failure_policy in cassandra.yaml to define what DSE should do when a disk error is encountered during startup. There are multiple options, depending on how you want to handle the error, frequently the failure results in or is triggered by sstable corruption or missing sstables.
The options configurable are:
die
shut down gossip and client transports and kill the JVM for any fs errors or single-sstable errors, so the node can be replaced.
stop_paranoid
shut down gossip and client transports even for single-sstable errors, kill the JVM for errors during startup.
stop
shut down gossip and client transports, leaving the node effectively dead, but can still be inspected via JMX, kill the JVM for errors during startup.
best_effort
stop using the failed disk and respond to requests based on remaining available sstables. This means you WILL see obsolete data at CL.ONE!
ignore
ignore fatal errors and let requests fail
Note that stop is the default behavior on a new DSE node.
How do you fix this ERROR?
If you determine the cause to be a corrupt sstable rather than a disk issue, the corruption will need to be repaired.
To attempt to repair the corruption online, run:
nodetool scrub <keyspace> <table>
Some types of corruption cannot be repaired online, when this happens, bring down the node and run offline:
sstablescrub <keyspace> <table>
If the sstable is corrupted to the point it cannot be repaired, but the replication factor for the keyspace is greater than one, delete the corrupted sstable from the node and run a full repair on the node to repair the data, for example:
nodetool repair --full <keyspace> <table>