Summary
This article provides the cause and solution to Spark workers being unavailable in the cluster.
Symptoms
Depending on the status of workers, it may not be possible to submit jobs since there are no resources available.
The Spark master user interface (UI) either reports no workers or the incorrect number of workers available. For example:
URL: spark://10.1.2.3:7077
REST URL: spark://10.1.2.3:6066 (cluster mode)
Workers: 0
Cores: 0 Total, 0 Used
Memory: 0.0 B Total, 0.0 B Used
Applications: 0 Running, 0 Completed
Drivers: 0 Running, 0 Completed
Status: ALIVE
In some instances, attempts to start a Spark shell returns the following error:
ERROR 2016-05-13 06:39:58,209 org.apache.spark.util.Utils: Failed to create dir in /var/lib/spark/rdd. Ignoring this directory. ERROR 2016-05-13 06:39:58,210 org.apache.spark.storage.DiskBlockManager: Failed to create any local dir.
Cause
A review of the system.log
on the nodes show that workers fail during initialisation since the DSE process is unable to access the Spark directories:
ERROR [SPARK-WORKER-INIT-0] 2016-05-13 03:57:08,510 SparkWorkerRunner.java:118 - Failed to configure Spark Worker java.nio.file.AccessDeniedException: /var/lib/spark/worker/worker.configuration at sun.nio.fs.UnixException.translateToIOException(UnixException.java:84) ~[na:1.8.0_91] at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) ~[na:1.8.0_91] at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) ~[na:1.8.0_91] at sun.nio.fs.UnixFileSystemProvider.implDelete(UnixFileSystemProvider.java:244) ~[na:1.8.0_91] at sun.nio.fs.AbstractFileSystemProvider.deleteIfExists(AbstractFileSystemProvider.java:108) ~[na:1.8.0_91] at java.nio.file.Files.deleteIfExists(Files.java:1165) ~[na:1.8.0_91] at com.datastax.bdp.transport.server.DigestAuthUtils.saveFile(DigestAuthUtils.java:126) ~[dse-core-4.8.6.jar:4.8.6] at com.datastax.bdp.spark.util.Utils$.createConfigurationFile(Utils.scala:111) ~[dse-spark-4.8.6.jar:4.8.6] at com.datastax.bdp.spark.util.Utils.createConfigurationFile(Utils.scala) ~[dse-spark-4.8.6.jar:4.8.6] at com.datastax.bdp.spark.SparkWorkerRunner.args(SparkWorkerRunner.java:114) ~[dse-spark-4.8.6.jar:4.8.6] at com.datastax.bdp.spark.AbstractSparkRunner.initService(AbstractSparkRunner.java:55) [dse-spark-4.8.6.jar:4.8.6] at com.datastax.bdp.spark.AbstractSparkRunner.initService(AbstractSparkRunner.java:19) [dse-spark-4.8.6.jar:4.8.6] at com.datastax.bdp.hadoop.mapred.ServiceRunner.run(ServiceRunner.java:126) [dse-hadoop-4.8.6.jar:4.8.6] at java.lang.Thread.run(Thread.java:745) [na:1.8.0_91] ERROR [SPARK-WORKER-INIT-0] 2016-05-13 03:57:08,511 AbstractSparkRunner.java:126 - SparkWorker-0 threw exception in state STARTING: java.lang.RuntimeException: java.nio.file.AccessDeniedException: /var/lib/spark/worker/worker.configuration at com.datastax.bdp.spark.SparkWorkerRunner.args(SparkWorkerRunner.java:119) ~[dse-spark-4.8.6.jar:4.8.6] at com.datastax.bdp.spark.AbstractSparkRunner.initService(AbstractSparkRunner.java:55) ~[dse-spark-4.8.6.jar:4.8.6] at com.datastax.bdp.spark.AbstractSparkRunner.initService(AbstractSparkRunner.java:19) ~[dse-spark-4.8.6.jar:4.8.6] at com.datastax.bdp.hadoop.mapred.ServiceRunner.run(ServiceRunner.java:126) ~[dse-hadoop-4.8.6.jar:4.8.6] at java.lang.Thread.run(Thread.java:745) [na:1.8.0_91] Caused by: java.nio.file.AccessDeniedException: /var/lib/spark/worker/worker.configuration at sun.nio.fs.UnixException.translateToIOException(UnixException.java:84) ~[na:1.8.0_91] at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) ~[na:1.8.0_91] at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) ~[na:1.8.0_91] at sun.nio.fs.UnixFileSystemProvider.implDelete(UnixFileSystemProvider.java:244) ~[na:1.8.0_91] at sun.nio.fs.AbstractFileSystemProvider.deleteIfExists(AbstractFileSystemProvider.java:108) ~[na:1.8.0_91] at java.nio.file.Files.deleteIfExists(Files.java:1165) ~[na:1.8.0_91] at com.datastax.bdp.transport.server.DigestAuthUtils.saveFile(DigestAuthUtils.java:126) ~[dse-core-4.8.6.jar:4.8.6] at com.datastax.bdp.spark.util.Utils$.createConfigurationFile(Utils.scala:111) ~[dse-spark-4.8.6.jar:4.8.6] at com.datastax.bdp.spark.util.Utils.createConfigurationFile(Utils.scala) ~[dse-spark-4.8.6.jar:4.8.6] at com.datastax.bdp.spark.SparkWorkerRunner.args(SparkWorkerRunner.java:114) ~[dse-spark-4.8.6.jar:4.8.6] ... 4 common frames omitted
Solution
Check the permissions on the Spark directories, e.g. ensure that the cassandra
user has read/write access.
The Spark directories are defined in spark-env.sh
:
SPARK_WORKER_DIR
(default /var/lib/spark/worker)SPARK_LOCAL_DIRS
(default /var/lib/spark/rdd)SPARK_WORKER_LOG_DIR
(default /var/log/spark/worker)
Reset the ownership and permissions as appropriate. For example:
$ sudo chown -R cassandra:cassandra /var/lib/spark/worker $ sudo chown -R cassandra:cassandra /var/lib/spark/rdd $ sudo chown -R cassandra:cassandra /var/log/spark/worker