Summary
This article will help diagnose and work around a possible condition where the Spark Master thread in DSE 5.1, may fail to launch successfully, and shut down the Spark Master service, during the DSE node startup.
Symptoms
During the node startup, the initialization of the Spark Master service will generate the following exceptions, followed by the Spark Master logging a shutdown, closing all child processes, and finishing the service shutdown:
ERROR [dispatcher-event-loop-1] 2018-03-09 11:26:57,787 SPARK-MASTER Logging.scala:91 - Exception encountered
java.io.InvalidClassException: org.apache.spark.deploy.rm.DseAppProxy; local class incompatible: stream classdesc serialVersionUID = -3776121086854317023, local class serialVersionUID = -2505941516339812882
...
<trace>
...
at java.lang.Thread.run(Thread.java:748) [na:1.8.0_141]
ERROR [dispatcher-event-loop-1] 2018-03-09 11:26:57,788 SPARK-MASTER Logging.scala:91 - DseSparkMaster error
com.datastax.bdp.spark.ha.CassandraPersistenceEngine$CassandraPersistenceEngineException: Failed to deserialize app_ with id=app_app-20180307135949-0003 in dc=dc1. Consider cleaning up the recovery data and restarting the workers - see 'dsetool sparkmaster cleanup' and 'dsetool sparkworker restart' commands in DSE help.
...
<trace>
...
at java.lang.Thread.run(Thread.java:748) [na:1.8.0_141]
Caused by: java.io.InvalidClassException: org.apache.spark.deploy.rm.DseAppProxy; local class incompatible: stream classdesc serialVersionUID = -3776121086854317023, local class serialVersionUID = -2505941516339812882
...
<trace>
...
... 26 common frames omitted
INFO [dispatcher-event-loop-2] 2018-03-09 11:27:02,286 SPARK-MASTER Logging.scala:54 - Registering worker <some_ip>:33655 with 3 cores, 8.1 GB RAM
INFO [pool-20-thread-1] 2018-03-09 11:27:02,753 SPARK-MASTER Logging.scala:35 - Shutting down SPARK-MASTER service...
INFO [pool-20-thread-1] 2018-03-09 11:27:02,754 SPARK-MASTER Logging.scala:35 - SPARK-MASTER service has been shut down
INFO [dispatcher-event-loop-1] 2018-03-09 11:27:02,755 SPARK-MASTER AbstractConnector.java:306 - Stopped ServerConnector@2b5f956e{HTTP/1.1}{<some_ip>:7080}
INFO [dispatcher-event-loop-1] 2018-03-09 11:27:02,756 SPARK-MASTER ContextHandler.java:865 - Stopped o.s.j.s.ServletContextHandler@7f2667c8{/driver/kill,null,UNAVAILABLE}
INFO [dispatcher-event-loop-1] 2018-03-09 11:27:02,756 SPARK-MASTER ContextHandler.java:865 - Stopped o.s.j.s.ServletContextHandler@7fd39b94{/app/kill,null,UNAVAILABLE}
INFO [dispatcher-event-loop-1] 2018-03-09 11:27:02,756 SPARK-MASTER ContextHandler.java:865 - Stopped o.s.j.s.ServletContextHandler@438e15b7{/static,null,UNAVAILABLE}
INFO [dispatcher-event-loop-1] 2018-03-09 11:27:02,756 SPARK-MASTER ContextHandler.java:865 - Stopped o.s.j.s.ServletContextHandler@5c3aa3a2{/json,null,UNAVAILABLE}
INFO [dispatcher-event-loop-1] 2018-03-09 11:27:02,756 SPARK-MASTER ContextHandler.java:865 - Stopped o.s.j.s.ServletContextHandler@44096b0{/,null,UNAVAILABLE}
INFO [dispatcher-event-loop-1] 2018-03-09 11:27:02,756 SPARK-MASTER ContextHandler.java:865 - Stopped o.s.j.s.ServletContextHandler@5a9cdfa5{/app/json,null,UNAVAILABLE}
INFO [dispatcher-event-loop-1] 2018-03-09 11:27:02,756 SPARK-MASTER ContextHandler.java:865 - Stopped o.s.j.s.ServletContextHandler@629fa694{/app,null,UNAVAILABLE}
INFO [dispatcher-event-loop-1] 2018-03-09 11:27:02,757 SPARK-MASTER AbstractConnector.java:306 - Stopped ServerConnector@a00ea59{HTTP/1.1}{<some_ip>:6066}
INFO [dispatcher-event-loop-1] 2018-03-09 11:27:02,757 SPARK-MASTER ContextHandler.java:865 - Stopped o.s.j.s.ServletContextHandler@54b5ab5b{/,null,UNAVAILABLE}
INFO [dispatcher-event-loop-1] 2018-03-09 11:27:02,757 SPARK-MASTER Logging.scala:35 - Created DelegationTokenRenewalRunner
INFO [dispatcher-event-loop-1] 2018-03-09 11:27:02,757 SPARK-MASTER Logging.scala:35 - Shutting down DelegationTokenRenewalRunner
INFO [SPARK-MASTER] 2018-03-09 11:27:02,758 SPARK-MASTER Logging.scala:35 - SPARK-MASTER service finished
Other symptoms to indicate the Spark Master is down would include Spark Workers being unable to connect to a Spark Master, and the Spark Master UI not being available.
Cause
During the setup of a Spark Master, an endpoint object for RPC calls is generated, and serialized. The serialized form of this is saved in Spark Master high-availability mechanisms, and after an update, the old serialized object may not have a compatible class signature. This results in the Spark Master encountering an object it cannot load safely. (internal defect ID DSP-15679).
Workaround
The workaround stated in the error message logging, of clearing the Spark Master recovery space, and restarting all Spark workers, is effective in resolving this condition. The commands to resolve the condition are:
dsetool sparkmaster cleanup
anddsetool sparkworker restart
Solution
A solution is under investigation for DSP-15679 at this time.