There is a bug in the Spark project SPARK–12963 where the SPARK_LOCAL_IP set in the spark-env.sh resulted in an inability to successfully complete jobs when using the flag --deploy-mode cluster when submitting jobs.
Often as a workaround to the following error one will set SPARK_LOCAL_IP
Exception in thread "main" java.net.BindException: Failed to bind to: /10.1.1.7:0 <http://10.1.1.7:0>: Service 'Driver' failed after 16 retries!
however this breaks the --deploy-mode cluster flag and will make errors like the following occur:
Exception in thread "main" java.net.BindException: Failed to bind to: /126.96.36.199:0 <http://188.8.131.52:0>: Service 'sparkDriver' failed after 16 retries!
Cluster mode unfortunately uses the SPARK_LOCAL_IP to bind the driver to the that IP, which can result in an incorrect setting if the driver does not happen to start on that node. This is not an intended behavior but affects all versions of Spark running the standalone scheduler up to 1.6.0.
So the workaround to all this is pretty simple:
Never ever set SPARK_LOCAL_IP until there is a fix for SPARK–12963
For all other commands pass the appropriate IP address, this is more effort but it just works:
- dse spark-submit –deploy-mode client –conf spark.driver.host <routeable ip>
- dse spark –master spark://<master ip you want>:7077
- dse spark-sql –master spark://<master ip you want>:7077
- dse spark-submit --deploy-mode cluster # don't have to pass spark.driver.host with --deploy-mode cluster
Pending. Once the Spark project has fixed SPARK-12963 a later release of DSE can incorporate this.