Summary
Spark jobs can be submitted in "cluster" mode or "client" mode. The former launches the driver on one of the cluster nodes, the latter launches the driver on the local node.
When using submit in cluster mode, a class not found error can occur if the relevant jar files are not accessible. This note addresses an example to show how this can be achieved.
Symptoms
The following is typical of the type of error that might be seen:
Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:58)
at org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)
Caused by: java.lang.NoClassDefFoundError: org/apache/spark/streaming/kafka/KafkaUtils$
Cause
Although the jar files were made available to all nodes in the cluster (i.e. via NFS share) the --driver-class-path was not included.
Solution
The following is an example of what was used to resolve the issue:
$ sudo -u cassandra dse spark-submit -v --conf "spark.serializer=org.apache.spark.serializer.KryoSerializer" --jars $JARS \
--executor-memory 512M \
--total-executor-cores 2 \
--deploy-mode "cluster" \
--master spark://10.1.2.3:6066 \
--supervise \
--driver-class-path $JARS_COLON_SEP \
--class "com.test.example" $APP_JAR "$INPUT_PATH" --files $INPUT_PATH
The above env variables were also set as follows
JARS = /home/bob/spark_job/lib/nscala-time_2.10-2.0.0.jar,/home/bob/spark_job/lib/kafka_2.10-0.8.2.1.jar,/home/bob/spark_job/lib/kafka-clients-0.8.2.1.jar,/home/bob/spark_job/lib/spark-streaming-kafka_2.10-1.4.1.jar,/home/bob/spark_job/lib/zkclient-0.3.jar,/home/bob/spark_job/lib/protobuf-java-2.4.0a.jar
JARS_COLON_SEP = /home/bob/spark_job/lib/nscala-time_2.10-2.0.0.jar:/home/bob/spark_job/lib/kafka_2.10-0.8.2.1.jar:/home/bob/spark_job/lib/kafka-clients-0.8.2.1.jar:/home/bob/spark_job/lib/spark-streaming-kafka_2.10-1.4.1.jar:/home/bob/spark_job/lib/zkclient-0.3.jar:/home/bob/spark_job/lib/protobuf-java-2.4.0a.jar
APP_JAR=spark-job-1.0.jar
INPUT_PATH=test.json
Further info
The following outlines spark submit options
https://spark.apache.org/docs/1.4.1/configuration.html
The following links discuss some examples of using this. The main distinction is if the jar files need to be in the system class path the --driver-class-path option is required
https://issues.apache.org/jira/browse/SPARK-9384
https://forums.databricks.com/questions/706/how-can-i-attach-a-jar-library-to-the-cluster-that.html