DataStax Help Center

Spark submit fails with "class not found" when deploying in cluster mode

Summary

Spark jobs can be submitted in "cluster" mode or "client" mode. The former launches the driver on one of the cluster nodes, the latter launches the driver on the local node.

When using submit in cluster mode, a class not found error can occur if the relevant jar files are not accessible. This note addresses an example to show how this can be achieved.

Symptoms

The following is typical of the type of error that might be seen:

Exception in thread "main" java.lang.reflect.InvocationTargetException 
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:58)
    at org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)
Caused by: java.lang.NoClassDefFoundError: org/apache/spark/streaming/kafka/KafkaUtils$

Cause

Although the jar files were made available to all nodes in the cluster (i.e. via NFS share) the --driver-class-path was not included.

Solution

The following is an example of what was used to resolve the issue:

$ sudo -u cassandra dse spark-submit -v
--conf "spark.serializer=org.apache.spark.serializer.KryoSerializer"
--jars $JARS \ 
--executor-memory 512M \
--total-executor-cores 2 \
--deploy-mode "cluster" \
--master spark://10.1.2.3:6066 \
--supervise \
--driver-class-path $JARS_COLON_SEP \
--class "com.test.example" $APP_JAR "$INPUT_PATH" --files $INPUT_PATH

The above env variables were also set as follows

JARS = /home/bob/spark_job/lib/nscala-time_2.10-2.0.0.jar,/home/bob/spark_job/lib/kafka_2.10-0.8.2.1.jar,/home/bob/spark_job/lib/kafka-clients-0.8.2.1.jar,/home/bob/spark_job/lib/spark-streaming-kafka_2.10-1.4.1.jar,/home/bob/spark_job/lib/zkclient-0.3.jar,/home/bob/spark_job/lib/protobuf-java-2.4.0a.jar
JARS_COLON_SEP = /home/bob/spark_job/lib/nscala-time_2.10-2.0.0.jar:/home/bob/spark_job/lib/kafka_2.10-0.8.2.1.jar:/home/bob/spark_job/lib/kafka-clients-0.8.2.1.jar:/home/bob/spark_job/lib/spark-streaming-kafka_2.10-1.4.1.jar:/home/bob/spark_job/lib/zkclient-0.3.jar:/home/bob/spark_job/lib/protobuf-java-2.4.0a.jar
APP_JAR=spark-job-1.0.jar
INPUT_PATH=test.json

Further info

The following outlines spark submit options

https://spark.apache.org/docs/1.4.1/configuration.html

The following links discuss some examples of using this. The main distinction is if the jar files need to be in the system class path the --driver-class-path option is required

https://issues.apache.org/jira/browse/SPARK-9384

https://forums.databricks.com/questions/706/how-can-i-attach-a-jar-library-to-the-cluster-that.html

Was this article helpful?
0 out of 0 found this helpful
Have more questions? Submit a request

Comments

Powered by Zendesk