Overview
Currently DataStax Studio 2.0 does not support running spark jobs. Apache Zeppelin is a web based notebook similar to DataStax Studio that supports spark. This article shows multiple ways to use Apache Zeppelin with DSE Spark
Option 1
Using the binaries found on the Apache Zeppelin download website
- Download Apache Zeppelin and install
- Copy
$ZEPPELIN_HOME/conf/zeppelin-env.sh.template to $ZEPPELIN_HOME/conf/zeppelin-env.sh
- Edit
$ZEPPELIN_HOME/conf/zeppelin-env.sh
and addexport MASTER=spark://<spark_DSE_master_IP>:7077
export JAVA_HOME=<path_to_java>
export DSE_HOME=<path_to_dse_home>
- Package install
export DSE_HOME=/usr
- Tarball install
export DSE_HOME=<dse_install_location>
- Package install
export SPARK_HOME=<path_to_spark_home>
- Package install
export SPARK_HOME=/usr/share/dse/spark
- Tarball install
export SPARK_HOME=$DSE_HOME/resources/spark
- Package install
- Copy
$ZEPPELIN_HOME/conf/zeppelin-site.xml.template to $ZEPPELIN_HOME/conf/zeppelin-site.xml
- Edit
$ZEPPELIN_HOME/bin/interpreter.sh
Modify the lineSPARK_SUBMIT="${SPARK_HOME}/bin/spark-submit"
toSPARK_SUBMIT="${DSE_HOME}/bin/dse spark-submit"
- If using security you will need to pass credentials here at the end of spark-submit
- Start Zeppelin
$ZEPPELIN_HOME/bin/zeppelin-daemon.sh start
- Point your browser to zeppelin host http://<zeppelin_host>:8080
- Click anonymous => interpreters
- Search for Spark => edit
- Under Dependencies add the following artifacts
- org.apache.commons:commons-csv:1.1
- com.datastax.spark:spark-cassandra-connector_2.<scala version>:<spark-cassandra-connector version required for your installed DSE>
- (You can find the versions of spark and spark-cassandra-connector in the release notes docs for your DSE version).
Example:com.datastax.spark:spark-cassandra-connector_2.10:1.4.2
- (You can find the versions of spark and spark-cassandra-connector in the release notes docs for your DSE version).
- You can now create a notebook and run spark code
Simple example:println("Spark version:"+sc.version);
Option 2
Build your own Zeppelin with the DataStax spark-cassandra-connector
Prerequisites
Building
- Add the following profile to the $ZEPPELIN_HOME/spark-dependencies/pom.xml being sure to change the cassandra-spark- id and spark.version to match your DSE version. Also change the spark-cassandra-connector version to match your DSE version. You can find the versions of spark and spark-cassandra-connector in the release notes docs for your DSE version
-
<profile>
<id>cassandra-spark-1.6.3</id>
<properties>
<spark.version>1.6.3</spark.version>
<spark.py4j.version>0.9</spark.py4j.version>
<akka.group>com.typesafe.akka</akka.group>
<akka.version>2.3.11</akka.version>
<protobuf.version>2.5.0</protobuf.version>
<guava.version>16.0.1</guava.version>
</properties><dependencies>
<dependency>
<groupId>com.datastax.spark</groupId>
<artifactId>spark-cassandra-connector_${scala.binary.version}</artifactId>
<version>1.6.5</version>
<exclusions>
<exclusion>
<groupId>org.joda</groupId>
<artifactId>joda-convert</artifactId>
</exclusion>
</exclusions>
</dependency>
</dependencies>
</profile>
-
- In $ZEPPELIN_HOME run maven clean package and specify the profile you just added and the version of hadoop for your DSE version. For example DSE 5.0.8
-
mvn clean package -Pcassandra-spark-1.6.3 -Dhadoop.version=2.7.1 -Phadoop-2.7 -DskipTests
-
- Once the build completes
- Copy
$ZEPPELIN_HOME/conf/zeppelin-env.sh.template to $ZEPPELIN_HOME/conf/zeppelin-env.sh
- Edit
$ZEPPELIN_HOME/conf/zeppelin-env.sh
and addexport MASTER=spark://<spark_DSE_master_IP>:7077
export DSE_HOME=<path_to_dse_home>
- Package install
export DSE_HOME=/usr
- Tarball install
export DSE_HOME=
<dse_install_location>
- Package install
export SPARK_HOME=<path_to_spark_home>
- Package install
export SPARK_HOME=/usr/share/spark
- Tarball install
export SPARK_HOME=$DSE_HOME/resources/spark
- Package install
- Copy
$ZEPPELIN_HOME/conf/zeppelin-site.xml.template to $ZEPPELIN_HOME/conf/zeppelin-site.xml
- Modify the line
SPARK_SUBMIT="${SPARK_HOME}/bin/spark-submit"
in$ZEPPELIN_HOME/bin/interpreter.sh
toSPARK_SUBMIT="${DSE_HOME}/bin/dse spark-submit"
- If using security you will need to pass credentials here at the end of spark-submit
- Start Zeppelin
$ZEPPELIN_HOME/bin/zeppelin-daemon.sh start
- Point your browser to zeppelin host http://<zeppelin_host>:8080
- You can now create a notebook and run spark code