This article aims to clarify the following document
Running Spark commands against a remote cluster
Normally, the Spark shell is assumed to be executed from the Analytic nodes as follows
$ dse spark [-u username -p password]
However this assumes the user has an actual Unix login to the node which in some cases is not allowed.
In that case, the end user could use a client machine if there is unrestricted network access to/from the DSE Analytic nodes. If security is enabled, the user still needs to authenticate
First, the DSE nodes need to be configured as Spark+Hadoop nodes (Hadoop mode is needed to use CFS)
The client machine requires the following:
- Run a Linux based OS
- Have Java installed - the same version as the nodes, if possible
- Install the exact same DSE version running on the nodes
- Configure DSE security just as if it were an actual node
Depending on the type of install on your client machine, rename the Hadoop configuration directory
mv ${DSE_HOME}/resources/hadoop/conf
${DSE_HOME}/resources/hadoop/conf.original
OR
mv /etc/dse/hadoop /etc/dse/hadoop.original
Collect the DSE Hadoop configuration from the nodes and replace to your client machine.
Test dse commands work, for example the Spark+Hadoop node is at 10.0.0.2 and the client machine is 10.0.0.8:
iMac:dse Pepe$ ls -la /tmp/*.txt
-rw-r--r-- 1 Pepe wheel 38 Dec 18 15:25 /tmp/text.txt
iMac:dse Pepe$ bin/dse hadoop fs -put /tmp/text.txt /
iMac:dse Pepe$ bin/dse hadoop fs -ls /
Found 3 items
-rwxrwxrwx 1 Pepe staff 38 2015-12-18 18:43 /text.txt
drwxrwxrwx - josemartinezpoblete staff 0 2015-12-14 11:46 /tmp
drwxrwxrwx - josemartinezpoblete staff 0 2015-12-14 11:46 /user
iMac:dse Pepe$ bin/dse spark --conf spark.driver.host=10.0.0.8 --master spark://10.0.0.2:7077 [-u username -p password]
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 1.4.2
/_/
Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_75)
Type in expressions to have them evaluated.
Type :help for more information.
Creating SparkContext...
Initializing SparkContext with MASTER: spark://10.0.0.2:7077
Created spark context..
Spark context available as sc.
Hive context available as hc.
CassandraSQLContext available as csc.
scala>
Rather than the client machine filesystem, the Cassandra Filesystem (CFS) should be used for file processing.
Operations such as this will fail:
scala> val inputFile = sc.textFile("file:///my/local/machine/file.txt")
scala> rdd1.saveAsTextFile("file:///my/local/file.txt");
First copy your file from your machine to CFS and load using:
scala> val inputFile = sc.textFile("/my/CFS/file.txt")
scala> rdd1.saveAsTextFile("/my/CFS/file.txt");
Then copy your file back to your local machine using:
$ bin/dse hadoop fs -get <CFS file> <Local file>
Alternatively, you could save to your client machine filesystem using Java I/O utilities but not the scala context utility saveAsTextFile
import java.io._
val pw = new PrintWriter(new File("LocalTextFile"))
for (line <- sc.parallelize(1 to 100000).map( num => s"$num::Line").toLocalIterator) { pw.println(line) }
pw.close