Connecting to DSE Spark Hadoop from a client machine

This article aims to clarify the following document

Running Spark commands against a remote cluster

Normally, the Spark shell is assumed to be executed from the Analytic nodes as follows

$ dse spark [-u username -p password]

However this assumes the user has an actual Unix login to the node which in some cases is not allowed.

In that case, the end user could use a client machine if there is unrestricted network access to/from the DSE Analytic nodes.  If security is enabled, the user still needs to authenticate

First, the DSE nodes need to be configured as Spark+Hadoop nodes (Hadoop mode is needed to use CFS)

The client machine requires the following:

  • Run a Linux based OS
  • Have Java installed - the same version as the nodes, if possible
  • Install the exact same DSE version running on the nodes
  • Configure DSE security just as if it were an actual node

Depending on the type of install on your client machine, rename the Hadoop configuration directory

mv ${DSE_HOME}/resources/hadoop/conf ${DSE_HOME}/resources/hadoop/conf.original
mv /etc/dse/hadoop /etc/dse/hadoop.original

Collect the DSE Hadoop configuration from the nodes and replace to your client machine.

Test dse commands work, for example the Spark+Hadoop node is at and the client machine is

iMac:dse Pepe$ ls -la /tmp/*.txt
-rw-r--r--  1 Pepe  wheel  38 Dec 18 15:25 /tmp/text.txt

iMac:dse Pepe$ bin/dse hadoop fs -put  /tmp/text.txt /

iMac:dse Pepe$ bin/dse hadoop fs -ls /
Found 3 items
-rwxrwxrwx   1 Pepe                staff         38 2015-12-18 18:43 /text.txt
drwxrwxrwx   - josemartinezpoblete staff          0 2015-12-14 11:46 /tmp
drwxrwxrwx   - josemartinezpoblete staff          0 2015-12-14 11:46 /user

iMac:dse Pepe$ bin/dse spark --conf  --master spark:// [-u username -p password]
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.4.2

Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_75)
Type in expressions to have them evaluated.
Type :help for more information.
Creating SparkContext...
Initializing SparkContext with MASTER: spark://
Created spark context..
Spark context available as sc.
Hive context available as hc.
CassandraSQLContext available as csc.


Rather than the client machine filesystem, the Cassandra Filesystem (CFS) should be used for file processing.  

Operations such as this will fail:

scala> val inputFile = sc.textFile("file:///my/local/machine/file.txt")
scala> rdd1.saveAsTextFile("file:///my/local/file.txt");

First copy your file from your machine to CFS and load using: 

scala> val inputFile = sc.textFile("/my/CFS/file.txt")
scala> rdd1.saveAsTextFile("/my/CFS/file.txt");

Then copy your file back to your local machine using:

$ bin/dse hadoop fs -get  <CFS file>  <Local file>

Alternatively, you could save to your client machine filesystem using Java I/O utilities but not the scala context utility saveAsTextFile

val pw = new PrintWriter(new File("LocalTextFile"))
for (line <- sc.parallelize(1 to 100000).map( num => s"$num::Line").toLocalIterator) { pw.println(line) }
