This article provides instructions on how to store files on the Cassandra filesystem and get the full path. It is intended for users who have not used CFS in the past.
The Cassandra file system (CFS) is provided with DataStax Enterprise as a replacement for the Hadoop distributed filesystem (HDFS). It is more fault-tolerant than HDFS since files are stored in Cassandra just like any normal application data.
CFS is used so that files are shared and accessible on all nodes for a given datacentre where CFS is replicated.
For new users, using CFS may be a bit daunting so this article has been written to hopefully make it simpler to understand how to work with files on CFS.
Since CFS is stored in Cassandra, you need to increase the replication factor of the CFS keyspaces to at least 3 to take advantage of the Cassandra features.
Placing files on CFS
Follow these steps to create a directory and copy files onto CFS.
Step A1 - By default, a directory will already exist on CFS for the Unix user. For example:
$ whoami automaton $ bin/dse hadoop fs -ls / Found 1 items drwxrwxrwx - automaton automaton 0 2016-04-02 02:59 /user $ bin/dse hadoop fs -ls /user Found 1 items drwxrwxrwx - automaton automaton 0 2016-04-02 02:59 /user/automaton
You can choose to use this directory or create a new one.
Step A2 - Create a CFS directory as follows:
$ bin/dse hadoop fs -mkdir /myCFSdir
To check the directories in the root CFS:
$ bin/dse hadoop fs -ls / Found 2 items drwxrwxrwx - automaton automaton 0 2016-04-02 03:32 /myCFSdir drwxrwxrwx - automaton automaton 0 2016-04-02 02:59 /user
Step A3 - In this example, we want to place a sample JAR onto CFS:
$ ls -lh demos/spark/spark-10-day-loss.jar -rw-rw-r-- 1 automaton automaton 49K Apr 1 03:00 demos/spark/spark-10-day-loss.jar
We place it onto CFS using the
$ bin/dse hadoop fs -copyFromLocal demos/spark/spark-10-day-loss.jar /myCFSdir
To check the contents:
$ bin/dse hadoop fs -ls /myCFSdir Found 1 items -rwxrwxrwx 1 automaton automaton 49803 2016-04-02 03:44 /myCFSdir/spark-10-day-loss.jar
Step A4 - For help on other filesystem commands, run:
$ bin/dse hadoop fs -help
Get CFS path
In CFS, the URI is prefixed with
Using the example above, we can get the CFS path using the
Step B1 - Generate the CFS path for /myCFSdir/spark-10-day-loss.jar as follows:
$ bin/dsetool checkcfs /myCFSdir/spark-10-day-loss.jar Path: cfs://10.1.2.3/myCFSdir/spark-10-day-loss.jar INode header: File type: FILE User: automaton Group: automaton Permissions: rwxrwxrwx (777) Block size: 67108864 Compressed: true First save: true Modification time: Sat Apr 02 03:44:36 UTC 2016 INode: Block count: 1 Blocks: subblocks length start end (B) 3865c280-f885-11e5-bd7e-8bf6f52da1d7: 1 49803 0 49803 386637b0-f885-11e5-bd7e-8bf6f52da1d7: 49803 0 49803 Block locations: 3865c280-f885-11e5-bd7e-8bf6f52da1d7: [10.1.2.3, 10.1.2.4, 10.1.2.5] Data: All data blocks ok.
In the output above, it shows that the file is replicated to 3 nodes:
Block locations: 3865c280-f885-11e5-bd7e-8bf6f52da1d7: [10.1.2.3, 10.1.2.4, 10.1.2.5]
and that all replicas are consistent:
Data: All data blocks ok.
DSE doc - About the Cassandra File System (CFS)
DSE doc - Hadoop getting started tutorial
DataStax blog - Cassandra File System Design