DataStax Help Center

FAQ - How to store files on CFS and get the correct path

Overview

This article provides instructions on how to store files on the Cassandra filesystem and get the full path. It is intended for users who have not used CFS in the past.

Background

The Cassandra file system (CFS) is provided with DataStax Enterprise as a replacement for the Hadoop distributed filesystem (HDFS). It is more fault-tolerant than HDFS since files are stored in Cassandra just like any normal application data.

CFS is used so that files are shared and accessible on all nodes for a given datacentre where CFS is replicated.

For new users, using CFS may be a bit daunting so this article has been written to hopefully make it simpler to understand how to work with files on CFS.

Prerequisites

Since CFS is stored in Cassandra, you need to increase the replication factor of the CFS keyspaces to at least 3 to take advantage of the Cassandra features.

Placing files on CFS

Follow these steps to create a directory and copy files onto CFS.

Step A1 - By default, a directory will already exist on CFS for the Unix user. For example:

$ whoami
automaton
$ bin/dse hadoop fs -ls /
Found 1 items
drwxrwxrwx   - automaton automaton          0 2016-04-02 02:59 /user
$ bin/dse hadoop fs -ls /user
Found 1 items
drwxrwxrwx   - automaton automaton          0 2016-04-02 02:59 /user/automaton

You can choose to use this directory or create a new one.

Step A2 - Create a CFS directory as follows:

$ bin/dse hadoop fs -mkdir /myCFSdir

To check the directories in the root CFS:

$ bin/dse hadoop fs -ls /
Found 2 items
drwxrwxrwx   - automaton automaton          0 2016-04-02 03:32 /myCFSdir
drwxrwxrwx   - automaton automaton          0 2016-04-02 02:59 /user

Step A3 - In this example, we want to place a sample JAR onto CFS:

$ ls -lh demos/spark/spark-10-day-loss.jar
-rw-rw-r-- 1 automaton automaton 49K Apr  1 03:00 demos/spark/spark-10-day-loss.jar

We place it onto CFS using the -copyFromLocal switch:

$ bin/dse hadoop fs -copyFromLocal demos/spark/spark-10-day-loss.jar /myCFSdir

To check the contents:

$ bin/dse hadoop fs -ls /myCFSdir
Found 1 items
-rwxrwxrwx   1 automaton automaton      49803 2016-04-02 03:44 /myCFSdir/spark-10-day-loss.jar

Step A4 - For help on other filesystem commands, run:

$ bin/dse hadoop fs -help

Get CFS path

In CFS, the URI is prefixed with cfs://.

Using the example above, we can get the CFS path using the dsetool command.

Step B1 - Generate the CFS path for /myCFSdir/spark-10-day-loss.jar as follows:

$ bin/dsetool checkcfs /myCFSdir/spark-10-day-loss.jar
Path: cfs://10.1.2.3/myCFSdir/spark-10-day-loss.jar
  INode header:
    File type: FILE
    User: automaton
    Group: automaton
    Permissions: rwxrwxrwx (777)
    Block size: 67108864
    Compressed: true
    First save: true
    Modification time: Sat Apr 02 03:44:36 UTC 2016
  INode:
    Block count: 1
    Blocks:                               subblocks     length         start           end
      (B) 3865c280-f885-11e5-bd7e-8bf6f52da1d7:   1      49803             0         49803
          386637b0-f885-11e5-bd7e-8bf6f52da1d7:          49803             0         49803
  Block locations:
    3865c280-f885-11e5-bd7e-8bf6f52da1d7: [10.1.2.3, 10.1.2.4, 10.1.2.5]
  Data:
    All data blocks ok.

In the output above, it shows that the file is replicated to 3 nodes:

  Block locations:
    3865c280-f885-11e5-bd7e-8bf6f52da1d7: [10.1.2.3, 10.1.2.4, 10.1.2.5]

and that all replicas are consistent:

  Data:
    All data blocks ok.

See also

DSE doc -  About the Cassandra File System (CFS)

DSE doc - Hadoop getting started tutorial

DSE doc - Setting the replication factor for CFS keyspaces

DataStax blog -  Cassandra File System Design

Was this article helpful?
2 out of 2 found this helpful
Have more questions? Submit a request

Comments

Powered by Zendesk