Overview
This article provides instructions on how to connect from R to a DataStax Enterprise cluster.
Applies to
- DataStax Enterprise 5.0.x
Background
With the growing popularity of the R programming language, there is also an increasing demand from the field to be able to use R to query data in Cassandra.
DataStax Enterprise 5.1 now supports SparkR for R analytic processing. However for customers who are yet to upgrade to DSE 5.1, this article provides an alternative for connecting R to a DSE 5.0 cluster.
Prerequisites
- a running DSE Analytics cluster with Spark enabled
- Spark SQL Thrift Server running on one of the DSE Analytics nodes
Procedure
Follow these steps to setup your R environment.
Step 1 - Download the Simba JDBC Driver from the DataStax Drivers Download page and unpack it on the machine where R is installed.
Step 2 - Add SparkJDBC41.jar
and the rest of the JAR files (from step 1) to the classpath.
Step 3 - Add the JDBC driver class com.simba.spark.jdbc41.Driver
.
Step 4 - Set the connection URL using the host and port (default 10000) where the Spark SQL Thrift Server is running in your DSE cluster, for example jdbc:spark://10.1.2.3:10000
.
Here is an example of a connection configured in R:
library(RJDBC) drv <- JDBC("com.simba.spark.jdbc41.Driver", "/path/to/SparkJDBC41.jar") conn <- dbConnect(drv, "jdbc:spark://10.1.2.3:10000", "user", "password")
NOTE - Details of how to configure your R environment depends on the distribution installed. Consult the documentation appropriate for your R distribution.
For more information about the Simba driver, see the documentation included in the driver download in step 1.