Overview
This article discusses the various scripts and configuration files associated with a Spark Worker memory configuration.
Applies to
NOTE - This article was specifically written for the following software versions. Although it may apply to earlier or future versions, no guarantee is given since it may change at any time.
- DataStax Enterprise (DSE) 4.7.x
- Apache Spark 1.2.x
Background
Spark configuration can be a minefield for both new and experienced users.
For seasoned Spark users, the additional configuration items introduced by DSE can sometimes be confusing. For new users, the DSE configuration file appears to be yet another place to configure Spark.
In this article, we will try to explain a common Spark configuration item to make sense of the various scripts and configuration files.
Scenario
John has a DSE Analytics cluster running where each worker has 8 GB of memory:
URL: spark://10.1.2.3:7077 Workers: 2 Cores: 16 Total, 16 Used Memory: 16.2 GB Total, 1024.0 MB Used Applications: 1 Running, 0 Completed Drivers: 0 Running, 0 Completed Status: ALIVE
The problem is that the Spark UI shows that each of the 2 executors is only using 512 MB of memory (2 x 512 = 1024.0 MB used).
John believes he has configured the memory correctly but is baffled by the different locations where it can be done.
Files explained
spark-env.sh
This is the generic Spark startup script where all environment variables are configured. This is from open-source Apache Spark.
In the example scenario above, each Spark worker is configured in spark-env.sh
with:
export SPARK_WORKER_MEMORY=8192m export SPARK_WORKER_CORES=8
NOTE - For new Spark users, it is not necessary to specify these since DSE automatically configures them in dse.yaml
(see below).
spark-defaults.conf
This is a generic Spark configuration file where Spark properties (as opposed to environment variables in the startup script above) are configured. This is also from open-source Spark.
Configure the available Spark properties in this configuration file for system-wide defaults.
dse.yaml
This is the primary file used specifically by DataStax Enterprise for configuring enhanced functionality brought by DSE to Cassandra and other components such as Spark, Solr and Hadoop. The key point is that this is used to configure the enhanced parts of the software components, but does not replace the configuration files of those components.
Put simply, dse.yaml
does not replace the configuration files for Spark. It is merely a place where the DSE-enhanced parts of Spark (i.e. enhancements not in open-source Spark) is configured.
In relation to Spark memory and core management, dse.yaml
contains the following default property:
initial_spark_worker_resources: 0.7
This property is used for calculating the default memory and cores available to the Spark worker (unless explicitly set in spark-env.sh
above). This means that the values for SPARK_WORKER_MEMORY
and SPARK_WORKER_CORES
get automatically set when DSE is started in Analytics mode.
Resolution
The 8 GB of memory available is the total amount of memory that a Spark Worker can allocate to executors on a Spark node (remember that there could be 1 or more executors running per node).
The 512 MB of memory used is what the executor on the node used out of the available 8 GB. This is because the default memory for each executor is 512 MB.
If for example John wanted each executor to use 4 GB for his application, John should add the following line to spark-defaults.conf
:
spark.executor.memory 4g
NOTE - This requires a DSE restart for the change to take effect.
See also
DataStax doc - Configuring Spark nodes
Apache Spark doc - Spark 1.2.1 Configuration
Apache Spark doc - Spark 1.2.1 Available Properties