DataStax Help Center

FAQ - Why are there different places to configure Spark Worker memory?

Overview

This article discusses the various scripts and configuration files associated with a Spark Worker memory configuration.

Applies to

NOTE - This article was specifically written for the following software versions. Although it may apply to earlier or future versions, no guarantee is given since it may change at any time.

  • DataStax Enterprise (DSE) 4.7.x
  • Apache Spark 1.2.x

Background

Spark configuration can be a minefield for both new and experienced users.

For seasoned Spark users, the additional configuration items introduced by DSE can sometimes be confusing. For new users, the DSE configuration file appears to be yet another place to configure Spark.

In this article, we will try to explain a common Spark configuration item to make sense of the various scripts and configuration files.

Scenario

John has a DSE Analytics cluster running where each worker has 8 GB of memory:

URL: spark://10.1.2.3:7077
Workers: 2
Cores: 16 Total, 16 Used
Memory: 16.2 GB Total, 1024.0 MB Used
Applications: 1 Running, 0 Completed
Drivers: 0 Running, 0 Completed
Status: ALIVE

The problem is that the Spark UI shows that each of the 2 executors is only using 512 MB of memory (2 x 512 = 1024.0 MB used).

John believes he has configured the memory correctly but is baffled by the different locations where it can be done.

Files explained

spark-env.sh

This is the generic Spark startup script where all environment variables are configured. This is from open-source Apache Spark.

In the example scenario above, each Spark worker is configured in spark-env.sh with:

export SPARK_WORKER_MEMORY=8192m
export SPARK_WORKER_CORES=8

NOTE - For new Spark users, it is not necessary to specify these since DSE automatically configures them in dse.yaml (see below).

spark-defaults.conf

This is a generic Spark configuration file where Spark properties (as opposed to environment variables in the startup script above) are configured. This is also from open-source Spark.

Configure the available Spark properties in this configuration file for system-wide defaults.

dse.yaml

This is the primary file used specifically by DataStax Enterprise for configuring enhanced functionality brought by DSE to Cassandra and other components such as Spark, Solr and Hadoop. The key point is that this is used to configure the enhanced parts of the software components, but does not replace the configuration files of those components.

Put simply, dse.yaml does not replace the configuration files for Spark. It is merely a place where the DSE-enhanced parts of Spark (i.e. enhancements not in open-source Spark) is configured.

In relation to Spark memory and core management, dse.yaml contains the following default property:

initial_spark_worker_resources: 0.7

This property is used for calculating the default memory and cores available to the Spark worker (unless explicitly set in spark-env.sh above). This means that the values for SPARK_WORKER_MEMORY and SPARK_WORKER_CORES get automatically set when DSE is started in Analytics mode.

Resolution

The 8 GB of memory available is the total amount of memory that a Spark Worker can allocate to executors on a Spark node (remember that there could be 1 or more executors running per node).

The 512 MB of memory used is what the executor on the node used out of the available 8 GB. This is because the default memory for each executor is 512 MB.

If for example John wanted each executor to use 4 GB for his application, John should add the following line to spark-defaults.conf:

spark.executor.memory        4g

NOTE - This requires a DSE restart for the change to take effect.

See also

DataStax doc - Configuring Spark nodes

Apache Spark doc - Spark 1.2.1 Configuration

Apache Spark doc - Spark 1.2.1 Available Properties

Was this article helpful?
1 out of 1 found this helpful
Have more questions? Submit a request

Comments

Powered by Zendesk