This article discusses the benefits and disadvantages of processing data locally or on another DSE Analytics node.
One of the configurable scheduling property in Apache Spark is the
When a job is run, Spark makes a determination of where to execute the task based on certain factors such as available memory or cores on a node, where the data is located in a cluster, or available executors.
One of the most important feature of DataStax Enterprise Analytics is its awareness of where the data resides in the cluster. This means that in the first instance, it will try to execute tasks local to where the data resides to avoid the overhead of having to transfer the data across the network to another executor on another node.
spark.locality.wait is set to 3 seconds. Spark will wait to launch a task on an executor local to the data using this value. After this period if the data-local node is still unavailable, Spark will give up and launch the task on another less-local node.
SCENARIO 1 - Large data sets
Jobs are taking too long to complete and on investigation, it was determined that tasks are being run on non-local nodes. Large amounts of data are being transferred to remote nodes (shuffling) resulting in delays in processing.
SCENARIO 2 - Small data sets
Despite jobs processing small amounts of data, it is still taking too long since only 1 or 2 tasks are running on data-local nodes with the rest of the cluster mostly idle.
When configuring a value for
spark.locality.wait, remember that there may be a penalty associated with either having a higher or lower value.
In the first scenario above, the cost of having to transfer large amounts of data across the network can be quite expensive. In this situation, data locality should be the primary consideration therefore increase the wait time so tasks will wait to launch on data-local nodes.
In the second scenario where the data sets are small, the cost of shuffling the data around the cluster is relatively inexpensive in contrast to having nodes in the cluster sitting idle. It would make sense to reduce the wait time so tasks get distributed to the rest of the cluster (higher parallelisation).
As a general rule, long running jobs would benefit from a higher wait time since the cost of waiting will have a less impact on the job's overall completion time. Jobs which only take a short time (e.g. 0.5-2 seconds) will be better off with a small wait time. In some circumstances, really short jobs should use a wait time of zero since they get executed immediately on the next available node.
Ultimately, determine the best wait time by achieving a balance between the cost of shuffling data around the cluster against the parallelisation of tasks around the cluster.
Apache Spark - Spark Configuration - Scheduling properties