The thrift server was failing with a JVM Out of memory error either heap space of GC overhead due to a full table request from a client application
The following errors may be observed in the
2017-04-11 00:41:55,017 org.apache.spark.util.Utils: Uncaught exception in thread task-result-getter-0 java.lang.OutOfMemoryError: GC overhead limit exceeded 2017-04-11 15:01:52,277 org.apache.spark.util.Utils: Uncaught exception in thread task-result-getter-0 java.lang.OutOfMemoryError: Java heap space
The problem was due to a client application needing to pull the entire contents of a given table from DSE. Note: while this is not a common action there may well be certain scenarios where this is necessary.
Spark will run all tasks in the job in parallel, because of this the data set will not fit into the JVM heap memory and cause these out of memory conditions.
There is a way to configure the thrift server so tasks in spark so they only run one at a time, while this is slower it means that they run in an incremental fashion. The following parameter may be invoked on the thrift server command line
The user must think about other ways to pull back a smaller data set from DSE using the power of the distributed computation that spark offers to do the “heavy lifting” on the cluster itself rather than in some upstream client application.