Summary
This article relates to an issue where Spark Worker entries from a previous version of DSE become orphaned in the Spark Master recovery table and create error messages in the system.log.
Applies to
- DataStax Enterprise 6.7.x
- DataStax Enterprise 6.0.x
Symptoms
After upgrading to DSE 6.X, error messages like the following entry might occur in the system.log:
ERROR [dispatcher-event-loop-1] 2019-07-22 19:18:59,832 Logging.scala:91 - DseSparkMaster error org.apache.spark.deploy.CassandraPersistenceEngine$CassandraPersistenceEngineException: Failed to deserialize worker_ with id=worker_worker-20181010025210-10.101.26.109-35648 in dc=nc-us-1-rxms_devqe. Consider cleaning up the recovery data and restarting the workers - see 'dsetool sparkmaster cleanup' and 'dsetool sparkworker restart' commands in DSE help. ... Caused by: java.lang.IllegalArgumentException: DSE version < 6.0.0 is not supported
This happened even after these steps were taken to verify that all SSTables were correctly upgraded to the correct format and version for the upgraded version of DSE. For example, in DSE 6.7.X and 6.0.X, the correct SSTable format and version is bti-aa.
Example:
1. We ran the following command on every node in the cluster, where /var/lib/cassandra/data is the data_file_directories value configured in cassandra.yaml:
$ sudo find /var/lib/cassandra/data -type f -exec ls -lart {} \; > ${HOSTNAME}_dfd.txt
2. We analyzed all of the results to ensure that the files are the correct bit-aa format and version:
$ sudo find /var/lib/cassandra/data -type f -exec ls -lart {} \; > ${HOSTNAME}_dfd.txt
$ find . -type f -name "*dfd.txt" -exec ls -l {} \;
-rw-r--r--@ 1 cassdba staff 36214342 Jul 22 14:45 ./cassnode01_dfd.txt
-rw-r--r--@ 1 cassdba staff 32398027 Jul 22 14:57 ./cassnode03_dfd.txt
-rw-r--r--@ 1 cassdba staff 39276510 Jul 22 14:47 ./cassnode02_dfd.txt
-rw-r--r--@ 1 cassdba staff 38100173 Jul 22 14:57 ./cassnode06_dfd.txt
-rw-r--r--@ 1 cassdba staff 37044999 Jul 22 14:57 ./cassnode04_dfd.txt
-rw-r--r--@ 1 cassdba staff 30484459 Jul 22 14:57 ./cassnode05_dfd.txt
$ find . -type f -name "*dfd.txt" -exec cat {} \; | egrep -v 'solr.data|snapshots|.txt|aa-.*-bti|.log'
$ # no records returned by above command indicating only bti-aa sstables are present
3. Finally, even running sparkmaster cleanup did not resolve the issue and the error still appears in system.log:
# Drops and recreates the Spark Master recovery table
$ dsetool sparkmaster cleanup # run on each node $ dsetool sparkworker restart
Cause
This issue can be caused by orphaned entries in the Spark Master recovery table, dse_analytics.rm_shared_data. These entries represent running workers that a node is supposed to contact upon reboot. These entries should be cleaned out by DSE after a node restart. However, if an entire cluster is brought down suddenly, or after the cluster is upgraded, the entries can become orphaned. Work on this issue is tracked in an internal Jira DSP-19468.
Workaround
1. Query the dse_analytics.rm_shared_data table to verify and identify orphaned entries:
select dc, id, version from dse_analytics.rm_shared_data;
In the following example, the cluster is DSE 6.7.3 but there are entries from DSE 6.0.3, DSE 6.0.4, DSE 5.0, and a DC that had been previously removed from the cluster (dc1-mycluster).
Example:
cqlsh> select dc, id, version from dse_analytics.rm_shared_data;
dc | id | version
--------------+--------------------------------------------------+-----------------------
dc1-mycluster | worker_worker-20181010011337-10.101.26.113-39262 | 6.0.3
dc1-mycluster | worker_worker-20181010013924-10.101.26.114-40991 | 6.0.3
dc1-mycluster | worker_worker-20181010025210-10.101.26.109-35648 | 5.0.0-Unknown
dc1-mycluster | worker_worker-20181010025551-10.101.26.110-35403 | 5.0.0-Unknown
dc2-mycluster | worker_worker-20181214143708-10.101.37.4-36468 | 6.0.4
dc2-mycluster | worker_worker-20181214144133-10.101.37.7-45010 | 6.0.4
dc2-mycluster | worker_worker-20181214144607-10.101.37.8-43466 | 6.0.4
dc2-mycluster | worker_worker-20181214144900-10.101.37.9-41278 | 6.0.4
dc2-mycluster | worker_worker-20181214145229-10.101.37.10-32782 | 6.0.4
dc2-mycluster | worker_worker-20181214145651-10.101.37.11-40608 | 6.0.4
2. Delete the orphaned entries
delete from dse_analytics where dc_id=[dc of orphaned row] and id=[id of orphaned row];
Example:
# deleting the first row in the above output that was for when the cluster was on 6.03
cqlsh> delete from dse_analytics where dc='dc1-mycluster' and id='worker_worker-20181010011337-10.101.26.113-39262';
Solution
As always, DataStax recommends reading the DSE release notes and upgrading to the latest version. You can watch for a resolution for the associated Jira DSP-19468.