Introduction
In certain circumstances when Lifecycle Manager jobs have been run at varying levels (for example a datacenter-level job followed by a cluster-level job) and failures have occurred at lower levels, LCM may show that the cluster has failed to deploy even after running a successful job at the cluster level.
Workaround
The job-status of clusters and datacenters is a rolled-up aggregate status that may reflect the outcome of many jobs run on different nodes and datacenters in the cluster. When you experience a job failure running a datacenter-level job on a specific datacenter, the cluster-level job-status will not turn green after running a successful cluster-level job, you must eventually run a successful datacenter-level job on the same datacenter to clear the datacenter-level failure from the datacenter and cluster status rollups.
If it's not obvious which job failure is spoiling the rollup, there are two procedures one can follow to find the problematic job:
- Visually inspect the status column in the jobs page. Look for recent failed jobs at the datacenter or node level where there isn't a successful follow-up job running at the same level. This is often sufficient for small sites with a small number of recent jobs.
- Alternatively, one can use the API to determine exactly what job-id is spoiling the rollup.
- Get the last-job-id for the cluster that is showing the failed status using the API per the instructions at https://docs.datastax.com/en/opscenter/6.1/api/docs/lcm_cluster.html#lcm-cluster.
- Get the status for for that job using the API per the instructions at https://docs.datastax.com/en/opscenter/6.1/api/docs/lcm_jobs.html#lcm-jobs. If the job status is not successful, correct the issue and rerun a successful job at the cluster level. The rollup may or may not clear up at this point.
- If the rollup is not yet clear, get the last-job-id for any datacenters that are showing a failed status using the API per the instructions at https://docs.datastax.com/en/opscenter/6.1/api/docs/lcm_datacenter.html#lcm-datacenter.
- Get the status for each job using the procedure from step 2. If any jobs are not successful, correct the issue and rerun a successful job at the datacenter-level. The rollup frequently clears up at this point.
- If the rollup is not yet clear, get the last-job-id for any nodes showing a failed status using the API per the instructions at https://docs.datastax.com/en/opscenter/6.1/api/docs/lcm_node.html#lcm-node.
- Get the status for each job using the procedure from step 2. If any jobs are not successfuil, correct the issue and rerun a successful job at the node level. The rollup should be clear at this point.
Technical Details
The meaning of the status column on the LCM jobs page is relatively straightforward to understand as it corresponds to the outcome of exactly one job.
The meaning of the status icon on the LCM cluster topology page, and the job-status api field on clusters, datacenters, and nodes that determines the behavior of that icon is more nuanced. Each node or datacenter in a cluster may last have been updated by a different job, yet when viewing the job-status for and aggregate entity like a cluster or datacenter we want a single rolled-up status that reflects the currently deployed state of the whole thing. Most critically, we don't have to have to click through every datacenter and node in a cluster to see if the cluster is currently deployed cleanly with no errors. In order to provide rolled-up statuses, LCM calculates job-status fields as follows:
- For nodes, display the job-status for the node itself. No rollup behavior is necessary.
- For datacenters, generate a list of statuses including:
- If the datacenter itself has a last-job-id, meaning that a datacenter-level job was run on that specific datacenter, include the status of that job. Note that cluster-level jobs are not considered here.
- The status of every node in the datacenter.
- The job-status for the datacenter becomes the highest-severity status present in the list.
- For clusters, generate a list of statuses including:
- If the cluster itself has a last-job-id, meaning that a cluster-level job was run on that specific cluster, include the status of that job.
- If any datacenter in the cluster has a last-job-id, meaning that a datacenter-level job was run on that specific datacenter, include the status of that job. Again, only datacenter-level jobs run against a specific datacenter are added at this point.
- The status of every node in the cluster.
- The job-status for the cluster becomes the highest-severity status present in the list.