Summary
This article discusses adding a new data center with Search workload (Solr) to an existing cluster. There are obvious distinctions between nodes with a Cassandra or other workload vs. nodes with a Search workload. Whether you are spinning up a new standalone Search data center or adding a new one to an existing cluster, you must consider things like resource utilization and configuration, Lucene indexes and indexing performance, Solr configuration and tuning, among others when spinning up a new Solr node or new Search data center. This article will focus on adding a new Search data center to an existing cluster and cover the high level sequence of steps to do so. There are many ancillary items like advanced security to data encryption to consider depending on your environment, but we will not cover those in great detail other than maybe calling out items to consider during certain steps.
Applies to
DataStax Enterprise 5.1.x
DataStax Enterprise 5.0.x
DataStax Enterprise 4.8.x
Procedure
- Stand up new hardware for nodes and ensure multi-DC operation.
- Set Linux kernel configurations according to DataStax recommendations
- As a best practice to prevent I/O contention between Cassandra and Solr operations, you should create a physical mount point for solr.data directories that are separate, including disk drives, from Cassandra data directories (data, commitlog, saved_caches)
- It is important that your application is executing requests at LOCAL consistency or when the new nodes join the cluster in the new DC being built, the application driver will start sending requests to those new nodes as they come to an “UN” state. This could cause problems for both reads and writes, so ensure your application is modified appropriately to operate in a multi-DC environment.
- Install DSE (do not start DSE).
- Make DSE configuration changes as required for your specific environment.
- cassandra.yaml
- Add auto_bootstrap: false <== Important step before starting DSE
- Determine if you are running vnodes or single_token and configure accordingly. Best practice for Solr nodes when not running single_token, is to set num_tokens in the cassandra.yaml to 16 or 32 to limit performance impacts on Solr queries. It is ok for the Solr data center to have a different number of tokens than your existing data center.
- dse.yaml
- Configure Solr Data Directories according to best practice
- By default, the solr.data or solrconfig_data_dir directories are located in the following locations and can be managed by information at the links provided.
- DSE 5.1: solrconfig_data_dir/keyspace_name.table_name
- DSE 5.0: cassandra_data_dir/solr.data
- DSE 4.8: cassandra_data_dir/solr.data
- By default, the solr.data or solrconfig_data_dir directories are located in the following locations and can be managed by information at the links provided.
- Configure DSE Search Indexing based on your requirements and DSE version.
- Configure Solr Data Directories according to best practice
- cassandra-env.sh
- gossip configs
- Set proper DC’s and racks per your snitch requirements in the appropriate file.
- Set node type to Search in “/etc/default/dse” (SOLR_ENABLED=1)
- gossip configs
- cassandra.yaml
- Start DSE on all nodes in new data center one at a time.
- Note: DSE will start on the node and it will join the cluster (from a gossip perspective), but will not bootstrap or stream data at this point. You should see it as part of the cluster in nodetool status as ‘UN’ listed in the proper data center.
- Note: If adding the new Search data center to a cluster that already has an existing Search data center, the Solr cores will load when DSE is started for the first time because the solr_admin keyspace has ‘everywhere’ strategy and will replicate resources over to nodes in the new cluster when they join. Once the data starts streaming after executing the “nodetool rebuild” command, Solr will start indexing the data being streamed for the loaded cores.
- Ensure all nodes join the cluster and nodetool status shows all DCs and nodes in the cluster as UN.
- From a node in the existing data center, alter keyspaces below, if present, to include the new DC in replication strategy. Note: Once your data keyspaces have been altered to include the new data center, data that is being ingested currently via the existing data center will start replicating to the new data center immediately. (old data will not replicate at this point)
- System Keyspaces
- system_auth, system_distributed, system_traces
- DSE Defined Keyspaces
- dse_perf, dse_security, dse_leases
- "OpsCenter" - (if you have OPSC running on prod cluster)
- Non-system Keyspaces (the keyspaces for your data)
- System Keyspaces
- Run “nodetool repair -full -- keyspace table” on the System and DSE Defined Keyspaces on each node in the existing data center. Do NOT run repair on OpsCenter or Non-system keyspaces (your data)
- Creating Solr Cores on nodes in new data center. Note: If building the new Search data center from a cluster that does not have an existing Search data center, you must create the cores manually.
- You must create the cores manually by running ‘dsetool” command on one of the new nodes as follows:
- If you do not have custom resources created, use: “dsetool create_core keyspace_name.table_name distributed=true generateResources=true reindex=true” for each Solr core you have defined.
- If you have created your own resources, use: “dsetool create_core keyspace_name.table_name schema=/path/to/schema.xml solrconfig=/path/to/solrconfig.xml distributed=true”
- Best Practice: It is critical that your schema.xml is configured properly and this process will save indexing time.
- Run “dsetool create_core keyspace_name.table_name distributed=true generateResources=true reindex=false”
- Use “dsetool infer_solr_schema keyspace_name.table_name > schema.xml” and review the inferred schema that was output, changing any text fields that do not need tokenizing to strfield, add docValues=true to the fields that will be sorted, faceted, or grouped except for text fields. Do not set docValues=true on text fields unless they are NOT tokenized. Make any other changes as required by your use case.
- Run “dsetool reload_core keyspace_name.table_name schema=/path/to/schema.xml distributed=true”
- You will also need to configure your solrconfig.xml appropriately for your use case and environment, but only changes to schema.xml will require reindexing, that is why we called it out in this section. You can go back later and reload the core with a custom solrconfig.xml without the need to reindex.
- Documentation Links for Core Creation
- DSE 5.1 - dsetool create_core and dsetool infer_solr_schema
- DSE 5.0 - dsetool create_core and dsetool infer_solr_schema
- DSE 4.8 - dsetool create_core and dsetool infer_solr_schema
- You must create the cores manually by running ‘dsetool” command on one of the new nodes as follows:
- Flush all source DC nodes.
- This will ensure that any data existing in memtables prior to altering the keyspaces is flushed to disk and will be included in the rebuild process.
- Run nodetool rebuild on each node in new DC
- Start one node at a time and monitor cluster health and make sure it’s not being overloaded. Run rebuild on additional nodes in parallel if the cluster remains stable.
- Comment out auto_bootstrap in cassandra.yaml