Summary
This article discusses an issue that causes DSE Search nodes to be unresponsive to client or sharding requests.
Applies to
- DSE 5.1.0 to 5.1.13
- DSE 6.0.0 to 6.0.7
- DSE 6.7.0 to 6.7.2
Symptoms
When the Apache Solr channel lock is interrupted when an internode handshake fails, a recursive error condition can occur that causes the DSE Search node to be unresponsive to any client or sharding requests and requests from peer DSE search nodes. Queries may timeout and applications receive subsequent errors as a result.
Cause
When an index writer is operating on the Lucene index, the index writer requires a lock on that directory (usually by means of a lock file). Originally reported in LUCENE-8262, the problems described above can occur if this thread is interrupted.
The following error stack may also be observed in the logs:
partition 'pkey1' on 'myks.mytable':
org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed
at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:836)
at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:850)
at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1663)
at org.apache.solr.update.DirectUpdateHandler2.doNormalUpdate(DirectUpdateHandler2.java:286)
at org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:218)
at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:173)
at com.datastax.bdp.search.solr.handler.update.CassandraDirectUpdateHandler.indexDoc(CassandraDirectUpdateHandler.java:545)
at com.datastax.bdp.search.solr.handler.update.CassandraDirectUpdateHandler.addDoc(CassandraDirectUpdateHandler.java:138)
at com.datastax.bdp.search.solr.AbstractSolrSecondaryIndex.doIndex(AbstractSolrSecondaryIndex.java:1272)
at com.datastax.bdp.search.solr.AbstractSolrSecondaryIndex.doUpdate(AbstractSolrSecondaryIndex.java:1028)
at com.datastax.bdp.search.solr.Cql3SolrSecondaryIndex.lambda$indexDocument$7(Cql3SolrSecondaryIndex.java:815)
at io.reactivex.internal.operators.single.SingleFlatMapCompletable$FlatMapCompletableObserver.onSuccess(SingleFlatMapCompletable.java:83)
at io.reactivex.internal.operators.single.SingleObserveOn$ObserveOnSingleObserver.run(SingleObserveOn.java:81)
at org.apache.cassandra.concurrent.TPCRunnable.run(TPCRunnable.java:68)
at org.apache.cassandra.concurrent.StagedScheduler$TPCAwareDisposeTask.run(StagedScheduler.java:203)
at io.reactivex.internal.schedulers.ScheduledRunnable.run(ScheduledRunnable.java:61)
at io.reactivex.internal.schedulers.ScheduledRunnable.call(ScheduledRunnable.java:52)
at io.netty.util.concurrent.PromiseTask.run(PromiseTask.java:73)
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:404)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:465)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:884)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.lucene.store.AlreadyClosedException: FileLock invalidated by an external force: NativeFSLock(path=/disk2/dse/cassandra/data/solr.data/myks.mytable/index/write.lock,impl=sun.nio.ch.FileLockImpl[0:9223372036854775807 exclusive invalid],ctime=2019-01-08T11:12:40Z)
at org.apache.lucene.store.NativeFSLockFactory$NativeFSLock.ensureValid(NativeFSLockFactory.java:166)
at org.apache.lucene.store.LockValidatingDirectoryWrapper.createOutput(LockValidatingDirectoryWrapper.java:43)
at org.apache.lucene.store.TrackingDirectoryWrapper.createOutput(TrackingDirectoryWrapper.java:43)
at org.apache.lucene.codecs.lucene53.Lucene53NormsConsumer.<init>(Lucene53NormsConsumer.java:43)
at org.apache.lucene.codecs.lucene53.Lucene53NormsFormat.normsConsumer(Lucene53NormsFormat.java:77)
at org.apache.lucene.index.DefaultIndexingChain.writeNorms(DefaultIndexingChain.java:366)
at org.apache.lucene.index.DefaultIndexingChain.flush(DefaultIndexingChain.java:193)
at org.apache.lucene.index.DocumentsWriterPerThread.flush(DocumentsWriterPerThread.java:725)
at org.apache.lucene.index.DocumentsWriter.doFlush(DocumentsWriter.java:571)
at org.apache.lucene.index.DocumentsWriter.lambda$flushInParallel$0(DocumentsWriter.java:744)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
... 1 common frames omitted
Note: In some systems, DataStax has observed that incorrect disk readahead settings cause this problem to occur, although incorrect disk readahead settings do not cause the actual problem.
Internal jira DSP-18211 was raised to cover this scenario.
Workaround
Reload the Solr core with reindex and distributed set to false. Example using dsetool:
dsetool reload_core myks.mytable reindex=false distributed=false
Note: In some cases the node may incur a growth of commitlogs. If you observe this happening, drain and stop / start DSE as soon as convenient after the reload.
Solution
Upgrade to a version with the fix for this issue:
- DSE 5.1 - Upgrade to DSE 5.1.14 or later
- DSE 6.0 - Upgrade to DSE 6.0.7 or later
- DSE 6.7 - Upgrade to DSE 6.7.3 or later