Summary
On a working DSE Search cluster, changes to application logic or data content can produce inconsistent results due to indexing errors. This article discusses an indexing error that resulted from a string that was unexpectedly too long after an external application change.
Applies to
- DSE 5.1.0 to 5.1.16
- DSE 6.0.0 to 6.0.7
- DSE 6.7.0 to 6.7.2
Symptoms
A previously working query or application logic starts to produce results that might be missing some content. A review of the logs showed the following error:
ERROR [wiki.immense Index WorkPool work thread-1] 2019-08-06 16:31:47,306 Cql3SolrSecondaryIndex.java:772 - [wiki.immense]: Exception writing document id 31 to the index; possible analysis error: Document contains at least one immense term in field="field1" (whose UTF8 encoding is longer than
the max length 32766), all of which were skipped. Please correct the analyzer to not produce such terms. The prefix of the first immense term is: '[51, 50, 55, 54, 55, 32, 49, 32, 50, 32, 51, 32, 52, 32, 53, 32, 54, 32, 55, 32, 56, 32, 57, 32, 49, 48, 32, 49, 49, 32]...', original message: bytes can be at most 32766 in length; got 185501. Perhaps the document has an indexed string field (solr.StrField) which is too large
org.apache.solr.common.SolrException: Exception writing document id 31 to the index; possible analysis error: Document contains at least one immense term in field="field1" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped. Please correct the analyzer to not produce such terms. The prefix of the first immense term is: '[51, 50, 55, 54, 55, 32, 49, 32, 50, 32, 51, 32, 52, 32, 53, 32, 54, 32, 55, 32, 56, 32, 57, 32, 49, 48, 32, 49, 49, 32]...', original message: bytes can be at most 32766 in length; got 185501. Perhaps the document has an indexed string field (solr.StrField) which is too large
at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:206)
at com.datastax.bdp.search.solr.handler.update.CassandraDirectUpdateHandler.indexDoc(CassandraDirectUpdateHandler.java:709)
at com.datastax.bdp.search.solr.handler.update.CassandraDirectUpdateHandler.addDoc(CassandraDirectUpdateHandler.java:150)
at com.datastax.bdp.search.solr.AbstractSolrSecondaryIndex.doIndex(AbstractSolrSecondaryIndex.java:1285)
at com.datastax.bdp.search.solr.AbstractSolrSecondaryIndex.doUpdate(AbstractSolrSecondaryIndex.java:1007)
at com.datastax.bdp.search.solr.Cql3SolrSecondaryIndex$2.run(Cql3SolrSecondaryIndex.java:761)
at com.datastax.bdp.search.solr.AbstractSolrSecondaryIndex$2.run(AbstractSolrSecondaryIndex.java:943)
at com.datastax.bdp.concurrent.Worker.run(Worker.java:86)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Cause
A string field does not typically tokenize. For example, when the string “1 2 3 4 5 … 32767” is not tokenized into separate terms, it is treated as a single long string. When the maximum length allowed for this field type is exceeded, the "Document contains at least one immense term" error occurs. Typical examples include application changes, underlying data changes, or both. For example, to attach XML or JSON files. These changes can affect the field length of the document causes the length of the string being indexed for this field to be too long.
In our example, the Solr schema looked like this:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<schema name="autoSolrSchema" version="1.5">
<types>
<fieldType class="org.apache.solr.schema.TrieIntField" name="TrieIntField"/>
<fieldType class="org.apache.solr.schema.StrField" name="StrField"/>
<fieldType class="org.apache.solr.schema.TextField" name="TextField">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
</types>
<fields>
<field docValues="true" indexed="true" multiValued="false" name="key" stored="true" type="TrieIntField"/>
<field indexed="true" multiValued="false" name="field1" stored="true" type="StrField"/>
<field indexed="true" multiValued="false" name="field2" stored="true" type="TextField"/>
</fields>
<uniqueKey>key</uniqueKey>
</schema>
A long string that contained the string of all numbers between 1 and 32767 was generated and inserted into DSE. While field2 was able to accept and build such a long string (text field), field1 was not.
In Cassandra, we see keys 1, 21, and 31:
$ cqlsh -e "select key from wiki.immense"
key
-----
1
21
31
(3 rows)
In Solr we see only docs 1 and 21 (remember doc 31 failed to index)
http://10.101.34.15:8983/solr/wiki.immense/select?q=key%3A*&fl=key&wt=json&indent=true
{
"responseHeader":{
"status":0,
"QTime":8},
"response":{"numFound":2,"start":0,"maxScore":1.0,"docs":[
{
"key":1},
{
"key":21}]
}}
Solution
The correct solution is to identify the potential problem row or rows in the database and verify the data integrity.
The document id should be displayed in the error:
Exception writing document id 31 to the index; possible analysis error:
The document id is the uniqueKey
in the Solr schema that points to the partition key in the database. You can then query the database to verify the row or rows to identify the required corrective action:
- Change the field type. Perhaps the new application logic requires strings of this length.
- The row or rows contain invalid data and require manual correction.
- Implement data validation checks to enforce controls on this field.
See also
Understanding field types DataStax documentation
JSON and DSE Search DataStax Developer Blog