The agent request cache can overflow in some situations involving long-running operations (such as an S3 backup that takes a considerable amount of time to complete). In some cases, this can abort the long-running operation.
We believe this issue to be quite rare; we've seen a confirmed case of it terminating S3 backups and we believe it has also resulted in Repair Service failures in another case.
This document describes the symptoms observed in cases so far. Note that the same underlying issue could lead to symptoms different from those described here, although the following should provide a good guide to detecting future occurrences.
Note that we believe this problem requires long-running S3 backups in order to provide enough time for other requests to fill up the cache. If you aren't seeing S3 backups that take a very long time (hours or even days), you probably aren't seeing this issue.
In the Backup Service case, you'll see something like the following in the OpsCenter logs:
2015-09-22 19:09:19+0000  DEBUG: Polling status of request 11d65eea-bbb8-4df1-828c-a177bddd0553 2015-09-22 19:09:19+0000  DEBUG: Performing HTTP request (GET): http://220.127.116.11:61621/request/11d65eea-bbb8-4df1-828c-a177bddd0553/status?, body: None
The request in question is the parent request for the S3 backup, as described below. OpsCenter is polling it to determine the state of the async uploads to S3. Normally the GET returns a 200 response and no further logging is needed; however, if the cache overflows, you'll see an additional message in the logs:
2015-09-22 19:09:19+0000  WARN: HTTP request http://18.104.22.168:61621/request/11d65eea-bbb8-4df1-828c-a177bddd0553/status? failed: 404 Not Found 2015-09-22 19:09:19+0000  WARN: OpsCenter is unable to determine the status of the upload to this destination because the agent is no longer aware of the backup reqeust. This usually indicates that the agent was restarted while an upload was in progress. 2015-09-22 19:09:19+0000  WARN: Marking request 926aa13b-d90b-41dd-8c67-2f69db56cc52 as failed: OpsCenter is unable to determine the status of the upload to this destination because the agent is no longer aware of the backup reqeust. This usually indicates that the agent was restarted while an upload was in progress. 2
By themselves, these messages aren't conclusive so further investigation is necessary. You'll want to look for additional request activity, most likely from repairs being handled by the agent. You should see a fair number of messages like the following in the agent logs after the backup has been initiated:
DEBUG [Thread-137] 2015-09-22 16:29:21,810 Request 11d65eea-bbb8-4df1-828c-a177bddd0553 finished with state :success
For a final confirmation, count the number of such requests after the backup has been initiated. If that number is 100 (or something very, very close), you're likely observing this issue.
The agent maintains an internal cache of ongoing requests for various operations. Example requests include the parent requests for an S3 backup as well as certain restore operations. For S3 backups, the OpsCenter regularly asks the agent for the status of the parent request while individual snapshot files are asynchronously sent to S3. The cache in question is a bounded-size cache with an eviction policy, so if a number of additional requests come in while the S3 backups are ongoing, the parent request can be evicted from the cache. To OpsCenter it looks like the parent request has simply disappeared, which in many cases leads to a cancellation of the long-running job.
OPSC-6650 has a wealth of additional detail.
This issue is addressed (in the short-term) by OPSC-6916, which introduces two new agent configuration params into 5.2.3. These params are "running-request-cache-size" and "finished-request-cache-size", which allow for customer-specific cache sizing. Default values for these caches are 500 and 100 (respectively). For BackupService failures like the one described above, the "finished-request-cache-size" is the relevant cache.