DataStax Help Center

OpsCenter backups to AWS S3 fails with read timeouts

Summary

This article discusses issues with backing up SSTables to AWS S3 buckets using OpsCenter.

Symptoms

When attempting to backup SSTables to AWS S3, the backup fails for some tables. Below are sample entries from a node's agent.log running OpsCenter agent 5.2.4:

ERROR [async-dispatch-14] 2016-03-29 12:56:47,864 Mar 29, 2016 12:56:47 PM org.jclouds.logging.jdk.JDKLogger logError
SEVERE: Cannot retry after server error, command is not replayable: [method=org.jclouds.aws.s3.AWSS3AsyncClient.public abstract com.google.common.util.concurrent.ListenableFuture org.jclouds.s3.S3AsyncClient.putObject(java.lang.String,org.jclouds.s3.domain.S3Object,org.jclouds.s3.options.PutObjectOptions[])[myBackup, [metadata=[key=snapshots/1b4b5e54-88cc-4dd6-b18e-9f53d0e1cdda/sstables/1234567890-myKS-myTable-ka-56789-Data.db, bucket=null, uri=null, eTag=null, cacheControl=null, contentMetadata=[contentDisposition=null, contentEncoding=null, contentLanguage=null, contentLength=7927171, contentMD5=null, contentType=application/octet-stream, expires=null], lastModified=null, owner=null, storageClass=STANDARD, userMetadata={}]], [Lorg.jclouds.s3.options.PutObjectOptions;@5277ac16], request=PUT https://myBackup.s3-us-west-1.amazonaws.com/snapshots/1b4b5e54-88cc-4dd6-b18e-9f53d0e1cdda/sstables/1234567890-myKS-myTable-ka-56789-Data.db HTTP/1.1]
WARN [async-dispatch-14] 2016-03-29 12:56:47,864 Transfer failed, retrying
org.jclouds.http.HttpResponseException: Read timed out connecting to PUT https://myBackup.s3-us-west-1.amazonaws.com/snapshots/1b4b5e54-88cc-4dd6-b18e-9f53d0e1cdda/sstables/1234567890-myKS-myTable-ka-56789-Data.db HTTP/1.1
        at org.jclouds.http.internal.BaseHttpCommandExecutorService.invoke(BaseHttpCommandExecutorService.java:117)
        at org.jclouds.rest.internal.InvokeSyncToAsyncHttpMethod.invoke(InvokeSyncToAsyncHttpMethod.java:128)
        at org.jclouds.rest.internal.InvokeSyncToAsyncHttpMethod.apply(InvokeSyncToAsyncHttpMethod.java:94)
...
        at org.jclouds.aws.s3.blobstore.strategy.internal.SequentialMultipartUploadStrategy.execute(SequentialMultipartUploadStrategy.java:119)
        at org.jclouds.aws.s3.blobstore.AWSS3BlobStore.putBlob(AWSS3BlobStore.java:89)
        at opsagent.jclouds$put_blob.invoke(jclouds.clj:28)
        at opsagent.backups.destinations$create_blob$fn__12590.invoke(destinations.clj:61)
...
Caused by: java.net.SocketTimeoutException: Read timed out
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
...
Caused by: java.net.SocketTimeoutException: Read timed out
        at java.net.SocketInputStream.socketRead0(Native Method)
        at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
        at java.net.SocketInputStream.read(SocketInputStream.java:170)
...

Cause

OpsCenter uses the Apache jclouds toolkit to access AWS resources. We have since identified that there are performance issues with the jclouds implementation that causes transfers to be much slower than expected.

As a result of the slow transfers, it does not complete within the timeout period and the transfer is marked as failed.

Alternatives are currently being investigated in an internal request (OPSC-6644) and we hope to provide a solution in a future release of OpsCenter.

Workaround

To get around this, adjust the configuration of the agents by adding the following JVM options to each agent's datastax-agent-env.sh:

JVM_OPTS="$JVM_OPTS -Xmx512M -Djclouds.mpu.parts.magnitude=100000 -Djclouds.mpu.parts.size=32000000"
JVM_OPTS="$JVM_OPTS -Djclouds.connection-timeout=120000"
JVM_OPTS="$JVM_OPTS -Djclouds.so-timeout=120000"

AWS S3 supports multipart file uploads of up to 100,000 parts. The first line sets the parts chunk size to 32MB. This means that the largest file which could be uploaded is 320GB so adjust the size accordingly to accommodate the largest SSTable in the cluster. Since the larger chunk size requires more memory, the max heap size for the agent is increased to 512MB.

The next 2 JVM options increase the connection and transfer timeout to 2 minutes (120 seconds).

NOTE - These modifications will require an agent restart for the changes to take effect.

See also

DataStax doc - Configuring the agent to upload very large files to Amazon S3

Was this article helpful?
3 out of 3 found this helpful
Have more questions? Submit a request

Comments

Powered by Zendesk