First, grep for AntiEntropy in the system.log to see all the messages related to repairs. Each repair session will have an identifier like
[repair #52cb6440-6526-11e2-0000-d2191516b4ff]. Once you identify the most recent repair session, grep for that identifier to see just the messages pertaining to that repair session.
Also, check the output of
nodetool compactionstats and
nodetool netstats while the repair seems to be stuck to see if any remaining streams are active or if a validation compaction is underway. Validation compactions typically mean that the node is calculating merkle trees.
If a node goes down while the repair is in progress or if the streaming connection from the source node back to the requester is interrupted, it can cause nodetool to hang indefinitely. If a node goes down, you will see an error in the log like this on the node where repair was run:
ERROR [AntiEntropySessions:15] 2013-01-24 12:52:50,498 AntiEntropyService.java (line 716) [repair #52cb6440-6526-11e2-0000-d2191516b4ff] session completed with the following error java.io.IOException: Endpoint /172.26.233.27 died
In this case, you need to check the log of the node that went down and try to figure out why. Look for errors, and messages from GCInspector indicating a long pause. If you find a long gc pause around the time of the dead node error, you will need to tune your heap to try to eliminate them. If you can't see any obvious reason for the node to have died, it's most likely network latency or congestion. You can try increasing the
phi_convict_threshold in your cassandra.yaml to 12 to reduce the likelihood that nodes will declare each other dead.
If you don't see any error stating that a node has died, it probably means that a stream has been interrupted due to a network issue. On the node where you ran repair, you will find a message that the node has requested merkle trees from a list of replicas. Later on you should see that it received the merkle trees from some of the replicas but one or more may be missing.
In this case, you won't see an error on the other node unless you have debug logging enabled on OutboundTcpConnection. To enable debug logging, add the following line to your log4j-server.properties:
If a stream is disrupted, you should see an Exception like this:
DEBUG [WRITE-/172.30.77.197] 2013-05-03 12:43:09,107 OutboundTcpConnection.java (line 165) error writing to /172.30.77.197 java.net.SocketException: Connection reset at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:96) at java.net.SocketOutputStream.write(SocketOutputStream.java:136) at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105) at java.io.DataOutputStream.write(DataOutputStream.java:90) at java.io.FilterOutputStream.write(FilterOutputStream.java:80) at org.apache.cassandra.net.OutboundTcpConnection.write(OutboundTcpConnection.java:200) at org.apache.cassandra.net.OutboundTcpConnection.writeConnected(OutboundTcpConnection.java:152) at org.apache.cassandra.net.OutboundTcpConnection.run(OutboundTcpConnection.java:126)
In this case you need to investigate the cause of the network error and address it.
One final possibility is that the JMX connection between nodetool and cassandra has timed out. In this case, you will see a message in the system.log that the repair session has completed. If you see this, the repair was successful even if nodetool is hanging. You can simply kill nodetool and you don't need to rerun the repair.