When a node goes away the system waits 10 minutes before beginning reprotect actions, allowing the node time to come back online. This default value can be changed with the following query:
sql> SET GLOBAL rebalancer_reprotect_queue_interval_s=num_seconds;
The nodes have a heartbeat process to communicate with the other nodes to ensure they are online. This happens every second. If a node is not online, the cluster will form a new group without the unavailable node and the rebalancer_reprotect_queue_interval timer will begin.
Once the rebalancer_reprotect_queue_interval timer has been exhausted, the cluster begins reprotecting around the failed node.
You can see the rebalancer actions being performed with the following query:
sql> SELECT * FROM system.rebalancer_activity_log ORDER BY started DESC LIMIT 25;
Once the cluster has fully reprotected around the failed node an alert will be sent to the alerts list indicating Full Protection Restored. At this time cluster is able to safely encounter another node failure.
To the application this has the following affect: By default every replica has two copies. As each replica of a slice is written to during every write and the ranked read replica is read during reads if these are located on failed node the query must be retried. If the ranked replica is located on a failed node, the remaining replica will become the ranked replica until a new replica is created. A query that was targeting this replica will need to be retried. With single-statement transactions there is an automatic internal retry mechanism that will retry transparent to the client in those cases. During a node failure the group change will require that all queries in process will need to be retried.