Skip to content

Commit

Permalink
region/client: re-establish connection on ServerNotRunningYetException
Browse files Browse the repository at this point in the history
When receiving ServerNotRunningYetException, a client shouldn't retry to
send the request to the same server. Instead, the client should be
closed and the region lookup should happen again.

There is two cases when ServerNotRunningYetException is returned:
- when the RegionServer is listening but not online yet: in that case,
  retrying the RPC on the same server may succeed if the Regionserver
  become ready and if the region is indeed assigned to it. But most
  likely the region would have been reassigned to another Regionserver
  and thus it will return NotServingRegionException in the following
  request. If the Regionserver is stuck in startup phase, it could also
  cause the client to be stuck in retry loop whereas HBasemaster may
  have detected the issue and correctly moved the region to another
  Regionserver already.

- when the HBasemaster server is currently not active: in that case,
  retrying the RPC on the same server is guaranteed to fail until a
  failover. The client will be stuck in a forever retrying loop.

If we receive multiple ServerError for the same RPC, we will backoff
before retrying. This is to avoid overwhelming HBase. Scenario where
this could happen is a cluster that is recovering from catastrophic
failure, with all HBasemaster still trying to start (like recovering
WALs or what not).

Also add MasterStoppedException and PleaseHoldException to the list of
known exception that can be returned by HBase.

Fix #265
  • Loading branch information
dethi committed Jul 13, 2024
1 parent 1504720 commit bb20fd0
Show file tree
Hide file tree
Showing 2 changed files with 21 additions and 7 deletions.
12 changes: 6 additions & 6 deletions region/client.go
Original file line number Diff line number Diff line change
Expand Up @@ -65,25 +65,25 @@ var (
}

// If a Java exception listed here is returned by HBase, the client should
// backoff and resend the RPC message to the same region and region server
// backoff and resend the RPC message to the same region and region server.
// The value of exception should be contained in the stack trace.
javaRetryableExceptions = map[string]string{
"org.apache.hadoop.hbase.CallQueueTooBigException": "",
"org.apache.hadoop.hbase.exceptions.RegionOpeningException": "",
"org.apache.hadoop.hbase.ipc.ServerNotRunningYetException": "",
"org.apache.hadoop.hbase.quotas.RpcThrottlingException": "",
"org.apache.hadoop.hbase.RetryImmediatelyException": "",
"org.apache.hadoop.hbase.RegionTooBusyException": "",
"org.apache.hadoop.hbase.PleaseHoldException": "",
}

// javaServerExceptions is a map where all Java exceptions that signify
// the RPC should be sent again are listed (as keys). If a Java exception
// listed here is returned by HBase, the RegionClient will be closed and a new
// one should be established.
// If a Java exception listed here is returned by HBase, the RegionClient
// will be closed and a new one should be established.
// The value of exception should be contained in the stack trace.
javaServerExceptions = map[string]string{
"org.apache.hadoop.hbase.regionserver.RegionServerAbortedException": "",
"org.apache.hadoop.hbase.regionserver.RegionServerStoppedException": "",
"org.apache.hadoop.hbase.exceptions.MasterStoppedException": "",
"org.apache.hadoop.hbase.ipc.ServerNotRunningYetException": "",
}
)

Expand Down
16 changes: 15 additions & 1 deletion rpc.go
Original file line number Diff line number Diff line change
Expand Up @@ -91,6 +91,7 @@ func (c *client) SendRPC(rpc hrpc.Call) (msg proto.Message, err error) {
}()

backoff := backoffStart
serverErrorCount := 0
for {
rc, err := c.getRegionAndClientForRPC(ctx, rpc)
if err != nil {
Expand All @@ -105,7 +106,20 @@ func (c *client) SendRPC(rpc hrpc.Call) (msg proto.Message, err error) {
return msg, err
}
continue // retry
case region.ServerError, region.NotServingRegionError:
case region.ServerError:
// Retry ServerError immediately, as we want failover fast to
// another server. But if HBase keep sending us ServerError, we
// should start to backoff. We don't want to overwhelm HBase.
if serverErrorCount > 1 {
sp.AddEvent("retrySleep")
backoff, err = sleepAndIncreaseBackoff(ctx, backoff)
if err != nil {
return msg, err
}
}
serverErrorCount++
continue // retry
case region.NotServingRegionError:
continue // retry
}
return msg, err
Expand Down

0 comments on commit bb20fd0

Please sign in to comment.