Debugging gRPC Errors and Status Codes

gRPC Error Model

Unlike HTTP which uses numeric status codes, gRPC defines a fixed set of 16 status codes that apply across all transports. Every gRPC call completes with a status code and an optional message string. Understanding these codes is the foundation of gRPC debugging.

gRPC errors are surfaced differently per language:

# Python
import grpc

try:
    response = stub.GetUser(request)
except grpc.RpcError as e:
    print(e.code())     # grpc.StatusCode.NOT_FOUND
    print(e.details())  # 'User 42 not found'

Status Code Reference

Code	Name	HTTP Equivalent	Meaning
0	`OK`	200	Success
1	`CANCELLED`	—	Client cancelled the request
2	`UNKNOWN`	500	Unexpected error
3	`INVALID_ARGUMENT`	400	Bad input
4	`DEADLINE_EXCEEDED`	504	Timeout expired
5	`NOT_FOUND`	404	Resource not found
6	`ALREADY_EXISTS`	409	Conflict
7	`PERMISSION_DENIED`	403	Forbidden
8	`RESOURCE_EXHAUSTED`	429	Rate limit / quota
9	`FAILED_PRECONDITION`	400	System not in required state
10	`ABORTED`	409	Concurrency conflict
13	`INTERNAL`	500	Server-side bug
14	`UNAVAILABLE`	503	Server temporarily unavailable
16	`UNAUTHENTICATED`	401	Missing or invalid credentials

UNAVAILABLE vs INTERNAL

These two are the most commonly confused:

UNAVAILABLE (14) — the server cannot be reached or is temporarily overwhelmed. It is safe to retry with backoff. Common causes: server is starting up, overloaded, or a network partition is in progress.

INTERNAL (13) — a bug or unexpected condition in the server code. It is not safe to retry automatically without investigation. The same request will likely produce the same error.

gRPC client libraries automatically retry UNAVAILABLE when configured with a service config. Do not retry INTERNAL.

Deadline Exceeded Debugging

DEADLINE_EXCEEDED means the deadline set by the caller expired before the RPC completed. The deadline propagates through the call chain — if a client sets a 500ms deadline and calls Service A which calls Service B, Service B also has at most 500ms (minus A's processing time).

# Set a deadline per call
response = stub.GetUser(request, timeout=0.5)  # 500ms

Debugging checklist:

Log the deadline remaining at each service hop
Add distributed tracing (OpenTelemetry) to identify which service consumed the most time
Check database query times — a slow DB query is the most common cause
Use grpc.StatusCode.DEADLINE_EXCEEDED vs grpc.StatusCode.CANCELLED — CANCELLED means the client gave up before the deadline

Channel and Connection Issues

gRPC uses HTTP/2, which multiplexes many RPCs over a single TCP connection. Connection issues affect all in-flight RPCs simultaneously.

# Check channel connectivity state
channel = grpc.insecure_channel('localhost:50051')
state = channel.check_connectivity_state(try_to_connect=True)
# States: IDLE, CONNECTING, READY, TRANSIENT_FAILURE, SHUTDOWN

TRANSIENT_FAILURE — connection attempt failed, will retry. This is normal during startup; problematic if it persists.

grpcurl for Testing

grpcurl is the curl equivalent for gRPC:

# List services (requires server reflection)
grpcurl -plaintext localhost:50051 list

# Describe a service
grpcurl -plaintext localhost:50051 describe UserService

# Make a unary call
grpcurl -plaintext -d '{"id": "42"}' \
  localhost:50051 UserService/GetUser

# With TLS and metadata
grpcurl -H 'Authorization: Bearer TOKEN' \
  -d '{"id": "42"}' api.example.com:443 UserService/GetUser

Distributed Tracing for gRPC

gRPC integrates natively with OpenTelemetry. Add the gRPC instrumentation interceptor to capture every RPC as a trace span:

from opentelemetry.instrumentation.grpc import GrpcInstrumentorClient

GrpcInstrumentorClient().instrument()
# All subsequent stub calls are automatically traced

In your tracing UI (Jaeger, Zipkin, Grafana Tempo), filter by rpc.grpc.status_code != 0 to find failed RPCs quickly.