gRPC Error Model
Unlike HTTP which uses numeric status codes, gRPC defines a fixed set of 16 status codes that apply across all transports. Every gRPC call completes with a status code and an optional message string. Understanding these codes is the foundation of gRPC debugging.
gRPC errors are surfaced differently per language:
# Python
import grpc
try:
response = stub.GetUser(request)
except grpc.RpcError as e:
print(e.code()) # grpc.StatusCode.NOT_FOUND
print(e.details()) # 'User 42 not found'
Status Code Reference
| Code | Name | HTTP Equivalent | Meaning |
|---|---|---|---|
| 0 | `OK` | 200 | Success |
| 1 | `CANCELLED` | — | Client cancelled the request |
| 2 | `UNKNOWN` | 500 | Unexpected error |
| 3 | `INVALID_ARGUMENT` | 400 | Bad input |
| 4 | `DEADLINE_EXCEEDED` | 504 | Timeout expired |
| 5 | `NOT_FOUND` | 404 | Resource not found |
| 6 | `ALREADY_EXISTS` | 409 | Conflict |
| 7 | `PERMISSION_DENIED` | 403 | Forbidden |
| 8 | `RESOURCE_EXHAUSTED` | 429 | Rate limit / quota |
| 9 | `FAILED_PRECONDITION` | 400 | System not in required state |
| 10 | `ABORTED` | 409 | Concurrency conflict |
| 13 | `INTERNAL` | 500 | Server-side bug |
| 14 | `UNAVAILABLE` | 503 | Server temporarily unavailable |
| 16 | `UNAUTHENTICATED` | 401 | Missing or invalid credentials |
UNAVAILABLE vs INTERNAL
These two are the most commonly confused:
UNAVAILABLE (14) — the server cannot be reached or is temporarily overwhelmed. It is safe to retry with backoff. Common causes: server is starting up, overloaded, or a network partition is in progress.
INTERNAL (13) — a bug or unexpected condition in the server code. It is not safe to retry automatically without investigation. The same request will likely produce the same error.
gRPC client libraries automatically retry UNAVAILABLE when configured with a service config. Do not retry INTERNAL.
Deadline Exceeded Debugging
DEADLINE_EXCEEDED means the deadline set by the caller expired before the RPC completed. The deadline propagates through the call chain — if a client sets a 500ms deadline and calls Service A which calls Service B, Service B also has at most 500ms (minus A's processing time).
# Set a deadline per call
response = stub.GetUser(request, timeout=0.5) # 500ms
Debugging checklist:
- Log the deadline remaining at each service hop
- Add distributed tracing (OpenTelemetry) to identify which service consumed the most time
- Check database query times — a slow DB query is the most common cause
- Use
grpc.StatusCode.DEADLINE_EXCEEDEDvsgrpc.StatusCode.CANCELLED— CANCELLED means the client gave up before the deadline
Channel and Connection Issues
gRPC uses HTTP/2, which multiplexes many RPCs over a single TCP connection. Connection issues affect all in-flight RPCs simultaneously.
# Check channel connectivity state
channel = grpc.insecure_channel('localhost:50051')
state = channel.check_connectivity_state(try_to_connect=True)
# States: IDLE, CONNECTING, READY, TRANSIENT_FAILURE, SHUTDOWN
TRANSIENT_FAILURE — connection attempt failed, will retry. This is normal during startup; problematic if it persists.
grpcurl for Testing
grpcurl is the curl equivalent for gRPC:
# List services (requires server reflection)
grpcurl -plaintext localhost:50051 list
# Describe a service
grpcurl -plaintext localhost:50051 describe UserService
# Make a unary call
grpcurl -plaintext -d '{"id": "42"}' \
localhost:50051 UserService/GetUser
# With TLS and metadata
grpcurl -H 'Authorization: Bearer TOKEN' \
-d '{"id": "42"}' api.example.com:443 UserService/GetUser
Distributed Tracing for gRPC
gRPC integrates natively with OpenTelemetry. Add the gRPC instrumentation interceptor to capture every RPC as a trace span:
from opentelemetry.instrumentation.grpc import GrpcInstrumentorClient
GrpcInstrumentorClient().instrument()
# All subsequent stub calls are automatically traced
In your tracing UI (Jaeger, Zipkin, Grafana Tempo), filter by rpc.grpc.status_code != 0 to find failed RPCs quickly.