Graceful Degradation in Distributed Systems

What Is Graceful Degradation?

Graceful degradation means your application continues to function — even if in a limited capacity — when one or more of its dependencies are unavailable. The alternative is total failure: one broken microservice takes down your entire product.

The goal is not perfection under failure. The goal is to give users something useful rather than an error page.

Fallback Strategies

1. Cached Data Fallback

Return the last successful response when a live fetch fails:

import redis
import httpx

cache = redis.Redis()

def get_product(product_id: int) -> dict:
    cache_key = f'product:{product_id}'
    try:
        data = httpx.get(
            f'https://inventory.internal/products/{product_id}',
            timeout=2.0,
        ).json()
        cache.setex(cache_key, 300, json.dumps(data))
        return data
    except Exception:
        cached = cache.get(cache_key)
        if cached:
            return json.loads(cached)  # Stale but better than nothing
        return {'id': product_id, 'available': False}  # Minimal fallback

2. Default / Empty State Fallback

Return a safe default when the dependency is unavailable:

Recommendations engine down → show top-selling items
Personalization service down → show generic homepage
Search service down → disable search, show category browse

3. Functional Subset

Identify the core user journey and protect it. Non-essential features can be degraded independently:

Payment service works even if loyalty points service is down
Product pages render even if review service is down
Checkout works even if recommendation service is down

Feature Flags

Feature flags let you disable non-essential features at runtime without deploying code:

from django.conf import settings

def render_product_page(request, product_id):
    product = get_product(product_id)
    reviews = []
    if settings.FEATURE_REVIEWS_ENABLED:
        try:
            reviews = get_reviews(product_id)
        except Exception:
            pass  # Degrade silently — reviews are non-core
    return render(request, 'product.html', {
        'product': product,
        'reviews': reviews,
    })

When get_reviews() becomes unreliable, flip FEATURE_REVIEWS_ENABLED to False in your feature flag system — no deployment needed.

Serving Stale Data

Stale data is almost always better than an error. Use stale-while-revalidate in HTTP caching to serve cached responses while fetching fresh ones asynchronously:

Cache-Control: max-age=60, stale-while-revalidate=3600

In your application layer, use stale-if-error:

Cache-Control: max-age=60, stale-if-error=86400

This tells CDNs and caches to serve stale content for up to 24 hours if the origin returns a 5xx error.

Read-Only Mode

When your database or write services are degraded, switch to read-only mode:

Disable all write endpoints (return 503 with Retry-After)
Continue serving reads from replicas or cache
Display a user-visible banner: 'Our system is experiencing issues. Browse and search are available; checkout is temporarily unavailable.'

Monitoring Degradation

Track degradation in your metrics:

Fallback rate: fallbacks_served / total_requests — alert when >1%
Feature flag overrides: Log when features are disabled
Stale cache hits: How often are you serving stale data?

Degradation should be temporary and visible, not a silent permanent state.

Summary

Graceful degradation requires deliberate design: identify your core user journey, implement fallbacks for every non-core dependency, use feature flags for quick disables, serve stale data rather than errors, and monitor your fallback rates to know when degradation occurs.