Why Fallbacks Matter
The choice between returning an error and returning a degraded-but-useful response is a product decision with real user-experience consequences. Consider two failure modes for a product recommendation engine:
- Without fallback: product page shows a 500 error. User bounces.
- With fallback: product page shows bestsellers from last hour's cache. User may still purchase.
A 503 Service Unavailable is honest but expensive. A degraded response that acknowledges the limitation is often better for both users and business metrics.
The key principle: fail partial, not total. Identify which parts of your service are critical (checkout, authentication) versus nice-to-have (recommendations, analytics, personalization), then design fallbacks only for the non-critical parts. Never trade correctness for availability in financial or safety-critical paths.
Cached Response Fallback
The most common and effective fallback: serve the last known-good response when the upstream is unavailable.
stale-while-revalidate
The stale-while-revalidate Cache-Control directive lets browsers (and CDNs) serve stale content immediately while fetching a fresh copy in the background:
Cache-Control: max-age=60, stale-while-revalidate=600, stale-if-error=3600
max-age=60— serve fresh for 60 seconds.stale-while-revalidate=600— serve stale for up to 10 minutes while revalidating in the background. User sees no delay.stale-if-error=3600— if revalidation fails with a 5xx, serve stale for up to 1 hour rather than returning an error to the user.
Cloudflare's Origin Error Pass-Through and Stale Content settings implement this at the CDN layer — when your origin is down, Cloudflare serves cached content automatically without configuration changes.
Application-Level Cache Fallback
import redis
from django.core.cache import cache
CACHE_KEY = 'product_recommendations:{user_id}'
FALLBACK_KEY = 'product_recommendations:bestsellers' # Always warm
async def get_recommendations(user_id: str) -> list[dict]:
cache_key = CACHE_KEY.format(user_id=user_id)
# Try fresh personalized recommendations
try:
recs = await fetch_recommendations(user_id, timeout=1.0)
# Cache with long TTL for fallback use
await cache.aset(cache_key, recs, timeout=86400) # 24h stale cache
return recs
except (TimeoutError, ConnectionError):
pass
# Fallback 1: stale personalized cache (up to 24h old)
stale = await cache.aget(cache_key)
if stale:
return stale
# Fallback 2: generic bestsellers (always available)
return await cache.aget(FALLBACK_KEY, default=[])
The layered approach — fresh → stale personalized → generic — maximizes the chance of returning something useful while degrading gracefully at each level.
Cache Aside vs Cache on Write
For fallback caches to be effective, they must be populated before the failure:
Cache-aside (lazy): populate on first cache miss. If the first request after a cache eviction coincides with a service failure, there is nothing to fall back to.
Cache-on-write (eager): populate the fallback cache every time data is successfully written. The fallback is always current up to the last successful write, even if the cache was cold when the failure started.
Default Value Fallback
When no cached data is available, static defaults can keep the user experience functional:
from config.feature_defaults import FEATURE_DEFAULTS
async def get_feature_config(user_id: str) -> dict:
"""Fetch dynamic feature configuration with static default fallback."""
try:
return await feature_service.get_config(
user_id=user_id,
timeout=0.5, # Hard timeout: 500ms
)
except Exception:
# Static default — safe, conservative feature flags
return FEATURE_DEFAULTS
# config/feature_defaults.py — the 'safe' state of every feature flag
FEATURE_DEFAULTS: dict = {
'new_checkout_flow': False, # Off by default — safe to disable
'ai_recommendations': False, # Off — reduces load during degradation
'dark_mode': True, # Harmless UI preference
'max_cart_items': 50, # Conservative limit
'currency': 'USD', # Safe default currency
}
Design defaults to be conservative — they should represent the safe, lowest-risk behavior, not the most feature-rich state. A new checkout flow that is accidentally enabled for all users during a feature flag service outage may cause more harm than the outage itself.
Degraded Mode
Degraded mode is a deliberate operational state where the service provides reduced functionality rather than failing completely.
Read-Only Mode
When the primary database is unavailable but read replicas are healthy, switch to read-only mode:
# Django middleware: enforce read-only mode
from django.http import HttpRequest, HttpResponse
class ReadOnlyModeMiddleware:
SAFE_METHODS = frozenset(['GET', 'HEAD', 'OPTIONS'])
def __init__(self, get_response):
self.get_response = get_response
def __call__(self, request: HttpRequest) -> HttpResponse:
if self._is_read_only_mode() and request.method not in self.SAFE_METHODS:
return HttpResponse(
'Service is in read-only mode. Please try again later.',
status=503,
headers={'Retry-After': '120'},
)
return self.get_response(request)
def _is_read_only_mode(self) -> bool:
return cache.get('system:read_only_mode', False)
Feature Shedding
Disable non-critical features to reduce load and protect core functionality:
class FeatureShedder:
"""Disable features in priority order as system load increases."""
SHED_ORDER = [
'ai_recommendations', # First to shed: expensive, non-critical
'product_suggestions', # Second: slightly cheaper
'user_activity_feed', # Third: only affects engagement
# Core: checkout, auth, product browsing — never shed
]
def __init__(self, load_threshold: float = 0.85):
self.load_threshold = load_threshold
def is_enabled(self, feature: str) -> bool:
current_load = self._get_load_factor()
if current_load < self.load_threshold:
return True # All features enabled
# Shed features starting from the front of the list
shed_count = int((current_load - self.load_threshold) / 0.05) + 1
shed_features = set(self.SHED_ORDER[:shed_count])
return feature not in shed_features
Implementation Patterns
Decorator Pattern
Apply fallback logic uniformly across service calls without modifying each call site:
from functools import wraps
from typing import Any, Callable, TypeVar
T = TypeVar('T')
def with_fallback(fallback_value: Any, exceptions: tuple = (Exception,)):
"""Decorator: return fallback_value if function raises any listed exception."""
def decorator(func: Callable[..., T]) -> Callable[..., T]:
@wraps(func)
async def wrapper(*args: Any, **kwargs: Any) -> Any:
try:
return await func(*args, **kwargs)
except exceptions as exc:
import logging
logging.warning(
'Fallback triggered',
extra={'function': func.__name__, 'error': str(exc)},
)
return fallback_value()
return wrapper
return decorator
@with_fallback(fallback_value=lambda: [], exceptions=(TimeoutError, ConnectionError))
async def get_related_products(product_id: str) -> list[dict]:
return await recommendations_service.get_related(product_id, timeout=1.0)
Resilience Library Integration
Libraries like Resilience4j (Java), Polly (.NET), and Failsafe (Java) provide fallback as a first-class concept composable with circuit breakers and retries:
// Resilience4j: Fallback composed with CircuitBreaker
CircuitBreaker circuitBreaker = CircuitBreaker.ofDefaults('recommendations');
Fallback<List<Product>> fallback = Fallback.of(
() -> bestsellersCache.get(), // Static fallback function
CallNotPermittedException.class, // Circuit open
TimeoutException.class // Timeout
);
Supplier<List<Product>> decorated = Decorators
.ofSupplier(() -> recommendationService.get(userId))
.withCircuitBreaker(circuitBreaker)
.withFallback(fallback)
.decorate();
Communicating Degraded State to Users
When serving degraded content, be transparent:
# Include degradation signal in API response
def recommendations_view(request):
recs, is_degraded = get_recommendations_with_fallback(request.user.id)
return JsonResponse({
'recommendations': recs,
'degraded': is_degraded, # Client can show a notice
'degraded_reason': 'cached' if is_degraded else None,
})
The client can then show an unobtrusive banner: *'Showing popular items — personalized recommendations will return shortly.'* This is more honest than silently serving stale data, and sets user expectations correctly.