Retry Logic, Fallbacks, and Consistent Error Shapes — All Without Writing Code
Every service team reimplements retry and error handling from scratch. Here's how to move that logic to the gateway layer where it belongs — configured once, applied everywhere, observable in one place.
- workflows
- resilience
- api-management
- developer-experience
Count the retry libraries in your codebase. There is probably one per language runtime, and several per team. Some of them use exponential backoff. Some use fixed intervals. Some retry on 429s. Some do not. None of them return errors in the same format. And none of this was a deliberate decision — it accumulated, one incident at a time, one team at a time.
The reason this happened is not that your teams made bad choices. It is that retry logic, fallback behaviour, and error formatting are not product features — they are infrastructure concerns that ended up in application code because there was no better place to put them. The gateway is that better place.
Here is how it works, what each piece looks like in practice, and what your teams stop writing the day you move this into the gateway.
What teams stop writing
Before the "how," the "what stops":
Retry libraries and retry configuration scattered across services. The gateway handles retries for all upstream calls that go through it. Each service team stops configuring their own retry count, backoff interval, and jitter. There is one configuration, in one place, that every team's upstream calls benefit from.
Custom error response formatting. Each service has its own error format: some return {"error": "..."}, some return {"message": "..."}, some return a status code with no body, some return HTML error pages that leak stack traces. When clients hit the gateway, they receive a consistent error format — regardless of what the upstream returned. Services stop being responsible for error formatting.
Fallback logic in application code. "If the recommendation service is down, return the last cached result" is gateway logic. It does not need to be in the calling service. The gateway workflow handles the fallback — the service team stops writing and maintaining that branch.
Health check polling for upstream availability. When a service team discovers that their upstream is degraded, they write circuit breaker logic to stop sending requests to it. The gateway can handle circuit-breaking as a workflow concern, reducing the load on a degraded upstream without application-layer changes.
The retry workflow: what each node does
A retry workflow in the gateway has distinct, configurable nodes. Here is what each one controls:
The upstream call node. Calls the backend service with a configured timeout. If the response is successful (2xx), the workflow terminates and returns the response. If the response is a transient failure (503, 429, or a timeout), the workflow moves to the retry node.
Upstream Call
- URL: https://api-internal/payments
- Timeout: 2000ms
- Retry on: [503, 429, timeout]
- Pass-through on: [4xx except 429]
The retry node. Configures how many retries to attempt and with what timing. Exponential backoff with jitter is the correct default for most production cases: it prevents thundering-herd retry storms while giving the backend time to recover.
Retry
- Max attempts: 3
- Backoff: exponential
- Initial delay: 500ms
- Max delay: 5000ms
- Jitter: true (±20%)
Three retries with exponential backoff and jitter means the second attempt waits ~500ms, the third waits ~1000ms, the fourth waits ~2000ms — each with ±20% randomisation to desynchronise concurrent retries from different clients.
The fallback node. Executes when all retries are exhausted. A fallback can be:
- A cached response (if the endpoint is a read operation and you have a recent cached value)
- A static default response (a known-good empty state, like
{"recommendations": []}) - A call to an alternate backend (a secondary data source or a degraded-mode endpoint)
- A pass-through to the error node
Fallback
- Type: cached (if available) → static default
- Cache key: {client_id}:{path}
- Cache TTL: 300s
- Static default: {"status": "degraded", "data": []}
The error response node. Returns a standardised error when fallback is not possible or appropriate. This node controls the error format every client receives — regardless of what the upstream returned.
What consistent error shapes actually enable
The value of consistent error shapes is not aesthetic. It is operational.
Clients write error handling once. If every API endpoint behind the gateway returns the same error structure, a client library handles all errors with one code path. If error formats differ by service, the client has conditional logic for each integration. That conditional logic is the source of subtle bugs and missed error cases.
Monitoring uses the same fields. When every error has a type field (a URI that identifies the error category), you can aggregate errors across all APIs by type. "How many upstream-timeout errors did we see in the last hour, across all products?" is a single query, not a join across log formats.
Clients know when to retry. A consistent error response includes a Retry-After header when the error is transient, and omits it when the error is final. Clients that know to look for this header can implement intelligent retry on their side — in addition to the gateway's retry — without guessing.
Incidents are faster to diagnose. An error response with instance (the request path), type (the error category), and detail (a human-readable explanation) gives the on-call engineer the context they need without log diving. "Payment initiation failed: upstream service did not respond after 3 attempts. Request ID: req_abc123" is actionable. {"status": 503} is not.
A consistent error format across all your APIs looks like this:
{
"type": "https://errors.yourapi.com/upstream-timeout",
"title": "Upstream service unavailable",
"status": 503,
"detail": "The payment service did not respond after 3 attempts.",
"instance": "/api/payments/initiate",
"requestId": "req_abc123",
"retryAfter": 30
}
The gateway's error response node produces this format. Every API. Every client. Every time.
The fallback design decision
Not every endpoint can fall back gracefully, and designing fallback behaviour requires thinking about what a degraded-but-functional response means for your specific endpoint:
Read endpoints are the easiest case. If you can serve a cached response from 60 seconds ago, most clients are fine with slightly stale data during a backend degradation. The fallback is: serve cache if available, return error if not.
Write endpoints are harder. A payment initiation that fails should not silently "fall back" to anything — the client needs to know the write did not complete. The fallback here is: return a clear, consistent error with a Retry-After that tells the client when to try again.
Idempotent writes are the middle case. A request with an idempotency key that previously succeeded can return the cached success response. The backend may be down, but the operation was already applied — the gateway can return the previous result.
Designing this in a gateway workflow forces explicit decisions about fallback semantics for each endpoint type — which is the right conversation to have, and is easier to review as a workflow configuration than as exception-handling code scattered through service implementations.
From "my service handles this" to "the gateway handles this"
The migration path for teams that currently implement retry, fallback, and error formatting in their services:
Step 1: Add the gateway workflow for new integrations. Any new upstream integration gets its retry and error config in the gateway. Existing integrations keep their service-level retry until they are next touched.
Step 2: Remove service-level retry when upstream calls move behind the gateway. When a service's upstream calls are proxied through the gateway, the gateway retry replaces the service retry. The service team removes their retry library configuration.
Step 3: Standardise error formats at the gateway boundary. The gateway's error response node transforms any upstream error into the standard format. Services no longer need to format error responses for clients — only for internal error propagation.
Step 4: Remove fallback code from services as it moves to the gateway. Fallback logic that was in application code — "if recommendation service fails, return empty array" — moves to the gateway workflow. The service removes that branch.
The result: services handle their business logic. The gateway handles resilience and client-facing error formatting. Each team owns less infrastructure code and more product code.
Zerq's workflow designer lets you configure retry, backoff, fallback, and RFC 9457 error responses as configurable workflow nodes — one definition, applied consistently across every API. See the resilience and error handling use case or request a demo to walk through your current retry and error architecture.