Stop implementing retry logic in every service. Put it in the gateway instead.

Every microservices codebase has retry logic. Most of it was added after an incident. None of it was designed together. The result is a fleet of services where Service A retries 3 times with exponential backoff, Service B retries 5 times with a fixed 1-second delay, Service C does not retry at all, and none of them return errors in the same format.

This is not a resilience problem. It is an architecture problem. And the fix is moving that logic to the one place that sits in front of all of them: the gateway.

Why distributed retry logic fails

The appeal of implementing retry at the service level is obvious: it is local, controllable, and the team can iterate on it without coordination. The problem is what happens at scale.

Thundering herd. When a backend experiences a transient failure, every upstream service that talks to it retries simultaneously. You get a surge of traffic hitting the recovering backend — exactly when it is least able to handle it. Without coordination (and ideally jitter), retries amplify the failure rather than absorbing it.

Inconsistent client experience. If Service A retries silently and Service B surfaces the 503 immediately, clients calling through Service A experience a 2-second delay with eventual success, and clients calling through Service B get an error they have to handle themselves. The same underlying failure produces different behaviour depending on which service path the request took.

No visibility across the retry behaviour. When retries are embedded in service code, you have no single place to see retry rates, fallback usage, or the correlation between retry storms and backend recovery. Each service team instruments this independently — if at all.

What gateway-level retry provides

When retry logic lives in the gateway workflow:

Coordination is structural. The gateway is the single caller of the upstream backend. Retry behaviour is defined once, in the workflow, and applied consistently. You tune it in one place. Jitter, backoff intervals, and retry counts are not scattered across service codebases.

Fallback is composable. When retries are exhausted, the next step in the workflow is explicit: serve a cached response, call an alternate backend, or return a standardised error. The fallback is part of the same workflow definition, not a separate code path in a different service.

Visibility is unified. Retry counts, fallback invocations, and final error rates are in one metrics and log pipeline. When a backend has a degraded period, you see the full picture: how many retries were attempted, how many fell back to cache, how many ultimately failed, and what error was returned to the client.

RFC 9457 error shapes: the problem with inconsistent errors

When retries are exhausted and a request fails, the client receives an error. What that error looks like is as important as what caused it.

RFC 7807 (updated as RFC 9457) defines a standard format for HTTP API errors — Problem Details. The structure is:

{
  "type": "https://example.com/errors/upstream-unavailable",
  "title": "Upstream service temporarily unavailable",
  "status": 503,
  "detail": "The payment service did not respond after 3 attempts. Retry after 30 seconds.",
  "instance": "/payments/initiate"
}

A client that knows it will always receive this format for errors can handle failures consistently, regardless of which service path produced them. It can display a meaningful message, implement appropriate retry behaviour on its own side, and log the instance and type fields for investigation.

When every service returns errors differently — some as {"error": "service unavailable"}, some as plain text 503s, some as empty responses — clients have to handle each case specifically. Integration becomes fragile.

The gateway is the right place to standardise error shapes, because the gateway is where the client boundary is. A custom response node in the gateway workflow returns the same Problem Details format for every backend failure — regardless of what the failing backend returned.

What gateway-level resilience looks like in practice

A workflow for a resilient API endpoint has these components:

1. The primary backend call. The workflow calls the upstream service. If it returns a success response, the workflow returns it to the client.

2. Retry on transient failure. If the call returns a 503 or times out, the workflow retries — with configurable backoff and jitter. A typical production configuration is 2-3 retries, with exponential backoff starting at 500ms, and jitter to prevent synchronized retry storms.

3. Fallback on exhausted retries. After retries are exhausted, the workflow branches to a fallback: a cached response (if the endpoint is a read), an alternate backend (if one is available), or a degraded response (if a partial answer is better than an error).

4. Standardised error on final failure. If even the fallback fails, the workflow returns a Problem Details error with a consistent format, an appropriate status code, and a Retry-After header that tells the client when it should try again.

5. Metrics at each step. The workflow records which path was taken: primary success, retry success, fallback, or final error. These metrics feed the same Prometheus dashboard as your other API metrics.

When to move this to the gateway

The gateway-level approach is the right choice when:

Multiple services share the same backend and are implementing retry independently
Client error handling is fragile because error formats vary by service
You want to tune retry behaviour without coordinating a change across multiple service deployments
Fallback to cache or an alternate backend needs to be implemented consistently

The service-level approach still makes sense when:

Retry behaviour needs to carry application state (e.g., re-authenticating a session)
The service has business logic that determines whether a failure is retryable
The retry target is internal to a service, not an external HTTP backend

For the common case — external HTTP call, transient network or backend failure, consistent error response to the client — the gateway is the right layer.

Zerq's workflow designer lets you model retry, fallback, and standardised error responses as a configurable workflow — one place to define, one place to tune, one place to observe. See the error handling and resilience use case or request a demo to walk through your current retry and error handling architecture.