Sometimes we ask a computer do to something, but it fails to do it. Sad, but we’re clever! We’ll just try again and hope for the best! It’s amazing (depressing?) how often this solves the problem.

Industrial-strength retries

But there’s not just one way to do retries. In some situations you want to retry very rapidly, sometimes more slowly, sometimes with the delay increasing on each attempt. Sometimes you want to retry for a long time, sometimes only a short time. We need a retry strategy that’s tailored to the specific failure scenario we’re dealing with.

A great example of how sophisticated retry strategies can get can be seen in the Twitter Finagle library which, among other things, can act as an HTTP client. Here’s an idea of some of the retry capabilities it offers:

limit retries to a certain number
backoff retries, with multiple options including exponential
conditionally retry based on the outcome of the request, such as
- when certain exceptions are thrown
- when certain HTTP status codes are received
retry budgets, which can do advanced things like limit the percentage of requests that can be retried, while ensuring a minimum retry rate
retry policy — defines the conditions under which retries should occur, some of examples of conditions are:
- when certain exceptions are thrown
- when certain HTTP status codes are received

All in all, a powerful, thoughtfully designed API for retries. So we read the docs, understand all these capabilities and apply them to get the retry strategy we want for our situation.

But then we find we want to retry other kinds of operations that aren’t implemented using Finagle. Perhaps we want to retry JDBC operations, or Redis commands. None of this advanced Finagle retry code is of any use to us when using other libraries, and neither is any of the knowledge we gained about Finagle’s API. For each library, we have to re-learn what retry capabilities are supported and how they are used.

ZIO’s universal retry

ZIO offers us a huge advantage here: it gives a single retry mechanism that is very powerful and works the same way for any operation. Any operation that can be captured as an effect value (which is essentially anything you can ask a computer to do) can be retried using whatever retry strategy you can dream up. Let’s tally up the wins:

only one API to learn, that can handle retries for any operation being performed by any library
if you implement a retry strategy, it can be used in any situation where appropriate, regardless of the library being used
libraries do not need to implement their own retry logic, everyone gets the same general-purpose retry mechanism for free

What’s the catch? Only that you have to use ZIO and program with effect values (which brings a host of advantages beyond just retries).

The ZIO retry API

The core idea of ZIO is that any effects you want to perform are represented by so-called “workflow” values with the type ZIO[R, E, A], which we can then execute as needed. If you were to execute this workflow and it succeeded, it would produce a value of type A, whereas if you executed the workflow and it failed it would produce a value of type E (the R type is not relevant to this retry discussion, so we’ll be ignoring it).

You can call a method called retry on any ZIO value that can fail (it’s valid to have workflows that cannot fail, where E=Nothing, in which case there’s never a need to retry). The method signature of retry is (with minor simplifications):

def retry[R1 <: R, S](policy: => Schedule[R1, E, S]): ZIO[R1, E, A]

So this:

myImportantWorkflow.retry(myRetrySchedule)

means “if myImportantWorkflow fails, retry it according to the schedule defined by myRetrySchedule“. A schedule is a value which defines which failures will be retried, and when, and how often. So the ZIO way to implement the kind of retry logic we saw offered by Finagle is to use Schedule.

As the ZIO retry mechanism relies on ZIO’s “failure channel” (represented by the E type) to determine when retries are needed, having thoughtfully designed error types does make applying retries easier to get right. It’s a little disappointing to see some ZIO libraries just using Throwable as the E type; tighten up those error types to make retries and many other things better!

Schedule

Now, the ZIO Schedule API can be a little intimidating at first. It is very general purpose and has a lot of methods and constructors. But it has the super-power of being composable, which means we can build complex retry strategies by composing more basic schedules together. Once you get the hang of it, you find it usually works just as you expect.

Let’s say we’re making a REST request, and we want to retry failures with jittered exponential backoff:

val myRetrySchedule = Schedule.exponential(
  100.milliseconds,
  2d
).jittered

This schedule will keep retrying forever, with longer and longer intervals between retries. Let’s put a limit so it won’t retry more than 10 times:

val myRetrySchedule = Schedule.exponential(
  100.milliseconds,
  2d
).jittered <* 
  Schedule.recurs(10)

The <* operator produces the intersection of two schedules, meaning a schedule that only recurs when both schedules do. Schedules produce values, for example exponential returns the duration between recurrences. By using <* the composed schedule takes only the outputs from the left schedule and ignores the outputs from the right. We could also use *> to keep only the right side outputs, or && to get both outputs as a tuple.

Schedules can also accept inputs. In the case of retrying, the input will be whatever the error type E of the workflow is. So far, our schedule does not examine its input, so it will retry for any failure. But let’s say we only want to retry IOExceptions:

val myRetrySchedule = Schedule.exponential(
  100.milliseconds,
  2d
).jittered <*
  Schedule.recurs(10) <*
  Schedule.recurWhile(_.isInstanceOf[IOException])

Schedules can be composed in other ways, including union and sequence. Schedules can also be used in other ways, they’re not just for retries. You can use a schedule to repeat successful workflows, or schedule workflows to run at a certain time. For example, to start a background fiber that runs a workflow at 3am every day:

myDailyWorkflow
  .scheduleFork(Schedule.hourOfDay(3))

ZIO includes a bunch of commonly used retry tools (like exponential) out of the box, but if you need more complex behaviours (for example, the retry budget offered by Finagle), these can be built using Schedule. See the documentation for more details.

Failing so we can retry

It makes total sense that only failures can be retried. By sometimes you might want to retry an operation that the API or library you are using classified as successful. A common example is when using an HTTP client, the workflow value you get might be something like ZIO[Any, IOException, Response], meaning that it fails if a network error prevents us from receiving the response, but succeeds otherwise.

But what if we want to retry requests where a 503 status code is received? From the point of view of the ZIO workflow, it was successful, it just returned a particular status codes in the returned response. If we called .retry on such a workflow, it would never retry if a 503 response was received.

Fortunately ZIO allows us to deal with this easily, by converting successful outcomes into failures. For this situation, we can use .filterOrFail:

enum HttpFailure:
  case ServiceUnavailable

val sendHttpRequest: ZIO[Any, IOException, Response] = ???

val failOn503: ZIO[Any, IOException | HttpFailure, Response] =
  sendHttpRequest
    .filterOrFail(_.statusCode != 503)(ServiceUnavailable)

failOn503.retry(myRetrySchedule)

Nice!

Reasons ZIO is awesome: retries