Supervision, Failure, and Recovery

By lesson seven, you should already have the right instinct about Akka: it is useful when the system has real concurrency, real state, and real failure boundaries. That last phrase matters more than many teams expect.

In ordinary application code, failure is often treated as a local event. A function throws, the caller catches, a response becomes an error, and the request ends. In actor systems, that mental model is too small. Failures are rarely just local mistakes. They affect state, message flow, retries, timeouts, and the health of neighboring components.

This is why supervision is such a central part of Akka. It is not an optional runtime feature you add later. It is one of the main ways you decide how much damage a failure can do.

In this lesson, we will look at supervision in the practical way production teams need it:

what supervision actually does
when restart, stop, and resume are appropriate
how poisoned messages differ from transient failures
what recovery means when state is in memory
how to contain failure without lying to yourself about what the system has lost

The goal is not to memorize APIs. The goal is to learn how to think clearly when a real system starts failing at 2 a.m.

Failure in Akka Is a Design Concern, Not a Logging Concern

Many systems are operationally weak because failure handling begins and ends with logging.

Something throws an exception. The stack trace is printed. A retry happens somewhere. A dashboard turns red. But the important engineering questions remain unanswered:

Is the component still in a valid state?
Should it keep processing new messages?
Has in-memory state been corrupted or partially updated?
Should the failure be isolated to one child actor or should a larger workflow be stopped?
Is the problem transient, deterministic, or caused by bad input?

Akka forces these questions into the design because actor systems are built from long-lived components. A failure inside a long-lived actor is not just a failed line of code. It is a question about the future behavior of that component.

Imagine a payments pipeline with these responsibilities:

one actor validates incoming payment requests
one actor calls an external gateway
one actor tracks settlement progress
one actor updates internal accounting state

If the gateway actor starts failing, you need more than a stack trace. You need to decide whether the actor should restart, whether requests should be retried, whether in-flight messages may be duplicated, and whether upstream actors should keep accepting work.

That is what supervision is for.

What Supervision Actually Means

In Akka, supervision is the policy a parent applies when a child actor fails.

That sounds simple, but it carries a lot of architectural meaning.

Supervision says:

which actor owns the failed child
how failure is contained
whether the child should stop, restart, or resume
how much local state is discarded during recovery

This is one reason parent-child boundaries matter so much in actor systems. They are not just a hierarchy for organization. They are failure boundaries.

In Akka Typed, supervision is often attached directly when a behavior is created:

import akka.actor.typed.Behavior
import akka.actor.typed.SupervisorStrategy
import akka.actor.typed.scaladsl.Behaviors
import scala.concurrent.duration.*

object PaymentGatewayWorker {
  sealed trait Command
  final case class Authorize(paymentId: String, amount: BigDecimal) extends Command

  final class GatewayUnavailable(message: String) extends RuntimeException(message)
  final class InvalidGatewayResponse(message: String) extends RuntimeException(message)

  def apply(client: GatewayClient): Behavior[Command] =
    Behaviors
      .supervise(active(client))
      .onFailure[GatewayUnavailable](
        SupervisorStrategy.restartWithBackoff(500.millis, 10.seconds, 0.2)
      )
      .onFailure[InvalidGatewayResponse](SupervisorStrategy.stop)

  private def active(client: GatewayClient): Behavior[Command] =
    Behaviors.receiveMessage {
      case Authorize(paymentId, amount) =>
        val result = client.authorize(paymentId, amount)
        if (!result.isWellFormed) {
          throw new InvalidGatewayResponse(
            s"Gateway returned malformed data for $paymentId"
          )
        }

        if (!result.accepted) {
          throw new GatewayUnavailable(
            s"Gateway could not authorize $paymentId right now"
          )
        }

        Behaviors.same
    }
}

The specific API matters less than the decision-making behind it.

In this example:

a temporary gateway outage is treated as a restartable problem
malformed data is treated as a stop-worthy problem
the failure policy is visible at actor construction time instead of being hidden in random catch blocks

That visibility is important. Good supervision policy should be readable from the outside.

Restart Is Not a Magic Undo Button

One of the most common beginner mistakes is to think restart means "make the problem go away and continue as before." That is not what restart means.

When an actor restarts, the previous actor instance is discarded and a new one is created from the behavior definition. In practice, that means any in-memory state owned by the old actor is gone unless you rebuild it explicitly.

That is sometimes exactly what you want.

If a child actor holds a cache of temporary request data, or a short-lived connection wrapper, restarting can be a sensible way to return it to a clean baseline.

But if the actor owns business-critical in-memory state, restart can be dangerous unless you know how that state is reconstructed.

Consider a shopping cart actor that tracks pending discounts, item reservations, and checkout state purely in memory. If it crashes and restarts empty, the system may technically be "running" again while the user workflow is now inconsistent.

That is not recovery. That is state loss hidden behind uptime.

So the right question is never just "can I restart this actor?" The right question is:

What does a fresh instance actually mean for the workflow this actor owns?

When `restart` Is Usually the Right Choice

Restart is strongest when all of the following are true:

the failure is plausibly transient
the actor can safely begin again from a clean state
replaying or resubmitting work is acceptable or already handled elsewhere
the actor is narrow in responsibility and does not own irreplaceable in-memory business truth

Good candidates include:

connection-managing actors that can re-establish external links
parser or adapter actors dealing with flaky upstream systems
short-lived workers spawned for isolated tasks
stateless or nearly stateless integration actors

Restart with backoff is often better than immediate restart for unstable dependencies. If a downstream API is down, fast restarts can become a retry storm that damages your own system and the dependency at the same time.

Backoff gives the surrounding platform room to breathe.

When `stop` Is the Honest Decision

Stopping can sound harsh, but it is often the most correct choice.

You should seriously consider stop when:

the input is invalid in a deterministic way
the actor has entered an unknown or corrupted state
continuing would risk duplicate side effects or broken business rules
the failure indicates a programming error that should not be hidden by endless restart loops

This is especially important for poisoned messages.

A poisoned message is not just a message that caused an exception once. It is a message that will likely keep causing failure every time it is processed because the data is structurally bad or violates an invariant the code actually depends on.

Examples:

a payment event is missing a required currency field
a downstream integration returns data that breaks a mandatory contract
a supposedly unique identifier is duplicated in a workflow where duplicates are not safe
deserialized input is syntactically valid but semantically impossible

Restarting on poisoned messages usually just creates noise. The actor comes back, reads the same bad message path, and fails again. That is not resilience. That is a loop.

In those cases, you usually want one of these outcomes instead:

reject the message before it reaches the fragile actor
stop the failing child and surface an operational alert
route the problematic input to a dead-letter or quarantine workflow
preserve enough context for human diagnosis and replay

Why `resume` Is Rarely the First Answer

Akka also gives you the option to resume after a failure, but experienced teams use this more carefully than newcomers often expect.

Resume says: keep the existing actor instance and continue processing future messages.

That may sound attractive, but it assumes the actor's state is still valid after the exception. That is a strong assumption.

If the failure happened before any state mutation and the actor remains internally consistent, resume may be acceptable. But many real actor handlers do some state work before failure occurs. Once that happens, resume can mean continuing from a partially updated state that no longer matches reality.

That is why resume should feel like a special-case optimization, not a default resilience strategy.

If you cannot explain precisely why the actor's state remains safe, do not use resume casually.

A More Realistic Example: Gateway Calls With Explicit Recovery Boundaries

Let us look at a more practical structure. Suppose you have a PaymentCoordinator actor that accepts business requests, and it delegates external authorization to a child worker.

That parent-child split is already a supervision decision.

the coordinator owns the workflow state
the worker owns the risky external call
a child crash should not automatically destroy all parent state

Here is a sketch:

import akka.actor.typed.{ActorRef, Behavior, SupervisorStrategy}
import akka.actor.typed.scaladsl.Behaviors
import scala.concurrent.duration.*

object PaymentCoordinator {
  sealed trait Command
  final case class StartPayment(paymentId: String, amount: BigDecimal, replyTo: ActorRef[Result]) extends Command
  private final case class AuthorizationSucceeded(paymentId: String) extends Command
  private final case class AuthorizationFailed(paymentId: String, reason: String) extends Command

  sealed trait Result
  final case class Accepted(paymentId: String) extends Result
  final case class Rejected(paymentId: String, reason: String) extends Result

  private final case class State(inFlight: Map[String, ActorRef[GatewayAuthorizer.Command]])

  def apply(client: GatewayClient): Behavior[Command] =
    running(State(Map.empty), client)

  private def running(state: State, client: GatewayClient): Behavior[Command] =
    Behaviors.receive { (context, message) =>
      message match {
        case StartPayment(paymentId, amount, replyTo) =>
          val worker = context.spawn(
            Behaviors
              .supervise(GatewayAuthorizer(paymentId, amount, client, context.self))
              .onFailure[GatewayAuthorizer.TemporaryGatewayFailure](
                SupervisorStrategy.restartWithBackoff(1.second, 15.seconds, 0.2)
              ),
            s"gateway-$paymentId"
          )

          worker ! GatewayAuthorizer.Run
          replyTo ! Accepted(paymentId)
          running(state.copy(inFlight = state.inFlight + (paymentId -> worker)), client)

        case AuthorizationSucceeded(paymentId) =>
          running(state.copy(inFlight = state.inFlight - paymentId), client)

        case AuthorizationFailed(paymentId, reason) =>
          context.log.warn("Authorization failed for {}: {}", paymentId, reason)
          running(state.copy(inFlight = state.inFlight - paymentId), client)
      }
    }
}

object GatewayAuthorizer {
  sealed trait Command
  case object Run extends Command

  final class TemporaryGatewayFailure(message: String) extends RuntimeException(message)

  def apply(
      paymentId: String,
      amount: BigDecimal,
      client: GatewayClient,
      coordinator: ActorRef[PaymentCoordinator.Command]
  ): Behavior[Command] =
    Behaviors.receiveMessage {
      case Run =>
        val result = client.authorize(paymentId, amount)
        if (result.timedOut) {
          throw new TemporaryGatewayFailure(s"Timeout for $paymentId")
        } else if (result.accepted) {
          coordinator ! PaymentCoordinator.AuthorizationSucceeded(paymentId)
          Behaviors.stopped
        } else {
          coordinator ! PaymentCoordinator.AuthorizationFailed(paymentId, "Authorization rejected")
          Behaviors.stopped
        }
    }
}

This example is deliberately incomplete, but the design lessons are real.

The Parent Owns Business Progress

The coordinator decides what counts as in-flight work. It can track what started, what finished, and what still needs attention.

That means a child restart does not erase the parent's broader understanding of the workflow.

The Child Owns a Narrow Risky Operation

The gateway worker does one risky thing: call the external gateway and report back. That makes restart semantics easier to reason about.

If the child fails, you are not restarting a giant actor with many unrelated responsibilities. You are restarting a small operation whose failure mode is understood.

Recovery Is Explicit, Not Implied

Notice that recovery here is not "Akka will fix payments for us." Recovery means:

the child may be restarted with backoff for transient gateway issues
the coordinator keeps workflow state separate from the flaky integration
terminal failures are reported clearly instead of being retried forever

That is the mindset to keep.

Poisoned Messages and the Cost of Pretending

Operationally, poisoned messages are some of the most expensive failure patterns because they tempt teams into fake resilience.

The actor crashes. It restarts. The queue drains a little. The dashboards look active. But one or more messages are permanently unprocessable, and the system has no honest path for surfacing that fact.

This is how systems end up with silent data loss, infinite retry loops, or mysterious backlog growth.

The engineering response should be explicit:

validate early
classify deterministic failures separately from transient ones
preserve identifiers and context for investigation
route bad work somewhere observable instead of letting it disappear into restart cycles

In practice, this often means doing a first layer of validation before spawning or messaging actors that assume stronger invariants. It can also mean recording failure metadata in a store or publishing an operational event that downstream monitoring can act on.

Resilience is not the ability to hide bad input. It is the ability to fail without corrupting the rest of the system.

Recovery Means More Than Restarting a Process

Recovery in production usually combines multiple layers:

supervision policy inside the actor tree
time-based retry or backoff policy
durable state recovery if persistence is involved
replay or re-drive workflow for failed messages
operational visibility for humans and automation

If your system depends only on actor restart and has no answer for lost in-memory state, duplicate commands, or replay of side effects, then your recovery story is incomplete.

This becomes especially important in systems with external side effects.

Suppose an actor sends a charge request, crashes before recording success, and then restarts. What happens next?

Possible outcomes include:

the payment gets charged twice
the payment succeeded remotely but looks failed locally
a retry occurs with no idempotency protection
the workflow stalls because the actor no longer knows what happened

Akka supervision helps contain failure, but it does not solve idempotency or business reconciliation for you. Those have to be designed deliberately.

Choosing the Right Strategy in Practice

A useful practical heuristic is to ask four questions for any actor that may fail.

1. What state does this actor own?

If the state is disposable or easily rebuilt, restart is safer.

If the state is business-critical and only lives in memory, restart may conceal data loss.

2. Is the failure transient or deterministic?

Transient failures often justify backoff and retry.

Deterministic failures usually need rejection, quarantine, or stop.

3. What side effects may already have happened?

If the actor may have partially completed external work, recovery must account for duplication and reconciliation.

4. What is the blast radius if this child dies?

If a small child failure currently forces a large parent workflow to collapse, the actor boundaries may be wrong.

These questions are often more valuable than memorizing every supervision option in the API.

Common Supervision Mistakes

There are a few mistakes that appear repeatedly in real Akka codebases.

Treating Every Exception as Restartable

This hides programming errors and invalid input behind activity. Some failures should stop the component and demand attention.

Putting Too Much State in One Actor

Large actors are harder to supervise because restart semantics become muddy. If one actor owns too many responsibilities, you cannot recover one concern without risking all the others.

Ignoring What Restart Does to State

If a restart recreates the actor from scratch, be honest about what was lost. Do not call it recovery if it merely reset the process.

Using Resume Without Strong Invariants

Resume can keep throughput moving, but it can also keep corrupted state alive. It is more dangerous than it first appears.

Forgetting Observability

A supervision strategy without clear logs, metrics, or alerts is incomplete. Operators need to know when restart loops, dead letters, or repeated stop events are happening.

Testing Failure Behavior Matters as Much as Testing Success

It is easy to test the happy path and declare the actor correct. That is not enough for systems where failure handling is part of the design.

You should test at least these questions:

does a transient child failure actually trigger the intended restart behavior?
does the parent preserve the right state after child failure?
are deterministic failures surfaced instead of retried forever?
do restart policies create duplicate side effects?
does the system remain observable when failures repeat?

If your production safety depends on supervision, then supervision behavior is application logic, not just framework plumbing.

Summary

Akka supervision is best understood as disciplined failure containment.

It gives you a structured answer to questions that thread-based systems often leave vague: who owns the failing component, what happens to its state, whether it should restart or stop, and how much damage should spread to the rest of the workflow.

The hard part is not choosing an API method. The hard part is choosing honest semantics.

restart is useful when the actor can safely return to a clean baseline
stop is often the right answer for poisoned messages or broken invariants
resume is only safe when you can prove the state is still valid
recovery is incomplete if it ignores business state, idempotency, and observability

If you keep those principles in mind, supervision becomes more than a runtime feature. It becomes one of the main tools you use to design actor systems that fail in controlled, understandable ways.

In the next lesson, we will look at the ask pattern, fire-and-forget messaging, and the workflow boundaries that decide when actors should wait for replies and when they should keep work asynchronous.

Supervision, Failure, and Recovery

Supervision, Failure, and Recovery

Failure in Akka Is a Design Concern, Not a Logging Concern

What Supervision Actually Means

Restart Is Not a Magic Undo Button

When `restart` Is Usually the Right Choice

When `stop` Is the Honest Decision

Why `resume` Is Rarely the First Answer

A More Realistic Example: Gateway Calls With Explicit Recovery Boundaries

The Parent Owns Business Progress

The Child Owns a Narrow Risky Operation

Recovery Is Explicit, Not Implied

Poisoned Messages and the Cost of Pretending

Recovery Means More Than Restarting a Process

Choosing the Right Strategy in Practice

1. What state does this actor own?

2. Is the failure transient or deterministic?

3. What side effects may already have happened?

4. What is the blast radius if this child dies?

Common Supervision Mistakes

Treating Every Exception as Restartable

Putting Too Much State in One Actor

Ignoring What Restart Does to State

Using Resume Without Strong Invariants

Forgetting Observability

Testing Failure Behavior Matters as Much as Testing Success

Summary

Comments

Feedback

Supervision, Failure, and Recovery

Failure in Akka Is a Design Concern, Not a Logging Concern

What Supervision Actually Means

Restart Is Not a Magic Undo Button

When restart Is Usually the Right Choice

When stop Is the Honest Decision

Why resume Is Rarely the First Answer

A More Realistic Example: Gateway Calls With Explicit Recovery Boundaries

The Parent Owns Business Progress

The Child Owns a Narrow Risky Operation

Recovery Is Explicit, Not Implied

Poisoned Messages and the Cost of Pretending

Recovery Means More Than Restarting a Process

Choosing the Right Strategy in Practice

1. What state does this actor own?

2. Is the failure transient or deterministic?

3. What side effects may already have happened?

4. What is the blast radius if this child dies?

Common Supervision Mistakes

Treating Every Exception as Restartable

Putting Too Much State in One Actor

Ignoring What Restart Does to State

Using Resume Without Strong Invariants

Forgetting Observability

Testing Failure Behavior Matters as Much as Testing Success

Summary

Comments

Feedback

Sign In

Sign Up

Reset Password

Sign Out

Reset Password

When `restart` Is Usually the Right Choice

When `stop` Is the Honest Decision

Why `resume` Is Rarely the First Answer