Supervision, Failure, and Recovery
By lesson seven, you should already have the right instinct about Akka: it is useful when the system has real concurrency, real state, and real failure boundaries. That last phrase matters more than many teams expect.
In ordinary application code, failure is often treated as a local event. A function throws, the caller catches, a response becomes an error, and the request ends. In actor systems, that mental model is too small. Failures are rarely just local mistakes. They affect state, message flow, retries, timeouts, and the health of neighboring components.
This is why supervision is such a central part of Akka. It is not an optional runtime feature you add later. It is one of the main ways you decide how much damage a failure can do.
In this lesson, we will look at supervision in the practical way production teams need it:
- what supervision actually does
- when restart, stop, and resume are appropriate
- how poisoned messages differ from transient failures
- what recovery means when state is in memory
- how to contain failure without lying to yourself about what the system has lost
The goal is not to memorize APIs. The goal is to learn how to think clearly when a real system starts failing at 2 a.m.
Failure in Akka Is a Design Concern, Not a Logging Concern
Many systems are operationally weak because failure handling begins and ends with logging.
Something throws an exception. The stack trace is printed. A retry happens somewhere. A dashboard turns red. But the important engineering questions remain unanswered:
- Is the component still in a valid state?
- Should it keep processing new messages?
- Has in-memory state been corrupted or partially updated?
- Should the failure be isolated to one child actor or should a larger workflow be stopped?
- Is the problem transient, deterministic, or caused by bad input?
Akka forces these questions into the design because actor systems are built from long-lived components. A failure inside a long-lived actor is not just a failed line of code. It is a question about the future behavior of that component.
Imagine a payments pipeline with these responsibilities:
- one actor validates incoming payment requests
- one actor calls an external gateway
- one actor tracks settlement progress
- one actor updates internal accounting state
If the gateway actor starts failing, you need more than a stack trace. You need to decide whether the actor should restart, whether requests should be retried, whether in-flight messages may be duplicated, and whether upstream actors should keep accepting work.
That is what supervision is for.
What Supervision Actually Means
In Akka, supervision is the policy a parent applies when a child actor fails.
That sounds simple, but it carries a lot of architectural meaning.
Supervision says:
- which actor owns the failed child
- how failure is contained
- whether the child should stop, restart, or resume
- how much local state is discarded during recovery
This is one reason parent-child boundaries matter so much in actor systems. They are not just a hierarchy for organization. They are failure boundaries.
In Akka Typed, supervision is often attached directly when a behavior is created:
import akka.actor.typed.Behavior
import akka.actor.typed.SupervisorStrategy
import akka.actor.typed.scaladsl.Behaviors
import scala.concurrent.duration.*
object PaymentGatewayWorker {
sealed trait Command
final case class Authorize(paymentId: String, amount: BigDecimal) extends Command
final class GatewayUnavailable(message: String) extends RuntimeException(message)
final class InvalidGatewayResponse(message: String) extends RuntimeException(message)
def apply(client: GatewayClient): Behavior[Command] =
Behaviors
.supervise(active(client))
.onFailure[GatewayUnavailable](
SupervisorStrategy.restartWithBackoff(500.millis, 10.seconds, 0.2)
)
.onFailure[InvalidGatewayResponse](SupervisorStrategy.stop)
private def active(client: GatewayClient): Behavior[Command] =
Behaviors.receiveMessage {
case Authorize(paymentId, amount) =>
val result = client.authorize(paymentId, amount)
if (!result.isWellFormed) {
throw new InvalidGatewayResponse(
s"Gateway returned malformed data for $paymentId"
)
}
if (!result.accepted) {
throw new GatewayUnavailable(
s"Gateway could not authorize $paymentId right now"
)
}
Behaviors.same
}
}
The specific API matters less than the decision-making behind it.
In this example:
- a temporary gateway outage is treated as a restartable problem
- malformed data is treated as a stop-worthy problem
- the failure policy is visible at actor construction time instead of being hidden in random catch blocks
That visibility is important. Good supervision policy should be readable from the outside.
Restart Is Not a Magic Undo Button
One of the most common beginner mistakes is to think restart means "make the problem go away and continue as before." That is not what restart means.
When an actor restarts, the previous actor instance is discarded and a new one is created from the behavior definition. In practice, that means any in-memory state owned by the old actor is gone unless you rebuild it explicitly.
That is sometimes exactly what you want.
If a child actor holds a cache of temporary request data, or a short-lived connection wrapper, restarting can be a sensible way to return it to a clean baseline.
But if the actor owns business-critical in-memory state, restart can be dangerous unless you know how that state is reconstructed.
Consider a shopping cart actor that tracks pending discounts, item reservations, and checkout state purely in memory. If it crashes and restarts empty, the system may technically be "running" again while the user workflow is now inconsistent.
That is not recovery. That is state loss hidden behind uptime.
So the right question is never just "can I restart this actor?" The right question is:
What does a fresh instance actually mean for the workflow this actor owns?
When restart Is Usually the Right Choice
Restart is strongest when all of the following are true:
- the failure is plausibly transient
- the actor can safely begin again from a clean state
- replaying or resubmitting work is acceptable or already handled elsewhere
- the actor is narrow in responsibility and does not own irreplaceable in-memory business truth
Good candidates include:
- connection-managing actors that can re-establish external links
- parser or adapter actors dealing with flaky upstream systems
- short-lived workers spawned for isolated tasks
- stateless or nearly stateless integration actors
Restart with backoff is often better than immediate restart for unstable dependencies. If a downstream API is down, fast restarts can become a retry storm that damages your own system and the dependency at the same time.
Backoff gives the surrounding platform room to breathe.
When stop Is the Honest Decision
Stopping can sound harsh, but it is often the most correct choice.
You should seriously consider stop when:
- the input is invalid in a deterministic way
- the actor has entered an unknown or corrupted state
- continuing would risk duplicate side effects or broken business rules
- the failure indicates a programming error that should not be hidden by endless restart loops
This is especially important for poisoned messages.
A poisoned message is not just a message that caused an exception once. It is a message that will likely keep causing failure every time it is processed because the data is structurally bad or violates an invariant the code actually depends on.
Examples:
- a payment event is missing a required currency field
- a downstream integration returns data that breaks a mandatory contract
- a supposedly unique identifier is duplicated in a workflow where duplicates are not safe
- deserialized input is syntactically valid but semantically impossible
Restarting on poisoned messages usually just creates noise. The actor comes back, reads the same bad message path, and fails again. That is not resilience. That is a loop.
In those cases, you usually want one of these outcomes instead:
- reject the message before it reaches the fragile actor
- stop the failing child and surface an operational alert
- route the problematic input to a dead-letter or quarantine workflow
- preserve enough context for human diagnosis and replay
Why resume Is Rarely the First Answer
Akka also gives you the option to resume after a failure, but experienced teams use this more carefully than newcomers often expect.
Resume says: keep the existing actor instance and continue processing future messages.
That may sound attractive, but it assumes the actor's state is still valid after the exception. That is a strong assumption.
If the failure happened before any state mutation and the actor remains internally consistent, resume may be acceptable. But many real actor handlers do some state work before failure occurs. Once that happens, resume can mean continuing from a partially updated state that no longer matches reality.
That is why resume should feel like a special-case optimization, not a default resilience strategy.
If you cannot explain precisely why the actor's state remains safe, do not use resume casually.
A More Realistic Example: Gateway Calls With Explicit Recovery Boundaries
Let us look at a more practical structure. Suppose you have a PaymentCoordinator actor that accepts business requests, and it delegates external authorization to a child worker.
That parent-child split is already a supervision decision.
- the coordinator owns the workflow state
- the worker owns the risky external call
- a child crash should not automatically destroy all parent state
Here is a sketch:
import akka.actor.typed.{ActorRef, Behavior, SupervisorStrategy}
import akka.actor.typed.scaladsl.Behaviors
import scala.concurrent.duration.*
object PaymentCoordinator {
sealed trait Command
final case class StartPayment(paymentId: String, amount: BigDecimal, replyTo: ActorRef[Result]) extends Command
private final case class AuthorizationSucceeded(paymentId: String) extends Command
private final case class AuthorizationFailed(paymentId: String, reason: String) extends Command
sealed trait Result
final case class Accepted(paymentId: String) extends Result
final case class Rejected(paymentId: String, reason: String) extends Result
private final case class State(inFlight: Map[String, ActorRef[GatewayAuthorizer.Command]])
def apply(client: GatewayClient): Behavior[Command] =
running(State(Map.empty), client)
private def running(state: State, client: GatewayClient): Behavior[Command] =
Behaviors.receive { (context, message) =>
message match {
case StartPayment(paymentId, amount, replyTo) =>
val worker = context.spawn(
Behaviors
.supervise(GatewayAuthorizer(paymentId, amount, client, context.self))
.onFailure[GatewayAuthorizer.TemporaryGatewayFailure](
SupervisorStrategy.restartWithBackoff(1.second, 15.seconds, 0.2)
),
s"gateway-$paymentId"
)
worker ! GatewayAuthorizer.Run
replyTo ! Accepted(paymentId)
running(state.copy(inFlight = state.inFlight + (paymentId -> worker)), client)
case AuthorizationSucceeded(paymentId) =>
running(state.copy(inFlight = state.inFlight - paymentId), client)
case AuthorizationFailed(paymentId, reason) =>
context.log.warn("Authorization failed for {}: {}", paymentId, reason)
running(state.copy(inFlight = state.inFlight - paymentId), client)
}
}
}
object GatewayAuthorizer {
sealed trait Command
case object Run extends Command
final class TemporaryGatewayFailure(message: String) extends RuntimeException(message)
def apply(
paymentId: String,
amount: BigDecimal,
client: GatewayClient,
coordinator: ActorRef[PaymentCoordinator.Command]
): Behavior[Command] =
Behaviors.receiveMessage {
case Run =>
val result = client.authorize(paymentId, amount)
if (result.timedOut) {
throw new TemporaryGatewayFailure(s"Timeout for $paymentId")
} else if (result.accepted) {
coordinator ! PaymentCoordinator.AuthorizationSucceeded(paymentId)
Behaviors.stopped
} else {
coordinator ! PaymentCoordinator.AuthorizationFailed(paymentId, "Authorization rejected")
Behaviors.stopped
}
}
}
This example is deliberately incomplete, but the design lessons are real.
The Parent Owns Business Progress
The coordinator decides what counts as in-flight work. It can track what started, what finished, and what still needs attention.
That means a child restart does not erase the parent's broader understanding of the workflow.
The Child Owns a Narrow Risky Operation
The gateway worker does one risky thing: call the external gateway and report back. That makes restart semantics easier to reason about.
If the child fails, you are not restarting a giant actor with many unrelated responsibilities. You are restarting a small operation whose failure mode is understood.
Recovery Is Explicit, Not Implied
Notice that recovery here is not "Akka will fix payments for us." Recovery means:
- the child may be restarted with backoff for transient gateway issues
- the coordinator keeps workflow state separate from the flaky integration
- terminal failures are reported clearly instead of being retried forever
That is the mindset to keep.
Poisoned Messages and the Cost of Pretending
Operationally, poisoned messages are some of the most expensive failure patterns because they tempt teams into fake resilience.
The actor crashes. It restarts. The queue drains a little. The dashboards look active. But one or more messages are permanently unprocessable, and the system has no honest path for surfacing that fact.
This is how systems end up with silent data loss, infinite retry loops, or mysterious backlog growth.
The engineering response should be explicit:
- validate early
- classify deterministic failures separately from transient ones
- preserve identifiers and context for investigation
- route bad work somewhere observable instead of letting it disappear into restart cycles
In practice, this often means doing a first layer of validation before spawning or messaging actors that assume stronger invariants. It can also mean recording failure metadata in a store or publishing an operational event that downstream monitoring can act on.
Resilience is not the ability to hide bad input. It is the ability to fail without corrupting the rest of the system.
Recovery Means More Than Restarting a Process
Recovery in production usually combines multiple layers:
- supervision policy inside the actor tree
- time-based retry or backoff policy
- durable state recovery if persistence is involved
- replay or re-drive workflow for failed messages
- operational visibility for humans and automation
If your system depends only on actor restart and has no answer for lost in-memory state, duplicate commands, or replay of side effects, then your recovery story is incomplete.
This becomes especially important in systems with external side effects.
Suppose an actor sends a charge request, crashes before recording success, and then restarts. What happens next?
Possible outcomes include:
- the payment gets charged twice
- the payment succeeded remotely but looks failed locally
- a retry occurs with no idempotency protection
- the workflow stalls because the actor no longer knows what happened
Akka supervision helps contain failure, but it does not solve idempotency or business reconciliation for you. Those have to be designed deliberately.
Choosing the Right Strategy in Practice
A useful practical heuristic is to ask four questions for any actor that may fail.
1. What state does this actor own?
If the state is disposable or easily rebuilt, restart is safer.
If the state is business-critical and only lives in memory, restart may conceal data loss.
2. Is the failure transient or deterministic?
Transient failures often justify backoff and retry.
Deterministic failures usually need rejection, quarantine, or stop.
3. What side effects may already have happened?
If the actor may have partially completed external work, recovery must account for duplication and reconciliation.
4. What is the blast radius if this child dies?
If a small child failure currently forces a large parent workflow to collapse, the actor boundaries may be wrong.
These questions are often more valuable than memorizing every supervision option in the API.
Common Supervision Mistakes
There are a few mistakes that appear repeatedly in real Akka codebases.
Treating Every Exception as Restartable
This hides programming errors and invalid input behind activity. Some failures should stop the component and demand attention.
Putting Too Much State in One Actor
Large actors are harder to supervise because restart semantics become muddy. If one actor owns too many responsibilities, you cannot recover one concern without risking all the others.
Ignoring What Restart Does to State
If a restart recreates the actor from scratch, be honest about what was lost. Do not call it recovery if it merely reset the process.
Using Resume Without Strong Invariants
Resume can keep throughput moving, but it can also keep corrupted state alive. It is more dangerous than it first appears.
Forgetting Observability
A supervision strategy without clear logs, metrics, or alerts is incomplete. Operators need to know when restart loops, dead letters, or repeated stop events are happening.
Testing Failure Behavior Matters as Much as Testing Success
It is easy to test the happy path and declare the actor correct. That is not enough for systems where failure handling is part of the design.
You should test at least these questions:
- does a transient child failure actually trigger the intended restart behavior?
- does the parent preserve the right state after child failure?
- are deterministic failures surfaced instead of retried forever?
- do restart policies create duplicate side effects?
- does the system remain observable when failures repeat?
If your production safety depends on supervision, then supervision behavior is application logic, not just framework plumbing.
Summary
Akka supervision is best understood as disciplined failure containment.
It gives you a structured answer to questions that thread-based systems often leave vague: who owns the failing component, what happens to its state, whether it should restart or stop, and how much damage should spread to the rest of the workflow.
The hard part is not choosing an API method. The hard part is choosing honest semantics.
- restart is useful when the actor can safely return to a clean baseline
- stop is often the right answer for poisoned messages or broken invariants
- resume is only safe when you can prove the state is still valid
- recovery is incomplete if it ignores business state, idempotency, and observability
If you keep those principles in mind, supervision becomes more than a runtime feature. It becomes one of the main tools you use to design actor systems that fail in controlled, understandable ways.
In the next lesson, we will look at the ask pattern, fire-and-forget messaging, and the workflow boundaries that decide when actors should wait for replies and when they should keep work asynchronous.
Comments
Be the first to comment on this lesson!