Cluster Sharding and Location-Transparent State

In the previous lesson, the main shift was architectural: once actors stop living in one JVM, you have to think like a distributed-systems engineer. That means node membership, discovery, partial failure, and the fact that the network is now part of the runtime model.

The next question is more specific. Suppose your system does not just have a few long-lived actors. Suppose it has hundreds of thousands of logical entities: shopping carts, customer accounts, IoT devices, fraud cases, support sessions, or game lobbies. Each entity has its own state and behavior. Each one should process messages sequentially for correctness. But you do not want to hand-write routing logic for every message or manually decide which node owns which entity.

That is the problem Cluster Sharding solves.

Cluster Sharding lets you address entities by logical identity instead of physical location. You send a message to "cart-123" or "device-88" without needing to know which node currently hosts that actor. Akka decides where the entity should live, starts it when needed, moves shards when the cluster changes, and preserves the illusion that you are talking to a stable logical component.

That is powerful, but it is not free. Sharding introduces another layer of distributed coordination, and teams often reach for it too early because the API feels elegant. In this lesson, we will make the tradeoff concrete: what sharding is actually doing, what it is good at, how a realistic Akka Typed setup looks, and where a simpler design is the better engineering choice.

Why Ordinary Cluster Actors Stop Scaling Operationally

Imagine a retail platform where each active shopping cart needs to hold short-lived state:

selected items
quantities
coupon state
checkout status
expiration timing

At low scale, you can keep this in one JVM. At somewhat higher scale, you might partition carts manually by hashing a cart ID and routing messages to region actors you created yourself.

That works for a while, but the operational problems appear quickly:

who decides which node owns a given cart?
what happens when that node leaves the cluster?
how do callers find the right actor instance?
how do you rebalance entities when traffic becomes uneven?
how much custom routing code will your team maintain six months from now?

This is the core reason sharding exists. It is not about making actors more abstract. It is about removing repetitive, failure-prone infrastructure code around large numbers of stateful entities.

Without sharding, many teams reinvent the same mechanisms badly:

a registry that maps entity IDs to nodes
hand-managed routing tables
ad hoc rebalancing logic
custom lifecycle rules for idle actors
brittle assumptions about entity location

Sharding standardizes those concerns so application code can focus on the entity behavior itself.

What Cluster Sharding Actually Gives You

Cluster Sharding is best understood as an entity management layer on top of Akka Cluster.

In practice, it gives you several things at once:

a way to define an entity type, such as ShoppingCart or DeviceTwin
a way to address one entity instance by its logical ID
automatic placement of entities into shards that are distributed across the cluster
automatic startup of entities when the first message arrives
rebalancing when cluster topology changes
passivation support so idle entities do not consume memory forever

The key phrase in the lesson title is location-transparent state.

Location transparency does not mean location stops mattering operationally. It means callers do not need to care where the entity currently lives in order to interact with it. That is a big API simplification. It lets the rest of the application speak in domain identities instead of infrastructure addresses.

In other words, callers say:

send AddItem to cart cart-123
send MarkOnline to device device-88
send FreezeAccount to account acct-91

They do not say:

find node payments-node-4
look up the current actor path
reconnect if the node moved
update some shared registry if rebalancing happened

That separation is the main engineering value.

A Concrete Example: Sharded Shopping Carts

Shopping carts are a good sharding example because they have all the right properties:

many independent entities
entity-local state
sequential message handling matters
workload can spread across many nodes
entities can be started on demand and passivated when idle

Here is a simplified Akka Typed cart entity:

import akka.actor.typed.{ActorRef, Behavior}
import akka.actor.typed.scaladsl.Behaviors
import akka.cluster.sharding.typed.scaladsl.EntityTypeKey

object ShoppingCart {
  sealed trait Command

  final case class AddItem(sku: String, quantity: Int, replyTo: ActorRef[Confirmation]) extends Command
  final case class RemoveItem(sku: String, replyTo: ActorRef[Confirmation]) extends Command
  final case class GetCart(replyTo: ActorRef[Summary]) extends Command
  case object ExpireIfIdle extends Command

  sealed trait Confirmation
  final case class Accepted(message: String) extends Confirmation
  final case class Rejected(reason: String) extends Confirmation

  final case class Summary(cartId: String, items: Map[String, Int], checkedOut: Boolean)

  final case class State(items: Map[String, Int], checkedOut: Boolean) {
    def addItem(sku: String, quantity: Int): State =
      copy(items = items.updated(sku, items.getOrElse(sku, 0) + quantity))

    def removeItem(sku: String): State =
      copy(items = items - sku)
  }

  val TypeKey: EntityTypeKey[Command] = EntityTypeKey[Command]("shopping-cart")

  def apply(cartId: String): Behavior[Command] =
    active(cartId, State(Map.empty, checkedOut = false))

  private def active(cartId: String, state: State): Behavior[Command] =
    Behaviors.receiveMessage {
      case AddItem(_, quantity, replyTo) if quantity <= 0 =>
        replyTo ! Rejected("Quantity must be positive")
        Behaviors.same

      case AddItem(sku, quantity, replyTo) if state.checkedOut =>
        replyTo ! Rejected("Cart is already checked out")
        Behaviors.same

      case AddItem(sku, quantity, replyTo) =>
        replyTo ! Accepted(s"Added $quantity of $sku")
        active(cartId, state.addItem(sku, quantity))

      case RemoveItem(sku, replyTo) if state.checkedOut =>
        replyTo ! Rejected("Cart is already checked out")
        Behaviors.same

      case RemoveItem(sku, replyTo) =>
        replyTo ! Accepted(s"Removed $sku")
        active(cartId, state.removeItem(sku))

      case GetCart(replyTo) =>
        replyTo ! Summary(cartId, state.items, state.checkedOut)
        Behaviors.same

      case ExpireIfIdle if state.items.isEmpty =>
        Behaviors.stopped

      case ExpireIfIdle =>
        Behaviors.same
    }
}

There is nothing especially "distributed" inside the entity behavior. That is a good sign.

The cart actor owns cart state. It processes one message at a time. It validates commands against current state. It exposes a protocol that makes sense in the domain.

This is how sharding should feel from the entity's point of view: the business actor remains understandable even though the runtime may place it anywhere in the cluster.

Register the Entity With Sharding

The next step is telling the cluster how to create cart entities.

import akka.actor.typed.ActorSystem
import akka.actor.typed.scaladsl.Behaviors
import akka.cluster.sharding.typed.scaladsl.{ClusterSharding, Entity}

object CartServiceApp {
  def apply(): Behaviors.Receive[Nothing] =
    Behaviors.setup[Nothing] { context =>
      val sharding = ClusterSharding(context.system)

      sharding.init(
        Entity(ShoppingCart.TypeKey) { entityContext =>
          ShoppingCart(entityContext.entityId)
        }
      )

      Behaviors.empty
    }
}

val system = ActorSystem[Nothing](CartServiceApp(), "cart-cluster")

This line is doing a lot of work:

Entity(ShoppingCart.TypeKey) { entityContext =>
  ShoppingCart(entityContext.entityId)
}

It tells Akka that for the entity type shopping-cart, each logical entity should be created by passing the entity ID into ShoppingCart(...).

That means when a message arrives for cart cart-123, Akka can start that entity if it is not already running, route the message to it, and keep the rest of the application ignorant of which node was chosen.

Sending Messages Without Knowing the Node

Once sharding is initialized, callers use an EntityRef instead of a traditional actor path.

import akka.cluster.sharding.typed.scaladsl.ClusterSharding

def addItemToCart(system: ActorSystem[_], cartId: String): Unit = {
  val sharding = ClusterSharding(system)

  val cartRef =
    sharding.entityRefFor(ShoppingCart.TypeKey, cartId)

  cartRef ! ShoppingCart.AddItem("sku-123", 2, replyTo = ???)
}

This is the part many teams like immediately. The call site is simple.

You do not need:

a node lookup service
a routing table in Redis
a hand-built hashing layer
actor selection based on remote addresses

You just need the entity type and the entity ID.

That is what location transparency buys you. The call site stays stable even when entity placement changes.

What a Shard Is and Why It Exists

The word "sharding" can sound more mysterious than it is.

A shard is just a grouping unit. Akka does not usually move individual entities around one by one for balancing. Instead, it groups entities into shards, and shards are distributed across nodes.

That design matters because moving thousands of entities individually would be noisy and expensive. Shards make rebalancing coarser and more manageable.

From an application perspective, the important facts are:

each entity belongs to exactly one shard
shards are distributed across cluster nodes
a rebalance may move a shard from one node to another
callers still address entities by logical ID, not by current location

You should not build application logic that depends on a specific node owning a specific entity forever. The whole point is that ownership may change.

Where Sharding Fits Best

Cluster Sharding is strongest when the domain already wants entity-local ownership.

Good fits usually look like this:

Per-Entity Stateful Workflows

Examples include carts, accounts, devices, sessions, and order aggregates. Each entity has a stable identity and a message stream that should be processed in order.

High Cardinality

You may have tens of thousands or millions of logical entities, but only a subset are active at once. Sharding helps you model them uniformly without keeping every entity permanently resident.

Hot State That Should Live Close to Behavior

If each entity needs short-lived in-memory state to answer messages quickly, an actor per entity can be a strong fit. Sharding then gives that model a scalable placement mechanism.

Systems Already Committed to an Actor-Centric Runtime

If the surrounding architecture already uses Akka actors, cluster membership, and message-driven workflows, sharding integrates naturally. It extends the existing model rather than introducing a second mental model.

A More Realistic Example: Device Digital Twins

Another classic sharding case is an IoT backend where each device has recent state, connectivity status, throttling rules, and command sequencing requirements.

import akka.actor.typed.{ActorRef, Behavior}
import akka.actor.typed.scaladsl.Behaviors
import akka.cluster.sharding.typed.scaladsl.EntityTypeKey
import java.time.Instant

object DeviceTwin {
  sealed trait Command

  final case class ReportTelemetry(temperature: Double, at: Instant) extends Command
  final case class MarkDisconnected(at: Instant) extends Command
  final case class GetStatus(replyTo: ActorRef[Status]) extends Command

  final case class Status(deviceId: String, online: Boolean, lastTemperature: Option[Double], lastSeen: Option[Instant])

  private final case class State(
      online: Boolean,
      lastTemperature: Option[Double],
      lastSeen: Option[Instant]
  )

  val TypeKey: EntityTypeKey[Command] = EntityTypeKey[Command]("device-twin")

  def apply(deviceId: String): Behavior[Command] =
    running(deviceId, State(online = false, None, None))

  private def running(deviceId: String, state: State): Behavior[Command] =
    Behaviors.receiveMessage {
      case ReportTelemetry(value, at) =>
        running(deviceId, state.copy(online = true, lastTemperature = Some(value), lastSeen = Some(at)))

      case MarkDisconnected(at) =>
        running(deviceId, state.copy(online = false, lastSeen = Some(at)))

      case GetStatus(replyTo) =>
        replyTo ! Status(deviceId, state.online, state.lastTemperature, state.lastSeen)
        Behaviors.same
    }
}

This kind of entity often benefits from sharding because each device has a natural ID, its own stream of updates, and localized state transitions. The application wants to talk to device-88, not to "whatever node currently happens to own device state."

That is the real mental model: sharding helps the system stay organized around domain identities.

Passivation: Not Every Entity Should Stay Alive Forever

One operational reason sharding is useful is that many entity types are bursty.

A cart may be active for ten minutes and then disappear. A user session may be active during one browser flow and then go quiet. A device may report once an hour.

If every entity stayed live forever, memory usage would become the limiting factor long before cluster capacity was used efficiently.

Passivation addresses that by allowing idle entities to stop and later restart when new messages arrive.

The main point is not the specific API detail. The main point is architectural:

entities can be demand-driven
memory usage can follow active workload rather than total entity count
callers still use the same logical identity after restart

This makes sharding especially attractive for large fleets of mostly idle entities.

But passivation also means entity code should be written so restart is normal, not exceptional. If the entity needs durable history or exact recovery semantics, sharding alone is not enough. That is where persistence enters the picture in the next lessons.

What Sharding Does Not Solve

This is where teams need discipline.

Cluster Sharding solves entity placement and location-transparent messaging. It does not automatically solve every stateful-system problem.

It does not remove the need to think about:

durable storage
exactly-once myths
idempotency
backpressure across service boundaries
observability
network partitions
slow downstream dependencies

It also does not make bad entity boundaries good.

If one actor owns too much state, receives too many unrelated commands, or becomes a bottleneck because the domain was modeled poorly, sharding will distribute the pain rather than eliminate it.

When Sharding Is Not Worth It

This is the most important design judgment in the lesson.

Do not use Cluster Sharding just because "one actor per entity" sounds clean. Use it when the operational and modeling benefits outweigh the distributed complexity.

Sharding is often the wrong choice when:

The State Does Not Need to Live in Actors

If the real source of truth is a relational database and requests are simple CRUD operations, a standard stateless HTTP service may be much easier to build and run.

There Are Too Few Entities

If you have a small fixed set of workers, ordinary cluster-aware actors or routers may be enough. Sharding helps most when there is high cardinality.

Access Patterns Are Mostly Batch or Query Driven

If the system mostly runs analytics, scans large datasets, or executes scheduled jobs, stream processing or database-native approaches may fit better than entity actors.

Your Team Is Not Ready to Operate a Distributed Actor System

Sharding inherits all the realities of clustered Akka: node churn, rebalancing, message timing, monitoring, deployment coordination, and careful debugging. If the team only needs a queue plus a database, that simpler architecture usually wins.

Entity Behavior Is Thin and Stateless

If the actor does almost nothing except proxy a database call, sharding is probably decorative rather than useful.

Practical Design Heuristics

When evaluating whether to shard a domain, ask these questions:

Does each entity have a stable identity that matters to the business?
Does each entity own meaningful behavior or short-lived state?
Is sequential handling per entity important for correctness?
Are there enough entities that manual routing would become painful?
Would callers benefit from not caring where an entity currently lives?
Is the team prepared to run clustered Akka in production?

If most of those answers are no, you probably do not need sharding.

If most are yes, sharding may be exactly the right tool.

The Architectural Boundary to Keep Clear

One common mistake is to think sharding should spread everywhere once it exists in the system.

Usually it should not.

A healthy architecture often looks more like this:

HTTP or gRPC at the edge
sharded entities in the core where identity and state matter
databases and external systems at explicit boundaries
streams or projections where event flow and integration matter

That keeps sharding in the part of the system where it creates real leverage instead of turning the entire application into an actor-only design exercise.

Summary

Cluster Sharding is Akka's answer to a specific distributed-systems problem: how to manage very large numbers of stateful logical entities across a cluster without forcing the application to hand-route messages or track physical placement.

Its real value is location-transparent state. You talk to cart cart-123, device device-88, or account acct-91 as business identities. Akka handles where those entities run, when they start, and how shards rebalance as the cluster changes.

That is powerful when the domain naturally consists of many stateful entities with ordered message streams. It is unnecessary when the problem is mostly CRUD, batch processing, or thin request forwarding.

The right lesson to carry forward is not "sharding is advanced, therefore use it." The right lesson is narrower and more useful: use sharding when entity-local behavior and scale genuinely require it, and keep the rest of the system simpler than your abstractions would like.

In the next lesson, we will build on this idea of entity-local truth by looking at event sourcing and Akka Persistence, where actor state stops being merely in-memory and becomes durable, replayable, and historically meaningful.

Cluster Sharding and Location-Transparent State

Cluster Sharding and Location-Transparent State

Why Ordinary Cluster Actors Stop Scaling Operationally

What Cluster Sharding Actually Gives You

A Concrete Example: Sharded Shopping Carts

Register the Entity With Sharding

Sending Messages Without Knowing the Node

What a Shard Is and Why It Exists

Where Sharding Fits Best

Per-Entity Stateful Workflows

High Cardinality

Hot State That Should Live Close to Behavior

Systems Already Committed to an Actor-Centric Runtime

A More Realistic Example: Device Digital Twins

Passivation: Not Every Entity Should Stay Alive Forever

What Sharding Does Not Solve

When Sharding Is Not Worth It

The State Does Not Need to Live in Actors

There Are Too Few Entities

Access Patterns Are Mostly Batch or Query Driven

Your Team Is Not Ready to Operate a Distributed Actor System

Entity Behavior Is Thin and Stateless

Practical Design Heuristics

The Architectural Boundary to Keep Clear

Summary

Comments

Feedback

Cluster Sharding and Location-Transparent State

Why Ordinary Cluster Actors Stop Scaling Operationally

What Cluster Sharding Actually Gives You

A Concrete Example: Sharded Shopping Carts

Register the Entity With Sharding

Sending Messages Without Knowing the Node

What a Shard Is and Why It Exists

Where Sharding Fits Best

Per-Entity Stateful Workflows

High Cardinality

Hot State That Should Live Close to Behavior

Systems Already Committed to an Actor-Centric Runtime

A More Realistic Example: Device Digital Twins

Passivation: Not Every Entity Should Stay Alive Forever

What Sharding Does Not Solve

When Sharding Is Not Worth It

The State Does Not Need to Live in Actors

There Are Too Few Entities

Access Patterns Are Mostly Batch or Query Driven

Your Team Is Not Ready to Operate a Distributed Actor System

Entity Behavior Is Thin and Stateless

Practical Design Heuristics

The Architectural Boundary to Keep Clear

Summary

Comments

Feedback

Sign In

Sign Up

Reset Password

Sign Out

Reset Password