Cluster Sharding and Location-Transparent State
In the previous lesson, the main shift was architectural: once actors stop living in one JVM, you have to think like a distributed-systems engineer. That means node membership, discovery, partial failure, and the fact that the network is now part of the runtime model.
The next question is more specific. Suppose your system does not just have a few long-lived actors. Suppose it has hundreds of thousands of logical entities: shopping carts, customer accounts, IoT devices, fraud cases, support sessions, or game lobbies. Each entity has its own state and behavior. Each one should process messages sequentially for correctness. But you do not want to hand-write routing logic for every message or manually decide which node owns which entity.
That is the problem Cluster Sharding solves.
Cluster Sharding lets you address entities by logical identity instead of physical location. You send a message to "cart-123" or "device-88" without needing to know which node currently hosts that actor. Akka decides where the entity should live, starts it when needed, moves shards when the cluster changes, and preserves the illusion that you are talking to a stable logical component.
That is powerful, but it is not free. Sharding introduces another layer of distributed coordination, and teams often reach for it too early because the API feels elegant. In this lesson, we will make the tradeoff concrete: what sharding is actually doing, what it is good at, how a realistic Akka Typed setup looks, and where a simpler design is the better engineering choice.
Why Ordinary Cluster Actors Stop Scaling Operationally
Imagine a retail platform where each active shopping cart needs to hold short-lived state:
- selected items
- quantities
- coupon state
- checkout status
- expiration timing
At low scale, you can keep this in one JVM. At somewhat higher scale, you might partition carts manually by hashing a cart ID and routing messages to region actors you created yourself.
That works for a while, but the operational problems appear quickly:
- who decides which node owns a given cart?
- what happens when that node leaves the cluster?
- how do callers find the right actor instance?
- how do you rebalance entities when traffic becomes uneven?
- how much custom routing code will your team maintain six months from now?
This is the core reason sharding exists. It is not about making actors more abstract. It is about removing repetitive, failure-prone infrastructure code around large numbers of stateful entities.
Without sharding, many teams reinvent the same mechanisms badly:
- a registry that maps entity IDs to nodes
- hand-managed routing tables
- ad hoc rebalancing logic
- custom lifecycle rules for idle actors
- brittle assumptions about entity location
Sharding standardizes those concerns so application code can focus on the entity behavior itself.
What Cluster Sharding Actually Gives You
Cluster Sharding is best understood as an entity management layer on top of Akka Cluster.
In practice, it gives you several things at once:
- a way to define an entity type, such as
ShoppingCartorDeviceTwin - a way to address one entity instance by its logical ID
- automatic placement of entities into shards that are distributed across the cluster
- automatic startup of entities when the first message arrives
- rebalancing when cluster topology changes
- passivation support so idle entities do not consume memory forever
The key phrase in the lesson title is location-transparent state.
Location transparency does not mean location stops mattering operationally. It means callers do not need to care where the entity currently lives in order to interact with it. That is a big API simplification. It lets the rest of the application speak in domain identities instead of infrastructure addresses.
In other words, callers say:
- send
AddItemto cartcart-123 - send
MarkOnlineto devicedevice-88 - send
FreezeAccountto accountacct-91
They do not say:
- find node
payments-node-4 - look up the current actor path
- reconnect if the node moved
- update some shared registry if rebalancing happened
That separation is the main engineering value.
A Concrete Example: Sharded Shopping Carts
Shopping carts are a good sharding example because they have all the right properties:
- many independent entities
- entity-local state
- sequential message handling matters
- workload can spread across many nodes
- entities can be started on demand and passivated when idle
Here is a simplified Akka Typed cart entity:
import akka.actor.typed.{ActorRef, Behavior}
import akka.actor.typed.scaladsl.Behaviors
import akka.cluster.sharding.typed.scaladsl.EntityTypeKey
object ShoppingCart {
sealed trait Command
final case class AddItem(sku: String, quantity: Int, replyTo: ActorRef[Confirmation]) extends Command
final case class RemoveItem(sku: String, replyTo: ActorRef[Confirmation]) extends Command
final case class GetCart(replyTo: ActorRef[Summary]) extends Command
case object ExpireIfIdle extends Command
sealed trait Confirmation
final case class Accepted(message: String) extends Confirmation
final case class Rejected(reason: String) extends Confirmation
final case class Summary(cartId: String, items: Map[String, Int], checkedOut: Boolean)
final case class State(items: Map[String, Int], checkedOut: Boolean) {
def addItem(sku: String, quantity: Int): State =
copy(items = items.updated(sku, items.getOrElse(sku, 0) + quantity))
def removeItem(sku: String): State =
copy(items = items - sku)
}
val TypeKey: EntityTypeKey[Command] = EntityTypeKey[Command]("shopping-cart")
def apply(cartId: String): Behavior[Command] =
active(cartId, State(Map.empty, checkedOut = false))
private def active(cartId: String, state: State): Behavior[Command] =
Behaviors.receiveMessage {
case AddItem(_, quantity, replyTo) if quantity <= 0 =>
replyTo ! Rejected("Quantity must be positive")
Behaviors.same
case AddItem(sku, quantity, replyTo) if state.checkedOut =>
replyTo ! Rejected("Cart is already checked out")
Behaviors.same
case AddItem(sku, quantity, replyTo) =>
replyTo ! Accepted(s"Added $quantity of $sku")
active(cartId, state.addItem(sku, quantity))
case RemoveItem(sku, replyTo) if state.checkedOut =>
replyTo ! Rejected("Cart is already checked out")
Behaviors.same
case RemoveItem(sku, replyTo) =>
replyTo ! Accepted(s"Removed $sku")
active(cartId, state.removeItem(sku))
case GetCart(replyTo) =>
replyTo ! Summary(cartId, state.items, state.checkedOut)
Behaviors.same
case ExpireIfIdle if state.items.isEmpty =>
Behaviors.stopped
case ExpireIfIdle =>
Behaviors.same
}
}
There is nothing especially "distributed" inside the entity behavior. That is a good sign.
The cart actor owns cart state. It processes one message at a time. It validates commands against current state. It exposes a protocol that makes sense in the domain.
This is how sharding should feel from the entity's point of view: the business actor remains understandable even though the runtime may place it anywhere in the cluster.
Register the Entity With Sharding
The next step is telling the cluster how to create cart entities.
import akka.actor.typed.ActorSystem
import akka.actor.typed.scaladsl.Behaviors
import akka.cluster.sharding.typed.scaladsl.{ClusterSharding, Entity}
object CartServiceApp {
def apply(): Behaviors.Receive[Nothing] =
Behaviors.setup[Nothing] { context =>
val sharding = ClusterSharding(context.system)
sharding.init(
Entity(ShoppingCart.TypeKey) { entityContext =>
ShoppingCart(entityContext.entityId)
}
)
Behaviors.empty
}
}
val system = ActorSystem[Nothing](CartServiceApp(), "cart-cluster")
This line is doing a lot of work:
Entity(ShoppingCart.TypeKey) { entityContext =>
ShoppingCart(entityContext.entityId)
}
It tells Akka that for the entity type shopping-cart, each logical entity should be created by passing the entity ID into ShoppingCart(...).
That means when a message arrives for cart cart-123, Akka can start that entity if it is not already running, route the message to it, and keep the rest of the application ignorant of which node was chosen.
Sending Messages Without Knowing the Node
Once sharding is initialized, callers use an EntityRef instead of a traditional actor path.
import akka.cluster.sharding.typed.scaladsl.ClusterSharding
def addItemToCart(system: ActorSystem[_], cartId: String): Unit = {
val sharding = ClusterSharding(system)
val cartRef =
sharding.entityRefFor(ShoppingCart.TypeKey, cartId)
cartRef ! ShoppingCart.AddItem("sku-123", 2, replyTo = ???)
}
This is the part many teams like immediately. The call site is simple.
You do not need:
- a node lookup service
- a routing table in Redis
- a hand-built hashing layer
- actor selection based on remote addresses
You just need the entity type and the entity ID.
That is what location transparency buys you. The call site stays stable even when entity placement changes.
What a Shard Is and Why It Exists
The word "sharding" can sound more mysterious than it is.
A shard is just a grouping unit. Akka does not usually move individual entities around one by one for balancing. Instead, it groups entities into shards, and shards are distributed across nodes.
That design matters because moving thousands of entities individually would be noisy and expensive. Shards make rebalancing coarser and more manageable.
From an application perspective, the important facts are:
- each entity belongs to exactly one shard
- shards are distributed across cluster nodes
- a rebalance may move a shard from one node to another
- callers still address entities by logical ID, not by current location
You should not build application logic that depends on a specific node owning a specific entity forever. The whole point is that ownership may change.
Where Sharding Fits Best
Cluster Sharding is strongest when the domain already wants entity-local ownership.
Good fits usually look like this:
Per-Entity Stateful Workflows
Examples include carts, accounts, devices, sessions, and order aggregates. Each entity has a stable identity and a message stream that should be processed in order.
High Cardinality
You may have tens of thousands or millions of logical entities, but only a subset are active at once. Sharding helps you model them uniformly without keeping every entity permanently resident.
Hot State That Should Live Close to Behavior
If each entity needs short-lived in-memory state to answer messages quickly, an actor per entity can be a strong fit. Sharding then gives that model a scalable placement mechanism.
Systems Already Committed to an Actor-Centric Runtime
If the surrounding architecture already uses Akka actors, cluster membership, and message-driven workflows, sharding integrates naturally. It extends the existing model rather than introducing a second mental model.
A More Realistic Example: Device Digital Twins
Another classic sharding case is an IoT backend where each device has recent state, connectivity status, throttling rules, and command sequencing requirements.
import akka.actor.typed.{ActorRef, Behavior}
import akka.actor.typed.scaladsl.Behaviors
import akka.cluster.sharding.typed.scaladsl.EntityTypeKey
import java.time.Instant
object DeviceTwin {
sealed trait Command
final case class ReportTelemetry(temperature: Double, at: Instant) extends Command
final case class MarkDisconnected(at: Instant) extends Command
final case class GetStatus(replyTo: ActorRef[Status]) extends Command
final case class Status(deviceId: String, online: Boolean, lastTemperature: Option[Double], lastSeen: Option[Instant])
private final case class State(
online: Boolean,
lastTemperature: Option[Double],
lastSeen: Option[Instant]
)
val TypeKey: EntityTypeKey[Command] = EntityTypeKey[Command]("device-twin")
def apply(deviceId: String): Behavior[Command] =
running(deviceId, State(online = false, None, None))
private def running(deviceId: String, state: State): Behavior[Command] =
Behaviors.receiveMessage {
case ReportTelemetry(value, at) =>
running(deviceId, state.copy(online = true, lastTemperature = Some(value), lastSeen = Some(at)))
case MarkDisconnected(at) =>
running(deviceId, state.copy(online = false, lastSeen = Some(at)))
case GetStatus(replyTo) =>
replyTo ! Status(deviceId, state.online, state.lastTemperature, state.lastSeen)
Behaviors.same
}
}
This kind of entity often benefits from sharding because each device has a natural ID, its own stream of updates, and localized state transitions. The application wants to talk to device-88, not to "whatever node currently happens to own device state."
That is the real mental model: sharding helps the system stay organized around domain identities.
Passivation: Not Every Entity Should Stay Alive Forever
One operational reason sharding is useful is that many entity types are bursty.
A cart may be active for ten minutes and then disappear. A user session may be active during one browser flow and then go quiet. A device may report once an hour.
If every entity stayed live forever, memory usage would become the limiting factor long before cluster capacity was used efficiently.
Passivation addresses that by allowing idle entities to stop and later restart when new messages arrive.
The main point is not the specific API detail. The main point is architectural:
- entities can be demand-driven
- memory usage can follow active workload rather than total entity count
- callers still use the same logical identity after restart
This makes sharding especially attractive for large fleets of mostly idle entities.
But passivation also means entity code should be written so restart is normal, not exceptional. If the entity needs durable history or exact recovery semantics, sharding alone is not enough. That is where persistence enters the picture in the next lessons.
What Sharding Does Not Solve
This is where teams need discipline.
Cluster Sharding solves entity placement and location-transparent messaging. It does not automatically solve every stateful-system problem.
It does not remove the need to think about:
- durable storage
- exactly-once myths
- idempotency
- backpressure across service boundaries
- observability
- network partitions
- slow downstream dependencies
It also does not make bad entity boundaries good.
If one actor owns too much state, receives too many unrelated commands, or becomes a bottleneck because the domain was modeled poorly, sharding will distribute the pain rather than eliminate it.
When Sharding Is Not Worth It
This is the most important design judgment in the lesson.
Do not use Cluster Sharding just because "one actor per entity" sounds clean. Use it when the operational and modeling benefits outweigh the distributed complexity.
Sharding is often the wrong choice when:
The State Does Not Need to Live in Actors
If the real source of truth is a relational database and requests are simple CRUD operations, a standard stateless HTTP service may be much easier to build and run.
There Are Too Few Entities
If you have a small fixed set of workers, ordinary cluster-aware actors or routers may be enough. Sharding helps most when there is high cardinality.
Access Patterns Are Mostly Batch or Query Driven
If the system mostly runs analytics, scans large datasets, or executes scheduled jobs, stream processing or database-native approaches may fit better than entity actors.
Your Team Is Not Ready to Operate a Distributed Actor System
Sharding inherits all the realities of clustered Akka: node churn, rebalancing, message timing, monitoring, deployment coordination, and careful debugging. If the team only needs a queue plus a database, that simpler architecture usually wins.
Entity Behavior Is Thin and Stateless
If the actor does almost nothing except proxy a database call, sharding is probably decorative rather than useful.
Practical Design Heuristics
When evaluating whether to shard a domain, ask these questions:
- Does each entity have a stable identity that matters to the business?
- Does each entity own meaningful behavior or short-lived state?
- Is sequential handling per entity important for correctness?
- Are there enough entities that manual routing would become painful?
- Would callers benefit from not caring where an entity currently lives?
- Is the team prepared to run clustered Akka in production?
If most of those answers are no, you probably do not need sharding.
If most are yes, sharding may be exactly the right tool.
The Architectural Boundary to Keep Clear
One common mistake is to think sharding should spread everywhere once it exists in the system.
Usually it should not.
A healthy architecture often looks more like this:
- HTTP or gRPC at the edge
- sharded entities in the core where identity and state matter
- databases and external systems at explicit boundaries
- streams or projections where event flow and integration matter
That keeps sharding in the part of the system where it creates real leverage instead of turning the entire application into an actor-only design exercise.
Summary
Cluster Sharding is Akka's answer to a specific distributed-systems problem: how to manage very large numbers of stateful logical entities across a cluster without forcing the application to hand-route messages or track physical placement.
Its real value is location-transparent state. You talk to cart cart-123, device device-88, or account acct-91 as business identities. Akka handles where those entities run, when they start, and how shards rebalance as the cluster changes.
That is powerful when the domain naturally consists of many stateful entities with ordered message streams. It is unnecessary when the problem is mostly CRUD, batch processing, or thin request forwarding.
The right lesson to carry forward is not "sharding is advanced, therefore use it." The right lesson is narrower and more useful: use sharding when entity-local behavior and scale genuinely require it, and keep the rest of the system simpler than your abstractions would like.
In the next lesson, we will build on this idea of entity-local truth by looking at event sourcing and Akka Persistence, where actor state stops being merely in-memory and becomes durable, replayable, and historically meaningful.
Comments
Be the first to comment on this lesson!