High-Throughput CSV Ingestion with ShardingProducerController: Bottlenecks and Startup Race Conditions #7729

vulcanUPB · 2025-07-11T09:48:14Z

vulcanUPB
Jul 11, 2025

Hello,

I'm building a high-throughput message ingestion system using ShardingProducerController, targeting a sharded ProductActor region. I’ve encountered two main challenges:

Use Case
I register a single producer actor (ProductProducerActor) at system startup using the WithActors(...) hook like this:

var producerProps = resolver.Props<TProducerActor>();
var actorOf = system.ActorOf(producerProps, $"{typeof(TProducerActor).Name}-system");
registry.Register<TProducerActor>(actorOf);

Then, I initialize the ShardingProducerController inside the AddStartup(...) hook, after the cluster is fully joined:

AddStartup((system, registry) => {
    var cluster = Akka.Cluster.Cluster.Get(system);
    cluster.RegisterOnMemberUp(() => {
        var producer = registry.Get<TProducerActor>();
        var shardProductProxy = registry.Get<TMarker>();

        var persistentId = typeof(TActorProtocol).Name;

        var eventSourcedProductProducerQueue = EventSourcedProducerQueue.Create<TActorProtocol>(
            persistentId: $"{persistentId}-durableQueue",
            system
        );

        var shardingProducerController = system.ActorOf(
            ShardingProducerController.Create<TActorProtocol>(
                producerId,
                shardProductProxy,
                eventSourcedProductProducerQueue,
                ShardingProducerController.Settings.Create(system)
            )
        );

        shardingProducerController.Tell(
            new ShardingProducerController.Start<TActorProtocol>(producer)
        );
    });
});

Problem 1: Slow throughput with a single producer
My registered ProductProducerActor sends messages to sharded entities by waiting for RequestNext and responding with a single message each time. Here’s a simplified version:

public class ProductProducerActor : ReceiveActor {
    public ProductProducerActor() {
        Idle();
    }

    private IActorRef SendNext { get; set; } = ActorRefs.Nobody;

    private void Idle() {
        Receive<AddProductProtocol>(_ => {
            // drop if not ready
        });

        Receive<ShardingProducerController.RequestNext<IProductProtocol>>(next => {
            SendNext = next.SendNextTo;
            Become(Active);
        });
    }

    private void Active() {
        Receive<AddProductProtocol>(message => {
            SendNext.Tell(new ShardingEnvelope(((IProductProtocol)message).ActorId, message));
            Become(Idle);
        });

        Receive<ShardingProducerController.RequestNext<IProductProtocol>>(next => {
            SendNext = next.SendNextTo;
        });
    }
}

The issue:

This setup only allows sending one message at a time, and I must wait for a new RequestNext before sending the next.
When importing a large CSV (thousands of rows), this causes significant performance bottlenecks.
I tried buffering messages in the actor and sending them sequentially on RequestNext, but it was still slow.
I experienced data loss on cluster restarts, even though I used EventSourcedProducerQueue.

Problem 2: Race condition when dynamically creating producers
To work around the bottleneck, I experimented with spawning multiple producer actors dynamically (one per CSV row) using a service like this:

public class CsvProductProducerService {
    private readonly ActorSystem _system;
    private readonly IActorRegistry _registry;
    private readonly Dictionary<string, IActorRef> _producers = new();

    public CsvProductProducerService(ActorSystem system, IActorRegistry registry) {
        _system = system;
        _registry = registry;
    }

    public void CreateProducerForCsvRow(string productId, IProductProtocol protocol) {
        var producerProps = Props.Create(() => new ProductProducerActor());
        var producerName = $"ProductProducer-{productId}";
        var producerActor = _system.ActorOf(producerProps, producerName);
        var productShardProxy = _registry.Get<ProductMarker>();

        _producers[productId] = producerActor;

        var eventSourcedProductProducerQueue = EventSourcedProducerQueue.Create<IProductProtocol>(
            persistentId: $"{producerName}-durableQueue",
            _system
        );

        var shardingProducerController = _system.ActorOf(
            ShardingProducerController.Create<IProductProtocol>(
                producerId: producerName,
                producerController: productShardProxy,
                durableQueue: eventSourcedProductProducerQueue,
                settings: ShardingProducerController.Settings.Create(_system)
            )
        );

        shardingProducerController.Tell(new ShardingProducerController.Start<IProductProtocol>(producerActor));

        // This message is often dropped
        producerActor.Tell(protocol);
    }
}

The issue:
The Start message takes time to reach and initialize the producer.

If I immediately send a message to the producer (e.g., AddProductProtocol), it's often dropped because the actor is still in Idle() and hasn’t received its first RequestNext.

Questions

What is the recommended pattern for sending messages at high throughput using ShardingProducerController?
Should I use batching or spin up N parallel producers?
Is there a way to stream without waiting for each individual RequestNext?
How do I reliably know when a producer is ready to receive messages?
Are there extra configuration steps needed to guarantee message redelivery?

Any advice or best practices around high-throughput ingestion with reliable delivery would be greatly appreciated.
Thanks

Aaronontheweb · 2025-07-11T16:08:20Z

Aaronontheweb
Jul 11, 2025
Maintainer

My registered ProductProducerActor sends messages to sharded entities by waiting for RequestNext and responding with a single message each time

yeah your issue here might be that the producer gets rate-limited by how quickly messages are being ACKed on the sender side. That's a great feature to have to ensure high deliverability but yeah, that might impact your throughput. Also, using the durable event-sourced queue will slow things down too because that adds a layer of persistence to both the SEND and the ACK.

The internal default buffer size is quite large though - so you should have head-room to let it rip, unless you're already hitting that value I suppose.

akka.net/src/contrib/cluster/Akka.Cluster.Sharding/reference.conf

Line 272 in c10cfc1

buffer-size = 1000

How big are your CSV rows?

If I immediately send a message to the producer (e.g., AddProductProtocol), it's often dropped because the actor is still in Idle() and hasn’t received its first RequestNext.

Use stashing to solve this: https://www.youtube.com/watch?v=S8FNHZ7SES8

0 replies

vulcanUPB · 2025-07-11T19:39:25Z

vulcanUPB
Jul 11, 2025
Author

@Aaronontheweb

It should be around 100–200 bytes per row, I think.

Here’s an example row:
P00030,Candidate Let,L,38.72,41.12,Song moment drug medical themselves contain concern situation.,Three these bank create.

In total, I have around 2000–3000 rows.

This is the class being used:

public class AddProductProtocol : IProductProtocol {
    public required EventId EventId { get; set; }

    public required MerchantId MerchantId { get; set; }

    public required OrganiserId OrganiserId { get; set; }

    public required DateTime TimeStamp { get; set; }

    public required ProductId ProductId { get; set; }

    public required string Code { get; set; }

    public required string Name { get; set; }

    public required string UnitOfMeasure { get; set; }

    public required double Price { get; set; }

    public required double Amount { get; set; }

    public required string ProductDescription { get; set; }

    public required string Details { get; set; }
}

Thanks for the suggestion on stashing — I’ve added that to the producer when it's idle. However, when I import the file, the messages get stashed but I never receive a ShardingProducerController.RequestNext message.

public class ProductProducerActor : ReceiveActor, IWithStash {

    public IStash Stash { get; set; } = null!;

    private readonly ILoggingAdapter _log = Context.GetLogger();

    public ProductProducerActor() {
        Idle();
    }

    private IActorRef SendNext { get; set; } = ActorRefs.Nobody;

    private void Idle() {
        Receive<AddProductProtocol>(_ => {
            Stash.Stash();
            _log.Info($"Stash product code: {_.Code}");
        }); 

        Receive<ShardingProducerController.RequestNext<IProductProtocol>>(next => {
            SendNext = next.SendNextTo;
            Become(Active);
        });
    }

    private void Active() {
        Receive<AddProductProtocol>(message => {
            SendNext.Tell(new ShardingEnvelope(((IProductProtocol)message).ActorId, message));
            Stash.Unstash();
            _log.Info($"Unstash product code: {message.Code}");
            Become(Idle); // wait for demand
        }); 

        Receive<ShardingProducerController.RequestNext<IProductProtocol>>(next => {
            // no work to do yet, but update SendNext
            SendNext = next.SendNextTo;
        });
    }
}

Right after stashing, I start getting the following error in the logs:

[WARNING][07/11/2025 19:20:42.886Z][Thread 0025][akka.tcp://EventsAkka@events-cluster-ui:10041/user/$Hu] LoadState failed, attempt [7] of [10], retrying.
[WARNING][07/11/2025 19:20:42.887Z][Thread 0027][akka.tcp://EventsAkka@events-cluster-ui:10041/user/$1h] LoadState failed, attempt [1] of [10], retrying.
[WARNING][07/11/2025 19:20:42.887Z][Thread 0041][akka.tcp://EventsAkka@events-cluster-ui:10041/user/$qz] LoadState failed, attempt [6] of [10], retrying.
[WARNING][07/11/2025 19:20:42.887Z][Thread 0039][akka.tcp://EventsAkka@events-cluster-ui:10041/user/$tE] LoadState failed, attempt [6] of [10], retrying.
[ERROR][07/11/2025 19:20:42.887Z][Thread 0103][akka://EventsAkka/user/$qj] Failed to load state from durable queue after 10 attempts, giving up.
Cause: System.TimeoutException: Failed to load state from durable queue after 10 attempts, giving up.
   at Akka.Cluster.Sharding.Delivery.Internal.ShardingProducerController`1.<>c__DisplayClass38_0.<WaitingForStart>b__2(LoadStateFailed failed)
   at lambda_method306(Closure, Object, Action`1, Action`1, Action`1, Func`2, Action`1)
   at Akka.Tools.MatchHandler.PartialHandlerArgumentsCapture`6.Handle(T value)
   at Akka.Actor.ReceiveActor.OnReceive(Object message)
   at Akka.Actor.ActorBase.AroundReceive(Receive receive, Object message)
   at Akka.Actor.ActorCell.Invoke(Envelope envelope)
[WARNING][07/11/2025 19:20:42.887Z][Thread 0046][akka.tcp://EventsAkka@events-cluster-ui:10041/user/$oi] LoadState failed, attempt [1] of [10], retrying.

This starts happening very quickly after startup. I’m attaching the full logs below for reference
fc97c3f0d3c0-logs.txt

And this is how I configured persistence — since the error seems related to this:

 var shardingJournalDbOptions = JournalDatabaseOptions.SqlServer;
 shardingJournalDbOptions.JournalTable!.TableName = "ShardingEventJournal";
 shardingJournalDbOptions.MetadataTable!.TableName = "ShardingMetadata";

 var shardingJournalOptions = new SqlJournalOptions(
     isDefaultPlugin: false,
     identifier: "sharding") {
     ConnectionString = sqlConn.ConnectionStrings,
     ProviderName = ProviderName.SqlServer2019,
     DatabaseOptions = shardingJournalDbOptions,
     TagStorageMode = TagMode.Csv,
     DeleteCompatibilityMode = true,
     AutoInitialize = true
 };
 shardingJournalOptions.Adapters.AddWriteEventAdapter<MessageTagger>("events-tagger", new[] { typeof(IProductProtocol), typeof(IMerchantProtocol) });


 var shardingSnapshotDbOptions = SnapshotDatabaseOptions.SqlServer;
 shardingSnapshotDbOptions.SnapshotTable!.TableName = "ShardingSnapshotStore";

 var shardingSnapshotOptions = new SqlSnapshotOptions(
     isDefaultPlugin: false,
     identifier: "sharding") {
     ConnectionString = sqlConn.ConnectionStrings,
     ProviderName = ProviderName.SqlServer2019,
     DatabaseOptions = shardingSnapshotDbOptions,
     AutoInitialize = true
 };

 shardOptions.JournalOptions = shardingJournalOptions;
 shardOptions.SnapshotOptions = shardingSnapshotOptions;
 
builder.WithSqlPersistence(
                connectionString: sqlConn.ConnectionStrings,
                providerName: ProviderName.SqlServer2019,
                tagStorageMode: TagMode.Csv,
                autoInitialize: true,
                useWriterUuidColumn: true,
                deleteCompatibilityMode: true,
                journalBuilder: builder => {
                    builder.AddWriteEventAdapter<MessageTagger>("events-tagger", new[] { typeof(IProductProtocol), typeof(IMerchantProtocol) });
                })
                .WithJournalAndSnapshot(settings.ShardOptions.JournalOptions!, settings.ShardOptions.SnapshotOptions!)
                .WithClusterShardingJournalMigrationAdapter(settings.ShardOptions.JournalOptions!.PluginId);

1 reply

Aaronontheweb Jul 11, 2025
Maintainer

Thanks for the suggestion on stashing — I’ve added that to the producer when it's idle. However, when I import the file, the messages get stashed but I never receive a ShardingProducerController.RequestNext message.

Well between that and the durable queue timing out at startup, that's either a bug in how your code is using the features or in how the features themselves are implemented. I'll have to take a closer look before I can be sure which is which.

So let's suppose for a moment that Akka.Cluster.Sharding.Delivery just adds too many friction / performance issues here - what's the next best recourse?

I'd consider using Akka.Streams to paginate from the CSV rows to the sharding system using a pattern similar to https://github.com/petabridge/akkadotnet-code-samples/blob/e295b5a2626ac501daf5fa68816d0be915d3ad6c/src/reliability/rabbitmq-backpressure/ReliableRabbitMQ.Consumer/Actors/RabbitMqConsumerActor.cs#L75-L93

You don't need RabbitMQ for that to work obviously - it's the Ask to the ShardRegion that is important. We have some native Akka.Streams stages that can parse CSV files but TBH I've not worked with that code myself so I can't speak to the quality / ease of use / performance facets of it. https://github.com/akkadotnet/Alpakka/tree/dev/src/Csv - my guess is it probably needs a facelift, like some of the other Akka.Streams stages we've improved recently (SignalR, Kafka.)

vulcanUPB · 2025-07-14T14:31:45Z

vulcanUPB
Jul 14, 2025
Author

Thanks for the detailed reply!

I'll go ahead and give Akka.Streams a try for this use case — especially following the pattern in the RabbitMqConsumerActor sample you linked. The idea of streaming rows from CSV and piping them into the sharding system via Ask makes a lot of sense, and it could help bypass some of the friction I'm seeing with DurableConsumers.

In the meantime, I put together a minimal repro project that consistently triggers the issue I'm encountering with Akka.Cluster.Sharding.Delivery. You can find it here: https://github.com/vulcanUPB/AkkaShardingEventFromCSV - It's not necessarily the cleanest codebase — it was quickly assembled to isolate the issue — but it should still provide a reproducible context. There's definitely room to make it cleaner and more idiomatic, but I hope it's useful for debugging purposes

Let me know if there's anything you'd like me to tweak or isolate further. Thanks again!

0 replies

Arkatufus · 2025-07-14T19:28:17Z

Arkatufus
Jul 14, 2025
Maintainer

Here's a message throughput benchmark for Akka.Delivery with sharding and durable queue, using in memory journal and snapshot store:

Method	MessageCount	Mean	Error	StdDev	Op/sec
DurableQueueMessageThroughputBenchmark	3000	8.573 s	0.3892 s	0.5825 s	349.9
DurableQueueMessageThroughputBenchmark	6000	36.995 s	1.8398 s	2.7538 s	162.2

5 replies

Aaronontheweb Jul 14, 2025
Maintainer

What's the message size on that @Arkatufus ?

Aaronontheweb Jul 14, 2025
Maintainer

Not that the message size necessarily makes that much of a difference, but those numbers are pretty bad. That has to be a flow control design issue somewhere in the mix.

Arkatufus Jul 14, 2025
Maintainer

Its just a single integer payload, but the whole Akka.Delivery protocol, especially with event sourced queue, is kind of slow.

Aaronontheweb Jul 14, 2025
Maintainer

Can you try one without the persistent queue for the sake of comparison?

Arkatufus Jul 14, 2025
Maintainer

doing it right now

Arkatufus · 2025-07-14T20:12:33Z

Arkatufus
Jul 14, 2025
Maintainer

Akka.Cluster.Delivery message throughput benchmark without event sourced queue:

Method	MessageCount	Mean	Error	StdDev	Op/sec
ClusterDeliveryMessageThroughputBenchmark	3000	6.811 s	0.0454 s	0.0680 s	440.5
ClusterDeliveryMessageThroughputBenchmark	6000	29.140 s	0.2749 s	0.4115 s	205.9

I think it is inherently slow as it is

2 replies

Aaronontheweb Jul 14, 2025
Maintainer

yeah so this is 100% a flow control issue then, not a mechanics issue. Think we could add some logging to figure out where the time is being spent?

Arkatufus Jul 14, 2025
Maintainer

I'll give it a go.

Here's the PR for the benchmark: akkadotnet/Akka.Hosting#622

danne931 · 2025-07-24T06:42:07Z

danne931
Jul 24, 2025

I want to chime in that I experienced the first issue (dropping messages) weeks ago when implementing my producer according to the following example: https://github.com/akkadotnet/akka.net/blob/dev/src/examples/Cluster/ClusterSharding/ShoppingCart/Producer.cs#L84

I also tried fixing it by stashing the domain message while in Idle and then unstashing all before (if I remember correctly) becoming Active but had issues with that as well.

In my scenario I have an Akka.Hosting AddStartup handler which sends 1000 or so seed messages through an Akka.Delivery producer I registered in Akka.Hosting .WithActors handler. I would check that the cluster was up before doing so (Cluster.Get(ctx.System).IsUp). Hundreds of messages dropped.

Fortunately I fixed my issues by removing the Idle state from the producer actor so it does not have switchable behaviors. It contains just the Receive handlers from the Active state now. Maybe the cluster sharding guaranteed delivery example could be updated to remove the Idle state or perhaps test it with the Idle state, Stash/UnstashAll, and behavior switching while sending it a few 1000 messages at once?

0 replies

Arkatufus · 2025-07-29T16:56:13Z

Arkatufus
Jul 29, 2025
Maintainer

@danne931 can you share your producer message handler code?

3 replies

danne931 Jul 29, 2025

@Arkatufus Here is my producer code which works as I expect. Let me know if you instead wanted to see the code I first tried to write (with behavior switching) which caused message dropping and I will try to produce a minimal example with C#.

[<RequireQualifiedAccess>]
module GuaranteedDelivery

open System
open Akka.Actor
open Akka.Cluster.Sharding
open Akka.Cluster.Sharding.Delivery
open Akka.Delivery
open Akkling

type Message<'Msg> = { EntityId: Guid; Message: 'Msg }

type ClusterShardingProducerActor<'Msg>() as x =
   inherit ReceiveActor()
   let mutable sendNext: IActorRef = ActorRefs.Nobody

   do
      x.Receive<Message<'Msg>>(fun (msg: Message<'Msg>) ->
         sendNext.Tell(ShardingEnvelope(string msg.EntityId, msg.Message)))

   do
      x.Receive<ShardingProducerController.RequestNext<'Msg>>
         (fun (next: ShardingProducerController.RequestNext<'Msg>) ->
            sendNext <- next.SendNextTo)

type ClusterShardingProducerOptions = {
   System: ActorSystem
   ShardRegion: IActorRef
   ProducerName: string
}

let producer<'Msg>
   (opts: ClusterShardingProducerOptions)
   : IActorRef<Message<'Msg>>
   =
   let system = opts.System
   let clusterMemberAddress = Akka.Cluster.Cluster.Get(system).SelfAddress
   let hash = Akka.Util.MurmurHash.StringHash(string clusterMemberAddress)
   let producerId = $"{opts.ProducerName}{hash}"

   let shardingProducerControllerProps =
      ShardingProducerController.Create<'Msg>(
         producerId,
         opts.ShardRegion,
         Akka.Util.Option<Akka.Actor.Props>.None,
         ShardingProducerController.Settings.Create system
      )

   let producerControllerRef =
      system.ActorOf(
         shardingProducerControllerProps,
         $"sharding-producer-controller-{opts.ProducerName}"
      )

   let producer =
      system.ActorOf(
         Props.Create<ClusterShardingProducerActor<'Msg>>(),
         opts.ProducerName
      )

   let startMsg = new ShardingProducerController.Start<'Msg>(producer)

   producerControllerRef.Tell startMsg

   typed producer

Arkatufus Jul 29, 2025
Maintainer

and are all 1000 messages you've sent actually being received by the shard entities?

danne931 Jul 29, 2025

Yes @Arkatufus I only observed dropped messages when I tried to copy the behavior switching example (https://github.com/akkadotnet/akka.net/blob/dev/src/examples/Cluster/ClusterSharding/ShoppingCart/Producer.cs#L84).

High-Throughput CSV Ingestion with ShardingProducerController: Bottlenecks and Startup Race Conditions #7729

Uh oh!

Uh oh!

vulcanUPB Jul 11, 2025

Replies: 7 comments · 11 replies

Uh oh!

Aaronontheweb Jul 11, 2025 Maintainer

Uh oh!

vulcanUPB Jul 11, 2025 Author

Uh oh!

Aaronontheweb Jul 11, 2025 Maintainer

Uh oh!

vulcanUPB Jul 14, 2025 Author

Uh oh!

Uh oh!

Arkatufus Jul 14, 2025 Maintainer

Uh oh!

Aaronontheweb Jul 14, 2025 Maintainer

Uh oh!

Aaronontheweb Jul 14, 2025 Maintainer

Uh oh!

Arkatufus Jul 14, 2025 Maintainer

Uh oh!

Aaronontheweb Jul 14, 2025 Maintainer

Uh oh!

Arkatufus Jul 14, 2025 Maintainer

Uh oh!

Arkatufus Jul 14, 2025 Maintainer

Uh oh!

Aaronontheweb Jul 14, 2025 Maintainer

Uh oh!

Arkatufus Jul 14, 2025 Maintainer

Uh oh!

danne931 Jul 24, 2025

Uh oh!

Arkatufus Jul 29, 2025 Maintainer

Uh oh!

danne931 Jul 29, 2025

Uh oh!

Arkatufus Jul 29, 2025 Maintainer

Uh oh!

danne931 Jul 29, 2025

vulcanUPB
Jul 11, 2025

Replies: 7 comments 11 replies

Aaronontheweb
Jul 11, 2025
Maintainer

vulcanUPB
Jul 11, 2025
Author

Aaronontheweb Jul 11, 2025
Maintainer

vulcanUPB
Jul 14, 2025
Author

Arkatufus
Jul 14, 2025
Maintainer

Aaronontheweb Jul 14, 2025
Maintainer

Aaronontheweb Jul 14, 2025
Maintainer

Arkatufus Jul 14, 2025
Maintainer

Aaronontheweb Jul 14, 2025
Maintainer

Arkatufus Jul 14, 2025
Maintainer

Arkatufus
Jul 14, 2025
Maintainer

Aaronontheweb Jul 14, 2025
Maintainer

Arkatufus Jul 14, 2025
Maintainer

danne931
Jul 24, 2025

Arkatufus
Jul 29, 2025
Maintainer

Arkatufus Jul 29, 2025
Maintainer