Which knob to adjust for scaling behavior? #1817

mpaul31 · 2021-05-03T19:40:40Z

mpaul31
May 3, 2021

I have a function app running on consumption that consumes messages from a service bus queue, signals an entity (12 or so entities perform some aggregations), and the entity sends a http request to a third-party (much more complexity, but you get the gist).

When looking at the live metrics from app insights, I'm seeing roughly 60-90 requests per seconds and 18 or so server instances handling the load with each instance somewhere between 5-12% CPU utilization.

I have the partitionCount set to 8 in the host.json and I'm wondering if I should adjust the maxConcurrentOrchestratorFunctions to better utilize my resources or maybe the partitionCount is set to high? It's difficult to know what is influencing the scaling behavior but for the work that is occurring it seems to be a bit to high IMO.

Would really appreciate any thoughts or suggestions on where else to look!

Thanks!

Answered by cgillum

May 3, 2021

If you're seeing 60-90 requests per second and 18 or so app instances, then I'm guessing you're processing Service Bus queues much faster than your entities. The low CPU% likely means that the entities cannot dequeue and/or process the signals as quickly as they are being enqueued.

Given the low density of entities to partitions, adjusting maxConcurrentOrchestratorFunctions is not likely going to have much effect. You could lower partitionCount, but that might not help much since it seems the Service Bus load is driving the scale out more-so than entity load (entity load won't cause you to scale higher than your partition count).

As you mentioned in another reply, if you signal up to 100K…

View full answer

cgillum · 2021-05-03T21:43:04Z

cgillum
May 3, 2021
Maintainer

If you're seeing 60-90 requests per second and 18 or so app instances, then I'm guessing you're processing Service Bus queues much faster than your entities. The low CPU% likely means that the entities cannot dequeue and/or process the signals as quickly as they are being enqueued.

Given the low density of entities to partitions, adjusting maxConcurrentOrchestratorFunctions is not likely going to have much effect. You could lower partitionCount, but that might not help much since it seems the Service Bus load is driving the scale out more-so than entity load (entity load won't cause you to scale higher than your partition count).

As you mentioned in another reply, if you signal up to 100K distinct entities per day, then increasing maxConcurrentOrchestratorFunctions should be the first thing you try before increasing partitionCount. Increasing partitionCount can be expensive because of the additional constant load on your storage account from partition management. It also won't address low utilization. Increase maxConcurrentOrchestratorFunctions until you see better utilization in terms of CPU and memory. Once utilization is up, then you should consider increasing partitionCount to improve overall throughput.

If your entity operations are really fast and aren't slowing things down with I/O operations, then you could also consider increasing controlQueueBufferThreshold. This will cause your app to prefetch message more aggressively and feed them to your entities in larger batches. We've observed pretty big throughput improvements in some of our tests when increasing this value for entities that execute very quickly. That may also help improve your utilization as well as your overall throughtput.

2 replies

mpaul31 May 4, 2021
Author

If you're seeing 60-90 requests per second and 18 or so app instances, then I'm guessing you're processing Service Bus queues much faster than your entities.

Do you think this would justify increasing the partitionCount to 16?

cgillum May 4, 2021
Maintainer

It's not obvious to me that increasing the partitionCount would improve utilization. I feel that would be more appropriate if your CPU or memory usage was too high, and it sounds like that's not the case. I would instead focus on increasing per-node concurrency.

mpaul31 · 2021-05-03T22:16:21Z

mpaul31
May 3, 2021
Author

thanks chris! just to be clear when i say 12 entities i am not talking specific instances. i haven’t looked at that specifically but well over 100K distinct per day on sale days. we receive over 500K-1 million service bus messages per day.

1 reply

cgillum May 4, 2021
Maintainer

Got it. Thanks for clarifying. I'll update my original response.

olitomlinson · 2021-05-05T08:28:25Z

olitomlinson
May 5, 2021

@mpaul31 just to be clear, is your objective to decrease the amount of online app instances, yet retain the current performance/throughput you have?

9 replies

mpaul31 May 6, 2021
Author

@mpaul31 Got you!

Not sure if this is will work, but It's my understanding that you can increase the ServiceBus.prefetchCount configuration in host.json so that each online instance grabs more messages from Service Bus, thereby increasing the throughput of your App, which should drain the queues faster. The result of this should be the scale controller scales out less aggressively and brings online less instances to handle the amount of messages on queue.

I think what could be happening here is that even if your Entity Partitions are under-utilised, the service bus scaling out of instances is effectively forcing your Entity Partitions to re-balance too among the available hosts, even if they don't necessarily need to.

Another option is to limit the max scale out of the Function App to the same amount of Partitions you're running, which would be 8. That would force all service bus work to be done on the same host instances as your entity partitions, thereby increasing utilisation.

So I updated the ServiceBus.prefetchCount and didn't see any changings in the scaling behavior. However, with adjusting the controlQueueBufferThreshold to 256 I'm seeing way less app instances being needed (<=5) with the CPU utilization < 20%, so I'm going to now slowly tweak the maxConcurrentOrchestratorFunctions to squeeze out a little more. With the new host.json defaults for the consumption plan, I think I can bump it up more than 5.

olitomlinson May 6, 2021

Very interesting!

I do wonder how changing that setting reduced the number of app instances so substantially, as 10 of the 18 app instances won’t have been doing any DF stuff, so how would this DF specific setting affect those particular instances?!

mpaul31 May 6, 2021
Author

I wonder the same thing and I looked at the scale controller logs to hopefully get some more insights into what is triggering the scale events but unfortunately the reason dimension is empty :/

My guess is now my entities are being feed larger batches of messages so they can drain the control queue faster?! I added some logging a while ago on the batch sizes so I'm going to go peek here soon and see how much they have increased.

mpaul31 May 6, 2021
Author

I spoke too soon. The number of instances is less but not what I initially thought was around ~5. I'm going to keep digging and trying some more things.

mpaul31 May 6, 2021
Author

@cgillum do you have access to the internal telemetry to provide any more insight into the reason scaling? Like I mentioned above, I tried using the new preview scale controller logging but it does not provide a reason.

@cgillum I also have a support case I just opened hoping to provide me with some more information I am unable to see if you would like me to share.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Which knob to adjust for scaling behavior? #1817

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 12 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Which knob to adjust for scaling behavior? #1817

mpaul31 May 3, 2021

Replies: 3 comments · 12 replies

cgillum May 3, 2021 Maintainer

mpaul31 May 4, 2021 Author

cgillum May 4, 2021 Maintainer

mpaul31 May 3, 2021 Author

cgillum May 4, 2021 Maintainer

olitomlinson May 5, 2021

mpaul31 May 6, 2021 Author

olitomlinson May 6, 2021

mpaul31 May 6, 2021 Author

mpaul31 May 6, 2021 Author

mpaul31 May 6, 2021 Author

mpaul31
May 3, 2021

Replies: 3 comments 12 replies

cgillum
May 3, 2021
Maintainer

mpaul31 May 4, 2021
Author

cgillum May 4, 2021
Maintainer

mpaul31
May 3, 2021
Author

cgillum May 4, 2021
Maintainer

olitomlinson
May 5, 2021

mpaul31 May 6, 2021
Author

mpaul31 May 6, 2021
Author

mpaul31 May 6, 2021
Author

mpaul31 May 6, 2021
Author