Sitemap

Azure AI Deployment: Unlocking the Power of Provisioned Throughput

5 min readJul 9, 2024

A short guide to Cost-Effective and Scalable Solutions

The buzz around the large language models has slowly settled down. Although we witness continuous advancements around generative AI across all modalities. Most of the enterprises today have crossed the exploratory stage. They have clearly defined strategic approaches to use AI and healthy roadmap to deploy custom copilots, Gen AI applications in production. As quoted by Gartner, the top barriers for Gen AI adoption yet has been estimating and demonstrating the value of AI. Before launching your application into production, it’s critical to evaluate it against key performance indicators (KPIs). Estimating the business value is crucial, especially when considering the uncertainties of scaling and calculating the total cost of ownership. This blog is meant to provide prescriptive guidance to deploy Gen AI applications on Azure using Azure Open AI.

Problem background

Enterprises today use Pay-as-you-go (PayGo) model for Gen AI Applications during dev/test phase. In this model, a fixed number of tokens per minute (TPM) is allocated per subscription, region, model. Every deployment for a model shares the TPM from this allocated lot. For example, in 120K TPMs allocated to GPT 3.5 turbo model, if a deployment consumes 50k TPM, 70k TPM will be left. The below images shows how the TPMs can be distributed across deployments.

Press enter or click to view image in full size
Token allocation per deployment in PayGo Model (Image by Author)

Let us understand some of the possible issues you may see in production, if we go ahead with this approach.

High Latency: In PayGo model both tokens and requests have upper bounds per minute (6 RPM per 1k TPM). Beyond this limit, the APIs will start throwing errors. With client side retries in place, the user may experience high latency.

Variable throughput: PayGo deployments can experience fluctuating throughput due to the use of shared processing power, influenced by the current demand for the service across users. The user may notice lags in token generation.

Limited scale: TPM and Requests per minute (RPM) apply per model, per region and subscription. The scalability of the solution is restricted to these limits. Beyond these limits, the solution can only be scaled by creating addition deployments of the model to a new region or a new subscription.

Managing cost: With PayGo model, the usage costs depend on the number of tokens consumed. Predicting Azure expenses can be challenging because the total cost is only known towards the end of the month.

By now, you may have guessed the solution, yes — dedicated capacity. Azure Open AI offers Provisioned Throughput Units (PTUs), these units are reserved and hence we get predictable latency, throughput and cost. PTU acts like a private channel to your LLMs, only you can use it.

Benefits of using PTUs

PTUs addresses all the challenges mentioned above by providing a commitment to the number of tokens that will be available for your deployment. In case of demand, you may also scale by purchasing additional PTU units. In additional to this you can also further reduce latency, optimize cost if prompts are crafted to support KV-Caching. More about it here.

Note: Purchasing PTUs is only possible by engaging your Azure Sales representative, link to purchase can be found here.

Azure Open AI team has published a calculator that takes number of requests per min, tokens per request and response to calculate PTUs required. PTUs can only be purchased in blocks of units, you may notice 94.97 is round off to nearest block in screenshot below (for GPT-4)

PTU calculator (Snapshot taken from https://oai.azure.com/portal/calculator)

PTUs is not a one-shot solution to all problems related to production-readiness of your application. The next section focuses on strategies to make effective use of PTUs and combine with PayGo models for optimal cost and user experience.

A matter to ponder about PTUs.

For optimal user experience and effective utilization of allocated throughput, it’s crucial to understand the following aspects of your application.

  • What is the usage pattern of the application? Example — Sinusoidal, Seasonal, Occasional peaks
  • What is SLA commitment to end-user? Example — 95% of requests will be served in < 1 sec
  • What is the budget for Total cost of ownership? Example — $10K per month

Usage driven PTUs.

PTUs can be perceived as costly if not strategically allocated. Unlike other capacity planning exercises, PTUs are not usually allocated to address peak loads. An economical strategy is to provide PTUs enough to meet the regular usage and divert the peaks to PayGo model. This strategy assumes that the overall performance still meets the SLA commitment to the end-user. To elaborate, let us say 95% of your requests follow a pattern and 5% are sudden peaks. You can allocate PTUs that serve 95% of the requests. Redirect the sudden peaks to PayGO model. Hence it is very important to understand, how your application will be used. More usage centric provisioning strategies are explained here.

If you decided to use PTUs and PayGO together, the issue remains on how do we redirecting traffic to different Open AI endpoints. Azure APIM acts as an excellent solution to this problem by providing a common endpoint for both PTU and PayGo endpoints. Using Azure APIM, you may scale your deployments using different cost strategies, measure usage, apply security policies and diverting traffic intelligently. The below images shows one such strategy, the redirects users based on user-tier. For more information such strategies, read this blog post.

User Profile based routing using APIM as gateway

Benchmark application with PTUs

How do you know if the PTUs allocated meet your traffic needs given various parameters like token size, model name etc. To address this, you should benchmark on how the model will responds to variable traffic at different token sizes, model versions etc. You may use the OAI benchmarking solution to evaluate response times, retry strategy and tokens used.

Understanding cost savings from PTU.

It is important to understand how PTUs are calculated to know if you benefit from moving to PTUs. Under the PayGo system, your expenses grow linearly with the number of tokens you use. In PTU model, you pay for blocks of purchase. For example, if you consume 1000 tokens per request (input and output combined), at 5 requests per min, the estimated TPM is around 10k tokens. Although the PTUs required is 36 in this case, the calculator rounds it off to 100 PTUs. Obviously, we are paying for unused PTUs. In a nutshell, the difference between recommended PTU and rounded off PTU show be minimum to benefit from PTU. In summary, you benefit in long term provided the PTUs allocated use the right strategy by understanding and classifying your application by usage pattern and promised SLA. Since, PTU purchase model follows step-wise increments, the nearer your consumption is to the upper bounds, the higher the profits. Whereas with PayGo consumption increases linearly.

References:

- https://nathan.gs/2024/02/27/azure-openai-multi-applications-and-scale/

- https://github.com/microsoft/AzureOpenAI-with-APIM?tab=readme-ov-file#api-management-to-azure-openai

- https://techcommunity.microsoft.com/t5/apps-on-azure-blog/build-an-enterprise-ready-azure-openai-solution-with-azure-api/ba-p/3907562

- https://techcommunity.microsoft.com/t5/fasttrack-for-azure/using-azure-api-management-circuit-breaker-and-load-balancing/ba-p/4041003

--

--

Srikanth Machiraju
Srikanth Machiraju

Written by Srikanth Machiraju

Cloud Solution Architect at Microsoft | AI & ML Professional | Published Author | Research Student

No responses yet