Understanding and Managing Cost in Fabric

Declan MorrisDeclan Morris
6 min read

With the advent of fully SaaSified end-to-end data platforms like Fabric, throwing money at problems has never been easier. Fabric specifically (in contrast to Databricks and Snowflake) abstracts away the underlying compute infrastructure entirely, adopting the “trust me bro” model where we assume that Microsoft provisions the appropriate VMs for the job and charges us reasonably for them. Of course, not having built-in infrastructure visibility doesn’t mean we can’t verify this assumption through other means. Running workloads at scale using a free trial capacity and analyzing the utilized compute lets you ensure you aren’t increasing costs excessively by migrating workloads to Fabric. This is also Microsoft’s recommended way to estimate which capacity tier to provision for your use case (I expect this will change once the SKU Estimator tool matures).

Fabric Capacities

Compute Billing

Fabric capacities are available in tiers based on the number of provisioned Capacity Units, provisioned in either pay-as-you-go or reservation models. They encapsulate all of the managed compute behind your Fabric workloads, allowing for easy price calculation and predictability despite the wide variety in Fabric’s workload engines. There is no universal definition for what a CU actually represents in terms of compute power, rather they are mostly used to track levels of consumption within a capacity (i.e. resource x consumed 0.12 CU seconds). We can, however, derive a vague idea of what a CU is capable of from the smattering of CU to other metric translations across the Fabric documentation. For example, when running Spark jobs in Fabric, 1 CU roughly equals 2 Spark VCores. SQL Database in Fabric, on the other hand, has the conversion of 1 CU equals 0.383 Database VCores. Even within their specific contexts these conversions have nuances, so leverage them with care.

Storage Billing

Fabric storage bills against your capacity at a flat rate per GB per month depending on the storage type. This is charged regardless of whether your capacity is paused. The primary storage types are standard, cache, and BCDR OneLake storage. Non-OneLake storage types have their own rates but often mirror automatically into OneLake for analytics. You get a hefty portion of free mirrored data in OneLake, which is important since certain Fabric items mirror their data into OneLake by default. The exact amount of free mirrored data is based on your capacity tier.

Other Costs

In addition to compute and storage resources, there are a few more Fabric features that consume capacity resources. Fabric’s AI integration is rapidly evolving and currently provides a specialized Copilot for most developer/consumer experiences. AI Functions and Fabric Data Agents leverage pre-trained and custom models from Azure AI Foundry and AI Services and bill directly against your capacity compute. This makes them trivial to set up and use, but the ensuing costs are entirely abstracted from the user. The only way (as of writing this) to gauge their cost is to try them out and track the resulting CU usage using Capacity Metrics. Unfortunately, AI features are unavailable in trial capacities, so you will need to use a paid capacity to do this cost analysis.

Pay-as-you-go, not pay-as-you-use

When you provision a Fabric capacity, you choose between a reservation or a pay-as-you-go capacity. Reservations are ~40% cheaper, but you are locked in for a year minimum. PAYG capacities can be scaled up and down, paused, resumed, and deprovisioned at any point. An important detail to keep in mind is that PAYG does NOT mean you are billed based on actual usage. As long as the capacity is active, you will always pay the full PAYG rate even if no workloads are running. For those conducting research using an AI assistant, beware: this is something current copilots and chatbots frequently misinterpret. To get the true ‘only pay for what you use’ experience, you’ll need to manually suspend or scale down the capacity when it is not in use.

Cost Control Strategies

PAYG or Reservation?

First of all, any time you need 60% or higher uptime for your Fabric workload and don’t anticipate a need for upscaling/downscaling in the short term, you should provision a reservation capacity. This is assuming you have a predictable workload that needs to run for at least one year. In other cases, consider using a PAYG capacity and leveraging the Azure REST API to programmatically pause the capacity when it is not in use (The Fabric REST API manages Fabric items, but not capacities). This can be done via HTTP request or an available SDK. Of course, this approach requires some upfront engineering and problem solving on your end.

The Capacity Metrics App

Fabric does not currently provide an out-of-the-box solution for auto-suspending a capacity based on usage. The Capacity Metrics App provides some visibility into CU consumption and can trigger event-driven workflows when used with Data Activator, but Activator’s triggerable actions are currently limited to Fabric items and Teams/email messages. Fabric items, of course, include notebooks which can make the API calls necessary to pause or scale capacities appropriately.

This comes with several limitations, the most apparent of which is that notebooks themselves require an active capacity to run. In order to suspend a capacity via Fabric notebook, you would need to maintain a separate fabric capacity running these notebooks. You could have a notebook suspend its own capacity, but it would not be able to programmatically resume its capacity afterwards. This renders CU-usage based capacity suspension inflexible compared to the alternatives.

Monitoring at the Azure Resource Level

Rather than pausing and resuming capacities based on usage, a better-supported approach would be to use a schedule or spending based strategy. Using the Capacity Metrics App introduces limitations as addressed earlier, so to get around this, you can instead monitor spending at the capacity level - that is, looking at the Azure resource as opposed to Fabric items. Monitoring at the capacity level means we will be working through the Azure portal, which has a more mature cost monitoring and budgeting system. Azure resources come integrated with cost monitoring capabilities, allowing us to define budgets as well as action groups that trigger when those budgets are exceeded.

One such triggerable action is to call an Azure Function, which we can use to pause or resume a capacity. This solution can actually be implemented free of cost using the consumption tier of Azure Functions, but keep in mind you will lose observability into the API call results. Since the pause/suspend request can exceed the timeout limit of consumption tier functions, you will need a tier with longer timeout limits to receive the results of your request. Note that the capacity will still undergo the pause/suspend operation, you will just get a timeout error on your function instead of a success or failure message.

This approach provides a quick, easily implemented solution for monitoring and limiting Fabric spending, but it comes with a few limitations beyond the technicalities of Azure Functions and API calls. For one, you will not have granular visibility into consumption by domain/resource. Your budgets and action groups will only be able to monitor the total cost accrued by the capacity (this aggregates across all workspaces’ storage and compute usage). To limit Fabric spending by department or team, you will need a distinct capacity assigned to each entity.

Further Reading

Fabric is rapidly evolving, and its cost model may well become more flexible and easier to manage in the future. To keep up to date on the latest updates and features, refer to the Microsoft Fabric documentation and Fabric Update Blog.

0
Subscribe to my newsletter

Read articles from Declan Morris directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Declan Morris
Declan Morris