Automating Microsoft Fabric: Extracting Identity Support data

Peer GrønnerupPeer Grønnerup
6 min read

In Microsoft Fabric, REST APIs play a crucial role in automating and optimizing various aspects of platform management, from CI/CD processes to maintaining a data lakehouse. They enable seamless interactions with Fabric items, making it easier to streamline data workflows and handle large-scale operations with minimal manual intervention. Understanding which identities - such as service principals or managed identities - are supported by different Fabric REST API endpoints is essential to ensure secure and efficient platform management.

And wouldn't it be great if we didn't have to visit each individual API documentation page to check which Microsoft Entra identities are supported? Constantly navigating through multiple pages to find this information can be time-consuming and inefficient. Fortunately, there's a way to automate this process, allowing us to extract and centralize the data with ease - saving both time and effort.

In this blog post, I'll walk through how to scrape Microsoft Fabric REST API documentation using a Fabric Notebook to extract information on supported identities for each endpoint. Once the data is extracted, we can leverage Semantic Link Labs to build a semantic model that exposes data from the Fabric Lakehouse.

And finally, we can create a report using Semantic Link, offering insights into how these identities are supported across various Fabric APIs.

The task for accomplishing the above split into 3 steps:

  • Extracting information from the Fabric REST API documentation

  • Creating a semantic model using Semantic Link Labs

  • Creating a Power BI report using Semantic Link Labs

Extracting Fabric REST API identity support information

To automate the extraction of identity support information from the Microsoft Fabric REST API documentation, I used BeautifulSoup (from the bs4 library) to scrape the necessary data directly from the Microsoft Learn site. Here's a brief overview of how the process works:

  1. Setup Fabric Items: Start by creating a new Workspace and assigning it to a Fabric capacity. Next, create a new Lakehouse within the workspace, import the two sample notebooks and upload the report.json file to the unmanaged Files section of your Lakehouse. You can find a link to these samples in the Conclusion section of this blog post.

  2. Fetching the API Documentation: The code starts by making an HTTP request using requests.get() to fetch the table of contents (TOC) from Microsoft Learn, which is structured in JSON format. The TOC contains links to each API's documentation page.

  3. Parsing the HTML: For each API page, BeautifulSoup parses the HTML content, looking for a specific section that lists the supported Microsoft Entra identities (e.g., User, Service Principal, and Managed Identities).

  4. Extracting the Identity Data: Once the correct section is found, the code extracts the table containing identity types. The table rows are iterated over to capture the identity information for each API endpoint, storing the results in a structured format (data_list).

  5. Handling Nested Documentation: My function extract_all_articles() recursively navigates through nested API documentation sections, ensuring that all relevant pages are checked, even when organized in hierarchical structures.

import requests

from pyspark.sql.types import StructType, StructField, StringType
from bs4 import BeautifulSoup
from pyspark.sql.functions import *

baseurl = "https://learn.microsoft.com/en-us/rest/api/fabric/"

### Extract Fabric API documentation
response = requests.get(baseurl+"toc.json")
data = response.json()

# Call the extract_all_articles function and store the return value as data_list
data_list = extract_all_articles(data)

This approach allows us to programmatically gather the identity support data, eliminating the need to manually check each API page.

Once collected, the data can be processed further or integrated into a Fabric Lakehouse for analysis. In our case we convert the data_list to a Spark DataFrame and write the DataFrame to a Delta table in our lakehouse. Also we create a manual table holding each Identity option. This table will be used for grouping and filtering APIs in the Power BI report which we will create later.

After extracting the necessary data from the Microsoft Fabric REST API documentation, the next step is to leverage Semantic Link Labs to create a semantic model. Semantic Link Labs is a Python library designed for use in Microsoft Fabric notebooks. This library extends the capabilities of Semantic Link offering additional functionalities to seamlessly integrate and work alongside it. Semantic Link Labs simplifies building semantic models, reports and more directly from our Fabric notebooks.

To use Semantic Link Labs we first need to install the Semantic Link Labs package within our Fabric Notebook environment. This can be done by running:

%pip install semantic-link-labs

Once Semantic Link Labs is installed, we can generate a blank semantic model as a foundation to which we will add our extracted data.

This blank model serves as a starting point, where we’ll later introduce the tables and data derived from your scraping process, along with defining specific measures and hierarchies needed for reporting.

import sempy_labs as labs
from sempy_labs.tom import connect_semantic_model
from sempy_labs import report

lakehouse_name = "FabricDocs"
lakehouse = mssparkutils.lakehouse.get(lakehouse_name)
workspace_name = notebookutils.runtime.context.get("currentWorkspaceName")

# Create a new blank semantic model
semantic_model_name = f"{lakehouse_name}_Model"
labs.create_blank_semantic_model(semantic_model_name)

After creating the blank model, we will connect to it (using connect_semantic_model) and add objects like tables, expressions, hierarcies etc.

Finally, after setting up the semantic model, we will create a report that exposes the extracted data from our Direct Lake semantic model. This is also achieved using Semantic Link Labs, which enables us to seamlessly generate reports based on the data stored in the model.

The following code is used to create the report:

# Read the file as a DataFrame where each row represents a line in the file
df = spark.read.text("Files/report.json")

# Convert the DataFrame rows (lines) into a single string
json_raw = ''.join(df.rdd.map(lambda row: row[0]).collect())
jobject = json.loads(json_raw)

# Create a new report based on the report.json file located in our Lakehouse
labs.report.create_report_from_reportjson(
    report="Fabric REST API Docs", 
    dataset=semantic_model_name, 
    report_json=jobject, 
    workspace=workspace_name
    )

This code reads a JSON file, which contains the report structure, and uses it to create a new report that is tied to the semantic model you previously built. This allows you to easily visualize and analyze the identity data extracted from the Microsoft Fabric REST API documentation, directly within your Fabric Lakehouse environment.

Conclusion

The Microsoft Fabric APIs are essential for automating key components of your Fabric setup, providing a strong foundation for CI/CD, governance, and scaling your data platform. By extracting and centralizing identity support information from the API documentation, you can streamline processes and ensure that your platform is built with both efficiency and security in mind.

In the near future, I’ll be publishing more articles on how to leverage the Fabric REST APIs to jumpstart your Fabric Lakehouse Data Platform, manage CI/CD pipelines, and much more. So stay tuned for more insights!

You can download the notebooks and the report.json file used in this post here: https://github.com/gronnerup/Fabric/tree/main/FabricRestApiDocs.

0
Subscribe to my newsletter

Read articles from Peer Grønnerup directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Peer Grønnerup
Peer Grønnerup

Principal Architect | Microsoft Fabric Expert | Data & AI Enthusiast With over 15 years of experience in Data and BI, I specialize in Microsoft Fabric, helping organizations build scalable data platforms with cutting-edge technologies. As a Principal Architect at twoday, I focus on automating data workflows, optimizing CI/CD pipelines, and leveraging Fabric REST APIs to drive efficiency and innovation. I share my insights and knowledge through my blog, Peer Insights, where I explore how to leverage Microsoft Fabric REST APIs to automate platform management, manage CI/CD pipelines, and kickstart Fabric journeys.