I am about 4 months late to publish this blog. I wrote this blog back in September last year, forgot about it until I saw the latest update from the Fabric team 😁.

Microsoft Fabric team releases a new update to the Fabric platform every month. The blogs are long, packed with details and sometime hard to consume just because of the length of the blog. This is especially true for updates released around major events like FabCon, Build, Ignite etc. So I wanted to see if/how I can use LLMs to extract data from these monthly update blogs as structured output so it’s easier to consume. My other goal was to test the AI services in Fabric which I have talked about in the past. While I was successful in extracting the data, I ran into some challenges with Fabric which I will cover in the second half of the blog. First, I will talk about my strategy and how I set it up which should be applicable with/without Fabric.

Input

Typically, if you want to extract data from a webpage, you will use BeautifulSoup to scrape it. The challenge with this method is that you need to clean the data, remove HTML tags, and ensure only the relevant content is passed to reduce the number of tokens used. However, a better and more efficient approach that worked for me is to extract the webpage as markdown, which has several advantages:

Keeps the structure of the page, e.g. for the monthly blog, there is a header and sub-header. Markdown can keep this hierarchy which LLM can use
Reduces number of token because there are no html, xml tags to clean. You only need to clean the special characters (#, !) and white spaces
Text is LLM friendly. Any ALT text included in images and videos is also captured in the link which can provide additional context to the LLM

Thankfully, Jina.AI makes this very easy and it’s free. All you have to do is add https://r.jina.ai/ before any link. It will return the markdown version of the blog which can be fed to the LLM without much processing.

import requests
from textwrap import dedent


def get_blog(url):
    ## Get blog content as markdown using Jina api
    try:
        response = requests.get(url)
        response.raise_for_status()
        return response.text

    except requests.RequestException as e:
        print(f"Error: {e}")
        return None

url = "https://r.jina.ai/https://blog.fabric.microsoft.com/en-US/blog/microsoft-fabric-january-2025-update/"
content = dedent(get_blog(url)) #dedent to remove white spaces and reduce #tokens. I could have preprocessed the text more.

Jina is not the only way. You could use Docling, markdownit and other libraries but with Jina, there is no installation needed and the output is well structured.

The Prompt

This is the most important part. I did a lot of research on several prompt strategies, methods, research papers on how best to extract entities and summaries from a text. I will write about it some other time but in general all of them revolve around a couple of key ideas:

Always provide guardrails and constraints to the model for the expected output. This can be in the form of schema, data type etc.
Always provide some examples, if applicable. This is the zero-shot vs few-shot but in general the more diverse examples you can provide the better. The order of the examples also matters.
Provide very specific instructions on what the model should and should not return.
Provide instructions for the model to deal with ambiguity, e.g. list of possible values and what to do when the model is unsure.

This research paper gave the best overall prompt structure to use and I followed their template as a starting point and tuned it from there. In my tests, I experimented with 20+ different prompt strategies and variations. I settled on the below prompt:

instruction = """

Extract the following structured data from the given Microsoft Fabric update blog post:
{
"publicationDate": "The date the blog post was published (YYYY-MM-DD format)",
"updates": [
        {
    "workload": "The main category (e.g. Power BI Desktop, Power BI, Core, OneLake, Data Warehouse, Data Engineering, Data Science, Real-Time Intelligence, Data Factory, Databases)",
    "fabricItem": "The Fabric item type (must be one of: Power BI Desktop, Data pipeline, Dataflow Gen1, Dataflow Gen2, Eventstream, Mirrored Azure Cosmos DB, Mirrored Azure SQL Database, Mirrored Snowflake, Mirrored database, Notebook, Spark Job Definition, Datamart, Eventhouse, Lakehouse, Sample warehouse, Semantic model, Warehouse, Apache Airflow Job, Azure Data Factory, AI Skill, Environment, Experiment, ML model, KQL Queryset, Activator, SQL database, Scorecard, Dashboard, Paginated report, Real-Time Dashboard, Report, API for GraphQL, Fabric User Data Functions)",
    "feature": "The specific new feature or update being described",
    "description": "A brief 1-2 sentence summary of what the feature does or its key benefits, improvements",
    "status": "Is this a new feature being announced, update to an existing feature, or deprecation of a feature. Must be one of : Update, New, Deprecation",
    "tasks": ["List of applicable tasks, must be one or more of: Get data, Store data, Prepare data, Analyze and train data, Track data, Visualize data, Develop data"]
        }
    ]
}
Constraints:

Extract the publication date from the blog post metadata
Use the hierarchical structure to identify workloads, Fabric items, and features
Analyze the description and context to identify the most appropriate Fabric item from the provided list. If enough information is not available, leave it blank.
Summarize descriptions concisely, focusing on key functionality and benefits. 
Analyze the description to assign relevant task tags, using only the provided task options
Return only the extracted JSON data without any explanation

Apply the following extraction patterns:

Use the Keyword Trigger Extractor pattern to identify main workloads, Fabric items, and individual features
Apply the Semantic Extractor pattern to understand and summarize the descriptions, and to infer the appropriate Fabric item when not explicitly mentioned
Utilize the Pattern Matcher to identify availability status, and applicable tasks
Employ the Specify Constraints pattern to ensure adherence to the provided lists for Fabric items and tasks

"""

Note the last part. I used Keyword Trigger Extractor etc but never defined what it means. But I found that including it still helped because it seemed to provide the model some direction. I am asking the model to :

Return the output as a structured JSON
I defined the JSON schema for the model to follow
I provided list of possible values and also what to do in case of ambiguity
1-2 line description of each item, if the feature is new/update/deprecation

The json returned by the model API will be :

{
"publicationDate": "The date the blog post was published (YYYY-MM-DD format)",
"updates": [
        {
    "workload": "The main category (e.g. Power BI Desktop, Power BI, Core, OneLake, Data Warehouse, Data Engineering, Data Science, Real-Time Intelligence, Data Factory, Databases)",
    "fabricItem": "The Fabric item type (must be one of: Power BI Desktop, Data pipeline, Dataflow Gen1, Dataflow Gen2, Eventstream, Mirrored Azure Cosmos DB, Mirrored Azure SQL Database, Mirrored Snowflake, Mirrored database, Notebook, Spark Job Definition, Datamart, Eventhouse, Lakehouse, Sample warehouse, Semantic model, Warehouse, Apache Airflow Job, Azure Data Factory, AI Skill, Environment, Experiment, ML model, KQL Queryset, Activator, SQL database, Scorecard, Dashboard, Paginated report, Real-Time Dashboard, Report, API for GraphQL, Fabric User Data Functions)",
    "feature": "The specific new feature or update being described",
    "description": "A brief 1-2 sentence summary of what the feature does or its key benefits, improvements",
    "status": "Is this a new feature being announced, update to an existing feature, or deprecation of a feature. Must be one of : Update, New, Deprecation",
    "tasks": ["List of applicable tasks, must be one or more of: Get data, Store data, Prepare data, Analyze and train data, Track data, Visualize data, Develop data"]
        }
    ]
}

Each Fabric item is categorized by one or more tasks in service. I am also instructing the model to analyze the feature description and classify it based on those tasks, e.g. “Get Data, Visualize Data” etc.

The prompt may seem unnecessary given that Gemini flash, GPT-4o, Claude and many other LLMs allow you to define structured output JSON. However, in my tests with Gemini, the output was much better and robust with the above detailed prompt. Structured output will likely work better with GPT-4o but I plan to cover in a future blog when it becomes available in Fabric.

The Model

The task here is not super complicated given what the LLMs can now do. However, the main challenge here is the input token size. The blog is long, ~25000 token which limits the number of models. While most models have an input limit of 32K, I found as you approach that limit, the quality of the response deteriorates. Hence, in this case I used the gemini-1.5-flash-002 model with an input limit of 1M tokens and plus it’s cheap (free for personal use). Create a Google AI Studio account to get the api key.

##install Gemini sdk
## %pip install google-generativeai --q

import google.generativeai as genai
import json
import pandas as pd

def create_and_use_model(api_key, model_name, config, instruction, content):
    """
    Returns Gemini's reponse.

    Parameters:
        api_key (str): The API key for authentication.
        model_name (str): The name of the model.
        config (dict): The configuration settings for the model.
        instruction (str): The instruction for the model.
        content (str): The content to send as a message.

    Returns:
        dict: The result of the model's response.
    """
    genai.configure(api_key=api_key)

    # model
    model = genai.GenerativeModel(
        model_name=model_name,
        generation_config=config,
        system_instruction=instruction,
    )

    chat_session = model.start_chat(
        history=[]
    )

    response = chat_session.send_message(content)
    result = json.loads(response.text)

    return result


api_key = "<>"
model_name = "gemini-1.5-flash-002"
config = {
    "temperature": 1,
    # "top_p": 0.95,
    # "top_k": 40,
    "max_output_tokens": 8192,
    "response_mime_type": "application/json",
}

result = create_and_use_model(api_key, model_name, config, instruction, content)

You can look at the full output here.

Next tabulate the json output for exploration:

date = result['publicationDate']
df = pd.json_normalize(result["updates"]).assign(publicationDate = date)
display(df)

Eval

So how well did it do ? Very well, I think. It extracted all the features and announcements, summarized the feature description very well as expected. It sometimes could not extract the relevant Fabric item but did not misclassify them. For example, Mirrored database as an item has no workload defined so it classified it under Data Factory. Look at the JSON and let me know your thoughts.

Overall for evaluating different models, configs, prompts, I looked at :

JSON schema
accuracy
completion of output
classification of items and tasks
reproducibility & consistency

I created instrumentation using mlflow but I will skip that to keep the blog short.

What About Fabric AI Services?

As I mentioned, one of my goals was to use the Azure Open AI integration in Fabric. It worked well when I tested initially on sample of the blog but got errors when I tried the whole blog for a few reasons:

The most advanced model available in Fabric is gpt-4-32k (As of Jan 30 2025) which for my task should work but I think because of the max token limit (32k) it didn’t do very well. At 32K, to get full 4K output, the input must be <28K which is close to my input size.
Fabric AI services most likely has high rate limits because after a couple of attempts the API request failed
Because of the token limit for this model, I only got 32 features back. To be fair, I could have chunked the input to get around this limit

GPT-4o is available based on docs but it has not been rolled out. It will certainly address above limitations. I will update the post again in the future.

I truly hope the Fabric team will make the latest models, especially Microsoft’s Phi SLMs, available in Fabric (ideally without F64+ limitation). This is a fantastic opportunity for Fabric to seamlessly make LLMs available to the masses and it will be a huge differentiator as AI gains adoption.

Here is the full code if you want to try it in Fabric (in F64+ capacity):

import requests
from textwrap import dedent
import json
import pyspark.sql.functions as F
from synapse.ml.services.openai import OpenAIChatCompletion
from pyspark.sql.types import StructType, StructField, StringType
import pandas as pd

def get_blog(url):
    ## Get blog content as markdown using Jina api
    try:
        response = requests.get(url)
        response.raise_for_status()
        return response.text

    except requests.RequestException as e:
        print(f"Error: {e}")
        return None

url = "https://r.jina.ai/https://blog.fabric.microsoft.com/en-US/blog/microsoft-fabric-january-2025-update/"
content = dedent(get_blog(url)) #dedent to remove white spaces and reduce #tokens. I could have preprocessed the text more.

schema = StructType([
    StructField("fabric_blog", StringType(), True)
])

data = [(content, )]
df = spark.createDataFrame(data, schema)

instructions = """

Extract the following structured data from the given Microsoft Fabric update blog post:
{
"publicationDate": "The date the blog post was published (YYYY-MM-DD format)",
"updates": [
        {
    "workload": "The main category (e.g. Power BI Desktop, Power BI, Core, OneLake, Data Warehouse, Data Engineering, Data Science, Real-Time Intelligence, Data Factory, Databases)",
    "fabricItem": "The Fabric item type (must be one of: Power BI Desktop, Data pipeline, Dataflow Gen1, Dataflow Gen2, Eventstream, Mirrored Azure Cosmos DB, Mirrored Azure SQL Database, Mirrored Snowflake, Mirrored database, Notebook, Spark Job Definition, Datamart, Eventhouse, Lakehouse, Sample warehouse, Semantic model, Warehouse, Apache Airflow Job, Azure Data Factory, AI Skill, Environment, Experiment, ML model, KQL Queryset, Activator, SQL database, Scorecard, Dashboard, Paginated report, Real-Time Dashboard, Report, API for GraphQL, Fabric User Data Functions)",
    "feature": "The specific new feature or update being described",
    "description": "A brief 1-2 sentence summary of what the feature does or its key benefits, improvements",
    "status": "Is this a new feature being announced, update to an existing feature, or deprecation of a feature. Must be one of : Update, New, Deprecation",
    "tasks": ["List of applicable tasks, must be one or more of: Get data, Store data, Prepare data, Analyze and train data, Track data, Visualize data, Develop data"]
        }
    ]
}
Constraints:

Extract the publication date from the blog post metadata
Use the hierarchical structure to identify workloads, Fabric items, and features
Analyze the description and context to identify the most appropriate Fabric item from the provided list. If enough information is not available, leave it blank.
Summarize descriptions concisely, focusing on key functionality and benefits. 
Analyze the description to assign relevant task tags, using only the provided task options
Return only the extracted JSON data without any explanation

Apply the following extraction patterns:

Use the Keyword Trigger Extractor pattern to identify main workloads, Fabric items, and individual features
Apply the Semantic Extractor pattern to understand and summarize the descriptions, and to infer the appropriate Fabric item when not explicitly mentioned
Utilize the Pattern Matcher to identify availability status, and applicable tasks
Employ the Specify Constraints pattern to ensure adherence to the provided lists for Fabric items and tasks

"""

chat_df = (df
  .withColumn("messages",
    F.array(
      F.struct(
        F.lit("system").alias("role"),
        F.lit("system").alias("name"),
        F.lit(instructions).alias("content")),
      F.struct(
        F.lit("user").alias("role"),
        F.lit("user").alias("name"),
        F.col("fabric_blog").alias("content")
      )
    )
  )
)


chat_completion = (
  OpenAIChatCompletion()
    .setDeploymentName("gpt-4-32k")
    .setMessagesCol("messages")
    .setErrorCol("error")
    .setMaxTokens(8192)
    .setTemperature(0.7)
    .setOutputCol("completions")
)

result = chat_completion.transform(chat_df).cache()

json_df =(result.select(F.col("completions.choices").getItem(0).getField("message").getField("content")))

result2 = json_df.toPandas().iloc[0,0]
json_result = json.loads(result2.replace('\n', ''))
final_df = pd.json_normalize(json_result['updates'])
final_df

You can download the notebook from here.

Notes:

This week it is obligatory to mention DeepSeek in anything related to LLMs 😁so I will just mention that back in Sept I actually used DeepSeek’s model as the starting point but hit the token limit. The new R1 LRM will not directly work as it does not support structured json output. You would have to pair it with another small LLM. One nice feature of DeepSeek is you can use openai sdk so it’s easy to swap out models for testing.
I used synapseml above but you could use Python too. Python notebook doesn’t have openai sdk installed but you can call the API.
As I mentioned above, I will update the blog when gpt-4o is available in Fabric.

Unstructured To Structured Data : Extracting Features From Fabric Monthly Update Blog