It’s finally here 🎉! Thanks to Michael Kovalsky, one of the most requested & anticipated APIs in now available in Semantic Link Labs (v0.8.10) - the Scanner API. The Scanner API in Fabric Admin REST APIs allows Fabric administrators to retrieve detailed metadata about their organization's Fabric items, supporting governance and compliance efforts. It provides information such as item names, descriptions, date created, lineage, connection strings etc. It’s not new, we have been using it in Power BI for a long time but in the Fabric world, it’s even more important given the number of items and configurations.

It’s an admin API so you need to be a Fabric admin
All the API requirements and limitations are still applicable, read the documentation for details
Semantic Link Labs automatically handles the continuation token and returns the json object
You can scan the entire tenant or selected workspaces. Currently, you can scan up to 100 workspaces but you can iterate + parallelize to scan more. You can make up to 16 simultaneous calls (see my code below) and read the documentation for details and limitations. I have not tested it thousands of workspaces. Be mindful of the API limitations and use it within documented guidelines.
Currently, you cannot use a Service Principal in Semantic Link Labs but hopefully it will be available soon.
You need to extract the individual items but expect that to be supported out of the box very soon. I have tested my extraction function below as much as I could. Each item has its own metadata, so if you want to contribute, please share your function in the comments.
You can use Polars, Duckdb, Spark to save the dataframes to a Lakehouse. Be sure to either explode the arrays or convert them to string otherwise the SQL Endpoint will not recognize those tables.
Semantic Link Labs is an open-source project. I highly encourage you to contribute to it.

I use Semantic Link Labs extensively every day and this is going to make my job very easy.

Scan Workspaces:

The API is simple and consistent : labs.admin.scan_workspaces(workspace) . You can pass one workspace or a list of workspaces. By default, if you don’t specify the workspace(s), only the workspace of the notebook will be scanned.

To overcome the 100 workspace limitation, below I batch the workspaces and make concurrent calls and combine the json:

## using Fabric Python notebook
%pip install semantic-link-labs --q
import sempy_labs as labs 
import sempy.fabric as fabric 
import pandas as pd 
import math
import json 
import concurrent.futures
from functools import partial

#max number of workspaces to scan, should be <=100
N = 100

def scan_batch(batch):
    return labs.admin.scan_workspaces(workspace=batch)

def process_workspaces_in_batches(workspace_list):
    """
    Sandeep Pawar | fabric.guru
    Process workspaces in batches and combine results into a single scanner output.

    """
    batch_size = N
    # Split the batch
    batches = [workspace_list[i:i + batch_size] for i in range(0, len(workspace_list), batch_size)]

    combined_result = {
        'workspaces': [],
        'datasourceInstances': [],
        'misconfiguredDatasourceInstances': []
    }

    with concurrent.futures.ThreadPoolExecutor() as executor:
        batch_results = list(executor.map(scan_batch, batches))


        for batch_result in batch_results:
            if batch_result:
                if 'workspaces' in batch_result:
                    combined_result['workspaces'].extend(batch_result['workspaces'])

                if 'datasourceInstances' in batch_result:
                    combined_result['datasourceInstances'].extend(batch_result['datasourceInstances'])

                if 'misconfiguredDatasourceInstances' in batch_result:
                    combined_result['misconfiguredDatasourceInstances'].extend(
                        batch_result['misconfiguredDatasourceInstances']
                    )

        print(f"Total workspaces processed: {len(combined_result['workspaces'])}")

    return combined_result

workspace_list = list(fabric.list_workspaces().query('`Type` != "AdminInsights"').Id.unique()) #exclude admin workspace

if len(workspace_list) < 101:
    scanner_json = labs.admin.scan_workspaces(workspace=workspace_list)
else:
    print(f"Number of workspaces is >= {N}. Processing in batches of {math.ceil(len(workspace_list)/N)} workspaces")
    scanner_json = process_workspaces_in_batches(workspace_list)

Output:

Extract Items:

We can now parse the above JSON to extract items. I have extracted as many as I could but it’s not exhaustive. There are plenty of other items and each items has its own configuration and nested JSON. I will leave it up to you do that. Again, this is temporary. Just like all things Labs, I expect this to be supported out of the box in the future.


def extract_lakehouses(json_data):
    """
    Extracts lakehouses from the JSON data and extracts the tdsEndpoint.

    Args:
        json_data: Scanner API json

    Returns:
        A DataFrame containing lakehouse details, including the tdsEndpoint.
    """
    lakehouses = []
    for workspace in json_data['workspaces']:
        for lakehouse in workspace.get('Lakehouse', []):
            lakehouse_record = {
                'workspaceId': workspace['id'],
                'workspaceName': fabric.resolve_workspace_name(workspace['id']),
                'id': lakehouse['id'],
                'name': lakehouse['name'],
                'description': lakehouse.get('description'),
                'state': lakehouse['state'],
                'lastUpdatedDate': lakehouse['lastUpdatedDate'],
                'createdDate': lakehouse['createdDate'],
                'modifiedBy': lakehouse['modifiedBy'],
                'createdBy': lakehouse['createdBy'],
                'modifiedById': lakehouse['modifiedById'],
                'createdById': lakehouse['createdById'],
                 'relations': lakehouse['relations'] if 'relations' in lakehouse else None,
                **lakehouse['extendedProperties']  
            }
            # tdsEndpoint from DwProperties
            try:
                dw_properties = json.loads(lakehouse_record['DwProperties'])
                lakehouse_record['tdsEndpoint'] = dw_properties.get('tdsEndpoint', None)
            except (json.JSONDecodeError, KeyError):
                lakehouse_record['tdsEndpoint'] = None

            lakehouses.append(lakehouse_record)
    return pd.DataFrame(lakehouses)


def extract_data_pipelines(json_data):
    """
    Extracts data pipelines from the JSON data.

    Args:
        json_data: Scanner API json

    Returns:
        A DataFrame containing data pipeline details.
    """
    data_pipelines = []
    for workspace in json_data['workspaces']:
        for data_pipeline in workspace.get('DataPipeline', []):
            data_pipeline_record = {
                'workspaceId': workspace['id'],
                'workspaceName': fabric.resolve_workspace_name(workspace['id']),
                'id': data_pipeline['id'],
                'name': data_pipeline['name'],
                'description': data_pipeline.get('description'),
                'state': data_pipeline['state'],
                'lastUpdatedDate': data_pipeline['lastUpdatedDate'],
                'createdDate': data_pipeline['createdDate'],
                'modifiedBy': data_pipeline['modifiedBy'],
                'createdBy': data_pipeline['createdBy'],
                'modifiedById': data_pipeline['modifiedById'],
                'createdById': data_pipeline['createdById'],
                'relations': data_pipeline['relations'] if 'relations' in data_pipeline else None,
                'extendedProperties': data_pipeline['extendedProperties'],
                'datasourceUsages': data_pipeline['datasourceUsages'] if 'datasourceUsages' in data_pipeline else None
            }
            data_pipelines.append(data_pipeline_record)
    return pd.DataFrame(data_pipelines)


def extract_notebooks(json_data):
    """
    Extracts notebooks from the JSON data.

    Args:
        json_data: Scanner API json

    Returns:
        A DataFrame containing notebook details.
    """
    notebooks = []
    for workspace in json_data['workspaces']:
        for notebook in workspace.get('Notebook', []):
            notebook_record = {
                 'workspaceId': workspace['id'],
                'workspaceName': fabric.resolve_workspace_name(workspace['id']),
                'id': notebook['id'],
                'name': notebook['name'],
                'description': notebook.get('description'),
                'state': notebook['state'],
                'lastUpdatedDate': notebook['lastUpdatedDate'],
                'createdDate': notebook['createdDate'],
                'modifiedBy': notebook['modifiedBy'],
                'createdBy': notebook['createdBy'],
                'modifiedById': notebook['modifiedById'],
                'createdById': notebook['createdById'],
                'relations': notebook['relations'] if 'relations' in notebook else None,
                'extendedProperties': notebook['extendedProperties']
            }
            notebooks.append(notebook_record)
    return pd.DataFrame(notebooks)


def extract_reports(json_data):
    """
    Extracts reports from the JSON data.

    Args:
        json_data: Scanner API json

    Returns:
        A DataFrame containing report details.
    """
    reports = []
    for workspace in json_data['workspaces']:
        for report in workspace.get('reports', []):
            report_record = {
                'workspaceId': workspace['id'],
                'workspaceName': fabric.resolve_workspace_name(workspace['id']),
                'reportType': report['reportType'],
                'id': report['id'],
                'name': report['name'],
                'createdDateTime': report['createdDateTime'],
                'modifiedDateTime': report['modifiedDateTime'],
                'modifiedBy': report['modifiedBy'],
                'createdBy': report['createdBy'],
                'modifiedById': report['modifiedById'],
                'createdById': report['createdById']
            }
            reports.append(report_record)
    return pd.DataFrame(reports)

def extract_dashboards(json_data):
    """
    Extracts dashboards from the JSON data.

    Args:
        json_data: Scanner API json

    Returns:
        A DataFrame containing dashboard details.
    """
    dashboards = []
    for workspace in json_data['workspaces']:
        for dashboard in workspace.get('dashboards', []):
          dashboards_record = {
                  'workspaceId': workspace['id'],
                'workspaceName': fabric.resolve_workspace_name(workspace['id']),
                  'id': dashboard['id'],
                  'displayName': dashboard['displayName'],

           }
          dashboards.append(dashboards_record)
    return pd.DataFrame(dashboards)

def extract_datasets(json_data):
    """
    Extracts datasets from the JSON data.

    Args:
        json_data: Scanner API json

    Returns:
        A DataFrame containing dataset details.
    """
    datasets = []
    for workspace in json_data['workspaces']:
        for dataset in workspace.get('datasets', []):
            dataset_record = {
                'workspaceId': workspace['id'],
                'workspaceName': fabric.resolve_workspace_name(workspace['id']),
                'id': dataset['id'],
                'name': dataset['name'],
                'tables': dataset['tables'],
                'configuredBy': dataset['configuredBy'],
                'configuredById': dataset['configuredById'],
                'isEffectiveIdentityRequired': dataset['isEffectiveIdentityRequired'],
                'isEffectiveIdentityRolesRequired': dataset['isEffectiveIdentityRolesRequired'],
                'targetStorageMode': dataset['targetStorageMode'],
                'createdDate': dataset['createdDate'],
                'contentProviderType': dataset['contentProviderType'],


            }
            datasets.append(dataset_record)
    return pd.DataFrame(datasets)


def extract_dataflows(json_data):
    """
    Extracts dataflows from the JSON data.

    Args:
        json_data: Scanner API json

    Returns:
        A DataFrame containing dataflow details.
    """
    dataflows = []
    for workspace in json_data['workspaces']:
       for dataflow in workspace.get('dataflows', []):
           dataflow_record = {
                'workspaceId': workspace['id'],
                'workspaceName': fabric.resolve_workspace_name(workspace['id']),
                'objectId': dataflow['objectId'],
                'name': dataflow['name'],
                 'configuredBy': dataflow['configuredBy'],
                'modifiedBy': dataflow['modifiedBy'],
                'modifiedDateTime': dataflow['modifiedDateTime'],
                'generation': dataflow['generation']
            }
           dataflows.append(dataflow_record)
    return pd.DataFrame(dataflows)


def extract_warehouses(json_data):
    """
    Extracts warehouses from the JSON data.

    Args:
        json_data: Scanner API json

    Returns:
        A DataFrame containing warehouse details.
    """
    warehouses = []

    for workspace in [json_data]:
        for warehouse in workspace.get('warehouses', []):
            warehouse_record = {
                'workspaceId': workspace['id'],
                'id': warehouse['id'],
                'name': warehouse['name'],
                'configuredBy': warehouse['configuredBy'],
                'configuredById': warehouse['configuredById'],
                'modifiedBy': warehouse['modifiedBy'],
                'modifiedById': warehouse['modifiedById'],
                'modifiedDateTime': warehouse['modifiedDateTime']
            }
            warehouses.append(warehouse_record)

    return pd.DataFrame(warehouses)


def extract_sql_analytics_endpoints(json_data):
    """
    Extracts SQL Analytics Endpoints from the JSON data.

    Args:
        json_data: Scanner API json

    Returns:
        A DataFrame containing SQL Analytics Endpoint details.
    """
    sql_analytics_endpoints = []
    for workspace in json_data['workspaces']:
        for endpoint in workspace.get('SQLAnalyticsEndpoint', []):
            endpoint_record = {
                'workspaceId': workspace['id'],
                'workspaceName': fabric.resolve_workspace_name(workspace['id']),
                'id': endpoint['id'],
                'name': endpoint['name'],
                'configuredBy': endpoint['configuredBy'],
                'configuredById': endpoint['configuredById'],
                'modifiedBy': endpoint['modifiedBy'],
                'modifiedById': endpoint['modifiedById'],
                'modifiedDateTime': endpoint['modifiedDateTime'],
                'relations': endpoint['relations'] if 'relations' in endpoint else None
            }
            sql_analytics_endpoints.append(endpoint_record)
    return pd.DataFrame(sql_analytics_endpoints)


def extract_kql_databases(json_data):
    """
    Extracts KQL Databases from the JSON data.

    Args:
        json_data: Scanner API json

    Returns:
        A DataFrame containing KQL Database details.
    """
    kql_databases = []
    for workspace in json_data['workspaces']:
        for database in workspace.get('KQLDatabase', []):
            database_record = {
                'workspaceId': workspace['id'],
                'id': database['id'],
                'name': database['name'],
                'description': database.get('description'),
                'state': database['state'],
                'lastUpdatedDate': database['lastUpdatedDate'],
                'createdDate': database['createdDate'],
                'relations': database.get('relations', []),  
                'QueryServiceUri': database['extendedProperties'].get('QueryServiceUri'),
                'IngestionServiceUri': database['extendedProperties'].get('IngestionServiceUri'),
                'Region': database['extendedProperties'].get('Region'),
                'KustoDatabaseType': database['extendedProperties'].get('KustoDatabaseType')
            }
            kql_databases.append(database_record)
    return pd.DataFrame(kql_databases)


def extract_eventhouses(json_data):
    """
    Extracts Eventhouses from the JSON data.

    Args:
        json_data: Scanner API json

    Returns:
        A DataFrame containing Eventhouse details.
    """
    eventhouses = []
    for workspace in json_data['workspaces']:
        for eventhouse in workspace.get('Eventhouse', []):
            eventhouse_record = {
                'workspaceId': workspace['id'],
                'id': eventhouse['id'],
                'name': eventhouse['name'],
                'description': eventhouse.get('description'),
                'state': eventhouse['state'],
                'lastUpdatedDate': eventhouse['lastUpdatedDate'],
                'createdDate': eventhouse['createdDate'],
                'relations': eventhouse.get('relations', [])  

            }
            eventhouses.append(eventhouse_record)
    return pd.DataFrame(eventhouses)

def extract_kql_querysets(json_data):
    """
    Extracts KQL Querysets from the JSON data.

    Args:
        json_data: Scanner API json

    Returns:
        A DataFrame containing KQL Queryset details.
    """
    kql_querysets = []
    for workspace in json_data['workspaces']:
        for queryset in workspace.get('KQLQueryset', []):
            queryset_record = {
                'workspaceId': workspace['id'],
                'workspaceName': fabric.resolve_workspace_name(workspace['id']),
                'id': queryset['id'],
                'name': queryset['name'],
                'description': queryset.get('description'),
                'state': queryset['state'],
                'lastUpdatedDate': queryset['lastUpdatedDate'],
                'createdDate': queryset['createdDate'],
                'modifiedBy': queryset['modifiedBy'],
                'createdBy': queryset['createdBy'],
                'modifiedById': queryset['modifiedById'],
                'createdById': queryset['createdById'],
                'relations': queryset['relations'] if 'relations' in queryset else None,
                **queryset['extendedProperties']
            }
            kql_querysets.append(queryset_record)
    return pd.DataFrame(kql_querysets)

def extract_ml_experiments(json_data):
    """
    Extracts ML Experiments from the JSON data.

    Args:
        json_data: Scanner API json

    Returns:
        A DataFrame containing ML Experiment details.
    """
    ml_experiments = []
    for workspace in json_data['workspaces']:
        for experiment in workspace.get('MLExperiment', []):
            experiment_record = {
                 'workspaceId': workspace['id'],
                 'workspaceName': fabric.resolve_workspace_name(workspace['id']),
                'id': experiment['id'],
                'name': experiment['name'],
                'description': experiment.get('description'),
                'state': experiment['state'],
                'lastUpdatedDate': experiment['lastUpdatedDate'],
                'createdDate': experiment['createdDate'],
                'modifiedBy': experiment['modifiedBy'],
                'createdBy': experiment['createdBy'],
                'modifiedById': experiment['modifiedById'],
                'createdById': experiment['createdById'],
                'relations': experiment['relations'] if 'relations' in experiment else None,
                **experiment['extendedProperties']
            }
            ml_experiments.append(experiment_record)
    return pd.DataFrame(ml_experiments)

def extract_ml_models(json_data):
    """
    Extracts ML Models from the JSON data.

    Args:
        json_data: Scanner API json

    Returns:
        A DataFrame containing ML Model details.
    """
    ml_models = []
    for workspace in json_data['workspaces']:
        for model in workspace.get('MLModel', []):
            model_record = {
                'workspaceId': workspace['id'],
                'workspaceName': fabric.resolve_workspace_name(workspace['id']),
                'id': model['id'],
                'name': model['name'],
                'description': model.get('description'),
                'state': model['state'],
                'lastUpdatedDate': model['lastUpdatedDate'],
                'createdDate': model['createdDate'],
                'modifiedBy': model['modifiedBy'],
                'createdBy': model['createdBy'],
                'modifiedById': model['modifiedById'],
                'createdById': model['createdById'],
                'relations': model['relations'] if 'relations' in model else None,
                'endorsementDetails': model.get('endorsementDetails', None), # Handle optional endorsementDetails
                **model['extendedProperties']

            }
            ml_models.append(model_record)
    return pd.DataFrame(ml_models)


def extract_environments(json_data):
    """
    Extracts environments from the JSON data.

    Args:
        json_data: Scanner API json

    Returns:
        A DataFrame containing environment details.
    """
    environments = []
    for workspace in json_data['workspaces']:
        for environment in workspace.get('Environment', []):
            environment_record = {
                'workspaceId': workspace['id'],
                'workspaceName': fabric.resolve_workspace_name(workspace['id']),
                'id': environment['id'],
                'name': environment['name'],
                'description': environment.get('description'),
                'state': environment['state'],
                'lastUpdatedDate': environment['lastUpdatedDate'],
                'createdDate': environment['createdDate'],
                'modifiedBy': environment['modifiedBy'],
                'createdBy': environment['createdBy'],
                'modifiedById': environment['modifiedById'],
                'createdById': environment['createdById'],
                **environment['extendedProperties']
            }
            environments.append(environment_record)
    return pd.DataFrame(environments)

import pandas as pd
import json

def extract_graphql_apis(json_data):
    """Extracts GraphQL APIs from the JSON data."""
    apis = []
    for workspace in json_data['workspaces']:
        for api in workspace.get('GraphQLApi', []):
            api_record = {
                'workspaceId': workspace['id'],
                'id': api['id'],
                'name': api['name'],
                'description': api.get('description'),
                'state': api['state'],
                'lastUpdatedDate': api['lastUpdatedDate'],
                'createdDate': api['createdDate'],
                'modifiedBy': api['modifiedBy'],
                'createdBy': api['createdBy'],
                'modifiedById': api['modifiedById'],
                'createdById': api['createdById'],
                **api['extendedProperties']  # Contains GraphQLEndpoint
            }
            apis.append(api_record)
    return pd.DataFrame(apis)

def extract_eventstreams(json_data):
    """Extracts Eventstreams from the JSON data."""
    streams = []
    for workspace in json_data['workspaces']:
        for stream in workspace.get('Eventstream', []):
            stream_record = {
                'workspaceId': workspace['id'],
                'id': stream['id'],
                'name': stream['name'],
                'description': stream.get('description'),
                'state': stream['state'],
                'lastUpdatedDate': stream['lastUpdatedDate'],
                'createdDate': stream['createdDate'],
                'modifiedBy': stream['modifiedBy'],
                'createdBy': stream['createdBy'],
                'modifiedById': stream['modifiedById'],
                'createdById': stream['createdById'],
                **stream['extendedProperties'] 
            }
            streams.append(stream_record)
    return pd.DataFrame(streams)

def extract_kql_dashboards(json_data):
    """Extracts KQL Dashboards from the JSON data."""
    dashboards = []
    for workspace in json_data['workspaces']:
        for dashboard in workspace.get('KQLDashboard', []):
            dashboard_record = {
                'workspaceId': workspace['id'],
                'id': dashboard['id'],
                'name': dashboard['name'],
                'description': dashboard.get('description'),
                'state': dashboard['state'],
                'lastUpdatedDate': dashboard['lastUpdatedDate'],
                'createdDate': dashboard['createdDate'],
                'modifiedBy': dashboard['modifiedBy'],
                'createdBy': dashboard['createdBy'],
                'modifiedById': dashboard['modifiedById'],
                'createdById': dashboard['createdById'],
                **dashboard['extendedProperties']
            }
            dashboards.append(dashboard_record)
    return pd.DataFrame(dashboards)

def extract_spark_job_definitions(json_data):
    """Extracts Spark Job Definitions from the JSON data."""
    jobs = []
    for workspace in json_data['workspaces']:
        for job in workspace.get('SparkJobDefinition', []):
            job_record = {
                'workspaceId': workspace['id'],
                'id': job['id'],
                'name': job['name'],
                'description': job.get('description'),
                'state': job['state'],
                'lastUpdatedDate': job['lastUpdatedDate'],
                'createdDate': job['createdDate'],
                'modifiedBy': job['modifiedBy'],
                'createdBy': job['createdBy'],
                'modifiedById': job['modifiedById'],
                'createdById': job['createdById'],
                **job['extendedProperties']  # has OneLakeRootPath
            }
            jobs.append(job_record)
    return pd.DataFrame(jobs)

def extract_sql_databases(json_data):
    """Extracts SQL Databases from the JSON data."""
    databases = []
    for workspace in json_data['workspaces']:
        for db in workspace.get('SQLDatabase', []):
            db_record = {
                'workspaceId': workspace['id'],
                'id': db['id'],
                'name': db['name'],
                'description': db.get('description'),
                'state': db['state'],
                'lastUpdatedDate': db['lastUpdatedDate'],
                'createdDate': db['createdDate'],
                'modifiedBy': db['modifiedBy'],
                'createdBy': db['createdBy'],
                'modifiedById': db['modifiedById'],
                'createdById': db['createdById'],
                'relations': db.get('relations'),
                **db['extendedProperties']  # has SourceServerDnsName, DnsConnectionString, etc.
            }
            databases.append(db_record)
    return pd.DataFrame(databases)

def extract_reflexes(json_data):
    """Extracts Reflexes from the JSON data."""
    reflexes = []
    for workspace in json_data['workspaces']:
        for reflex in workspace.get('Reflex', []):
            reflex_record = {
                'workspaceId': workspace['id'],
                'id': reflex['id'],
                'name': reflex['name'],
                'description': reflex.get('description'),
                'state': reflex['state'],
                'lastUpdatedDate': reflex['lastUpdatedDate'],
                'createdDate': reflex['createdDate'],
                'modifiedBy': reflex['modifiedBy'],
                'createdBy': reflex['createdBy'],
                'modifiedById': reflex['modifiedById'],
                'createdById': reflex['createdById'],
                **reflex['extendedProperties'] 
            }
            reflexes.append(reflex_record)
    return pd.DataFrame(reflexes)

def extract_copy_jobs(json_data):
    """Extracts Copy Jobs from the JSON data."""
    jobs = []
    for workspace in json_data['workspaces']:
        for job in workspace.get('CopyJob', []):
            job_record = {
                'workspaceId': workspace['id'],
                'id': job['id'],
                'name': job['name'],
                'description': job.get('description'),
                'state': job['state'],
                'lastUpdatedDate': job['lastUpdatedDate'],
                'createdDate': job['createdDate'],
                'modifiedBy': job['modifiedBy'],
                'createdBy': job['createdBy'],
                'modifiedById': job['modifiedById'],
                'createdById': job['createdById'],
                **job['extendedProperties']
            }
            jobs.append(job_record)
    return pd.DataFrame(jobs)

def extract_explorations(json_data):
    """Extracts Explorations from the JSON data."""
    explorations = []
    for workspace in json_data['workspaces']:
        for exploration in workspace.get('Exploration', []):
            exploration_record = {
                'workspaceId': workspace['id'],
                'id': exploration['id'],
                'name': exploration['name'],
                'description': exploration.get('description'),
                'state': exploration['state'],
                'lastUpdatedDate': exploration['lastUpdatedDate'],
                'createdDate': exploration['createdDate'],
                'modifiedBy': exploration['modifiedBy'],
                'createdBy': exploration['createdBy'],
                'modifiedById': exploration['modifiedById'],
                'createdById': exploration['createdById'],
                **exploration['extendedProperties']
            }
            explorations.append(exploration_record)
    return pd.DataFrame(explorations)

Parallelize the extraction:



def parallel_extract(data):

    extraction_tasks = {
        'lakehouse_df': extract_lakehouses,
        'data_pipeline_df': extract_data_pipelines,
        'notebook_df': extract_notebooks,
        'report_df': extract_reports,
        'dashboard_df': extract_dashboards,
        'dataset_df': extract_datasets,
        'dataflow_df': extract_dataflows,
        'warehouse_df': extract_warehouses,
        'sql_analytics_endpoint_df': extract_sql_analytics_endpoints,
        'kql_database_df': extract_kql_databases,
        'eventhouse_df': extract_eventhouses,
        'kql_queryset_df': extract_kql_querysets,
        'ml_experiment_df': extract_ml_experiments,
        'ml_model_df': extract_ml_models,
        'environment_df': extract_environments
    }


    results = {}
    with concurrent.futures.ThreadPoolExecutor() as executor:

        future_to_name = {
            executor.submit(func, data): name 
            for name, func in extraction_tasks.items()
        }

        for future in concurrent.futures.as_completed(future_to_name):
            name = future_to_name[future]
            try:
                results[name] = future.result()
            except Exception as exc:
                print(f'{name} exception: {exc}')
                results[name] = None

    return results


def process_workspace_data(data):
    dfs = parallel_extract(data)

    lakehouse_df = dfs['lakehouse_df']
    data_pipeline_df = dfs['data_pipeline_df']
    notebook_df = dfs['notebook_df']
    report_df = dfs['report_df']
    dashboard_df = dfs['dashboard_df']
    dataset_df = dfs['dataset_df']
    dataflow_df = dfs['dataflow_df']
    warehouse_df = dfs['warehouse_df']
    sql_analytics_endpoint_df = dfs['sql_analytics_endpoint_df']
    kql_database_df = dfs['kql_database_df']
    eventhouse_df = dfs['eventhouse_df']
    kql_queryset_df = dfs['kql_queryset_df']
    ml_experiment_df = dfs['ml_experiment_df']
    ml_model_df = dfs['ml_model_df']
    environment_df = dfs['environment_df']

    return {
        'lakehouse_df': lakehouse_df,
        'data_pipeline_df': data_pipeline_df,
        'notebook_df': notebook_df,
        'report_df': report_df,
        'dashboard_df': dashboard_df,
        'dataset_df': dataset_df,
        'dataflow_df': dataflow_df,
        'warehouse_df': warehouse_df,
        'sql_analytics_endpoint_df': sql_analytics_endpoint_df,
        'kql_database_df': kql_database_df,
        'eventhouse_df': eventhouse_df,
        'kql_queryset_df': kql_queryset_df,
        'ml_experiment_df': ml_experiment_df,
        'ml_model_df': ml_model_df,
        'environment_df': environment_df
    }

all_dfs = process_workspace_data(scanner_json)

lakehouse_df = all_dfs['lakehouse_df']
data_pipeline_df = all_dfs['data_pipeline_df']
notebook_df = all_dfs['notebook_df']
report_df = all_dfs['report_df']
dashboard_df = all_dfs['dashboard_df']
dataset_df = all_dfs['dataset_df']
dataflow_df = all_dfs['dataflow_df']
warehouse_df = all_dfs['warehouse_df']
sql_analytics_endpoint_df = all_dfs['sql_analytics_endpoint_df']
kql_database_df = all_dfs['kql_database_df']
eventhouse_df = all_dfs['eventhouse_df']
kql_queryset_df = all_dfs['kql_queryset_df']
ml_experiment_df = all_dfs['ml_experiment_df']
ml_model_df = all_dfs['ml_model_df']
environment_df = all_dfs['environment_df']

Scan Fabric Workspaces With Scanner API Using Semantic Link Labs

Scan Workspaces:

Extract Items:

Parallelize the extraction:

Subscribe to my newsletter

Sandeep Pawar

Sandeep Pawar