Syncing Existing GitHub Repositories to Fabric with Semantic Link Labs / Sempy

Managing and syncing Microsoft Fabric workspaces with GitHub repositories is an essential practice for ensuring proper version control, seamless collaboration, and efficient data management. Previously, Sandeep Pawar blogged about how to sync Fabric Workspace artifacts to GitHub programmatically.

All good. But what if you need to do the reverse—sync existing GitHub repositories back into Fabric? In this blog, as a sequel to Sandeep’s post, I am explaining how I achieved that.

💡
This approach is particularly beneficial when an organization undergoes changes, such as onboarding new users or setting up a new capacity. Instead of manually identifying and syncing each repository one by one, this automated solution ensures that all relevant Fabric artifacts from GitHub are seamlessly restored.

Thanks to the power of Michael Kovalsky’s Semantic Link Labs and Sempy, we can automate this process and make it efficient. In this blog, we’ll walk through how to accomplish this in a structured and automated way.


What I have done previously

In my case, I previously synced my existing Fabric Workspaces to GitHub with the Bkp_' prefix (Refer to Sandeep’s previous article )

Synced Workspaces to Github

Now, I can easily filter out these repositories and sync them into Fabric without the prefix (I created the new Workspaces with the same name in Github repos without the ‘Bkp_’ prefix programmatically). If a workspace already exists, I simply skip creating it and proceed with Git initialization.


Prerequisites

Before we dive into the process, ensure you have:

  1. GitHub Personal Access Token (PAT): Required for authentication with the GitHub API. Create a classic token with appropriate scopes.

  2. Microsoft Fabric Connection ID: This is generated when setting up a GitHub connection in Fabric.

  3. Administrative Rights: You must have admin access to Fabric workspaces to sync repositories.

     pip install PyGithub semantic-link-labs
    

💡
Before executing the Notebook code, the following shows that I don’t have any workspaces except the workspaces that I created for storing the executing Notebook.

To effectively sync Fabric workspaces with GitHub, we rely on several key functions from Semantic Link Labs and Sempy:

  1. labs.admin.list_git_connections()

    • Fetches all existing GitHub connections linked to Fabric workspaces.
  2. labs.connect_workspace_to_github(owner_name, repository_name, branch_name, directory_name, connection_id, workspace)

    • Establishes a Git connection between a Fabric workspace and a GitHub repository.
  3. labs.initialize_git_connection(workspace)

    • Initializes the GitHub connection for a Fabric workspace and returns the latest commit hash.
  4. labs.update_from_git(workspace, remote_commit_hash, conflict_resolution_policy='PreferRemote')

    • Synchronizes the Fabric workspace with the latest version from GitHub.

Sempy Functions

  1. fabric.list_workspaces()

    • Retrieves a list of existing workspaces in Microsoft Fabric.
  2. fabric.create_workspace(display_name, description, capacity_id)

    • Creates a new Fabric workspace with the specified name, description, and capacity.

These functions form the backbone of our automation, enabling us to efficiently manage workspaces and maintain synchronization between GitHub and Fabric.


Step-by-Step Guide

1. Initialize GitHub and Fabric Connections

First, we set up our connections to GitHub and Microsoft Fabric:

from github import Github
import sempy.fabric as fabric
import sempy_labs as labs
import time

# GitHub Personal Access Token and Connection ID
g = Github("your_github_pat")
conn_id = "your_connection_id"
capacity_id = "your_capacity_id"

2. Fetch Existing GitHub Repositories

To retrieve all repositories associated with your GitHub account:

user = g.get_user()
repos = list(user.get_repos())  # Convert iterator to list
print(f"ℹ️ Total repositories found: {len(repos)}")

3. List Existing Workspaces and Git Connections

Before creating new Fabric workspaces, check which ones already exist:

existing_workspaces_df = fabric.list_workspaces()
git_connections_df = labs.admin.list_git_connections()

4. Automate Workspace Creation and connect WS to Git

For each repository (With “Bkp_”), we check if a corresponding workspace exists. If not, we create it and sync it with GitHub.

# Initialize counters
total_repos_processed = 0
workspaces_created = 0
workspaces_skipped = 0
workspaces_already_exist = 0
git_connections_created = 0
git_connections_skipped = 0
workspaces_synced_from_github = 0

for repo in repos:
    total_repos_processed += 1
    try:
        if not repo.name.startswith("Bkp_"):
            print(f"⚠️ Skipping '{repo.name}', does not match 'Bkp_' prefix.")
            workspaces_skipped += 1
            continue

        workspace_name = repo.name.replace("Bkp_", "")
        print(f"Processing repo: {repo.name} -> Workspace name: {workspace_name}")

        existing_workspace = existing_workspaces_df[existing_workspaces_df['Name'] == workspace_name]

        if not existing_workspace.empty:
            workspace_id = existing_workspace.iloc[0]['Id']
            print(f"ℹ️ Workspace '{workspace_name}' already exists with ID: {workspace_id}.")
            workspaces_already_exist += 1

            if workspace_id in git_connections_df['Workspace Id'].values:
                print(f"⚠️ Workspace '{workspace_name}' already has a Git connection. Skipping.")
                git_connections_skipped += 1
                continue
        else:
            workspace_id = fabric.create_workspace(
                display_name=workspace_name,
                description=f"Workspace for {repo.name} synced from GitHub",
                capacity_id=capacity_id
            )
            print(f"🟢 Workspace '{workspace_name}' created with ID: {workspace_id}.")
            workspaces_created += 1

        labs.connect_workspace_to_github(
            owner_name=user.login,
            repository_name=repo.name,
            branch_name="main",
            directory_name="/",
            connection_id=conn_id,
            workspace=workspace_id
        )
        print(f"🟢 Connected workspace '{workspace_name}' to GitHub repository '{repo.name}'.")
        git_connections_created += 1
    except Exception as e:
        print(f"⚠️ Unexpected error processing '{repo.name}': {e}")

4. Initialize Git Connection and Git pull

        # Wait for Git connection to initialize
        time.sleep(10)  # Adjust as necessary
        try:
            commit_hash = labs.initialize_git_connection(workspace=workspace_id)
            print(f"🟢 Git connection initialized for '{workspace_name}'. Commit: {commit_hash}")
        except Exception as e:
            print(f"⚠️ Error initializing Git for '{workspace_name}': {e}")
            continue  # Move to the next repo

        # Automatically update workspace from Git
        try:
            update_status = labs.update_from_git(workspace=workspace_id, remote_commit_hash=commit_hash, conflict_resolution_policy='PreferRemote')
            print(f"🟢 Updated workspace '{workspace_name}' from Git. Status: {update_status}")
            workspaces_synced_from_github += 1
        except Exception as e:
            print(f"⚠️ Error updating workspace '{workspace_name}' from Git: {e}")

        # Sync all content from GitHub repo
        print(f"🟢 Workspace '{workspace_name}' setup completed. Moving to next repo.")

    except Exception as e:
        print(f"⚠️ Unexpected error processing '{repo.name}': {e}"

Output of the code


Workspaces are now automatically created and content is synced in Fabric.


Following Workspace pulled the content from Github Repo and now it is connected with Github automatically.

Conclusion

With these steps, you can now integrate previously synced Github artifacts into Microsoft Fabric with minimal effort.(No manual configuration). For Azure DevOps repos the process would be the same.

Full code is available in this GitHub Gist.

Happy syncing - Thanks for Reading!

0
Subscribe to my newsletter

Read articles from Nalaka Wanniarachchi directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Nalaka Wanniarachchi
Nalaka Wanniarachchi

Nalaka Wanniarachchi is an accomplished data analytics and data engineering professional with over 18 years of experience. As a CIMA(ACMA/CGMA) UK qualified ex-banker with strong analytical skills, he transitioned into building robust data solutions. Nalaka specializes in Microsoft Fabric and Power BI, delivering advanced analytics and engineering solutions. He holds a Microsoft certification as a Fabric Analytic Engineer and Power BI Professional, combining technical expertise with a deep understanding of financial and business analytics.