Access Blob & ADLS2 data in Azure Synapse Serverless pool

I have worked on Azure Synapse Analytics quite extensively. Synapse Analytics should be your goto platform if you want a combination of Big data and Warehousing capabilities under a single roof. There are two major components in Synapse Analytics Serverless and Dedicated.

In Serveless SQL pool as its name implies there is no need to provision or manage any infrastructure and it is primarily used for data analysis across the semi/ unstructured data stored on Blob and ADLS2 storages of various file formats.

On the other hand Dedicated SQL pool is fully provisioned environment that provides large scale data warehousing capabilities that runs on Massively Parallel Processing (MPP) architecture.

The difference between the two is quite extensive, detailed here and here.

In this article I will try my best to highlight different data access options available through Managed Identity and Service Principal on serverless pool. There are some significant caveats between the two.

If you are not aware of service principals and managed identities in Azure you can refer to my article on service principal here and my article on managed identity here.

SetUp

To start, lets create three csv files that store customer information :

Customer_1.csv

CustomerID,Name,Email,City
1,Aarav Joshi,aarav.joshi@example.com,Pune
2,Isha Kulkarni,isha.kulkarni@example.com,Mumbai
3,Vikram Singh,vikram.singh@example.com,Bangalore
4,Neha Patil,neha.patil@example.com,Nashik
5,Rohan Deshmukh,rohan.deshmukh@example.com,Sangli

Customer_2.csv

CustomerID,Name,Email,City
6,Priya Chavan,priya.chavan@example.com,Pune
7,Aditya Jadhav,aditya.jadhav@example.com,Solapur
8,Sneha Nair,sneha.nair@example.com,Ahmedabad
9,Arjun Shinde,arjun.shinde@example.com,Aurangabad
10,Kavya Chawla,kavya.chawla@example.com,Lucknow

Customer_3.csv

CustomerID,Name,Email,City
11,Manoj Khanna,manoj.khanna@example.com,Surat
12,Ananya Bansal,ananya.bansal@example.com,Kanpur
13,Karan Malhotra,karan.malhotra@example.com,Visakhapatnam
14,Tanvi Pawar,tanvi.pawar@example.com,Nagpur
15,Rajesh Gaikwad,rajesh.gaikwad@example.com,Thane

We will upload these files to a blob storage under a directory Customers.

Synapse SQL Serverless pool

and to the ADSL2 storage

Synapse SQL Serverless pool

Once done we would test different read options through Managed Identity and Service Principal.

We then create a service principal on Microsoft Entra. If you are not aware of steps to create Service Principal you can refer to one of my article here.

There are a few pre existing service principals on my Entra. I will use the Fabric OAUTH2 service principal.

Synapse SQL Serverless pool

I had an existing User Managed Identity under my Azure resource. I will use that one.

Synapse SQL Serverless pool

Next, create a database in the Server less pool. I created one under name SQLDB.

Then create a Encryption key in the master database :

Use master
GO
CREATE MASTER KEY ENCRYPTION BY PASSWORD='Any Strong Password'
GO

Lets access and query the data through different options.

Access data on Blob Storage with Managed Identity

Grant Storage Blob Data Contributor role to the Managed Identity on the blob storage. In our case it is Fabric_Identity.

Synapse SQL Serverless pool

Next create Database scope credentials. You can read more about database scoped credentials here.

CREATE DATABASE SCOPED CREDENTIAL MyCredential
WITH IDENTITY = 'MANAGED IDENTITY'; -- MANAGED IDENTITY IS THE KEYWORD

Then create an External data source. You can read about external data source here.

CREATE EXTERNAL DATA SOURCE SampleSource
WITH (
    -- location should be pointing to the container on the blob storage
      location = 'https://storageaccount.blob.core.windows.net/container',
    CREDENTIAL = MyCredential
);

We will use OPENROWSET to query the data source. OPENROWSET returns the data from the source in a table format.

Running the queries using PARSER_VERSION = '1.0' and PARSER_VERSION = '2.0'.

Notice the syntax difference using PARSER_VERSION = '1.0' and PARSER_VERSION = '2.0'. In PARSER_VERSION = '2.0' you dont have to specify the WITH clause to predefine the structure of the underlying source.

-- Using PARSER_VERSION = '1.0'
SELECT  
       customerid,
       name,
       email,
       city
FROM   OPENROWSET(BULK 'Customers', -- This is the directory where source files exist
                  DATA_SOURCE = 'SampleSource',
                  FORMAT='csv',
                  PARSER_VERSION = '1.0', 
                  FIRSTROW = 2) 
WITH (
    [Customerid] [int] ,
    [Name] [varchar](40) ,
    [Email] [varchar](40) ,
    [city] [varchar](40) 
) AS [tbl]
GO

-- Using PARSER_VERSION = '2.0'
SELECT 
    customerid,
       name,
       email,
       city
FROM 
    OPENROWSET(BULK 'Customers', -- This is the directory where source files exist
               DATA_SOURCE = 'SampleSource',
               FORMAT='csv', 
               PARSER_VERSION = '2.0',
               FIELDTERMINATOR = ',' ,
               HEADER_ROW = TRUE
        ) AS tbl
GO

Running either of the two queries should return the underlying data.

Synapse SQL Serverless pool

Access data on ADLS2 storage through Managed Identity.

Next we will try to access data on ADLS2 storage through Managed Identity.

Drop the existing data source and scoped credentials.

DROP EXTERNAL DATA SOURCE SampleSource
DROP DATABASE SCOPED CREDENTIAL MyCredential

Grant Storage Blob Data Contributor role to the ADLS2 storage.

Synapse SQL Serverless pool

Next create scoped credentials and data source. We will use abfss protocol.

CREATE DATABASE SCOPED CREDENTIAL MyCredential
WITH IDENTITY = 'MANAGED IDENTITY'; -- MANAGED IDENTITY IS THE KEYWORD
GO

CREATE EXTERNAL DATA SOURCE SampleSource
WITH (
    -- location should be pointing to the adls2 storage container
      location = 'abfss://container@storageaccount.blob.core.windows.net/',
    CREDENTIAL = MyCredential
);
GO

Running the queries using PARSER_VERSION = '1.0' and PARSER_VERSION = '2.0'.

Notice the syntax difference using PARSER_VERSION = '1.0' and PARSER_VERSION = '2.0'. In PARSER_VERSION = '2.0' you dont have to specify the WITH clause to predefine the structure of the underlying source.

-- Using PARSER_VERSION = '1.0'
SELECT  
       customerid,
       name,
       email,
       city
FROM   OPENROWSET(BULK 'Customers', -- This is the directory where source files exist
                  DATA_SOURCE = 'SampleSource',
                  FORMAT='csv',
                  PARSER_VERSION = '1.0', 
                  FIRSTROW = 2) 
WITH (
    [Customerid] [int] ,
    [Name] [varchar](40) ,
    [Email] [varchar](40) ,
    [city] [varchar](40) 
) AS [tbl]
GO

-- Using PARSER_VERSION = '2.0'
SELECT 
    customerid,
       name,
       email,
       city
FROM 
    OPENROWSET(BULK 'Customers', -- This is the directory where source files exist
               DATA_SOURCE = 'SampleSource',
               FORMAT='csv', 
               PARSER_VERSION = '2.0',
               FIELDTERMINATOR = ',' ,
               HEADER_ROW = TRUE
        ) AS tbl
GO

Synapse SQL Serverless pool

Access data on Blob storage through Service Principal

Drop the existing data source and scoped credentials.

DROP EXTERNAL DATA SOURCE SampleSource
DROP DATABASE SCOPED CREDENTIAL MyCredential

As mentioned earlier, we will use the existing service principal called Fabric OAUTH2.

Synapse SQL Serverless pool

CREATE DATABASE SCOPED CREDENTIAL MyCredential
WITH IDENTITY='ClientId@https://login.microsoftonline.com/tenantId/oauth2/v2.0/token' , 
SECRET='secret'

Next create a data source that points to the blob storage

CREATE EXTERNAL DATA SOURCE SampleSource
WITH (
       location = 'https://storageaccount.blob.core.windows.net/container',
    CREDENTIAL = MyCredential
);

Execute the queries

-- Using PARSER_VERSION = '1.0'
SELECT  
       customerid,
       name,
       email,
       city
FROM   OPENROWSET(BULK 'Customers', -- This is the directory where source files exist
                  DATA_SOURCE = 'SampleSource',
                  FORMAT='csv',
                  PARSER_VERSION = '1.0', 
                  FIRSTROW = 2) 
WITH (
    [Customerid] [int] ,
    [Name] [varchar](40) ,
    [Email] [varchar](40) ,
    [city] [varchar](40) 
) AS [tbl]
GO

-- Using PARSER_VERSION = '2.0'
SELECT 
    customerid,
       name,
       email,
       city
FROM 
    OPENROWSET(BULK 'Customers', -- This is the directory where source files exist
               DATA_SOURCE = 'SampleSource',
               FORMAT='csv', 
               PARSER_VERSION = '2.0',
               FIELDTERMINATOR = ',' ,
               HEADER_ROW = TRUE
        ) AS tbl
GO

The queries error out

Synapse SQL Serverless pool

This is because the Service Principal has no access to the storage account. Let’s grant the Storage Blob Data Contributor role to the Service Principal Fabric OAUTH2.

Synapse SQL Serverless pool

After granting the access, give couple of minutes for changes to take effect.

Re executing the query should now return the data.

Synapse SQL Serverless pool

Access data on ADLS2 storage through Service Principal

Drop the data source and credentials and re create them just the way we did in previous step.

DROP EXTERNAL DATA SOURCE SampleSource
DROP DATABASE SCOPED CREDENTIAL MyCredential

Note that below I have used blob endpoints. You could also use dfs endpoints if you wish.

CREATE DATABASE SCOPED CREDENTIAL MyCredential
WITH IDENTITY='ClientId@https://login.microsoftonline.com/tenantId/oauth2/v2.0/token' , 
SECRET='secret'
GO
CREATE EXTERNAL DATA SOURCE SampleSource
WITH (
    -- location should be pointing to the adls2 storage container
      location = 'abfss://container@storageaccount.blob.core.windows.net/',
    CREDENTIAL = MyCredential
);
GO

The query would fail due to lack of underlying access to the source.

Synapse SQL Serverless pool

Grant Storage Blob Data Contributor role to the Service Principal Fabric OAUTH2 to the ADLS2 storage.

Give couple of minutes for changes to take effect. Re execute the queries and the query should return the data.

Synapse SQL Serverless pool

External Tables

Incase if you wish to store the output of the query in a table you cant just create a table and do an Insert into through the Select query.

The only option available is to create an External Table and pre define an External file format.

You can read about External tables here and External file format here.

Create an External File Format that defines the external source structure.

CREATE EXTERNAL FILE FORMAT [CsvFormatWithHeader] WITH (
    FORMAT_TYPE = DELIMITEDTEXT, 
    FORMAT_OPTIONS (
        FIELD_TERMINATOR = ',', 
        FIRST_ROW  = 2,
        STRING_DELIMITER = '"',
        USE_TYPE_DEFAULT = False
        )
)

and then create an External Table.

CREATE EXTERNAL TABLE dbo.Customers
( [Customerid] [int] ,
    [Name] [varchar](40) ,
    [Email] [varchar](40) ,
    [city] [varchar](40)
    )
    WITH (
            LOCATION = '/Customers',
            DATA_SOURCE = SampleSource,
            FILE_FORMAT = [CsvFormatWithHeader]
            ) 
SELECT * FROM [Customers]

The created External Table is available under the External table folder in SSMS.

Synapse SQL Serverless pool

To drop the External table just use the DROP statement.

DROP EXTERNAL TABLE [dbo].[Customers]

Conclusion

In this article we delved into different options Azure Synapse serverless SQL pools provide to query data directly from Blob and Azure Data Lake Storage without requiring data movement. Using External Tables and OPENROWSET, one can query unstructured and semi-structured data formats like Parquet, CSV, and JSON seamlessly. This approach reduces storage duplication and operational overhead.

Ensuring proper authentication with Managed identities, Service Principals and database-scoped credentials enhances reduces the overall risks and overheads of maintaining user credentials separately.

In next article we will see how we can leverage different options available in Synapse Analytics Dedicated pools to query external data sources on Blob and ADLS2 storages.

Thanks for reading !!!

0
Subscribe to my newsletter

Read articles from Sachin Nandanwar directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Sachin Nandanwar
Sachin Nandanwar