Access Blob & ADLS2 data in Azure Synapse Serverless pool


I have worked on Azure Synapse Analytics quite extensively. Synapse Analytics should be your goto platform if you want a combination of Big data and Warehousing capabilities under a single roof. There are two major components in Synapse Analytics Serverless and Dedicated.
In Serveless SQL pool as its name implies there is no need to provision or manage any infrastructure and it is primarily used for data analysis across the semi/ unstructured data stored on Blob and ADLS2 storages of various file formats.
On the other hand Dedicated SQL pool is fully provisioned environment that provides large scale data warehousing capabilities that runs on Massively Parallel Processing (MPP) architecture.
The difference between the two is quite extensive, detailed here and here.
In this article I will try my best to highlight different data access options available through Managed Identity and Service Principal on serverless pool. There are some significant caveats between the two.
If you are not aware of service principals and managed identities in Azure you can refer to my article on service principal here and my article on managed identity here.
SetUp
To start, lets create three csv files that store customer information :
Customer_1.csv
CustomerID,Name,Email,City
1,Aarav Joshi,aarav.joshi@example.com,Pune
2,Isha Kulkarni,isha.kulkarni@example.com,Mumbai
3,Vikram Singh,vikram.singh@example.com,Bangalore
4,Neha Patil,neha.patil@example.com,Nashik
5,Rohan Deshmukh,rohan.deshmukh@example.com,Sangli
Customer_2.csv
CustomerID,Name,Email,City
6,Priya Chavan,priya.chavan@example.com,Pune
7,Aditya Jadhav,aditya.jadhav@example.com,Solapur
8,Sneha Nair,sneha.nair@example.com,Ahmedabad
9,Arjun Shinde,arjun.shinde@example.com,Aurangabad
10,Kavya Chawla,kavya.chawla@example.com,Lucknow
Customer_3.csv
CustomerID,Name,Email,City
11,Manoj Khanna,manoj.khanna@example.com,Surat
12,Ananya Bansal,ananya.bansal@example.com,Kanpur
13,Karan Malhotra,karan.malhotra@example.com,Visakhapatnam
14,Tanvi Pawar,tanvi.pawar@example.com,Nagpur
15,Rajesh Gaikwad,rajesh.gaikwad@example.com,Thane
We will upload these files to a blob storage under a directory Customers
.
and to the ADSL2 storage
Once done we would test different read options through Managed Identity
and Service Principal
.
We then create a service principal on Microsoft Entra. If you are not aware of steps to create Service Principal you can refer to one of my article here.
There are a few pre existing service principals on my Entra. I will use the Fabric OAUTH2
service principal.
I had an existing User Managed Identity under my Azure resource. I will use that one.
Next, create a database in the Server less pool. I created one under name SQLDB
.
Then create a Encryption key in the master database :
Use master
GO
CREATE MASTER KEY ENCRYPTION BY PASSWORD='Any Strong Password'
GO
Lets access and query the data through different options.
Access data on Blob Storage with Managed Identity
Grant Storage Blob Data Contributor
role to the Managed Identity on the blob storage. In our case it is Fabric_Identity
.
Next create Database scope credentials. You can read more about database scoped credentials here.
CREATE DATABASE SCOPED CREDENTIAL MyCredential
WITH IDENTITY = 'MANAGED IDENTITY'; -- MANAGED IDENTITY IS THE KEYWORD
Then create an External data source. You can read about external data source here.
CREATE EXTERNAL DATA SOURCE SampleSource
WITH (
-- location should be pointing to the container on the blob storage
location = 'https://storageaccount.blob.core.windows.net/container',
CREDENTIAL = MyCredential
);
We will use OPENROWSET
to query the data source. OPENROWSET
returns the data from the source in a table format.
Running the queries using PARSER_VERSION = '1.0'
and PARSER_VERSION = '2.0'
.
Notice the syntax difference using PARSER_VERSION = '1.0'
and PARSER_VERSION = '2.0'
. In PARSER_VERSION = '2.0'
you dont have to specify the WITH
clause to predefine the structure of the underlying source.
-- Using PARSER_VERSION = '1.0'
SELECT
customerid,
name,
email,
city
FROM OPENROWSET(BULK 'Customers', -- This is the directory where source files exist
DATA_SOURCE = 'SampleSource',
FORMAT='csv',
PARSER_VERSION = '1.0',
FIRSTROW = 2)
WITH (
[Customerid] [int] ,
[Name] [varchar](40) ,
[Email] [varchar](40) ,
[city] [varchar](40)
) AS [tbl]
GO
-- Using PARSER_VERSION = '2.0'
SELECT
customerid,
name,
email,
city
FROM
OPENROWSET(BULK 'Customers', -- This is the directory where source files exist
DATA_SOURCE = 'SampleSource',
FORMAT='csv',
PARSER_VERSION = '2.0',
FIELDTERMINATOR = ',' ,
HEADER_ROW = TRUE
) AS tbl
GO
Running either of the two queries should return the underlying data.
Access data on ADLS2 storage through Managed Identity.
Next we will try to access data on ADLS2 storage through Managed Identity.
Drop the existing data source and scoped credentials.
DROP EXTERNAL DATA SOURCE SampleSource
DROP DATABASE SCOPED CREDENTIAL MyCredential
Grant Storage Blob Data Contributor
role to the ADLS2 storage.
Next create scoped credentials and data source. We will use abfss
protocol.
CREATE DATABASE SCOPED CREDENTIAL MyCredential
WITH IDENTITY = 'MANAGED IDENTITY'; -- MANAGED IDENTITY IS THE KEYWORD
GO
CREATE EXTERNAL DATA SOURCE SampleSource
WITH (
-- location should be pointing to the adls2 storage container
location = 'abfss://container@storageaccount.blob.core.windows.net/',
CREDENTIAL = MyCredential
);
GO
Running the queries using PARSER_VERSION = '1.0'
and PARSER_VERSION = '2.0'
.
Notice the syntax difference using PARSER_VERSION = '1.0'
and PARSER_VERSION = '2.0'
. In PARSER_VERSION = '2.0'
you dont have to specify the WITH
clause to predefine the structure of the underlying source.
-- Using PARSER_VERSION = '1.0'
SELECT
customerid,
name,
email,
city
FROM OPENROWSET(BULK 'Customers', -- This is the directory where source files exist
DATA_SOURCE = 'SampleSource',
FORMAT='csv',
PARSER_VERSION = '1.0',
FIRSTROW = 2)
WITH (
[Customerid] [int] ,
[Name] [varchar](40) ,
[Email] [varchar](40) ,
[city] [varchar](40)
) AS [tbl]
GO
-- Using PARSER_VERSION = '2.0'
SELECT
customerid,
name,
email,
city
FROM
OPENROWSET(BULK 'Customers', -- This is the directory where source files exist
DATA_SOURCE = 'SampleSource',
FORMAT='csv',
PARSER_VERSION = '2.0',
FIELDTERMINATOR = ',' ,
HEADER_ROW = TRUE
) AS tbl
GO
Access data on Blob storage through Service Principal
Drop the existing data source and scoped credentials.
DROP EXTERNAL DATA SOURCE SampleSource
DROP DATABASE SCOPED CREDENTIAL MyCredential
As mentioned earlier, we will use the existing service principal called Fabric OAUTH2
.
CREATE DATABASE SCOPED CREDENTIAL MyCredential
WITH IDENTITY='ClientId@https://login.microsoftonline.com/tenantId/oauth2/v2.0/token' ,
SECRET='secret'
Next create a data source that points to the blob storage
CREATE EXTERNAL DATA SOURCE SampleSource
WITH (
location = 'https://storageaccount.blob.core.windows.net/container',
CREDENTIAL = MyCredential
);
Execute the queries
-- Using PARSER_VERSION = '1.0'
SELECT
customerid,
name,
email,
city
FROM OPENROWSET(BULK 'Customers', -- This is the directory where source files exist
DATA_SOURCE = 'SampleSource',
FORMAT='csv',
PARSER_VERSION = '1.0',
FIRSTROW = 2)
WITH (
[Customerid] [int] ,
[Name] [varchar](40) ,
[Email] [varchar](40) ,
[city] [varchar](40)
) AS [tbl]
GO
-- Using PARSER_VERSION = '2.0'
SELECT
customerid,
name,
email,
city
FROM
OPENROWSET(BULK 'Customers', -- This is the directory where source files exist
DATA_SOURCE = 'SampleSource',
FORMAT='csv',
PARSER_VERSION = '2.0',
FIELDTERMINATOR = ',' ,
HEADER_ROW = TRUE
) AS tbl
GO
The queries error out
This is because the Service Principal has no access to the storage account. Let’s grant the Storage Blob Data Contributor
role to the Service Principal Fabric OAUTH2
.
After granting the access, give couple of minutes for changes to take effect.
Re executing the query should now return the data.
Access data on ADLS2 storage through Service Principal
Drop the data source and credentials and re create them just the way we did in previous step.
DROP EXTERNAL DATA SOURCE SampleSource
DROP DATABASE SCOPED CREDENTIAL MyCredential
Note that below I have used blob
endpoints. You could also use dfs
endpoints if you wish.
CREATE DATABASE SCOPED CREDENTIAL MyCredential
WITH IDENTITY='ClientId@https://login.microsoftonline.com/tenantId/oauth2/v2.0/token' ,
SECRET='secret'
GO
CREATE EXTERNAL DATA SOURCE SampleSource
WITH (
-- location should be pointing to the adls2 storage container
location = 'abfss://container@storageaccount.blob.core.windows.net/',
CREDENTIAL = MyCredential
);
GO
The query would fail due to lack of underlying access to the source.
Grant Storage Blob Data Contributor
role to the Service Principal Fabric OAUTH2
to the ADLS2 storage.
Give couple of minutes for changes to take effect. Re execute the queries and the query should return the data.
External Tables
Incase if you wish to store the output of the query in a table you cant just create a table and do an Insert into through the Select query.
The only option available is to create an External Table and pre define an External file format.
You can read about External tables here and External file format here.
Create an External File Format that defines the external source structure.
CREATE EXTERNAL FILE FORMAT [CsvFormatWithHeader] WITH (
FORMAT_TYPE = DELIMITEDTEXT,
FORMAT_OPTIONS (
FIELD_TERMINATOR = ',',
FIRST_ROW = 2,
STRING_DELIMITER = '"',
USE_TYPE_DEFAULT = False
)
)
and then create an External Table.
CREATE EXTERNAL TABLE dbo.Customers
( [Customerid] [int] ,
[Name] [varchar](40) ,
[Email] [varchar](40) ,
[city] [varchar](40)
)
WITH (
LOCATION = '/Customers',
DATA_SOURCE = SampleSource,
FILE_FORMAT = [CsvFormatWithHeader]
)
SELECT * FROM [Customers]
The created External Table is available under the External table folder in SSMS.
To drop the External table just use the DROP
statement.
DROP EXTERNAL TABLE [dbo].[Customers]
Conclusion
In this article we delved into different options Azure Synapse serverless SQL pools provide to query data directly from Blob and Azure Data Lake Storage without requiring data movement. Using External Tables
and OPENROWSET
, one can query unstructured and semi-structured data formats like Parquet, CSV, and JSON seamlessly. This approach reduces storage duplication and operational overhead.
Ensuring proper authentication with Managed identities, Service Principals and database-scoped credentials enhances reduces the overall risks and overheads of maintaining user credentials separately.
In next article we will see how we can leverage different options available in Synapse Analytics Dedicated pools to query external data sources on Blob and ADLS2 storages.
Thanks for reading !!!
Subscribe to my newsletter
Read articles from Sachin Nandanwar directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
