In today's data-centric world, it's not just about gathering data—it's about ensuring the data is accurate, consistent, and reliable for decision-making. In this blog, I’ll guide you through the process of loading a CSV file, customer.csv, from Azure Data Lake Storage Gen2 into Azure Synapse Analytics, performing queries using both serverless and dedicated SQL pools, and incorporating robust data quality testing practices.

Step 1: Setting Up Azure Synapse Analytics

Before diving into the integration, ensure that you have an Azure Synapse workspace set up. This workspace will act as the hub for your data processing and analytics needs. Additionally, configure your Azure Data Lake Storage Gen2 account to allow access from Synapse Analytics.

Step 2: Loading Data from Azure Data Lake Storage Gen2

Connect to Azure Data Lake Storage Gen2:
- Navigate to your Synapse workspace.
- In the left-hand navigation pane, select Linked Services and add your Azure Data Lake Storage Gen2 account as a linked service.
Import the CSV File:
- Use Synapse's Data Integration Pipeline to import the “customer.csv” file into a dedicated table in Synapse Analytics.

Step 3: Querying Data with Serverless SQL Pool

Azure Synapse also provides a serverless SQL pool for querying data directly from Azure Data Lake Storage Gen2. This approach is perfect for ad-hoc analytics without the need to pre-load data into a table.

-- This is auto-generated code
SELECT
    TOP 100 *
FROM
    OPENROWSET(
        BULK 'https://pocdlsgen2acc.dfs.core.windows.net/inbound/customer.csv',
        FORMAT = 'CSV',
        PARSER_VERSION = '2.0'
    ) AS [result]

Serverless SQL pools allow querying large datasets efficiently, reducing the overhead of managing physical storage.

Step 4: Querying Data with Dedicated SQL Pool

Synapse's built-in SQL pool is ideal for querying structured and semi-structured data. Once the customer.csv file is loaded, create a table in the dedicated SQL pool to store the data:

CREATE TABLE [dbo].[customer]
(
    customer_id NVARCHAR(50),
    name NVARCHAR(50),
    email NVARCHAR(50),
    country NVARCHAR(50)
)
WITH
(
    DISTRIBUTION = HASH (customer_id),
    CLUSTERED COLUMNSTORE INDEX
)
GO
COPY INTO [dbo].[customer]
(customer_id 1, name 2,email 3,country 4)
FROM 'https://pocdlsgen2acc.dfs.core.windows.net/inbound/customer.csv'
WITH
(
    FILE_TYPE = 'CSV'
);

SELECT * FROM [dbo].[customer] ORDER BY customer_id;

Use SQL queries to analyze the data, identify trends, and generate insights.

Step 5: Ensuring Data Quality

Data quality testing is a crucial step in any data pipeline to ensure the accuracy and reliability of your analytics. We focused on the following metrics to validate the integrity of the data:

Consistency: Verify that the data in Azure Synapse matches the source system to ensure alignment and correctness.
Completeness: Check for mismatched row counts between pipeline stages and confirm that all expected columns are present.
Accuracy: Validate column data types and confirm that numeric values are rounded off correctly, adhering to the desired precision.
Uniqueness: Ensure that column values are unique where necessary, such as in primary keys.
Null/Empty Values: Identify and address null or empty values in mandatory columns, as these could impact downstream processes.
Referential Integrity: Confirm that every row depending on a dimension in a fact table has its corresponding dimension (e.g., foreign keys without a primary key should not exist).

By embedding data quality checks in your pipeline, you can trust your data to be both accurate and useful for decision-making.

Step 6: Synapse pipeline error codes

Error code: 3250

Message: There are not enough resources available in the workspace, details: '%errorMessage;'
Cause: Insufficient resources
Recommendation: Try ending the running job(s) in the workspace, reducing the numbers of vCores requested, increasing the workspace quota or using another workspace.

Error code: 3251

Message: There are not enough resources available in the pool, details: '%errorMessage;'
Cause: Insufficient resources
Recommendation: Try ending the running job(s) in the pool, reducing the numbers of vCores requested, increasing the pool maximum size or using another pool.

Error code: 3252

Message: There are not enough vcores available for your spark job, details: '%errorMessage;'
Cause: Insufficient virtual cores
Recommendation: Try reducing the numbers of vCores requested or increasing your vCore quota. For more information, see Apache Spark core concepts.

Error code: 3253

Message: There are substantial concurrent MappingDataflow executions which is causing failures due to throttling under the Integration Runtime used for ActivityId: '%activityId;'.
Cause: Throttling threshold was reached.
Recommendation: Retry the request after a wait period.

Error code: 3254

Message: AzureSynapseArtifacts linked service has invalid value for property '%propertyName;'.
Cause: Bad format or missing definition of property '%propertyName;'.
Recommendation: Check if the linked service has property '%propertyName;' defined with correct data.

GitHub link:

https://github.com/vipinputhanveetil/azure_synapse_sql_pools

Key Benefits of Azure Synapse Analytics

Flexibility: Choose between built-in and serverless SQL pools based on your needs.
Scalability: Handle large volumes of data effortlessly.
Cost-Efficiency: Serverless pools offer a pay-per-query model, eliminating upfront costs.

Azure Synapse: Leveraging Serverless and Dedicated SQL Pools for Data Analytics with Data Quality Testing