Introduction to Azure Data Factory

Table of contents

Azure Data Factory - is a CODE FREE ETL as a Service
It is an orchestrated managed cloud service that is built for ETL (Extract-Transform-Load), ELT (Extract-Load-Transform) and data integration projects.
Features of Azure Data Factory
Data Compression - during data copy activity, the data can be compressed and write the compressed data to the target data source. This helps optimize the data bandwidth usage in data copying.
Extensive Connectivity Support for different data sources - ADF (Azure Data Factory) provides broad connectivity support for various data sources. This enables us to pull or write data from various data sources.
Custom Event Triggers - as a managed orchestration tool, ADF allows us to automate data processing using custom event triggers, this enables us to do certain actions when a certain action occurs or ends.
Data Preview and Validation - during the data copy activity, we can preview and valdiate data, so we are sure of the ADF orchestration steps.
Customizable Data Flows - Custom Actions or steps can be added to the ADF data flows.
Integrated Security - EntraID integration and role-based access control, security features are integrated in ADF.
Main Concepts in ADF
Pipelines
This is a logical grouping of activites that performs an unit of work.
The Activities in a pipeline perform a Task.
The Activities in a pipeline can be chained together to operate sequentially, or operate independently in parallel.
ADF can have multiple pipelines.
Activities
this is a step in a pipeline.
ex: a copy activity, to copy data from one data store to another. a Hive activity, to run a Hive Query
ADF supports 3 types of activities : data movement, data transformation, and control activities.
Datasets
- this represents data structures within the data store.
Linked Services
This is like a Connection String, defines the connection information that is needed for ADF to connect to external sources.
This is used to represent 2 purposes :
To represent Data Store - SQL Server Database, Orable DB, file share or Azure blob storage.
To represent Compute Resource - that can host the execution of an activity.
Data Flows
Managed graphs for data transformation logic that used to transform any-sized data
we can build-up a resuable library of data transformations.
we can execute these transformations in a scaled-out manner through ADF pipelines.
This logic is executed on a SPARK cluster, that spins-up and down when needed.
Integration Runtimes
- Provides the bridge between the activity (action to be performed) and the linked services (target data source or a compute service)
Triggers
this represents the unit of processing that determines when a pipeline need to kick off.
there are different type of triggers for different types of events.
Pipeline Runs
this represents an instance of a pipeline execution.
typically instantiated by passing the arguments to the parameters defined in pipelines.
these arugments can be sent manually or within a trigger definition.
Parameters
these are Key-Value Pairs of read-only configuration.
these are defined in the pipeline.
activities within the pipeline consume the parameter values.
Control Flow
this is an orchestration of activities that includes chaining activities in a sequence, branching, defining parameter at the pipeline level.
passing arguments calling a pipeline on-demand or from a trigger.
Subscribe to my newsletter
Read articles from Vijaya Malla directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
