Introduction to Azure Data Factory

Vijaya MallaVijaya Malla
3 min read

Azure Data Factory - is a CODE FREE ETL as a Service

It is an orchestrated managed cloud service that is built for ETL (Extract-Transform-Load), ELT (Extract-Load-Transform) and data integration projects.

Features of Azure Data Factory

  1. Data Compression - during data copy activity, the data can be compressed and write the compressed data to the target data source. This helps optimize the data bandwidth usage in data copying.

  2. Extensive Connectivity Support for different data sources - ADF (Azure Data Factory) provides broad connectivity support for various data sources. This enables us to pull or write data from various data sources.

  3. Custom Event Triggers - as a managed orchestration tool, ADF allows us to automate data processing using custom event triggers, this enables us to do certain actions when a certain action occurs or ends.

  4. Data Preview and Validation - during the data copy activity, we can preview and valdiate data, so we are sure of the ADF orchestration steps.

  5. Customizable Data Flows - Custom Actions or steps can be added to the ADF data flows.

  6. Integrated Security - EntraID integration and role-based access control, security features are integrated in ADF.

Main Concepts in ADF

  1. Pipelines

    • This is a logical grouping of activites that performs an unit of work.

    • The Activities in a pipeline perform a Task.

    • The Activities in a pipeline can be chained together to operate sequentially, or operate independently in parallel.

    • ADF can have multiple pipelines.

  2. Activities

    • this is a step in a pipeline.

    • ex: a copy activity, to copy data from one data store to another. a Hive activity, to run a Hive Query

    • ADF supports 3 types of activities : data movement, data transformation, and control activities.

  3. Datasets

    • this represents data structures within the data store.
  4. Linked Services

    • This is like a Connection String, defines the connection information that is needed for ADF to connect to external sources.

    • This is used to represent 2 purposes :

      • To represent Data Store - SQL Server Database, Orable DB, file share or Azure blob storage.

      • To represent Compute Resource - that can host the execution of an activity.

  5. Data Flows

    • Managed graphs for data transformation logic that used to transform any-sized data

    • we can build-up a resuable library of data transformations.

    • we can execute these transformations in a scaled-out manner through ADF pipelines.

    • This logic is executed on a SPARK cluster, that spins-up and down when needed.

  6. Integration Runtimes

    • Provides the bridge between the activity (action to be performed) and the linked services (target data source or a compute service)
  7. Triggers

    • this represents the unit of processing that determines when a pipeline need to kick off.

    • there are different type of triggers for different types of events.

  8. Pipeline Runs

    • this represents an instance of a pipeline execution.

    • typically instantiated by passing the arguments to the parameters defined in pipelines.

    • these arugments can be sent manually or within a trigger definition.

  9. Parameters

    • these are Key-Value Pairs of read-only configuration.

    • these are defined in the pipeline.

    • activities within the pipeline consume the parameter values.

  10. Control Flow

    • this is an orchestration of activities that includes chaining activities in a sequence, branching, defining parameter at the pipeline level.

    • passing arguments calling a pipeline on-demand or from a trigger.

💡
p.s. The cover image is created by ChatGPT
0
Subscribe to my newsletter

Read articles from Vijaya Malla directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Vijaya Malla
Vijaya Malla