The need for batch movement of data on a regular time schedule is a requirement for most analytics solutions. Within the Cortana Intelligence Suite, Azure Data Factory (ADF) is the service that can be used to fulfil such a requirement.
ADF provides a cloud-based data integration service that orchestrates the movement and transformation of data from various data stores.
So what do we mean by orchestration?
Think about an orchestra. The most important member of the orchestra is the conductor. The conductor does not play the instruments, they simply lead the symphony members through the entire piece of music that they perform. The musicians use their own skills to produce particular sounds at various stages of the symphony, so they may only learn certain parts of the music. The conductor orchestrates the entire piece of music, and therefore is aware of the entire score that is being performed. They will also use specific arm movements that provide instructions to the musicians how a piece of music should be played.
ADF uses a similar approach, it will not perform the actual work required to transform data, but will instruct another service, such a Hadoop Cluster to perform a Hive query to perform the transformation on ADF’s behalf. So in this case, it would be Hadoop that performs the work, not ADF. ADF merely orchestrates the execution of the Hive query through Hadoop, and then provides the pipelines to move the data onto the next destination.
It also provides rich visualizations to display the lineage and dependencies between your data pipelines, and monitor all your data pipelines from a single unified view to easily pinpoint issues and setup monitoring alerts.
The Azure Data Factory Process
The process for Azure Data Factory can be summarized by the following graphic.
Data Factory supports a wide variety of data sources that you can connect to through the creation of an object known as a Linked Service. This enables you to ingest the data from a data source in readiness to prepare the data for transformation and/or analysis. In addition, Linked Services can fire up compute services on demand. For example, you may have a requirement to start an on demand Hadoop cluster for the purpose of just processing data through a Hive query. So Linked Services enables you to define data sources, or compute resource that are required to ingest and prepare data.
With the linked service defined, Azure Data Factory is made aware of the datasets that it should use through the creation of a Datasets object. Datasets represent data structures within the data store that is being referenced by the Linked Service object. Datasets can also be used by an ADF process know as an Activity.
Activities typically contain the transformation logic or the analysis commands of the Azure Data Factory’s work. This could include the execution of a stored procedure, Hive Query or Pig script to transform the data. You can use U-SQL with Data Lake Analytics or push data into a Machine Learning model to perform analysis. It is not uncommon for multiple activities to take place that may include transforming data using a SQL stored procedure and then performing Data Lake Analytics with USQL. In this case, multiple activities can be logically grouped together with an object referred to as a Pipeline.
Once all the work is complete you can then use Data Factory to publish the final dataset to another linked service that can then be consumed by technologies such as Power BI or Machine Learning.
Therefore the process discussed above can be summarized by the creation of the following objects as shown in the graphic.
But I can move data with other tools right? Absolutely. Data movement could occur for example using SSIS to load data from SQL Server to Azure DW. However, if you have seen the video on data loading in Azure DW you will be aware that SSIS is not the ideal tool to use to load data into Azure SQL DW if performance of the data loads is the key object. PolyBase is, and we can use Azure Data Factory to orchestrate the PolyBase execution to load data into SQL Data Warehouse.
The other use case is if you want to call on demand services such as Hadoop clusters. ADF can be used to create an on demand cluster when it is required, and to shut it down once your work is complete. In which case you can control the usage of the Azure resources to process your data, and only pay for what you use.
We can also use ADF to push data into Machine Learning models, this is particularly useful once a model has been productionized and you want to automate the process of feeding data to the Machine Learning model to perform its analytics. ADF can perform this activity with ease.
These examples, along with others will be explored in future blogs. In the meantime you can watch this 3 minute video that summarizes the capability of Azure Data Factory