ADW5: Loading data into Azure SQL Data Warehouse

PolybaseThere are a wide range of tools and technologies to ingest data into Azure SQL Data Warehouse, including BCP and SSIS. If performance is a critical metric to meet when loading a data warehouse, loads are best performed with PolyBase from Azure Storage Blobs or Azure Data Lake.

What is PolyBase?

Polybase allows  you to leverage your data from various sources, and enables you to query non-relational data held in Azure Blob storage or Azure Data Lake as though it is a regular table by using familiar T-SQL commands. As a result, Polybase can be used to query non-relational data, or to import non-relational data into SQL Data Warehouse.

PolyBase achieves this using external tables to access non-relational data. The tables are metadata definitions that are stored in SQL Data Warehouse, and you can access them by using SQL and tools like you would access normal relational data. The key point to note, is that external tables do not hold any data at all, they provide a metadata abstraction over the source data held in an Azure Storage Blobs or Azure Data Lake. Therefore, an external table requires two supporting objects that informs the external table where the source data resides, and the format of the source data. These are referred to as the external data source and external file format.

External Data Source An External Data Source provides the connection information to the source data. Typically, this will also involve using a credential to access the data source. Therefore, a credential will have to be created so that the external data source can use this to access the data source. Finally, the credential has to be protected, so a master key would also be required in the database to protect the credential The following example firstly creates a master key in the EquityDB database, and then creates a database credential named AzureStorageCredential that contains the Access Key to an Azure Blob Store. Finally an External Data Source named AzureStorage provides the connection to a container in an Azure Blob Store named datacontainer@ctostorageaccount.blob.core.windows.net  that uses the credential named AzureStorageCredential to access the source data. — A: Create a master key.

 

CREATE MASTER KEY;

 

— B: Create a database scoped credential

 

CREATE DATABASE SCOPED CREDENTIAL AzureStorageCredential

WITH

IDENTITY = ‘NA’,

SECRET = ‘XI7amHWNZiqyrst8Ockja7Tw5Hci3D/yyxW6l2Djjg1RcdrtHfEla5h007g==’

;

 

— C: Create an external data source

 

CREATE EXTERNAL DATA SOURCE AzureStorage

WITH (

TYPE = HADOOP,

LOCATION = ‘wasbs://datacontainer@ctostorageaccount.blob.core.windows.net’,

CREDENTIAL = AzureStorageCredential

);
External File Format

The data read by Polybase can be in various formats, including Parquet, delimited files or ORC files. The External File Format command is used to specify the format of files that contain the external data. The following is a simple example of an External file format named TextFile that defines the format as delimited text with a field terminator using a comma.

— D: Create an external file format

 

CREATE EXTERNAL FILE FORMAT TextFile

WITH (

FORMAT_TYPE = DelimitedText,

FORMAT_OPTIONS (FIELD_TERMINATOR = ‘,’)

);

 

External Tables

External tables access non-relational data. The following example creates an external table named DimDate2External that has columns that reflect a date dimension table. The critical point to note are the options within the WITH clause. They point to both the external data source created earlier named AzureStorage and the external file format named TextFile . The location specifies the folder or the file path and file name for the actual data in the Azure blob storage. The location starts from the root folder in this example; the root folder being the data location specified in the external data source.

CREATE EXTERNAL TABLE dbo.DimDate2External (

[Date] datetime2(3) NULL,

[DateKey] decimal(38, 0) NULL,

[MonthKey] decimal(38, 0) NULL,

[Month] nvarchar(100) NULL,

[Quarter] nvarchar(100) NULL,

[Year] decimal(38, 0) NULL,

[Year-Quarter] nvarchar(100) NULL,

[Year-Month] nvarchar(100) NULL,

[Year-MonthKey] nvarchar(100) NULL,

[WeekDayKey] decimal(38, 0) NULL,

[WeekDay] nvarchar(100) NULL,

[Day Of Month] decimal(38, 0) NULL

)

WITH (

LOCATION=’/’,

DATA_SOURCE=AzureStorage,

FILE_FORMAT=TextFile

);

 

Using PolyBase to load data quickly

When you are loading, or exporting large volumes of data or fast performance is needed, PolyBase is the best choice. PolyBase is designed to leverage the MPP (Massively Parallel Processing) architecture of SQL Data Warehouse and will therefore load and export data magnitudes faster than any other tool. PolyBase loads can be run using CREATE TABLE AS SELECT (CTAS). Using CTAS will minimize transaction logging and the fastest way to load your data.

 

The following code uses a CTAS statement to create a new table in the EquityDB database named Dates that is populated with data from the external table named DimDate2External created earlier. The table will also have a clustered columstore index and uses a round_robin distribution.

CREATE TABLE dbo.Dates

WITH

(

CLUSTERED COLUMNSTORE INDEX,

DISTRIBUTION = ROUND_ROBIN

)

AS

SELECT * FROM [dbo].[DimDate2External];

This blog has explored using PolyBase to load data into an Azure SQL DataWarehouse. Although there are many tools to perform a load. PolyBase is the best choice form a performance perspective as it is designed to leverage the MPP (Massively Parallel Processing) architecture of SQL Data Warehouse. Other services can be used to run PolyBase under the hood such as Azure Data Factory, or you could use a third party tool to migrate you data from an on premise server to such as Redgate’s Data Platform Studio, which is readily accessible from the Azure SQL Data Warehouse product home page

See the code in the blog being executed in the following video

Analyse Twitter Sentiments with Azure Streaming Analytics in this lab

So tonight I will be presenting at the Manchester PowerBI user group looking at Power BI and Cortana integration. A video will be imminently released showing you how this can be done.

The excellent David Moss will also be presenting a topic on Azure Streaming Analytics, and whilst I have not written any blogs on this area as yet, I am going to take a different approach where you can learn not by reading, but by doing.

What do you need to perform this lab?

  • An Azure account and subscription
  • A Twitter account

If you have these you can go ahead and perform the lab.

Streaming Analytics Lab

Want to watch a video about this?

Then visit Channel 9

 

 

 

 

Cortana and Power BI integration

Thursday found me in Manchester with the Power BI user group to discuss Cortana integration with Power BI and the Cortana Intelligence Suite. The group can be found here if you want to attend any of their future events. It is well worth it.

You can watch a 10 minute video here, or you can read the setup below.

What is Cortana?

One of the most common questions that we are asked is what is the difference between Cortana and Cortana Intelligence. Cortana is the personal digital assistant, or as my children tell me, she is my other friend! Cortana will help you find things on your PC, manage your calendar, track packages, find files, chat with you and even tell some funny jokes. The more you use Cortana, the more personalised your experience will be. I have to say that I rely on Cortana heavily in both my personal and professional life on a daily basis and she helps me get my work done on time.

What is Cortana Intelligence?

CortanaPowerBI

Cortana Intelligence is different. Cortana Intelligence is a suite of technologies that provides a platform to deliver data science and/or artificial intelligence solutions. Using this in conjunction with the Team Data Science Process, you have both the platform and the process to deliver successful solutions.

So how does Cortana get involved with this?

Cortana is still the personal digital assistant that is found in the intelligence section of the Cortana Intelligence Suite. You will also notice from the graphic above, that PowerBI provides the dashboard and visualizations that will surface the data that is orchestrated and stored in Big Data Stores.

It is here that Cortana can be integrated with Power BI, so that you can call reports from your Power BI service from either your Windows 10 desktop or your Windows 10 phone.

So how can we make this happen?

First of all, you need to right version of Windows. The minimum version required is 1511, and you can run the command winver in Windows to confirm you have the right version. If you do not have the right version, you will have to upgrade to use this feature.

winver.png

The next step is to create a workplace account for both the windows machine and for Power BI. From a windows machine perspective, the account you login as should match the same account used to login to Power BI. In my case that would be chris.testaoneill@microsoft.com.

However, if this is not the case, you can add additional accounts within Windows using Settings followed by Access Work or School. From here you can click Connect and provide the credentials of a secondary account. Be mindful however there could be Domain Administrator restrictions on performing this operation, so it is worth checking and asking permission if you can perform this operation first.

AddWorkAccount.png

But there’s got to be more to it than that, right?

Yes. The steps above are essentially the provisioning work that is required to get the ball rolling. They are one off operations to get the integration started. There is still work you have to do within Power BI for both datasets and reports.

Within the dataset, you have to allow Cortana access to the dataset. There is a specific checkbox next to this precise setting found in the data refresh properties of the specific  dataset itself. You can also augment the querying of Cortana by adding your own custom questions to the Featured Q&A questions section as well.

AllowCortana

But there is also work to be done within the individual reports as well. Within Power BI Desktop, in the Report properties, you must modify the Page Size property in the format section and set the Type drop down to “Cortana“. In addition you can also use the Page Information section to add alternate names for the report as well.

PageSizes.png

After these setting have been defined, you can then publish the report to your Power BI workspace, and after a few minutes you can use Cortana on either your laptop, or your mobile to ask questions. Your all set!

If you want to try this yourself, here are a couple resources that will get you started:

  1. Enable Power BI Integration with Cortana

https://powerbi.microsoft.com/en-us/documentation/powerbi-service-cortana-enable/

  1. Use Power BI to create a custom Answer Page for Cortana https://powerbi.microsoft.com/en-us/documentation/powerbi-service-cortana-desktop-entity-cards/

ADW4: Working with tables in SQL DW

SQLDWTables

Before we get into the implementation of tables, there are several concepts that need to be understood so that you can understand the ramifications of you design choice for tables. These concepts include:

  • Distributions
  • Nodes
  • Service Objective

Distributions

SQL Data Warehouse is a Massively Parallel Processing (MPP) system. Data is distributed across multiple locations across the MPP system referred to as distributions and is fundamental to running queries in parallel to achieve high query performance.

When a query runs in parallel, all 60 distribution performs a SQL query on its portion of the data. SQL Data Warehouse uses SQL Database to run the query. This shared-nothing architecture design is fundamental to achieving scale-out parallel computing. Every SQL Data Warehouse has a fixed number of 60 distributions.

Nodes

There are two types of nodes with Azure SQL Data Warehouse: Control Nodes, and Compute Nodes

The Control node manages and optimizes queries. It is the front end that interacts with all applications and connections. In SQL Data Warehouse, the Control node is powered by SQL Database, and connecting to it looks and feels the same. Under the surface, the Control node coordinates all the data movement and computation required to run parallel queries on your distributed data across distribution nodes. When you submit a T-SQL query to SQL Data Warehouse, the Control node transforms it into separate queries that run on each Compute node in parallel.

The Compute nodes serve as the power behind SQL Data Warehouse. They are SQL Databases that store your data and process your query. When you add data, SQL Data Warehouse distributes the rows to your Compute nodes. The Compute nodes are the workers that run the parallel queries on your data. After processing, they pass the results back to the Control node. To finish the query, the Control node aggregates the results and returns the final result.

The number of Compute nodes can change according to the backend service workload and the computing capacity (DWUs) you specify for the data warehouse. When the number of Compute nodes changes, the number of distributions per Compute node also changes.

For example, if you select DWU 100, then the 60 distributions are allocated to a single compute node working with a control node.

Should you increase the DWU to 200, then the 60 distributions are divided between two compute nodes working with a control node.

Service Objective

Service objective is the generic name to describe “Data Warehouse Units (DWU)” in Azure Data Warehouse, and “Data Transaction Units (DTU)” in Azure SQL Database. In Azure SQL Data Warehouse the service objectives is used to define the scale of the data warehouse, with a minimum of DWU100 to a maximum of DWU6000 at the time of writing. The higher the DWU, the more nodes that are created to distribute the 60 distributions and increase the performance capability of the Azure SQL Data Warehouse.

Table and Table Geometries

To create a table within Azure SQL Database, we use code very similar to the CREATE TABLE statement for an on premise SQL Server. The difference is we can (or need) to include a WITH clause that specifies specific ASDW options. The following example creates a date dimension table

CREATE TABLE [dbo].[Dates]
(
       [Date] [datetime2](3) ,
       [DateKey] [nchar](8),
       
       
       
       [WeekDay] [nvarchar](100) ,
       [Day Of Month] [decimal](38, 0)
)
WITH
(
       DISTRIBUTION = ROUND_ROBIN
);

You can see from the code, that you create a table defining a column name and data type. You can also specify the nullability of a column as well. What is different is what is contained in the WITH clause with the DISTRIBUTION clause. The distribution clause defines how the data will be stored across the distribution in the Azure SQL Data Warehouse and there are two options:

DISTRIBUTION = ROUND_ROBIN | HASH(column_name)

Round_robin distribution

A distribution of round_robin will evenly distribute the data across all the 60 distributions. The benefit of this approach is that the data is evenly spread, but there is no control over where the data is stored. The round_robin distribution works very well when initially loading data into a staging table. By default, tables are Round Robin distributed. This makes it easy for users to get started creating tables without having to decide how their tables should be distributed. Round Robin tables may perform sufficiently for some workloads, but in most cases selecting a distribution column will perform much better.

Hash distribution

Selecting a distribution column is done when using the hash distribution. In this situation, a column must be defined when using the hash. Using a hash algorithm to distribute your table data can improve performance for many scenarios by reducing data movement at query time. Hash distributed tables are tables which are divided between the distributions using a hashing algorithm on a single column which you select.

The distribution column is what determines how the data is divided across your distributed databases. The hash function uses the distribution column to assign rows to distributions. The hashing algorithm and resulting distribution is deterministic. That is the same value with the same data type will always has to the same distribution.

Therefore, the column that is selected for the hash distribution should be chosen carefully. If a column is chosen on 10 million row table for the hash distribution. Let’s say the City column as an example. And 98% of the records in the City column have a value of Manchester, then there would be an uneven spread of the data across the distributions. This data skew towards Manchester can adversely affect the performance of the Azure SQL Data Warehouse. So you would want to select a column that has relative even distribution of data and a column that won’t be updated. In fact, our friends in the Azure CAT team have written a great article on loading patterns in Azure SQL Data Warehouse, and I highly recommend that you read it.

So, the choice of the distribution method, and the column to be used when a hash distribution is used is vitally important to ensure a performant Data Warehouse. Choose your columns carefully.

See how you can create table in the following video

Other Table Options

The logical architecture for the creation of tables still largely remains the same. You can still create star schema, or snowflake schemas with fact and dimension tables.

The Create Table statement in Azure SQL Data Warehouse also supports Indexing and Partitioning. Interestingly, there are subtle variations with the partitioning in particular, but I will save this for another blog. But partitioning can be important in aiding the manageability of data by merging and switching partitions to aid the fast loading of data.

Statistics also play an important role in aiding query execution plans. However, it should be noted that at the time of writing, the statistics are not updated on the control node. So, it is best practice to create statistics on specific columns that aid querying so that the information is up to date for the query optimization process.

Machine Learning deck and SQL DW Lab

Bham

Last week I was visiting Birmingham and spent an hour discussing the Cortana Intelligence Suite, The Team Data Science Process and focused specifically on Machine Learning.

Simon Whiteley from Adatis also gave an excellent presentation on SQL DW. I promised the attendees a copy of the slides and a SQL DW Lab which I am now including in this post in the following links:

SQL DW Lab

Machine Learning 60 – 90 min Presentation

I hope this is of use to you.

Thanks

Chris

ADW3: Creating Databases in Azure SQL Data Warehouse

AzureSQLDW

Creating databases within Azure SQL Data Warehouse uses very familiar T-SQL language constructs to create them. However, there are subtle difference in the TSQL between creating an on premise SQL Server database and an Azure SQL Data Warehouse.

Let’s start with the CREATE DATABASE statement itself. This can be performed once you have connected to the SQL Server instance in Visual Studio and run the following command:

CREATE DATABASE EquityDB COLLATE SQL_Latin1_General_CP1_CI_AS

(       EDITION                    = ‘DataWarehouse’

,      SERVICE_OBJECTIVE           = ‘DW100’

,      MAXSIZE                    = 1024 GB

);

You will note that the command does not define the creation of data files and log files. This is not required when creating a database in SQL Data Warehouse, as the data will be distributed across all the available distributions, and the log is created under the hood.

You will see that there is a COLLATE and MAXSIZE option available that works in the same way as the options for on premise SQL Server. At the time of writing, the maximum size of a Data Warehouse is 245760 GB.

However, there is an EDITION clause and a SERVICE_OBJECTIVE clause. These are option that are not available with SQL Server, but are specific to Azure version of SQL Server services such as SQL Database and SQL Data Warehouse. When creating an Azure Data Warehouse, you must specify the edition as DataWarehouse, and the SERVICE_OBJECTIVE can go to a maximum of DW6000. This determines the compute level that is being used by the data warehouse.

You can also create a database using a PowerShell script as shown in the following code:

# CREATE A SQL DATABASE

$resourceGroupName = “cto_ads_prep_rg”

$serverName = “ctomsftadssqlsrv”

$databaseName = “ContosoRetailDW”

$databaseEdition = “DataWarehouse”

$RequestedServiceObjectiveName = “DW400”

New-AzureRmSqlDatabase

   -ResourceGroupName $resourceGroupName 

   -RequestedServiceObjectiveName $RequestedServiceObjectiveName

   -ServerName $serverName

   -DatabaseName $databaseName

   -Edition $databaseEdition

The following video shows you how to create a database in Azure SQL Data Warehouse.

It is important to note that within Azure SQL Data Warehouse, cross database queries are not permitted. In scenarios where you would typically do this, your design should change to incorporate the creation of database schemas to hold data within the schema that you would typically hold in separate databases. This is performed using the CREATE SCHEMA TSQL statement that works in the same way as on premise SQL Server. 

With the database created, you can then proceed to create tables. That’s the next ADW blog….

 

ADW2: Scalability of Azure SQL Data Warehouse.

A quick one today. In the first post of Azure SQL Data Warehouse, we introduced the key business benefits of the service. Namely the ability to scale on demand, and the pause and restart the service at will.

The ability to scale a SQL Data Warehouse on demand provides several benefits

  1. You can increase the compute of SQL Data Warehouse during periods of heavy demand.Heavy demand can either include a period of sustained data loads, or periods of heavy querying. You can scale up the compute during these peak periods, and then reduce the compute when the demand is not there
  2. You can control the cost of the data warehouseThe ability to change the compute at will, also enable you to control to costs of the data warehouse. Pricing can be found at the following page. Note that the price is broken down into the cost of the compute, and the cost of storing the data.

There are several methods to scale a SQL Data Warehouse. The following video shows you how to scale using the Azure Portal. But you can also scale the data warehouse using the following TSQL code:

ALTER DATABASE ContosoRetailDW

MODIFY (service_objective = ‘DW100’);

Finally, PowerShell could be used to scale the data warehouse:

Set-AzureRmSqlDatabase `

–ResourceGroupName “RG_name” `

–ServerName “SRV_name” `

–DatabaseName “DB_name” `

-RequestedServiceObjectiveName “DW100”

 

Your organization may face a period where no access is required to the data held in a data warehouse. Perhaps during the holiday season, your organization informs the staff that IT maintenance is being performed over the holidays and that access to data and network resources is not guaranteed. In this case, you can pause the Azure SQL Data Warehouse. This means that you won’t be charged for the compute aspect of the warehouse, only the storage aspect.  You can pause the service using the Azure Portal as shown in the following video, or you can use the following PowerShell code:

Suspend-AzureRmSqlDatabase `

–ResourceGroupName “RG_name” `

–ServerName “SRV_name” `

–DatabaseName “DB_name”

 

You can resume the Azure SQL Data Warehouse with similar PowerShell Code:

Resume-AzureRmSqlDatabase `

–ResourceGroupName “RG_name” `

–ServerName “SRV_name” `

-DatabaseName “DB_name”

 

These simple steps can be used to help you scale and pause the service when you need it. Giving you control of the performance and the costs of the Solution.