Machine Learning deck and SQL DW Lab


Last week I was visiting Birmingham and spent an hour discussing the Cortana Intelligence Suite, The Team Data Science Process and focused specifically on Machine Learning.

Simon Whiteley from Adatis also gave an excellent presentation on SQL DW. I promised the attendees a copy of the slides and a SQL DW Lab which I am now including in this post in the following links:


Machine Learning 60 – 90 min Presentation

I hope this is of use to you.



Cortana and Power BI integration

Thursday found me in Manchester with the Power BI user group to discuss Cortana integration with Power BI and the Cortana Intelligence Suite. The group can be found here if you want to attend any of their future events. It is well worth it.

What is Cortana?

One of the most common questions that we are asked is what is the difference between Cortana and Cortana Intelligence. Cortana is the personal digital assistant, or as my children tell me, she is my other friend! Cortana will help you find things on your PC, manage your calendar, track packages, find files, chat with you and even tell some funny jokes. The more you use Cortana, the more personalised your experience will be. I have to say that I rely on Cortana heavily in both my personal and professional life on a daily basis and she helps me get my work done on time.

What is Cortana Intelligence?


Cortana Intelligence is different. Cortana Intelligence is a suite of technologies that provides a platform to deliver data science and/or artificial intelligence solutions. Using this in conjunction with the Team Data Science Process, you have both the platform and the process to deliver successful solutions.

So how does Cortana get involved with this?

Cortana is still the personal digital assistant that is found in the intelligence section of the Cortana Intelligence Suite. You will also notice from the graphic above, that PowerBI provides the dashboard and visualizations that will surface the data that is orchestrated and stored in Big Data Stores.

It is here that Cortana can be integrated with Power BI, so that you can call reports from your Power BI service from either your Windows 10 desktop or your Windows 10 phone.

So how can we make this happen?

First of all, you need to right version of Windows. The minimum version required is 1511, and you can run the command winver in Windows to confirm you have the right version. If you do not have the right version, you will have to upgrade to use this feature.


The next step is to create a workplace account for both the windows machine and for Power BI. From a windows machine perspective, the account you login as should match the same account used to login to Power BI. In my case that would be

However, if this is not the case, you can add additional accounts within Windows using Settings followed by Access Work or School. From here you can click Connect and provide the credentials of a secondary account. Be mindful however there could be Domain Administrator restrictions on performing this operation, so it is worth checking and asking permission if you can perform this operation first.


But there’s got to be more to it than that, right?

Yes. The steps above are essentially the provisioning work that is required to get the ball rolling. They are one off operations to get the integration started. There is still work you have to do within Power BI for both datasets and reports.

Within the dataset, you have to allow Cortana access to the dataset. There is a specific checkbox next to this precise setting found in the data refresh properties of the specific  dataset itself. You can also augment the querying of Cortana by adding your own custom questions to the Featured Q&A questions section as well.


But there is also work to be done within the individual reports as well. Within Power BI Desktop, in the Report properties, you must modify the Page Size property in the format section and set the Type drop down to “Cortana“. In addition you can also use the Page Information section to add alternate names for the report as well.


After these setting have been defined, you can then publish the report to your Power BI workspace, and after a few minutes you can use Cortana on either your laptop, or your mobile to ask questions. Your all set!

If you want to try this yourself, here are a couple resources that will get you started:

  1. Enable Power BI Integration with Cortana

  1. Use Power BI to create a custom Answer Page for Cortana

ADW4: Working with tables in SQL DW


Before we get into the implementation of tables, there are several concepts that need to be understood so that you can understand the ramifications of you design choice for tables. These concepts include:

  • Distributions
  • Nodes
  • Service Objective


SQL Data Warehouse is a Massively Parallel Processing (MPP) system. Data is distributed across multiple locations across the MPP system referred to as distributions and is fundamental to running queries in parallel to achieve high query performance.

When a query runs in parallel, all 60 distribution performs a SQL query on its portion of the data. SQL Data Warehouse uses SQL Database to run the query. This shared-nothing architecture design is fundamental to achieving scale-out parallel computing. Every SQL Data Warehouse has a fixed number of 60 distributions.


There are two types of nodes with Azure SQL Data Warehouse: Control Nodes, and Compute Nodes

The Control node manages and optimizes queries. It is the front end that interacts with all applications and connections. In SQL Data Warehouse, the Control node is powered by SQL Database, and connecting to it looks and feels the same. Under the surface, the Control node coordinates all the data movement and computation required to run parallel queries on your distributed data across distribution nodes. When you submit a T-SQL query to SQL Data Warehouse, the Control node transforms it into separate queries that run on each Compute node in parallel.

The Compute nodes serve as the power behind SQL Data Warehouse. They are SQL Databases that store your data and process your query. When you add data, SQL Data Warehouse distributes the rows to your Compute nodes. The Compute nodes are the workers that run the parallel queries on your data. After processing, they pass the results back to the Control node. To finish the query, the Control node aggregates the results and returns the final result.

The number of Compute nodes can change according to the backend service workload and the computing capacity (DWUs) you specify for the data warehouse. When the number of Compute nodes changes, the number of distributions per Compute node also changes.

For example, if you select DWU 100, then the 60 distributions are allocated to a single compute node working with a control node.

Should you increase the DWU to 200, then the 60 distributions are divided between two compute nodes working with a control node.

Service Objective

Service objective is the generic name to describe “Data Warehouse Units (DWU)” in Azure Data Warehouse, and “Data Transaction Units (DTU)” in Azure SQL Database. In Azure SQL Data Warehouse the service objectives is used to define the scale of the data warehouse, with a minimum of DWU100 to a maximum of DWU6000 at the time of writing. The higher the DWU, the more nodes that are created to distribute the 60 distributions and increase the performance capability of the Azure SQL Data Warehouse.

Table and Table Geometries

To create a table within Azure SQL Database, we use code very similar to the CREATE TABLE statement for an on premise SQL Server. The difference is we can (or need) to include a WITH clause that specifies specific ASDW options. The following example creates a date dimension table

CREATE TABLE [dbo].[Dates]
       [Date] [datetime2](3) ,
       [DateKey] [nchar](8),
       [WeekDay] [nvarchar](100) ,
       [Day Of Month] [decimal](38, 0)

You can see from the code, that you create a table defining a column name and data type. You can also specify the nullability of a column as well. What is different is what is contained in the WITH clause with the DISTRIBUTION clause. The distribution clause defines how the data will be stored across the distribution in the Azure SQL Data Warehouse and there are two options:


Round_robin distribution

A distribution of round_robin will evenly distribute the data across all the 60 distributions. The benefit of this approach is that the data is evenly spread, but there is no control over where the data is stored. The round_robin distribution works very well when initially loading data into a staging table. By default, tables are Round Robin distributed. This makes it easy for users to get started creating tables without having to decide how their tables should be distributed. Round Robin tables may perform sufficiently for some workloads, but in most cases selecting a distribution column will perform much better.

Hash distribution

Selecting a distribution column is done when using the hash distribution. In this situation, a column must be defined when using the hash. Using a hash algorithm to distribute your table data can improve performance for many scenarios by reducing data movement at query time. Hash distributed tables are tables which are divided between the distributions using a hashing algorithm on a single column which you select.

The distribution column is what determines how the data is divided across your distributed databases. The hash function uses the distribution column to assign rows to distributions. The hashing algorithm and resulting distribution is deterministic. That is the same value with the same data type will always has to the same distribution.

Therefore, the column that is selected for the hash distribution should be chosen carefully. If a column is chosen on 10 million row table for the hash distribution. Let’s say the City column as an example. And 98% of the records in the City column have a value of Manchester, then there would be an uneven spread of the data across the distributions. This data skew towards Manchester can adversely affect the performance of the Azure SQL Data Warehouse. So you would want to select a column that has relative even distribution of data and a column that won’t be updated. In fact, our friends in the Azure CAT team have written a great article on loading patterns in Azure SQL Data Warehouse, and I highly recommend that you read it.

So, the choice of the distribution method, and the column to be used when a hash distribution is used is vitally important to ensure a performant Data Warehouse. Choose your columns carefully.

See how you can create table in the following video

Other Table Options

The logical architecture for the creation of tables still largely remains the same. You can still create star schema, or snowflake schemas with fact and dimension tables.

The Create Table statement in Azure SQL Data Warehouse also supports Indexing and Partitioning. Interestingly, there are subtle variations with the partitioning in particular, but I will save this for another blog. But partitioning can be important in aiding the manageability of data by merging and switching partitions to aid the fast loading of data.

Statistics also play an important role in aiding query execution plans. However, it should be noted that at the time of writing, the statistics are not updated on the control node. So, it is best practice to create statistics on specific columns that aid querying so that the information is up to date for the query optimization process.

ADW3: Creating Databases in Azure SQL Data Warehouse


Creating databases within Azure SQL Data Warehouse uses very familiar T-SQL language constructs to create them. However, there are subtle difference in the TSQL between creating an on premise SQL Server database and an Azure SQL Data Warehouse.

Let’s start with the CREATE DATABASE statement itself. This can be performed once you have connected to the SQL Server instance in Visual Studio and run the following command:


(       EDITION                    = ‘DataWarehouse’

,      SERVICE_OBJECTIVE           = ‘DW100’

,      MAXSIZE                    = 1024 GB


You will note that the command does not define the creation of data files and log files. This is not required when creating a database in SQL Data Warehouse, as the data will be distributed across all the available distributions, and the log is created under the hood.

You will see that there is a COLLATE and MAXSIZE option available that works in the same way as the options for on premise SQL Server. At the time of writing, the maximum size of a Data Warehouse is 245760 GB.

However, there is an EDITION clause and a SERVICE_OBJECTIVE clause. These are option that are not available with SQL Server, but are specific to Azure version of SQL Server services such as SQL Database and SQL Data Warehouse. When creating an Azure Data Warehouse, you must specify the edition as DataWarehouse, and the SERVICE_OBJECTIVE can go to a maximum of DW6000. This determines the compute level that is being used by the data warehouse.

You can also create a database using a PowerShell script as shown in the following code:


$resourceGroupName = “cto_ads_prep_rg”

$serverName = “ctomsftadssqlsrv”

$databaseName = “ContosoRetailDW”

$databaseEdition = “DataWarehouse”

$RequestedServiceObjectiveName = “DW400”


   -ResourceGroupName $resourceGroupName 

   -RequestedServiceObjectiveName $RequestedServiceObjectiveName

   -ServerName $serverName

   -DatabaseName $databaseName

   -Edition $databaseEdition

The following video shows you how to create a database in Azure SQL Data Warehouse.

It is important to note that within Azure SQL Data Warehouse, cross database queries are not permitted. In scenarios where you would typically do this, your design should change to incorporate the creation of database schemas to hold data within the schema that you would typically hold in separate databases. This is performed using the CREATE SCHEMA TSQL statement that works in the same way as on premise SQL Server. 

With the database created, you can then proceed to create tables. That’s the next ADW blog….


ADW2: Scalability of Azure SQL Data Warehouse.

A quick one today. In the first post of Azure SQL Data Warehouse, we introduced the key business benefits of the service. Namely the ability to scale on demand, and the pause and restart the service at will.

The ability to scale a SQL Data Warehouse on demand provides several benefits

  1. You can increase the compute of SQL Data Warehouse during periods of heavy demand.Heavy demand can either include a period of sustained data loads, or periods of heavy querying. You can scale up the compute during these peak periods, and then reduce the compute when the demand is not there
  2. You can control the cost of the data warehouseThe ability to change the compute at will, also enable you to control to costs of the data warehouse. Pricing can be found at the following page. Note that the price is broken down into the cost of the compute, and the cost of storing the data.

There are several methods to scale a SQL Data Warehouse. The following video shows you how to scale using the Azure Portal. But you can also scale the data warehouse using the following TSQL code:


MODIFY (service_objective = ‘DW100’);

Finally, PowerShell could be used to scale the data warehouse:

Set-AzureRmSqlDatabase `

–ResourceGroupName “RG_name” `

–ServerName “SRV_name” `

–DatabaseName “DB_name” `

-RequestedServiceObjectiveName “DW100”


Your organization may face a period where no access is required to the data held in a data warehouse. Perhaps during the holiday season, your organization informs the staff that IT maintenance is being performed over the holidays and that access to data and network resources is not guaranteed. In this case, you can pause the Azure SQL Data Warehouse. This means that you won’t be charged for the compute aspect of the warehouse, only the storage aspect.  You can pause the service using the Azure Portal as shown in the following video, or you can use the following PowerShell code:

Suspend-AzureRmSqlDatabase `

–ResourceGroupName “RG_name” `

–ServerName “SRV_name” `

–DatabaseName “DB_name”


You can resume the Azure SQL Data Warehouse with similar PowerShell Code:

Resume-AzureRmSqlDatabase `

–ResourceGroupName “RG_name” `

–ServerName “SRV_name” `

-DatabaseName “DB_name”


These simple steps can be used to help you scale and pause the service when you need it. Giving you control of the performance and the costs of the Solution.



ADW1: Data Warehousing in the Cloud.

For a few months now, we have had a SQL Server edition in the cloud known as Azure SQL Data Warehouse. This version enables you to provision a data warehouse instance in just 3 to 5 minutes. Its main benefit is that it allows you to scale your compute in seconds to keep up with your organizations data demands. So, if you are performing heavy data loads, you can maximise the amount of compute for the duration of a data load, only to scale the compute back down once the load is complete. Furthermore, if you don’t require access to your data, you can pause the compute so that you can keep control of your costs, while still retaining the data.

These are some of the business benefits of Azure SQL Data Warehouse, but how does it work?

SQL Data Warehouse is a massively parallel processing (MPP) distributed database system. Behind the scenes, SQL Data Warehouse spreads your data across many shared-nothing storage and processing units. The data is stored in a Premium locally redundant storage layer on top of which dynamically linked Compute nodes execute distributed queries. SQL Data Warehouse takes a “divide and conquer” approach to running loads and complex queries. Requests are received by a Control node, optimized for distribution, and then passed to Compute nodes to do their work in parallel as shown in the following graphic.


Azure Data Warehouse is ideal for analytical workloads whether it is a small workload of GB’s, to a large workload of PB’s. It can also interface with unstructured data stored an Azure Blob Store. If transactional consistency and high concurrency is your requirement, then Azure SQL Data Warehouse is not the service to use. SQL Database would be a more appropriate choice.

It only takes minutes to get this up and running, and you can either do this within the Azure Portal, or alternatively you can use PowerShell.

At this point I will give you a choice, you can either watch the following 10 minute video that demonstrates how to set up an instance of Azure SQL Data Warehouse. Alternatively, you can continue to read. If your hard core, why not do both! But I understand your time is precious.

OK, so there is information that you need to have to hand before creating an Azure SQL Data Warehouse:


Armed with this information, you can then go ahead and create the SQL Data Warehouse Instance.

Alternatively, you can use the same information to create a PowerShell script to sign into an Azure Subscription, create a resource group and then create a SQL Server instance, and optionally a database. The following PowerShell code creates a resource group named cto_ads_prep_rg located in North Europe using the New-AzureRmResourceGroup cmdlet.  The script then creates a SQL Server instance named ctomsftadssqlsrv with an admin account named ctesta-oneill using the New-AzureRmSqlServer cmdlet.


##                                               PART I: Creating the Azure SQL Server                                             ##


# Sign in to Azure and set the WINDOWS AZURE subscription to work with

$SubscriptionId = “XXXXXXXX-xxXX-XXxx-XXXX-xxxxxxxxxxxx”


Set-AzureRmContext -SubscriptionId $SubscriptionId



$resourceGroupName = “cto_ads_prep_rg”

$rglocation = “North Europe”

New-AzureRmResourceGroup -Name $resourceGroupName -Location $rglocation



$serverName = “ctomsftadssqlsrv”

$serverVersion = “12.0”

$serverLocation = “North Europe”


$serverAdmin = “ctesta-oneill”

$serverPassword = “P@ssw0rd”

$securePassword = ConvertTo-SecureString –String $serverPassword –AsPlainText -Force

$serverCreds = New-Object –TypeName System.Management.Automation.PSCredential –ArgumentList $serverAdmin, $securePassword

$sqlDbServer = New-AzureRmSqlServer -ResourceGroupName $resourceGroupName -ServerName $serverName -Location $serverLocation -ServerVersion $serverVersion -SqlAdministratorCredentials $serverCreds


You can also use PowerShell to configure firewall settings on the SQL Server instance using the New-AzureRmSqlServerFirewallRule cmdlet. This can be performed in the Azure Portal as well.



$ip = (Test-Connection -ComputerName $env:COMPUTERNAME -Count 1 -Verbose).IPV4Address.IPAddressToString

$firewallRuleName = ‘Client IP address’

$firewallStartIp = $ip

$firewallEndIp = $ip

$fireWallRule = New-AzureRmSqlServerFirewallRule -ResourceGroupName $resourceGroupName -ServerName $serverName -FirewallRuleName $firewallRuleName -StartIpAddress $firewallStartIp -EndIpAddress $firewallEndIp


With the firewall rules defined, you will then be able to access to Azure SQL Server using tools such as Visual Studio and SQL Server Management Studio, where you could run T-SQL scripts to create and manage database. Although this can be done in PowerShell using the New-AzureRmSqlDatabase cmdlets as well. The following code creates a data warehouse named ContosoRetailDW.



$databaseName = “ContosoRetailDW”

$databaseEdition = “DataWarehouse”

$RequestedServiceObjectiveName = “DW400”

$sqlDatabase = New-AzureRmSqlDatabase -ResourceGroupName $resourceGroupName -RequestedServiceObjectiveName $RequestedServiceObjectiveName -ServerName $serverName -DatabaseName $databaseName -Edition $databaseEdition.

Once a database has been created, you can then understand how to scale and Pause a database in Azure SQL Data Warehouse, but that will be the next blog.



LEX1: Azure SQL Data Warehouse. Add to your Big Data specialist skills


Whether you’ve outgrown your on-premises datacenter or need to extend a hybrid scenario, Azure SQL Data Warehouse can help. The great news is that, unlike data warehousing of the past, you can spin up this cloud-based massively parallel processing solution in just a few minutes. Push a button to view the contents in Power BI. You can even pause compute to help control costs. Get the details, in “Delivering a Data Warehouse in the Cloud,” a self-paced course now available on edX.

Experts Chris Randall and Chris Testa-O’Neill dive deep into the specifics of this breakthrough technology. Learn how to deploy, design, and load data from a variety of sources. Plus, explore PolyBase for Big Data and look at compressed in-memory indexes. Design tables and indexes to efficiently distribute data in tables across many nodes, secure and recover data, and much more.

If you’re adding to your Big Data specialist skills, this course is a must. We’ve got lots of other Big Data courses, too. Look at Analyzing and Visualizing Data with Excel or with Power BI. Explore Querying Data with Transact-SQL. Or find out about Processing Big Data with Azure Data Lake Analytics or with Hadoop in Azure HDInsight. You can even Orchestrate Big Data with Azure Data Factory. Finally, learn about Analyzing Big Data with Microsoft R Server and Implementing Predictive Models with Spark in Azure HDInsight.

All of these Big Data concepts and technologies courses are free, or you can pay $49 for a verified certificate, which offers proof that you’ve successfully completed the online course and which you can share with friends, employers, and others. Spend a few hours each week (for four to six weeks, depending on the course) with these videos, demos, hands-on labs, and assessments. Plus, get additional resources and content for further study. Roll up your sleeves, and get started today!

Sign up now! (