Welcome Aboard!

“With respect, I have to disagree with you.” was one of my responses in a recent interview that has lead me to my current role. I was on my seventh interview, and I was explaining to one of my future team mates why I didn’t regard myself as a data scientist.

At the end of the day I had spent well over 16 years working with SQL Server, specialising in Business Intelligence and descriptive analytics. If I was to call myself a data scientist, I would be regarded as a pretty bad one. And the rhetoric that had been coming out of the industry was that I knew I had to look into this field, but I would only be an apprentice data scientist at best.

That was my thought anyway, but it turns out that there are existing skills that a SQL Server Data Professional possesses that is extremely useful to data science. Particularly in the realms of “Features Engineering” –  I will talk at length about this in another post. Unbeknown to me during numerous interviews, I described techniques and technologies I had used in the world of Data Migrations, Data Warehousing and Cubes that resonated with the resident data scientists. I received the call soon after saying “You’re hired.”

The last 18 months have seen me immersed in the world of cloud migrations and I have been around many conferences presenting content on the Cortana Intelligence Suite (CIS).  At each conference the same themes keep recurring as I am questioned about the CIS sessions I run.  Questions such as “How can I keep pace with all these changes?” and “Are there any links to information that can help me understand this?”.

So if you are feeling “Where do I start!”, then the good news is you are not alone, and this blog, with the other resources that we are providing is intended to provide you with the information to help you augment your existing skills with the skills and technologies required to implement Cortana solutions.

The first thing you should do is browse our team’s portal that has a list of training events around the globe, and video materials. Content will be added to this site fairly regularly in the coming months. I would recommend that you visit Buck Woody’s blog , Backyard Data Science, where the focus is looking at the processes, the statistics and the technology hat fits around data science space. And Ryan Swanstrom blog, Data Science 101 also provides useful insights into the world of data science. Let me know in the comments below what areas you would like to look at and I will get them on the blog as soon as I can.

Welcome aboard! I hope you will find future articles useful as we both learn how to navigate this new and exciting world!

 

ADW3: Creating Databases in Azure SQL Data Warehouse

AzureSQLDW

Creating databases within Azure SQL Data Warehouse uses very familiar T-SQL language constructs to create them. However, there are subtle difference in the TSQL between creating an on premise SQL Server database and an Azure SQL Data Warehouse.

Let’s start with the CREATE DATABASE statement itself. This can be performed once you have connected to the SQL Server instance in Visual Studio and run the following command:

CREATE DATABASE EquityDB COLLATE SQL_Latin1_General_CP1_CI_AS

(       EDITION                    = ‘DataWarehouse’

,      SERVICE_OBJECTIVE           = ‘DW100’

,      MAXSIZE                    = 1024 GB

);

You will note that the command does not define the creation of data files and log files. This is not required when creating a database in SQL Data Warehouse, as the data will be distributed across all the available distributions, and the log is created under the hood.

You will see that there is a COLLATE and MAXSIZE option available that works in the same way as the options for on premise SQL Server. At the time of writing, the maximum size of a Data Warehouse is 245760 GB.

However, there is an EDITION clause and a SERVICE_OBJECTIVE clause. These are option that are not available with SQL Server, but are specific to Azure version of SQL Server services such as SQL Database and SQL Data Warehouse. When creating an Azure Data Warehouse, you must specify the edition as DataWarehouse, and the SERVICE_OBJECTIVE can go to a maximum of DW6000. This determines the compute level that is being used by the data warehouse.

You can also create a database using a PowerShell script as shown in the following code:

# CREATE A SQL DATABASE

$resourceGroupName = “cto_ads_prep_rg”

$serverName = “ctomsftadssqlsrv”

$databaseName = “ContosoRetailDW”

$databaseEdition = “DataWarehouse”

$RequestedServiceObjectiveName = “DW400”

New-AzureRmSqlDatabase

   -ResourceGroupName $resourceGroupName 

   -RequestedServiceObjectiveName $RequestedServiceObjectiveName

   -ServerName $serverName

   -DatabaseName $databaseName

   -Edition $databaseEdition

The following video shows you how to create a database in Azure SQL Data Warehouse.

It is important to note that within Azure SQL Data Warehouse, cross database queries are not permitted. In scenarios where you would typically do this, your design should change to incorporate the creation of database schemas to hold data within the schema that you would typically hold in separate databases. This is performed using the CREATE SCHEMA TSQL statement that works in the same way as on premise SQL Server. 

With the database created, you can then proceed to create tables. That’s the next ADW blog….

 

ADW2: Scalability of Azure SQL Data Warehouse.

A quick one today. In the first post of Azure SQL Data Warehouse, we introduced the key business benefits of the service. Namely the ability to scale on demand, and the pause and restart the service at will.

The ability to scale a SQL Data Warehouse on demand provides several benefits

  1. You can increase the compute of SQL Data Warehouse during periods of heavy demand.Heavy demand can either include a period of sustained data loads, or periods of heavy querying. You can scale up the compute during these peak periods, and then reduce the compute when the demand is not there
  2. You can control the cost of the data warehouseThe ability to change the compute at will, also enable you to control to costs of the data warehouse. Pricing can be found at the following page. Note that the price is broken down into the cost of the compute, and the cost of storing the data.

There are several methods to scale a SQL Data Warehouse. The following video shows you how to scale using the Azure Portal. But you can also scale the data warehouse using the following TSQL code:

ALTER DATABASE ContosoRetailDW

MODIFY (service_objective = ‘DW100’);

Finally, PowerShell could be used to scale the data warehouse:

Set-AzureRmSqlDatabase `

–ResourceGroupName “RG_name” `

–ServerName “SRV_name” `

–DatabaseName “DB_name” `

-RequestedServiceObjectiveName “DW100”

 

Your organization may face a period where no access is required to the data held in a data warehouse. Perhaps during the holiday season, your organization informs the staff that IT maintenance is being performed over the holidays and that access to data and network resources is not guaranteed. In this case, you can pause the Azure SQL Data Warehouse. This means that you won’t be charged for the compute aspect of the warehouse, only the storage aspect.  You can pause the service using the Azure Portal as shown in the following video, or you can use the following PowerShell code:

Suspend-AzureRmSqlDatabase `

–ResourceGroupName “RG_name” `

–ServerName “SRV_name” `

–DatabaseName “DB_name”

 

You can resume the Azure SQL Data Warehouse with similar PowerShell Code:

Resume-AzureRmSqlDatabase `

–ResourceGroupName “RG_name” `

–ServerName “SRV_name” `

-DatabaseName “DB_name”

 

These simple steps can be used to help you scale and pause the service when you need it. Giving you control of the performance and the costs of the Solution.

 

 

ADW1: Data Warehousing in the Cloud.

For a few months now, we have had a SQL Server edition in the cloud known as Azure SQL Data Warehouse. This version enables you to provision a data warehouse instance in just 3 to 5 minutes. Its main benefit is that it allows you to scale your compute in seconds to keep up with your organizations data demands. So, if you are performing heavy data loads, you can maximise the amount of compute for the duration of a data load, only to scale the compute back down once the load is complete. Furthermore, if you don’t require access to your data, you can pause the compute so that you can keep control of your costs, while still retaining the data.

These are some of the business benefits of Azure SQL Data Warehouse, but how does it work?

SQL Data Warehouse is a massively parallel processing (MPP) distributed database system. Behind the scenes, SQL Data Warehouse spreads your data across many shared-nothing storage and processing units. The data is stored in a Premium locally redundant storage layer on top of which dynamically linked Compute nodes execute distributed queries. SQL Data Warehouse takes a “divide and conquer” approach to running loads and complex queries. Requests are received by a Control node, optimized for distribution, and then passed to Compute nodes to do their work in parallel as shown in the following graphic.

SQLDW

Azure Data Warehouse is ideal for analytical workloads whether it is a small workload of GB’s, to a large workload of PB’s. It can also interface with unstructured data stored an Azure Blob Store. If transactional consistency and high concurrency is your requirement, then Azure SQL Data Warehouse is not the service to use. SQL Database would be a more appropriate choice.

It only takes minutes to get this up and running, and you can either do this within the Azure Portal, or alternatively you can use PowerShell.

At this point I will give you a choice, you can either watch the following 10 minute video that demonstrates how to set up an instance of Azure SQL Data Warehouse. Alternatively, you can continue to read. If your hard core, why not do both! But I understand your time is precious.

OK, so there is information that you need to have to hand before creating an Azure SQL Data Warehouse:

DWUI

Armed with this information, you can then go ahead and create the SQL Data Warehouse Instance.

Alternatively, you can use the same information to create a PowerShell script to sign into an Azure Subscription, create a resource group and then create a SQL Server instance, and optionally a database. The following PowerShell code creates a resource group named cto_ads_prep_rg located in North Europe using the New-AzureRmResourceGroup cmdlet.  The script then creates a SQL Server instance named ctomsftadssqlsrv with an admin account named ctesta-oneill using the New-AzureRmSqlServer cmdlet.

######################################################################

##                                               PART I: Creating the Azure SQL Server                                             ##

######################################################################

# Sign in to Azure and set the WINDOWS AZURE subscription to work with

$SubscriptionId = “XXXXXXXX-xxXX-XXxx-XXXX-xxxxxxxxxxxx”

Add-AzureRmAccount

Set-AzureRmContext -SubscriptionId $SubscriptionId

 

# CREATE A RESOURCE GROUP

$resourceGroupName = “cto_ads_prep_rg”

$rglocation = “North Europe”

New-AzureRmResourceGroup -Name $resourceGroupName -Location $rglocation

 

# CREATE A SERVER

$serverName = “ctomsftadssqlsrv”

$serverVersion = “12.0”

$serverLocation = “North Europe”

 

$serverAdmin = “ctesta-oneill”

$serverPassword = “P@ssw0rd”

$securePassword = ConvertTo-SecureString –String $serverPassword –AsPlainText -Force

$serverCreds = New-Object –TypeName System.Management.Automation.PSCredential –ArgumentList $serverAdmin, $securePassword

$sqlDbServer = New-AzureRmSqlServer -ResourceGroupName $resourceGroupName -ServerName $serverName -Location $serverLocation -ServerVersion $serverVersion -SqlAdministratorCredentials $serverCreds

 

You can also use PowerShell to configure firewall settings on the SQL Server instance using the New-AzureRmSqlServerFirewallRule cmdlet. This can be performed in the Azure Portal as well.

 

# CREATE A SERVER FIREWALL RULE

$ip = (Test-Connection -ComputerName $env:COMPUTERNAME -Count 1 -Verbose).IPV4Address.IPAddressToString

$firewallRuleName = ‘Client IP address’

$firewallStartIp = $ip

$firewallEndIp = $ip

$fireWallRule = New-AzureRmSqlServerFirewallRule -ResourceGroupName $resourceGroupName -ServerName $serverName -FirewallRuleName $firewallRuleName -StartIpAddress $firewallStartIp -EndIpAddress $firewallEndIp

 

With the firewall rules defined, you will then be able to access to Azure SQL Server using tools such as Visual Studio and SQL Server Management Studio, where you could run T-SQL scripts to create and manage database. Although this can be done in PowerShell using the New-AzureRmSqlDatabase cmdlets as well. The following code creates a data warehouse named ContosoRetailDW.

 

# CREATE A SQL DATABASE

$databaseName = “ContosoRetailDW”

$databaseEdition = “DataWarehouse”

$RequestedServiceObjectiveName = “DW400”

$sqlDatabase = New-AzureRmSqlDatabase -ResourceGroupName $resourceGroupName -RequestedServiceObjectiveName $RequestedServiceObjectiveName -ServerName $serverName -DatabaseName $databaseName -Edition $databaseEdition.

Once a database has been created, you can then understand how to scale and Pause a database in Azure SQL Data Warehouse, but that will be the next blog.

 

 

LEX1: Azure SQL Data Warehouse. Add to your Big Data specialist skills

sqldw

Whether you’ve outgrown your on-premises datacenter or need to extend a hybrid scenario, Azure SQL Data Warehouse can help. The great news is that, unlike data warehousing of the past, you can spin up this cloud-based massively parallel processing solution in just a few minutes. Push a button to view the contents in Power BI. You can even pause compute to help control costs. Get the details, in “Delivering a Data Warehouse in the Cloud,” a self-paced course now available on edX.

Experts Chris Randall and Chris Testa-O’Neill dive deep into the specifics of this breakthrough technology. Learn how to deploy, design, and load data from a variety of sources. Plus, explore PolyBase for Big Data and look at compressed in-memory indexes. Design tables and indexes to efficiently distribute data in tables across many nodes, secure and recover data, and much more.

If you’re adding to your Big Data specialist skills, this course is a must. We’ve got lots of other Big Data courses, too. Look at Analyzing and Visualizing Data with Excel or with Power BI. Explore Querying Data with Transact-SQL. Or find out about Processing Big Data with Azure Data Lake Analytics or with Hadoop in Azure HDInsight. You can even Orchestrate Big Data with Azure Data Factory. Finally, learn about Analyzing Big Data with Microsoft R Server and Implementing Predictive Models with Spark in Azure HDInsight.

All of these Big Data concepts and technologies courses are free, or you can pay $49 for a verified certificate, which offers proof that you’ve successfully completed the online course and which you can share with friends, employers, and others. Spend a few hours each week (for four to six weeks, depending on the course) with these videos, demos, hands-on labs, and assessments. Plus, get additional resources and content for further study. Roll up your sleeves, and get started today!

Sign up now! (https://www.edx.org/course/delivering-data-warehouse-cloud-microsoft-dat220x)

ADF 2: Create an Azure Data Factory Instance

There a variety of methods that can be used to create an instance of an Azure Data Factory, this blog explores how you can create an ADF instance using the Azure Portal, PowerShell and Azure Resource Manager templates.

Azure Portal

The first step is to create the Azure Data Factory. this can be performed within the Azure portal. Click on the New icon, point to Databases and then click Data Factory.

createadf

At this point the provisioning blade for ADF appears. It is in this blade that you define a name for the Data Factory instance. You then assign the instance to a subscription that you own.

The resource group enables you to define whether your ADF instance will reside in a Resource Group that already exists, or in a resource Group that you create. Resource groups are important in that they contain services that can have the billing to be visible on that container, you can define access control on a resource group, and resources held within that container to be able to communicate with each other without the need to write complex IaaS scripts to further define the communication between services.

The resource group can include all the resources for the solution, or only those resources that you want to manage as a group. You decide how you want to allocate resources to resource groups based on what makes the most sense for your organization. Therefore it is important to understand the objective for creating a resource group and plan them appropriately.

Finally, you will then assign the Data Factory instance to a region of your choice.

2-settingadfoptions

You can view a video here on how to setup an ADF instance in the Azure Portal.

PowerShell

You also have the ability to deploy Azure Data Factory instances using PowerShell. This requires that Azure PowerShell is installed on your computer. Once this is set up you can use the following PowerShell commands to create a Data Factory instance. The following commands set the Azure context to a subscription and then defines a Data Factory instance defining the resource group where the instance is hosted, followed by the name and the location.

adfpowershell

You can view 3 minute this video on how to create an ADF instance here.

Azure Resource Manager templates

When dealing with ADF in production scenarios, or dealing with multiple ADF instances you can also make use of Azure Resource Manager (ARM) templates to ensure the deployment of multiple instances of ADF is much easier. ARM templates are JavaScript Object Notation (JSON) files that defines one or more resources to deploy to a resource group. It also defines the dependencies between the deployed resources. The JSON files can be parameterize to provide the flexibility to deploy ADF instances of different names, in different resource groups and regions.

There is an excellent article from the product group that will enable you to use ARM templates in a variety of Data Factory scenarios. I would highly recommend that you read this.

With the Data Factory instance created, it is then time to create the relevant linked services, data sets and pipelines to perform the data orchestration activities. These will be covered off individually in other blog posts.

ADF 1: Orchestrating data movement in the Cortana Intelligence Suite.

The need for batch movement of data on a regular time schedule is a requirement for most analytics solutions. Within the Cortana Intelligence Suite, Azure Data Factory (ADF) is the service that can be used to fulfil such a requirement.

ADF provides a cloud-based data integration service that orchestrates the movement and transformation of data from various data stores.

So what do we mean by orchestration?

Think about an orchestra. The most important member of the orchestra is the conductor. The conductor does not play the instruments, they simply lead the symphony members through the entire piece of music that they perform. The musicians use their own skills to produce particular sounds at various stages of the symphony, so they may only learn certain parts of the music. The conductor orchestrates the entire piece of music, and therefore is aware of the entire score that is being performed. They will also use specific arm movements that provide instructions to the musicians how a piece of music should be played.

ADF uses a similar approach, it will not perform the actual work required to transform data, but will instruct another service, such a Hadoop Cluster to perform a Hive query to perform the transformation on ADF’s behalf. So in this case, it would be Hadoop that performs the work, not ADF. ADF merely orchestrates the execution of the Hive query through Hadoop, and then provides the pipelines to move the data onto the next destination.

It also provides rich visualizations to display the lineage and dependencies between your data pipelines, and monitor all your data pipelines from a single unified view to easily pinpoint issues and setup monitoring alerts.

The Azure Data Factory Process

The process for Azure Data Factory can be summarized by the following graphic.

3-adfprocess

Data Factory supports a wide variety of data sources that you can connect to through the creation of an object known as a Linked Service. This enables you to ingest the data from a data source in readiness to prepare the data for transformation and/or analysis. In addition, Linked Services can fire up compute services on demand. For example, you may have a requirement to start an on demand Hadoop cluster for the purpose of just processing data through a Hive query. So Linked Services enables you to define data sources, or compute resource that are required to ingest and prepare data.

With the linked service defined, Azure Data Factory is made aware of the datasets that it should use through the creation of a Datasets object. Datasets represent data structures within the data store that is being referenced by the Linked Service object. Datasets can also be used by an ADF process know as an Activity.

Activities typically contain the transformation logic or the analysis commands of the Azure Data Factory’s work. This could include the execution of a stored procedure, Hive Query or Pig script to transform the data. You can use U-SQL with Data Lake Analytics or push data into a Machine Learning model to perform analysis. It is not uncommon for multiple activities to take place that may include transforming data using a SQL stored procedure and then performing Data Lake Analytics with USQL. In this case, multiple activities can be logically grouped together with an object referred to as a Pipeline.

Once all the work is complete you can then use Data Factory to publish the final dataset to another linked service that can then be consumed by technologies such as Power BI or Machine Learning.

Therefore the process discussed above can be summarized by the creation of the following objects as shown in the graphic.

4-adfcomponents

Use cases

But I can move data with other tools right? Absolutely. Data movement could occur for example using SSIS to load data from SQL Server to Azure DW. However, if you have seen the video on data loading in Azure DW you will be aware that SSIS is not the ideal tool to use to load data into Azure SQL DW if performance of the data loads is the key object. PolyBase is, and we can use Azure Data Factory to orchestrate the PolyBase execution to load data into SQL Data Warehouse.

The other use case is if you want to call on demand services such as Hadoop clusters. ADF can be used to create an on demand cluster when it is required, and to shut it down once your work is complete. In which case you can control the usage of the Azure resources to process your data, and only pay for what you use.

We can also use ADF to push data into Machine Learning models, this is particularly useful once a model has been productionized and you want to automate the process of feeding data to the Machine Learning model to perform its analytics. ADF can perform this activity with ease.

These examples, along with others will be explored in future blogs. In the meantime you can watch this 3 minute video that summarizes the capability of Azure Data Factory

 

 

 

 

The learnanalytics.microsoft.com portal

So where can we learn about all the features and capabilities of the Cortana Intelligence Suite. Well here at Microsoft we are looking to address this through our new portal:

www.learnanalytics.microsoft.com

This site is evolving all the time as we start to add webinars and videos. What’s really interesting is the wide range of in-person training events that are available all around the world. Not only is this training done by Microsoft but we also partner with some of the most talented speakers and professionals in the world.

Why not learn Machine Learning with Rafal Lukawiecki from Project Botticelli, or from Presciient. Look into big data technologies such as Apache Spark with data bricks, there is a huge list available and it keeps growing.  Take a look at our training page for the latest courses.