There are a wide range of tools and technologies to ingest data into Azure SQL Data Warehouse, including BCP and SSIS. If performance is a critical metric to meet when loading a data warehouse, loads are best performed with PolyBase from Azure Storage Blobs or Azure Data Lake.
What is PolyBase?
Polybase allows you to leverage your data from various sources, and enables you to query non-relational data held in Azure Blob storage or Azure Data Lake as though it is a regular table by using familiar T-SQL commands. As a result, Polybase can be used to query non-relational data, or to import non-relational data into SQL Data Warehouse.
PolyBase achieves this using external tables to access non-relational data. The tables are metadata definitions that are stored in SQL Data Warehouse, and you can access them by using SQL and tools like you would access normal relational data. The key point to note, is that external tables do not hold any data at all, they provide a metadata abstraction over the source data held in an Azure Storage Blobs or Azure Data Lake. Therefore, an external table requires two supporting objects that informs the external table where the source data resides, and the format of the source data. These are referred to as the external data source and external file format.
External Data Source An External Data Source provides the connection information to the source data. Typically, this will also involve using a credential to access the data source. Therefore, a credential will have to be created so that the external data source can use this to access the data source. Finally, the credential has to be protected, so a master key would also be required in the database to protect the credential The following example firstly creates a master key in the EquityDB database, and then creates a database credential named AzureStorageCredential that contains the Access Key to an Azure Blob Store. Finally an External Data Source named AzureStorage provides the connection to a container in an Azure Blob Store named email@example.com that uses the credential named AzureStorageCredential to access the source data. — A: Create a master key.
CREATE MASTER KEY;
— B: Create a database scoped credential
CREATE DATABASE SCOPED CREDENTIAL AzureStorageCredential
IDENTITY = ‘NA’,
SECRET = ‘XI7amHWNZiqyrst8Ockja7Tw5Hci3D/yyxW6l2Djjg1RcdrtHfEla5h007g==’
— C: Create an external data source
CREATE EXTERNAL DATA SOURCE AzureStorage
TYPE = HADOOP,
LOCATION = ‘wasbs://firstname.lastname@example.org’,
CREDENTIAL = AzureStorageCredential
External File Format
The data read by Polybase can be in various formats, including Parquet, delimited files or ORC files. The External File Format command is used to specify the format of files that contain the external data. The following is a simple example of an External file format named TextFile that defines the format as delimited text with a field terminator using a comma.
— D: Create an external file format
CREATE EXTERNAL FILE FORMAT TextFile
FORMAT_TYPE = DelimitedText,
FORMAT_OPTIONS (FIELD_TERMINATOR = ‘,’)
External tables access non-relational data. The following example creates an external table named DimDate2External that has columns that reflect a date dimension table. The critical point to note are the options within the WITH clause. They point to both the external data source created earlier named AzureStorage and the external file format named TextFile . The location specifies the folder or the file path and file name for the actual data in the Azure blob storage. The location starts from the root folder in this example; the root folder being the data location specified in the external data source.
CREATE EXTERNAL TABLE dbo.DimDate2External (
[Date] datetime2(3) NULL,
[DateKey] decimal(38, 0) NULL,
[MonthKey] decimal(38, 0) NULL,
[Month] nvarchar(100) NULL,
[Quarter] nvarchar(100) NULL,
[Year] decimal(38, 0) NULL,
[Year-Quarter] nvarchar(100) NULL,
[Year-Month] nvarchar(100) NULL,
[Year-MonthKey] nvarchar(100) NULL,
[WeekDayKey] decimal(38, 0) NULL,
[WeekDay] nvarchar(100) NULL,
[Day Of Month] decimal(38, 0) NULL
Using PolyBase to load data quickly
When you are loading, or exporting large volumes of data or fast performance is needed, PolyBase is the best choice. PolyBase is designed to leverage the MPP (Massively Parallel Processing) architecture of SQL Data Warehouse and will therefore load and export data magnitudes faster than any other tool. PolyBase loads can be run using CREATE TABLE AS SELECT (CTAS). Using CTAS will minimize transaction logging and the fastest way to load your data.
The following code uses a CTAS statement to create a new table in the EquityDB database named Dates that is populated with data from the external table named DimDate2External created earlier. The table will also have a clustered columstore index and uses a round_robin distribution.
CREATE TABLE dbo.Dates
CLUSTERED COLUMNSTORE INDEX,
DISTRIBUTION = ROUND_ROBIN
SELECT * FROM [dbo].[DimDate2External];
This blog has explored using PolyBase to load data into an Azure SQL DataWarehouse. Although there are many tools to perform a load. PolyBase is the best choice form a performance perspective as it is designed to leverage the MPP (Massively Parallel Processing) architecture of SQL Data Warehouse. Other services can be used to run PolyBase under the hood such as Azure Data Factory, or you could use a third party tool to migrate you data from an on premise server to such as Redgate’s Data Platform Studio, which is readily accessible from the Azure SQL Data Warehouse product home page
See the code in the blog being executed in the following video