Inderjit Rana
7 min read · May 14, 2020
--
In this article I will share some practical guidance around data loads into Synapse (formerly Azure SQL Data Warehouse) from Azure Storage using Polybase. Polybase does require data being in Azure Storage, I will be addressing the most common pattern where files are loaded from Azure Storage to Azure Synapse (formerly Azure SQL Data Warehouse). I will briefly explain the Polybase concept but my assumption is that you have some level of familiarity with Polybase, Azure Synapse, Azure Storage, Azure Data Factory, etc. but looking for more practical guidance on how to get the best performance on data load. In case you are one of those impatient folks who want to jump to most important section just scroll to the bottom section with title “Performance Guidance around Count and Size of Files, Zipped or Unzipped” and this is somewhat the missing piece which motivated to write this post.
I always like to add the disclaimer that things change at a fast pace, I am writing this article in May 2020 and will try my best to come back and update as things change or even if I find more information to add.
Synapse Copy Command
Before discussing Polybase, I would like to point out that there is a new easy to use method for performant data loads to Synapse in Preview which is referred to as Copy Command. I won’t be surprised if this becomes a standard in next 6–12 months but I still think it would be valuable to document Polybase guidance.
Useful Copy Command Links
- Quickstart to load data into Synapse using T-SQL — https://docs.microsoft.com/azure/synapse-analytics/sql-data-warehouse/quickstart-bulk-load-copy-tsql
- More details around Permissions and authentication needed -https://techcommunity.microsoft.com/t5/azure-synapse-analytics/how-to-use-copy-into/ba- p/1298242
- Detailed Documentation for Copy Command, pay attention to the FAQ section which has very detailed guidance on number of files and size of files — https://docs.microsoft.com/sql/t-sql/statements/copy-into-transact-sql?view=azure-sqldw-latest
- Azure Data Factory using Copy Command (important to note that Azure Data Factory Copy Activity can use different mechanisms to perform the load with Copy Command being one of them and Polybase being the other) — https://docs.microsoft.com/azure/data-factory/connector-azure-sql-data-warehouse#use-copy-statement
***Note*** — Copy Command is GA as of September 2020
Polybase
Basics
In simple terms, it’s a very performant way to load data from Azure Storage to Azure Synapse. It does require going through a handful of steps — Create Master Key for database, Create Database Scoped Credential, Create External Data Source, Create External File Format, Create External Table and then perform the load from External Table to Synapse native Table. Polybase can be used with plain T-SQL but ADF can also do Polybase load hiding the underlying steps like creation of External Data Source, External Table, etc.
T-SQL process is documented pretty well over here — https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-load-from-azure-blob-storage-with-polybase
Also, you can refer to my post for complete example, though the main purpose is to document load from VNET protected ADLS Gen2 Account — https://medium.com/@isinghrana/azure-synapse-data-load-using-polybase-or-copy-command-from-vnet-protected-azure-storage-da8aa6a9ac68
Permissions
One of the concerns usually raised is that it does require Control Permissions on the Database, Control Permission are required but you can take few measures to limit the permissions given to a Load user:
- You can do explicit deny on all other schemas except where you are loading the data as documented here — https://docs.microsoft.com/azure/synapse-analytics/sql-data-warehouse/guidance-for-loading-data#allowing-multiple-users-to-load
- Not all steps require Control Permissions, admins with higher level permissions can perform a subset of steps that require Control Permissions and then user with limited permissions can perform the load — https://techcommunity.microsoft.com/t5/azure-synapse-analytics/how-to-create-a-polybase-user-with-only-schema-level-access/ba-p/839878
Azure Data Factory and Polybase
Azure Data Factory allows using Polybase even if your data is on-premises (using Self-Hosted Integration Runtime) with the Staged Copy Feature but do keep in mind that data is indeed copied to Azure Storage first and then loaded to Synapse (so it just provides a convenient abstraction referred to as Staged Copy).
Another important thing to pay attention is even if data is in Azure Storage there are certain conditions that need to be met for ADF to do direct polybase load from Azure Storage to Synapse. Otherwise ADF has the capability to do a staged copy to conform the data to Polybase requirements, which means data is in Azure Storage but it is transformed and copied to another temporary storage location before it is loaded to Synapse. The goal should be to load the data to Storage in the format required by Polybase (if possible) and do a direct load from Storage to Synapse without going through additional step of Staging because that can really drain the load performance.
Please see the “Direct copy by using Polybase” section on the following link to understand the requirements for direct copy from Azure Storage to Azure Synapse — https://docs.microsoft.com/en-us/azure/data-factory/connector-azure-sql-data-warehouse#use-polybase-to-load-data-into-azure-sql-data-warehouse. One of the example requirement is that the source delimited file cannot have quoteChar, EscapeChar, etc.
Performance Guidance around Count and Size of Files, Zipped or Unzipped
If you have large amount of data to load to Synapse and have control over size and count of files to generate from your source systems you might have questions around how many files do I need to create, what should be the size of those files 50 GB, 100 GB, etc., whether they should be compressed or uncompressed (please note when I say compressed/uncompressed it means zipped delimited text file like gzip)
- In general following is the order of performance (highest to lowest) — Uncompressed Multiple Files, Compressed Multiple Files, Uncompressed Single File and Compressed Single File.
- For delimited uncompressed text files Polybase has the capability to do parallel processing at both individual file level as well as at the folder level where there are multiple files in the folder. There is no real need to split the files but there might be slight performance advantage of splitting so it’s up to you to determine if slight improvement in performance is worth the extra effort to split the files. A better reason for splitting the files would be operational support where you want to send files from on-premises to Azure Storage in reasonable chunks.
- When splitting delimited uncompressed text files the guidance is that as long as the file size is over 2 GB and count of file is not reaching high 100s( like 1000) it doesn’t really matter if the file size is 50 GB, 100 GB, etc.
- If you have multiple files in a folder, create external table at the folder level there is no need to do this at individual file level, Polybase automatically has parallelization built-in.
- Final step of Polybase is to load from External Table to the Synapse Table, CTAS (Create Table as Select) is observed to perform better than Insert INTO. You can learn more about CTAS here — https://docs.microsoft.com/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-develop-ctas
- When using Azure Data Factory, native Polybase Load uses INSERT INTO so if you desire to use the most performant load you will achieve better performance by using a Stored Procedure with CTAS.
- Polybase cannot parallelize processing of Compressed files like Gzip so if using compressed file its better to have multiple files for Polybase to load.
- Polybase cannot parallelize processing of individual parquet file so having multiple parquet files should yield better performance.
Other Best Practices — Resource Classes, Enough Compute
Updates on May 27
The Resource Class of the user performing the load is an important aspect which is commonly overlooked, its important that the Load User has enough resources to perform the load to achieve best performance which is controlled by Resource Classes as well as Workload Management Techniques. An implicit requirement is that Synapse is running at a scale (DWU Level) with enough compute.
Relevant Links
- Detailed article on how to monitor Polybase Load to Azure Synapse Analytics from people who did this in real world — https://army91.medium.com/monitoring-a-polybase-load-in-azure-synapse-analytics-b25d21f7ba28
- Resource Classes — https://docs.microsoft.com/azure/synapse-analytics/sql-data-warehouse/resource-classes-for-workload-management
- Workload Management — https://docs.microsoft.com/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-workload-management
- Data Loading Best Practices — https://docs.microsoft.com/azure/synapse-analytics/sql-data-warehouse/guidance-for-loading-data
- Columnstore Compression for best query performance — https://docs.microsoft.com/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-memory-optimizations-for-columnstore-compression