Practical Tips on Data Load to Synapse using Polybase (2025)

Inderjit Rana

7 min read

·

May 14, 2020

--

In this article I will share some practical guidance around data loads into Synapse (formerly Azure SQL Data Warehouse) from Azure Storage using Polybase. Polybase does require data being in Azure Storage, I will be addressing the most common pattern where files are loaded from Azure Storage to Azure Synapse (formerly Azure SQL Data Warehouse). I will briefly explain the Polybase concept but my assumption is that you have some level of familiarity with Polybase, Azure Synapse, Azure Storage, Azure Data Factory, etc. but looking for more practical guidance on how to get the best performance on data load. In case you are one of those impatient folks who want to jump to most important section just scroll to the bottom section with title “Performance Guidance around Count and Size of Files, Zipped or Unzipped” and this is somewhat the missing piece which motivated to write this post.

I always like to add the disclaimer that things change at a fast pace, I am writing this article in May 2020 and will try my best to come back and update as things change or even if I find more information to add.

Synapse Copy Command

Before discussing Polybase, I would like to point out that there is a new easy to use method for performant data loads to Synapse in Preview which is referred to as Copy Command. I won’t be surprised if this becomes a standard in next 6–12 months but I still think it would be valuable to document Polybase guidance.

Useful Copy Command Links

Practical Tips on Data Load to Synapse using Polybase (2)

***Note*** — Copy Command is GA as of September 2020

Polybase

Basics

In simple terms, it’s a very performant way to load data from Azure Storage to Azure Synapse. It does require going through a handful of steps — Create Master Key for database, Create Database Scoped Credential, Create External Data Source, Create External File Format, Create External Table and then perform the load from External Table to Synapse native Table. Polybase can be used with plain T-SQL but ADF can also do Polybase load hiding the underlying steps like creation of External Data Source, External Table, etc.

T-SQL process is documented pretty well over here — https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-load-from-azure-blob-storage-with-polybase

Also, you can refer to my post for complete example, though the main purpose is to document load from VNET protected ADLS Gen2 Account — https://medium.com/@isinghrana/azure-synapse-data-load-using-polybase-or-copy-command-from-vnet-protected-azure-storage-da8aa6a9ac68

Permissions

One of the concerns usually raised is that it does require Control Permissions on the Database, Control Permission are required but you can take few measures to limit the permissions given to a Load user:

Azure Data Factory and Polybase

Azure Data Factory allows using Polybase even if your data is on-premises (using Self-Hosted Integration Runtime) with the Staged Copy Feature but do keep in mind that data is indeed copied to Azure Storage first and then loaded to Synapse (so it just provides a convenient abstraction referred to as Staged Copy).

Another important thing to pay attention is even if data is in Azure Storage there are certain conditions that need to be met for ADF to do direct polybase load from Azure Storage to Synapse. Otherwise ADF has the capability to do a staged copy to conform the data to Polybase requirements, which means data is in Azure Storage but it is transformed and copied to another temporary storage location before it is loaded to Synapse. The goal should be to load the data to Storage in the format required by Polybase (if possible) and do a direct load from Storage to Synapse without going through additional step of Staging because that can really drain the load performance.

Practical Tips on Data Load to Synapse using Polybase (3)

Please see the “Direct copy by using Polybase” section on the following link to understand the requirements for direct copy from Azure Storage to Azure Synapse — https://docs.microsoft.com/en-us/azure/data-factory/connector-azure-sql-data-warehouse#use-polybase-to-load-data-into-azure-sql-data-warehouse. One of the example requirement is that the source delimited file cannot have quoteChar, EscapeChar, etc.

Practical Tips on Data Load to Synapse using Polybase (4)

Performance Guidance around Count and Size of Files, Zipped or Unzipped

If you have large amount of data to load to Synapse and have control over size and count of files to generate from your source systems you might have questions around how many files do I need to create, what should be the size of those files 50 GB, 100 GB, etc., whether they should be compressed or uncompressed (please note when I say compressed/uncompressed it means zipped delimited text file like gzip)

  • In general following is the order of performance (highest to lowest) — Uncompressed Multiple Files, Compressed Multiple Files, Uncompressed Single File and Compressed Single File.
  • For delimited uncompressed text files Polybase has the capability to do parallel processing at both individual file level as well as at the folder level where there are multiple files in the folder. There is no real need to split the files but there might be slight performance advantage of splitting so it’s up to you to determine if slight improvement in performance is worth the extra effort to split the files. A better reason for splitting the files would be operational support where you want to send files from on-premises to Azure Storage in reasonable chunks.
  • When splitting delimited uncompressed text files the guidance is that as long as the file size is over 2 GB and count of file is not reaching high 100s( like 1000) it doesn’t really matter if the file size is 50 GB, 100 GB, etc.
  • If you have multiple files in a folder, create external table at the folder level there is no need to do this at individual file level, Polybase automatically has parallelization built-in.
  • Final step of Polybase is to load from External Table to the Synapse Table, CTAS (Create Table as Select) is observed to perform better than Insert INTO. You can learn more about CTAS here — https://docs.microsoft.com/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-develop-ctas
  • When using Azure Data Factory, native Polybase Load uses INSERT INTO so if you desire to use the most performant load you will achieve better performance by using a Stored Procedure with CTAS.
  • Polybase cannot parallelize processing of Compressed files like Gzip so if using compressed file its better to have multiple files for Polybase to load.
  • Polybase cannot parallelize processing of individual parquet file so having multiple parquet files should yield better performance.

Other Best Practices — Resource Classes, Enough Compute

Updates on May 27

The Resource Class of the user performing the load is an important aspect which is commonly overlooked, its important that the Load User has enough resources to perform the load to achieve best performance which is controlled by Resource Classes as well as Workload Management Techniques. An implicit requirement is that Synapse is running at a scale (DWU Level) with enough compute.

Relevant Links

Practical Tips on Data Load to Synapse using Polybase (2025)

References

Top Articles
Latest Posts
Recommended Articles
Article information

Author: Domingo Moore

Last Updated:

Views: 6341

Rating: 4.2 / 5 (73 voted)

Reviews: 88% of readers found this page helpful

Author information

Name: Domingo Moore

Birthday: 1997-05-20

Address: 6485 Kohler Route, Antonioton, VT 77375-0299

Phone: +3213869077934

Job: Sales Analyst

Hobby: Kayaking, Roller skating, Cabaret, Rugby, Homebrewing, Creative writing, amateur radio

Introduction: My name is Domingo Moore, I am a attractive, gorgeous, funny, jolly, spotless, nice, fantastic person who loves writing and wants to share my knowledge and understanding with you.