ETL performance tuning: Low hanging fruit
While significant improvement in the execution time of Data Services jobs usually requires digging into the code and making changes there are a few tweaks that can be made without changing any code.
Monitor row size/monitor sample rate
This setting tells Data Services how frequently to write to the monitor log. If you are actively watching the execution of a job the monitor log will be updated only as often as the setting specifies. In other words, if you are watching the job through the Management Console, it doesn’t do any good to hit the refresh button every 5 seconds if the monitor sample rate setting is 30 seconds.
To gain performance, increase the setting.
|Data Services Version||Default Value||Maximum Value||Minimum Value|
|<= 4.0||1000 rows||64000||0|
|>= 4.1||5 seconds||9999999+||0|
Using the minimum value will result in no entries in the monitor log. It will be completely empty. Normally you wouldn’t want this type of result. In a production system that is very stable and there is no need to review the monitor log and you need to squeeze every ounce of performance out of a job then it’s OK to use the minimum value. But if you ever have to troubleshoot a Dataflow you will find the information in the monitor log very helpful.
If the Export Execution Command has been used to generate a .bat or .sh file for running the job from a third party scheduler the monitor setting is embedded within the file. If you change the setting within the batch job’s properties then you’ll need to redo the Export Execution Command. You could also edit the .bat or .sh file directly (change the –t parameter) but making this change starting with the job itself ensures that it doesn’t get lost.
Degrees of parallelism
Within the DSConfig.txt file there is a setting for Global_DOP. For versions prior to 4.1 the default value is 1 and for versions starting with 4.1 the default value is 2. This setting only applies to Dataflows. It means that Dataflows configured to use the default DOP value will use that many degrees of parallelism. Increasing the default value for the degrees of parallelism may help to improve the performance of some Dataflows.
For many Dataflows, a change to this setting will have no impact. Simple Dataflows cannot split processing into multiple threads.
Some shops were taken by surprise in 4.1 when SAP changed the default degrees of parallelism to 2. Not every Dataflow has been written to support parallel processing. In fact, more than one client of PK found that certain Dataflows randomly failed when Global_DOP was set to 2 but never failed with a setting of 1. The use of parallel processing can change the order of the data in a Dataflow. If transforms are expecting data to be sorted in a particular order but there is no explicit ORDER BY directly upstream of that transform then there is no guarantee that the data will be in the correct order when parallel processing is enabled.
If you find that a Dataflow fails when Global_DOP is greater than 1 then change the properties of the Dataflow to override the use of the default DOP. A setting of 1 in the Dataflow’s properties will explicitly prevent the Dataflow from using the Global_DOP value.
Maximum number of engine processes
If your ETL jobs run more than seven Dataflows at the same time then this tweak is for you! See this prior blog entry: https://www.pkglobalprod.wpengine.com/blog/2015/03/24/maximum-number-engine-processes-data-services.
Use an enterprise scheduler
This could be an expensive solution but it doesn’t require any change to ETL code. Where the benefit of using a true enterprise scheduler comes into play is that it can schedule and coordinate running multiple jobs in parallel. The scheduler in the Management Console is very rudimentary and very few shops that PK consultants have worked with use it in production. It can run jobs at specific times but it is not capable of scheduling by dependency. Job “B” cannot be executed when Job “A” completes. It certainly cannot handle running a set of twenty jobs but only running four at a time. The ability to run multiple jobs in parallel and coordinate the results of all of those jobs is where you can improve the overall execution time of your ETL.
Most enterprise schedulers will execute ETL jobs using the .bat or .sh file created by the Export Execution Command in the Management Console. The one exception to this – that I know of at this time – is Cronacle by Redwood Software. It interfaces directly with the job server through web services calls without the intermediate .bat or .sh file. Not having to create and maintain the .bat/.sh files can be a huge time saver, especially if you have a large number of jobs.
However, based on my experience at a client site that was using Cronacle, you should steer clear of it. Cronacle cannot over-ride the global variable defaults and it is unable to recover from lost connections. This last issue is particularly damning as it creates a huge headache for operations.
Learn how to monitor disk space using an ETL job.Tags: Cloud Engineering, Data, Dataflows, ETL Performance Tuning