Should I have multiple production data services repositories?
Sometimes ETL repositories are like cockroaches. They’re all over the place, nobody knows how they got here and nobody knows how to get rid of them.
An overabundance of repositories poses a few problems:
- Confusion – Which one should I use? Which one uses table X?
- Space – Most Data Services repositories don’t take all that much space, especially developer repositories. But if you have enough of them then the space adds up. Repositories use space within the database (the repository schema) as well as on the job server operating system in the form of log files.
- Multiple production repositories implies that you probably have multiple central repositories.
- Maintenance – Objects replicated across multiple repositories require you to deploy that object to each repository to keep the object in sync.
It’s the last item in the list that is the most important. So why do you need multiple repositories? If your ETL code is for completely different topics you might think you need different repositories. But is that true? I don’t think so. A job server could care less what sources and targets your jobs are using. If I have hundreds of jobs does using multiple repositories make managing those jobs easier? No, in fact it makes it far more difficult.
Let’s say your task today is to improve performance for the ETL. If the jobs are broken out into ten different production repositories you now have to gather runtime statistics from ten different locations to determine the five worst running Dataflows. Sure, I can write a query unioning the data in AL_STATISTICS from all ten repositories. But it’s a lot easier to gather that information from a single repository.
One justification given for using multiple repositories was that an object needed to be coded differently for various situations. If that’s the case then it probably doesn’t make sense to try and use the same name for the object or you need to figure out a way to code the object so that it handles these situations at run time. One example of this is business logic that differs for U.S. and Canadian customers. There are ways to handle this:
- Create two Dataflows.Name them with the suffixes _US and _CA. This then requires two different jobs with similarly suffixed names. Keep in mind that there is no inheritance feature in Data Services. So once you replicate an object you now have two different objects that must be maintained.
- Use a Global Variable to indicate if the job is running for US or CA. Then, within the Dataflow provide logic in the mapping that looks at the Global Variable and acts one way for a value of ‘US’ and another way for ‘CA’. This assumes that your source and target objects are identical for both US and CA. If they aren’t then you need to go with option one.
Circling back to the space issue, I’ve seen repositories that take up far more database and file space than you would expect. Jobs that abnormally fail don’t get cleaned up. The job’s history remains forever unless someone manually cleans it up. At one client site we eliminated over 20 million rows in AL_STATISTICS and 2000 log files of jobs that were still in the history from three years ago. Just because your job history is set to 30 days doesn’t mean it cleans up everything. I would rather have to do this maintenance on a single repository than on ten repositories.
In my next post I’ll address how to consolidate multiple repositories into one.Tags: Cloud Engineering, Development, ETL Repositories