Consolidating multiple repositories
Throughout my years of consulting on Data Services ETL projects I have seen many different project environments. There is no single “right” environment. If your environment doesn’t cause you more work than it’s worth then it’s the right one for you. But if you’re working in an environment that has a lot of overhead that creates a lot of extra work then perhaps it is time to clean it up.
In this blog post I’m going to focus on the situation where a projects production environment uses multiple repositories. My experience is that at some point in time a project decided to use multiple repositories and, oftentimes, nobody knows why, it just happened. In my prior blog post I discussed the downsides of using multiple repositories. In this post I’ll show how to consolidate multiple repositories into a single repository.
There are two general consolidation methods:
- Blind Import.Export the current repositories to ATL files and then import those ATL files into a single repository.
- Selective Migration.Export individual objects or groups of objects to an ATL file and then import those ATL files into a single repository.
The Blind Import method is the easiest and least time-consuming. It is also most likely to cause problems if there are object collisions. An object collision occurs when two or more repositories have an object with the same name. This could be any object; a Job, Dataflow, Datastore or even a Table. If the definition of the object is different in any of the repositories then you have a problem. The last version of the object that is imported is the one you end up with. Which definition is the correct one? If developers didn’t keep the objects in sync then you’ll have to pick a winner after comparing the code. This can be a very tedious process.
I had one client that started with nine repositories. Every repository shared a set of “common” objects that were used in each Job. The operations group was tired of having to maintain so many repositories and one day combined them all into a single repository. Miraculously, nothing blew up. Not every project would be so lucky. I generally do not recommend this method and when it is used there should be extensive testing before turning the consolidated repository loose on production. One object out of sync could cause problems in your data.
The Selective Migration method could be performed a number of ways. The objects could be checked into a Central Repository from each repository and then brought into the consolidated repository from the Central Repository. Or, as stated earlier, the objects could be selectively exported to ATL files from the individual repositories and then imported into the consolidated repository. The net result is the same. The Central Repository makes it easier to compare objects. However, the compare tool in Designer often finds false positives such as extra carriage returns and properties differences that are insignificant.
The initial challenge is identifying the commonly named objects. For one client, here is how I identified these objects:
1. In a scratch schema, create a table called CONSOLIDATED_AL_LANG.
CREATE TABLE CONSOLIDATED_AL_LANG
2. Populate the table from each of the repositories that are candidates for consolidation.
INSERT INTO CONSOLIDATED_AL_LANG
‘<source schema name>’ as REPO_NAME
FROM <source schema name> .AL_LANG
WHERE OBJECT_TYPE NOT IN (73); — ignore data cleansing objects
3. Run the following query. (This query is specific to Oracle.)
LISTAGG(REPO_NAME, ‘, ‘) WITHIN GROUP (ORDER BY REPO_NAME) AS REPO_NAME
FROM (SELECT CONSOLIDATED_AL_LANG.NAME,
COUNT(*) AS DUPLICATE_COUNT
GROUP BY CONSOLIDATED_AL_LANG.NAME,
HAVING COUNT(*) > 1) DUP_OBJ
INNER JOIN CONSOLIDATED_AL_LANG
ON (CONSOLIDATED_AL_LANG.NAME = DUP_OBJ.NAME AND
CONSOLIDATED_AL_LANG.OBJECT_TYPE = DUP_OBJ.OBJECT_TYPE)
GROUP BY DUP_OBJ.NAME;
4. Review the list. The second column will show the repositories that each object exists in. (The list will contain a few system objects such as “di_job_al_mach_info” and “subvar_store”. You can safely ignore these objects.) The list will not only provide the scope of how many object collisions you have to deal with but the specific objects that need attention.
The list created above is not complete. It only contains Jobs, Workflows, Dataflows, Formats, Datastores, System Configurations and any other object type stored in AL_LANG. You may also want to look in AL_SCHEMA (database tables), AL_FUNCINFO (imported database functions/procedures and custom functions) and AL_PROJECTS (projects).
How will you decide which version of an object to put in the consolidated repository? Usually this comes down to inspecting the objects themselves to ensure that they are identical. If they are defined the same then it doesn’t matter which version is used. If there are differences then you’ll have to pick a version based on inspection. Hopefully your developers provided comments within each object that will help you decide. If you have a Central Repository for each of the individual repositories then you might want to use the version that was last checked into the Central by date. I would still perform a visual inspection even if this is the case.
Consolidating repositories can be an overwhelming and time-consuming endeavor. But the benefit of working with a single repository is often worth the effort. The techniques used here can help reduce the complexity and risk as well as the time involved.
Learn more about our digital engineering capabilities.Tags: Cloud Engineering, Dataflows, ETL Data Services, Repository Consolidation, Workflows