Input Records:
123,Naveen 124,Joe 125,Gary 124,Joe 126,Mahesh 127,Bob 126,Mahesh 124,Joe
We need to capture the duplicates from above data using Remove Duplicate stage along with any other stage.
Expected Output:
124,Joe 126,Mahesh 124,Joe
Solution:
We can capture duplicates using Remove Duplicate Stage, design of the parallel job would be like below.
To test this design, I use below sample data and read that data with sequential file stage. My sample records in sequential stage would be like:
Here I enable ‘Row Number Column’ property in sequential stage in order to generate unique number(this will be used in next stage) for each input record.
In next step, take a copy stage to pass the input into two output links. One link to Remove Duplicate stage side and another link to Change Capture stage side.
In Remove Duplicate stage, we will remove the duplicates based on key column. Here I use “id” as the key column to remove the duplicates. So after this stage we will have only unique records.
Also read:
Capture unique records (have no duplicates) in DataStage
Capture duplicate records in DataStage
Change Capture stage is the important part in this job design. Here Remove Duplicates stage acts as After dataset and Copy stage acts as Before dataset for Change Capture stage.
Select the ‘Unique’ (Row Number Column from sequential stage) column as Key column and Change Mode as ‘Explicit Keys, All Values’. Our aim is to capture only duplicate records, so make ‘Drop Output for Delete’ as ‘False’.
However we can use join stage instead of change capture stage to capture duplicates after Remove Duplicate stage.
Here link ordering also important as that decides Before and After dataset for Change Capture stage.
Compile job and run the job.
You can see the duplicate records in the output.
We hope this article helped to find the solution that you are looking.
If you liked this article, then please subscribe to our YouTube Channel for Wings Of Technology video tutorials. You can also find us on Twitter and Facebook.
I doubt this will work, CCD will have 3,6,7 in both before and after link, So drop o/p for delete = False won’t work here. This will give records other than 3,6,7 in o/p. Also CCD requires Sorting on Key fields which is missing here (All the stages are running in Auto Mode). Using Join stage will give the required results.
Hey Vinoth,
Thanks for your comment.
Change capture stage won’t have 3,6,7 in after dataset and it will have only 1,4 corresponding to that records.
How did you assume that remove duplicate stage passes again duplicate records 3 and 7? It won’t happen right? we are removing the duplicates records based on key i.e. 124.
Yes, all the stages are running in Auto Mode here. I kept it just for the sake of solution and as you said Change Capture stage requires sorting on Key fields and we need to pass only unique records to Change Capture stage.
Hi admin!
what would happen if duplicates arrived from the after link?
thank you