CAPTURE DUPLICATES USING REMOVE DUPLICATE STAGE IN DATASTAGE

Capture-Duplicates-Using-Remove-Duplicate-Stage

Input Records:

123,Naveen
124,Joe
125,Gary
124,Joe
126,Mahesh
127,Bob
126,Mahesh
124,Joe

We need to capture the duplicates from above data using Remove Duplicate stage along with any other stage.

Expected Output:


124,Joe
126,Mahesh
124,Joe

Solution:

We can capture duplicates using Remove Duplicate Stage, design of the parallel job would be like below.

Final-output-design-would-be

To test this design, I use below sample data and read that data with sequential file stage. My sample records in sequential stage would be like:

Input-records-with-duplicates
Here I enable ‘Row Number Column’ property in sequential stage in order to generate unique number(this will be used in next stage) for each input record.

enable-row-number-column-in-sequential-file-stage
In next step, take a copy stage to pass the input into two output links. One link to Remove Duplicate stage side and another link to Change Capture stage side.
In Remove Duplicate stage, we will remove the duplicates based on key column. Here I use “id” as the key column to remove the duplicates. So after this stage we will have only unique records.

define-remove-duplicate-stage

Comments

comments

3 Comments

Vinoth February 4, 2016

I doubt this will work, CCD will have 3,6,7 in both before and after link, So drop o/p for delete = False won’t work here. This will give records other than 3,6,7 in o/p. Also CCD requires Sorting on Key fields which is missing here (All the stages are running in Auto Mode). Using Join stage will give the required results.

- Admin February 5, 2016
  
  Hey Vinoth,
  Thanks for your comment.
  Change capture stage won’t have 3,6,7 in after dataset and it will have only 1,4 corresponding to that records.
  How did you assume that remove duplicate stage passes again duplicate records 3 and 7? It won’t happen right? we are removing the duplicates records based on key i.e. 124.
  Yes, all the stages are running in Auto Mode here. I kept it just for the sake of solution and as you said Change Capture stage requires sorting on Key fields and we need to pass only unique records to Change Capture stage.
  
Filippo January 28, 2020

Hi admin!

what would happen if duplicates arrived from the after link?
thank you

Wings Of Technology

CAPTURE DUPLICATES USING REMOVE DUPLICATE STAGE IN DATASTAGE

Input Records:

Expected Output:

Solution:

Like this:

Comments

3 Comments

Leave a Reply

Input Records:

Expected Output:

Solution:

Sharing is caring 🙂

Like this:

Comments

Related Posts

3 Comments

Leave a Reply