CAPTURE DUPLICATES USING REMOVE DUPLICATE STAGE IN DATASTAGE

Capture-Duplicates-Using-Remove-Duplicate-Stage

Input Records:

123,Naveen
124,Joe
125,Gary
124,Joe
126,Mahesh
127,Bob
126,Mahesh
124,Joe

We need to capture the duplicates from above data using Remove Duplicate stage along with any other stage.

Expected Output:


124,Joe
126,Mahesh
124,Joe

Solution:

We can capture duplicates using Remove Duplicate Stage, design of the parallel  job would be like below.

Final-output-design-would-be

To test this design, I use below sample data and read that data with sequential file stage. My sample records in sequential stage would be like:

Input-records-with-duplicates
Here I enable ‘Row Number Column’ property in sequential stage in order to generate unique number(this will be used in next stage) for each input record.

enable-row-number-column-in-sequential-file-stage
In next step, take a copy stage to pass the input into two output links. One link to Remove Duplicate stage side and another link to Change Capture stage side.
In Remove Duplicate stage, we will remove the duplicates based on key column. Here I use “id” as the key column to remove the duplicates. So after this stage we will have only unique records.

define-remove-duplicate-stage

Also read:

Capture unique records (have no duplicates) in DataStage

Capture duplicate records in DataStage

Change Capture stage is the important part in this job design. Here Remove Duplicates stage acts as After dataset and Copy stage acts as Before dataset for Change Capture stage.
Select the ‘Unique’ (Row Number Column from sequential stage) column as Key column and Change Mode as ‘Explicit Keys, All Values’. Our aim is to capture only duplicate records, so make ‘Drop Output for Delete’ as ‘False’.

define-change-capture-stage-properties

However we can use join stage instead of change capture stage to capture duplicates after Remove Duplicate stage.

Here link ordering also important as that decides Before and After dataset for Change Capture stage.

link-ordering-for-change-capture-stage
Compile job and run the job.

run-final-job-to-capture-duplicates
You can see the duplicate records in the output.

final-duplicate-records-output

We hope this article helped to find the solution that you are looking.

If you liked this article, then please subscribe to our YouTube Channel for Wings Of Technology video tutorials. You can also find us on Twitter and Facebook.

Comments

comments

3 Comments

  1. Vinoth February 4, 2016
    • Admin February 5, 2016
  2. Filippo January 28, 2020

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: