Capture duplicate records in DataStage – Job Design – 1

Input file:
Sno,name,salary,rank
1,Naveen,10000,2
2,Joe,20000,2
3,David,30000,1
4,Richard,30000,2
5,Obs,30000,1
1,Naveen,10000,2
1,Naveen,10000,2
3,David,30000,1

Output file:
1,Naveen,10000,2
1,Naveen,10000,2
3,David,30000,1

The final job design would be:

First sequential file stage used to read input file and duplicate records are highlighted in sequential file stage:

In Sort stage, we sorts the records based on key column ‘sno’ and enable the option ‘Create Key Change Column’ as ‘True’.
After enabling this option, one column with the name ‘KeyChange’ will be created in output.
For the first occurrence of a particular value, it assigns 1 and for subsequent occurances of the same value it assigns 0.

Now you can use a filter stage with condition like KeyChange=0 or KeyChange < 1 to capture duplicate records.

Ran the job here:

3 records have been filtered in filter stage as their KeyChange value is ‘0’. So the final output will be:

You can watch the video tutorial of this post here:

Comments

comments

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: