123,Naveen 124,Joe 125,Gary 124,Joe 126,Mahesh 127,Bob 126,Mahesh 124,Joe
We need to capture the duplicates from above data using Remove Duplicate stage along with any other stage.
124,Joe 126,Mahesh 124,Joe
We can capture duplicates using Remove Duplicate Stage, design of the parallel job would be like below.
To test this design, I use below sample data and read that data with sequential file stage. My sample records in sequential stage would be like:
Here I enable ‘Row Number Column’ property in sequential stage in order to generate unique number(this will be used in next stage) for each input record.
In next step, take a copy stage to pass the input into two output links. One link to Remove Duplicate stage side and another link to Change Capture stage side.
In Remove Duplicate stage, we will remove the duplicates based on key column. Here I use “id” as the key column to remove the duplicates. So after this stage we will have only unique records.
Change Capture stage is the important part in this job design. Here Remove Duplicates stage acts as After dataset and Copy stage acts as Before dataset for Change Capture stage.
Select the ‘Unique’ (Row Number Column from sequential stage) column as Key column and Change Mode as ‘Explicit Keys, All Values’. Our aim is to capture only duplicate records, so make ‘Drop Output for Delete’ as ‘False’.
However we can use join stage instead of change capture stage to capture duplicates after Remove Duplicate stage.
Here link ordering also important as that decides Before and After dataset for Change Capture stage.
Compile job and run the job.
You can see the duplicate records in the output.
We hope this article helped to find the solution that you are looking.