1. What is the difference between Join/Lookup/Merge stages? How these will react if duplicates records come in input links?
2. In the join stage, if one input has col1,col2,col3, and others have col4,col5,col6 then how to join this and perform left outer join?
3. When we use Lookup Stage?
4. Can we use Hash Partition for the reference link in the Lookup stage?
5. How many types of joins supported by Merge/Join/Lookup stages?
6. Which Partition methods should we use in Merge/Lookup/Join, explain why?
7. Describe the Partition methods in DataStage?
Answer: There are a total of 9 partition methods.
Auto: DataStage attempts to work out the best partitioning method depending on execution modes of current and preceding stages and how many nodes are specified in the configuration file. This is the default partitioning method for most stages.
DB2: Replicates the DB2 partitioning method of a specific DB2 table. Requires extra properties to be set. Access these properties by clicking the properties button.
Entire: Each file written to receives the entire data set.
Hash: The records are hashed into partitions based on the value of a key column or columns selected from the Available list.
Modulus: The records are partitioned using a modulus function on the key column selected from the Available list. This is commonly used to partition on tag fields.
Random: The records are partitioned randomly, based on the output of a random number generator.
Round Robin: The records are partitioned on a round-robin basis as they enter the stage.
Same: Preserves the partitioning already in place.
Range: Divides a data set into approximately equal size partitions based on one or more partitioning keys. Range partitioning is often a pre-processing step to performing a total sort on a data set. Requires extra properties to be set. Access these properties by clicking the properties button.
8. Which partition method is faster, Hash, or Modulus?
Answer: Hash and Modulus techniques are Key based on partition techniques. If all the key columns are numeric data types then we use the Modulus partition technique. If one or more key columns are text then we use the Hash partition technique. Modulus is fast comparing with Hash because we are using only numeric fields for calculation purposes and Mod function calculation will be done faster than the Hash function.
9. Job design/requirement is forced me to use only the join stage, in this case, how can I capture reject records?
Answer: The Join stage does not provide reject handling for unmatched records (such as in an Inner Join scenario). If un-matched rows must be captured or logged, an OUTER join operation must be performed.
In an OUTER join scenario, all rows on an outer link (eg. Left Outer, Right Outer, or both links in the case of Full Outer) are output regardless of the match on key values. During an Outer Join, when a match does not occur, the Join stage inserts NULL values into the unmatched columns. Care must be taken to change the column properties to allow NULL values before the Join.
This is most easily done by inserting a Copy stage and mapping a column from NON-NULLABLE to NULLABLE. However, if the column is NON-NULLABLE, it will default value for that unmatched rows like for CHAR it will pad the spaces till that length and for VARCHAR it will assign Blank(”) and for INTEGER it will assign zero.
A filter stage can be used to test for NULL values in unmatched columns. In some cases, it is simpler to use a Column Generator to add an ‘indicator’ column, with a constant value, to each of the outer links and test that column for the constant after you have performed the join.
This is also handy with Lookups that have multiple reference links.
10. How many reject links I can give in the Merge stage?
Answer: Either none or number of reject links equal to input update links. Let say If you have 3 input update links and one master input link, you should give exact 3 reject links for each update link or none.
11. How to remove duplicates in a table without using an inner query?
12. What is a “degenerate dimension”?
13. The student table has columns like Sub1,Sub2,Sub3,Sub4, etc, how to get the total marks of a student based on his name?
14. What is the difference between the WHERE & HAVING clause?
15. What is the order of execution of SQL query?
16. What is Change Data Capture (CDC) stage? how it works, explain it with simple job design
17. When we go for the CDC stage or why we use the CDC stage?
Answer: One benefit of the CDC is we can put change capture into a shared container for use across multiple jobs. This code re-use can save a lot of time. The container has a transformer in it with two input columns: keyfields and valuefields, and two output links: inserts and updates. The keyfields column contains the key fields concatenated into a string with a delimiter such as |. The value columns contain all fields concatenated with a delimiter. The job that uses the container needs to concatenate the input columns and pass them to the container and then split the output insert and updated rows. Row Merge and Row Splitter can be used to do this.
18. Difference between Funnel/Filter stages?
19. Input Number is:141675, sort this to 114567?
20. Explain about SCD (Slowly Changing Dimension) and its types like SCD1, SCD2 & SCD3?
You can practice the above questions and along with a few more by watching the below video:
Next Article: DataStage Interview Questions-2