As a developer you always define the columns and their datatypes in ‘Column’ tab in each stage.
Schemas are an alternative way to specify column definitions for the data used by parallel jobs.
As you know, by default, most parallel job stages take their meta data from the Columns tab, which contains table definitions, supplemented, where necessary by format information from the Format tab.
For some stages, you can specify a property that causes the stage to take its meta data from the specified schema file instead.
Read also: Difference between link sort and sort stage
Some stages also allow us to specify a partial schema. This allows you to define only those columns that a particular stage is processing and ignore the rest of columns.
Schema file format:
A schema contains a record (or row) definition. This describes each column (or field) that will be encountered within the record, giving column name and data type. The following is an example record schema:
name:string[255];
address:nullable string[255];
value1:int32;
value2:int32;
dateofbirth:date)
The line breaks are there for ease of reading, if you want you can omit these if you were defining a partial schema, for example
record(name:string[255];value1:int32;dateofbirth:date)
is a valid schema.
The format of each line describing a column is:
column_name:[nullability]datatype;
column_name: This is the name that identifies the column. Names must start with a letter or an underscore (_), and can contain only alphanumeric or underscore characters. The name is not case sensitive.
nullability: You can optionally specify whether a column is allowed to contain a null value, or whether this would be viewed as invalid. If the column can be null, insert the word ’nullable’. By default columns are not nullable.
You can also include ’nullable’ at record level to specify that all columns are nullable, then override the setting for individual columns by specifying ‘not nullable’. For example:
name:not nullable string[255];
value1:int32;
dateofbirth:date)
datatype: This is the datatype of the column. This uses the internal data types but not SQL data types as used on Columns tabs in stage editors.
You can include comments in schema definition files. A comment is started by a double slash //, and ended by a newline.
You should always ensure that Runtime Column Propagation is turned on if you want to use schema files to define column meta data.
Read also: How FIELD Function Works in DataStage?
The example schema corresponds to the following table definition as specified on a Columns tab of a stage editor:
What if we need to apply Unicode in varchar fields through schema files ?