Buffering could be a technique used in the DataStage jobs to make sure a constant and uninterrupted flow of data to and from stages in such a way that there’re no potential deadlock or any fork-join issues.
As mentioned by IBM the perfect situation is when the data flows through the stages without being written on the disk. As in the case of buffering in any system, the upstream operators should wait for the downstream operators to consume their input before starting to create their records. This is the intention in DataStage too.
In Datastage, buffering is inserted automatically within the jobs on the links connecting the different stages. The buffer behaves in such a way that it always tries aptly transfer data between links and prevents data from being pushed onto the disk.
As an example if the downstream operator isn’t any longer obtaining the data from the upstream operator at a good rate or not obtaining it the least bit, the buffer operator slows down the incoming data for the upstream stage so the buffer doesn’t fill itself to an extent that data needs to be written on the disk.
Ideally, in most projects the default buffering policy is all that you simply need for running your jobs in the best manner. The default policy can make sure that data doesn’t spill out onto the disk once the buffer space has been stuffed up in any part of the job.
Buffering is controlled from the administrator by setting the suitable value for the APT_BUFFERING_POLICY variable. In addition to this, you can also modify the buffering setting for your stage in the advanced tab of the stage.
By default, the Buffering policy is AUTOMATIC_BUFFERING which is able to insert buffers on links to avoid deadlocks as and when needed. The other two buffering choices are ‘FORCE BUFFERING’ which is able to buffer all links and ‘NO BUFFERING’ which will not insert any buffering.
Just in case you choose to override the default buffering policy, you can do it through the Datastage administrator. This requires us to set the subsequent environment variables
The available environment variables are as follows:
APT_BUFFER_MAXIMUM_MEMORY
This variable contains the value of the maximum amount of virtual memory, in bytes, that will be used per buffer. The default size is 3145728 (3 MB). So this means that your buffer has a maximum size of 3 MB per buffer. So if your job requires 3 buffers, you will be having 9MB of buffer space.
So if in the runtime of the job if your buffer gets filled to the limit of 3MB then the remaining data is written to the disk.
APT_BUFFER_DISK_WRITE_INCREMENT
This environment variable sets the size, in bytes, of blocks of data being moved to/from disk by the buffering operator. The default size is 1048576 (1 MB.)
So if going to the above example if the buffer limit of 3MB has been hit then blocks of data will start to get written to the disk each of 1MB size. Changing these values has advantages as well as disadvantages.
Increasing the block size reduces the number of times the buffer operator has to write to the disk, but might decrease performance whenever data has to be read/written in smaller units. Reducing the block size increases throughput, but might enhance a number of times the disk has to be accessed to write the data.
APT_BUFFER_FREE_RUN
This is normally specified as a percentage value of the maximum buffer size. This value indicates the amount of available in-memory buffer to consume before the buffer offers resistance to any new data being read from it.
So as long as the percentage of buffer used is less than the percentage specified in this variable, the data will move at the normal speed but as soon as the percentage point is crossed the buffer will start restricting the data flow.
The default percentage is 0.5 (50% of Maximum memory buffer size which in this case will be 1.5 MB). The values can change from 0.0 to 1.0.
Similar options will also be available in the stage editor’s advanced tab for customizing the buffering on the link of your choice. I hope this gives you a better understanding of the buffering options in DataStage and the meaning of each variable and its effect on the performance of the job.