The lookup Stage can have a reference link, single input link, single output link and single reject link.Look up Stage is faster when the data volume is less.It can have multiple reference links (if it is a sparse lookup it can have only one reference link)The optional reject link carries source records that do not have a corresponding input lookup tables.Lookup Stage and type of lookup should be chosen depending on the functionality and volume of data.Sparse lookup type should be chosen only if primary input data volume is small.If the reference data volume is more, usage of Lookup Stage should be avoided as all reference data is pulled in to local memory It is used to perform a lookup on any parallel job Stage that can output data. dsDatasets are much faster compared to sequential files.Data is spread across multiple nodes and is referred by a control file.Datasets are not UNIX files and no UNIX operation can be performed on them.Usage of Dataset results in a good performance in a set of linked jobs.They help in achieving end-to-end parallelism by writing data in partitioned form and maintaining the sort order.Ī Look up Stage is an Active Stage. DataStage parallel extender jobs use Dataset to store data being operated on in a persistent form.Datasets are operating system files which by convention has the suffix. It can be configured to operate in sequential mode or parallel mode. This Stage can have a single input link or single Output link. The Data Set is a file Stage, which allows reading data from or writing data to a dataset. It causes performance overhead, as it needs to do data conversion before writing and reading from a UNIX file.In order to have faster reading from the Stage the number of readers per node can be increased (default value is one).
While handling huge volumes of data, this Stage can itself become one of the major bottlenecks as reading and writing from this Stage is slow.Sequential files should be used in following conditionsWhen we are reading a flat file (fixed width or delimited) from UNIX environment which is FTPed from some external systemsWhen some UNIX operations has to be done on the file Don’t use sequential file for intermediate storage between jobs. It can have only one input link or one Output link. It is used to read data from or write data to one or more flat Files. It is the most common I/O Stage used in a DataStage Job. The sequential file Stage is a file Stage. Performance Analysis of Various stages in DataStag In such jobs $APT_BUFFER_MAXIMUM_MEMORY, $APT_MONITOR_SIZE and $APT_MONITOR_TIME should be set to sufficiently large values. This happens due to low temp memory space. Jobs often abort in cases where a single lookup has multiple reference links. In DataStage Jobs where high volume of data is processed, virtual memory settings for the job should be optimised. nearly equal partitioning of data should occur and data skew should be minimized. Partitioning should be set in such a way so as to have balanced data flow i.e.
Proper partitioning of data is another aspect of DataStage Job design, which significantly improves overall job performance. Therefore, while choosing the configuration file one must weigh the benefits of increased parallelism against the losses in processing efficiency (increased processing overheads and slow start up time).Ideally, if the amount of data to be processed is small, configuration files with less number of nodes should be used while if data volume is more, configuration files with larger number of nodes should be used. A configuration file with a larger number of nodes will generate a larger number of processes and will in turn add to the processing overheads as compared to a configuration file with a smaller number of nodes. The degree of parallelism of a DataStage Job is determined by the number of nodes that is defined in the Configuration File, for example, four-node, eight –node etc.
BasicsParallelism Parallelism in DataStage Jobs should be optimized rather than maximized.