IBM DataStage and QualityStage Stages

IBM DataStage and QualityStage Stages

DataStage and QualityStage stages are grouped into the following logical sections:

  • General objects
  • Data Quality Stages
  • Database connectors
  • Development and Debug stages
  • File stages
  • Processing stages
  • Real Time stages
  • Restructure Stages
  • Sequence activities

Please refer to the list below for a description of the stages used in DataStage and QualityStage. 
We classified all stages in order of importancy and frequency of use in real-life deployments (and also on certification exams). Also, the most widely used stages are marked bold or there is a link to a subpage available with a detailed description with examples.

DataStage and QualityStage parallel stages and activities

General elements

  • Link indicates a flow of the data. There are three main types of links in DataStage: stream, reference and lookup.
  • Container (can be private or shared) – the main outcome of having containers is to simplify visually a complex dataStage job design and keep the design easy to understand.
  • Annotation is used for adding floating DataStage job notes and descriptions on a job canvas. Annotations provide a great way to document the ETL process and help understand what a given job does.
  • Description Annotation shows the contents of a job description field. One description annotation is allowed in a DataStage job.

Development/Debug stages

  • Row generator produces a set of test data which fits the specified metadata (can be random or cycled through a specified list of values). Useful for testing and development.
  • Column generator adds one or more column to the incoming flow and generates test data for this column.
  • Peek stage prints record column values to the job log which can be viewed in Director. It can have a single input link and multiple output links.
  • Sample stage samples an input data set. Operates in two modes: percent mode and period mode.
  • Head selects the first N rows from each partition of an input data set and copies them to an output data set.
  • Tail is similar to the Head stage. It select the last N rows from each partition.
  • Write Range Map writes a data set in a form usable by the range partitioning method.

Processing stages

  • Aggregator joins data vertically by grouping incoming data stream and calculating summaries (sum, count, min, max, variance, etc.) for each group. The data can be grouped using two methods: hash table or pre-sort.
  • Copy – copies input data (a single stream) to one or more output data flows
  • FTP stage uses FTP protocol to transfer data to a remote machine
  • Filter filters out records that do not meet specified requirements.
  • Funnel combines multiple streams into one.
  • Join combines two or more inputs according to values of a key column(s). Similar concept to relational DBMS SQL join (ability to perform inner, left, right and full outer joins). Can have 1 left and multiple right inputs (all need to be sorted) and produces single output stream (no reject link).
  • Lookup combines two or more inputs according to values of a key column(s). Lookup stage can have 1 source and multiple lookup tables. Records don’t need to be sorted and produces single output stream and a reject link.
  • Merge combines one master input with multiple update inputs according to values of a key column(s). All inputs need to be sorted and unmatched secondary entries can be captured in multiple reject links.
  • Modify stage alters the record schema of its input dataset. Useful for renaming columns, non-default data type conversions and null handling
  • Remove duplicates stage needs a single sorted data set as input. It removes all duplicate records according to a specification and writes to a single output
  • Slowly Changing Dimension automates the process of updating dimension tables, where the data changes in time. It supports SCD type 1 and SCD type 2.
  • Sort sorts input columns
  • Transformer stage handles extracted data, performs data validation, conversions and lookups.
  • Change Capture – captures before and after state of two input data sets and outputs a single data set whose records represent the changes made.
  • Change Apply – applies the change operations to a before data set to compute an after data set. It gets data from a Change Capture stage
  • Difference stage performs a record-by-record comparison of two input data sets and outputs a single data set whose records represent the difference between them. Similar to Change Capture stage.
  • Checksum – generates checksum from the specified columns in a row and adds it to the stream. Used to determine if there are differences between records.
  • Compare performs a column-by-column comparison of records in two presorted input data sets. It can have two input links and one output link.
  • Encode encodes data with an encoding command, such as gzip.
  • Decode decodes a data set previously encoded with the Encode Stage.
  • External Filter permits specifying an operating system command that acts as a filter on the processed data
  • Generic stage allows users to call an OSH operator from within DataStage stage with options as required.
  • Pivot Enterprise is used for horizontal pivoting. It maps multiple columns in an input row to a single column in multiple output rows. Pivoting data results in obtaining a dataset with fewer number of columns but more rows.
  • Surrogate Key Generator generates surrogate key for a column and manages the key source.
  • Switch stage assigns each input row to an output link based on the value of a selector field. Provides a similar concept to the switch statement in most programming languages.
  • Compress – packs a data set using a GZIP utility (or compress command on LINUX/UNIX)
  • Expand extracts a previously compressed data set back into raw binary data.

File stage types

  • Sequential file is used to read data from or write data to one or more flat (sequential) files.
  • Data Set stage allows users to read data from or write data to a dataset. Datasets are operating system files, each of which has a control file (.ds extension by default) and one or more data files (unreadable by other applications)
  • File Set stage allows users to read data from or write data to a fileset. Filesets are operating system files, each of which has a control file (.fs extension) and data files. Unlike datasets, filesets preserve formatting and are readable by other applications.
  • Complex flat file allows reading from complex file structures on a mainframe machine, such as MVS data sets, header and trailer structured files, files that contain multiple record types, QSAM and VSAM files.
  • External Source – permits reading data that is output from multiple source programs.
  • External Target – permits writing data to one or more programs.
  • Lookup File Set is similar to FileSet stage. It is a partitioned hashed file which can be used for lookups

Database stages

  • Oracle Enterprise allows reading data from and writing data to an Oracle database (database version from 9.x to 10g are supported).
  • ODBC Enterprise permits reading data from and writing data to a database defined as an ODBC source. In most cases it is used for processing data from or to Microsoft Access databases and Microsoft Excel spreadsheets.
  • DB2/UDB Enterprise permits reading data from and writing data to a DB2 database.
  • Teradata permits reading data from and writing data to a Teradata data warehouse. Three Teradata stages are available: Teradata connector, Teradata Enterprise and Teradata Multiload
  • SQLServer Enterprise permits reading data from and writing data to Microsoft SQLl Server 2005 amd 2008 database.
  • Sybase permits reading data from and writing data to Sybase databases.
  • Stored procedure stage supports Oracle, DB2, Sybase, Teradata and Microsoft SQL Server. The Stored Procedure stage can be used as a source (returns a rowset), as a target (pass a row to a stored procedure to write) or a transform (to invoke procedure processing within the database).
  • MS OLEDB helps retrieve information from any type of information repository, such as a relational source, an ISAM file, a personal database, or a spreadsheet.
  • Dynamic Relational Stage (Dynamic DBMS, DRS stage) is used for reading from or writing to a number of different supported relational DB engines using native interfaces, such as Oracle, Microsoft SQL Server, DB2, Informix and Sybase.
  • Informix (CLI or Load)
  • DB2 UDB (API or Load)
  • Classic federation
  • RedBrick Load
  • Netezza Enterpise
  • iWay Enterprise

Real Time stages

  • XML Input stage makes it possible to transform hierarchical XML data to flat relational data sets
  • XML Output writes tabular data (relational tables, sequential files or any dataStage data streams) to XML structures
  • XML Transformer converts XML documents using an XSLT stylesheet
  • WebSphere MQ stages provide a collection of connectivity options to access IBM WebSphere MQ enterprise messaging systems. There are two MQ stage types available in DataStage and QualityStage: WebSphere MQ connector and WebSphere MQ plug-in stage.
  • Web services client
  • Web services transformer
  • Java client stage can be used as a source stage, as a target and as a lookup. The java package consists of three public classes: com.ascentialsoftware.jds.Column, com.ascentialsoftware.jds.Row, com.ascentialsoftware.jds.Stage
  • Java transformer stage supports three links: input, output and reject.
  • WISD Input – Information Services Input stage
  • WISD Output – Information Services Output stage

Restructure stages

  • Column export stage exports data from a number of columns of different data types into a single column of data type ustring, string, or binary. It can have one input link, one output link and a rejects link.
  • Column import complementary to the Column Export stage. Typically used to divide data arriving in a single column into multiple columns.
  • Combine records stage combines rows which have identical keys, into vectors of subrecords.
  • Make subrecord combines specified input vectors into a vector of subrecords whose columns have the same names and data types as the original vectors.
  • Make vector joins specified input columns into a vector of columns
  • Promote subrecord – promotes input subrecord columns to top-level columns
  • Split subrecord – separates an input subrecord field into a set of top-level vector columns
  • Split vector promotes the elements of a fixed-length vector to a set of top-level columns

Data quality QualityStage stages

  • Investigate stage analyzes data content of specified columns of each record from the source file. Provides character and word investigation methods.
  • Match frequency stage takes input from a file, database or processing stages and generates a frequency distribution report.
  • MNS – multinational address standardization.
  • QualityStage Legacy
  • Reference Match
  • Standardize
  • Survive
  • Unduplicate Match
  • WAVES – worldwide address verification and enhancement system.

Sequence activity stage types

equence activities
  • Job Activity specifies a DataStage server or parallel job to execute.
  • Notification Activity – used for sending emails to user defined recipients from within DataStage
  • Sequencer used for synchronization of a control flow of multiple activities in a job sequence.
  • Terminator Activity permits shutting down the whole sequence once a certain situation occurs.
  • Wait for file Activity – waits for a specific file to appear or disappear and launches the processing.
  • EndLoop Activity
  • Exception Handler
  • Execute Command
  • Nested Condition
  • Routine Activity
  • StartLoop Activity
  • UserVariables Activity
Gunes INAL

Comments are closed.