Netezza / PureData – How to add comments on a field

Netezza / PureData Table Documentation Practices

Netezza / PureData Table Documentation Practices

The ‘Comment on Column’ provides the same self-documentation capability as ‘Comment On table’, but drives the capability to the column field level.  This provides an opportunity to describe the purpose, business meaning, and/or source of a field to other developers and users.  The comment code is part of the DDL and can be migrated with the table structure DDL.  The statement can be run independently, or working with Aginity for PureData System for Analytics, they can be run as a group, with the table DDL, using the ‘Execute as a Single Batch (Ctrl+F5) command.

Basic ‘COMMENT ON field’ Syntax

  • The basic syntax to add a comment to add a comment to a column is:

COMMENT ON COLUMN <<Schema.TableName.ColumnName>> IS ‘<<Descriptive Comment>>’;

Example ‘COMMENT ON Field’ Syntax

  • This is example syntax, which would need to be changed and applied to each column field:

COMMENT ON COLUMN time_dim.time_srky IS ‘time_srky is the primary key and is a surrogate key derived from the date business/natural key’;

 

Related References

 

What is the URL for InfoSphere Information Analyzer

Information Analyzer Icon

Information Analyzer Icon

This is one of those things, which usually comes up during new installs.   To access the Infosphere information analyzer thin client there are two approaches. First, is via the Infosphere launchpad page and, the second is to go directly to the information analyzer page. Both URLs are provided in below.

InfoSphere LaunchPad URL

https:// <<Host_Server>> : 9443/ibm/iis/launchpad/

Information Analyzer URL 

https:// <<Host_Server>> : 9443/ibm/iis/dq/da/login.jsp

 

 

Related References

InfoSphere Datastage – How to Improve Sequential File Performance Using Parallel Environment Variables

APT_FILE_EXPORT_BUFFER_SIZE and APT_FILE_IMPORT_BUFFER_SIZE in DataStage Administrator

APT_FILE_EXPORT_BUFFER_SIZE and APT_FILE_IMPORT_BUFFER_SIZE in DataStage Administrator

While extensive use of sequential files is not best practice, sometimes there is no way around it, due to legacy systems and/or existing processes. However, recently, I have encountered a number of customers who are seeing significant performance issues with sequential file intensive processes. Sometimes it’s the job design, but often when you look at the project configuration they still have the default values. This is a quick and easy thing to check and to adjust to get a quick performance win, if they’ve not already been adjusted.These are delivered variables, but should seriously be considered for adjustment in nearly all data stage ETL projects. The adjustment must be based on the amount of available memory, the volume of work load that is sequential file intensive, and the environment you’re working in. Some experiential adjustment may be required, but I have provided a few recommendations below.

 

Environment Variable Properties

 

Category Name Type Parameter Name Prompt Size Default Value
Parallel > Operator Specific String APT_FILE_EXPORT_BUFFER_SIZE Sequential write buffer size Adjustable in 8 KB units.  Recommended values for Dev: 2048; Test & Prod: 4096. 128
Parallel > Operator Specific String APT_FILE_IMPORT_BUFFER_SIZE Sequential read buffer size Adjustable in 8 KB units.  Recommended values for Dev: 2048; Test & Prod: 4096. 128

Related References

 

 

InfoSphere DataStage – How to calculate age in a transformer

Age

Age

Occasionally, there is a need to calculate the between two dates for any number of reasons. For example, the age of a person, of an asset, age of an event.  So, having recently had to think about how to do this in a DataStage Transformer, rather in SQL, I thought it might be good to document a couple of approaches, which can provide the age.  This code does it at the year level, however, if you need the decimal digits or other handling them the rounding within the DecimalToDecimal function can be changed accordingly.

Age Calculation using Julian Date

DecimalToDecimal((JulianDayFromDate(<>) – JulianDayFromDate(Lnk_In_Tfm.PROCESSING_DT) )/365.25, ‘trunc_zero’)

Age Calculation using Julian Date with Null Handling

If a date can be missing from you source input data, then null handling is recommended to prevent job failure.  This code uses 1901-01-01 as the null replacement values, but it can be any date your business requirement stipulates.

DecimalToDecimal((JulianDayFromDate( NullToValue(<>, StringToDate(‘1901-01-01’,”%yyyy-%mm-%dd”) ) )  – JulianDayFromDate(Lnk_In_Tfm.PROCESSING_DT)) /365.25, ‘trunc_zero’)

Calculate Age Using DaysSinceFromDate

DecimalToDecimal(DaysSinceFromDate(<>, <>) /365.25 , ‘trunc_zero’)

Calculate Age Using DaysSinceFromDate with Null Handling

Here is a second example of null handling being applied to the input data.

DecimalToDecimal(DaysSinceFromDate(<>, NullToValue(<< Input date (e.g.Date of Birth) >>, StringToDate(‘1901-01-01’,”%yyyy-%mm-%dd”) ) ) /365.25 , ‘trunc_zero’)

 

 

 

 

Netezza Connector Stage, Table name required warning for User-defined SQL Write mode

Recently, while working at a customer site and I encountered an anomaly in the Netezza Connector stage, when choosing ‘User-defined SQL’ write mode, the ‘Table name’ displays a caution / warning even though a table name should not be required.  If you are using a user-defined SQL statement and/or have parametrized your SQL scripts to make the job reusable, each SQL and/or SQL script would have its own schema and table name being passed in.  After some investigation, a workaround was found, which both allows you to populate table name and leverage with different schema and table names within your SQl statement and/or.

Table Name, User-defined SQL, Warning

You will notice, in a screen shot below the ‘User-defined SQL’, ‘write mode’, property has been chosen, a parameter has been placed in the ‘User-defined SQL’ property, and ‘Read user defined SQL from a file’ property has been set to ‘Yes’.  However, yellow triangle displays on the ‘Table name’ property marking it as a required item.  This, also, occurs when placing SQL statements in the User-defined SQL property, whether reading from a file of not.

Netezza Connector User-Defined SQL , Table Name Required , Warning

Netezza Connector User-Defined SQL , Table Name Required , Warning

Table Name, User-defined SQL, Warning Workaround

After some experimentation, the workaround is straight forward enough.  Basically, give the ‘table name’ property something to read successfully, so it can move on to the user-defined SQL and/or user defined SQl file script, which the process actually needs to execute. In the screenshot below, the SYSTEM.DEFINITION_SCHEMA._V_DUAL view was used, so, it could be found, then the script file passed by the parameter runs fine.  Another view or table, which the DataStage user has access to, should just as well.

Netezza Connector, User-Define SQL, Table Name Required Fix

Netezza Connector, User-Define SQL, Table Name Required Fix

Related References

 

Surrogate Key File Effective Practices

Database Table, Surrogate Key File Effective Practices

Surrogate Key File Effective Practices

Here are a few thoughts on effectively working with IBM Infosphere, Information Server, DataStage surrogate key files, which may prove useful for developers.

 

Placement

  • The main thing about placement is that it be in a consistent location. Developers and production support teams should need to guess or look up where it is for every DataStage project. So, it is best to put the surrogate keys in same base path and that each project has its own subfolder to facilitate migrations and to reduce the possibility of human error. Here Is the patch structure, which is commonly use:

Path

  • /data/SRKY/<<Project Name>>

Parameter Sets

  • As a best practice, the surrogate key file path should be in a parameter set and the parameter used in the jobs, as needed.  This simplifies maintenance, if and when changes to the path are required, and during migrations.

Surrogate Key Parameter Set Screenshot – Example Parameter Tab

Surrogate Key Parameter Set Screen Screen

Surrogate Key Parameter Set Screenshot

Surrogate Key Parameter Set Screenshot – Example Values Tab

Surrogate Key Parameter Set Screenshot – Example Values Tab

Surrogate Key Parameter Set Screenshot – Example Values Tab

Surrogate Key Job Parameter Example Path using Parameter

Surrogate Key Job Parameter Example Path using Parameter

Surrogate Key Job Parameter Example Path using Parameter

Permissions

  • DataStage must have permissions to:
    • The entire parent path
    • The project folder, and
    • The surrogate key files themselves.

To ensure the DataStage has access to the path and Surrogate files, ensure:

  • The ‘dsadm’ (owner) and ‘dstage’ (group) have access to folders in the path, with at least a “-rw-r–r–“ (644) permissions level. Keeping the permissions to a minimum can, for obvious reasons,  prevent inadvertent overwrites of the surrogate key files; thus, avoiding some, potentially, serious cleanup.
  • The ‘dsadm’ (owner) and ‘dstage’ (group) have access to the surrogate key files

Surrogate Key File Minimum Permissions

Surrogate Key File Minimum Permissions

Surrogate Key File Minimum Permissions

Productivity Tip – Quickly create a new surrogate key file

Linux

Linux

This productivity tip, is how we can quickly create a new surrogate key file in Linux.  This example is leveraging native capabilities of Red Hat Enterprise Linux (RHEL) to skip a few commands, by using an existing surrogate key file to create a new surrogate file with a minimum of keys strokes and command line entries.

Creating a New Surrogate Key File From an Existing File

The basic process consists of just a few steps:

  1. Navigate to the location of your existing surrogate key files
  2. Copy an existing surrogate file
  3. Empty the new surrogate key file

Navigate to the location of your existing surrogate key files

This step is preparatory step; you will need to look at the path variable for the project you are working with to know where to go.  The actual path to the surrogate files your project can vary by project.

Copy an existing surrogate file

Assuming you have existing surrogate key files configured as needed, the use of the copy (cp) command can and the interactive and preserve options can eliminate the need to create the file, then set groups and permissions.   The interactive (-i) option prevent you from overwriting an existing files, in case you made a filename typo and the preserver (-p) option preserve the specified attributes (e.g. ownership, and permissions).

Basic Command

  • Here is the command formats with interactive and preserve, either format works
    • cp -ip <<FileName to Be Copied>> <<New Filename>>
  • Here is the command formats with only preserve
    • cp -p <<FileName to Be Copied>> <<New Filename>>

Example Command

  • cp -ip srky  blogexampl.srky
Copy Surrogate Key With Permissions

Copy Surrogate Key With Permissions

Empty the new surrogate key file

Setting the newly create surrogate key file to null will empty the file, so, DataStage can begin from the point configure in your DataStage job.

Basic Command

  • cat /dev/null > <<FileName>>

Example Command

  • cat /dev/null > blogexample.srky
Empty Surrogate Key File

Empty Surrogate Key File

Related References

 

Productivity Tip – Changing Owner and Groups on Surrogate Key File

Linux

Linux

 

This practice tip, is how we quickly update surrogate key file owner and group in Linux.  This example is leveraging native capabilities of Red Hat Enterprise Linux (RHEL) to skip a few commands, by using a combined command to set both the owner and group of a surrogate key file in a single command.

Surrogate Key File Owners and Groups

To ensure the DataStage has access to the path and Surrogate files, ensure the ‘dsadm’ (owner) and ‘dstage’ (group) have access to the surrogate key files

Setting Surrogate Key File Owners and Groups

You can change the ownership and group of a surrogate file at the same time, in Linux, with the change owner command. To do this navigate the surrogate key path containing the file, then execute the chown combined command.

Command chown basic format

  • chown <<OWNER>>:<<Group>> <<File Name>>

Example chown command

  • chown dsadm:dstage Blogexampl.txt
Chown On Surrogate Key File

Chown On Surrogate Key File

Related Reference

 

 

Infosphere Datastage – Client Tier Image Requirements

InfoSphere Information Server (IIS) DataStage Client Tools

InfoSphere Information Server (IIS) DataStage Client Tools

A frequent question encountered in the early stages of a client engagement is: what are the recommended Windows Client Tier Image Requirements (Virtual machine image or Desktop)?  However, I have never found this information to be satisfactorily documented by IBM.  So, invariably, we end up providing our best guidance based on experience, which in the case of the normal InfoSphere Information Server (IIS) installation is provided in the table below.

Recommended Developer Client Tier Characteristics

Item Quantity Notes

Application Directory Storage

8 GB or Larger

Memory

6 GB or More This should be a developer images, so, more is better.  Especially, given the other applications, which the developer may also be using simultaneously.
CPU Cores 2 or more This should be a developer images, so, more is better.  Especially, given the other applications, which the developer may also be using simultaneously.

Permissions

N/A Users performing the client software installation, must have full Administrative rights on the Virtual Machine image or individual workstation.

Protection Software

N/A All Firewalls and virus protection software needs to be disabled on the client, while client software installation is in progress.  Also, required Infosphere firewall ports for the client tools must be open to use the tools.

 

Related References

Are the Infosphere Windows Client 32 bit or 64 bit?

InfoSphere Information Server (IIS) DataStage Client Tools

InfoSphere Information Server (IIS) DataStage Client Tools

The question of whether or not the IBM InfoSphere Windows client tools 32 bit or 64 bit, actually, comes up rather frequently. The short answer to the question is that the InfoSphere Windows client tools, actually, are 32 bit applications, will run on supported 64 bit windows system.  This is what IBM calls 64-Tolerate: (32-bit applications that can run in a 64-bit operating system environment), however, the client tools do not run in native 64-bit mode and do not exploit the 64-bit operating environment to improve performance.

Related References

 

DataStage – IIS-DSEE-TBLD-00008- Processing Input Record APT_Decimal Error

IIS-DSEE-TBLD-00008 apt decimal error before disabling combination

IIS-DSEE-TBLD-00008 apt decimal error

This another one of those nebulas error messages, which can cost a lost of time in research, if you don’t know how to simplify the process a bit.  However, determining where the error is can be a bit of a challenge if you have not encountered this error before and figured out the trick, which isn’t exactly intuitive.

In this case, as it turned out, after I had determined where the error was, it was as simple as having missed resetting the stage variable properties, when the other decimal fields were increase.

How to identify where this error occurs?

Disabling the APT_DISABLE_COMBINATION environment variable by:

  • adding the APT_DISABLE_COMBINATION environment variable to the job properties

  • setting the  APT_DISABLE_COMBINATION environment variable it to true in the job properties

  • compiling the job and running the job again

 

This approach will, usually, provide a more meaningful identification of the stage with the error.

Note:  Please remember to remove the APT_DISABLE_COMBINATION environment variable before moving it to testing and/or releasing your code in production.

Message ID

IIS-DSEE-TBLD-00008

Error Message with combine enabled:

APT_CombinedOperatorController(1),0: Exception caught in processingInputRecord() for input “0”: APT_Decimal::ErrorBase: From: the source decimal has even precision, but non-zero in the leading nybble, or is too large for the destination decimal… Record dropped. Create a reject link to save the record if needed.

Error message with combine disabled

Tfm_Measures_calc,0: Exception caught in processingInputRecord() for input “0”: APT_Decimal::ErrorBase: From: the source decimal has even precision, but non-zero in the leading nybble or is too large for the destination decimal… Record dropped. Create a reject link to save the record if needed.

IIS-DSEE-TBLD-00008 apt decimal error after disabling combination

IIS-DSEE-TBLD-00008 apt decimal error after disabling combination

Note: Measures_calc is the stage name

Related References

 

 

DataStage – How to add or subtract from a date in a transformer stage

date offset by components example

date offset by components example, subtracting a month, and doing other work in the transformer field

From time to time there is a need to adjust a date within the a DataStage transformer.  It could be as small subtracting a day to set a batch processing date to yesterday or something bigger.  If you need to adjust a date by year, month and/or day the DateOffsetByComponents function is the method.  The DateOffsetByComponents function can work with a fix values or input from the job either; an input field, or a stage variable.  The main thing is knowing what to put in which position in the offset values with the understanding that off sets values can be positive or negative, thereby, allowing the addition and subtraction of of different part of a date in datastage transformer.

Purpose of DateOffsetByComponents Function

  • This function is used to obtain a specific required date by applying offset to given date.

Syntax

DateOffsetByComponents(<<Input Date>>, year offset, month offset, day of month offset)

How to subtract a year

DateOffsetByComponents(dsjobstartdate,-1, 0, 0)

How to add a year

DateOffsetByComponents(dsjobstartdate ,1, 0, 0)

How to subtract a month

DateOffsetByComponents(dsjobstartdate, 0, -1, 0)

How to add a month

DateOffsetByComponents(dsjobstartdate, 0, 1, 0)

How to subtract a day

DateOffsetByComponents(dsjobstartdate, 0, 0, -1)

How to add a day

DateOffsetByComponents(dsjobstartdate, 0, 0, 1)

 

Related References

IBM InfoSphere Licensing – DataStage and DataQuality

IBM InfoSphere Licensing - DataStage and DataQuality

IBM Infosphere Information Server (IIS)

Licensing is one of the major factors which can limit Infosphere performance, flexibility, and extensibility. Licensing can affect you performance through the type of product which being licensed and the Quality of CPU’s Cores being licensed, and the operating system for the license was purchased.

The Type Of Product For Which Was Licensed Was Or Is Being Purchased

The type of product license purchased reduce your ability to tune and achieve optimal performance by reducing the ability to perform parallel processing.  For Example, the difference between the IBM InfoSphere DataStage licenses:

  • IBM InfoSphere DataStage Server Edition – A standalone ETL solution without full parallel processing framework capabilities.
  • IBM InfoSphere DataStage – Includes parallel processing capabilities to handle a massive volume, velocity, and variety of data.

Also, Information server packs and special environment licenses provide enhanced capability in specific environment scenarios.  For example:

  • IBM InfoSphere DataStage MVS Edition – Provides capability to generate COBOL programs that run natively on the mainframe to access and integrate mainframe data sources to load into a data warehouse.
  • IBM InfoSphere Information Server Pack for Oracle Applications – Allows data extraction from the entire Oracle E-Business Suite of applications, including Oracle Financials, Manufacturing, CRM and others, in a DataStage ETL job.
  • IBM InfoSphere Information Server Pack for PeopleSoft Enterprise – Provides a graphical way to extract data from PeopleSoft Enterprise in a DataStage ETL job.
  • IBM InfoSphere Information Server Pack for Salesforce.com – Enables a no-programming, meta-data based, fully configurable integration between salesforce.com and other enterprise applications and data repositories.

The Quality Of CPU’s Cores For The Licensed Was Or Is Being Purchased

The proper sizing of the license to acquire has a significant effect on DataStage’s ability to leverage the system CPU’s/cores and is frequently undersized.  While there is a balance between cost and performance, under sizing your license can add cost in support, as well as, increase processing times.  Furthermore, most performance issues reported to IBM are with system licensed for less than eight (8) CPU/Cores

Related References

  • Licensed IBM InfoSphere DataStage editions and feature packs activation and deactivation
  • Viewing a list of activated IBM InfoSphere DataStage editions and feature packs

How to clear a Surrogate Key file using Linux

Linux - Ho to Empty Surrogate Key File

Linux – Ho to Empty Surrogate Key File

Occasionally, the question comes up about how to clear/reset the surrogate key file from the Linux command line.  It is a simple process really, but it should be done with care and, only, if you need the keys in the dimension to be reset to the beginning.   A complete reset would require:

  • The target table to be truncated and,
  • All keys in use in facts to be removed, or reset after the fact, and/or the table truncated and reloaded,
  • The Surrogate Key file emptied, and
  • The ETL rerun.

Basic Command

  • cat /dev/null > <<FileName>>

Example Command

  • cat /dev/null > Season.srky

Related Links

PureData – Table Effective Practices

IBM Netezza PureData Table Effective Practices

Database Table

Here a few tips, which can make a significant difference to the efficiency and effectiveness of developers and users, making information available to them when developing and creating analytic objects.  This information can, also, be very help to data modelers.  While some of these recommendations are not enforced by Netezza/PureData, this fact makes them no less helpful to your community.

Alter table to Identify Primary Keys (PK)

  • Visually helps developers and users to know what the keys primary keys of the table are
  • Primary key information can, also, be imported as meta data by other IBM tools (e.g. InfoSphere, Datastage, Data Architect, Governance Catalog, Aginity, etc.)
  • The query optimizer will use these definitions define efficient query execution plans

Alter table to Identify Foreign Keys (FK)

  • Illustrate table relationships for developers and users
  • Foreign key information can, also, be imported as meta data by other IBM tools (e.g. InfoSphere, Datastage, Data Architect, Governance Catalog, Aginity, etc.)
  • The query optimizer will use these definitions define efficient query execution plans

Limit Distribution Key to Non-Updatable Fields

  • This one seems obvious, but this problem occurs regularly, if tables and optimizations are not properly planned; Causing an error will be generated, if an update is attempted against a field contain in the distribution of a table.

Use Null on Fields

  • Using ‘Not Null’ whenever the field data and ETL transformation rules can enforce it, helps improve performance by reducing the number of null condition checks performed and reduces storage.

Use Consistence Field Properties

  • Use the same data type and field length in all tables with the same field name, reduces the amount of interpretation/conversion required by the system, developers, and report SQL.

Schedule Table Optimizations

  • Work with your DBA’s to determine the best scheduling time, system user, and priority of groom and generate statistics operations. Keep in mind the relationship to when the optimizations occur in relation to when users need to consume the data. All too often, this operation is not performed before users need the performance and/or is driven by DBA choice, without proper consideration to other processing performance needs.  This has proven, especially true, in data warehousing when the DBA does not have Data warehousing experience and/or does not understand the load patterns of the ETL/ELT process.

Related Links

 

Infosphere – decimal_from_string Conversion Error

IBM Infosphere - decimal_from_string Conversion Error

decimal_from_string Conversion Error

This is another one of those nebulas error, which can kick out of DataStage, DataQuality, and/or DataClick.  This error can be particularly annoying, because it doesn’t identify the field or even the precise command, which is causing the error.  So, there can be more than field and/or more than one command causing the problem.

Error

Conversion error calling conversion routine decimal_from_string data may have been lost

Resolution

To resolve this error, check for the correct formatting (date format, decimal, and null value handling) before passing to datastage StringToDate, DateToString,DecimalToString or StringToDecimal functions.  Additionally, even if the formatting is correct, you may need to imbed commands to completely clear the issue.

Example

Here is a recent example of command embedding, which has clear the issue, but I’m sure you will need to this concept in other ways to meet all your needs.

DecimalToString( DecimalToDecimal( <>,’trunc_zero’),”suppress_zero”)

Infosphere Director Scheduler

Infosphere Director Scheduler

Infosphere Director Scheduler

 

First let us start with dispelling a common myth, The InfoSphere scheduler is not an Enterprise Scheduler application.

The scheduling service in InfoSphere leverages the operating system (OS) scheduler, in the case of Linux this is CRON, and provides graphical User Interface, which a provides time based capability to schedule Jobs at the Suite component level.  The Director Client scheduler can:

  • Schedule individual jobs and sequence jobs
  • Schedule Jobs/Sequencers to run:
  • Today
  • Tomorrow
  • Every
  • Next
  • Daily

How to set-up the Daily Schedule

 

Steps to set the Daily Schedule are below:

  1. Open the DataStage Director
  2. Once the browser is open click on Job>Add to Schedule
  3. Click “Daily” under the “Run Job” and choose the time

Related References

 

How to suppress a Change_Capture_Host warning in Datastage

Change Capture Host Warning (IIS-DSEE-TFXR-00017)

Change Capture Host Warning (IIS-DSEE-TFXR-00017)

Occasionally, I run into this Change Capture Host defaulting warming, so, I thought this information may be useful.

Event Type

  • Warning

Message ID

  • IIS-DSEE-TFXR-00017

Example Message

  • Change_Capture_Host: When checking operator: Defaulting “<<FieldName>>” in transfer from “beforeRec” to “outputRec”.

Setting Variable

  • Set APT_DISABLE_TFXR0017=1
  • This environment variable can be added either at the project level or at the job level.

Alternative Solution

  • Within the Change Capture stage properties:
    • Stage tab
    • Option
    • Property: “Change Mode
    • Value:  “Explicit key, All Values”.