Database – What is a foreign key?

Acronyms, Abbreviations, Terms, And Definitions, DDL (Data Definition Language), What is a foreign key

Acronyms, Abbreviations, Terms, And Definitions

Definition of a Foreign Key

  • A foreign Key (FK) is a constraint that references the unique primary key (PK) of another table.

Facts About Foreign Keys

  • Foreign Keys act as a cross-reference between tables linking the foreign key (Child record) to the Primary key (parent record) of another table, which establishing a link/relationship between the table keys
  • Foreign keys are not enforced by all RDBMS
  • The concept of referential integrity is derived from foreign key theory
  • Because Foreign keys involve more than one table relationship, their implementation can be more complex than primary keys
  • A foreign-key constraint implicitly defines an index on the foreign-key column(s) in the child table, however, manually defining a matching index may improve join performance in some database
  • The SQL, normally, provides the following referential integrity actions for deletions, when enforcing foreign-keys

Cascade

  • The deletion of a parent (primary key) record may cause the deletion of corresponding foreign-key records.

No Action

  • Forbids the deletion of a parent (primary key) record, if there are dependent foreign-key records.   No Action does not mean to suppress the foreign-key constraint.

Set null

  • The deletion of a parent (primary key) record causes the corresponding foreign-key to be set to null.

Set default

  • The deletion of a record causes the corresponding foreign-keys be set to a default value instead of null upon deletion of a parent (primary key) record

Related References

 

Advertisements

InfoSphere / Datastage – What are The support Connectors stages for dashDB?

dashDB

dashDB

In a recent discussion, the question came up concern which Infosphere Datastage connectors and/or stages are supported by IBM for dashDB.  So, it seems appropriate to share the insight gained from the question being answered.

What Datastage Connectors and/or stages are Supported for dashDB

You have three choices as to connectors, which may best meet you your needs based on the nature of your environment and the configuration chooses which have been applied:

  1. The DB2 Connector Stage
  2. The JDBC Connector stage
  3. The ODBC Stage

Related References

Connecting to IBM dashDB

InfoSphere Information Server, InfoSphere Information Server 11.5.0, Information Server on Cloud offerings, Connecting to other systems, Connecting to IBM dashDB

DB2 connector

InfoSphere Information Server, InfoSphere Information Server 11.5.0, Connecting to data sources, Databases, IBM DB2 databases, DB2 connector

ODBC stage

InfoSphere Information Server, InfoSphere Information Server 11.5.0, Connecting to data sources, Older stages for connectivity, ODBC stage

JDBC data sources

InfoSphere Information Server, InfoSphere Information Server 11.5.0, Connecting to data sources, Multiple data sources, JDBC data sources

Database – What is a Primary Key?

Database Table

Database Table

What is a primary Key?

What a primary key is depends, somewhat, on the database.  However, in its simplest form a primary key:

  • Is a field (Column) or combination of Fields (columns) which uniquely identifies every row.
  • Is an index in database systems which use indexes for optimization
  • Is a type of table constraint
  • Is applied with a data definition language (DDL) alter command
  • And, depending on the data model can, define parent-Child relationship between tables

Related References

Data Modeling – Fact Table Effective Practices

Database Table

Database Table

Here are a few guidelines for modeling and designing fact tables.

Fact Table Effective Practices

  • The table naming convention should identify it as a fact table. For example:
    • Suffix Pattern:
      • <<TableName>>_Fact
      • <<TableName>>_F
    • Prefix Pattern:
      • FACT_<TableName>>
      • F_<TableName>>
    • Must contain a temporal dimension surrogate key (e.g. date dimension)
    • Measures should be nullable – this has an impact on aggregate functions (SUM, COUNT, MIN, MAX, and AVG, etc.)
    • Dimension Surrogate keys (srky) should have a foreign key (FK) constraint
    • Do not place the dimension processing in the fact jobs

Related References

Data Modeling – Dimension Table Effective Practices

Database Table

Database Table

I’ve had these notes laying around for a while, so, I thought I consolidate them here.   So, here are few guidelines to ensure the quality of your dimension table structures.

Dimension Table Effective Practices

  • The table naming convention should identify it as a dimension table. For example:
    • Suffix Pattern:
      • <<TableName>>_Dim
      • <<TableName>>_D
    • Prefix Pattern:
      • Dim_<TableName>>
      • D_<TableName>>
  • Have Primary Key (PK) assigned on table surrogate Key
  • Audit fields – Type 1 dimensions should:
    • Have a Created Date timestamp – When the record was initially created
    • have a Last Update Timestamp – When was the record last updated
  • Job Flow: Do not place the dimension processing in the fact jobs.
  • Every Dimension should have a Zero (0), Unknown, row
  • Fields should be ‘NOT NULL’ replacing nulls with a zero (0) numeric and integer type fields or space ( ‘ ‘ ) for Character type files.
  • Keep dimension processing outside of the fact jobs

Related References

 

 

Netezza / PureData – How to add comments on a field

Netezza / PureData Table Documentation Practices

Netezza / PureData Table Documentation Practices

The ‘Comment on Column’ provides the same self-documentation capability as ‘Comment On table’, but drives the capability to the column field level.  This provides an opportunity to describe the purpose, business meaning, and/or source of a field to other developers and users.  The comment code is part of the DDL and can be migrated with the table structure DDL.  The statement can be run independently, or working with Aginity for PureData System for Analytics, they can be run as a group, with the table DDL, using the ‘Execute as a Single Batch (Ctrl+F5) command.

Basic ‘COMMENT ON field’ Syntax

  • The basic syntax to add a comment to add a comment to a column is:

COMMENT ON COLUMN <<Schema.TableName.ColumnName>> IS ‘<<Descriptive Comment>>’;

Example ‘COMMENT ON Field’ Syntax

  • This is example syntax, which would need to be changed and applied to each column field:

COMMENT ON COLUMN time_dim.time_srky IS ‘time_srky is the primary key and is a surrogate key derived from the date business/natural key’;

 

Related References

 

Infosphere Datastage – Client Tier Image Requirements

InfoSphere Information Server (IIS) DataStage Client Tools

InfoSphere Information Server (IIS) DataStage Client Tools

A frequent question encountered in the early stages of a client engagement is: what are the recommended Windows Client Tier Image Requirements (Virtual machine image or Desktop)?  However, I have never found this information to be satisfactorily documented by IBM.  So, invariably, we end up providing our best guidance based on experience, which in the case of the normal InfoSphere Information Server (IIS) installation is provided in the table below.

Recommended Developer Client Tier Characteristics

Item Quantity Notes

Application Directory Storage

8 GB or Larger

Memory

6 GB or More This should be a developer images, so, more is better.  Especially, given the other applications, which the developer may also be using simultaneously.
CPU Cores 2 or more This should be a developer images, so, more is better.  Especially, given the other applications, which the developer may also be using simultaneously.

Permissions

N/A Users performing the client software installation, must have full Administrative rights on the Virtual Machine image or individual workstation.

Protection Software

N/A All Firewalls and virus protection software needs to be disabled on the client, while client software installation is in progress.  Also, required Infosphere firewall ports for the client tools must be open to use the tools.

 

Related References

Are the Infosphere Windows Client 32 bit or 64 bit?

InfoSphere Information Server (IIS) DataStage Client Tools

InfoSphere Information Server (IIS) DataStage Client Tools

The question of whether or not the IBM InfoSphere Windows client tools 32 bit or 64 bit, actually, comes up rather frequently. The short answer to the question is that the InfoSphere Windows client tools, actually, are 32 bit applications, will run on supported 64 bit windows system.  This is what IBM calls 64-Tolerate: (32-bit applications that can run in a 64-bit operating system environment), however, the client tools do not run in native 64-bit mode and do not exploit the 64-bit operating environment to improve performance.

Related References

 

What are the Factor Affecting the Selection of Data Warehouse Naming Convention?

Data Warehouse Naming Convention

Database Table

The primary factors affecting the choices in the creation of Data Warehouse (DW) naming convention policy standards are the type of implementation, pattern of the implementation, and any preexisting conventions.

Type of implementation

The type of implementation will affect your naming convention choices. Basically, this boils down to, are you working with a Commercial-Off-The-Shelf (COTS) data warehouse or doing a custom build?

Commercial-Off-The-Shelf (COTS)

If it is a Commercial-Off-The-Shelf (COTS) warehouse, which you are modifying and or enhancing, then it is very strongly recommended that you conform to the naming conventions of the COTS product.  However, you may want to add an identifier to the conventions to identify your custom object.

Using this information as an exemplar:

  • FAV = Favinger, Inc. (Company Name – Custom Identifier)
  • GlobalSales = Global Sales (Subject)
  • MV = Materialized View (Object Type)

 

Suffix Pattern Naming Convention

<<Custom Identifier>>_<<Object Subject Name>>_<<Object Type>>

Example:  FAV_GlobalSales_MV

Prefix Pattern Naming Convention

<<Object Type>>_<<Custom Identifier>>_<<Object Subject Name>>

Example: MV_FAV_GlobalSales

Custom Data Warehouse Build

If you are creating a custom data warehouse from scratch, then you have more flexibility to choose your naming convention.  However, you will still need to take into account a few factors to achieve the maximum benefit from you naming conventions.

  • What is the high level pattern of you design?
  • Are there any preexisting naming conventions?

Data Warehouse Patterns

Your naming convention will need to take into account the overall intent and design pattern of the data warehouse, the objects and naming conventions of each pattern will vary, if for no other reason than the differences in the objects, their purpose, and the depth of their relationships.

High level Pattern of the Data Warehouse Implementation

The high level pattern of you design whether an Operational Data Store (ODS), Enterprise Data Warehouse (EDW), Data Mart (DM) or something else, will need to guide your naming convention, as the depth of logical and/or processing zone of each pattern will vary  and have some industry generally accepted conventions.

Structural Pattern of the Data Warehouse Implementation

The structural pattern of your data warehouse design whether, Snowflake, 3rd Normal Form, or Star Schema, will  need to guide your naming convention, as the depth of relationships each pattern will vary, have some industry generally accepted conventions, and will relate directly to you High level Data Warehouse pattern.

Preexisting Conventions

Often omitted factor of data warehouse naming conventions are the sources of preexisting conventions, which can have significant impacts both from an engineering and political point of view. The sources of these conventions can vary and may or may not be formally documented.

A common source naming convention conflict is with preexisting implementations, which may not even be document.  However, system objects and consumers are familiar will be exposed to these conventions, will need to be taken into account when accessing impacts to systems, political culture, user training, and the creation of a standard convention for your data warehouse.

The Relational Database Management System (RDBMS) in which you intend to build the data warehouse may have generally accepted conventions, which consumers may be familiar and have a preconceived expectations whether expressed or intended).

Change Management

Whatever data warehouse naming convention you chose, the naming conventions along with the data warehouse design patterns assumptions, should be well documented and placed in a managed and readily accessible, change management (CM) repository.

 

 

PureData – Table Effective Practices

IBM Netezza PureData Table Effective Practices

Database Table

Here a few tips, which can make a significant difference to the efficiency and effectiveness of developers and users, making information available to them when developing and creating analytic objects.  This information can, also, be very help to data modelers.  While some of these recommendations are not enforced by Netezza/PureData, this fact makes them no less helpful to your community.

Alter table to Identify Primary Keys (PK)

  • Visually helps developers and users to know what the keys primary keys of the table are
  • Primary key information can, also, be imported as meta data by other IBM tools (e.g. InfoSphere, Datastage, Data Architect, Governance Catalog, Aginity, etc.)
  • The query optimizer will use these definitions define efficient query execution plans

Alter table to Identify Foreign Keys (FK)

  • Illustrate table relationships for developers and users
  • Foreign key information can, also, be imported as meta data by other IBM tools (e.g. InfoSphere, Datastage, Data Architect, Governance Catalog, Aginity, etc.)
  • The query optimizer will use these definitions define efficient query execution plans

Limit Distribution Key to Non-Updatable Fields

  • This one seems obvious, but this problem occurs regularly, if tables and optimizations are not properly planned; Causing an error will be generated, if an update is attempted against a field contain in the distribution of a table.

Use Null on Fields

  • Using ‘Not Null’ whenever the field data and ETL transformation rules can enforce it, helps improve performance by reducing the number of null condition checks performed and reduces storage.

Use Consistence Field Properties

  • Use the same data type and field length in all tables with the same field name, reduces the amount of interpretation/conversion required by the system, developers, and report SQL.

Schedule Table Optimizations

  • Work with your DBA’s to determine the best scheduling time, system user, and priority of groom and generate statistics operations. Keep in mind the relationship to when the optimizations occur in relation to when users need to consume the data. All too often, this operation is not performed before users need the performance and/or is driven by DBA choice, without proper consideration to other processing performance needs.  This has proven, especially true, in data warehousing when the DBA does not have Data warehousing experience and/or does not understand the load patterns of the ETL/ELT process.

Related Links

 

The Value of Data Indicator Flags

Data Indicator Flags

Data Indicator Flags

I cannot count the times, which using a flag (also, called an indicator) is described as a nice to have in database table design, at least, until the code runs into complexity and/or performance challenges.

When designing your data models, ETL’s, and reports it is useful to consider how indicator flags can help.  While indicator flags are, normally, binary in nature, such as True/False or Yes/No, but indicator flags don’t always need to be binary.

How indicator flags can help your processes and reporting:

  • Provide an equijoin for complex business rules, which can otherwise result ‘Not In List’, ‘In list’, ‘Not Exists., ‘Exists’, ‘Not Equal To’ and sub-selects SQL statements
  • Provide processing maker to prevent look ups to other tables to determine an attribute. For example, a snapshot type (daily, weekly, monthly) flag, which can be used to apply data retention rules.
  • Provide a mark for special circumstance. For example, a Legal Hold flag to mark record to be exempted from removal and/or change to meet legal proceeding requirements.
  • The judicious planning and use of flags can reduce the number of full table scans required against large tables.

Related References

Data Modeling – Database Table Field Ordering Effective Practices

Database Table Field Ordering Effective Practices

Database Table

Field ordering can help the performance on inserts and updates and, also, keeps developer and users from having to search entire table structure to be sure they have all the keys, etc.

Table Field Ordering

  1. Distribution Field Or Fields, if no distribution field is set the first field will be used by default.
  2. Primary Key Columns (including Parent and Child key fields)
  3. Foreign Key Columns (Not Null)
  4. Not Null Columns
  5. Nullable Columns
  6. Created Date Timestamp
  7. Modified (or Last Updated) Date Timestamp
  8. Large text Fields
  9. Large binary Columns or Binary Field references

Related References

Infosphere Data Architect Install Error CRIMC1029E or CRIMC1085E

The CRIMC1029E / CRIMC1085E errors may be caused by running the incorrect Infosphere Data Architect installer.  If you run the admin installer (launchpad.exe) on 64bit windows with insufficient privileges, the process will throw a CRIMC1029E / CRIMC1085E error.

What the Error Looks Like

 

Installation failed.

CRIMC1029E: Adding plug-in com.ibm.etools.cobol.win32_7.0.921.v20140409_0421 to repository E:\Program Files\IBM\SDPShared failed.

CRIMC1085E: Resumable download failed for: file:/E:/InstallIBM/INFO_DATA_ARCH_V9.1.2_WIN/disk2/ad/plugins/file000086.

‘plug-in com.ibm.etools.cobol.win32_7.0.921.v20140409_0421’ does not exist (at file:/E:/InstallIBM/INFO_DATA_ARCH_V9.1.2_WIN/disk2/ad/plugins/file000086).

Elapsed time 00:00.00.

Solution

Run the launchpad_win_nonadmin64.exe file, instead of the launchpad.exe file.

Reference Link:

Installing IBM InfoSphere Data Architect with the Installation Manager Install wizard

 

Data Modeling – What is Data Modeling?

Data Models, Data Modeling, What is data Modeling, logical Model, Conceptual Model, Physical Model

Data Models

Data modeling is the documenting of data relationships, characteristics, and standards based on its intended use of the data.   Data modeling techniques and tools capture and translate complex system designs into easily understood representations of the data creating a blueprint and foundation for information technology development and reengineering.

A data model can be thought of as a diagram  that illustrates the relationships between data. Although capturing all the possible relationships in a data model can be very time-intensive, a well-documented models allow stakeholders to identify errors and make changes before any programming code has been written.

Data modelers often use multiple models to view the same data and ensure that all processes, entities, relationships and data flows have been identified.

There are several different approaches to data modeling, including:

Concept Data Model (CDM)

  • The Concept Data Model (CDM) identifies the high level information entities, their relationships, and organized in the Entity Relationship Diagram (ERD).

Logical Data Model (LDM)

  • The Logical Data Model (LDM)  defines detail business information (in business terms) within each of the Concept Data Model and is a refinement of the information entities of the Concept Data Model.   Logical data model are non-RDBMS specific  business definition of tables, fields, and attributes contained within each information entity from which the Physical Data Model (PDM) and Entity Relationship Diagram (ERD) is produced.

Physical Data Model (PDM)

  • The Physical Data Model (PDM)  provides the actual technical details of the model and database object (e.g. table names, field names, etc.) to facilitate creation of accurate detail technical designs and actual database creation.  Physical Data Models are RDBMS specific definition of the logical model used build database, create deployable DDL statements, and to produce the Entity Relationship Diagram (ERD).

Related References

 

Infosphere Datastage – Useful Links

Documentation, Infosphere, information server,  Datastage , Useful Links

IIS Datastage – Useful Links

Here are some items, which I have found useful when working with IBM Infosphere Information Server(IIS) and Datastage.

Documentation Location

IBM Infosphere Information Server product documentation

http://www-01.ibm.com/support/knowledgecenter/SSZJPZ/welcome

Infosphere Datastage Parallel Framework Standard Practices

http://www.redbooks.ibm.com/abstracts/sg247830.html

DSXchange

http://www.dsxchange.com/

IBM Infosphere Datastage Data Flow and Job Design

http://www.redbooks.ibm.com/abstracts/sg247576.html?Open

IBM DeveloperWorks

http://www.ibm.com/developerworks/

Infosphere Information Server, Version 11.5.0.1 for Windows

http://www-01.ibm.com/support/docview.wss?uid=swg24040703

Infosphere Information Server, Version 11.5.0.1 for Linux

http://www-01.ibm.com/support/docview.wss?uid=swg24040702

Infosphere Information Server, Version 11.5.0.1 for AIX

http://www-01.ibm.com/support/docview.wss?uid=swg27016382

InfoSphere Information Server V11.5.0.1 detailed system requirements

http://www-01.ibm.com/support/docview.wss?uid=swg27047977

 

Netezza/PureData – Table Distribution Key Practices

Database Table, Table Distribution Key Practices, netezza, puredata

Database Table

For those of us new to Netezza and coming from other database system, which use index, the optimization of the distribution key in Netezza can be counter initiative.  The distribution key is not about group like an index, but rather spreading the data in a which let the zone map functionality improve performance.  So, here are a few effective practices to help decision making when deciding on a distribution key for a table:

  1. Distribute on one key only
  2. Distribute on the lowest/finest possible grain of the data to improve your zone map balance
  3. Avoid distributing foreign keys and/or foreign Surrogate key
  4. Avoid distributing on fields, which may need to be updated

Why Use Materialized Views?

Database Views - Why Use Materialized Views

Database Views

If we consider Materialized Views (MV) in their simplest form, as a point in time stored query result, then materialized views serve two primary purposes Performance Optimization and Semantics Simplification.

Performance Optimization

There are several ways in which materialized views can improve performance:

  • Reduce Database Workloads: materialized views can reduce database workloads by pre-assembling frequently used queries and, thereby, eliminating the repetitive execution of joins, aggregations, and filtering.
  • Facilitate Database Optimizers: in some databases can be partitioning and indexing which are considered by database optimizers. Also, some databases, in which more than one materialized view has been applied to a table, the database optimizer will consider the all the associated materialized views when optimizing queries.
  • Reduced Network Workloads: by the use of database replication and/or mass deployment techniques to materialized views, they can be distributed to more local proximity to the consumers, thereby, reducing the data volume across the network and provided business continuation/disaster recovery capabilities, should the primary site become temporarily unavailable.
  • Precalculation and/or Preaggregation: Performing calculation and aggregation of information upon creation of a materialized view, eliminates the need to perform these functions on an on-demand basis as various consumers submit requests.
  • Data Subsets: by applying filters to eliminate unnecessary data (e.g. history data no long in common reporting use) or unnecessary data attributes (e.g. unused columns or columns intended for other information purposes) the impact of filter for these items is reduced and is effectively eliminated for consumers of the materialized view.

Semantics Simplification

Materialized views can be used to simplify the semantics provided to information consumers with Ad hoc capabilities and/or to simplify the construction of reporting and analytics objects.  Depending on the database and/or integration tools in use to create them, materialized views can simplify the consumer experience by:

  • Reduce or Eliminate Join Coding: When constructed materialized views can perform the joins and populate the materialized view with the value results of from the join table, thereby, eliminating the need for the consumer to perform this function as an ad hoc user or in the semantics of reporting and analytics tools
  • Pre-application of Business Rules: When constructed materialized views can apply business rules to facilitate queries by adding indicator flags and preapplying special business logic to data and populating the materialized view with the value results, thus, eliminating the need for the consumer to perform this function as an ad hoc user or in the semantics of reporting and analytics tools.
  • Precalculation and/or Pre-aggregation: Performing calculation and aggregation of information upon creation of a materialized view allow the consumer to use the results without need to build the calculations and/or aggregations in to the ad hoc query or in the semantics of reporting and analytics tools.  This also, helps to ensure information accuracy and consistency.
  • Data Subsets: By prefiltering the data during the creation of the view unnecessary or unused data and columns are not available to consumers and do not need to be filter out of ad hoc queries or in the semantics of reporting and analytics tools

Where do data models fit in the Software Development Life Cycle (SDLC) Process?

Data Model SDLC Relationship Diagram

Data Model SDLC Relationship Diagram

In the classic Software Development Life Cycle (SDLC) process, Data Models are typically initiated, by model type, at key process steps and are maintained as data model detail is added and refinement occurs.

The Concept Data Model (CDM) is, usually, created in the Planning phase.   However,  creation the Concept Data Model  can slide forwarded or backward,  somewhat , within the System Concept Development, Planning, and Requirements Analysis phases, depending upon  whether the application being modeled is a custom development effort or a modification of a Commercial-Off-The-Shelf (COTS) application.  The CDM is maintained, as necessary, through the remainder of the SDLC process.

The Logical Data Model (LDM) is created in the Requirement Analysis phase and is a refinement of the information entities of the Concept Data Model. The LDM is maintained, as necessary, through the remainder of the SDLC process.

The Physical Data Model (PDM) is created in the Design phase to facilitate creation of accurate detail technical designs and actual database creation. The PDM is maintained, as necessary, through the remainder of the SDLC process.

Related References: