DataStage – How to use single quoted parameter list in an Oracle Connector

Data Integration

Data Integration

While working with a client’s 9.1 DataStage version, I ran into a situation where they wanted to parameterize SQL where clause lists in an Oracle Connector stage, which honestly was not very straight forward to figure out.  First, if the APT_OSL_PARAM_ESC_SQUOTE is not set and single quotes are used in the parameter, the job creates unquoted invalid SQL when the parameter is populated.  Second, I found much of the information confusing and/or incomplete in its explanation.   After some research and some trial and error, here is how I resolved the issue.  I’ll endeavor to be concise, but holistic in my explanation.

When this Variable applies

This where I know this process applies, there may be other circumstances to which is this applicable, but I’m listing the ones here with which I have recent experience.

Infosphere Information Server Datastage

  • Versions 91, 11.3, and 11.5

Oracle RDBMS

  • Versions 11g and 12c

Configurations process

Here is a brief explanation of the steps I used to implement the where clause as a parameter.  Please note that in this example, I am using a job parameter to populate on a portion of the where clause, you can certainly pass the entire where clause as a parameter, if it is not too long.

Configure Project Variable in Administrator

  • Add APT_OSL_PARAM_ESC_SQUOTE to project in Administrator
  • Populate the APT_OSL_PARAM_ESC_SQUOTE Variable \
APT_OSL_PARAM_ESC_SQUOTE Project Variable

APT_OSL_PARAM_ESC_SQUOTE Project Variable

Create job parameter

Following your project name convention or standard practice, if you customer and/or project do not have established naming conventions, create the job parameter in the job. See jp_ItemSource parameter in the image below.

Job Parameter In Oracle Connector

Job Parameter In Oracle Connector

Add job parameter to Custom SQL in Select Oracle Connector Stage

On the Job parameter has been created, add the job parameter to the SQL statement of the job.

Job Parameter In SQL

Job Parameter In SQL

Related References

IBM Knowledge Center > InfoSphere Information Server 11.5.0

Connecting to data sources > Databases > Oracle databases > Oracle connector

IBM Support > Limitation of the Parameter APT_OSL_PARAM_ESC_SQUOTE on Plugins on Parallel Canvas

IBM Knowledge Center > InfoSphere Information Server 11.5.0

InfoSphere DataStage and Quality > Stage > Reference > Parallel Job Reference > Environment Variables > Miscellaneous > APT_OSL_PARAM_ESC_SQUOTE

 

Advertisements

How to know if your Oracle Client install is 32 Bit or 64 Bit

Oracle Database, How to know if your Oracle Client install is 32 Bit or 64 Bit

Oracle Database

 

How to know if your Oracle Client install is 32 Bit or 64 Bit

Sometimes you just need to know if your Oracle Client install is 32 bit or 64 bit. But how do you figure that out? Here are two methods you can try.

The first method

Go to the %ORACLE_HOME%\inventory\ContentsXML folder and open the comps.xml file.
Look for <DEP_LIST> on the ~second screen.

If you see this: PLAT=”NT_AMD64” then your Oracle Home is 64 bit
If you see this: PLAT=”NT_X86” then your Oracle Home is 32 bit.

It is possible to have both the 32-bit and the 64-bit Oracle Homes installed.

The second method

This method is a bit faster. Windows has a different lib directory for 32-bit and 64-bit software. If you look under the ORACLE_HOME folder if you see a “lib” AND a “lib32” folder you have a 64 bit Oracle Client. If you see just the “lib” folder you’ve got a 32 bit Oracle Client.

Related References

 

OLTP vs Data Warehousing

Database, OLTP vs Data Warehousing

Database

OLTP Versus Data Warehousing

I’ve tried to explain the difference between OLTP systems and a Data Warehouse to my managers many times, as I’ve worked at a hospital as a Data Warehouse Manager/data analyst for many years. Why was the list that came from the operational applications different than the one that came from the Data Warehouse? Why couldn’t I just get a list of patients that were laying in the hospital right now from the Data Warehouse? So I explained, and explained again, and explained to another manager, and another. You get the picture.
In this article I will explain this very same thing to you. So you know  how to explain this to your manager. Or, if you are a manager, you might understand what your data analyst can and cannot give you.

OLTP

OLTP stands for OLine Transactional Processing. With other words: getting your data directly from the operational systems to make reports. An operational system is a system that is used for the day to day processes.
For example: When a patient checks in, his or her information gets entered into a Patient Information System. The doctor put scheduled tests, a diagnoses and a treatment plan in there as well. Doctors, nurses and other people working with patients use this system on a daily basis to enter and get detailed information on their patients.
The way the data is stored within operational systems is so the data can be used efficiently by the people working directly on the product, or with the patient in this case.

Data Warehousing

A Data Warehouse is a big database that fills itself with data from operational systems. It is used solely for reporting and analytical purposes. No one uses this data for day to day operations. The beauty of a Data Warehouse is, among others, that you can combine the data from the different operational systems. You can actually combine the number of patients in a department with the number of nurses for example. You can see how far a doctor is behind schedule and find the cause of that by looking at the patients. Does he run late with elderly patients? Is there a particular diagnoses that takes more time? Or does he just oversleep a lot? You can use this information to look at the past, see trends, so you can plan for the future.

The difference between OLTP and Data Warehousing

This is how a Data Warehouse works:

How a Data Warehouse works

How a Data Warehouse works

The data gets entered into the operational systems. Then the ETL processes Extract this data from these systems, Transforms the data so it will fit neatly into the Data Warehouse, and then Loads it into the Data Warehouse. After that reports are formed with a reporting tool, from the data that lies in the Data Warehouse.

This is how OLTP works:

How OLTP works

How OLTP works

Reports are directly made from the data inside the database of the operational systems. Some operational systems come with their own reporting tool, but you can always use a standalone reporting tool to make reports form the operational databases.

Pro’s and Con’s

Data Warehousing

Pro’s:

  • There is no strain on the operational systems during business hours
    • As you can schedule the ETL processes to run during the hours the least amount of people are using the operational system, you won’t disturb the operational processes. And when you need to run a large query, the operational systems won’t be affected, as you are working directly on the Data Warehouse database.
  • Data from different systems can be combined
    • It is possible to combine finance and productivity data for example. As the ETL process transforms the data so it can be combined.
  • Data is optimized for making queries and reports
    • You use different data in reports than you use on a day to day base. A Data Warehouse is built for this. For instance: most Data Warehouses have a separate date table where the weekday, day, month and year is saved. You can make a query to derive the weekday from a date, but that takes processing time. By using a separate table like this you’ll save time and decrease the strain on the database.
  • Data is saved longer than in the source systems
    • The source systems need to have their old records deleted when they are no longer used in the day to day operations. So they get deleted to gain performance.

Con’s:

  • You always look at the past
    • A Data Warehouse is updated once a night, or even just once a week. That means that you never have the latest data. Staying with the hospital example: you never knew how many patients are in the hospital are right now. Or what surgeon didn’t show up on time this morning.
  • You don’t have all the data
    • A Data Warehouse is built for discovering trends, showing the big picture. The little details, the ones not used in trends, get discarded during the ETL process.
  • Data isn’t the same as the data in the source systems
    • Because the data is older than those of the source systems it will always be a little different. But also because of the Transformation step in the ETL process, data will be a little different. It doesn’t mean one or the other is wrong. It’s just a different way of looking at the data. For example: the Data Warehouse at the hospital excluded all transactions that were marked as cancelled. If you try to get the same reports from both systems, and don’t exclude the cancelled transactions in the source system, you’ll get different results.

online transactional processing (OLTP)

Pro’s

  • You get real time data
    • If someone is entering a new record now, you’ll see it right away in your report. No delays.
  • You’ve got all the details
    • You have access to all the details that the employees have entered into the system. No grouping, no skipping records, just all the raw data that’s available.

Con’s

  • You are putting strain on an application during business hours.
    • When you are making a large query, you can take processing space that would otherwise be available to the people that need to work with this system for their day to day operations. And if you make an error, by for instance forgetting to put a date filter on your query, you could even bring the system down so no one can use it anymore.
  • You can’t compare the data with data from other sources.
    • Even when the systems are similar. Like an HR system and a payroll system that use each other to work. Data is always going to be different because it is granulated on a different level, or not all data is relevant for both systems.
  • You don’t have access to old data
    • To keep the applications at peak performance, old data, that’s irrelevant to day to day operations is deleted.
  • Data is optimized to suit day to day operations
    • And not for report making. This means you’ll have to get creative with your queries to get the data you need.

So what method should you use?

That all depends on what you need at that moment. If you need detailed information about things that are happening now, you should use OLTP.
If you are looking for trends, or insights on a higher level, you should use a Data Warehouse.

 Related References

 

 

Oracle – How to get a list of user permission grants

IBM Infosphere Information Server (IIS), Oracle – How to get a list of user permission grants

IBM Infosphere Information Server (IIS)

Since the Infosphere, information server, repository, has to be installed manually with the scripts provided in the IBM software, sometimes you run into difficulties. So, here’s a quick script, which I have found useful in the past to identify user permissions for the IAUSER on Oracle database’s to help rundown discrepancies in user permissions.

 

SELECT *

FROM ALL_TAB_PRIVS

WHERE  GRANTEE = ‘iauser’

 

If we cannot run against the ALL_TAB_PRIVS view, then we can try the ALL_TAB_PRIVS view:

 

SELECT *

FROM USER_TAB_PRIVS

WHERE  GRANTEE = ‘iauser’

 

Related References

oracle help Center > Database Reference > ALL_TAB_PRIVS view

Databases – What is ACID?

Acronyms, Abbreviations, Terms, And Definitions, What is ACID?

Acronyms, Abbreviations, Terms, And Definitions

What does ACID mean in database technologies?

  • Concerning databases, the acronym ACID means: Atomicity, Consistency, Isolation, and Durability.

Why is ACID important?

  • Atomicity, Consistency, Isolation, and Durability (ACID) are import to database, because ACID is a set of properties that guarantee that database transactions are processed reliably.

Where is the ACID Concept described?

  • Originally described by Theo Haerder and Andreas Reuter, 1983, in ‘Principles of Transaction-Oriented Database Recovery’, the ACID concept has been codified in ISO/IEC 10026-1:1992, Section 4

What is Atomicity?

  • Atomicity ensures that only two possible results from transactions, which are changing multiple data sets:
  • either the entire transaction completes successfully and is committed as a work unit
  • or, if part of the transaction fails, all transaction data can be rolled back to databases previously unchanged dataset

What is Consistency?

  • To provide consistency a transaction either creates a new valid data state or, if any failure occurs, returns all data to its state, which existed before the transaction started. Also, if a transaction is successful, then all changes to the system will have been properly completed, the data saved, and the system is in a valid state.

What is Isolation?

  • Isolation keeps each transaction’s view of database consistent while that transaction is running, regardless of any changes that are performed by other transactions. Thus, allowing each transaction to operate, as if it were the only transaction.

What is Durability?

  • Durability ensures that the database will keep track of pending changes in such a way that the state of the database is not affected, if a transaction processing is interrupted. When restarted, databases must return to a consistent state providing all previously saved/committed transaction data

 

Related References

Databases – Database Isolation Level Cross Reference

Database Type Isolation Levels Cross Reference

Database And Tables

 

Here is a table quick reference of some common database and/or connection types, which use connection level isolation and the equivalent isolation levels. This quick reference may prove useful as a job aid reference, when working with and making decisions about isolation level usage.

Database isolation levels

Data sources

Most restrictive isolation level

More restrictive isolation level

Less restrictive isolation level

Least restrictive isolation level

Amazon SimpleDB

Serializable Repeatable read Read committed Read Uncommitted

dashDB

Repeatable read Read stability Cursor stability Uncommitted read

DB2® family of products

Repeatable read Read stability* Cursor stability Uncommitted read

Informix®

Repeatable read Repeatable read Cursor stability Dirty read

JDBC

Serializable Repeatable read Read committed Read Uncommitted

MariaDB

Serializable Repeatable read Read committed Read Uncommitted

Microsoft SQL Server

Serializable Repeatable read Read committed Read Uncommitted

MySQL

Serializable Repeatable read Read committed Read Uncommitted

ODBC

Serializable Repeatable read Read committed Read Uncommitted

Oracle

Serializable Serializable Read committed Read committed

PostgreSQL

Serializable Repeatable read Read committed Read committed

Sybase

Level 3 Level 3 Level 1 Level 0

 

Related References

 

Database – What is a foreign key?

Acronyms, Abbreviations, Terms, And Definitions, DDL (Data Definition Language), What is a foreign key

Acronyms, Abbreviations, Terms, And Definitions

Definition of a Foreign Key

  • A foreign Key (FK) is a constraint that references the unique primary key (PK) of another table.

Facts About Foreign Keys

  • Foreign Keys act as a cross-reference between tables linking the foreign key (Child record) to the Primary key (parent record) of another table, which establishing a link/relationship between the table keys
  • Foreign keys are not enforced by all RDBMS
  • The concept of referential integrity is derived from foreign key theory
  • Because Foreign keys involve more than one table relationship, their implementation can be more complex than primary keys
  • A foreign-key constraint implicitly defines an index on the foreign-key column(s) in the child table, however, manually defining a matching index may improve join performance in some database
  • The SQL, normally, provides the following referential integrity actions for deletions, when enforcing foreign-keys

Cascade

  • The deletion of a parent (primary key) record may cause the deletion of corresponding foreign-key records.

No Action

  • Forbids the deletion of a parent (primary key) record, if there are dependent foreign-key records.   No Action does not mean to suppress the foreign-key constraint.

Set null

  • The deletion of a parent (primary key) record causes the corresponding foreign-key to be set to null.

Set default

  • The deletion of a record causes the corresponding foreign-keys be set to a default value instead of null upon deletion of a parent (primary key) record

Related References

 

Netezza / PureData – How to Number for day of week in SQL?

Netezza / PureData - Numeric Day of Week

Netezza / PureData – Numeric Day of Week

I had reason today to get the number of the day of the week, in PureData / Netezza, which I don’t seem to have discussed in previous posts.  So, here is a simple script to get the number for the day of week with a couple of flavors, which may prove useful.

Basic Format

select extract(dow from <<FieldName>>) from <<SchemaName>>.<<tableName>>

Example SQL

 

SELECT

CURRENT_DATE

,  TO_CHAR(CURRENT_DATE,’DAY’) AS DAY_OF_WEEK

—WEEK STARTS ON MONDAY

,  EXTRACT(DOW FROM CURRENT_DATE)-1 AS DAY_OF_WEEK_NUMBER_STARTS_ON_MONDAY

—WEEK STARTS ON SUNDAY

,  EXTRACT(DOW FROM CURRENT_DATE) AS DAY_OF_WEEK_NUMBER_STARTS_ON_SUNDAY

—WEEK STARTS ON SATURDAY

,  EXTRACT(DOW FROM CURRENT_DATE)+1 AS DAY_OF_WEEK_NUMBER_STARTS_ON_SATURDAY

FROM _V_DUAL;

 

Related References

Extract date and time values

PureData System for Analytics, PureData System for Analytics 7.2.1, IBM Netezza database user documentation, Netezza SQL basics, Functions and operators, Functions, Extract date and time values

Date/time functions

PureData System for Analytics, PureData System for Analytics 7.2.1, IBM Netezza database user documentation, Netezza SQL basics, Netezza SQL extensions, Date/time functions

Template patterns for date/time conversions

PureData System for Analytics, PureData System for Analytics 7.2.1, IBM Netezza database user documentation, Netezza SQL basics, Netezza SQL extensions, Conversion functions, Template patterns for date/time conversions

https://www.ibm.com/support/knowledgecenter/en/SSULQD_7.2.1/com.ibm.nz.dbu.doc/r_dbuser_ntz_sql_extns_templ_patterns_date_time_conv.html

Database – What is a Composite Primary Key?

Database Table

Database Table

What is a Composite Primary Key?

A Composite Primary key is Primary key What a primary key, which is defined by having multiple fields (columns) in it.  Like a Primary Key what a composite Primary Key is depends on the database.  Essentially a Composite Primary Key:

  • Is a combination of Fields (columns) which uniquely identifies every row.
  • Is an index in database systems which use indexes for optimization
  • Is a type of table constraint
  • Is applied with a data definition language (DDL) alter command
  • And may define parent-Child relationship between tables

Related References

Database – What is a Primary Key?

Database Table

Database Table

What is a primary Key?

What a primary key is depends, somewhat, on the database.  However, in its simplest form a primary key:

  • Is a field (Column) or combination of Fields (columns) which uniquely identifies every row.
  • Is an index in database systems which use indexes for optimization
  • Is a type of table constraint
  • Is applied with a data definition language (DDL) alter command
  • And, depending on the data model can, define parent-Child relationship between tables

Related References

Oracle Database – Useful links

Oracle Database

Oracle Database

Some useful oracle documentation references.

Reference Type Link

Database SQL Language Reference (11.2)

https://docs.oracle.com/cd/E11882_01/server.112/e41084/

Database SQL Language Reference (11.2)

https://docs.oracle.com/cd/E11882_01/server.112/e16604/toc.htm
Database SQL Language Reference (12c) https://docs.oracle.com/database/121/SQLRF/toc.htm

Database (12c) SQL*Plus User’s Guide and Reference

https://docs.oracle.com/database/121/SQPUG/toc.htm

SQL Developer Documentation

http://docs.oracle.com/cd/E12151_01/

Database – What is DDL?

SQL (Structured Query Language), Database, What is DDL?

SQL (Structured Query Language)

What is DDL (Data Definition Language)?

DDL (Data Definition Language), are the statements used to manage tables, schemas, domains, indexes, views, and privileges.  The the major actions performed by DDL commands are: create, alter, drop, grant, and revoke.

 

Related References

Data Modeling – Fact Table Effective Practices

Database Table

Database Table

Here are a few guidelines for modeling and designing fact tables.

Fact Table Effective Practices

  • The table naming convention should identify it as a fact table. For example:
    • Suffix Pattern:
      • <<TableName>>_Fact
      • <<TableName>>_F
    • Prefix Pattern:
      • FACT_<TableName>>
      • F_<TableName>>
    • Must contain a temporal dimension surrogate key (e.g. date dimension)
    • Measures should be nullable – this has an impact on aggregate functions (SUM, COUNT, MIN, MAX, and AVG, etc.)
    • Dimension Surrogate keys (srky) should have a foreign key (FK) constraint
    • Do not place the dimension processing in the fact jobs

Related References

Database – What is DML?

SQL (Structured Query Language), Database, What is DML?

SQL (Structured Query Language)

What is DML (Data Manipulation Language)?

As the name indicates, Data manipulation for working with information inside a database structure.  There are four main DML commands:

  • Select – reading data rows
  • Insert – adding data rows
  • update – changing values within data rows
  • Delete – removing data row

Related References

 

Data Modeling – Dimension Table Effective Practices

Database Table

Database Table

I’ve had these notes laying around for a while, so, I thought I consolidate them here.   So, here are few guidelines to ensure the quality of your dimension table structures.

Dimension Table Effective Practices

  • The table naming convention should identify it as a dimension table. For example:
    • Suffix Pattern:
      • <<TableName>>_Dim
      • <<TableName>>_D
    • Prefix Pattern:
      • Dim_<TableName>>
      • D_<TableName>>
  • Have Primary Key (PK) assigned on table surrogate Key
  • Audit fields – Type 1 dimensions should:
    • Have a Created Date timestamp – When the record was initially created
    • have a Last Update Timestamp – When was the record last updated
  • Job Flow: Do not place the dimension processing in the fact jobs.
  • Every Dimension should have a Zero (0), Unknown, row
  • Fields should be ‘NOT NULL’ replacing nulls with a zero (0) numeric and integer type fields or space ( ‘ ‘ ) for Character type files.
  • Keep dimension processing outside of the fact jobs

Related References

 

 

What is DCL?

SQL (Structured Query Language)

SQL (Structured Query Language)

What is DCL (Data Control Language)?

Data control language (DCL) is anything that is used for administrating access (permissions/security) to database content.  The main DCL commands are:

  • Grant
  • Revoke

Related References

Oracle SQL – How to limit the number of rows returned by a select query

Oracle doesn’t recognize the ’Limit’ command, however, there is any easy way to achieve the same result.  Using the Oracle/function in the ‘where’ clause.

ROWNUM Pseudo Column Description

The Oracle/PLSQL ROWNUM pseudo column returns a number that represents the order that a row is selected by Oracle from a table or joined tables. The first row has a ROWNUM of 1, the second has a ROWNUM of 2, and so on.

Basic RUWNUM Syntax

SELECT column_name(s)

FROM Schema.table_name

WHERE ROWNUM <= number

Order by column_name(s);

 

Example SQL Limiting to One row

The one row limit has two accurate way to get the desired result:

·       SQL Limiting to One row using equal to

SELECT RSNBL_CODE,

MVNG_AVG1_PRD

FROM Blog.SCP_FCST_CONTROL

where rownum = 1;

·       SQL Limiting to One row using less than or equal to

SELECT RSNBL_CODE,

MVNG_AVG1_PRD

FROM Blog.SCP_FCST_CONTROL

where rownum <= 1;

Oracle Rownum Limited to One Row

Oracle Rownum Limited to One Row

 

Example SQL Limiting to multiple rows

SELECT RSNBL_CODE,

MVNG_AVG1_PRD

FROM Blog.SCP_FCST_CONTROL

where rownum <= 10;

Oracle Rownum Limited to Multiple Rows

Oracle Rownum Limited to Multiple Rows

 

Surrogate Key File Effective Practices

Database Table, Surrogate Key File Effective Practices

Surrogate Key File Effective Practices

Here are a few thoughts on effectively working with IBM Infosphere, Information Server, DataStage surrogate key files, which may prove useful for developers.

 

Placement

  • The main thing about placement is that it be in a consistent location. Developers and production support teams should need to guess or look up where it is for every DataStage project. So, it is best to put the surrogate keys in same base path and that each project has its own subfolder to facilitate migrations and to reduce the possibility of human error. Here Is the patch structure, which is commonly use:

Path

  • /data/SRKY/<<Project Name>>

Parameter Sets

  • As a best practice, the surrogate key file path should be in a parameter set and the parameter used in the jobs, as needed.  This simplifies maintenance, if and when changes to the path are required, and during migrations.

Surrogate Key Parameter Set Screenshot – Example Parameter Tab

Surrogate Key Parameter Set Screen Screen

Surrogate Key Parameter Set Screenshot

Surrogate Key Parameter Set Screenshot – Example Values Tab

Surrogate Key Parameter Set Screenshot – Example Values Tab

Surrogate Key Parameter Set Screenshot – Example Values Tab

Surrogate Key Job Parameter Example Path using Parameter

Surrogate Key Job Parameter Example Path using Parameter

Surrogate Key Job Parameter Example Path using Parameter

Permissions

  • DataStage must have permissions to:
    • The entire parent path
    • The project folder, and
    • The surrogate key files themselves.

To ensure the DataStage has access to the path and Surrogate files, ensure:

  • The ‘dsadm’ (owner) and ‘dstage’ (group) have access to folders in the path, with at least a “-rw-r–r–“ (644) permissions level. Keeping the permissions to a minimum can, for obvious reasons,  prevent inadvertent overwrites of the surrogate key files; thus, avoiding some, potentially, serious cleanup.
  • The ‘dsadm’ (owner) and ‘dstage’ (group) have access to the surrogate key files

Surrogate Key File Minimum Permissions

Surrogate Key File Minimum Permissions

Surrogate Key File Minimum Permissions

What are the Factor Affecting the Selection of Data Warehouse Naming Convention?

Data Warehouse Naming Convention

Database Table

The primary factors affecting the choices in the creation of Data Warehouse (DW) naming convention policy standards are the type of implementation, pattern of the implementation, and any preexisting conventions.

Type of implementation

The type of implementation will affect your naming convention choices. Basically, this boils down to, are you working with a Commercial-Off-The-Shelf (COTS) data warehouse or doing a custom build?

Commercial-Off-The-Shelf (COTS)

If it is a Commercial-Off-The-Shelf (COTS) warehouse, which you are modifying and or enhancing, then it is very strongly recommended that you conform to the naming conventions of the COTS product.  However, you may want to add an identifier to the conventions to identify your custom object.

Using this information as an exemplar:

  • FAV = Favinger, Inc. (Company Name – Custom Identifier)
  • GlobalSales = Global Sales (Subject)
  • MV = Materialized View (Object Type)

 

Suffix Pattern Naming Convention

<<Custom Identifier>>_<<Object Subject Name>>_<<Object Type>>

Example:  FAV_GlobalSales_MV

Prefix Pattern Naming Convention

<<Object Type>>_<<Custom Identifier>>_<<Object Subject Name>>

Example: MV_FAV_GlobalSales

Custom Data Warehouse Build

If you are creating a custom data warehouse from scratch, then you have more flexibility to choose your naming convention.  However, you will still need to take into account a few factors to achieve the maximum benefit from you naming conventions.

  • What is the high level pattern of you design?
  • Are there any preexisting naming conventions?

Data Warehouse Patterns

Your naming convention will need to take into account the overall intent and design pattern of the data warehouse, the objects and naming conventions of each pattern will vary, if for no other reason than the differences in the objects, their purpose, and the depth of their relationships.

High level Pattern of the Data Warehouse Implementation

The high level pattern of you design whether an Operational Data Store (ODS), Enterprise Data Warehouse (EDW), Data Mart (DM) or something else, will need to guide your naming convention, as the depth of logical and/or processing zone of each pattern will vary  and have some industry generally accepted conventions.

Structural Pattern of the Data Warehouse Implementation

The structural pattern of your data warehouse design whether, Snowflake, 3rd Normal Form, or Star Schema, will  need to guide your naming convention, as the depth of relationships each pattern will vary, have some industry generally accepted conventions, and will relate directly to you High level Data Warehouse pattern.

Preexisting Conventions

Often omitted factor of data warehouse naming conventions are the sources of preexisting conventions, which can have significant impacts both from an engineering and political point of view. The sources of these conventions can vary and may or may not be formally documented.

A common source naming convention conflict is with preexisting implementations, which may not even be document.  However, system objects and consumers are familiar will be exposed to these conventions, will need to be taken into account when accessing impacts to systems, political culture, user training, and the creation of a standard convention for your data warehouse.

The Relational Database Management System (RDBMS) in which you intend to build the data warehouse may have generally accepted conventions, which consumers may be familiar and have a preconceived expectations whether expressed or intended).

Change Management

Whatever data warehouse naming convention you chose, the naming conventions along with the data warehouse design patterns assumptions, should be well documented and placed in a managed and readily accessible, change management (CM) repository.

 

 

The Value of Data Indicator Flags

Data Indicator Flags

Data Indicator Flags

I cannot count the times, which using a flag (also, called an indicator) is described as a nice to have in database table design, at least, until the code runs into complexity and/or performance challenges.

When designing your data models, ETL’s, and reports it is useful to consider how indicator flags can help.  While indicator flags are, normally, binary in nature, such as True/False or Yes/No, but indicator flags don’t always need to be binary.

How indicator flags can help your processes and reporting:

  • Provide an equijoin for complex business rules, which can otherwise result ‘Not In List’, ‘In list’, ‘Not Exists., ‘Exists’, ‘Not Equal To’ and sub-selects SQL statements
  • Provide processing maker to prevent look ups to other tables to determine an attribute. For example, a snapshot type (daily, weekly, monthly) flag, which can be used to apply data retention rules.
  • Provide a mark for special circumstance. For example, a Legal Hold flag to mark record to be exempted from removal and/or change to meet legal proceeding requirements.
  • The judicious planning and use of flags can reduce the number of full table scans required against large tables.

Related References