ETL Error Handling Effective Practices
ETL (extract, transform, and load) error handling practices can vary, but three basic approaches can significantly assist in having effective ETL error handling practices. Effective error handling practices begin in the requirements and design phases. All too often, error handling practices are left to the build phase and the fall to the developer practices. This is an area where standard practices are not well defined or adopted by the ETL developer community. So, here are a few effective error handling practices which will contribute to process stability, information timeliness, information accuracy, and reduce the level of effort required to support the application once in operation.
Anticipating ETL Errors
In the requirements and design phases, with proper consideration, many errors can be avoided altogether in the ETL process. When discussing requirements and preparing designs consideration should be given to error handling, especially, treatment of common errors. As an effective practice, anticipated errors should be treated within the ETL process. Some examples, to consider are:
- Replacement of special characters: do any special characters need to be removed if found? This is generally determined that the field level and should be considered in the source to target mapping (STTM) and business rules. Also, the passing between different systems and working with VARCHAR fields, should the ‘Unicode’ extended property be set.
- Removal of leading and trailing spaces: removal of unnecessary leading and trailing spaces should be considered when changing fields from CHAR to VARCHAR and when working with keys used as primary keys, join keys, and/or lookup keys.
- Deduplication of data: duplicate data prevention practices and business rules should always be considered. These can be of a couple of types:
- First, is file processing conventions, such as assigning timestamps to files and removal or movement of process files to prevent reprocessing.
- Second, is rules for the identification of duplicate rows, including the appropriate keys for determining duplicate rows.
- Third, if duplicate rows are being produced as a result of more than one input source system, identification of the authoritative source should be considered to resolve conflicts.
- Null Value Treatment: null value treatment can be extraordinarily important, especially, when working with keys and traditional data warehouse models. It is important to be mindful of the fact that to the database and the ETL nulls and spaces are not the same thing. They may or may not be the same thing in the mind of the consumer of the information. So, business rules should indicate the treatment of both spaces and nulls. In some circumstances, especially, when using surrogate keys in data warehousing business processes sometimes need to know the difference between a null in space or even a no and space and an unknown value. So these three scenarios should be considered when forming business rules and treating the ETL. Here are a couple questions that could be asked informing your solution:
- do nulls and spaces mean the same thing to the business community?
- Is space considered an unknown value?
- Does a null need to be uniquely identified as different from space and/or an unknown lookup value?
- If surrogate keys are in use for the field in question, which of these scenarios require a unique surrogate key, other than the unknown unique surrogate key?
- Missing or Invalid Value Replacement or Defaults: having replacement values or defaults is especially important for any fields which are not nullable and/or require a surrogate key for data warehouse dimensions. Also, for reporting to be meaningful replacement or default value assignments can be important, as well (e.g. for cubes, and statistical calculations).
Rows should not be rejected unless there is a specific business requirement and/or need to do so. Rejecting rows causes data inaccuracies by omission and undermines the consumer’s confidence in the accuracy of the information being delivered. This can be, especially, problematic for accounting and other activities, which must balance across information sets.
- If value lookups are in use:
- Unknown and null values need to have a treatment rule to prevent errors.
- Two surrogate key or transformation default values may be necessary if the ability to distinguish between an unknown/Invalid value and a null value is required.
- Make sure the lookup ‘Key Type’ are aligned (e.g. equality, caseless equality) to the formatting of both inputs to the lookup
- That the complete unique key is in use.
Information Consistency Practices
Information consistency practices allow the information to be transformed and enriched to make the information more consistent for ‘like to like’ comparisons, usability, and/or readability. As an effective practice consider these Standard formatting recommendations, which can be good requirements questions and should be included in the STTM:
- Making descriptive and/or text fields consistent in their format (e.g. mixed case, Proper case, upper case).
- Have use consistent date formatting, when converting dates to text fields.
- When dealing with currency, convert the currency to consistent ISO currency codes (e.g. USD, CAD, EUR) and decimal (e.g. two decimal places).
- Identification of financial records into categories (e.g. credit and debit) with a default group behavior included (e.g. N/A or Unknown).