Data Warehousing vs. Data Virtualization

Information Management

Information Management

Today, a business heavily depends on data to gain insights into their processes and operations and to develop new ways to increase market share and profits. In most cases, data required to generate the insights are sourced and located in diverse places, which requires reliable access mechanism. Currently, data warehousing and data virtualization are two principal techniques used to store and access the sources of critical data in a company. Each approach offers various capabilities and can be deployed for particular use cases as described in this article.

Data Warehousing

A data warehouse is designed and developed to secure host historical data from different sources. In effect, this technique protects data sources from performance degradation caused by the impact of sophisticated analytics and enormous demands for reports. Today, various tools and platforms have been developed for data warehouse automation in companies. They can be deployed to quicken development, automate testing, maintenance, and other steps involved in data warehousing. In a data warehouse, data is stored as a series of snapshots, where a record represents data at a particular time. In effect, companies can analyze data warehouse snapshots to compare data between different periods. The results are converted into insights required to make crucial business decisions.

Moreover, a data warehouse is optimized for other functions, such as data retrieval. The technology duplicates data to allow database de-normalization that enhances query performance. The solution is further deployed to create an enterprise data warehouse (EDW) used to service the entire organization.

Data Warehouse Information Architecture

Data Warehouse Information Architecture

Features of a Data Warehouse

A data warehouse is subject-oriented, and it is designed to help entities analyze data. For instance, a company can start a data warehouse focused on sales to learn more about sales data. Analytics on this warehouse can help establish insights such as the best customer for the period. The data warehouse is subject oriented since it can be defined based on a subject matter.

A data warehouse is integrated. Data from various sources is first out into a consistent format. The process requires the firm to resolve some challenges, such as naming conflicts and inconsistencies on units of measure.

A data warehouse in nonvolatile. In effect, data entered into the warehouse should not change after it is stored. This feature increases accuracy and integrity in data warehousing.

A data warehouse is time variant since it focuses on data changes over time. Data warehousing discovers trends in business by using large amounts of historical data. In effect, a typical operation in a data warehouse scans millions of rows to return an output.

A data warehouse is designed and developed to handle ad hoc queries. In most cases, organizations may not predict the amount of workload of a data warehouse. Therefore, it is recommendable to optimize the data warehouse to perform optimally over any possible query operation.

A data warehouse is regularly updated by the ETL process using bulk data modification techniques. Therefore, end users cannot directly update the data warehouse.

Advantages of Data Warehousing

The primary motivation for developing a data warehouse is to provide timely information required for decision making in an organization. A business intelligence data warehouse serves as an initial checkpoint for crucial business data. When a company stores its data in a data warehouse, tracking it becomes natural. The technology allows users to perform quick searches to be able to retrieve and analyze static data.

Another driver for companies investing in data warehouses involves integrating data from disparate sources. This capability adds value to operational applications like customer relationship management systems. A well-integrated warehouse allows the solution to translate information to a more usable and straightforward format, making it easy for users to understand the business data.

The technology also allows organizations to perform a series of analysis on data.

A data warehouse reduces the cost to access historical data in an organization.

Data warehousing provides standardization of data across an organization. Moreover, it helps identify and eliminate errors. Before loading data, the solution shows inconsistencies to users and corrects them.

A data warehouse also improves the turnaround time for analysis and report generation.

The technology makes it easy for users to access and share data. A user can conduct a quick search on a data warehouse to find and analyze static data without wasting time.

Data warehousing removes informational processing load from transaction-oriented databases.

Disadvantages of Data Warehousing

While data warehousing technology is undoubtedly beneficial to many organizations, not all data warehouses are relevant to a business. In some cases, a data warehouse can be expensive to scale and maintain.

Preparing a data warehouse is time-consuming since it requires users to input raw data, which has to be achieved manually.

A data warehouse is not a perfect choice for handing unstructured and complex raw data. Moreover, it faces difficulties incompatibility. Depending on the data sources, companies may require a business intelligence team to ensure compatibility is achieved for data coming from sources running distinct operating systems and programs.

The technology requires a maintenance cost to continue working correctly. The solution needs to be updated with latest features that might be costly. Regularly maintaining a data warehouse will need a business to spend more on top of the initial investment.

A data warehouse use can be limited due to information privacy and confidentiality issues. In most cases, businesses collect and store sensitive data belonging to their clients. Viewing it is only allowed to individual employees, which limits the benefits offered by a data warehouse.

Data Warehousing Use Case

There are a series of ways organizations use data warehouses. Businesses can optimize the technology for performance by identifying the type of data warehouse they have.

  1. A data warehouses can be used by an organization that is struggling to report efficiently on business operations and activities. The solution makes it possible to access the required data
  2. A data warehouse is necessary for an organization where data is copied separately by different divisions for analysis in spreadsheets that are not consistent with one another.
  3. Data warehousing is crucial in organizations where uncertainties about data accuracy are causing executives to question the veracity of reports.
  4. A data warehouse is crucial for business intelligence acceleration. The technology delivers rapid data insights to analysts at different scales, concurrency, and without requiring manual tuning or optimization of a database.
Data Virtualization Information Architecture

Data Virtualization Information Architecture

Data Virtualization

Data virtualization technology does not require transfer or storage of data. Instead, users employ a combination of application programming interfaces (APIs) and metadata (data about data) to interface with data in different sources. Users use joined queries to gain access to the original data sources. In other words, data virtualization offers a simplified and integrated view to business data in real-time as requested by business users, applications, and analytics. In effect, the technology makes it possible to integrate data from distinct sources, formats, and locations, without replication. It creates a unified virtual data layer that delivers data services to support users and various business applications.

Data virtualization performs many of the same data integration functions, that is, extract, transform, and load, data replication, and federation. It leverages modern technology to deliver real-time data integration with agility, low cost, and high speed. In effect, data virtualization eliminates traditional data integration and reduces the need for replicated data warehouses and data marts in most cases.

Capabilities and Benefits of Data Virtualization

There are various benefits of implementing data virtualization in an organization.

Firstly, data virtualization allows access and leverage of all information that helps a firm achieve a competitive advantage. The solution offers a unified virtual layer that abstracts the underlying source complexity and presents disparate data sources as a single source.

Data virtualization is cheaper since it does not require actual hardware devices to be installed. In other words, organizations no longer need to purchase and dedicate a lot of IT resources and additional monetary investment to create on-site resources, similar to the one used in a data warehouse.

Data virtualization allows speedy deployment of resources. In this solution, resource provisioning is fast and straightforward. Organizations are not required to set up physical machines or to create local networks or install other IT components. Users have a single point of access to a virtual environment that can be distributed to the entire company.

Data virtualization is an energy-efficient system since the solution does not require additional local hardware and software. Therefore, an organization will not be required to install cooling systems.

Disadvantages of Data Virtualization

Data virtualization creates a security risk. In the modern world, having information is a cheap way to make money. In effect, company data is frequently targeted by hackers. Implementing data virtualization from disparate sources may give an opportunity to malicious users to steal critical information and use it for monetary gain.

Data virtualization requires a series of channels or links that must work in cohesion to perform the intended task. In this cases, all data sources should be available for virtualization to work effectively.

Data Virtualization Use Cases

  • Companies that rely on business intelligence require data virtualization for rapid prototyping to meet immediate business needs. Data virtualization can create a real-time reporting solution that unifies access to multiple internal databases.
  • Provisioning data services for single-view applications, such as in customer service and call center applications require data virtualization.

 

What Is Machine Learning?

Machine Learning

Machine Learning

Machine learning is Artificial Intelligence (AI) which enables a system to learn from data rather than through explicit programming.  Machine learning uses algorithms that iteratively learn from data to improve, describe data, and predict outcomes.  As the algorithms ingest training data to produce a more precise machine learning model. Once trained, the machine learning model, when provided data will generate predictions based on the data that taught the model.  Machine learning is a crucial ingredient for creating modern analytics models.

My Most Used Windows 10 Keyboard Shortcuts

Shortcut Keystrokes

Shortcut Keystrokes

While there are a great number of useful windows 10 shortcuts, I have the list below the combination, which I use daily.  Many of the shortcuts can be used across multiple applications (e.g. Notepad++, MS Word, SQL Server, Aginity, etc.) and save a considerable amount of mouse work.  Overall, these shortcut keys are more efficient and faster than using the mouse to perform the same task on a repetitive basis.

You may want to investigate the numerous other Windows 10 shortcuts keys, which best apply to your daily activities, but these are the ones, which I have found most useful and which I have committed to memory.

Table of My Most Used Windows Shortcuts

Key
Strokes

Behavior

Alt
+ Tab

Switch
between open apps

Ctrl
+ A

Select
all items in a document or window

Ctrl
+ Alt + Tab

Use
the arrow keys to switch between all open apps

Ctrl
+ C

Copy
the selected item

Ctrl
+ D

Delete
the selected item and move it to the Recycle Bin

Ctrl
+ F

Select
the search box

Ctrl
+ V

Paste
the selected item

Ctrl
+ X

Cut
the selected item

Esc

Stop
or leave the current task

F5

Refresh
the active window

F11

Maximize
Window

Related References

 Microsoft > Windows Support > Keyboard shortcuts in Windows

 

 

 

 

 

 

What is Source Control?

Acronyms, Abbreviations, Terms, And Definitions

Acronyms, Abbreviations, Terms, And Definitions

Source Control is an Information technology environment management system for storing, tracking and managing changes to software. This is commonly done through a process of creating branches (copies for safely creating new features) off of the stable master version of the software, then merging stable feature branches back into the master version. This is also known as version control or revision control.

Database – What is DDL?

SQL (Structured Query Language), Database, What is DDL?

SQL (Structured Query Language)

What is DDL (Data Definition Language)?

DDL (Data Definition Language), are the statements used to manage tables, schemas, domains, indexes, views, and privileges.  The the major actions performed by DDL commands are: create, alter, drop, grant, and revoke.

 

Related References

Data Modeling – Fact Table Effective Practices

Database Table

Database Table

Here are a few guidelines for modeling and designing fact tables.

Fact Table Effective Practices

  • The table naming convention should identify it as a fact table. For example:
    • Suffix Pattern:
      • <<TableName>>_Fact
      • <<TableName>>_F
    • Prefix Pattern:
      • FACT_<TableName>>
      • F_<TableName>>
    • Must contain a temporal dimension surrogate key (e.g. date dimension)
    • Measures should be nullable – this has an impact on aggregate functions (SUM, COUNT, MIN, MAX, and AVG, etc.)
    • Dimension Surrogate keys (srky) should have a foreign key (FK) constraint
    • Do not place the dimension processing in the fact jobs

Related References

Data Modeling – Dimension Table Effective Practices

Database Table

Database Table

I’ve had these notes laying around for a while, so, I thought I consolidate them here.   So, here are few guidelines to ensure the quality of your dimension table structures.

Dimension Table Effective Practices

  • The table naming convention should identify it as a dimension table. For example:
    • Suffix Pattern:
      • <<TableName>>_Dim
      • <<TableName>>_D
    • Prefix Pattern:
      • Dim_<TableName>>
      • D_<TableName>>
  • Have Primary Key (PK) assigned on table surrogate Key
  • Audit fields – Type 1 dimensions should:
    • Have a Created Date timestamp – When the record was initially created
    • have a Last Update Timestamp – When was the record last updated
  • Job Flow: Do not place the dimension processing in the fact jobs.
  • Every Dimension should have a Zero (0), Unknown, row
  • Fields should be ‘NOT NULL’ replacing nulls with a zero (0) numeric and integer type fields or space ( ‘ ‘ ) for Character type files.
  • Keep dimension processing outside of the fact jobs

Related References

 

 

Data Modeling – Database Table Field Ordering Effective Practices

Database Table Field Ordering Effective Practices

Database Table

Field ordering can help the performance on inserts and updates and, also, keeps developer and users from having to search entire table structure to be sure they have all the keys, etc.

Table Field Ordering

  1. Distribution Field Or Fields, if no distribution field is set the first field will be used by default.
  2. Primary Key Columns (including Parent and Child key fields)
  3. Foreign Key Columns (Not Null)
  4. Not Null Columns
  5. Nullable Columns
  6. Created Date Timestamp
  7. Modified (or Last Updated) Date Timestamp
  8. Large text Fields
  9. Large binary Columns or Binary Field references

Related References

Data Modeling – What is Data Modeling?

Data Models, Data Modeling, What is data Modeling, logical Model, Conceptual Model, Physical Model

Data Models

Data modeling is the documenting of data relationships, characteristics, and standards based on its intended use of the data.   Data modeling techniques and tools capture and translate complex system designs into easily understood representations of the data creating a blueprint and foundation for information technology development and reengineering.

A data model can be thought of as a diagram  that illustrates the relationships between data. Although capturing all the possible relationships in a data model can be very time-intensive, a well-documented models allow stakeholders to identify errors and make changes before any programming code has been written.

Data modelers often use multiple models to view the same data and ensure that all processes, entities, relationships and data flows have been identified.

There are several different approaches to data modeling, including:

Concept Data Model (CDM)

  • The Concept Data Model (CDM) identifies the high level information entities, their relationships, and organized in the Entity Relationship Diagram (ERD).

Logical Data Model (LDM)

  • The Logical Data Model (LDM)  defines detail business information (in business terms) within each of the Concept Data Model and is a refinement of the information entities of the Concept Data Model.   Logical data model are non-RDBMS specific  business definition of tables, fields, and attributes contained within each information entity from which the Physical Data Model (PDM) and Entity Relationship Diagram (ERD) is produced.

Physical Data Model (PDM)

  • The Physical Data Model (PDM)  provides the actual technical details of the model and database object (e.g. table names, field names, etc.) to facilitate creation of accurate detail technical designs and actual database creation.  Physical Data Models are RDBMS specific definition of the logical model used build database, create deployable DDL statements, and to produce the Entity Relationship Diagram (ERD).

Related References

 

Information Management Unit Testing

Information Management Unit Testing, UT, Unit Test

Information Management Unit Testing

 

Information management projects generally have the following development work:

  • Data movement software;
  • Data conversion software;
  • Data cleansing routines;
  • Database development DDL; and
  • Business intelligence and reporting analytical solutions.

Module testing validates that each module’s logic satisfies requirements specified in the requirements specification.

Effective  Practices

  1. Should focus on testing individual modules to ensure that they perform to specification, handle all exceptions as expected, and produce the appropriate alerts to satisfy error handling.
  2. Should be performed in the development environment.
  3. Should be conducted by the software developer who develops the code.
  4. Should validate the module’s logic, adherence to functional requirements and adherence to technical standards.
  5. Should ensure that all module source code has been executed and each conditional logic branch followed.
  6. Test data and test results should be recorded and form part of the release package when the code moves to production.
  7. Should include a code review, which should:
  • Focus on reviewing code and test results to provide additional verification that the code conforms to data movement best practices and security requirement; and
  • Verify that test results confirm that all conditional logic paths were followed and that all error messages were tested properly.

Testing Procedures

  1. Review design specification with designer.
  2. Prepare test plan before coding.
  3. Create test data and document expected test results.
  4. Ensure that test data validates the module’s logic, adherence to functional requirements and adherence to technical standards.
  5. Ensure that test data tests all module source code and each conditional logic branch.
  6. Conduct unit test in personal schema.
  7. Document test results.
  8. Place test data and test results in project documentation repository.
  9. Check code into code repository.
  10. Participate in code readiness review with Lead Developer.
  11. Schedule code review with appropriate team members.
  12. Assign code review roles as follows:
  • Author, the developer who created the code;
  • Reader, a developer who will read the code during the code review—The reader may also be the author; and
  • Scribe, a developer who will take notes.

Code Review Procedures

  1. Validate that code readiness review has been completed.
  2. Read code.
  3. Verify that code and test results conform to data movement best practices.
  4. Verify that all conditional logic paths were followed and that all error messages were tested properly.
  5. Verify that coding security vulnerability issues have been addressed.
  6. Verify that test data and test results have been placed in project documentation repository.
  7. Verify that code has been checked into code repository.
  8. Document action items.

Testing strategies

  1. Unit test data should be created by the developer and should be low volume.
  2. All testing should occur in a developer’s personal schema.

Summary

Unit testing is generally conducted by the developer who develops the code and validates that each module’s logic satisfies requirements specified in the requirements specification.

Where do data models fit in the Software Development Life Cycle (SDLC) Process?

Data Model SDLC Relationship Diagram

Data Model SDLC Relationship Diagram

In the classic Software Development Life Cycle (SDLC) process, Data Models are typically initiated, by model type, at key process steps and are maintained as data model detail is added and refinement occurs.

The Concept Data Model (CDM) is, usually, created in the Planning phase.   However,  creation the Concept Data Model  can slide forwarded or backward,  somewhat , within the System Concept Development, Planning, and Requirements Analysis phases, depending upon  whether the application being modeled is a custom development effort or a modification of a Commercial-Off-The-Shelf (COTS) application.  The CDM is maintained, as necessary, through the remainder of the SDLC process.

The Logical Data Model (LDM) is created in the Requirement Analysis phase and is a refinement of the information entities of the Concept Data Model. The LDM is maintained, as necessary, through the remainder of the SDLC process.

The Physical Data Model (PDM) is created in the Design phase to facilitate creation of accurate detail technical designs and actual database creation. The PDM is maintained, as necessary, through the remainder of the SDLC process.

Related References: