A data warehouse is a storage platform which contains historical data that is derived from transactional/relational database. Data Warehousing is a process for collecting, storing, and deliveringdecision-support data for some or all of an enterprise.
As it includes data from different sources,it helps to seperate analysis workload from transaction workload.It is designed for query and analysis rather than transaction processing so that business can consolidate data from serveral sources.
In addition to being a RDBMS(Relational Database Management Systems), a data warehouse can often consists of an ETL(Extract Transform Load) solution, an OLAP(Online Analytical Processing) engine, client analysis tools, and can manage the process of gathering data and delivering it to business clients.
Data warehouse consists of three tightly integrated components.
- Data tables The data tables store user data in a table.
System tables The system tables store metadata about the data in the data tables. Both the data and system tables are stored as files using the underlying operating system’s file system.
- SQL query engine. The SQL query engine provides a SQL interface to store and analyze the data in the data tables.
Metadata is collected to improve supportability of the environment and to orient users to the meaning of the data.
There are two types of Metadata that is used in Enterprise Data Warehouse
Technical Metadata – data about how data is sourced, loaded, transformed, and managed.
Business Metadata – data that helps users understand the business context, meaning, and appropriate usage of the data.
Metadata benefits include the following areas:
- Provides orientation to new team members on both the technical and business teams.
- Promotes usage by providing context and content information.
- Promotes a common understanding.
- Provides documentation on agreements about usage.
The major types of Metadata collected and available include:
- Source to Target Mappings – this metadata include the transformation rules that are used to convert and load operational data into the data warehouse.
- Data Load Scheduling dependencies and SLAs – This document provides information on how the data is loaded and includes load dependencies and Service levels for when data is available.
- Data Security & Usage Guidelines – This metadata is provided by data stewards that own the data and provide the requirements for data security and any usage guidelines for the data. The security requirements are implemented in the database and the usage guidelines are available for all users. Security Views will control access to data by LOB.
- Logical Data Model – this metadata provides the core attributes, entities, and relationships and helps users understand the data that is available for use.
- Physical data Model (PDM) – this metadata provides the physical constructs that are used to distribute and access the data and provides the physical table structures in the Integration Layer. This model shows the structures that help development teams produce performant code.
- Views – the metadata provides information on what Views exist and helps the user find the View that meets their needs. This improves the maintainability of the Views.
- Corporate Metrics – the business and technical definitions of corporate metrics are defined. These are metrics that have meaning to more than one department. These metrics are available in the Semantic Layer. Whenever users utilize this data they should make sure they are in compliance with the definitions provided.
Types of Layers in Enterprise Data Warehouse
The Integration Layer is the heart of the Integrated Data Warehouse. The Integration Layer contains the lowest possible granularity available from an authoritative source, in near Third Normal Form (3NF). The term ‘near 3NF’ is used because there may be requirements for slight denormalization of the base data. This is done on an exception basis. All 3NF tables will be defined using the Natural (or Business) keys of the data. Surrogate keys will not be used.
Data in the higher layers of the architecture are derived from data in this layer. This layer includes all corporate data that has business value to more than one business area, meaning that it has corporate value. Contained in this layer is the ‘base’ business data.
The source of the data in this layer is a combination of the operational systems, base data, master data and possibly applications that are resident on the EDW (e.g., Marketing Applications, Supply Chain Applications, or other applications which create data on the EDW). Aggregation/summary tables that have broad business use could also be located here.
Guiding Principles for the Integration Layer are:
- Commit to creating and maintaining a Logical Data Model (LDM) and a Physical Data Model (PDM)
- Maintain the data as appropriate to meet current and future business needs.
- Commit to backup, recovery, and business continuity that satisfy business requirements
Enterprise Base Tables : Near 3NF representation of all business data. May also contain summary tables that are applicable across multiple business areas. An example of this could be Store/Item/Day sales that are created by summarizing all sales transactions.
This layer consists of Views that access the tables contained in the Integration Layer. Views are used to define a ‘virtual’ dimensional star schema model to hide the complexity associated with normalized data in the Integration Layer.
A semantic / data access layer provides ease of use for BI Developers and adhoc users. This layer presents data in a format that is easy to use and eliminates the most common joins of the physical tables. The Semantic / Data Access Layer structures provide users with a view to the data. All applications and users consume / use the data via views. Tableau is the current BI tool for Reporting. SQL Assistant is used for adhoc queries. Security will be managed per Line of Business through separate sets of semantic views.
Semantic Views benefits
- Simplification and Usability – provides a business specific view that may reduce attributes and combine tables to simplify usability for applications and for ad hoc access. Views can be used to create dimensional structures that are easier for BI tools to access and use.
- Enables and implements Security by limiting the data returned based on the user’s access rights.
- Manages database locking.
- Provides a logical, more straight-forward view of data for business users and applications; reduces the learning curve to use the data.
- Protect/isolate application code and user queries from changes to physical table structures.
- Allows joins to be done in the database in parallel instead of in the application to improve performance. This layer is intended to improve usability of the data and make access to the data easy for both ad hoc users and BI Tools. All access to Integration Layer tables and Performance Layer tables will be through views. This is the external view of the Data Warehouse.
Semantic Layer Guiding Principals
- Access to Enterprise Data or to application specific data must be performed through a view
- Use the LOCKING modifier
- Adopt and apply naming standards
- Avoid excessive nesting of views
- Semantic Layer Components and Descriptions
The Integration Layer is the heart of the data warehouse and includes corporate data in a near 3NF data model.
Enterprise Views : One-to-One view on the base table. These include views with write permissions for ETL and ELT applications as well as views that provide read access to base tables. There may also be views that define corporate metrics and logical structures that are used across business areas.
Business Intelligence Reporting Views : Used by the reporting front end and most ad-hoc queries. May be a combination of Enterprise and Performance Layer access. May include views to create star schemas or dimensional models to simplify data usage
Security Views : Used to limit access to any sensitive data based on access rights
If performance requirements dictate better response time from these normalized tables in the Integration Layer, de-normalization of these tables can be created in the Performance Layer as either physical tables or other performance structures such as aggregate join indexes (AJIs).
Performance Layer Guiding Principles:
Physical objects will only be created when a need is demonstrated; based on performance requirements and SLAs. Core Metrics and Key Performance Indicators that are used across business areas are best defined in this layer.
- Performance can be improved through aggregates
- Indexes and Partitions can be used to limit the I/O needed
- Join Indexes can be used to pre-join data at load prior to application real-time requests
Data Warehouse schema and Modelling
- Star Schema
It is called a star schema because the diagram resembles a star, with points radiating from a center. The center of the star consists of fact table and the points of the star are the dimension tables. Usually the fact tables in a star schema are in third normal form(3NF) whereas dimensional tables are de-normalized. Despite the fact that the star schema is the simplest architecture, it is most commonly used nowadays and is recommended by Oracle. It is an important special case of snowflake schema and is more effective in handling simpler queries.
- Snowflake Schema
The snowflake schema represents a dimensional model which is also composed of a central fact table and a set of constituent dimension tables which are further normalized into sub-dimension tables. In a snowflake schema implementation, Warehouse Builder uses more than one table or view to store the dimension data. Separate database tables or views store data pertaining to each level in the dimension.
- Fact constellation schema
For each star schema it is possible to construct fact constellation schema(for example by splitting the original star schema into more star schemes each of them describes facts on another level of dimension hierarchies). The fact constellation architecture contains multiple fact tables that share many dimension tables.
The main shortcoming of the fact constellation schema is a more complicated design because many variants for particular kinds of aggregation must be considered and selected. Moreover, dimension tables are still large.
When do you use Snowflake Schema
Ralph Kimball, the data warehousing guru, proposes three cases where snowflake implementation is not only acceptable but is also the key to a successful design:
Large customer dimensions where, for example, 80 percent of the fact table measurements involve anonymous visitors about whom you collect little detail, and 20 percent involve reliably registered customers about whom you collect much detailed data by tracking many dimensions
Financial product dimensions for banks, brokerage houses, and insurance companies, because each of the individual products has a host of special attributes not shared by other products
Multi-enterprise calendar dimensions because each organization has idiosyncratic fiscal periods, seasons, and holidays
A fact table typically has two types of columns: foreign keys to dimension tables and measures those that contain numeric facts. A fact table can contain fact’s data on detail or aggregated level.
A dimension is a structure usually composed of one or more hierarchies that categorizes data. If a dimension hasn’t got a hierarchies and levels it is called flat dimension or list. The primary keys of each of the dimension tables are part of the composite primary key of the fact table. Dimensional attributes help to describe the dimensional value. They are normally descriptive, textual values. Dimension tables are generally small in size than fact table.
Typical fact tables store data about sales while dimension tables data about geographic region(markets, cities) , clients, products, times, channels.
Data Integration in Data Warehouse
Data integration is a technique for moving data or otherwise making data available across data stores. The data integration process can include extraction, movement, validation, cleansing, transformation, standardization, and loading.
Extract Transform Load (ETL) In the ETL pattern of data integration, data is extracted from the data source and then transformed while in flight to a staging database. Data is then loaded into the data warehouse. ETL is strong for batch processing of bulk data.
Extract Load Transform (ELT) In the ELT pattern of data integration, data is extracted from the data source and loaded to staging without transformation. After that, data is transformed within staging and then loaded to the data warehouse.
Change Data Capture (CDC) The CDC pattern of data integration is strong in event processing. Database logs that contain a record of database changes are replicated near real time at staging. This information is then transformed and loaded to the data warehouse. CDC is a great technique for supporting real-time data warehouses.
[Data Warehousing] (https://dzone.com/refcardz/data-warehousing)