Microsoft Confidential. © 2006 Microsoft Corporation. All rights reserved. These materials are confidential to and maintained as a trade secret by Microsoft Corporation. Information in these materials is restricted to Microsoft authorized recipients only. Any use, distribution or public discussion of, and any feedback to, these materials is subject to the terms of the attached license. By providing any feedback on these materials to Microsoft, you agree to the terms of that license.
Microsoft Corporation Technical Documentation License Agreement (Standard)
READ THIS! THIS IS A LEGAL AGREEMENT BETWEEN MICROSOFT CORPORATION ("MICROSOFT") AND THE RECIPIENT OF THESE MATERIALS, WHETHER AN INDIVIDUAL OR AN ENTITY ("YOU"). IF YOU HAVE ACCESSED THIS AGREEMENT IN THE PROCESS OF DOWNLOADING MATERIALS ("MATERIALS") FROM A MICROSOFT WEB SITE, BY CLICKING "I ACCEPT", DOWNLOADING, USING OR PROVIDING FEEDBACK ON THE MATERIALS, YOU AGREE TO THESE TERMS. IF THIS AGREEMENT IS ATTACHED TO MATERIALS, BY ACCESSING, USING OR PROVIDING FEEDBACK ON THE ATTACHED MATERIALS, YOU AGREE TO THESE TERMS.
1. For good and valuable consideration, the receipt and sufficiency of which are acknowledged, You and Microsoft agree as follows:
(a) If You are an authorized representative of the corporation or other entity designated below ("Company"), and such Company has executed a Microsoft Corporation Non-Disclosure Agreement that is not limited to a specific subject matter or event ("Microsoft NDA"), You represent that You have authority to act on behalf of Company and agree that the Confidential Information, as defined in the Microsoft NDA, is subject to the terms and conditions of the Microsoft NDA and that Company will treat the Confidential Information accordingly;
(b) If You are an individual, and have executed a Microsoft NDA, You agree that the Confidential Information, as defined in the Microsoft NDA, is subject to the terms and conditions of the Microsoft NDA and that You will treat the Confidential Information accordingly; or
(c)If a Microsoft NDA has not been executed, You (if You are an individual), or Company (if You are an authorized representative of Company), as applicable, agrees: (a) to refrain from disclosing or distributing the Confidential Information to any third party for five (5) years from the date of disclosure of the Confidential Information by Microsoft to Company/You; (b) to refrain from reproducing or summarizing the Confidential Information; and (c) to take reasonable security precautions, at least as great as the precautions it takes to protect its own confidential information, but no less than reasonable care, to keep confidential the Confidential Information. You/Company, however, may disclose Confidential Information in accordance with a judicial or other governmental order, provided You/Company either (i) gives Microsoft reasonable notice prior to such disclosure and to allow Microsoft a reasonable opportunity to seek a protective order or equivalent, or (ii) obtains written assurance from the applicable judicial or governmental entity that it will afford the Confidential Information the highest level of protection afforded under applicable law or regulation. Confidential Information shall not include any information, however designated, that: (i) is or subsequently becomes publicly available without Your/Company’s breach of any obligation owed to Microsoft; (ii) became known to You/Company prior to Microsoft’s disclosure of such information to You/Company pursuant to the terms of this Agreement; (iii) became known to You/Company from a source other than Microsoft other than by the breach of an obligation of confidentiality owed to Microsoft; or (iv) is independently developed by You/Company. For purposes of this paragraph, "Confidential Information" means nonpublic information that Microsoft designates as being confidential or which, under the circumstances surrounding disclosure ought to be treated as confidential by Recipient. "Confidential Information" includes, without limitation, information in tangible or intangible form relating to and/or including released or unreleased Microsoft software or hardware products, the marketing or promotion of any Microsoft product, Microsoft's business policies or practices, and information received from others that Microsoft is obligated to treat as confidential.
2. You may review these Materials only (a) as a reference to assist You in planning and designing Your product, service or technology ("Product") to interface with a Microsoft Product as described in these Materials; and (b) to provide feedback on these Materials to Microsoft. All other rights are retained by Microsoft; this agreement does not give You rights under any Microsoft patents. You may not (i) duplicate any part of these Materials, (ii) remove this agreement or any notices from these Materials, or (iii) give any part of these Materials, or assign or otherwise provide Your rights under this agreement, to anyone else.
3. These Materials may contain preliminary information or inaccuracies, and may not correctly represent any associated Microsoft Product as commercially released. All Materials are provided entirely "AS IS." To the extent permitted by law, MICROSOFT MAKES NO WARRANTY OF ANY KIND, DISCLAIMS ALL EXPRESS, IMPLIED AND STATUTORY WARRANTIES, AND ASSUMES NO LIABILITY TO YOU FOR ANY DAMAGES OF ANY TYPE IN CONNECTION WITH THESE MATERIALS OR ANY INTELLECTUAL PROPERTY IN THEM.
4. If You are an entity and (a) merge into another entity or (b) a controlling ownership interest in You changes, Your right to use these Materials automatically terminates and You must destroy them.
5. You have no obligation to give Microsoft any suggestions, comments or other feedback ("Feedback") relating to these Materials. However, any Feedback you voluntarily provide may be used in Microsoft Products and related specifications or other documentation (collectively, "Microsoft Offerings") which in turn may be relied upon by other third parties to develop their own Products. Accordingly, if You do give Microsoft Feedback on any version of these Materials or the Microsoft Offerings to which they apply, You agree: (a) Microsoft may freely use, reproduce, license, distribute, and otherwise commercialize Your Feedback in any Microsoft Offering; (b) You also grant third parties, without charge, only those patent rights necessary to enable other Products to use or interface with any specific parts of a Microsoft Product that incorporate Your Feedback; and (c) You will not give Microsoft any Feedback (i) that You have reason to believe is subject to any patent, copyright or other intellectual property claim or right of any third party; or (ii) subject to license terms which seek to require any Microsoft Offering incorporating or derived from such Feedback, or other Microsoft intellectual property, to be licensed to or otherwise shared with any third party.
6. Microsoft has no obligation to maintain confidentiality of any Microsoft Offering, but otherwise the confidentiality of Your Feedback, including Your identity as the source of such Feedback, is governed by Your NDA.
7. This agreement is governed by the laws of the State of Washington. Any dispute involving it must be brought in the federal or state superior courts located in King County, Washington, and You waive any defenses allowing the dispute to be litigated elsewhere. If there is litigation, the losing party must pay the other party’s reasonable attorneys’ fees, costs and other expenses. If any part of this agreement is unenforceable, it will be considered modified to the extent necessary to make it enforceable, and the remainder shall continue in effect. This agreement is the entire agreement between You and Microsoft concerning these Materials; it may be changed only by a written document signed by both You and Microsoft.
In this Chapter
● Choosing appropriate schema type and designing
● Dimensions, Facts and Granularity
● Slowly Changing Dimensions
● Modeling Time and Date Dimension
Dimensional modeling is the design concept used by many data warehouse designers to build their data warehouse. Dimensional model is the underlying data model used by many of the commercial OLAP products available today in the market. The dimensional data model provides a method for making databases simple and understandable. The major purpose of creating a data warehouse from transactional systems is creating intelligence out of day to day activity and it is not intended for extracting operation reports and cannot be treated merely as a report store.
The terms that are used frequently in this document, are explained below. Details of each of these terms are explained in the later sections of this chapter.
Dimension table: It is a business entity of the source system. There can be multiple normalized table represent one single business entity on the source system. Example: Customer Dimension, Product Dimension
Fact table: A central table in a data warehouse schema that contains numerical measures and keys relating facts to dimension tables.
Grain: The level of the measures within a fact table represented by the lowest level of the dimensions. It is the level of detail the facts are stored in the fact table. Example, Fact table may contain lower grain to daily transactions such as customer purchasing information and number books purchased. OR fact table may contain higher grain to simply state the number of books sold without the details of who bought which book.
Measures: It is a numeric attribute that would typically be derived from transactions. It is generally placed in fact table of dimensional structure. Example: Number of books sold today
Hierarchies and levels: This is basically group of attributes that would logically relate each other in the form a tree structure. Example: City, State, Country; Day, Month, Year. Levels is the relation between the members of the hierarchy, such as in the above example, city is at lower level in the hierarchy and Country is at higher level.
Physical Schema Structures
There are basically two types structures that are popularly used while designing the physical dimensional modeling, they are Star Schema and Snow flake Schema
• Star Schema. The simplest relational schema for querying a data warehouse databases is the star schema. The star schema has a center, represented by a fact table, and the points of the star, represented by the dimension tables. From a technical perspective, the advantage of a star schema is that joins between the dimensions and the fact tables are simple, performance, ability to slicing and easy understanding of data.
The following diagram depicts a sample star schema,
• Snowflake Schema. A variation of a star schema is called a snowflake. In a snowflake, the dimension tables are normalized. From a reporting perspective, there is no logical difference because all the tables of a snowflake dimension can be logically joined into a single star dimension. From a performance perspective, the snowflake may result in slower queries because of the additional joins required.
Designing a Warehouse Schema
Following are the basic steps for designing a data warehouse based on facts and dimensions.
Identify Facts and Fact Table Grain
o Identity business processes for analysis such as sales trend over the period or cross sell
o Determine the facts that are needed directly or indirectly for the identified business analysis
o The facts identification is the process of looking at source database and see if the source database can satisfy business needs such as source does not have any information on products name/brand sold but has only consolidated information on the kind of product that is sold
o Identify and list out source facts typically stored in transaction tables
o Determine the lowest level of granularity needed for the facts such as business analysis needs only weekly sales information
o Determine the entities for the source transactions. It can be identified from the parameters for the source measures such as Product cost is based on products, date and store
o Business entities are termed as dimensions and numeric attributes of the transactional tables are called measures. Fact table is a set of dimensions attributes and measures.
Identify Dimension Keys and Attributes
A dimension logically consists of a primary (unique) key that maps to a foreign key in the fact table. The dimension also includes additional attributes about that key. For example, in a Customer dimension, the Customer ID may be the logical primary key. Some of the attributes provide the basis for grouping the facts. These are the groupable attributes. For example, Gender, City, State, and Country may be groupable attributes of a customer. Other attributes are descriptive. For example, Mobile Phone Number is an attribute of a customer, but a report is unlikely to group by phone numbers. Some of the groupable attributes can be organized into a hierarchy, allowing drill-down reporting. For example, City rolls up to State, which rolls up to Country. Not all attributes can be organized into a single hierarchy. For example, Gender does not have any correlation to City.
o As OLTP is normalized there can be more than one table present for each business entity such as Products, Products Manufacturer details
o Extract the required attributes of each business entity such as Products needs only Product Name, Manufacturer though there are other attributes of parameters but not needed for the analysis
o Denormalize the business entity (flattening normalized tables to make unique set of attributes of each business entity)
o Prepare a logical data modal consisting structure on how business entities related each other and entities that depend on source measures
o It is not required to use all the business entities collected for each fact
o Hierarchies for the dimensions are stored with in the dimensional table in star schema
o In Star schema every dimension will have surrogate key as a primary key and there is no child/parent table present for the dimensions
o Fact tables will have a composite key containing the dimension keys
Steps in designing Snow Flake Schema
• It is similar to Star schema but with a degree of normalization is being performed
• One or more dimension tables, each with a generated primary key, which contain one or more descriptive fields used to determine some level attributes for each dimension. In addition, a dimension table may contain a join to one or more other child dimension tables that supply additional level attributes for each dimension Example: Manufacturer information is kept in a separate dimension and Product dimension has a reference of Manufacturer in its attributes. In star schema the manufacturer information would have been as attributes to Product dimension (not a separate dimension)
• In a snowflake schema, more than one table is required to completely define the set of attributes for the dimension.
• In a snow flake schema, a dimension table may have have a relationship to foreign keys in more than one dimension table. For example, a Geography table can join to the GeographyKey of Customer, Reseller, and Manufacturer
• Whether a dimension is designed as a single table (star) or multiple tables (snowflake) has no effect on the fact table.
• In a fully-normalized snowflake dimension, you can clearly see the hierarchical relationship between attributes. In a star dimension, you may not see the hierarchical relationships.
Guidelines for choosing Star vs. Snowflake
• Consider a snowflake schema to reduce storage requirements for a large dimension tables. For example, Storing the information about the manufacturer in a product dimension table would make a product dimension unnecessarily large because one manufacturer may provide multiple products
• Consider snowflaking a dimension if one entity is referenced by multiple dimensions. For example, Customers, Resellers, and Manufacturers may all have a Geography component. Extracting geography information into a normalized table allows for consistency between geographical definitions for the different dimensions. A snowflaked dimension accessed from multiple dimensions is called a reference dimension.
• Another reason for using a snowflake schema is to supply information, not about the level attributes of a dimension, but the secondary attributes of a level.
Consideration of Operational Data Source (ODS)
Operational Data Source is a topic covered in Extraction guide.
o ODS data store technique is adopted when business requirement does demand OLTP granularity for shorter duration of time.
o ODS should be considered only when information retrieval requirements are to be real time and cannot access OLTP systems because of criticality of resources.
o ODS is a logically different data store apart from warehouse.
o The data in ODS is much granular or equal to the granularity of warehouse. ODS does not maintain history of attributes and the duration of ODS data is typically to the order of synch duration. Example: If data is loaded to warehouse once in a day then intra day transactions are maintained in ODS.
o ODS does not maintain Surrogate keys but having business keys
o If ODS exists, it will be used for loading to warehouse than any other staging source
o Population to ODS is through ETL process such as Data Extraction, Transformation and Loading
OLTP data sources typically use an entity-relationship (E-R) schema technique for modeling transactional data, because OLTP transactions usually involve a large number of transactions of small amounts of data. Data warehouses, on the other hand, usually involve fewer transactions of very large amounts of data, so E-R schema technique is not as efficient. Instead, data warehouses generally employ one of two approaches towards schema design, referred to as star schema or snowflake schema.
There are various terms relating to dimensions that you should be familiar with. Following is a brief summary. Most of these will be discussed in more detail elsewhere.
Changing Dimension: Changing dimension is the dimension which has at least one attribute whose value would change over the time.
Slowly Changing Dimensions: Attributes of a dimension that would undergo changes over time. It depends on the business requirement whether particular attribute history of changes should be preserved in the warehouse. Example: Employee work location attribute may undergo changes over time not frequently. This is called a Slowly Changing Attribute and a dimension containing such an attribute is called a Slowly Changing Dimension.
Rapidly Changing Dimensions: A dimension attribute that changes frequently is a Rapidly Changing Attribute. For example, a CurrentLocation attribute for an Employee may change very frequently. If you don’t need to track the changes, the Rapidly Changing Attribute is no problem, but if you do need to track the changes, using a standard Slowly Changing Dimension technique can result in a huge inflation of the size of the dimension. One solution is to move the attribute to its own dimension, with a separate foreign key in the fact table. This new dimension is called a Rapidly Changing Dimension. Deciding if an attribute should be moved into a Rapidly Changing Dimension or kept in a Slowly Changing Dimension is up to the data architect. The decision depends on factors such as disk space, performance, expected data volume, history preservation etc,
Junk Dimensions: A junk dimension is a single table with a combination of different and unrelated attributes to avoid having a large number of foreign keys in the fact table. Junk dimensions are often created to manage the foreign keys created by Rapidly Changing Dimensions. If you have bit-mapped indexes available, you may not need to create separate Junk dimensions, because the multiple, small-cardinality foreign keys can be managed efficiently.
Inferred Dimensions: While loading fact records, a dimension record may not yet be ready. One solution is to generate an surrogate key with Null for all the other attributes. This should technically be called an inferred member, but is often called an inferred dimension.
Conformed Dimensions: A Dimension that is used in multiple locations is called a conformed dimension. A conformed dimension may be used with multiple fact tables in a single database, or across multiple data marts or data warehouses. Conformed dimensions are discussed in detail later in this chapter.
Degenerate Dimensions: A degenerate dimension is when the dimension attribute is stored as part of fact table, and not in a separate dimension table. Common examples include Order Line Number or Transaction Number. These are essentially dimension keys for which there are no other attributes. In a data warehouse, these are often used as the result of a drill through query to analyze the source of an aggregated number in a report. You can use these values to trace back to transactions in the OLTP system.
Role Playing Dimensions: A role-playing dimension is one where the same dimension key—along with its associated attributes—can be joined to more than one foreign key in the fact table. For example, a fact table may include foreign keys for both Ship Date and Delivery Date. But the same date dimension attributes apply to each foreign key, so you can join the same dimension table to both foreign keys. Here the date dimension is taking multiple roles to map ship date as well as delivery date, and hence the name of Role Playing dimension.
Shrunken Dimensions: A shrunken dimension is a subset of another dimension. For example, the Orders fact table may include a foreign key for Product, but the Target fact table may include a foreign key only for ProductCategory, which is in the Product table, but much less granular. Creating a smaller dimension table, with ProductCategory as its primary key, is one way of dealing with this situation of heterogeneous grain. If the Product dimension is snowflaked, there is probably already a separate table for ProductCategory, which can serve as the Shrunken Dimension.
Outtriggers: Sometimes a dimension attribute can have multiple values for a single dimension key member. For example, a customer may have multiple accounts. One solution is to create a separate table with Customer Account as its logical primary key. An outrigger dimension is the logical inverse of a shrunken dimension.
Static Dimensions: Static dimensions are not extracted from the original data source, but are created within the context of the data warehouse. A static dimension can be loaded manually—for example with Status codes—or it can be generated by a procedure, such as a Date or Time dimension.
• Dimensions provide the 'who, what, where, when and why' that is applied to the 'how much' fact.
• Lookup tables and master tables which generally act as a primary key tables in OLTP would end up as dimensions in warehouse
• For normalization reasons, there can be more than one table for a single entity in OLTP such as customer table and customer address table, both of them can make it up to one dimension in warehouse as a part of the denormalization process
• Dimensions are typically highly denormalized and descriptive. They are typically text fields, and serve as context for the facts
• Dimensions can be left as degenerate dimensions whenever they make sense by themselves (usually a matter of opinion), provided that they do not have additional attributes. One reason for creating a separate dimension table is if you want to report on “missing” values from the fact table: the dimension table can include all possible values, not just the ones for which there are fact transactions. By creating an outer join, you can find members that do not have fact transactions.
• Dimensions are some times referred to as causal: they contain the causal factors responsible for the collected measures
• If the attributes are connected in one or more hierarchies then these are usually captured in the dimension table
• Dimension attributes are business descriptors users believe are of significance to understanding the measures in the fact table
• Dimension attributes are usually discrete (not continuous) and are often used as the source of constraints. Some dimension attributes may be continuous—for example the list price for products. Continuous values are not useful for grouping. You can often convert a continuous value into discrete buckets. For example, converting prices into High, Medium, Low.
• Dimension attributes usually have a limited cardinality at least when compared to facts, which come from OLTP transactions
• User input and evaluation of existing business reports help define the dimensions to include in the data warehouse. The words “by”, “for”, “of”, and “in” often flag a dimension.
o FOR EX 1: A user who wants to see data by sales region and by product has just identified two dimensions (geography and product).
o FOR EX 2: Business reports that group sales by salesperson or sales by customer identify two more dimensions (salesforce and customer).
o FOR EX 3: The question, "What were the sales of food items in the northwest region for the third quarter of 1999?" represents the use of three dimensions (product, geography, time) to specify the information to be summarized
Following are the guidelines pertaining to the dimension maintenance
• Avoid removing attributes from a dimension’s schema. Consider not removing an attribute from a dimension without analyzing the impact on other objects such as analytical queries, report queries, and cubes. Even if the attribute is discontinued at the source, you can add a default value in the warehouse for new rows, rather than remove the attribute from dimension. Business keys should always be retained in the dimension tables, as these provide a mechanism for tracing the report values back to the source.
• As an example, If the source schema is modified to remove an attribute such as Date of Birth, there probably already exist data for the attribute in the dimension table. So if the attribute is kept in Dimension it will have NULL, if removed you are loosing historical data
• Adding a new attribute to the dimension should not change the granularity of the dimension. For example: Adding Customer location should not change the granularity of the customer dimension. If there is only one record in customer dimension before adding location, it should still have only one record per customer in the dimension even after adding location attribute to it. If adding a new attribute takes the dimension to a new lower granularity then consider managing the value as a slowly changing attribute, or add an outrigger table to the dimension. Changing the dimension granularity would affect the fact tables with references to this dimension
• Dimensions must have the ability to add attributes to dimensions incrementally. Assume all attributes on the dimension is Type I then if a new attribute is added which could potentially result in changing the dimension to Type II this would impact the ETL logic. Consider maintenance changes while developing ETL framework such some attributes of Type I today may become Type II tomorrow or vice versa. Type I and Type II are part of slowly changing dimension design, which is covered in the later section.
• Data warehouse architectures must be designed to accommodate ongoing data updates, and to allow for future expansion with minimum impact on existing design. Fortunately, the dimensional model and its straightforward schemas simplify these activities.
• Records are added to the fact table in periodic batches, often with a little effect on most dimensions. For example, a sale of an existing product to an existing customer at an existing store will not affect the product, customer, or store dimensions at all. If the customer is new, a new record should have been added to the customer dimension table before adding the fact record. The historical nature of data warehouses means that records almost never have to be deleted from tables except to correct errors. Errors in source data are often detected in the extraction and transformation processes and are corrected before the data is loaded into the data warehouse database.
• The date and time dimensions are often created and maintained in the data warehouse independent of the other dimension tables or fact tables – updating date and time dimensions may involve only a simple annual task to physically add the records for the next year.
• An entirely new schema can be developed to a data warehouse. A new business subject area can be added by designing and creating a fact table and any new dimensions specific to the subject area. Existing dimensions can be reused without modification to maintain conformity throughout the entire warehouse. If the new fact table has a different grain, you may need to create a shrunken dimension (or add an outrigger table) while still conforming with the master dimension.
Consider the following guidelines when designing your Dimension tables:
This is the process of flattening the dimensions to form one single unit of all the related entities. Normalization is generally performed on OLTP systems to reduce redundancy and reduce the risk of update anomalies while managing many transactions. On the contrary, denormalization introduces redundancy and reduces joins, allowing for faster retrieval and more user-friendly structures
Unlike a transaction system, where new transactions can be launched from many process, in Data warehouse only the ETL process will add or update records, and the update frequency is much less than that of a typical OLTP.
Benefits of selectively violating normalization rules and introducing redundancy:
• Redundancy can reduce data retrieval time
• Redundancy can create a more user-friendly model
Potential disadvantages with denormalizing the database:
• It may cause update, delete and insert performance to suffer
• It will take up more space
In a typical data warehouse, the advantages of denormalization far outweigh the disadvantages.
Avoid encoding record information in Surrogate key
A surrogate key should be a simple running sequence number and should not have any encoded information. If the surrogate key is an integer, the query manager can perform faster joins between the fact table and the dimension table. The surrogate key should be able to change freely when you and it should be able to apply a new surrogate key to any new records added because of Type II changes. Encoding record information in surrogate key would have adverse affect when the record information is changed.
For example, a surrogate key that includes the combination of Employee ID and Location ID would need to change if the employee location changes. This defeats the decoupling purpose of the surrogate key.
Denormalize snowflake dimensions for end users
If you decide to store a dimension as a snowflake—in multiple tables—use a view to denormalize it into a star before representing it to end users. A simple dimension structure that consists of simply a dimension key and a list of attributes is easier for users to understand than a complex relationship of tables with joins.
For example, an Employee dimension may be stored as an Employee table and a Department table. If the Department ID is a foreign key in the Employee table, it is logically part of the Employee dimension. Rather than show the end user two separate entities, show the dimension as if it were a single table.
Do not show surrogate keys to the end user because they do not have any business meaning.
Consider adding zero key for Dimension tables
Adding a dummy record in each dimension as a first record whose surrogate key value is zero, so called as zero key. The business key and other attributes are defaulted to NULL or default value such as Not Available.
The fact table typically contains the transactions that are recorded at source system. Due to the nature of extraction process implemented, there is a possibility that business key is not available at the time of fact table loading. In such a case, a default key is assumed from dimension (as dimension is not yet populated with the required key) having zero value. Once the dimension record comes back in the loading, the corresponding zero key in the fact is updated with corresponding surrogate key of the business key. Zero keys also added date and time dimensions. If the date/time information is not available at the time fact table population zero keys of the corresponding dimensions are assumed.
Do not combine attributes from different business entities into a single dimension
One Dimension represents one entity and its details. Combining two entities into one dimension in order to reduce number of dimensions does not adequately model the relationship.
For example, suppose that a single Employee can participate in multiple Projects, and a single Project can include multiple Employees. This means that Employees and Projects are different entities. The relationship between Employees and Projects must be from an association table or a fact table. If you try to include Project information in the Employee dimension, you would need a separate Employee member for each Employee Project.
Create a surrogate key as the primary key of each dimension
Surrogate keys will act as a primary key for the dimension. It is required because of following advantages
Consolidating from Multiple sources: When data is consolidated from multiple sources, there is a possibility that business key repeated across sources. To have it uniquely identified, surrogate key is created. This is a single column incremented for each added record and its value is not based on the business key
Slowly changing members: Dimension members can change over time. In Type II dimensions where the history of Type II attribute is tracked will have business key repeated. Surrogate keys will uniquely identify each dimensional record irrespective of type of change in attribute columns.
Inferred dim support: When the dimension record is a late arrival after its fact records then a new dummy record (having business key and all other columns as NULL) is inserted in dimension table with a business key. Surrogate key here supports the scenario to get referenced in fact table, be aware that dummy dimensional record will get updated once the dimension has received the late arrived dimensional record.
Natural Key may change: The natural key (source key) may be modified in source system, having it referenced in fact table would lead to chaos as it needs to update entire fact table wherever the natural key is referenced. Having surrogate key independent of natural key would not have to change anything even if the natural key is modified in the source system
Performance: Having an integer surrogate key in the dimension can improve the performance of data access queries because of its integer data type and also single attribute to represent the dimension record. Consider the natural key at source is not a single column but a group of columns acting as a composite key. Having this in dimension with surrogate key support would force the facts having these two keys referenced, this would not only increase the disk usage but joins on both the columns to get dimensional record
Use a simple integer for Surrogate keys
• Integer key joins are faster that joins with other data types, The SQL query engine is optimized for integer comparison.
• If possible, use an Identity constraint to generate the surrogate key. This is much more efficient than creating the key by using ETL code
• Each time a member has a new value for a Type II attributes, you add a new record, which generate another surrogate key. This allows older fact table rows to have the old attribute value, while new fact table rows have the new value.
• Surrogate keys can be defined as unique across source systems, so that a John Reynolds from system XYZ and a John Reese from system ABC might both be identified in their respective systems as JR. However, once they are loaded into the data warehouse, they are each assigned a unique, non-overlapping value.
Identify type and relationship between the dimensions and the facts
Consider following guidelines when defining the relationship between dimensions and facts
• If an attribute in a dimension is a numeric value that can be meaningfully aggregated in user reports, consider adding the numeric value to a fact table
• A dimension can have numeric values that don’t need to be aggregated in reports. These numeric values can remain in the dimension.
• A fact table may contain only foreign keys, but no measures. This is called a factless fact table and exists when the business measures consist of only counts. This is a valid fact table design, but you may need to use distinct counts to aggregate the counts, rather than simple fact row counts.
• The purpose for a dimension table is the ability to have multiple attributes for the dimension key. If a dimension key does not have any additional attributes, you can implement the dimension as a degenerate dimension in the fact table.
Creating separate snow flake dimension tables for variable number of attributes of entities
For heterogeneous dimension items such as products, some products have 100 attributes to track and some products just 10. Having all these product details in the same dimension table would make the dimension table very wide, with no values for most attributes in any one row (sparse). One solution is to create one main dimension table containing the shared attributes and create individual dimension tables that join to the main dimension table to track the extended attributes
Create an association table when there is a many-to-many relationship between two dimensions
An association table maps the relationship between attributes from different dimensions.
Consider there are two dimensions namely Authors Dimension and BookTitle Dimension. One author can have multiple bookTitles and one Booktitle can have multiple authors. This is many to many relationship between authors and booktitles.
Consider creating a separate association table which has the association between Authors and Titles.
EmpId EmpName Age
1098 John 28
EmpId EmpName Age
1098 John 28
Also, the association is maintained without the need of any fact "happening" transactions at the source. In general, the association tables have keys mapping in many to many relationship but if underlying dimensions are type II, there is maintenance overhead. To reduce overhead, we would suggest creating association table with business keys rather than SKs. The key can easily found in Type II dimension with the help of EndKey
Guidelines for choosing an Outrigger table
If a dimension attribute is not unique for the business key then another dimension is
created also called a child dimension which would have reference of the parent dimension and also has its own surrogate key. One business key record in parent dimension can have in multiple records in the child dimension, the child dimension here called as outrigger.
This is sometimes called a Multi-valued Snowflake Dimension or a Reference Dimension
An outrigger table is different from a snowflake table. A snowflake has lower cardinality than the dimension key; An outrigger has higher cardinality. In a snow flake, the parent dimension (which is normalized from star to snowflake) will have reference of snow flaked dimension, where as here the outrigger will have the reference of parent dimension. Example, Employee dimension is snowlaked to have EmpDimension and Status Dimension. Here Emp Dimension will have an attribute called Status Dimension SK. In outrigger (example below), the outrigger table AccountOutrigger will have an attribute to store the Emp SK.
Outriggers are meant for improving maintainability, save space or increase flexibility. Outriggers are not same as Snow Flake. Outriggers are created only when there is more than one value for a dimensional attribute where snow flake is created just to take out some dimensional attribute (normalization)
Avoid snow flaking the outrigger as it could result complex, less stable & more difficult to maintain model
SK EmployeeName EmpId Location AccountId AccountType
1 Dave 123 Redmond 9287 Savings
2 Dave 123 Redmond 4235 Checking
3 Jim 124 Redmond 8923 Savings
Here employee information is repeated because employee has more than one account, to reduce the redundancy and to have only one record per employee in employee dimension, create the following structure using outrigger.
SK EmployeeName EmpId Location
1 Dave 123 Redmond
3 Jim 124 Redmond
AccSK EmpSK AccountId AccountType
1 1 9287 Savings
2 1 4235 Checking
3 3 Jim Savings
Conformed (Shared) dimensions
A conformed (or shared) dimension is one that is used with multiple fact tables, or in multiple data marts or in multiple data warehouses. Shared does not mean only one copy is present, it does mean that only one copy is loaded with formal loading process and other copies will get replicated from the loaded copy. A dimension is said to be conformed if it has been used across different schemas such as data marts, data warehouses and thus maintaining the logically same set of dimension attributes. Using conformed dimensions is critical to successful data warehouse design.
Business Entities (also called as dimensions here) can be reused for other facts such as Products entity can be reused for Sales as well as Product costs
Conformed dimension ensures the dimension data is sourced from only one process and any other derived dimensions for data mart purposes would have consistent information.
When there are multiple fact tables or data marts in the system then all these fact tables and data marts should access one logical dimension per one business entity. Example: There should be only one customer dimension the system that is being used by all data marts or fact tables or outrigger tables or other snow flake dimensions.
This dimension can be replicated or horizontally partitioned to suit the requirements of other data marts. Horizontal partitioning of a dimension means that some of the rows are in one copy and some are in another copy. For example, one dimension might have all the East region customers, while another has all the West region customers.
ETL process is to populate only one customer dimension in the system, not once for each data mart or separately for other horizontally/vertically separated dimensions
You should use conformed dimensions whenever possible instead of individual ones. Data mart and Data warehouse will use the same dimension not multiple copies of the same dimension, loading process is only once
Conformed dimensions are to be designed to satisfy all the data mart and data warehouse needs. A dimension can be taken to higher granularity but cannot be taken to lower granularity once fact tables are populated. As the confirmed dimensions are designed to be shared among data marts and data warehouse, it should be designed as lower granularity possible as per source data base schema
A data warehouse must deal with attributes that change values over time. This is not an issue for most transaction systems, because they deal only with the present. But a data warehouse will often present data historically for anywhere from two to ten years. Organization changes, customer attribute changes, and other changes must be managed.
Procedure for identifying changes in a dimension
• Identifying the attributes that would change overtime needs knowledge about the OLTP system. In principle, any attribute except the key can change over time. You need to plan for what to do in case an attribute changes.
• Identifying the attributes whose change history to be maintained in warehouse is a business analysis decision
• Changes are to be well modeled in the data model modal and data mapping sheet
• Granularity and normalization to be properly represented in data modal
• Assess the data type of surrogate keys that would sufficiently allow you to support changing dimensions
• Dimension Size considerations
o Data must be split based on attribute changes and usage
o Data that is used together is kept together to reduce joins
o Splitting a fat dimension into several smaller dimensions can improve performance
o Snowflake schema with its separated dimensions performance slightly better than the star schema with its wide flattened dimensions
• If the rapidly changing attributes are not textual and discrete, but are numeric and continuous, then consider placing them in facts
• The validity of the data warehouse is thus dependent on how well changes in the its dimensions are tracked
Slowly Changing Dimensions Explained
The Slowly Changing Dimension (SCD) should technically be called Slowly Changing Attribute, because any attribute can be handled differently from any other. There are logically three ways of dealing with a slowly changing attribute:
• Type I: The existing record is updated with change. Replacing the data is very appropriate if the old data was erroneous or it is unlikely that the changes will materially affect important analyses. Type 1 slowly changing dimensions do require re-creating any aggregations that would be affected by the change.
• Type II: History is preserved, a new record is added to dimension and old record is marked as inactive. EndDate or IsLatest is generally used to represent latest record of a business key. A new surrogate key is created for every change irrespective of the business key. Type II changes can cause a serious inflation in the number of members of a dimension.
• Type III: Type III preserves a version of the history by adding it to the dimension record in a different attribute. Typically, there are two states: current and previous. In some cases, you may have 4 or 5 states. When the number of defined states is exceeded, previous states are discarded. A Type III change does not create a new row. A Type III change does require extra columns in the table schema. As with a Type I change, a Type III change requires a dimension update, so you need to re-process all aggregations after a change.
Slowly Changing Dimension Guidelines
Consider the following guidelines for addressing slowly changing dimensions:
Avoid Type II for dimensions with a large number of columns; use number of smaller dimensions
If the SCD dimension having a Type II attribute and also has large number of columns then any change in this attribute would result in adding entire row again to the same dimension though there is no change in any of the other columns. Create smaller dimensions as an alternative solution.
Consider loading with correct SK for Late arriving facts
In the case of an inferred dimension member, where you create the member surrogate key before the attributes are available, do not create a new Type II record when the attributes arrive. Instead, load the initial attributes as if they were Type II attributes. For example, RecordX of fact1 is delayed and it is corresponding to customer when the location was L1 (SK = 100). The location of customer is changed to L2 and so the surrogate (SK=101). Now when the RecordX is added to Fact1 then you should consider SK=100 instead of the latest SK (SK=101) of customer dimension
Consider not treating attributes as Type II attributes unless required
The idea behind of Type II attributes is to maintain history for business analysis to track the changes over time. If the changes need not be tracked and unnecessary then consider the attribute as Type I (overwrite the attribute value every time it changes)
Consider size of dimension record when identifying Type II attributes
Having more number of Type II attributes in a single dimension would result in adding the entire dimension record for any change of Type II attribute. If the number of Type II attributes are unavoidable, consider breaking the dimensions
Use a sufficiently large data type for surrogate keys
When estimating the size of a data type for a surrogate key, be sure to include the additional records that will be created from Type II changes.
Example: In a dimension with an estimated 9,000,000 customers, you might be tempted to use a small integer for the key, but one Type II attribute, with only one change per customer would result in 18,000,000 records. To be safe, you may want to use bigint keys for dimensions.
Use TYPE III SCD when you need current and previous values
Type III attribute changes are relatively rare in a data warehouse, but if you do need to report only on the current and previous states, it can simplify reporting compared to a Type II implementation.
Slowly Changing Dimensions – Scenarios and Solutions
Process Type I updates in the correct sequence
With a Type I attribute in the warehouse, it is possible to have a source system that maintains multiple change entries. When applying the changes, be sure to apply them in the correct sequence so that the warehouse maintains the last possible value. For example, the Employee location was New York at 9AM. It was changed to Los Angeles at 1 PM and then changed again to Houston at 4 PM. When extraction is performed at 6 PM, it has all the changes starting from last extraction. When the data is applied to the warehouse, it is critical that the final value for the dimension member be the last change—Houston.
There are two ways to accomplish this
• Batch Aggregation: Hourly aggregate the changes and load it at the end of the day of each hourly data. In such a case if the update occurred multiple times with in an hour then the latest change is propagated to warehouse.
• Loading in Intervals: Consider loading the warehouse hourly from each hourly data extract and reduce the interval further if updates are frequent and need to be captured
Split a dimension table when attributes change at different frequencies
If there is a dimension has that Employee information and Department information, consider splitting into two tables, with a foreign key to the department table in the employee table. If both employee and department dimension are in the same dimension and as they change at different rates, it will lead to huge cardinality particularly if both are Type II in nature.
Use an Rapidly Changing Dimension (RCD) or Outrigger table for Type II attributes that change frequently
The best solution is to treat the attribute (that changes very frequently) as rapidly changing and create a separate Rapidly Changing Dimensions (RCD) for the attribute. Add the surrogate key of this new dimension to fact table to maintain the current status when the record is inserted in the fact. If, however, there is no fact record then it is difficult to know the latest value of this attribute. One solution is to create a separate fact table that has only the dimension and the RCD surrogate key. (In essence, this becomes an outrigger dimension.) .
When Current information is required on a large Type II Dimension table, consider joining on Business Key
Reports that access fact table joins Type II SCD dimensions to get latest instance of the business key. But it results in self join to get latest SK for the business key whose key referenced from the corresponding record in fact. Fact.DimSK = Dim.DimSK and Dim.BusinessKey = SelfDim.BusinessKey and SelfDim.EndDate is NULL (EndDate is typically a column on Type II dimensions to mark the start and end date of the Type II change). The solution is to add business key in the facts, not in composite keys, as an additional field would reduce the self join. This is particularly used for Type II SCD dimension having large volume and reports direct access the underlying relational tables. Typically the join condition would be
Fact.BusinessKey = Dim.BusinessKey and Dim.EndDate is NULL
If using SK key as a means to join the facts and dimensions to get latest information from Type II dimensions would as
Select Current.CustomerName, Measure
Inner Join Customer old on Fact.SK = CustomerSK
inner join Customer Current on Old.CustomerId = Current.CustomerId
If using Business key, the query can be eliminated with a self join as
Select Current.CustoerName, Measure
Inner Join Customer Current on Fact.CustomerId = Current.CustomerId and Customer.EndDate is NULL
This method of joining on the additional business key (instead of dimension surrogate key) is particularly useful when the “current” information is sought in the result set on a large volume Type II dimension table. This method will not eliminate the need of surrogate key in the facts, it only suggests having an additional attribute in fact table as dimensional business key for improved performance queries.
Use a Full Outer Join to identify Inserts, Updates, and Deletes
While loading warehouse, you need to check for the existence of the new record in the warehouse. If record does exist, then you have a Type I or Type II effect depending on the attribute that is changed. If the record does not exist, you must add one. In addition, you have the case where the record exists in the warehouse, but is not in the new extract. This signals a logical deletion. Comparing the existence of the values in both the new and old data sets can be slow, particularly when the tables are large.
A good solution is to use a Full Outer Join between the new table and the existing warehouse dimension table, using the business key for the join. You can then determine Updates, Inserts, and Deletes in one query.
SourceTable Full Outer Join Dim On SourceTable.BusinessKey = Dim.BusinessKey and Dim.EndDate is NULL
If the resulting dimension Business Key is NULL then it means the corresponding record does not exist in warehouse or that it has expired. In this case a new record is to be inserted in the warehouse.
If the resulting source Business Key is NULL, then it means the corresponding record has been deleted and should be flagged with an EndDate in the warehouse table.
If neither Business Key is NULL then it means the corresponding record exists in warehouse and is active. In this case the appropriate Type I or Type II processing can be performed.
Using the Full Outer Join approach combined with a Hash value for attributes, you can optimally find out the Type I and Type II attribute changes, as given below
• Step 1: Identify Inserts, Updates, Deletes using Full Outer Join
• Step 2: For Inserts, directly insert into dimensions
• Step 3: For Deletes, update dimensions for marking as inactive
• Step 4: For Updates, compare the Type I and Type II hash values to determine subset of records to have Type I effect and subset of records to have Type II effect
Rapidly Changing Dimension
A dimension is considered to be a rapidly changing dimension (RCD) if one or more of its attributes changes frequently in many rows. Consider the following guidelines if you are dealing with RCD.
Determine if an attribute is Rapidly Changing
When an attribute where you need to maintain history (a Type II attribute) changes frequently, you can end up with a huge inflation of rows in the dimension table. To avoid inflating the dimension table, you can move the attribute to a separate dimension, called a Rapidly Changing Dimension (RCD). The threshold for when you would move an attribute to an RCD is based on the size of the dimension, the number of changing attributes, and the frequency of change. As a broad rule of thumb, if an attribute would cause a dimension to double in size more than once per year, you may want to move it to an RCD. The cut-off limit will vary depending on your resources and business requirements.,
Slow down the rate of change by using Tracking bands
The number of changes the Rapidly Changed Attribute can be reduced if a range is
defined instead of precise value of the attribute, these ranges also termed as Tracking Bands. This would reduce the number of changes in RCD unless it crosses the range values. As the range it belongs to will not change as frequently as the value, it will not be considered as RCD. This approach would eliminate the need of handling RCD in another set of dimensions. Defining the range is not feasible for every RCD but can be implemented wherever “range” can be implemented and business requirements are met with the range definition. Example, If business requirements does not demand actual age of employee, and if age is considered to be rapidly changing attribute, then consider defining the range for the age. The range include [10-20], [20-30] etc., So, even if employee age changes from 25 to 26, the range he/she belongs does not change.
EmpId EmpName Age
1098 John 28
EmpSK EmpId EmpName AgeRange
342 1098 John 20
Determine if Type II change is needed for Rapidly Changing attributes
Determine if having history maintained for rapidly changing attributes is important for business reports/analysis needed. If it is not required, a type I change on such attributes would not create any cardinality or performance problems.
Use static time components rather than changing attributes
Rather than use Age as an attribute, you can store the Birth Date. Birth Date will not change, but can be used to calculated the exact age as needed.
Move Rapidly Changing Attributes to another dimension
When you have a rapidly changing attribute, you avoid inflating the dimension table by moving the attribute to its own dimension, with its own The solution for a dimension with rapidly changing attributes is to break the offending attributes out of the dimension and create one or more new dimensions. Consider the following example.
Consider a customer dimension having status attribute which is identified as rapidly changing attribute and retaining history is necessary. As a solution, create a CustomerStatusDimension and add the foreign key for that dimension to each row of the fact table as the fact data is loaded. Rather than add the CustomerStatusDimension to the main fact table, you can also create a separate fact table including the Date, the CustomerID, and the the CustomerStatusDimensionID. This solution would also allows end users to analyze the history of status changes.
Create Junk dimension to store multiple RCD attributes
If you create a large number of rapidly changing attributes and create a dimension for each, the number of dimensions can become very large, and the fact table can end up with a large number of foreign keys. One approach is to combine several of these mini-dimensions into a single physical dimension. This approach works because each RCD typically has a very low cardinality. This is the same technique used to create what is often called a junk dimension—a dimension that contains unrelated attributes to get them out of the fact table. A junk dimension in the fact table reduces the number of extra foreign keys in the fact table, but it is still necessary to use the fact table to resolve the relationship between the customer and the attributes.
To keep the RCD attributes completely out of the main fact table, create a separate fact table with a Date, the CustomerID, and the RCD composite key.
Modeling Date and Time dimensions
There will be only one Date dimension and one Time dimension across all facts and dimension tables. Not all date and time fields needed to be mapped to date and time dimensions. Map a date-time field to a Date or Time dimension only when you need the extra attributes from the dimension.
Typically the lowest granularity of the Date dimension is Day, and the lowest granularity of the Time dimension is Second. Many warehouses do not need a Time dimension, but a Date dimension is used in virtually every data warehouse system.
In general, create separate dimensions for Date and Time
If it is necessary to extract contiguous chunks of time that cross day boundaries (for example 11/24/2000 10 p.m. to 11/25/2000 6 a.m.), then it is easier if the hour and day are in the same dimension. Otherwise, separate dimensions for Date and Time are easier to manage and to query.
It is easier to analyze cyclical and recurring daily events if Date and Time are separate dimensions.
Number of meetings active at 9AM this week. This is an easy query if Date and Time are separate; it is much more complex if Date and Time are combined into a single dimension.
Following are issues of having date and time dimensions together in one DateTime dimension
Combining Date and Time into a single dimension creates an extremely large dimension, particularly if you use Seconds as the grain. The problem is not as great if you use Hour as the grain of the combined table
• Cardinality of Date and Time Dimensions (separated) for 10 years of data
Date Dimension: 10 * 365 = 3650
Time Dimension (granularity: seconds): 24 * 60 * 60 = 86400
• Cardinality of Date and Time Dimensions (combined) for 10 years of data
DateTime Dimensions (granularity Hourly): 10 * 365 * 24 = 87600
DateTime Dimensions (granularity Seconds): 10 * 365 * 24 * 60 * 60 = 315,360,000
The more records in the DateTime dimension, the slower query performance will be.
Having both date and time dimension in the same dimension can cause interpretation problems for fact tables that have granularity of a Day. It would be easy to inadvertently enter two records into the fact table for the same day, because of the extra granularity in the dimension table.
When do you need Time information in the Fact Table?
• Business analysis needs time granularity for roll up or for measures/aggregations based on transaction date time
• When the business analysis has requirement such as number of transactions happened between 3 PM and 4 PM on 1/1/2006
• If no date time can be extracted related to a transaction date time, alternately Synch Date Time is used in presentation. Example: There is a change of employee location in one of the employee master tables at source but source didn’t capture the date/time of the change. To have Type II affect on employee dimension, synch date/time is used as a representation of location change occurred date/time, it is opted only in case of date/time information is needed for ware house dimension and the same is not present in source tables.
• When source transaction has date and time information and fact is loaded with no aggregations
• Incase of date time information is not present, synch date time is used as a record changing/event occurred time
Store Date keys in the Fact Table
Consider using a meaningful surrogate key for the Date dimension as an integer with the structure YYYYMMDD. This makes it easy for ad hoc users and analysts to understand the underlying date. You cannot use this method if both date and time are stored in one dimension, because the structured key would be too long.
Typically a surrogate key should not contain any business value and should not be used for any specific business rules. Dates are an exception for ease of use where date dimension surrogate key format is suggested to have YYYYMMDD having data type as INT(4)
Similarly for time dimension, the suggested format is HHMMSS
Time "4:23:47 PM" would have a surrogate key as 162347
Though it is readable avoid doing any calculation using the surrogate keys directly without mapping to the corresponding date/time dimension.
A date dimension with one record per day will suffice if users do not need time granularity finer than a single day
If the business analysis or reports does not need exact time of the day the transaction occurred, then the Time dimension is not needed for this fact.
A separate Time dimension table should be constructed if a fine time granularity, such as minute or second, is needed
Consider mapping Time dimension key to the facts only when the business requirement demands fine granularity of the facts. Having the granularity to the seconds may significantly increase the data volume in the warehouse. If Time granularity is not required, you can pre-aggregate the fact table to the Day grain.
If exact event time is needed in the warehouse, the source system must provide data at the necessary grain
Business needs such as the number of books purchased between 3 and 4 PM of a day then the extracted transaction information from source transaction tables should have granular information at least up to the hourly. Business analysis requirement demands are met but limited to with what is available in source systems.
Guidelines on when to add timekey information to facts/dimensions
Just because there is a time-related field is available to put into the fact table, that doesn’t mean it needs to be there. The following guidelines would help to decide whether to add timekey information to dimensions
• Do not consider mapping to time dimension if there is need to aggregate based on Time dimension attributes. For example, a business requirement to group the report by hour would require the Hour attribute from the Time dimension.
• If there is no granularity to the hour, minute, or second level in the fact table, do not consider mapping time dimension
Summarizing data for a range of days requires joining only the date dimension table to the fact table
Even if fact table rows have a grain of seconds, you do not need to join to the Time dimension if you only need to aggregate at the day or higher grain.
For example, suppose that the fact table contains the detailed information that Book1 sold at 5 hours 30 minutes 28 seconds on 2/1/2006
When these transactions are aggregated at the end of the week to the day level, it would take away the time information but keep only the day information. In this case, there is no need to join to the fact table.
Number of books sold on 2/1/2006
Add convenient attributes to Date and Time dimensions
It is often convenient to maintain attribute columns in a date dim and time dim to provide additional convenience to business users such as calculating whether a transaction happened on a holiday, a specific day in the financial cycle, at a peak time (for examlle from 10 AM to 2 PM). You could even store aggregated values—such as the total number of holidays in the month—in each row of the Date dimension.
With multiple time zones, store local and GMT times.
• If data comes from multiple time zones, adjust all times to a standard time sone, such as GMT. You can also store an additional attribute for how to handle a specific Daylight Savings time calculation. With multiple time zones, consider using GMT time in the dimension, and storing the time zone delta in a fact table column. In this case, the time zone shift is similar in function to a Rapidly Changing Dimension.
• Special care needs to be given when reporting multi-zone times as the date can easily be misrepresented if the wrong Time attribute (GMT or local) is used.
• Take particular care when dealing with day aggregations when multiple time zones are involved. Does the day begin at local time, or at a world-wide standard time?
Fact tables contain data that describes specific events within a business, such as bank transactions or product sales
Types of Fact tables
There are basically two ways transaction-based fact table records are sourced, plus another way for non-transaction counts.
• Transaction table: This is the transactions as it happens on the source. Each transaction and its measure such as price is extracted, transformed and loaded to facts at regular intervals of time. Ideally, if there are 100 transactions happened on a given day on the source and if fact is designed at the same level of granularity, in transaction fact table you would see 100 records added once ETL process finishes loading.
• Snapshot table: This is not exactly a transaction on the source. It is generally to get the snapshot of the system. Example: Getting all customer account balance as of end of the day. Even when there is no transaction on any particular account, the snapshot fact table would gather all the current account balance information and keep it in the fact table to exactly know what the account balance of a given account is on a given date.
• Factless Fact: This fact generally is not sourced from an OLTP transaction. Occasionally you need only to capture whether an event happened. Example: You might want to capture employee attendance, you only need to know which employees came to work and which employees were absent. Absent employees will not come from attendance sheet. This is prepared separately with the help of employee table and attended employees to load into factless fact. Factless facts, as the name says, will not have any measures. Factless facts are often used with counts, and particularly distinct counts. For example, how many distinct employees worked during a given week.
Ideal fact measures are additive in nature but it is not a requirement that a fact table have numeric measures. For example, consider list of computers added and deleted from an environment. The fact table may capture the computers that are added and deleted. There is no measure here but count of computers becomes the measure. There is no measure stored in the fact table for this kind of transaction..
Identifying Facts and Measures
• Attributes of fact table are in general containing surrogate keys of the respective dimension tables
• The OLTP transactions or OLTP snapshot, are the source of fact records and would decide what business keys would make up to OLTP transaction. During the load process, these business keys are mapped to the business key of the respective dimensions to extract the corresponding surrogate keys.
• Numeric values that resulted from an OLTP transaction typically end up as measures in the fact table. One fact table can have more than one measure, including custom calculated measures based on measures derived from the OLTP system..
• OLTP transactions are almost always appended. Existing records are not updated.
• Calculated measures that include only values from a single row may be pre computed during the load process and stored in the fact table, or they may be computed on the fly as they are used. Determination of which measures should be pre computed is a design consideration. There are other considerations in addition to the usual tradeoff between storage space and computational time.
o All the facts have a specific granularity
o A set of foreign keys constitutes a logical concatenated key
o The fact table typically contains the largest volume of data
• The needs of the analyst must be supported by the facts in the fact table, there must be measures which have relevance to the business goals which the organization seeks to fulfill
• FACTS are measurable
• Facts are by nature dynamic and variable over time
• Fact tables have an unlimited number of rows—as opposed to dimension tables, which have a finite cardinality, even if very large.
• Fact data represents the measurable results of a business event, or a business state and Dimensional data provides the context for that event or state.
• Fact events vary depending on the business
• It is important to understand the basic business process and identify the business events and states when developing a dimensional data warehouse.
• Fact tables should not contain descriptive information or any data other than the numerical measurement fields and the foreign key pointers to the corresponding entries in the dimension tables.
• Identifying Measures. The most useful measures to include in a fact table are numbers that are additive.
o Additive measures. allow summary information to be obtained by adding various quantities of the measure, such as the sales of a specific item at a group of stores for a particular time period.
o Semi-additive measure: Is additive across all dimensions except time. The classic example of a semi-additive fact is a daily account balance. It's perfectly legitimate to add the balances of every account on a specific day—the fact is additive across the Account dimension—but the end-of-month balance of my savings account is not the sum of the balances for the past month—the fact is non-additive across time. When aggregating across time you need a special function such as the last non-empty value, or the first non-empty value, or the average. if you include semi-additive facts in a dimensional database, you should ideally use a front end tool that manage the appropriate semi-additive aggregation. At the very least, the front end tool should warn users if they attempt to summarize across a semi-additive dimension.
o Non-additive measures: Non additive measures are those which cannot be summarized like additive/semi additive measures. In general, you want to convert non-additive measure to additive measures. You can often do that by adding a weighting factor. For example, rather than store a rate, you can multiply the rate by a weight. After summing the weighted rate and also the weights, you can easily calculate a weighted average rate for any level of summarization.
Each fact table also includes a multipart index that contains as foreign keys the primary keys of related dimension tables, which contain the attributes that describe the fact records.
Granularity is defined as the lowest level of detail of the data in the fact tables particularly the measures. The more granular the data, the more detailed your reports and analysis can be. However, excess granularity consumes space and increases complexity. It is a trade off because once you aggregate a fact table, you cannot decompose the aggregates.
There are three types of grain: Transaction, Periodic Snapshots and Accumulating Snapshots.
• Transactions - This is the basic detail record of the warehouse as it happened in the OLTP systems. Example: Customer purchases a book or money transfer from savings to checking. The Time grain of a transaction is essentially continuous.
• Periodic Snapshots - This is a set of detail records that are repeated over time. It is to get number of customers existing, inventory of the system etc., at the end of the day. The grain of a snapshot is the interval at which the snapshot is taken—whether hourly, daily, weekly, monthly, or yearly.
• Accumulating Snapshot - This is a snapshot that can change over time. The common example is the student enrollment example. The student's data doesn't change but the dates for enrollment, admission, etc.
Fact tables Guidelines
Consider the following guidelines when designing your Fact tables:
Aggregating Data in Fact Tables
Data is generally aggregated to store the summarized information over a period of time of the fact measures such as the number of units of a product sold in a given month. The source for these aggregations is the detailed granular fact which collects transactional information from source at regular intervals of time. In some data warehouses, the aggregation is performed directly by the ETL process to move it into an aggregated fact table where the detailed transactional records do not have any value for the report users. For example, executives may not need to see the number of units sold at 6th hour of a day. Consider the following guidelines when designing your Fact tables that contain aggregate data:
• Aggregating data in the fact table should only be done after considering the consequences. Aggregation is the process of calculating summary data from detail records. It is often tempting to reduce the size of fact tables by aggregating data into summary records when the fact table is created. Detailed information is no longer directly available to the analyst. If detailed information is needed, the detail rows that were summarized will have to be identified and located, possibly in the source system that provided the data. Fact table data should be maintained at the finest granularity feasible.
• Avoid aggregation in warehouse tables if you will use an OLAP cube as the source for reports and analysis. In this case, the cube provides the aggregation functionality, but if you need to get to details it still available in the fact table.
• If the reports query the underlying relational structure then aggregation will make queries faster
• It would make sense to archive only aggregated information rather than all the warehouse records.
• To enhance relational queries, you may want to store the summary records in separate tables
• Mixing aggregated and detailed data in the fact table can cause issues and complications when using the data warehouse. For example, a sales order often contains several line items and may contain a discount, tax, or shipping cost that is applied to the order total instead of individual line items, yet the quantities and item identification are recorded at the line item level. Summarization queries become more complex in this situation. If you are using a tool such as Analysis Services, leave the aggregation to the tool
Determine if a source numeric field is to be measure or an attribute
Sometimes during the design process, it is unclear whether a numeric data field from a production data source is a measured fact or an attribute. Generally, if the numeric data field is a measurement that changes each time we sample it, it is a fact. If it is a discretely valued description of something that is more or less constant, it is a dimension attribute. One way of making the distinction is whether it is a value you would want to use as a grouping item for a report. If it is, then it is an attribute. Often, an aggregatable measure also appears as an attribute by bucketing the values. For example, order quantity is typically a measure, but Order Size (i.e., Order Quantity bucketed into 5 discrete buckets) is an attribute.
Storage is cheap and time is expensive:
This is true in most of the data warehouse processes. This means that you should store measures needed for reports as part of the fact measures. Ideally, all the measures would directly be coming from source transactions but any calculations on these measures can be computed and stored in another measure attribute of the fact. This reduces the report query time as there is no more computation involved. If reports directly query relational tables, create aggregation tables with pre-computed totals to enable reports to query at different grains with no computation during the reporting query.
Retain all measures coming from source in your fact
Have the original details of the measurements which are used for calculating the additive/non additive fields, can be used for later aggregations. Example:
If you keep all the measurements that are available from source in the fact table, any possible custom measurements later on the fact table can be done without changing the design of the fact table. This ensures if one calculation is feasible to extract from source system then the warehouse system provides the same feasibility.
EmployeeId Date StartTime EndTime
34 1/1/2006 9 17
21 1/2/2006 12 19
56 1/1/2006 13 20.5
EmployeeKey DateKey NumberOfHours
11 45 8
12 46 7
13 45 7.5
Here in the above fact the original measures such as StartTime and EndTime are not included, because of this the measures such as number of employee start working on 1/1/2006 before 10AM cannot be answered though it is possible to deduct the number from the source schema.
Here the guideline is when calculating custom measures on fact, retain the source measures as part of the fact.
Keeping the source measures also makes it possible to show an audit trail of how the value was calculated.
If the granularity of two facts tables is the same, consider combining them
As a general rule when you start your dimensional modeling each business function should have its own fact table. As your dimensional model evolves try combining the fact tables where the granularity is the same between the fact tables. Combing the facts makes analysis easier and eliminates the need for unnecessary joins.
ProductKey DateKey Inventory
23 44 289
43 44 435
65 44 658
ProductKey DateKey Sold
23 44 32
43 44 45
65 44 59
Both InventoryFact and InvoiceFact both have same granularity (Product – Daily) and represent single business entity products SKU number. Even if there is nothing sold for a product on a given day, the record can be added to the combined fact as inventory fact is a snapshot fact, not based on whether the product is sold today or not, with corresponding sold measure as 0. The cardinality of the combined fact would be same as Inventory Fact.
The combined fact would look like the following
ProductKey DateKey Inventory Sold
23 44 289 32
43 44 435 45
65 44 658 59
Combining these two fact tables into one also enables calculated measures such as percentage of SKUs are sold today, that would need to use measurements from both the facts, without this combination it is going to be cross fact and would lead to performance problems.
Use indexed views to improve the performance of relational queries
The performance of queries against the relational data warehouse can be significantly improved when you use indexed views. Indexed view is a pre-computed table comprising aggregated or joined data from fact and possibly dimension tables. A indexed view is one way of implementing particulary when the requested data set is summary values that were not pre computed or aggregated
Fact table Index Guidelines
• Consider having a composite primary key for your Fact table. Consider composite key (typically composite of all related dimension surrogate keys) as the primary key for Fact tables. Conversely, you can add an Identity column as a primary key in the fact table.
• Some dimensional modeling will have business keys also being part of facts for enabling report performance. These business keys do not need to be part of the composite key of the fact as their corresponding dimension SKs are declared in composite key. There can be non clustered index to cover these business keys for better reports performance.
• Business keys which are used by ad hoc queries should be part of non clustered index. Non-clustered indexes will the increase efficiency and responsiveness of ad-hoc queries. In the absence non key attributes covered by an index, would result in a clustered index scan (provided clustered index is created & available) which is expensive
Avoid updating fact records
Fact table represents the transactional record that was created by source business processes. There should not be any record in the fact table that does not originate in a business transaction at the source. At the same time, any change of the value of a measure on the source system should result a new record in the fact, never an update of the fact table. If you use an OLAP tool, any update of fact table records (apart from end date if exists) would require refreshing the cube. Refreshing the cube having huge data volume may take longer time and may result in down time of the system.
Keep the data types of the fact table columns as short as possible
Fact records are already huge in size; for ex. use tiny int where possible instead of INT. It is suggested to adopt the same data type for measurements that source system has. If you create aggregate tables, be sure to change the data types to be large enough to handle the maximum aggregated values.
Fact loading should not create records that are not part of OLTP system
Capture only transactional records from OLTP. Avoid having the ETL process add new records that were not exactly sourced from OLTP systems. Example, if Customer1 purchased Book1 then fact table should have only one entry for this transaction. Adding records such as Customer1 did not purchase Book2, Customer1 did not purchase Book3, will not add any value and should be avoided.
Fact records are considered as event records – i.e. transactions processed on the OLTP. No fact records should be created by the ETL, it is just a transformation.
Business needs must determine the minimum granularity of the fact table
Fact table granularity comes from business needs, you can’t have further granularity than OLTP records. It is driven by the business needs. Example: – if the business requires to track only aggregated values for the entire day instead of hourly basis, then use day aggregate as the base granularity
Ensure fact table measures have relevance to the business goals
Though it is possible to create multiple measurements from the source measures, the fact measures should represent business requirement. Create only those measures that are required for the reports/analytic needs of the business users.
Ex: If Percentage sold is not a requirement for business people, do not create this as a measurement just because it is possible to create from the available data
It is suggested to try creating additive measures and avoid having non additive measures in the fact tables.
The measures of the fact table are numeric in nature and additive. But some measurements are non additive and semi additive. Semi additive measures are those which cannot be summed up across all the products but can be summed of the same product across the year. Non additive measurements are such as ratio, percentage etc., but all these measurements are to be numeric in nature. For non additive and semi additive, custom aggregations are created but for this original source additive measurements are to be part of fact table to calculate custom aggregations.
Aggregation to higher grain
Aggregations of the facts can be performed either going to higher grain within the dimension, or by removing a dimension. If going to higher grain, you may need to create a shrunken dimension table in the dimension or split out the higher grain attributes into a snowflaked table. As an example of aggregations on non-hierarchy attributes, a fact table may track the products with color, location so the dimension has a surrogate key on the products which has two different records for two different colors for the same location and same product. But when the fact table is aggregated to remove the color hierarchy, then there should be another dimension with only one record for product and location (with no color). As an example of aggregations on hierarchy attributes, consider a dimension having city, state and country. If the aggregation of the fact table is performed to the state level then this dimension cannot be used as the granular is higher. So, use a shrunken dimension or a showflaked dimension table at the appropriate grain.
Determine when and where calculated measures are pre-computed.
Calculated measures are those measures which are not directly available from source but computed from the measures coming from source.
Example: Number Sold is the source measure. Percentage of the items sold per city is the measure calculated based on number of items sold in the city to the total items old in the given day.
Calculated measures can be created in the facts while loading enabling the reports which query the relational data warehouse table directly.
If you are using an OLAP tool such as Analysis Services, it may be easier to do calculations that involve aggregated values (such as percent of total) in the logical cube design where both detailed and aggregated values are accessible.
Keys and Indexes
This section describes guidelines for creating keys and indexes on different objects of physical warehouse schema.
Types of Keys and Indexes
Consider the following types of keys:
• Surrogate Key: A key to represent uniqueness in dimensions and acts as a method of being referenced from fact tables. It generally an integer data type and get incremented for each key, to act as Identity column of the dimension. Details are surrogate key is explained in the earlier sections of the chapter.
• Natural Key: Also termed as Business key. This is the primary key or the unique key on the source system to represent the identification of entity uniquely. Typically this is also part of physical implementation of warehouse schema participating in dimension structure but will not be as a key. Natural key is just another attribute of dimension structure.
Consider the following types of Indexes:
• Clustered: Need of having indexes on the fact tables and dimensions are very crucial in warehouse and should be optimized for data access queries. Typically clustered index on dimensions is created on surrogate key and surrogate keys used in the fact will be a composite key for fact tables.
• Non-Clustered: There is a requirement of creating non clustered indexes on measures and on other widely accessed columns of the fact table.
• Covering Index: Covering index will contain all the columns of the table, in this context it is for fact table. The idea behind of creating covering index is boosting performance for any queries need not be part of any other index At the cost of index size.
Key and Index Guidelines
Consider the following Key and Index guidelines for your dimensional model:
• Avoid using GUIDs as keys. GUIDs may be used in data from distributed source systems, but they are difficult to use as table keys. GUIDs use a significant amount of storage (16 bytes each), cannot be efficiently sorted, and are difficult for humans to read. Indexes on GUID columns may be relatively slower than indexes on integer keys because GUIDs are four times larger. GUIDs are not generally accessed in the query and so there is no need of creating indexes on GUIDs
• Have non clustered indexes created on the business keys if they are part of facts
• Have clustered index created on dimension surrogate keys in facts (or create clustered index on Identity column and use non-clustered for SK FK’s)
• There is no need to create separate indexes on measurements as they are accessed via the surrogate keys or business keys
• For each dimension table, add a clustered index on the surrogate keys (primary key)
• Make sure to use the index in the same order it is created. For example, if an index is created on columns F1, F2 in the order, try having your query filters out first on F1 and then F2
• Analyze the query execution plan for specific report queries and sample ad hoc queries before optimizing the usage
• Keep index keys smaller, do not include all the columns of the fact in the covering index
• Use Index Tuning Wizard to tune your indexes based on query load
• Uses Indexed views to improve query performance that access data through views