Online Analytical Processing (OLAP)
Microsoft Confidential. © 2006 Microsoft Corporation. All rights reserved. These materials are confidential to and maintained as a trade secret by Microsoft Corporation. Information in these materials is restricted to Microsoft authorized recipients only. Any use, distribution or public discussion of, and any feedback to, these materials is subject to the terms of the attached license. By providing any feedback on these materials to Microsoft, you agree to the terms of that license.
Microsoft Corporation Technical Documentation License Agreement (Standard)
READ THIS! THIS IS A LEGAL AGREEMENT BETWEEN MICROSOFT CORPORATION ("MICROSOFT") AND THE RECIPIENT OF THESE MATERIALS, WHETHER AN INDIVIDUAL OR AN ENTITY ("YOU"). IF YOU HAVE ACCESSED THIS AGREEMENT IN THE PROCESS OF DOWNLOADING MATERIALS ("MATERIALS") FROM A MICROSOFT WEB SITE, BY CLICKING "I ACCEPT", DOWNLOADING, USING OR PROVIDING FEEDBACK ON THE MATERIALS, YOU AGREE TO THESE TERMS. IF THIS AGREEMENT IS ATTACHED TO MATERIALS, BY ACCESSING, USING OR PROVIDING FEEDBACK ON THE ATTACHED MATERIALS, YOU AGREE TO THESE TERMS.
1. For good and valuable consideration, the receipt and sufficiency of which are acknowledged, You and Microsoft agree as follows:
(a) If You are an authorized representative of the corporation or other entity designated below ("Company"), and such Company has executed a Microsoft Corporation Non-Disclosure Agreement that is not limited to a specific subject matter or event ("Microsoft NDA"), You represent that You have authority to act on behalf of Company and agree that the Confidential Information, as defined in the Microsoft NDA, is subject to the terms and conditions of the Microsoft NDA and that Company will treat the Confidential Information accordingly;
(b) If You are an individual, and have executed a Microsoft NDA, You agree that the Confidential Information, as defined in the Microsoft NDA, is subject to the terms and conditions of the Microsoft NDA and that You will treat the Confidential Information accordingly; or
(c)If a Microsoft NDA has not been executed, You (if You are an individual), or Company (if You are an authorized representative of Company), as applicable, agrees: (a) to refrain from disclosing or distributing the Confidential Information to any third party for five (5) years from the date of disclosure of the Confidential Information by Microsoft to Company/You; (b) to refrain from reproducing or summarizing the Confidential Information; and (c) to take reasonable security precautions, at least as great as the precautions it takes to protect its own confidential information, but no less than reasonable care, to keep confidential the Confidential Information. You/Company, however, may disclose Confidential Information in accordance with a judicial or other governmental order, provided You/Company either (i) gives Microsoft reasonable notice prior to such disclosure and to allow Microsoft a reasonable opportunity to seek a protective order or equivalent, or (ii) obtains written assurance from the applicable judicial or governmental entity that it will afford the Confidential Information the highest level of protection afforded under applicable law or regulation. Confidential Information shall not include any information, however designated, that: (i) is or subsequently becomes publicly available without Your/Company’s breach of any obligation owed to Microsoft; (ii) became known to You/Company prior to Microsoft’s disclosure of such information to You/Company pursuant to the terms of this Agreement; (iii) became known to You/Company from a source other than Microsoft other than by the breach of an obligation of confidentiality owed to Microsoft; or (iv) is independently developed by You/Company. For purposes of this paragraph, "Confidential Information" means nonpublic information that Microsoft designates as being confidential or which, under the circumstances surrounding disclosure ought to be treated as confidential by Recipient. "Confidential Information" includes, without limitation, information in tangible or intangible form relating to and/or including released or unreleased Microsoft software or hardware products, the marketing or promotion of any Microsoft product, Microsoft's business policies or practices, and information received from others that Microsoft is obligated to treat as confidential.
2. You may review these Materials only (a) as a reference to assist You in planning and designing Your product, service or technology ("Product") to interface with a Microsoft Product as described in these Materials; and (b) to provide feedback on these Materials to Microsoft. All other rights are retained by Microsoft; this agreement does not give You rights under any Microsoft patents. You may not (i) duplicate any part of these Materials, (ii) remove this agreement or any notices from these Materials, or (iii) give any part of these Materials, or assign or otherwise provide Your rights under this agreement, to anyone else.
3. These Materials may contain preliminary information or inaccuracies, and may not correctly represent any associated Microsoft Product as commercially released. All Materials are provided entirely "AS IS." To the extent permitted by law, MICROSOFT MAKES NO WARRANTY OF ANY KIND, DISCLAIMS ALL EXPRESS, IMPLIED AND STATUTORY WARRANTIES, AND ASSUMES NO LIABILITY TO YOU FOR ANY DAMAGES OF ANY TYPE IN CONNECTION WITH THESE MATERIALS OR ANY INTELLECTUAL PROPERTY IN THEM.
4. If You are an entity and (a) merge into another entity or (b) a controlling ownership interest in You changes, Your right to use these Materials automatically terminates and You must destroy them.
5. You have no obligation to give Microsoft any suggestions, comments or other feedback ("Feedback") relating to these Materials. However, any Feedback you voluntarily provide may be used in Microsoft Products and related specifications or other documentation (collectively, "Microsoft Offerings") which in turn may be relied upon by other third parties to develop their own Products. Accordingly, if You do give Microsoft Feedback on any version of these Materials or the Microsoft Offerings to which they apply, You agree: (a) Microsoft may freely use, reproduce, license, distribute, and otherwise commercialize Your Feedback in any Microsoft Offering; (b) You also grant third parties, without charge, only those patent rights necessary to enable other Products to use or interface with any specific parts of a Microsoft Product that incorporate Your Feedback; and (c) You will not give Microsoft any Feedback (i) that You have reason to believe is subject to any patent, copyright or other intellectual property claim or right of any third party; or (ii) subject to license terms which seek to require any Microsoft Offering incorporating or derived from such Feedback, or other Microsoft intellectual property, to be licensed to or otherwise shared with any third party.
6. Microsoft has no obligation to maintain confidentiality of any Microsoft Offering, but otherwise the confidentiality of Your Feedback, including Your identity as the source of such Feedback, is governed by Your NDA.
7. This agreement is governed by the laws of the State of Washington. Any dispute involving it must be brought in the federal or state superior courts located in King County, Washington, and You waive any defenses allowing the dispute to be litigated elsewhere. If there is litigation, the losing party must pay the other party’s reasonable attorneys’ fees, costs and other expenses. If any part of this agreement is unenforceable, it will be considered modified to the extent necessary to make it enforceable, and the remainder shall continue in effect. This agreement is the entire agreement between You and Microsoft concerning these Materials; it may be changed only by a written document signed by both You and Microsoft.
Business Intelligence (BI) often involves tools that enhance the speed of reporting queries and the meaningfulness of data and relationships. The Analysis Services component of SQL Server is a tool that facilitates BI solutions. When you work with BI tools, the terminology can sometimes be confusing.
OLAP stands for On-Line Analytical Processing. In general, OLAP refers to specialized data structures that enhance query performance in a multi-dimensional space. You can think of OLAP as a disk-based cache of pre-calculated aggregates. Different tools use different OLAP caching architectures. Analysis Services uses a method that is particularly well suited to handling additive aggregations over huge volumes of fact data.
The Unified Dimension Model (UDM) is the Analysis Services term for a logical model that client tools can query. Different vendors use different terms for the modeling layer. The UDM models both relational as well as OLAP structures. It includes the definitions of dimensions, measures, and calculations that are visible to end users. In addition, the UDM includes infrastructure definitions pertaining to how the data should be store and cached. Because a UDM can model both OLAP and dimensional sources, you can use the UDM as a single source for all reporting queries, regardless of how the data is stored. All queries directed to the UDM, however, whether relational or OLAP, use MDX as the query language, not SQL. One of the main points of the UDM is that after you create it, business users can create their own powerful reports without requiring the assistance of a data expert.
The UDM creates a “unified” view of both OLAP and relational worlds that is available to all, which avoids data integrity and consistency issues.
A Cube is a collection of measures and their associated dimensions. The word cube is only loosely related to a geometrical cube. A cube can have multiple dimensions (not just three), and can even consist of components (called “measure groups” in Analysis Services) with differing dimension counts.
A Data Source View (DSV) is a component of Analysis Services that is often confused with the UDM. The DSV is a component of the UDM. It tells the UDM how the relational data is stored and provides means to access the relational data.
In this chapter, you will learn principles and guidelines to help you work with OLAP tools such as Analysis Services. A good OLAP design rests on the foundation of a good Dimensional Model. This chapter focuses on concepts that relate to effective OLAP design, particularly as it relates to Analysis Services.
This is the architecture diagram of UDM interacting with relational structure and providing required business information to the client tools.
Dimension tables and Fact tables are the relational tables in dimensional modeling, guidelines of which are explained in the separate chapter.
DSV (Data Source Views) is a meta data layer that would define the relation ships between the relational tables as required for UDM (Unified Dimensional Model). DSV can still be connected to transaction sources, explained in later sections.
UDM is seldom treated as Analysis cube which is useful for storing/calculating aggregations and custom measures/formulas. UDM gives ability for end users to browse the data in a multi dimensional view. UDM can store data in a number of structures either a separate storage or with relational structure. Any of these storage methods can be chosen based on query perf and disk usage. Guidelines related to UDM are explained in later sections
Need for OLAP
There are three primary benefits you can obtain by using an OLAP-based analytical tool such as Analysis Services:
• Fast query performance with large data sets,
• Spreadsheet-like formulas for complex calculations
• Presentation layer flexibility because of server-based metadata
Fast Query Performance
Business Intelligence (BI) reporting typically shows highly summarized information. When there are millions—or hundreds of millions—of rows in the data source, creating the desired aggregations can take far too long to be suitable for interactive reporting. Using purely relational data sources, it is possible to create fast queries by creating aggregation tables, but doing so requires several steps. Analysis Services facilitates or automates each of these steps:
1. Design the aggregate tables. In principle, you could create a separate aggregate table for every combination of attributes that apply to a fact table. As a very simple example, imagine a fact table associated with two dimensions (Date and product), each of which has three groupable attributes (Year, Quarter, and Month for the Date dimension and Category, Subcategory, and Product for the Product dimension). Counting the All level for each dimension, from this fact table, you could create as many as 9 possible aggregate tables: All Time by All Product, All Time by Category, Year by All Product, Year by Category, and so forth. With a realistic data scenario, the number of theoretically possible aggregate tables is overwhelming. The first step is deciding which of those aggregate tables you should create. Analysis Services provides an aggregation design wizard that sifts through the thousands of possibilities and determines the best aggregations to use. A later section will discuss the process of designing aggregations in detail.
2. Create the aggregate tables. For each aggregate table you decide to use, you must create the table along with the appropriate indexes for accessing the data quickly. Analysis Services completely automates this step.
3. Load the aggregate tables. For each aggregate table, you need a process to load the appropriately aggregated data from the fact table into the aggregate table. Analysis Services completely automates this step.
4. Update the aggregate tables when the data changes. When the data in the fact table changes—for example, when new rows are added to the fact table—you need to synchronize all the aggregate tables to match. Likewise, if you have any Type I dimension attributes, if the value of any Type I attribute changes, you need to reload any aggregate tables that include that attribute. Analysis Services manages this process for you. Factors involved in configuring the update most efficiently for Type I and Type II attribute changes will be discussed in a later section.
5. Direct reports to the appropriate aggregate table. Aggregate tables don’t help query performance unless reports access them. With purely relational aggregate tables, you must design reports to use the most highly-aggregated table possible for that report. If you have hundreds of aggregate tables, this can be a very difficult process. Analysis Services makes this process completely transparent.
The concept of creating aggregate tables to improve the performance of queries against a large data source is not new. Analysis Services simply takes the same concept and makes it much easier to manage.
Designing appropriate aggregations is an extremely complex and sophisticated process, particularly when you have dozens of dimensions and hundreds of attributes to consider. The Analysis Services aggregation design wizard is an extremely powerful tool to help with the aggregation design. The rest of the steps are essentially administrative: for any one table, the steps are not that difficult, but when you have hundreds of aggregations, managing all of the administrative tasks is a major undertaking. Even aside from the relative performance of a specific relational aggregation table compared to an OLAP aggregation, the ability to manage the entire set of aggregations efficiently is a critical benefit of Analysis Services. The benefit increases even more if you need to revise your aggregation design periodically.
Much business analysis takes place in spreadsheets. Spreadsheets are powerful tools for data analysis because they allow users to create very sophisticated formulas by simply referencing the cell that contains the desired value. As a consequence, in most organizations, much of the business logic is stored in spreadsheets, where it is very easy to have inconsistencies between views, or errors in formulas. Creating the equivalent formulas on the server by using purely relational tools (such as SQL or even CLR-based stored procedures) is extremely difficult. An OLAP tool such as Analysis Services that supports the MDX expression language allows you to create complex calculations in a way that is very analogous to a spreadsheet. In an OLAP cube, you can address any point in the data space—detail values as well as the aggregated total at any level—by simply referencing the cell. This allows business rules to be defined, debugged and stored on a central server, so all reports show the same value. Following are examples of complex calculations that can be moved to the server by using OLAP.
• Percent of Total (or of Group) Calculating the percent this product comprises of the total requires both a detail value and a total value. Using SQL, this requires a temporary table or a separate sub query. By using OLAP, you simply reference to total. The problem is more complex when you want the percent each product comprises of its Product Category, but the OLAP formula is equally simple.
• Net Profit (and growth from last year) Calculating Net Profit involves adding some values (such as Revenue) and subtracting other values (such as Expenses). It also often involves applying rates (such as Tax Rates). This is sophisticated business logic that OLAP allows you to move to the server. Once you have a cell that contains the Net Profit, you can also reference that value for the same period of the previous year to calculate year-over-year growth.
• Revenue per Employee (vs. Company Average) Calculating the Revenue per Employee typically involves retrieving the Revenue value from one fact table aggregation and dividing it by the Employee Count from a completely different fact table. OLAP allows a seamless integration of the two fact tables into one formula. Comparing this number to Average across the entire company is another complex calculation that can be easily defined in an MDX expression.
Some calculations qualify as Key Performance Indicators. A Key Performance Indicator (KPI) is a metric specifically designed to identify whether an organizations performance is on track. One of the primary benefits of a BI system is the ability to create and display useful KPIs. An effective KPI typically involves very sophisticated calculations. OLAP provides an effective mechanism for defining KPI formulas.
Even with OLAP, certain complex calculations can take time to calculate. (A later section will discuss how to deal with unavoidably slow calculations.) But OLAP allows you to define the formulas by using a spreadsheet-like syntax that is much more appropriate to complex calculations than pure relational tools.
Presentation Layer Flexibility
A relational database, even a well-designed star-schema data warehouse, does not include information about how the entities and attributes are related. For example, in a Product dimension table, there may be columns for Color, Size, Weight, Category, Subcategory, and Group. There is nothing in the list of columns that indicates that Category, Subcategory and Group can be logically organized into a hierarchical drill-down path, while Color, Size, and Weight cannot. You can certainly write a custom application that would know when drill-down is possible and when it is not, but you would have to store information in the client tool that describes the data. By adding an OLAP metadata layer, you can record information about the data relationships in the metadata layer. This allows you to use off-the-shelf client tools to access the data. For example, when using Analysis Services, you can use any number of client tools—including Microsoft Excel, Proclarity, Panorama Nova view, Cognos PowerPlay as a client, without any customization. If you switch from one client tool to another, the data, relationships, and calculations all appear consistently in the new tool. If you make changes to the cube structure, the client tool automatically recognizes the new definition.
Alternatives to OLAP
OLAP is a powerful tool, but that doesn’t mean that it is always the best solution for a specific problem. The primary benefit of OLAP comes from the ability to pre-aggregate values. If you don’t need—or cannot create—pre-aggregated values, OLAP may not be the best solution. Following are some examples of situations where other tools may be better than OLAP.
• Real-time alerts on detailed events. If you need notification that a shop-floor machine has malfunctioned, you should build that notification into the operational system. Because this is a detailed even, the benefit of OLAP aggregations does not compensate for the latency added by using even (near) real-time OLAP. If, however, you want to be notified as soon as the total number of malfunctions in the company exceeds a certain threshold in one day, you cannot see the event until the numbers are aggregated, so using OLAP (with real-time data feeds) is an excellent option.
• Highly filtered detail from large dimensions. If you need to see the list of Purchase Orders for a specific Customer Number, you can easily create a relational report to display that information. In this case, there is no aggregation involved. If, however, you want to see which Purchase Orders were involved in the March total for the March total for Widgets, you are starting the search from an aggregated value, so using OLAP (with drill-through) is an excellent option.
• Market Basket or other data mining analysis. If you need to know which products were purchased at the same time (market basket analysis), you don’t have any way to pre-define the grouping buckets. Any analysis that involves searching for correlations between various attributes is probably better suited for a data mining tool than for OLAP.
Guidelines for OLAP
This section details the guidelines that can be adopted while designing OLAP structure for aggregations, calculated formulas, measure groups and storage structures.
Logical Design: Good Dimension design
An OLAP dimension consists of a key attribute and a collection of other attributes that relate to that key. If you are using a well-defined data warehouse as the source, the primary key of the dimension table becomes the OLAP key. All other columns from a star-schema dimension table, or all the columns of related tables from a snowflake-schema dimension—become attributes of that OLAP dimension. For example, in a Product dimension, ProductID may be the key attribute, with other attributes such as SKU, Size, Color, Category, List Price, Buyer, and so forth. Good OLAP dimension modeling allows you to specify the nature of each attribute and the relationships between attributes.
Identify Groupable and Non-Groupable Attributes
Some attributes—such as Category or Size—are groupable attributes: you may use the members on rows or columns of a report. Other attributes—such as Description or List Price—don’t make sense for grouping or filtering, because they apply only to individual members. Distinguishing between groupable and non-groupable attributes allows the OLAP engine to store aggregations and indexes only for the groupable attributes. This makes storage and queries more efficient and also simplifies the user interface. In Analysis Services, for non-groupable attributes, change the AttributeHierarchyEnabled property to False for each non-groupable attribute. In Analysis Services, each groupable attribute becomes an attribute hierarchy, which means that Analysis Services creates groupings for each member of the attribute and also for the sum of all the members—the All level of the attribute hierarchy. Disabling the attribute hierarchy removes the attribute from the structure of the cube, leaving it as a description of a related member. A non-groupable attribute is sometimes called a member property.
Define Multi-Level Hierarchies for Navigation or Calculations
You can organize groupable attributes into a multi-level hierarchy. There are two reasons to create a multi-level hierarchy. First, a multi-level hierarchy can function as a drill-down path for end users. For example, you can organize the Product attributes Category, Subcategory, and Product into a drill-down path. A drill-down path does not have to consist of natural groupings. For example, you can create a drill-down path from Size to Color. That means that Red could appear under both Small and Large.
This is not a natural grouping, but it might be a useful drill-down path if there is a different manager for each Size category. When you create a multi-level hierarchy, you may want to hide the base attribute hierarchy to minimize user confusion. The second reason for creating a multi-level hierarchy is to support relationship-based calculations. For example, even if users don’t want a pre-defined drill-down path from Year to Quarter to Month, you need to create a multi-level hierarchy in order to calculate the growth since the same month of the previous year. Likewise, calculating the percent a product is of its Category requires a hierarchy relationship between Category and Product. Do not put attributes into multi-level hierarchies unless necessary either as an end-user drill-down path or to support relationship-based calculations.
Define Attribute Relationships to Create Natural Hierarchies
Even if you don’t create an explicit multi-level hierarchy, you should clearly define the natural relationships between each attribute. A natural relationship is a one-to-many relationship. It means that if you know the value of one attribute, you can determine the value of the other. For example, if you know the Subcategory, you can determine the value of the Category.
In Analysis Services, by default each attribute is related to the key, because, by definition, if you know the value of the key you can determine the value of every other attribute. If there are other relationships, move the relationship from the key to the other attribute. For example, move the relationship for the Category attribute from ProductID to Subcategory.
If your relational dimension consists of fully normalized snowflaked tables, each table’s primary key attribute should be the source for all the other attributes in that table. Even if your dimension is a single star table, think through how you would logically denormalize the dimension to determine the appropriate attribute relationships in the dimension.
A multi-level hierarchy that consists completely of natural relationships is called a natural hierarchy. In Analysis Services a multi-level hierarchy that is also a natural hierarchy is flagged by a small blue triangle. In a natural hierarchy, there is a one-to-one correspondence between each member of the multi-level hierarchy and the corresponding member of each individual attribute. For example, in the Category to Subcategory to Product natural hierarchy, the Sports Vest member exactly corresponds to the Blue Vest member of the Product attribute. Conversely in the Size to Color hierarchy, the Blue member that appears under Large does not correspond to the Blue member from the Color attribute. Correctly defined natural hierarchies improve query performance and simplify MDX expressions.
Use a Parent Child Hierarchy to Handle Arbitrary Dimension Depth
If the number of levels in a dimension varies dynamically—particularly between branches of the dimension, consider using a parent-child hierarchy. The levels in a parent-child hierarchy are not defined by columns in the dimension table, but rather by the relationships formed by the data stored in the Dimension Key column and a Parent column.
For example, in an Organization dimension, there may be three levels between the President and a staff accountant in the Finance department, but ten levels between the President and an assembly operator in the Manufacturing department. Managing the different levels by using explicit relational columns would be difficult. A common alternative is to include the Manager ID for each employee. The simplest possible dimension table would consist of Employee ID, Manager ID, and Employee Name. By recursively joining from the Manager ID to that manager’s Employee ID, you can generate the entire organizational structure. A parent-child dimension in Analysis Services reads the parent-child relationship and constructs as many levels as necessary.
A parent-child dimension is useful when the number of levels is arbitrary or dynamic.
Following are some of the most common applications for a parent-child dimension:
Organization: Different branches of an organization can have different levels, and even a simple corporate reorganization can change the number of levels.
Financial Account: There are often differing levels in a Revenue branch of an account structure than in the Expense branch, and the relationships and groupings can be modified frequently.
Manufacturing Parts: The parts that go into a product create a tree structure that can be very unbalanced. Different products can have very different component structures.
A parent-child structure in an OLAP database typically comes from a parent-child structure in the source application database. An OLAP dimension allows you to capture and report using the parent-child relationships from the source system. It is not usually necessary to generate parent-child attributes in the ETL process; the relationships almost always originate in the source system.
Avoid unnecessary members in a Date dimension
In a relational database, members of a dimension table that are not accessed in the fact table do not cause any problems, so some organizations create Date dimensions with many years of data. In an OLAP database, dimensions are processed as a structural part of the database. Large numbers of unused Date dimension members can make the cube structures more complex, and can be confusing to end users dealing with drop-down lists. Following are some techniques for avoiding unnecessary Date dimension members:
Add Date members to the dimension table only as needed, either on an as-used basis, or one year at a time.
Create a Used flag column in the Date dimension and load it with True for any members that are referenced by fact table rows. In the DSV, filter the Date dimension table to include only rows with the Used flag set to True.
Compress old Date dimension members. For example, convert all keys for prior years to the first day of the month so that other days can be deleted. The reports for older time periods will be available only at the month grain.
Logical Design: Measures Groups and Measures
These are the guidelines related to factors that can be considered while designing measure groups and measure in Analysis Services
Create one Measure Group for each logical fact table
A measure group is simply a group of measures that share a common grain (i.e., that have the same set of foreign keys). If you have a logical fact table that is physically partitioned vertically (that is, the partitions could be combined by using a Join), either create separate measure groups or combine the fact tables as described in the Dimension Modeling chapter. If you a logical fact tables that is partitioned horizontally (that is, the partitions could be combined by using a Union), don’t create separate measure groups, but use OLAP partitioning to combine the relational partitions.
Create separate measure groups for fact tables with different grain
In the following example, the EmpTotalClasses fact table is at the Employee grain while the EmpClasses fact table is at the Employee and Class grain. If you join the two tables before loading a fact table, you will over-count the values from the EmpTotalClasses fact table. If you create a separate measure group for each fact table, you do not over-count any values, but you can still perform calculations between the two measures.
EmpId ClassId GoodFeedback
1011 A129 1
1011 B120 0
1011 D223 1
1011 A129 1
277 B289 0
277 B892 1
277 C293 1
Result of joining the two Fact Tables
EmpId ClassId TotalClasses GoodFeedback
1011 A129 4 1
1011 B120 4 0
1011 D223 4 1
1011 A129 4 1
277 B289 3 0
277 B892 3 1
277 C293 3 1
Incorrect results (based on joined table)
EmpId TotalClasses TotalGoodFeedbacks
1011 16 3
277 9 2
Correct results (based on separate measure groups)
EmpId TotalClasses TotalGoodFeedbacks
1011 4 3
277 3 2
Create one measure group in the UDM for each fact table grain. Don’t join fact tables of different grain before loading the cube.
Use a dimension table as the source of measure group in formulas.
If you have numeric attributes in a dimension table that you want to be able to use in formulas—perhaps in conjunction with measures from other fact tables, create a measure group that uses the dimension table as a fact table. For example, a product dimension may have a List Price attribute. Formulas that need to use the List Price can work more effectively if it is a measure than if it is a dimension attribute. Think of the dimension table as a fact table with the grain of a single foreign key.
Whenever possible, use measures that can be aggregated by using Sum, Min, or Max
Sum, Min, and Max are associative operations. That is, the order in which you perform the operation, you get the same answer. Associative operations give very fast query performance with relatively modest pre-aggregation, because the query engine can continue aggregating from any point. Transaction amounts (such as dollars or units) can typically be summed. Rates—such as prices—cannot be summed, but you may want to determine the minimum or maximum values. Sum, Min, and Max are all easy to manage and should be used whenever possible.
Decompose weighted averages into summable components
Suppose that you have a Net Price measure in your fact table. Typically the best way to aggregate a Net Price is by using a Weighted Average. But a Weighted Average operation is not associative. That means that it cannot be stored efficiently. Instead of storing the Net Price directly in the OLAP cube, convert it into a summable value by multiplying it by a weight—for Net Price, the transaction Units is a good weight. Then sub both the Weighted Price and the Units. At any point of aggregation—Year by Category, or Quarter by Subcategory, or Month by Product—you can easily divide the aggregated Weighted Price by the aggregated Units to calculate the Average Net Price.
Many non-additive measures can be aggregated by using a weighted average, even if the appropriate weighting is not immediately obvious. For example, if you want to determine the total.
Manage Snapshot (semi-additive) Measures over Time
When the source of the measure is a snapshot—such as an account balance, or a head-count balance—you cannot add the values over time, even though you can add them over all other dimensions, such as geography. When you model the measure in a cube, choose the appropriate way to manage the aggregations over time. Following are some of the most common semi-additive aggregation methods:
• Last Child. This is appropriate for closing balances. Assuming Calendar Time, the Year displays the December value, and Q3 displays the September value.
• Last Non-Empty. This is the same as Last Child, except that it ignores empty periods. For example, if the database is only loaded through October, the Year value will display October, rather than the empty December value.
• First Child. This is appropriate for opening balances. The Year displays January, and Q3 displays July.
• First Non-Empty. This is the same as First Child, except that it ignores empty periods. If you did not open the account until May, the Year displays May.
• Average of Children. This is sometimes used for Headcount. Rather than show the total employees at the end of the year, you may want to show the average number of employees over the year.
Analysis Services has built-in operations to support semi-additive measures, but only in the Enterprise edition. If you are using a different edition, or a different tool, write formulas to perform the semi-additive calculation.
Use Unary Operators to manage Arbitrary Aggregation Needs
Across a financial Accounts dimension, you typically have diverse aggregations from one member up to its parent. For example, Revenue adds to its parent, Expense subtracts from its parent, and HeadCount does not affect its parent at all. All these aggregations apply to a single measure. To manage this type of aggregation in Analysis Services, you create a special attribute for the Account dimension. In that attribute, you store for each member, the way the member should be applied to its parent. Because there is one operator for each member, this type of operator is called a unary operator. The most common unary operations are addition, subtraction, and no-effect.
Manage Distinct Counts Properly
There are two different types of Counts that you can include in a cube. The first Count is simply the number of rows in the fact table. For example, if there is one fact table row for each order, you can simply count the rows in the fact table to determine the number of rows. This type of count produces the same result you would get if you applied a standard Sum function to a fact table measure with the value 1 for each row. Because a simple Count is essentially additive, it is easy to use and can be aggregated easily. When you simply count rows, you don’t need to specify a fact table column (unless you’re skipping rows that have a Null fact table row).
But many times, a simple count is not the right answer. For example, if your fact table is not at the Order grain, but is rather at the Line Item grain—which is often the case—then you could not simply count the rows to find the number of orders: most orders will have more than one line item, and hence counting the rows over counts the number of orders. In order to count the Orders properly, you must use a Distinct Count. When you do a distinct count, you must specify which column to count—typically a foreign key column in the fact table. A distinct count is an extremely important operation in BI reporting. Distinct Count is not an associative operation, so the calculation engine must take special steps to perform the calculation properly.
Following are some key points for managing distinct counts properly:
• If possible, create a fact table with a suitable grain so that you can simply count the fact table rows.
• Always put a distinct count measure into a different measure group than other measures from the fact table. In Analysis Services, be sure to select Distinct Count while first creating the measure—this automatically creates a separate measure group. If you change the aggregation type to Distinct Count after the measure already exists, it will not go into a new measure group.
• Avoid creating a large number of physical aggregations for a distinct count measure group. Aggregations for distinct counts are much more expensive than aggregations for additive measures. Use restraint when designing physical aggregations or you will multiply by many times the disk space required and the time required processing the measure group.
• Create a distinct count measure rather than using the DistinctCount MDX function. The built-in aggregate type creates special structures that result in much faster queries than the MDX function.
Avoid data type mismatch for member keys
The size and data type of the member key value itself can affect processing and query performance. Avoid creating member keys with a data type different from that of the value actually stored in the underlying dimension table. Mismatched data types take time during processing. If necessary, create a view or table on the relational database that converts the data type as needed.
Logical Design: Formulas and Calculations
The ability to create sophisticated formulas is one of the most powerful benefits of using an OLAP cube. You can create formulas that represent business rules and store them on the server so that they are available to anyone who uses the cube. In an Analysis Services cube, there are two fundamental places the formulas can be created. One option is to create a new member—one that does not exist in the data source database, and use a formula to give the member a value. This option is called a calculated member. The other option is to apply a formula to an existing portion of the cube, replacing what would otherwise have appeared in the cube. This type of formula is called a script assignment. (You can also use a calculated cell to achieve the same effect, but using a script assignment is syntactically more straightforward.) Following are some key points about calculated members and script assignments:
• You can create a calculated member for any dimension. Many times, you will create new members of the Measures dimension, but you can add a new calculated member to any dimension in the cube.
• A calculated member applies to all members of all other dimensions. If you want the calculated member to behave differently for the Quarter attribute than for the Month attribute, you must include the appropriate conditional logic within the formula.
• A calculated member does not cause other aggregations in the cube to recalculate. For example, using a calculated member to calculate a value for the Month attribute will not cause the Quarter attribute to re-calculate. The formula is calculated independently everywhere it appears.
• A script assignment can apply to a single attribute or to the intersection of attributes from different dimensions. For example, you can apply a script assignment to only the Month attribute of the Date dimension where it intersects with the Forecast member of the Scenario dimension.
• A script assignment does trigger re-calculations in other attributes. If you use a script assignment to change the value for the Month attribute, the cube will automatically recalculate the Quarter and Year attributes. To prevent this from happening, you can use the Freeze statement.
• Both calculated members and script assignments are calculated at query time. Even though script assignments may have the appearance of procedural assignment statements, they are in fact formulas that apply to individual cells when a query is executed. Script assignments are not inherently slower than calculated members, but a single script assignment can cause thousands of recalculations to occur, particularly when you create the script assignment at a low level of the cube. Because of this, you must be particularly cautious when creating a script assignment that you are applying it at the appropriate part of the cube. Somewhat paradoxically, applying the script assignment to a larger subcube can actually make the formula run faster, because it may not trigger as many subsequent calculations.Following are guidelines relating to calculations and designing the custom formulas
Use a Calculated Member when you do not want recalculation effects
For simple calculations—such as dividing a Weighted Price by Units to get a Weighted Average Price—a calculated member is simpler and faster than using a script assignment.
Use a Script Assignment when you do want recalculation effects.
If you need to make an adjustment to a low-level value and have the effects propagate to higher levels, you should use a script assignment. For example, a formula that affects the Employee Benefits expense in an Account hierarchy should have the effect ripple up through the Total Expenses and on to Gross Profit. Using a script assignment for the formula allows the aggregations defined in the dimension to re-calculate. Note, however, that this can be a slow operation. If at all possible, define low-level calculations in the source.
Create KPI definitions to group calculations under a single name
Analysis Services supports the concept of KPI definitions. A KPI definition is really no different from a regular calculated member, except that a KPI typically includes several standard formulas: Value, Target, Trend, Weight, and some other optional ones. Rather than creating multiple calculated members that are differentiated only by a naming convention, you can create a single KPI name that includes each of the standard components. Some client tools know how to look for KPI definitions and use them directly.
Create calculations on the server, rather than on the client
You can create calculations—whether calculated members or script assignments—on either the server (as part of the cube definition) or on the client (as part of an MDX session or query). In general, you should create the calculations on the server whenever possible, for the following reasons:
Server-based calculations are available to all users. This is part of the “one version of the truth” concept. Users should see the same values regardless of the client tool. Storing calculations on the server preserves this business logic consistency.
Server-based calculations occur before security is applied. This means that you can use security to control access to the underlying values, but only if the calculations are defined on the server.
Server-based calculations are faster. You can think of the server-based calculations as “pre-compiled.” There is less overhead with server-based calculations, so they can perform faster than client-based calculations—at times many times faster.
Move detail-grain calculations to the load process when possible.
You can think of calculations as pre-aggregation or post-aggregation. A pre-aggregation calculation is one that must be done at the lowest possible grain, before aggregations are applied. For example, converting a price to a weighted price is a pre-aggregation calculation because you must perform the multiplication before the aggregation addition. Conversely, converting a weighted price to an average price is a post-aggregation calculation because you must perform the division after the aggregation addition. Performing pre-aggregation-style calculations over a large number of members in an MDX formula can be extremely slow. In a calculated member, if you use a Set function that returns a large number of members, the calculation is probably a pre-aggregation calculation. Likewise, if you make a script assignment at the leaf level of a cube, allowing the cube to then re-aggregate the values, the calculation is probably a pre-aggregation calculation. Try to convert these pre-aggregation calculations to SQL expressions you can include in the source query. This can make a dramatic difference in query performance. Most calculations are 1) additive aggregations, 2) pre-aggregation calculations, or 3) post-aggregation calculations. Aggregations should be additive. Addition is the simplest possible aggregation method; it responds very well to creating aggregations to speed up queries. Pre-aggregation calculation converts a measure into something that can be aggregated—for example by applying a weighting factor. Pre-aggregation calculations should happen in the relational source, before (or while) loading the row into the cube. Post-aggregation calculations convert aggregated values into a usable form, often by calculating a ratio or by dividing the weighted measure by the weighting factor.
Logical Design: Perspectives
These are the guidelines related to perspectives in analysis services. Perspectives are the implementation technique to limit the visibility of data in a cube to a particular requirement
Use Perspectives to simplify the cube space for end users
Perspectives are a way to simplify the apparent size of a cube. If you have a cube with dozens of dimensions, hundreds of attributes, and dozens of measures, it may be difficult for an end user to understand all the options. By creating a perspective, you can limit the view of the cube to relatively few measures and dimensions that relate to a specific type of analysis.
Do not use Perspectives unless you allow end-user access to the cube
If you use a cube purely as the source for structured reports (as by using Reporting Services) or as the source for a custom application then there is no benefit from creating Perspectives. Perspectives are designed to simplify ad hoc analysis.
Do not use Perspectives as a security tool
Perspectives have no bearing on security. You cannot allow some users to see one Perspective but not another. You apply security to the entire cube space. A user can see any Perspective, but only the data they are allowed to see within the entire cube space.
Logical Design: Security
In a relational database, most security is applied either at the object (i.e., table or view) level, or possibly at the column level. It is much less common to have security that applies to specific values within the rows. In an OLAP environment, it is common to create security that restricts specific values. Analysis Services security uses Windows Authentication. You apply security to roles based on Windows Users or Groups, but you can also create dynamic security based on MDX formulas that take into consideration the current User name.
Use Dimension Security to prevent visibility of member existence
Sometimes it is important to prevent visibility of member names. For example, if you are exposing the cube in an extranet environment, you may not want vendors to have access to the email aliases of people with whom they are not authorized to interact. Dimension security eliminates members from the dimension, so the members do not even appear in drop-down lists or as labels on the report. You can define dimension security as either an include or an exclude list.
o Use an include list when you want any new members to be excluded by default. Only the members explicitly on the include list are visible to members of the role.
o Use an exclude list when you want any new members to be included by default. Only the members explicitly on the exclude list are blocked from members of the role.
With dimension security, you can enable visual totals, which means that the total for the attribute is recalculated as if the blocked members did not exist. This avoids the confusion that comes from a total not matching the visible components, but also prevents a user from back-calculating the value of the missing members.
Use Cell Security to block access to arbitrary cells in a report
Suppose that you want to allow report users to see the Revenue measure for all sales representatives, but the Salary measure for only members of their direct subordinates. Dimension security is not able to apply differently to some measures than to others. Cell security is a formula that calculates for each cell retrieved in a report. If the formula returns a non-zero result for a cell, the value of that cell becomes visible. If the formula returns a zero result, the value of that cell is blocked. The security formula calculates at query time, but only for the cells in the current report view. Cell security gives you more control over which values are visible than dimension security do. Cell security is not necessarily slower than dimension security. If the security formula is simple, it does not slow down the query. If the security formula is complex, it may cause the report to display more slowly. You cannot use visual totals to recalculate total cells when a detail cell value is blocked.
Use formulas to avoid creating a large number of roles
If you need the security to behave differently for each user—for example, if you show the Salary measure for only the current user—you can avoid creating a large number of security roles by creating an MDX formula that uses the UserName system variable. In order to use UserName in a meaningful way, the UserName must be available as an attribute in at least one dimension, so that you can compare the current user’s name with an attribute in the cube.
Physical Design: Multiple Partitions
Analysis Services partitions are independent of relational partitions. A partition is simply the unit of physical storage for a measure group. A partition is also the unit of processing. You cannot process something smaller than a partition. Each measure group has at least one partition. You can create multiple partitions for a measure group for two different reasons: to improve data load performance, or to improve query performance. All Analysis Services partitions are horizontal partitions.
Use time-based partitions to enhance data load performance
One of the best techniques for speeding up processing of a measure group is to create one (or more) partitions for older, stable, time periods, and one (or more) partitions for new, volatile, time periods. It is then necessary to process only the new, changeable partitions. Suppose that you have a measure group that contains 5 years history and takes 10 hours to process. Suppose also that only the current year is changeable. If you create one partition per year and only process the current partition, the processing time will be reduced to less than two hours.
Use business group-based partitions to enhance query performance
When you load a partition, Analysis Services detects which members of each dimension have data in that partition. If users focus on different Product Categories—rarely querying for any products from other categories—you can reduce the amount of data that needs to be searched by creating a separate partition for each category. If you have a 30 gigabyte measure group and can partition the measure group between 10 categories, a user query can search only 3 gigabytes for the result, instead of the entire measure group.
Create more partitions for heavily used or updated
Partitions do not have to be evenly subdivided. The current year is typically processed and queried more frequently than prior years. Use yearly partitions for previous years and monthly partitions for the current year. For very large data sources, you may even want to create daily partitions for the current month or hourly partitions for the current day. Likewise, if the West region is much more heavily used than the East region, create a single partition for the East region, but subdivide the West region into smaller partitions. Use care when creating a sophisticated partitioning strategy that you don’t create overlaps or leave gaps.
Use distributed partitions to share loading
Use distributed partitions to share loading and querying processor load among servers. For an extremely large database, you can create multiple partitions for a measure group and assign different partitions to different servers.
Do not use a MOLAP partition as the primary data storage location
You should always maintain the relational version of your data, even if you need to archive it to an off-line storage. You should think of a MOLAP partition as a disk-based cache, not as a primary repository. With an OLAP database, there is a high probability of making structural modifications which would require reloading the data. You should always maintain the relational data necessary to reload the MOLAP structures.
In some cases, it makes sense to partition a very large measure group both by time and also by business groups. For example, with five years of data and ten product categories, you may create 50 partitions, one for each year and category combination.
Physical Design: Aggregations
Analysis Services can pre-calculate aggregations and store them on disk. With a very large data source, pre-calculating aggregations can make a dramatic difference in query performance. The aggregations Analysis Services creates are analogous to relational aggregation tables, except that creating, maintaining, and using them is transparently managed for you. To avoid the data explosion required for storing every possible aggregation, Analysis Services only creates aggregations for Sum, Min, and Max aggregation functions. With these aggregation functions, all possible aggregations can be quickly derived from relatively few aggregation tables.
Do not pre-aggregate data in Relational Tables
Because Analysis Services effectively creates aggregates internally, do not create relational aggregation tables. Never store aggregated values together with detail facts in the fact table. Doing so would cause double-counting of the data.
Sparsity is not a problem in analysis services
Data sparsity is when very few dimension members contain data—and particularly when few attribute intersections contain data. As a simple example, when intersecting Months with Quarters (from a Calendar Date dimension), you will never have data for the intersection of January with Quarter 3. As another example, you are unlikely to have sales for heaters in summer in Phoenix or for air conditioners in winter in Oslo. Some OLAP tools allocate space for possible combinations, whether data exists or not, and sparsity considerations are critical. Analysis Services never allocates space for missing values, so sparsity is not an issue.
Use the Aggregation Wizard to design physical aggregations
It is possible to explicitly create specific aggregations, but you should take advantage of the Aggregation Wizard for designing aggregations. Thinking through the thousands—or hundreds of thousands—of possible aggregation combinations is virtually impossible to do correctly by hand. Manually created aggregations often provide fast performance for specific queries but terrible performance for exploratory queries. The Wizard effectively balances all the relevant attributes to decide which combination of aggregations is best.
Include correct attributes in aggregation design pool
In a typical measure group, you may have dozens of dimensions and hundreds of attributes. Including all possible attributes in the analysis would make aggregation design take many days to complete. Most attributes can be derived quickly from other attributes, so you don’t need to include all attributes in the design. By default, Analysis Services includes the following attributes in the design:
• The key attribute for the dimension
• The All attribute for the dimension (which effectively aggregates away the dimension)
• The top level attribute for each User hierarchy
• Any levels of a User hierarchy that have a natural relationship to a higher level in the hierarchy
This default selection of attributes is appropriate in many situations. There may be certain other attributes you want to consider for aggregation. For example, in a Customer dimension with 10 million members, suppose you have a Gender attribute. The Gender attribute may be very helpful, because it is a 5-million-fold aggregation, but it does not meet the default criteria for inclusion in the aggregation design. To add a specific attribute to the pool, use the following steps:
In the Cube designer, on the Cube Design tab, in the Cube Dimensions pane, select the By Attribute option.
Expand the appropriate dimension and select the desired attribute.
In the Properties pane, change the Aggregation Usage property to Include.
Limit aggregations for Distinct Count measure groups
Even though a Distinct Count measure cannot be pre-aggregated the same as Sum, Min, or Max, Analysis Services does create aggregations for Distinct Count measures. It does this by treating the Distinct Count source column as a key, storing all the values in the aggregation table. With a Distinct Count of a large-cardinality attribute (such as number of Customers), adding all the distinct CustomerID values to each aggregation can result in a huge data explosion. If possible, do not add any aggregations to a Distinct Count measure group. If you do need some aggregations, try to use as few aggregations as possible.
Do not over-aggregate the measure group
Because Sum, Min and Max values can be quickly derived from existing aggregations, it is not necessary to create too many aggregations. In fact, not only do excessive aggregations consume disk space and processing time, they can actually slow down queries, because it is often faster to calculate the new aggregation based on values stored in memory than it is to go to disk to retrieve the pre-stored aggregations. The Performance Gain number in Analysis Services does not have a useful pragmatic meaning, but as a general rule, you should not go higher than about 30%. The graph in the Analysis Wizard can be useful for determining the appropriate number of aggregations: when the slope of the line starts to decrease, aggregations become less beneficial.
Use Usage-Based Optimization after accumulating usage history
Analysis Services allows you to sample user queries and use the distribution of the queries to determine the optimal set of aggregations for similar queries. To use Usage-Based Optimization, you must first enable the logging. Then allow several months of usage to develop a useful history. Then redesign aggregations by using the Usage Based Optimization Wizard. Do not run sample queries to populate the log; that defeats the purpose of Usage-Based analysis. Search Books on-Line for Usage-Based Optimization for specific instructions.
Physical Design: Storage Type
Analysis Services functions both as 1) an OLAP storage engine, 2) an MDX expression calculation engine, and 3) as a metadata layer. The OLAP storage engine provides fast query response time from large datasets, the MDX expression engine supports sophisticated calculations, and the metadata layer allows you to take advantage of general-purpose client tools. If you want the benefits of the MDX engine and the metadata layer without using OLAP storage, you can define a partition to use ROLAP (Relational OLAP) as the storage mode.
Storage mode can apply to both the base values in the cube and also to aggregations. Following are the six logical possibilities, along with a brief explanation of each.
• MOLAP base with no aggregations. Very fast and very efficient storage for a small to medium-size partition. MOLAP partitions as large as 10-20 GB can be very fast, even with no aggregations.
• MOLAP base with MOLAP aggregations. Very fast and very efficient storage for large partitions. By using multiple MOLAP partitions, you can create high-performance multi-terabyte cubes.
• ROLAP base with no aggregations. This option is good for when you need near-real-time results. It works best as a very small time-slice partition when the older, stable partitions are MOLAP. It can also be useful when you want the benefits of MDX expressions and the metadata layer, but don’t need the performance benefit of MOLAP storage.
• ROLAP base with ROLAP aggregations. Creating relational aggregations is effective only when the aggregates are indexed view that the relational engine can automatically update when the relational table changes. As with ROLAP in general, it is best used with a small time-slice partition.
• ROLAP base with MOLAP aggregations. This is called HOLAP (Hybrid OLAP). It can be useful when the fact table is very large, but most of the ad-hoc queries require only highly summarized aggregations. If the fact table changes without processing the aggregations, there may be inconsistencies between the values retrieved from the aggregations and values retrieved from the fact table.
• MOLAP base with ROLAP aggregations. This combination does not exist.
Following are guidelines for designing storage:
Use MOLAP storage as a general practice
MOLAP storage—both for base data and for aggregations—is very efficient, both in terms of storage space and query retrieval time. By creating a small time-slice partition, you can have near-real-time results, even using MOLAP storage.
Use ROLAP storage for infrequently accessed large stores
ROLAP can be an effective option when you want to do detailed relational-style queries by using the metadata of the UDM.
Never use static tables for ROLAP aggregations
If you do choose to create ROLAP aggregations, be sure to used indexed views. Static ROLAP aggregation tables are never useful. They must be processed the same as MOLAP aggregations, but are slower to process, take more disk space, and are slower to query than MOLAP aggregations.
Consider having indexes on the relational tables for faster cube processing
Make sure that all fields used as keys have an index. It is especially important to have a single index on the combination of foreign key fields in the fact table. These indexes make a big difference in how quickly OLAP Services can process a query. Make sure the DSV created or the data accessed query being passed by cube processor would actually make use of the indexes and efficiently collect from relational tables. This would minimize cube processing time.
Make sure reference of surrogate keys of fact to dimensions
If the value in a fact table can't be found in a dimensional table, the row may be ignored. Likewise, having a value in the dimensional table with a corresponding value in the fact table may cause empty cells to appear in your cube.
Physical Design: Using the Data Source View (DSV)
When you design a cube by using Analysis Services, you must always first create a Data Source View (DSV) to represent the relational database. You can think of the DSV as a design-time component that facilitates creating the UDM. If the relational database is a well-defined star or snowflake schema, with properly defined primary keys and foreign key relationships, the DSV simply mirrors the relational data structure. You can, however, add additional information to the DSV. Following are some of the key features and benefits of the DSV:
• The DSV allow you to cache the metadata about the data sources so that you can develop cubes without having a connection to the source data systems.
• If you do not have permission to modify the data source, you can use the DSV to define logical primary keys, logical relationships, named queries (i.e., views), calculated columns. You can also give friendly names to the relational database objects.
• In the DSV, you only need to include the relevant tables from the data source. This simplifies the working space.
• The DSV allows you to create logical connections between tables that come from different data sources.
• DSV can access views instead of underlying tables so one or more tables can be flattened and use as a source for DSV
Following are guidelines for working with a DSV:
Do not put unnecessary tables into the DSV
The DSV provides a way to simplify the view of the relational database. If you have a clean data warehouse, where the whole purpose of the warehouse is to support a UDM, then it makes sense to bring all the tables into the DSV.
Avoid complex named queries particularly when in ROLAP storage
When generating SQL, Analysis Services expands named queries into the underlying SQL syntax. If you have complex named queries, the SQL generated by Analysis Services can become so complex that it does not work well.
Avoid using the DSV as a replacement for a data warehouse
It is true that you can use a DSV directly against an OLTP database or flat file sources, using named queries to simulate a star schema. But this usually results in complex named queries, and does not have the other benefits of a data warehouse, such as the ability to retain history, and to cleanse data from the source.
Operational Design: Decreasing Latency (Real-Time Reporting)
Using MOLAP storage in a cube adds an additional data layer. The additional layer provides valuable query performance benefits, but when new data is added to the fact and dimension tables, the MOLAP structures must be processed to reflect the changes.
You can get the changes into the cube using either a pull strategy—where the cube initiates the update—or a push strategy—where the relational data source triggers the update when there is a change. Following are ways that you can implement an update strategy.
• Create an Integration Services package that performs the update on a scheduled basis. This is probably the most common option. If you have packages that load the data into the data warehouse, the final step of the package process the cube.
• Schedule Analysis Services to refresh on a specific interval. This is useful for near-real-time analysis. You can tell the Analysis Services partition to check for new values on a periodic interval, such as every five minutes. If the interval is less frequent than hourly, it is probably more effective to use an Integration Services package to explicitly set the time you want the update to occur.
• Enable proactive caching with SQL Server. This tells the SQL Server relational engine to inform Analysis Services whenever there is a change to the relational data. This is most useful when there are relatively infrequent updates to the relational table (so you don’t want to process every five minutes), but you do want the cube to refresh whenever there is a change to the relational source.
• Enable proactive caching with other data providers. Some relational database providers cannot proactively notify Analysis Services of a change. In this case, you can get a similar effect by configuring Analysis Services to send a query to the source asking for whether there are any changes. For example, you might query for the maximum value of the Last Update field.
Before deciding on an update scheduling strategy, you need to identify the business reason for the update. In many cases, in a BI application, you want a certain amount of stability in the reports. If the report values change every five minutes, it can cause confusion about why different users have different values in the reports. For strategic decision making, the volatility that comes from real-time reporting is a disadvantage. In some cases, you may want only monthly updates to a BI cube.
In cases where you do want near-real-time data updates, consider the following guidelines:
Create a very small partition for the current time period
Regardless of what ever processing techniques you use, the one essential component for getting near-real-time turnaround is to keep the current partition as small as possible. A partition is the smallest unit for processing. A small partition processes proportionately faster than a large partition. A one-hour partition will process in 1/24 the time (on average) of a one-day partition.
Choose a scheduling mechanism based on loading process
Analysis Services provides several options for how to schedule processing for a partition. The best option is usually a function of how the data gets loaded into the fact table—or, if you are sourcing the cube directly against a transaction table—how the data arrives in the source table
Data loads as part of a scheduled batch package. Include the cube processing as part of the package. SSIS includes a task for processing Analysis Services objects. Scheduling the processing as part of the load package makes the processing happen as soon as, but only when, it is needed. In an SSIS package, you can also process dimensions as soon as the dimension tables are loaded, even while the fact table is still being loaded. This can minimize total latency.
Data loads in a steady stream. Schedule the dimension or partition to process on a scheduled basis. You can do this either with an SSIS package (when the timing of the interval is critical) or by using the pro-active scheduling option in Analysis (when the interval is small and the specific timing is not important).
Data loads sporadically. Schedule the dimension or partition to process based on a trigger from the relational table. Use the proactive caching option in Analysis Services. If the data loads in sporadic clusters, set the Quiet Interval so Analysis Services will not begin processing the partition until a cluster has finished loading.
Use Lazy Aggregations when low latency is important
When Analysis Services process a partition, it first loads the base data for the partition and then loads the aggregations designed for that partition. Normally, Analysis Services directs queries to the old partition data until after the aggregations are processed. If your goal is to minimize latency as much as possible, you can configure Analysis Services to switch incoming queries to the new partition as soon as the base data has loaded, before it creates aggregations or indexes. This keeps data latency to an absolute minimum, but may result in fluctuations in query performance: if the aggregations happen to exist, the query is fast, but if the partition has recently been updated, the query is slow. As a general rule, consistency in query performance is very important in BI solutions, and keeping the current partition small is typically a better solution than using lazy aggregations. To set lazy aggregations, use the ProcessingMode property for a cube or dimension.
Operational Design: Optimizing Load Performance
Following are the guidelines that can be considered for optimizing the load, including machine configurations and other are implementation suggestions
Provide sufficient memory for processing
Make enough physical memory available for processing to work without swapping to disk. You can monitor whether processing requires writing to disk by using the Temp File Bytes Written per Second counter from the MSAS 2005: Proc Aggregations group in Performance Monitor. Remember that allocating more memory than is physically available will cause the Operating System to swap memory. Allowing Analysis Services to manage swapping is preferable to using the Operating System, because Analysis Services is aware of the meaning of memory regions and can swap out the least important portions.
Maximize Resource Availability for Processing
Avoid sharing the processor with other applications. Ensure that the network bandwidth from the relational database to the OLAP server is sufficiently fast.
Reduce grain of fact table if not needed for detailed reporting
It is possible to reduce the time needed to load a partition if you are able to pre-aggregate the data in the fact table. For example, if you need only monthly data in the cube, but the fact table contains daily data, you could reduce load time by pre-aggregating the fact table to the month grain. This should be an option, however only if you are re-loading the data on a frequent basis. If you will load it only once, pre-aggregating the data will not save any time over having the cube aggregate the data as it loads. Also, remember that the UDM allows for flexible reporting options, including drill through to the fact table. Even though the Date dimension goes only to the Month grain, users may want to drill through to see the daily detail from the fact table.
Operational Design: Minimizing Downtime
It is increasingly important for BI applications to be available with virtually no downtime. When reporting systems are interactive, you can never predict when a user will need to retrieve information. Even if all of your users work in a single time zone on a single shift, there will be times when an analyst has to work all night to complete a presentation. And the likelihood that all your users will be in a single time zone on a single shift is constantly decreasing. Following are guidelines to help you minimize downtime when processing a cube.
Process the entire database if it is not too large
For smaller databases, you can process the entire database. When you do this, Analysis Services is able to completely process all the dimensions and then all the partitions. The processed objects are given names with a sequential identifier embedded. Once all the objects are completely processed, the server redirects new queries to the new objects. Once all connections to the old objects are closed, the old objects are deleted. This approach does consume at least twice the disk space of a standard cube, and it requires enough time and resources to process all the objects, but it can be an effective way to avoid downtime for a reasonably small cube, particularly if you have many Type I attributes that are included in aggregations.
Use Type II attributes whenever possible.
Type II attributes never change. If the attribute value changes, a new key is generated. This means that dimensions can always be processed with the Update option, and old partitions never need to be reprocessed. The cube never needs to go offline.
Define Type I attributes with a Flexible attribute relationship
If you do need to have Type I attributes (which restate history), Analysis Services will need to recreate any aggregations that include the attribute when any of its values change. Setting the attribute relationship to Flexible (which is the default) creates the aggregations as flexible. Query performance against a cube with flexible aggregations can fluctuate, as the aggregations are dropped and re-created in the background, so you should be careful how you use this option, but it does retain uptime for the system.
Use a staging server to process the cube
In extreme cases, you can use one server to process the OLAP data and a different server to respond to user queries. One relatively easy way to switch between the two servers is to put them on a SAN and use the network to redirect traffic to the server that is currently responding to queries.
Operational Design: Optimizing Query Performance
The following guidelines can help improve query response
Whenever possible, make core cube measures additive
Additive measures respond very well to the aggregation techniques used in Analysis Services. If you have a non-additive measure, try to multiply it by an appropriate weighting factor, and then divide the aggregated measure by the aggregated weighting factor. For example, suppose you have a Defect Rate measure. This measure cannot be added over time or across organizations but if the defect rate is per workstation per day, create a weighting factor of 1 for each workstation-day. Multiply the Defect Rate by the WeightingFactor to get WeightedDefects at the leaf level, and then aggregate both WeightedDefects and WeightingFactor. These values can aggregate over any other dimension, including time. The create a calculated measure for Defect Rate with the formula WeightedDefects/WeightingFactor. This calculation will work at any point in the cube—day, month, year, department, factory, region, or country. You can hide the underlying values. This is an extremely powerful technique for improving query performance.
Use appropriate partitioning strategy
Partitioning is often used to reduce load time, but by partitioning based on common query usage—for example, by Region, if most queries take place within a single Region—you can substantially reduce the amount of space the server needs to search for an answer. This is particularly critical with cubes larger than 100 GB.
Use the Distinct Count aggregation function over MDX
If at all possible, avoid using the DistinctCount MDX function. It must dynamically construct a set of all the possible values and count them. This is a very slow operation. The Distinct Count aggregation function for a measure still has to do a lot of work to calculate the count, but it creates internal structures to make the process much more efficient. Make use of Distinct Count measures to decompose other complex calculations into component parts. For example, calculating the Revenue per Customer probably requires a distinct count of Customers for the denominator. With certain restrictions, even sophisticated calculations such as a standard deviation can be decomposed into additive measures if you make use of a distinct count measure.
Create sufficient aggregations
By using the performance monitor counters from the MSAS 2005: Cache and MSAS 2005: Storage Engine Queries groups, you can monitor how many times queries have to go to file as opposed to retrieving the values from cache.
Operational Design: Supporting Write-back
If you think of Business Intelligence as a reporting function, writing data back to the cube doesn’t make sense. But in a broader sense—including planning into the equation—writing values into a cube structure can be very useful.
For example, suppose that you are planning orders for key resellers. You may have product planners who select a single product and forecast the units each reseller will order. But you also have account managers who select a single reseller and forecast the units for each product. The ability to input the forecasts in one orientation, and then immediately switch them to the other orientation, or immediately see the aggregated total for the entire product or for the entire reseller is a powerful OLAP capability.
Analysis Services does provide write-back capabilities. If you understand the essence of how Analysis Services write-back works, you can benefit from its features and avoid poor performance. In the first place, you must have a client tool that support write-back. Even though the server supports write-back, the client must be able to send the data to the server.
The essence of Analysis Services write-back is that it treats write-back as simply another partition that means
1) All write-back occurs at the same grain as the fact table, even if you submit a write-back value at a higher grain
2) All write-back values are stored as a delta from the previous value. Following are some guidelines to help you make the best use of write-back.
Limit the granularity of a write-back cube
When you write a non-leaf value back to a cube, Analysis Services generates one row for each leaf level member that would feed into that cell. Suppose, for example, that you have a cube with two dimensions: Product and Date, and there are 50 products. If the grain of the Date dimension is Month, and you write a value for All Product, 2007, the system will generate 50*12 = 600 records in the write-back partition. If the grain of the Date dimension is Day, the same write-back value will generate 50*365 = 18,250 records. Adding a Region dimension with 20 records would change that to 365,000 records. Each time you add a dimension, or increase the granularity of an existing dimension, you explode the number of potential records from a write-back operation. For write-back create a special measure group with as few dimensions and as limited grain as possible for each dimension.
Limit the level at which write-back occurs
The closer to the bottom of the cube you perform a write-back operation, the fewer rows are generated. In the case of a measure group with Product and Date dimensions, writing a value for a single product for a single day generates only a single record. You can use security to limit the levels at which write-back is permitted. You may want one measure group where only low-level write back is permitted, and another measure group that allows high-level write-back, but with extremely limited granularity. The two measure groups allow bottom-up and top-down write-back capabilities respectively.
Periodically transfer write-back data to a standard fact table
Because write-back data occurs at the grain of the fact table, you can easily create an ETL process that copies the write-back values from the write-back partition into a fact table. You can then clear the write-back partition and process the fact table into a MOLAP partition. This is an effective way to maintain fast performance even as write-back data volume grows. Consider transferring the write-back data at the end of each planning cycle.
Beware changes to the underlying cell values
As the write-back values are stored as delta records, if the underlying value in a cell changes (because of changes to the fact table) then the total in the cell will be changed proportionately. One of the processing options in Analysis Services is whether to drop write-back values when you process a fact table. You should keep the write-back values only if you know the underlying values will not change—or if you want to maintain the same write-back delta for the new values.
Create relational reports for showing write-back values
If you start with an empty fact table (zero-based budgeting), you can take advantage of the time-stamp Analysis Services adds to each write-back record to show the value of a budget at any point in time. Create a relational report that uses a date-time parameter and filter the rows from the write-back table to less than or equal the value of the parameter. Adding all the records in the write-back table up to that point in time will give you the budget as of that time. You can use this technique to avoid capturing snapshots during the course of a planning cycle. You can also use relational reports against the write-back table to determine which users were responsible for which change amounts.
Limit the number of concurrent updates to same cell
Analysis Services does handle concurrent write-backs in a reasonably intelligent manner. If User A and User B both retrieve the value 10 for a given cell, and then User A posts the value 15 back to the cell, but then user B posts the value 18 back to the cell, Analysis Services logs a change of +5 for User A, and +3 for User B. In other words, it re-reads the cell for the current value before calculating the appropriate delta. There is, however, no concept of locking the value of a cell while editing it. You would be wise to limit the users who can concurrently write to a single cell. You can use Cell Security to limit who can write to a cell. When you use Cell Security to control write-back, you create an MDX formula that acts as an overlay to a cell grid. If the formula calculates True for any cell, the user is allowed to write to that cell. If it calculates False, write-back is not allowed. You can use the same formula for both read and write permissions, or you can use different formulas, allowing the user to see the value from cells that are not write-enabled.
Use Dimension write-back as needed to create new planning cycles
Dimension write-back behaves differently from data write-back. Data write-back is stored in a separate table from the underlying fact table. It is easy to remove the write-back values and get back to the underlying value for the cube. Dimension write-back is simply a short-cut method of making a change to the dimension tables and updating the dimension. Because of this, there is no concept of a “what-if” write-back for dimensions. You should be very restrictive of who can perform dimension write-back, and prevent concurrent dimension write-back if at all possible.
Challenges of Analysis Services
Though analysis services pretty much reduces the implementation problems to satisfy business needs, these are the challenges that would in general bottlenecks for olap implementation.
Managing huge data volumes
Following are some guidelines that may help you work with extremely large data sources.
Avoid re-processing old data
One of the best ways to reduce processing time for extremely large sources is to avoid reprocessing old data. Following are some key techniques for avoiding reprocessing:
Use only Type II attributes
Type II attributes never modify existing records, but only add new dimension records. These never require reprocessing old data. If you do need some Type I attributes, you can leave the Relationship Type set to Flexible for the attribute relationship for that attribute. This means that any aggregations that include the attribute are created using flexible aggregations, which are dropped and re-created when you update the dimension. Flexible aggregations can cause performance fluctuations, so consider avoiding aggregations at all on Type I attributes.
Never update fact table rows. Only append new ones.
This allows you to create time-sliced partitions and process only the most recent time period. If you make updates to fact table records, you have to go back
Use MOLAP storage and partition along member slices for queries
MOLAP storage is inherently faster than ROLAP, both for detail data and for aggregations. When Analysis Services loads a MOLAP partition, it identifies which attribute members are included in the partition. If you separate partitions based on commonly used slices, Analysis Services can avoid searching unused partitions. For example, if most queries are performed within a specific Region, partition the data by using the Region attributes. That way, queries that need only West region data will not search any other partitions. With very large cubes, you may want to partition across more than one dimension—for example, partition both by time and by region (so that, for example, there is a partition that includes only West Region, 2006 March data).
Consider partitioning incase of discount measures and multi proc
If you have Distinct Count measures and multiple processors available, partition so that the distinct count items are spread relatively evenly across partitions.
For example, if you are counting distinct customers, and customers are highly correlated with region but are not correlated with product, slicing partitions by product will result in faster queries. This guideline seems counter-intuitive, because it means that multiple partitions will be involved in each query. But calculating distinct counts is a processor-intensive activity. Only one processor can operate at a time on a single partition. If you partitioned based on Region, you would force a single processor to do all the work of calculating the distinct count. By partitioning based on Product, multiple partitions can work in parallel on the query. This approach assumes that you have processors available.
Managing unavoidably slow calculations
Analysis Services is extremely effective at producing very fast queries from very large data sets when the aggregations are additive or the calculations perform relatively simple operations on the additive measures (for example, with a weighted average). Unfortunately, calculations that do not lend themselves to strategic aggregations can still be slow, particularly with a large data source. One of the problems with slow calculations is even if it were possible to pre-calculate the values, in cube environment—where users can dynamically navigate to any part of the cube—storing all the possible values would result in a huge data explosion problem.
Use reporting services to cache calculation results
One solution for the very slow calculations problem is to use Reporting Services to query the cube and store the result for a limited data set. Reporting Services can execute a query in advance and store the resulting data set in its own cache, which can then be used to server reports very quickly for end users. This approach avoids the data explosion problem because users are limited to the view provided by the report.
As long as you do not need dynamic filters (report parameters), using Reporting Services to cache the results of slow calculations is straightforward. The slow formulas are sure to return non-additive values, so be sure to use the Aggregate function as described in the Managing Client Tools section.
If you do want to use report parameters to allow users to filter a cached data set in Reporting Services, there is a slight problem: The Aggregate function—which is critical for returning the correct result for non-additive calculations—requires an active connection to the data extension, so it is incompatible with dataset filters. Parameters in the query itself are available only with interactive queries, so it would seem impossible to use Reporting Services to cache the results of slow formulas in a report that can be filtered by using report parameters.
It is, however, possible to pre-execute the query and use parameters to filter the dataset if you bypass the new Analysis Services data extension for Reporting Services 2005. This approach is complex and can significantly slow down the report development cycle, so it should be used only as a last resort. Following are the steps to use in Reporting Services:
1) First create a report using the Report Wizard to generate the MDX. Do not include any query parameters. Modify the MDX as described in the Managing Client Tools section so that it returns all the levels.
2) Create a data source that uses the OLE DB data extension, with the Analysis Services 9.0 data provider, and change the data set for the report to use the new data source.
3) Change all the Sum functions to use First. (In other words, use First where you would use the Aggregate function with the Analysis Services data extension.)
4) Each group in the report will show an extra row with a blank total. This row corresponds to the Total row for the next higher group. To hide the blank row, add a filter to the Group. For example, to hide a Group based on the Category field, edit the Group (not the data set) and in the Filter pane, enter =Len(Fields!Category.Value) in the first column, select > in the second column, and enter =CInt(0) in the third column. Do this step for each group in the report that shows unwanted blank rows.
5) Add a calculated column in the MDX query that corresponds to the report parameter. If you use the Report Wizard to create a report parameter, you can see the MDX needed for the parameter. In the MDX for the report dataset, you need a column that corresponds to that parameter value. Typically, you will want to display all the rows that have the report parameter as an ancestor. Suppose that you will use [Product].[Category] as the report parameter. Add this calculated member to the query, and include it in the measures on the Columns axis:
WITH MEMBER Measures.CategoryKey as
6) Add a filter to the data set (or to the data region) that filters for the appropriate report parameter. In order for this to work, there must be a field in the MDX that matches the definition of the value used in the parameter. For example, to match the parameter named CategoryParam with the CategoryKey calculated measure, edit the data set, and in the Filter pane, enter =Parameters!CategoryParam.Value in the first column, select = in the second column, and enter =Fields!CategoryKey.Value in the third column.
7) After deploying the report, change the data source to include stored credentials, and change the report to execute as a snapshot. After the snapshot executes, you should be able to view the report, select the desired parameter, and quickly see the filtered results.
Managing client tool queries
Various Client tools are available that would interact with Analysis Services and show the easy browser view of the analysis cube and graphs/charts with a feature of adhoc reporting capability
Current Member formulas with Subcube Filters
In 2005, Analysis Services added a new type of filter for queries: a subcube subquery. In many ways a subcube subquery behaves in a very intuitive way. Some client tools (for example Excel 2007) now use subcube subqueries to specify page filters. This can be a problem if you have a calculated member that uses the CurrentMember of the filtered dimension. For example, suppose you have an MDX formula that calculates the Monthly Growth Rate. To calculate, this formula looks at the current member of the time dimension and compares it to the previous member. You select March 2007 as the page filter. If the client query uses a subcube subquery, the data is properly filtered to include only March data, but the current member of the Date dimension is still All Time. The Monthly Growth Rate formula tries to retrieve the previous member of the Date dimension, and there is no previous member for All Time.
In a client tool such as Excel, you have no control over the MDX. The workaround in this situation is to put the Date dimension onto the Row or Column axis of the report, and then filter for the desired month. Putting the Date dimension onto an axis—even if there is only one member selected—sets the current member for the dimension and allows the formula to work.
In a client tool such as Reporting Services, the MDX designer defaults to put the filter parameter into a subcube subquery, but you can edit the MDX to put the parameter into a Where clause. If the selected month is in the Where clause, it does affect the current member, so the growth formula will work properly.
Modify report queries to handle non additive measures
The Reporting Services MDX builder and report wizard work properly for additive measures, but not for non-additive measures. In a cube, you may create a calculated member that properly calculates a non-additive value for any point in the cube. For example, you may have a formula to calculate the Average Price. To display this value properly in a report, you need MDX that returns not just the lowest level of detail, but all levels of any displayed hierarchies. If you want to display non-additive measures from the cube in a Reporting Services report, follow the following steps after using the wizard to create the report:
1) Edit the MDX to retrieve values from all levels of the hierarchy, not just the leaf. To do this, look for the AllMembers function in the definition of the Rows axis. It will look something like this: [Product].[Categories].[Product Name].AllMembers. The first item ([Product]) is the name of the dimension; leave that alone. The second item ([Categories]) is the name of the hierarchy; leave that alone. The third item ([Product Name]) is the name of the lowest level of the hierarchy. Delete that part of the expression, including the period that follows it.
2) In the report, change any Sum functions to use the Aggregate function. For example, change Sum(Fields![Average Price].Value) to Aggregate(Fields![Average Price].Value).
These two changes will enable the report to display the non-additive values as they are calculated in the cube.
There are some types of queries that are inherently slow. In essence, they are all cases where you need highly aggregated values, but there are no predefined buckets you can use to pre-aggregate the data. These are not situations where a relational query will be faster—because even in a relational query, you would need to aggregate the data dynamically.
The user dynamically selects an arbitrary date range for grouping
For example, suppose that a store had a promotion between March 5 and June 7 and the analyst wants to compare the sales for that range with the same time period of the previous year. It would be difficult to create a promotion flag on the time dimension if the promotion is for a single region, but the time dimension is shared across all regions. You could create a separate reference fact table that contains promotion dates for each region, but that would still not provide a grouping bucket for the same period of the previous year. If the user interface allows the user to simply enter arbitrary start and end dates, it is impossible to pre-define buckets for those arbitrary dates.
Arbitrary list of members for grouping
For example, suppose you want to get the total insurance claims for all patients who were ever treated for a particular disease, perhaps cancer. In this case, you need to search the fact table twice: once to identify which patients meet the criteria, and another time to aggregate values for the selected patients. If you can pre-define the disease, you can add a TreatedForCancer flag to the patient records which could then be used to pre-aggregate the claim amounts, but if the disease selection is an arbitrary choice, it is difficult to flag the patients for all possible treatments they may have had. It is particularly difficult when a compound selection criteria is used—for example, all patients who ever had heart surgery that had also ever been prescribed a particular blood-pressure drug. One technique that can help in these cases is to create a relational flag to mark the targeted members. The flag can then serve as a grouping bucket. Of course, if you need to create the definition of the flag dynamically, it still takes time to create the flag and aggregate the values.
Analysis Services allows you to create flexible BI reporting solutions that return very fast results to ad hoc queries. End users can use general-purpose tools to create dynamic reports without waiting for data specialists. You can define business logic formulas in one place on the server, so that they can be consistently shared among all users, regardless of the specific report layout. OLAP storage makes the results of the queries very fast, even with very large data sources, and the UDM makes the data easily usable regardless of the client tool.