Business Intelligence Guidelines Conceptual Framework
Microsoft Confidential. © 2006 Microsoft Corporation. All rights reserved. These materials are confidential to and maintained as a trade secret by Microsoft Corporation. Information in these materials is restricted to Microsoft authorized recipients only. Any use, distribution or public discussion of, and any feedback to, these materials is subject to the terms of the attached license. By providing any feedback on these materials to Microsoft, you agree to the terms of that license.
Microsoft Corporation Technical Documentation License Agreement (Standard)
READ THIS! THIS IS A LEGAL AGREEMENT BETWEEN MICROSOFT CORPORATION ("MICROSOFT") AND THE RECIPIENT OF THESE MATERIALS, WHETHER AN INDIVIDUAL OR AN ENTITY ("YOU"). IF YOU HAVE ACCESSED THIS AGREEMENT IN THE PROCESS OF DOWNLOADING MATERIALS ("MATERIALS") FROM A MICROSOFT WEB SITE, BY CLICKING "I ACCEPT", DOWNLOADING, USING OR PROVIDING FEEDBACK ON THE MATERIALS, YOU AGREE TO THESE TERMS. IF THIS AGREEMENT IS ATTACHED TO MATERIALS, BY ACCESSING, USING OR PROVIDING FEEDBACK ON THE ATTACHED MATERIALS, YOU AGREE TO THESE TERMS.
1. For good and valuable consideration, the receipt and sufficiency of which are acknowledged, You and Microsoft agree as follows:
(a) If You are an authorized representative of the corporation or other entity designated below ("Company"), and such Company has executed a Microsoft Corporation Non-Disclosure Agreement that is not limited to a specific subject matter or event ("Microsoft NDA"), You represent that You have authority to act on behalf of Company and agree that the Confidential Information, as defined in the Microsoft NDA, is subject to the terms and conditions of the Microsoft NDA and that Company will treat the Confidential Information accordingly;
(b) If You are an individual, and have executed a Microsoft NDA, You agree that the Confidential Information, as defined in the Microsoft NDA, is subject to the terms and conditions of the Microsoft NDA and that You will treat the Confidential Information accordingly; or
(c)If a Microsoft NDA has not been executed, You (if You are an individual), or Company (if You are an authorized representative of Company), as applicable, agrees: (a) to refrain from disclosing or distributing the Confidential Information to any third party for five (5) years from the date of disclosure of the Confidential Information by Microsoft to Company/You; (b) to refrain from reproducing or summarizing the Confidential Information; and (c) to take reasonable security precautions, at least as great as the precautions it takes to protect its own confidential information, but no less than reasonable care, to keep confidential the Confidential Information. You/Company, however, may disclose Confidential Information in accordance with a judicial or other governmental order, provided You/Company either (i) gives Microsoft reasonable notice prior to such disclosure and to allow Microsoft a reasonable opportunity to seek a protective order or equivalent, or (ii) obtains written assurance from the applicable judicial or governmental entity that it will afford the Confidential Information the highest level of protection afforded under applicable law or regulation. Confidential Information shall not include any information, however designated, that: (i) is or subsequently becomes publicly available without Your/Company’s breach of any obligation owed to Microsoft; (ii) became known to You/Company prior to Microsoft’s disclosure of such information to You/Company pursuant to the terms of this Agreement; (iii) became known to You/Company from a source other than Microsoft other than by the breach of an obligation of confidentiality owed to Microsoft; or (iv) is independently developed by You/Company. For purposes of this paragraph, "Confidential Information" means nonpublic information that Microsoft designates as being confidential or which, under the circumstances surrounding disclosure ought to be treated as confidential by Recipient. "Confidential Information" includes, without limitation, information in tangible or intangible form relating to and/or including released or unreleased Microsoft software or hardware products, the marketing or promotion of any Microsoft product, Microsoft's business policies or practices, and information received from others that Microsoft is obligated to treat as confidential.
2. You may review these Materials only (a) as a reference to assist You in planning and designing Your product, service or technology ("Product") to interface with a Microsoft Product as described in these Materials; and (b) to provide feedback on these Materials to Microsoft. All other rights are retained by Microsoft; this agreement does not give You rights under any Microsoft patents. You may not (i) duplicate any part of these Materials, (ii) remove this agreement or any notices from these Materials, or (iii) give any part of these Materials, or assign or otherwise provide Your rights under this agreement, to anyone else.
3. These Materials may contain preliminary information or inaccuracies, and may not correctly represent any associated Microsoft Product as commercially released. All Materials are provided entirely "AS IS." To the extent permitted by law, MICROSOFT MAKES NO WARRANTY OF ANY KIND, DISCLAIMS ALL EXPRESS, IMPLIED AND STATUTORY WARRANTIES, AND ASSUMES NO LIABILITY TO YOU FOR ANY DAMAGES OF ANY TYPE IN CONNECTION WITH THESE MATERIALS OR ANY INTELLECTUAL PROPERTY IN THEM.
4. If You are an entity and (a) merge into another entity or (b) a controlling ownership interest in You changes, Your right to use these Materials automatically terminates and You must destroy them.
5. You have no obligation to give Microsoft any suggestions, comments or other feedback ("Feedback") relating to these Materials. However, any Feedback you voluntarily provide may be used in Microsoft Products and related specifications or other documentation (collectively, "Microsoft Offerings") which in turn may be relied upon by other third parties to develop their own Products. Accordingly, if You do give Microsoft Feedback on any version of these Materials or the Microsoft Offerings to which they apply, You agree: (a) Microsoft may freely use, reproduce, license, distribute, and otherwise commercialize Your Feedback in any Microsoft Offering; (b) You also grant third parties, without charge, only those patent rights necessary to enable other Products to use or interface with any specific parts of a Microsoft Product that incorporate Your Feedback; and (c) You will not give Microsoft any Feedback (i) that You have reason to believe is subject to any patent, copyright or other intellectual property claim or right of any third party; or (ii) subject to license terms which seek to require any Microsoft Offering incorporating or derived from such Feedback, or other Microsoft intellectual property, to be licensed to or otherwise shared with any third party.
6. Microsoft has no obligation to maintain confidentiality of any Microsoft Offering, but otherwise the confidentiality of Your Feedback, including Your identity as the source of such Feedback, is governed by Your NDA.
7. This agreement is governed by the laws of the State of Washington. Any dispute involving it must be brought in the federal or state superior courts located in King County, Washington, and You waive any defenses allowing the dispute to be litigated elsewhere. If there is litigation, the losing party must pay the other party’s reasonable attorneys’ fees, costs and other expenses. If any part of this agreement is unenforceable, it will be considered modified to the extent necessary to make it enforceable, and the remainder shall continue in effect. This agreement is the entire agreement between You and Microsoft concerning these Materials; it may be changed only by a written document signed by both You and Microsoft.
An increasing number of businesses and corporations who are using traditional online transaction processing (OLTP) applications are incorporating analytics into the system. Transactional systems process massive quantities of data. Turning that data into consumable information that can facilitate decisions is not a luxury but a necessity to stay competitive in the business. Without the help of critical indicators providing timely information, decision makers spend their time organizing data, rather than extracting meaning from the data. When implemented well, analytic systems provide decision makers with the necessary tools to perform analysis (such as trend and comparative analysis) of the data at various levels of granularity, to view the details, to see relationships between the data, and to explore new possibilities.
Conceptual Framework for Business Intelligence
The term Business Intelligence (BI) incorporates the concept of deriving useful information from the data in an organization. Designing a BI application involves multiple layers. The goal of this section is to provide a common framework for architects and developers. This framework is conceptual, technology agnostic, and covers the major phases, features, and functionality required to effectively implement a BI solution. The conceptual architecture in Figure 1 is comprised of five major areas and a set of cross-cutting concerns.
Conceptual system architecture
Data Storage is the end result of Data Source and Data Integration layers. Data Storage can be termed as Data Warehouse (DW)/Data Mart (DM). Data Analysis and Data Presentation will leverage the information stored in Data Storage.
One of the challenges when working with data in a BI system is that it typically originates in many different data storage systems. Extracting data from those different sources and merging the data into a single, consistent dataset is challenging. The complexity of the source system makes the situation more difficult. In this section extraction of data sources is outlined.
Data source in conceptual system architecture
Figure 3 shows that as interactions between the potential elements in the data source increase, the complexity of the system also increases.
Conceptual system architecture
When working with diverse data sources, the following issues may be considered
• Different source environments with different systems, including different platforms and operating systems: Heterogeneous environments are silos of information due to differences in business and applications. Each application has its own database, business rules, and schema. On the other hand, homogeneous environments, where both the source and destination are using similar technologies and versions, are much easier to deal with. For example a homogeneous environment may use SQL Server 2005 on both the OTLP and online analytical processing (OLAP) systems.
• Different database systems: Source data may originate in many different types of database system, including Oracle, DB2, SQL Server or others. The components that read the source data are usually referred to as data adapters. Connecting to a wide variety of data sources means choosing the right source adaptor to facilitate the connection and retrieval of data from various data sources. Consider the type of host systems and applications, for example, the data could be read from a standard database (SAP, DB2, Oracle, SQL Server), from a flat file such as an IIS log or a queue, or data could be read via a third party broker or Web service. When connecting to a data source system, consider appropriate security precautions such as authentication, authorization, and secure communication.
• Different schemas, data formats, and naming conventions: Legacy transaction systems may use EBCDIC string formats. OR the source database systems may use VSAM files which are not relational in nature. Table and column names from an enterprise application such as SAP may be difficult to understand
• Geographically separated source locations: If the data warehouse collects data from geographically dispersed locations, it is required to consider the implications for timing requirements on timelines and bandwidth to address the latency challenges.
• Source system ownership. If the same team, product, or department owns both the OLTP and DW system, then it is relatively easy to work out issues over ownership. Ownership pertains to both access permission and data quality. Getting permission to read data from transaction system databases can be a complex political problem. However, it is still critical to understand trust boundaries so that appropriate quality gates in place. If control over the source system is restricted then consider all necessary precautions for limitations or variations of the source system.
• Impact on the source system. Retrieving large quantities of data from transactional system databases can have negative impact on the operational applications. It may be appropriate to consider techniques that would minimize usage of source resources and also extract all the required information timely. All the precautions need to be taken to extract only the data that is required for business analysis.
• Source data volatility: The source data can be volatile as a result of streaming data, transactional data, periodic snapshot data, stand-alone data, and replicated data. Consider appropriate window for data extraction from source systems as they are meant to change continuously with operational or batch process transactions.
• Source data volume: Data volume is a major consideration for data extraction operation as it would affect both processing as well as resource utilization. Consider extraction techniques that would reduce the data volume on ongoing basis to meet the business requirements.
Integration is critical for connecting business functionality, process, and data. Integration of fragmented information requires addressing a complex set of challenges. Some of these challenges are already mentioned in the "Data Source" section. In this section, critical factors relating to a data integration system are explained in detail.
Data integration in conceptual system architecture
The following issues need to considered while integrating data from various sources
Data profiling: Upon having access to data source, study the underlying data, its dependencies, and rules is a complex task. The ability to profile and analyze the source data to identify potential issues such as anomalies, outliers, dependency violations, and redundant or orphaned data is commonly referred to as data profiling. The three commonly sited aspects of data profiling are: column analysis, dependency analysis, and redundancy analysis. Column or attribute analysis evaluates the distribution range, completeness, uniqueness, format, type, size, and frequency of data. Dependency or referential analysis looks for relationships, integrity, and business rule dependencies. Finally, redundancy analysis, as the name suggests, is the technique for identifying duplicate data, in addition to orphan records.
• Data extraction. Once the source data is studied, extracting meaningful data is the next big challenge with data integration. Extracting data from the source requires connecting to heterogeneous or homogeneous data sources using a source adapter. During the extraction following issues needs to be addressed: What source format are you dealing with? What is the frequency of the extraction? What is the load? What if there is no record tracking for the transactions on the source system? How do you select and extract data that has only been changed since your last extraction? How do you minimize the load on the source data during extraction?
• Data staging. Staging is a location where data is temporarily stored before loading the data into the destination warehouse. Staging may not be necessary for simple cleansing or transformation operations; however, staging is for other reasons, for example, you may want to avoid the overhead on the source system so checks are performed locally on the staging server for cleansing and transformations. There may be procedural reasons causing a time lag between data extraction and loading. You may need to perform complex transformations that require you to access multiple sources, lookups, or fact tables.
• Data transformation. Once you have connected to the source system, identified relevant data and extracted the data from the source, it is ready to be transformed. During the transformation process you could perform various actions on the data depending on the scenario. You could sort, split, merge, lookup, address slowly changing dimensions, audit, pivot, or aggregate the data. This process may go hand in hand with data cleansing. Data transformation can be done at various stages depending on the type of operation you are performing and the data load. You could choose to do it at the source, before the staging, or before loading the data into the destination warehouse database.
• Data cleansing. Data cleansing ensures inconsistent and invalid data is cleaned before loading it into the data warehouse. During this process, common data quality problems such as absent data values, inconsistent values, duplicate values, primary key reuse, and violation of business rules are detected and corrected. Correcting data quality issues at the source is ideal; however, in most cases it is not possible. For example, you may not have ownership or influence over the source system, the source system may have dependencies which increase the complexity.
• Data loading. Populating the data warehouse is the last step of the Extract Transform, and Load (ETL) process. If you are populating the warehouse for the first time, you will load the historical data, followed by a new transactional data on a periodic basis. For loading large transaction data into fact tables, you have to consider issues such as: loading data during off-peak usage, loading data into temporary tables, creating indexes on those temporary tables similar to fact tables and merging the temporary table as a partition back into the fact table, or dropping indexes on the fact table before loading the data.
Several of these topics are the subjects of specific guides in this series.
The data stored in a warehouse is typically loaded by the ETL process. The schema for storing information in a warehouse is different from the transactional system. In this section, you will learn critical factors relating to a data storage system.
Data storage in conceptual system architecture
When you manage data storage for your warehouse, you will need to think about many issues, including the following:
• Dimensional modeling. Strategies for effectively organizing data are critical to implementing BI/DW systems. The technique used in modeling the logical data warehouse is commonly referred to as dimensional modeling. The guidelines for designing data warehouse solutions are different from transaction systems. As a designer you have to choose appropriate schema types, such as star or snowflake schema, design fact tables with measures relevant for the business and at the appropriate level of granularity, and address attributes that change over time by picking appropriate types of slow changing dimensions (SCD).
• Partitions. Warehouse databases typically contain millions of rows in tables. Dividing large tables and their indexes into multiple segments and assigning them to filegroups is called partitioning. As a designer you have to create a scalable partition that enables the best possible performance, you have to choose an appropriate partitioning strategy, choose optimal partitioning functions, appropriately place partitions in filegroups, use index alignment, and plan for data management, such as moving new data inside a partition and removing aging data out of the partition.
• Indexes: Designing an appropriate indexing strategy taking into account various factors such as usage patterns, schema design, column types, and storage needs are important for efficient operation.
Because BI systems typically contain massive quantites of data, tools and techniques for managing, summarizing, querying, and analyzing the data are critical. In this section you will learn critical factors relating to a data analysis system.
Data analysis in conceptual system architecture
When you design ways to analyze data from your warehouse, you will need to think about issues, including the following:
• OLAP. In addition to providing data storage, an OLAP engine such as SQL Server Analysis Services also facilitates is designed for analysis of business measures, optimized for bulk loads, and superior performance for business intelligence and large complex queries. Data is organized and preprocessed into multidimensional cubes based on a dimensional model that enables you to rapidly summarize information for analytical queries.
• Data Mining. You can use sophisticated and complex data mining algorithms to analyze the data for exposing interesting information useful for decision makers. You can create complex models, browse and query them, perform predictions against those models, and test the model's accuracy. Choosing an appropriate data mining algorithm for a specific business problem requires testing various algorithms separately or together and exploring the results.
Presenting meaningful and visually appealing information in interesting ways is critical to helping analysts, managers, and leaders make informed decisions. In this section, you will learn critical factors relating to a data presentation system.
Data presentation in conceptual system architecture
When you plan ways to present data from your Business Intelligence warehouse, you will need to think about many issues, including the following:
• Navigation. The presentation layer provides easy access to complex query results in a rich user interface that facilitates navigation. Users have the ability to interactively explore the data by drilling, pivoting, and drag-drop capabilities. Visualizing and navigating through data to analyze root cause helps organizations gain deeper insight into what business drivers matter the most.
• Format. Depending on the type of information available and the message that you want to deliver, it is important to choose an appropriate format. Choosing the right format such as chart, graph, table, report, dashboard, or Key Performance Indicator (KPI) depends on the usage patterns such as trends, behaviors, comparisons, correlation, change, classifications, or facts.
• Host. Choosing an appropriate client depends on the business needs. If the business demands access to information at any time, anywhere, and through any device, the design needs to accommodate such a requirement. Given the broad spectrum of client devices available in the market today – desktop PCs, Tablet PCs, and handheld mobile devices such as Pocket PCs and Smartphones – your client presentation can be a thin client, smart client, or mobile client. However, each type of client has its advantages and disadvantages. When designing your application, you will need to carefully consider the specifics of your situation before you can determine which is appropriate.
Regardless of which layer of the data flow you deal with, several issues must be constantly addressed. In this section, you will learn about the critical factors: metadata, security, performance, and operations.
General concerns in conceptual system architecture
When you plan ways to present data from your Business Intelligence warehouse, you will need to think about many issues, including the following:
The conceptual framework presented in this section helps simplify the complexity of the BI system. You can use the conceptual framework to navigate scenarios, architecture and design challenges, solutions, technical options, and also the interaction between layers and sub-categories. This framework is a starting point and will be explored extensively in more detail in subsequent chapters.