Search This Blog

Welcome to Machers Blog

Blogging the world of Technology and Testing which help people to build their career.

Thursday, August 7, 2008

Data Loading

7
ETL – Data Loading
Microsoft Confidential. © 2006 Microsoft Corporation. All rights reserved. These materials are confidential to and maintained as a trade secret by Microsoft Corporation. Information in these materials is restricted to Microsoft authorized recipients only. Any use, distribution or public discussion of, and any feedback to, these materials is subject to the terms of the attached license. By providing any feedback on these materials to Microsoft, you agree to the terms of that license.


Microsoft Corporation Technical Documentation License Agreement (Standard)
READ THIS! THIS IS A LEGAL AGREEMENT BETWEEN MICROSOFT CORPORATION ("MICROSOFT") AND THE RECIPIENT OF THESE MATERIALS, WHETHER AN INDIVIDUAL OR AN ENTITY ("YOU"). IF YOU HAVE ACCESSED THIS AGREEMENT IN THE PROCESS OF DOWNLOADING MATERIALS ("MATERIALS") FROM A MICROSOFT WEB SITE, BY CLICKING "I ACCEPT", DOWNLOADING, USING OR PROVIDING FEEDBACK ON THE MATERIALS, YOU AGREE TO THESE TERMS. IF THIS AGREEMENT IS ATTACHED TO MATERIALS, BY ACCESSING, USING OR PROVIDING FEEDBACK ON THE ATTACHED MATERIALS, YOU AGREE TO THESE TERMS.

1. For good and valuable consideration, the receipt and sufficiency of which are acknowledged, You and Microsoft agree as follows:

(a) If You are an authorized representative of the corporation or other entity designated below ("Company"), and such Company has executed a Microsoft Corporation Non-Disclosure Agreement that is not limited to a specific subject matter or event ("Microsoft NDA"), You represent that You have authority to act on behalf of Company and agree that the Confidential Information, as defined in the Microsoft NDA, is subject to the terms and conditions of the Microsoft NDA and that Company will treat the Confidential Information accordingly;

(b) If You are an individual, and have executed a Microsoft NDA, You agree that the Confidential Information, as defined in the Microsoft NDA, is subject to the terms and conditions of the Microsoft NDA and that You will treat the Confidential Information accordingly; or

(c)If a Microsoft NDA has not been executed, You (if You are an individual), or Company (if You are an authorized representative of Company), as applicable, agrees: (a) to refrain from disclosing or distributing the Confidential Information to any third party for five (5) years from the date of disclosure of the Confidential Information by Microsoft to Company/You; (b) to refrain from reproducing or summarizing the Confidential Information; and (c) to take reasonable security precautions, at least as great as the precautions it takes to protect its own confidential information, but no less than reasonable care, to keep confidential the Confidential Information. You/Company, however, may disclose Confidential Information in accordance with a judicial or other governmental order, provided You/Company either (i) gives Microsoft reasonable notice prior to such disclosure and to allow Microsoft a reasonable opportunity to seek a protective order or equivalent, or (ii) obtains written assurance from the applicable judicial or governmental entity that it will afford the Confidential Information the highest level of protection afforded under applicable law or regulation. Confidential Information shall not include any information, however designated, that: (i) is or subsequently becomes publicly available without Your/Company’s breach of any obligation owed to Microsoft; (ii) became known to You/Company prior to Microsoft’s disclosure of such information to You/Company pursuant to the terms of this Agreement; (iii) became known to You/Company from a source other than Microsoft other than by the breach of an obligation of confidentiality owed to Microsoft; or (iv) is independently developed by You/Company. For purposes of this paragraph, "Confidential Information" means nonpublic information that Microsoft designates as being confidential or which, under the circumstances surrounding disclosure ought to be treated as confidential by Recipient. "Confidential Information" includes, without limitation, information in tangible or intangible form relating to and/or including released or unreleased Microsoft software or hardware products, the marketing or promotion of any Microsoft product, Microsoft's business policies or practices, and information received from others that Microsoft is obligated to treat as confidential.

2. You may review these Materials only (a) as a reference to assist You in planning and designing Your product, service or technology ("Product") to interface with a Microsoft Product as described in these Materials; and (b) to provide feedback on these Materials to Microsoft. All other rights are retained by Microsoft; this agreement does not give You rights under any Microsoft patents. You may not (i) duplicate any part of these Materials, (ii) remove this agreement or any notices from these Materials, or (iii) give any part of these Materials, or assign or otherwise provide Your rights under this agreement, to anyone else.

3. These Materials may contain preliminary information or inaccuracies, and may not correctly represent any associated Microsoft Product as commercially released. All Materials are provided entirely "AS IS." To the extent permitted by law, MICROSOFT MAKES NO WARRANTY OF ANY KIND, DISCLAIMS ALL EXPRESS, IMPLIED AND STATUTORY WARRANTIES, AND ASSUMES NO LIABILITY TO YOU FOR ANY DAMAGES OF ANY TYPE IN CONNECTION WITH THESE MATERIALS OR ANY INTELLECTUAL PROPERTY IN THEM.

4. If You are an entity and (a) merge into another entity or (b) a controlling ownership interest in You changes, Your right to use these Materials automatically terminates and You must destroy them.

5. You have no obligation to give Microsoft any suggestions, comments or other feedback ("Feedback") relating to these Materials. However, any Feedback you voluntarily provide may be used in Microsoft Products and related specifications or other documentation (collectively, "Microsoft Offerings") which in turn may be relied upon by other third parties to develop their own Products. Accordingly, if You do give Microsoft Feedback on any version of these Materials or the Microsoft Offerings to which they apply, You agree: (a) Microsoft may freely use, reproduce, license, distribute, and otherwise commercialize Your Feedback in any Microsoft Offering; (b) You also grant third parties, without charge, only those patent rights necessary to enable other Products to use or interface with any specific parts of a Microsoft Product that incorporate Your Feedback; and (c) You will not give Microsoft any Feedback (i) that You have reason to believe is subject to any patent, copyright or other intellectual property claim or right of any third party; or (ii) subject to license terms which seek to require any Microsoft Offering incorporating or derived from such Feedback, or other Microsoft intellectual property, to be licensed to or otherwise shared with any third party.

6. Microsoft has no obligation to maintain confidentiality of any Microsoft Offering, but otherwise the confidentiality of Your Feedback, including Your identity as the source of such Feedback, is governed by Your NDA.

7. This agreement is governed by the laws of the State of Washington. Any dispute involving it must be brought in the federal or state superior courts located in King County, Washington, and You waive any defenses allowing the dispute to be litigated elsewhere. If there is litigation, the losing party must pay the other party’s reasonable attorneys’ fees, costs and other expenses. If any part of this agreement is unenforceable, it will be considered modified to the extent necessary to make it enforceable, and the remainder shall continue in effect. This agreement is the entire agreement between You and Microsoft concerning these Materials; it may be changed only by a written document signed by both You and Microsoft.








Overview
Data loading is the process of loading processed data to warehouse. This chapter explains the common data loading scenarios, challenges, best practices.
Phases for loading warehouse
There are four processes associated with loading the data warehouse:
• Warehouse Initialization. This happens only once. You create the required physical data structures, and then populate any tables that are not fed from the data source. For example, use a script to populate date dimension tables. You also load other static tables, and insert Unknown Member placeholder values into dimension tables, and prepare any metadata tables that will be used by the ETL processes.
• Historical Load. This typically happens only once. The goal is to load into the data warehouse historical data that has been archived from the OLTP system. This data is often stored in offline storage systems such as flat files or backup devices. Inconsistencies in the file structures and database schema design over time need to be considered before generating ETL process for Historical load. Historical loads may require to consider following options
o Load only those dimensions that are referenced in fact tables
o Inferred dimensions can be considered (while loading corresponding facts) instead of loading all dimensions
o Check if we need all historical data loaded
o Consider aggregating historical data so dimensions can be rolled up
• Full Refresh. This typically happens only once, but you may choose to periodically resynchronize the data warehouse with the source systems. If you choose to completely refresh some tables but not others, be especially careful of any surrogate key values that change their meanings. Consider limiting resynch to dimensions, not facts
• Incremental Load. This is the standard, onging load process. You typically schedule the load on a frequent periodic basis—whether hourly, daily, weekly, or monthly. It is also possible to design a real-time incremental load that continually feeds changed records from the source system to the data warehouse.
Challenges during the Loading Process
Loading data into the data warehouse is often the most time consuming part of ETL. It also has a high potential for affecting reporting and analytical users. Following are some of the key reasons the loading process can be resource intensive:
• Dependent objects such as facts can be loaded only after the corresponding dimensions are loaded.
• Identification of dimensions (Slow Changing Dimensions, Rapidly changing dimensions) and apply the incoming data changes accordingly.
• Facts have keys as well as individual business keys
• Because of the high volume of loads, index statistics will quickly become invalid and need to be reorganized.
• Fact tables often include calculated measures that require calculation time. This can become significant when the calculations are complex or the data volumes are very large.
• Disk space utilization would be more than source systems as warehouse loads history
• Data access queries are more intensive so the requirements of memory and hardware support such as disk IO
• Periodic housekeeping tasks such as moving the prior month’s data from the current month table to a history table can take significant time.
• Aggregation tables must be reloaded when the data changes.
• Snapshot data records must be flagged to indicate which version is current. For snapshot fact loads, updating the old snapshot records as old involves identifying the old fact records and updating with end date. Identifying previous set of records for updating with end date will take significant amount of time as it involves all the snapshot records of previous synch and it involves an update
• Cross-database (or cross-server) joins between the staging database and the warehouse database can slow down loads dramatically.
Loading Guidelines
Avoid Design Errors
Prepare a data mapping as part of your ETL planning process.
Prepare a data map between source and warehouse including the intermediate transformations having business logic and data corrections. If the destination is a derived field, then describe the logic for derivation. Typically, data mapping is a spread sheet containing each source table, source column, data type and business meaning for source fields, data warehouse table, column, data type, business meaning, derivation logic and comments for warehouse fields. The mapping document is the source for implementing the transformations and loading to warehouse. Data mapping document is enhanced version of data dictionary and it is evolved after data modeling the warehouse.
Ensure the data is loaded in an appropriate Order
While loading the transformed data to warehouse, populate static dimensions such as Date, Time. Populate the lookup dimensions such as status dimension. Load dimension tables, load association tables, load bridge tables, load outriggers and lastly load fact tables.
Do test loads to verify the correctness and to estimate the running times
Once data is loaded is successfully have testing scripts run on the loaded data to ensure the data is correctly loaded. Test scripts include two types
Data count checking: It is to check if expected number of records loaded based on the number of transactions that are extracted from source are loaded to warehouse. This is to just check with number and no data validation required. This data count checking is to be done for each data warehouse object including dimensions and fact tables.
Data Validation: This process is to validate the data containing the descriptions, Surrogate Key references from facts, measure values. The queries should consider the extracted data and determine the transformed values that should have processed to warehouse and compare it against the records that are updated/inserted as part of loading to warehouse and find the mismatches to ensure the quality of the data loaded. Often this will involve comparing aggregated totals from the data warehouse with aggregated totals from the staging database, the source database, or legacy reports.
Collect and trouble shoot the failed records
In any typical data warehouse loads, it should not be assumed that all the records will be loaded successfully without any problems. The general problems that records may not be loaded because of
• Schema changes in warehouse: DDL statements issued against warehouse without modifying the underlying loading procedures
• Data type mismatches: The data type of a column in warehouse is changed
• Measures out of limit: This is particularly likely when loading aggregate tables. Measures data reached above the limit set example: It was not expected to contain total books bought by customer to 37567 (so it was declared as smallint) in a day but it was reached the limit cause of some discount activity in store.
• SK reaching limit: Surrogate key is declared as Integer but because of Type II affect, the limit is reached
• No reference in dimensions: While loading facts it could not get the corresponding dimension surrogate key so the fact data could not be loaded
• Unexpected data: Loading records having value that is not look up tables. Example: It is not expected to contain customer status as Inactive
• Nullable data: Measures are not expected to contain any NULL or business keys are supposed to have NULL values but incoming data has NULL value cause of source data problems or transformation errors.
• Loading problems (SK increment): Surrogate key was not declared as IDENTITY in dimensions as it is enforced using procedures. But procedures failed to increment the surrogate causing data consistency problems (assuming there is no PK created on SK)
• Server unavailable: Server is down during loading process but could not raise an exception cause of loading package problems
• Locking problems: Data access queries may have issued exclusive lock that is blocking loading process to load the data
As a solution, collect these failed records with appropriate error message so that trouble shooting can eliminate future occurrences.
Enhance Load Performance
Consider loading data in chunks incase of huge volume
Divide the whole load process into smaller chunks and populating a few files at a time. This way you will be able run smaller loads in parallel and you will be able keep some parts of the DW up and running while loading the other parts..
There are three types of parallelism techniques typically follow
Divide and load: A sequential file can split up to into smaller files and these smaller files will be loaded in parallel to the destination table. It will work with updates too but make sure update row lock will not be escalated as all these parallel tasks would work on the same destination table
Divide the process: Sequential tasks of a process such as look up and load, can be divided into parallel process in such a way that the look up records once done can go ahead for loading while it is performing looking up for other set of records
Independent tasks parallel: The independent tasks such as loading of fact1 and loading of fact2 can go in parallel but be aware that loading dimensions and loading facts cannot go in parallel

Minimizing the downtime of the warehouse by using partition switching
Using partition switching, one can load huge volume of data in short time to warehouse tables. First load the data into an empty table that has the same schema as the destination table. This table does not need to have any indexes or constraints. Then extend the partition of the fact table to include the unconstrained table. It does not involve loading it again as long as the unconstrained table data would fit into a new partition of the fact table (same schema, same partition range).
Example: Fact table F has partitioned implemented with partition column as day. It means for every day there will be a partition. So, for loading one day of transactions to the fact table, loading it a temporary table and then extending the partition would be performance advantage as all the load of this temporary table would go in one partition of the fact table without causing any down time for customers
Choose an appropriate batch size
Because the batch commit size is being used to manage both the size of the batch inserted to the fact table and the grain with which errors are captured, a reasonable balance between these two factors should be established. Consider avoiding batch setting to 0 (all in one batch) or 1 (batch per row) to have appropriate batch size
Avoid Explicit Referential Integrity Constraints
If you can ensure that relational integrity conditions are met at the source or in the staging area, you don’t need to add the explicit constraints in the data warehouse. This can make loading the warehouse tables go faster.
Minimize use of Explicit Transactions
Because a data warehouse is typically loaded as a batch, you can make the load more efficient by avoiding explicit transactions during the load. Add checkpoints to the load procedures, and log the beginning and completion of each task. This allows you to restart a task without requiring transactions. Loading data into a separate partition and merging the partition after the load is successful also reduces the need for using explicit transactions when loading tables.
Use Identity column for dimension Surrogate Key
Using an identity column for the Surrogate Key for a table is generally more efficient than using code to determine the next available Surrogate Key value. If you do not need the additional control available from manually creating Surrogate Keys, use an identity constraint.
Pre calculate custom measures
If any calculated measures in a fact table can be derived solely from data in the staging area, you can reduce the load time by performing the calculations during the Transformation stage, rather than waiting for the Loading stage. For example, if the staging database receives an extract of the total sales for one day, the Transformation process would be able to calculate the percent of total daily sales for each row of the staging table. Calculating the percent of monthly sales would, however, not be possible purely from the staging database.
Selecting the appropriate time for loading based on data requests
Loading data will always have some impact on concurrent queries of the data warehouse. If your organization has clearly identified “off hours” such as the middle of the night, you should schedule the load to be during the off hour period. If there is no off hour time period available, you should still schedule the load at a predicable time so that reporting users can plan accordingly..
Choose an appropriate loading window parameter
Maintain a table with the maximum acceptable duration for each load operation, scaled as needed by actual data volumes. For example, for a particular fact table, the appropriate rate may be 1 GB of data per hour. If the process takes moderately longer than the limit, it may need some tuning. If the process takes substantially longer than the limit, there may be a critical error that requires the process to be halted. If you store actual load times, you can report on the deviation from the expectation. This information is typically stored in the metadata containing the task status.
Use an incrementing key to identify current snapshot
Rather than flagging snapshot fact table rows with a Current that must be updated when new data is loaded, use an AsOfDate key. A dimension table can then indicate which AsOfDate is the current one. This eliminates the need for executing an Update command on large numbers of fact table rows, but does add the requirement of a join when querying the fact table.
Enhance Warehouse Reporting
Create appropriate indexes on warehouse tables
Proper indexes are essential for acceptable query performance.
Following are some Some of the guidelines indexing facts are
• Having fact keys as part of clustered index
• Having business keys also be part of fact schema in addition to dimensional surrogate keys
• Make sure columns used to join fact and dimension tables are properly indexed.
• Monitor the execution plan of data request queries to see any unnecessary inefficiencies.
Ensure Update Statistics are run after load
Be sure to have up-to-date statistics on the dimension table indexes either by running the UPDATE STATISTICS statement following the data load or using the "Auto Update Statistics" database option. Update Statistics task can be used to update the statistics enabling SQL Server to calculate query execution plans more efficiently. In general, a large data load can cause the statistics to be inaccurate. It is suggested to run the task after a major data load/update process
Reorganize indexes after loading huge data volume or multiple delta synchs
Consider reorganizing indexes after loading a large amount of data, particularly if you observe performance degradation. Consider using the reorganize index task for compacting and re structuring the indexes rather than a drop and recreate, as it can be used while system is in use.
Avoid locking for data access queries
A data warehouse is not a transaction system. Typically the only process that updates the warehouse is the data load procedure. Allowing dirty reads from the warehouse makes both loading and querying faster, with no increase in data quality risk. There is no data inconsistency as there are no transactions as the only process that would issue DML statements is loading process which runs only at particular time in a day typically and it generally will not have explicit transactions (Having explicit transactions in loading process is discouraged). Uncommitted records can be accessed using the lock hint as NO LOCK or having TRANSACTION ISOLATION LEVEL READ UNCOMMITTED at the beginning of data access query.

Managing Dimension Lookups and Surrogate Keys
Surrogate Keys are an important concept in a data warehouse.
• Allow multiple source systems to be integrated
• Allow changes in history for dimension to be captured
• Allow retaining old members (even if application key is recycled)
You need to generate a new key the first time a record is entered, but re-use the existing key if available. When re-loading dimension data, you must use the same keys as previously—or reload all the fact tables that use the surrogate key as a foreign key.
Adopt hashing technique for identifying Type I and Type II changes
Details about Type I and Type II changes are explained in Dimensional Modeling chapter.
The major time consuming part of loading process is identifying the updates and inserts that would be performed on the warehouse. Traditionally, the Type I columns of the dimensions are compared with incoming incremental data to determine if change exists, if so the fields are updated for the corresponding dimension records. The same is performed separately for Type II, to have new records added to dimensions. Comparison of large number of fields in the dimension would add overhead and lead to performance of the system dropped. It is suggested to compute hash and store it in dimension for both Type I and Type II fields for each dimension record. The computation of hash value is part of the loading process to dimension records. When incoming data is to be loaded to warehouse, first the hash is calculated for incoming records and then matched with Type I hash of the dimension record. If both are different then straight away update all the Type I fields. The affect will be same for Type II fields except that instead of update, a new insertion is performed.
Enforce constraints in load procedures than Declarative
No Declarative PK or FK constraints between dimensions and fact tables to enhance the performance enforce it while loading in loading procedures. This would give visible difference of loading particularly for large data volumes. This option is adopted when the fact table size is huge.
Moving data to warehouse and then load for huge data volumes and across servers\
In most data warehouses, you must match data from the staging database to existing records in the data warehouse. Assuming you have a conventional star schema, you should have a fact table with two kind of fields: surrogate keys and measures. When it's time to populate the fact table, the common issue is to transform application dimensional keys into surrogate keys
For fact table rows, you need to look up the surrogate key in the dimension tables
For dimension table rows, you need to check for Type I or Type II attribute changes.
Incase of huge data volume loads against already existing large volume of data in warehouse, this would cause overuse of network and performance problems for cross server record look up. In the diagrams below
Traditional approach diagram shows how the typical warehouse is loaded across two different servers by accessing the record by record from warehouse to determine the existence, attribute changes and dimensional look up.
Traditional Approach


Enhanced Approach
In the enhanced approach, it is a two step process. This is particularly useful for low bandwidth network connections as well as high data volumes load.
First all the data that is be loaded to warehouse is moved to a set of tables on warehouse server using a simple insert statement from staging to warehouse tables
The data from these temporary set of tables will then be loaded to warehouse tables after checking the existence of data, attribute changes and dimensional look up for fact loads. This operation would give performance advantage as there is no cross server/database calls for loading to dimensions and fact tables.



Handling Missing Keys on Dimension Lookup in SSIS
It's typical in data warehouse loads to have lookups that go against dimension tables. If the key doesn't exist, either a new one is created, or a default value is used.
There're two ways to do this in SSIS:
1. Configure the Lookup transformation to redirect non-matching rows to a separate output (error output), then use a derived column to specify a default value, and finally merge both the lookup success output and the output of the derived column using a union all transform.
2. Configure the Lookup transformation to ignore lookup failures and pass the row out with null values for reference data. A derived column downstream of the lookup then checks for null reference data (Check for NULL) and replaces the value with a default value. This method is analogous to using a SQL outer join to find the missing rows.
In SSIS, the Union All transformation creates a new buffer and copies the values from the input streams into the output buffer. This is less efficient than working in the single buffer, so the second technique is better in most cases. You may want to use the error output approach if you intend to do secondary matching, such as using a Fuzzy Lookup to test for slightly misspelled words.
Dimension Key Lookup with SSIS
Using a pure SQL approach, it is natural to use an outer join to match fact table rows to dimension keys. SSIS has a new Lookup transformation. How does the Lookup function compare to an outer join? ..
SELECT
COALESCE( Northwind_Mart.dbo.Time_Dim.TimeKey, 1 ),
COALESCE( Northwind_Mart.dbo.Customer_Dim.CustomerKey, 1 ),
COALESCE( Northwind_Mart.dbo.Product_Dim.ProductKey, 1 ),
LineItemTotal = [Order Details].UnitPrice * [Order Details].Quantity,
LineItemQuantity = [Order Details].Quantity,
FROM Orders
INNER JOIN [Order Details] ON Orders.OrderID = [Order Details].OrderID
LEFT JOIN Northwind_Mart.dbo.Product_Dim ON [Order Details].ProductID =
Northwind_Mart.dbo.Product_Dim.ProductID
LEFT JOIN Northwind_Mart.dbo.Customer_Dim ON Orders.CustomerID =
Northwind_Mart.dbo.Customer_Dim.CustomerID
LEFT JOIN Northwind_Mart.dbo.Time_Dim ON Orders.ShippedDate =
Northwind_Mart.dbo.Time_Dim.theDate
WHERE (Orders.ShippedDate IS NOT NULL)
The outer joins make sure that rows with missing or wrong dimensions are still imported: the COALESCE function inserts a constant key that refers to a dummy member inserted in each dimension to identify "unknown" values.
The pure SQL approach works well with few dimension tables, but becomes slow with many dimensions (like 30 or more), especially when some of these dimensions require complex join conditions.
With SSIS, one approach is to sort the fact rows and the dimension rows and use a Merge Join transformation. The Merge Join transformation is efficient, but the Sort that precedes it is not, especially for a large fact table, and especially since the fact table needs to be re-sorted for each dimension..With Service Pack 1, SSIS allows for more control over the Sort transformation, but it is still extremely resource intensive.
Another approach is to use the Lookup transformation. With the Lookup transformation, by default, the entire lookup table is loaded in memory the first time it is used. This allows subsequent rows to perform efficiently against the memory-resident list, provided that the lookup table is small enough to easily fit into memory. With large dimensions, you can still use Lookup, but limit the cache size, with the logical limit of a single query for each row. When writing a customized query for the Lookup transformation, be sure to restrict the columns to only the application key and surrogate key fields. Be sure to check the "Enable caching" checkbox in Advanced Properties of the transformation; otherwise each lookup will send a query to the database, even if the value is the same of the previous lookup.
The results are very good: low memory consumption (rows run in the pipeline without stopping for sort operations), good tuning possible for large dimensions, low load on the database server. With large numbers of lookup tables, the Lookup transformation is very likely to out-perform a single SQL query with multiple outer join clauses.
Handling Missing Keys on Dimension Lookup in SSIS
When using the SSIS lookup function, you still need to define an "unknown" member value for values that aren't found in the lookup table. This is equivalent to using a COALESCE function in an outer join SQL query.
There're two ways to do this in SSIS:
1. Configure the Lookup transformation to redirect non-matching rows to a separate output (error output), then use a derived column to specify a default value, and finally merge both the lookup success output and the output of the derived column using a union all transform.
2. Configure the Lookup transformation to ignore lookup failures and pass the row out with null values for reference data. A derived column downstream of the lookup then checks for null reference data using 'ISNULL' and replaces the value with a default value. This method is analogous to using a SQL outer join to find the missing rows.
In SSIS, the Union All transformation creates a new buffer and copies the values from the input streams into the output buffer. This is less efficient than working in the single buffer, so the second technique is better in most cases. You may want to use the error output approach if you intend to do secondary matching, such as using a Fuzzy Lookup to test for slightly misspelled words.
Performance Considerations for loading data
Consider the following Guidelines:
Slowly Changing Dimensions Handling Options:
• SQL outer joins. Can be slow with lots of joins. Use Full Outer join to detect deleted rows. (Null on left = deleted, Null on right = new. Match = check for changed attributes)
• Use temporary table to minimize number of joins. Use hash keys to minimize number of comparisons.
• If the relational data store includes fact table with partitions, data loading process should leverage parallelism. While loading parallel load consider one thread per partition
• If you need to temporarily store the contents of a pipeline on disk, use the SSIS Raw file destination. This can be read only by the SSIS Raw File source.
• When possible, use SQL set operations (in the Execute SQL task) rather than individual OLE DB Updates and Inserts, which send one row at a time. When performing multiple transformations on a singe row, however, the SSIS pipeline is more efficient than multiple, sequential set operations.
• SCD generates 1 call to the database for each row in the pipeline. You can implement the logic of the SCD transformation by using lookups (especially on a hash key). Conversely, you could create a stored procedure to handle the SCD logic.
• Dimension lookup (used for loading facts) are effective for smaller dimensions and larger memory systems as the dim will be tried to load to memory
• Consider batch updates for larger facts or dimensions
• Making use of pre joined temp tables (3 dim tables joined to make one with required subset) or views with required set of look up values would enable better performance incase of larger dimension tables. Consider Index views in this scenario. We replaced the three-table join with one pre-joined temp table and use records from this one table passing through the key lookup instead of joining them on the fly. If your driving joins are complex and not indexable and the size is huge, this may be worthwhile to try
• For huge volume loads, consider intermediate commits (that would commit to disk on a regular intervals), be aware that it cannot be set on SQL destination but on OLEDB.
Implementation Considerations
Consider the following Guidelines:
• If you are loading only rows from the current time period into the fact table, restrict the dimension lookup rows to only current rows as well, particularly if you are using the SSIS Lookup transformation and will be loading the lookup table into memory.
• Using Outer join to lookup in conjunction with NULL comparison to find only those interested records, to populate lookup data set
• Custom task required for having Type III effect on dimensions with the help of identifying the fixed value attributes using SCD transformation
• Shrink database task enables the recovery of unused pages from tables
• A left Merge Join transformation was used where the source records on the left were matched with the current dimension members on the right across the business key. The effect of this join was to stream in only those Item records that were actually used in the relevant facts. Columns needed from the dimension for change type analysis were included in the Data Flow for matching records. A left merge was used so that new records from the source (left) would continue down the pipe where they would be added to the dimension as new members
• LookUp a Text File:
OPENROWSET SQL function now allows the BULK clause to be specified which accepts a filename and format specification. An OLE-DB driver for flat-files or XML would also do the job but this is a work-around for the native stack. The only difficulty would to get the format for the .fmt file but after that it 'just works':
select * from openrowset(bulk N'c:\temp\test.txt', formatfile = 'c:\temp\test.fmt') as x





Maintenance of Data Warehouse:
Following guidelines would help to maintain the warehouse over long haul

Disk de fragmentation: As data is loaded to warehouse in every delta synch, it can be huge in volume based on delta interval or transaction rate at source, the disk will become fragmented. Disk fragmentation can have a very large effect on system response time for the report/analysis queries. It is important to monitor this situation and reorganize the data files when necessary.

Index de fragmentation: Indexes are key for the query performance particularly on warehouse systems having large data volumes. When the records are added/updated on warehouse, the indexes will be fragmented over the time and it may require multiple look up for the keys to find the data pages thus reducing the performance. It is suggested to monitor the index fragmentation, and make use of defrag indexes instead of rebuilding to reduce downtime.

Table de fragmentation: Data files might have kept on multiple disks while creating the schema to provide the maximum concurrent access across disks to speed up query performance. But over the time, large data might have been loaded to one particular disk and may need further re organization of the data files across disks. This would need study of queries and data distribution across disks.

Index tuning: Indexes are to be tuned over the time as the initial set of indexes might not be sufficient for the current set of data request queries. Check the query execution plans for the queries if they access the corresponding index. This process may identify unused queries to save loading time and disk space also need to create new set of indexes for better query performance.
Index tuning wizard would help to tune the indexes which checks the current set of indexes against the request data access queries to determine to set of indexes dropped/recreated over the underlying tables.

Partition tuning: Some Partitions may be overloaded over the time than other partitions. Example: There might be 100,000 customer for DepartmentA and 10 customers in DepartmentB. If the data access queries most data from DepartmentA then the query would take same time as if there is no partition. Consider revamping the partitions over time based to have uniform distribution of queries to improve performance.

Aggregates: The current set of aggregation is not sufficient for the queries. Because of this the aggregations might be performed at run time or manually. The aggregation granularity may change over time because of change in business requirements. The aggregation strategy is to be revisited to change the aggregation interval and granularity to have queries to better perform by accessing pre aggregated values.
Example: Data is aggregated on monthly basis when schema is designed, a separate monthly aggregation tables are created. But over the time, the business requirements need weekly aggregations too for analyzing weekly system throughput. It may force queries to perform weekly aggregations as system does not provide.

Performance of each task: Loading or transformation task may perform lower because of large data volume on targeted table. This is particularly visible if Type I or Type II affects are to be applied on the underlying dimension tables and data volume on the dimension tables is huge over the long time. Consider changing the process of identifying the records for having Type I or Type II affect, such as making use of hash comparison, SSIS SCD objects, schedule changes, scale up resources or loading data in chunks

Update Statistics: The statistics are maintained for each table about the underlying data by SQL server. Query plans to choose a particular index depends on the statistics. If the statistics are not updated the query optimizer may choose different plan that is not optimized for the current set of data. This is most visible in most of the queries once the statistics are out of date.
Example: When the statistics were updated there were 100 records so optimizer decided to have table scan as it involves only 2 pages to look up. But now the cardinality is created to 100,000 records and if optimizer still chooses table scan that it is down in the performance. The solution is to update the statistics so that optimizer realizes and select the correct plan including the corresponding index for better query performance.

Grooming: Having large volumes of data in the targeted table particularly fact tables would hinder the performance of loading as well as statistics. If the reports/analysts need data up to 2 months old and older than 2 months of data is very rarely used then moving more than 2 months of old fact data to another set of tables/aggregated would boost the performance for the queries, also reduce the loading time.

No comments: