Machars Blog: ETL data extraction

4
ETL – Data Extraction
Microsoft Confidential. © 2006 Microsoft Corporation. All rights reserved. These materials are confidential to and maintained as a trade secret by Microsoft Corporation. Information in these materials is restricted to Microsoft authorized recipients only. Any use, distribution or public discussion of, and any feedback to, these materials is subject to the terms of the attached license. By providing any feedback on these materials to Microsoft, you agree to the terms of that license.

Microsoft Corporation Technical Documentation License Agreement (Standard)
READ THIS! THIS IS A LEGAL AGREEMENT BETWEEN MICROSOFT CORPORATION ("MICROSOFT") AND THE RECIPIENT OF THESE MATERIALS, WHETHER AN INDIVIDUAL OR AN ENTITY ("YOU"). IF YOU HAVE ACCESSED THIS AGREEMENT IN THE PROCESS OF DOWNLOADING MATERIALS ("MATERIALS") FROM A MICROSOFT WEB SITE, BY CLICKING "I ACCEPT", DOWNLOADING, USING OR PROVIDING FEEDBACK ON THE MATERIALS, YOU AGREE TO THESE TERMS. IF THIS AGREEMENT IS ATTACHED TO MATERIALS, BY ACCESSING, USING OR PROVIDING FEEDBACK ON THE ATTACHED MATERIALS, YOU AGREE TO THESE TERMS.

1. For good and valuable consideration, the receipt and sufficiency of which are acknowledged, You and Microsoft agree as follows:

(a) If You are an authorized representative of the corporation or other entity designated below ("Company"), and such Company has executed a Microsoft Corporation Non-Disclosure Agreement that is not limited to a specific subject matter or event ("Microsoft NDA"), You represent that You have authority to act on behalf of Company and agree that the Confidential Information, as defined in the Microsoft NDA, is subject to the terms and conditions of the Microsoft NDA and that Company will treat the Confidential Information accordingly;

(b) If You are an individual, and have executed a Microsoft NDA, You agree that the Confidential Information, as defined in the Microsoft NDA, is subject to the terms and conditions of the Microsoft NDA and that You will treat the Confidential Information accordingly; or

(c)If a Microsoft NDA has not been executed, You (if You are an individual), or Company (if You are an authorized representative of Company), as applicable, agrees: (a) to refrain from disclosing or distributing the Confidential Information to any third party for five (5) years from the date of disclosure of the Confidential Information by Microsoft to Company/You; (b) to refrain from reproducing or summarizing the Confidential Information; and (c) to take reasonable security precautions, at least as great as the precautions it takes to protect its own confidential information, but no less than reasonable care, to keep confidential the Confidential Information. You/Company, however, may disclose Confidential Information in accordance with a judicial or other governmental order, provided You/Company either (i) gives Microsoft reasonable notice prior to such disclosure and to allow Microsoft a reasonable opportunity to seek a protective order or equivalent, or (ii) obtains written assurance from the applicable judicial or governmental entity that it will afford the Confidential Information the highest level of protection afforded under applicable law or regulation. Confidential Information shall not include any information, however designated, that: (i) is or subsequently becomes publicly available without Your/Company’s breach of any obligation owed to Microsoft; (ii) became known to You/Company prior to Microsoft’s disclosure of such information to You/Company pursuant to the terms of this Agreement; (iii) became known to You/Company from a source other than Microsoft other than by the breach of an obligation of confidentiality owed to Microsoft; or (iv) is independently developed by You/Company. For purposes of this paragraph, "Confidential Information" means nonpublic information that Microsoft designates as being confidential or which, under the circumstances surrounding disclosure ought to be treated as confidential by Recipient. "Confidential Information" includes, without limitation, information in tangible or intangible form relating to and/or including released or unreleased Microsoft software or hardware products, the marketing or promotion of any Microsoft product, Microsoft's business policies or practices, and information received from others that Microsoft is obligated to treat as confidential.

2. You may review these Materials only (a) as a reference to assist You in planning and designing Your product, service or technology ("Product") to interface with a Microsoft Product as described in these Materials; and (b) to provide feedback on these Materials to Microsoft. All other rights are retained by Microsoft; this agreement does not give You rights under any Microsoft patents. You may not (i) duplicate any part of these Materials, (ii) remove this agreement or any notices from these Materials, or (iii) give any part of these Materials, or assign or otherwise provide Your rights under this agreement, to anyone else.

3. These Materials may contain preliminary information or inaccuracies, and may not correctly represent any associated Microsoft Product as commercially released. All Materials are provided entirely "AS IS." To the extent permitted by law, MICROSOFT MAKES NO WARRANTY OF ANY KIND, DISCLAIMS ALL EXPRESS, IMPLIED AND STATUTORY WARRANTIES, AND ASSUMES NO LIABILITY TO YOU FOR ANY DAMAGES OF ANY TYPE IN CONNECTION WITH THESE MATERIALS OR ANY INTELLECTUAL PROPERTY IN THEM.

4. If You are an entity and (a) merge into another entity or (b) a controlling ownership interest in You changes, Your right to use these Materials automatically terminates and You must destroy them.

5. You have no obligation to give Microsoft any suggestions, comments or other feedback ("Feedback") relating to these Materials. However, any Feedback you voluntarily provide may be used in Microsoft Products and related specifications or other documentation (collectively, "Microsoft Offerings") which in turn may be relied upon by other third parties to develop their own Products. Accordingly, if You do give Microsoft Feedback on any version of these Materials or the Microsoft Offerings to which they apply, You agree: (a) Microsoft may freely use, reproduce, license, distribute, and otherwise commercialize Your Feedback in any Microsoft Offering; (b) You also grant third parties, without charge, only those patent rights necessary to enable other Products to use or interface with any specific parts of a Microsoft Product that incorporate Your Feedback; and (c) You will not give Microsoft any Feedback (i) that You have reason to believe is subject to any patent, copyright or other intellectual property claim or right of any third party; or (ii) subject to license terms which seek to require any Microsoft Offering incorporating or derived from such Feedback, or other Microsoft intellectual property, to be licensed to or otherwise shared with any third party.

6. Microsoft has no obligation to maintain confidentiality of any Microsoft Offering, but otherwise the confidentiality of Your Feedback, including Your identity as the source of such Feedback, is governed by Your NDA.

7. This agreement is governed by the laws of the State of Washington. Any dispute involving it must be brought in the federal or state superior courts located in King County, Washington, and You waive any defenses allowing the dispute to be litigated elsewhere. If there is litigation, the losing party must pay the other party’s reasonable attorneys’ fees, costs and other expenses. If any part of this agreement is unenforceable, it will be considered modified to the extent necessary to make it enforceable, and the remainder shall continue in effect. This agreement is the entire agreement between You and Microsoft concerning these Materials; it may be changed only by a written document signed by both You and Microsoft.

In this Chapter
● Different data extraction models and techniques
● Techniques for handling large volume of data
● Techniques for delta detection
● Validating extracted data
Overview
A good understanding of ETL project data requires knowledge of the internal workings of a given source system--knowledge that sometimes can only be obtained from hard coded tables, undocumented methods, and with limited access to the original developers. Don’t underestimate the magnitude of this issue! Consider expert consulting, in addition to programming staff and operators, when it’s necessary to dig in and accurately understand the details of the ETL data.
Data Extraction Models
This section describes different models for extracting data at a conceptual level..
Pull, Push and Hybrid Extraction Models
There are basically two different approaches for extracting data from the source system and transferring it the destination. You can also combine the two approaches into different types of hybrid models. Following is a high level description of the two basic models, with guidelines for different implementations:
• Push Model. In the push model, the source system initiations the data extraction process which typically results in extracting and moving the data to the staging database
• Pull Model. In the pull model, the staging database initiates the data extraction process, issuing SQL statements to query the data from the source system.
• Hybrid Model. In this hybrid model, the source system extracts data to keep in an intermediate store from which the staging system picks it up at regular intervals.

Figure 1
Push and Pull Model

In the above diagram, code means the extraction routines or script intended for extracting the data from source systems.
Figure 1 illustrates the Push and Pull methods for extracting data from the source to destination.
Guidelines for using a pull method
You should use the pull method when it is necessary to minimize the impact on the source server. This may be the case when the source system is limited in disk space, CPU or memory, or when you do not have access to the source applications or systems and cannot enhance them for extraction process. In the pull method, the staging server initiates the extraction, prepares extraction queries, tracks the failures and maintains necessary metadata. This is the most common type of data extraction process, largely because of the need to minimize changes to the source applications.
Following are guidelines for extracting data by using the pull method:
• Use a scheduler to launch the extraction process during off-peak hours for the source system or, in cases where there is no off-peak time, at specific periodic intervals
• Store metadata required for the extraction on the staging server. Use the staging server to maintain the status of all extractions. .
• Configure the extraction process to read the metadata for configuration parameters and to configure source connections. Make the extraction process begin by checking the schema to see if there have been any modifications to the source since last extraction process.
• Configure the extraction process to prepare the extraction queries based on the metadata. This includes information such as list of the tables to extract and how to identify the appropriate delta period.
• Configure the extraction process to keep track of the status for each source table. Create a status code—such as initialized, in progress, completed, and failed—for each stage of the extraction. During each extraction update the status code in the metadata table. Use the status code to manage recovery, whether immediately, at a specified recovery interval, or during the next normal load process. A simple restart may be sufficient if the problem was a failed connection, but actual changes to the source may require manual intervention to change the extraction process or the staging database.
• After the extraction, check to make sure all the data transferred successfully, perhaps by comparing total row counts. In addition, you can check the data for errors such as primary key violations, null violations, and date time violations (or you can defer quality checks to the transformation process.) Also, after the extraction, update the metadata information needed to manage the next extraction.

A pull method extraction may be significantly more complex than suggested in the above guidelines. For example, if you extract data from multiple sources asynchronously, the extract from one server may succeed while the one other fails, which would require special attention to the recovery process. Likewise if extractions from different servers happen at different times (for example, one at 10:00 PM, and one at 6:00 AM) then the extraction process needs to track each extraction separately.
Guidelines for using the push method
In the push method, the source system initiates the extraction process. This is referred as push, because the staging server does not actively extract the data, but the source system pushes delta records to the staging server, which subsequently processes the extracted data. The source system is responsible for extracting the incremental data, tracking the process, restarting in case of a failure, and even moving the data to the staging area.
Use the push system when the source system has special scheduling requirements, when you do not have read access to the source system databases, or when you need near real-time data transfers to the destination. Following are guidelines for using the push method to extract data from the source system.
• Schedule the extraction process at regular intervals, or when certain triggers are fired, such as when certain data values change.
• Store the metadata required to perform the extraction process on the source system (or in a neutral location).
• Choose whether to store the extracted information temporarily in the local source system or to send it directly to the staging server.
• Maintain metadata status information for recovery purposes as when using the pull method.

Guidelines for using Hybrid Models
A hybrid model can provide some of the best features of both push and pull. In order to use a hybrid model, you must have some ability to modify processes on the source server. One scenario involves creating triggers on the source database; another involves changing the source application. The hybrid models are better than a pure push model in that the source server does not need to know anything about the staging server—this local isolation is sometimes referred to as disconnected architecture. In addition, hybrid models are better than a pure pull model in that keeping track of delta batches is relatively straightforward.

Scenario 1
In this scenario, triggers on the source database push every transaction to audit tables. The staging server reads the audit tables and manages the triggers. You don’t need to change the source applications, but you do need to be able to create triggers that log transactions.

• Add triggers to the source system to log each transaction to alternating audit tables, using codes for inserts, updates, and deletions.
• In the staging server extraction process, read the metadata to determine which audit table currently receives the log the records and then switch the log to the other table. For example, if the trigger is logging the records to A1, change the trigger to log to A2.
• In the extraction process, extract the data from the previous log file to the staging server and, if successful, truncate the table.
• In the extraction process, update the metadata to indicate which audit table is currently receiving the logged records. This prepares for the next execution
• In a subsequent process, load the information into the data warehouse.
In this hybrid scenario, each time an extract is successful, the audit table for that source switches to the alternate table. In the example given above, the trigger originally logs to A1. Then for the first extraction, the trigger logs to A2 while A1 data is extracted. In the next extraction, the trigger again logs to A1 while A2 data is extracted. In the case of a failure, the corresponding audit table is not truncated, so a recovery process can extract again from the same table. With this scenario, there is no need to change any application code.

Scenario 2
In this scenario the application (for example, an ASP.NET web application logging click stream data) logs records to a flat file as illustrated in figure X.
Here the data is not extracted from source database but written redundantly by source application. This scenario can be adopted when source application collects huge data volume from user interface (such as click stream). The logs are created/switched by the application as configured such as one log file for each day. Source application needs to create a new file at the end of the day.

Figure X
Hybrid extraction – Application logging to flat file
Use a time stamp (e.g., 200608070030.txt) or a sequential number (e.g., ABC00043.txt) as the file name. The extraction process on the staging server, then reads the files, loads the data into the staging database, and then deletes the source files. The processes can be extended to extract data from multiple servers.
You do have to modify the source application to write to the log file for each transaction in addition to the regular updating of the source database. Writing two entries does add overhead to the source application, but the cost should be minimal, as the source application already knows everything necessary about the transaction, including the type of the change.
Using a Reporting Data Store
The primary purpose of a Reporting Data Store is to access OLTP reports from a data store separate from the OLTP system. This approach minimizes the impact reporting has on source system applications.
This data store contains only volatile, current data. This is different from the staging database in that no history is retained, and that it can be used directly for reporting.
The primary purposes for having a Reporting Data Store are as follows:
• Can be used for loading warehouse
• Allows for reporting that is closer to real-time than is available in the data warehouse. You should make sure that there is continuous data availability between the reporting data store to the Warehouse (For example: This data store provides current month, Warehouse should provide data at least up to previous month)
• Allows for reporting that is more detailed (granular) than available in the data warehouse. Of course, reporting from the reporting data store cannot be more granular than the granularity available in the source system

Scenario I (Using Snapshot/Standby Server as an alternative to OLTP for reports)
The following diagram shows how a standby or snapshot server can fit into the total warehouse architecture. The standby server acts as a data repository to feed data to staging.
Extracting data to the standby server happens in real time or near real time. There is no transformation being performed before loading the data to standby server in this scenario and no data is filtered. As the schema/volume is same as OLTP systems, reports cannot be more detailed (granular) than what is available in the OLTP system. The standby server functions as an alternate OLTP database in case the OLTP system crashes, but under normal operations, this server is used as a reporting repository and as the source for data extraction. The data that is stored in reporting data store is not scrubbed or transformed, but simply acts as a replica of the OLTP data source.

Figure X

Scenario 2
In this scenario, a separate Reporting Data Store is created after the staging database, but before the data warehouse. This type of Reporting Data Store provides for reports with scrubbed and transformed data, but more granular than will be put into the data warehouse. The data warehouse summarizes and accumulates data from this data store.

The following list summarizes key aspects of a Reporting Data Store
• Data can be extracted from the OLTP to a Reporting Data Store at near-real-time frequency, allowing for a degree of real-time reporting that is not available from the data warehouse.
• The loading of the Reporting Data Store can either go through an extensive ETL cleansing process, or it can be a simple loading from OLTP databases
• The structure of reporting data store is different from OLTP but similar to warehouse structure as t
• In most cases, the Reporting Data Store keeps the same keys as in the source system database
• If the data is cleansed and transformed before being inserted into the Reporting Data Store, it can be used for loading to warehouse
• The Reporting Data Store is particularly used when the reports needs more granular data over a limited time. For less granular reports over a longer time period, the main data warehouse is used
• In some contexts the Reporting Data Store is called an Operational Data Store (ODS)

Data Extraction Mode
As a designer of the ETL process, you should take into consideration the capabilities and constraints of the source system to decide if the process should be done while the source system is online or offline
• Online Mode. The ETL process in this mode connects and extracts data directly while the source system is online. The extracted data is then transformed and loaded into the warehouse without intermediary staging needs. Data extracted from source would be in transaction isolation level set as read-committed to have consistent data extraction.
• Offline Mode. In this mode, data cannot be extracted directly from the source system. This limitation could be due to not having direct access to the source system, performance constraints, high transaction rate, disparate systems, etc. This could potential result in staging the data as part of the ETL process. Using the offline mode data could be made available in various file formats – flat files, XML, log, tables, backup/archived tables.

Data movement Options
As part of extraction process, the data is first moved out of source systems and then loaded into a staging database. What are the methods that can be used for the physical process of extracting, or moving, the data out of the source system? Following are the most common techniques:,

Following are common extraction techniques, along with Pros and Cons
• Standby Server: Create a standby server for the OLTP system. This can acts as a live OLTP system in case the current server needs to be taken offline for any reason. Typically the standby server has relatively less load compared to live server and it can be used as the source of the extract to reduce the impact on the OLTP system. A disadvantage of the standby server is that the data may have some latency compared to the live server because it is updated by shipping logs. An advantage is that it contains only committed data and this helps reduces the data inconsistencies.
• Snapshot Server: This is a server on which snapshot back up is restored. Snapshot backup is a highly efficient backup of the source system. It is particularly useful when the source OLTP is critical and has large data volumes. It can take a complete or differential database backup.
• Bulk Copy Utility (BCP): The bulk copy utility (bcp) can be used to export or import large amount of data from data source. This utility typically provides high performance compared to other options. This tool is ideal when used on a separate database such as a standby or snapshot copy of the OLTP, where there are no uncommitted transactions.
• SQL Server Integration Services (SSIS) Data Pump: Using the SSIS data pump you can load data from a source directly to a SQL destination. The data is loaded into memory in chunks and then asynchronously moves to the destination. SSIS allows for in-memory transformations with throughput that is similar to that of BCP.
• Triggers: If the source system is a database that supports triggers then you can use triggers to log Insert, delete and update transactions to a destination (temporary log tables, files, the staging server, etc). Triggers can be particularly useful if the source does not maintain a transaction history, if you cannot make changes to the source system, or if the source is a non database system (for example, file based system, legacy systems, etc). If you are using triggers for extraction, be sure to test performance to measure the increased load on the transactional system.
• Source Applications: Use the OLTP application to write the data redundantly to the staging server, or to tables explicitly designed for the extraction process to read. This allows for a great deal of custom control, but does require the ability to modify the source applications.
• Replication. Typically transaction replication is used for synchronizing two servers in which transactions on the publisher are replicated to the subscriber so that the subscriber can act as a backup server in case of a source system crash. There is some latency with a replicated server, but some latency is usually acceptable in a BI system.
Each of the options listed above comes with a set of advantages and disadvantages. Choose one that works for best for your application. Critical decision factors include the type of source system, the volume of data, performance considerations, the available extraction tools, and the type of access you have to the source server..
Handling Intermediate Values at source
Do your Analysis reports need all the intermediate changes between two periods? Based on the types of OLTP system schema, there are two kinds of table structures available in general, they are
Transaction tables: History of changes is maintained in the same transaction tables including updates and inserts. These records would typically have modified date time and modified by user Id. These fields are also termed as transaction tracking fields. It is usually straightforward to capture intermediate changes in a transaction table.
Master tables: History of changes on source master records is generally not maintained. Master tables typically have two columns that provide an audit trail of changes: Modified Date Time and Modified By User Id. Whenever there is a new addition or modification these two columns are updated. Previous details of modifications are lost. This means that if one change to the master record happened at 9 AM and another at 10 AM, then extracting data at 11 AM would reflect only the 10AM change, losing the 9AM intermediate change. Typically, the intermediate changes in the master table records are not required. If they are required, you have different options, depending on whether the source maintains history.
• The Source Maintains History. If the source system preserves values of the attributes with every change, it is easy for the delta extraction process to identify and extract the changed and newly added data. Ensure that all the changes required by the reporting application are tracked by the server. When the source logs changes, you typically capture all the changes, regardless of the extraction interval.
• The Source Does Not Maintain History. The source system may not retain changes as transactions happen. In this case, use one of the following options:
o Reduce the interval between deltas (not a complete solution)
o Keep history records of balance changes either using triggers or source application logging or transaction logs.
Consider the following extraction techniques for if the source maintains only current values and does not keep track of history:
o Implement triggers to write the details in another history table. (This may have some impact on source transaction processing time.)
o Have Source application to write the details in another table when updating the current table. (This requires modification of the source application)
• Snapshot vs. Transactional Extract. A snapshot at the end of the day will not catch the intermediate changes, as shown in the following diagram. This may miss detail needed for analytical reports.

Frequency of Data Extraction
How frequently do you need to extract data from the source system? The greater the frequency of extraction, the greater the load on the system. Depending on the business requirements, your extraction process can be in real time or deferred:
• Real Time Data Extraction. If you need to application need to extract data in real time then commonly used PUSH techniques are:
o Capture the updates using Triggers on source system
o Source applications writes to OLTP and to store redundantly in different databases
o Use Standby server with a minimal latency
You could implement the PULL option using SSIS packages or an hybrid approach as described in earlier sections.
• Deferred Data Extraction. If you do not need extract data and create reports on a intra-day basis then consider the following options:
o Using Time Stamp. This will act as delta detection on source system and delta runs at regular intervals of time to capture the transactions performed in delta interval (Time between last extractions to the current)
o Usage of standby server. Typically standby server uses log shipping to bring it to the OLTP server incase of failure. This server is generally be idle and extraction on this server (with a latency) can be used for data extraction
o Usage of snapshot server. This is introduced in latest sql version to have a read only copy of the OLTP system. It is the best candidate for the extraction process incase of OLTP system is critical in resources
o File Comparisons. This is to compare the OLTP system against the staging database (accumulated data) to find out the differences and apply it as delta in regular intervals of time. This is a time consuming process for larger OLTP data volume

Data Extraction based on Metadata
Extraction process should be driven by metadata. This information is used at the runtime to drive the ETL process. Using metadata provides a flexible architecture that accommodates changes with minimal impact. The typical metadata information that you could store related to extraction are:
o List of tables to be extracted
o Columns that are to be extracted for each table
o Delta detection of each table/synch
o Extraction Start and End Date Times
o Status of each table extract
o System parameters such as folder location, connection parameters
o Schedule durations
o Number of retries incase of failure
o Logging Parameters
The metadata information could be stored in configuration files, database tables, DSVs etc. Based on the complexity of configuration and flexibility for making changes you could choose different stores.
Techniques for Delta Detection
In most cases, to minimize the extraction and processing load, you want to capture only changed records. Determining the minimum amount of data required to capture all the information is delta detection.

Problems from inadequate delta detection:
• It is necessary to capture all the records to staging and process them into the warehouse, which increases the time required for the ETL.
• The history of changes are not tracked at warehouse would lead to limit the business requirements
• Source resources are overused cause of extraction volume in each extract

There are three basic approaches for delta detection:
• Log of transactions: For this approach, you need a log of all the changes to the source tables. This approach is most commonly used for transaction records, and for changes to a master table when the table is very large.
• Snapshot of source system: This approach is to copy the entire content of tables to staging and then determine the changed/added records. This is a time consuming process, particularly if the source tables are huge. This is a preferred method for small tables, such as some master tables. This approach does not capture intermediate changes.
• Snapshot of changed records: This approach is to capture the changed/added records directly from master or transactional tables. You do not need a separate log of changes. But this process does require the modification date and time of the changes. This approach does not capture intermediate changes.

There are various delta detection techniques to choose from based on the capabilities of the source system and business requirements. If the source system creates change logs, then detecting extracting the delta is relatively straightforward. If the source system does not help with delta detection and you cannot make changes to the source, then the extraction process has to bring in all the transactional records to staging and determine the changed records. This is time consuming and for huge OLTP data volumes it is not operationally feasible. Hence choosing appropriate delta detection technique becomes important for efficient and effective extraction process. This section will review various delta detection techniques:

Problem Solution Technique
Smaller Source System (in size) Snapshot of Source System
Each table has record tracking time stamp Use Extraction Time, Time Stamp
Identity columns are available for master/transactional tables
No updates/deletes are performed on source
Source schema changes are not allowed Use Identity
History is not currently logged at source Transactional Replication
Log Shipping
Triggers
Redundant data stores
Separate history tables (source schema changes)
Source Schema is not allowed to be changed and no delta detection mechanism in place Transactional Replication
Log shipping
Snapshot of source system

Snapshot of changed records:
a. Extraction Time
Instead of tracking each table transaction with corresponding date and time, a more at high level tracking is used as a solution using Extraction time. Typically, the last extracted date time is preserved in the metadata and the current extraction could use that date time to query the changes. The query would, for example, as follows
Select Field1, Field2, Field3 from SourceTable1 Where ModifiedDateTime > ‘1/1/2006 11:00:00.000’
The date time (1/1/2006 11:00:00.000) in the above sample query is the last extracted date time. Using extraction time as a means for delta detection does not require you to store the last extracted date time for each table. There is one value for each delta extraction and any record that changes after this date time is considered for next delta extraction. This technique assumes that the source tables maintain a modified date time field. Using this approach, Inserts and Updates can be tracked, hard deletes cannot be tracked because the information is permanently deleted unless the source system preserves history. The following table provides different detection
Delta Detection Mechanism
Sr.# Transaction Type Detection Mechanism
1 Insert Extract all the records whose Modified Date Time > Last Extracted Date time
2 Update Extract all the records whose Modified Date Time > Last Extracted Date time
3 Delete Hard deletes are not possible to track. Capture the deleted records in a separate table using the application or a trigger or use a Soft delete. In Soft delete approach, the deleted record is not physically deleted but flagged as deleted. It will be deleted later by clean up process.
In summary:
• Delta detection is not specific to each table. A single last extracted date time is used for all tables.
• If you want to use this technique for tables that doesn’t have a last modified date, you may be able to add a ModifiedDateTime column that uses getDate() as the default value, or you can add a transaction commit date if the table has record tracking enabled. This does require the ability to change the table schema, but should not require changes to applications.
• This technique does not track the transactions for inserts, updates and deletes separately. It gets only changed/added records without representing each transaction separately.
b. Time Stamp data type column of Modified Date time for each table
If your table already has a TimeStamp data type then it can be used for delta detection. The TimeStamp data type is stored as binary in SQL server. Convert TimeStamp column to BigINT and get delta records recorded after last timestamp value, per table basis. Store the last extract TimeStamp to your metadata for the next round of extraction. Using this approach, Inserts and Updates can be tracked but hard deletes. Either use soft deletes or maintain the deleted records in different tables as a solution to track deletes.
Delta Detection Mechanism
Sr.# Transaction Detection Mechanism
1 Insert Extract all the records whose cast(timestamp as bigint) > Last extracted value of time stamp of this table
2 Update Extract all the records whose cast(timestamp as bigint) > Last extracted value of time stamp of this table
3 Deletes Hard deletes are not possible to track. Either capture them in separate tables by using the application/trigger or use Soft delete approach
In summary:
o Delta detection here is specific to table. Here each table’s last extracted maximum TimeStamp is used for delta detection
o If you want to use this technique for tables that doesn’t have the capability, then add a new column having TimeStamp data type.
o This technique does not track the transactions for inserts, updates and deletes separately. It gets only changed/added records without representing each transaction separately.

c. Using Identity or Record Sequence integer
We can avail this option only if
• Source schema is not designed in view of data warehouse
• Identity columns are available for master/transactional tables
• No updates/deletes are performed on source
• Source schema changes are not allowed
In this solution, every table is created with column that would track the record id also called sequence integer. Typically, this new column is created with an integer data type having column property Identity as ON. Every record in this table is automatically numbered in sequence. Inserts will have new sequence numbers making it easy to track; however, updates, soft deletes and hard deletes (specifically reusing deleted ids) cannot be detected. You must track updates using modified date time and deletes by having soft delete mechanism.
Delta Detection Mechanism
Sr.# Transaction Detection Mechanism
1 Insert Extract all the records whose Identity Column value > Last extracted value of identity of this table
2 Update Cannot be tracked
3 Deletes Hard deletes are not possible to track. Either capture them in separate tables by using the application/trigger or use Soft delete approach
In summary:
o Delta detection here is specific to table. Here each table’s maximum identity value of last extraction is used for delta detection
o If you want to use this technique for tables that doesn’t have the capability, then add a new column having identity set. The incremental value can be enforced declaratively or using procedures (for example: Get maximum value and increase it by 1 for each insert)
o This method is useful only for tracking inserts, but not for updates and deletes. They have to be addressed separately.
Snapshot of Source System:
In this approach the changed/added records are not identified at source system at the time of extraction. All the source records are captured in every extraction (Extraction is no more termed as full or delta extraction). The extracted records are loaded to staging data store. Here the records of each table are compared against the warehouse or accumulated staging to determine for changed/added/deleted records.
Log Transactions
a. Transaction Replication
Using transaction replication the transactions can be captured and apply it on staging servers without the need of separate extraction process. Typically for the first extraction, snapshot replication is used and for subsequent delta detection transaction replication is used. Transaction code and transaction extraction date time can be set in replication routines (stored procedures)
Delta Detection Mechanism
Sr.# Transaction Detection Mechanism
1 Insert Have transaction code (can be set in replication) when applied on staging, by changing the replication procedure for inserts.
2 Update Have transaction code (can be set in replication) when applied on staging, by changing the replication procedure for updates.
3 Deletes Modify replication procedure to set a flag to represent the delete instead of deleting the record from staging when replication log is applied
In summary:
o Delta detection here is specific to table. You don’t need separate column for delta detection. Transaction replication captures all the transactions and keeps it in the log to get applied on staging server.
o This method tracks all inserts, updates and deletes.
b. Log Shipping
Log shipping is typically used to update the Standby server (also called back up server) incase of source database failures or minimizing downtime. The same log shipping can be applied on staging server to have the transactions up to date on staging server, which typically act as a delta extraction. Staging server in this case is an exact copy of the source system in terms of schema and data. Using this approach one can get the delta records but cannot specifically track inserts, updates and deletes.
Delta Detection Mechanism
Sr.# Transaction Detection Mechanism
1 Insert Log Shipping applies Insert transactions when staging server is configured as a standby server. There should a column in staging schema indicating the synch date time when the log is applied for each staging table.
2 Update Log Shipping applies Updates transactions when staging server is configured as a standby server. There should be a column in staging schema indicating the synch date time when the log is applied for each staging table.
3 Deletes Log Shipping applies Delete transactions when staging server is configured as a standby server, but deletes should trigger physical deletes so as to capture and apply it on warehouse.
In summary:
o Delta detection here is specific to table. You don’t need separate column for delta detection at the source but at staging. The staging maintains a separate column to have log applied date time enable the transformation process to capture only those records that are modified by the latest log shipping process.
o To track records a separate column for each table in staging is created to know what records are actually updated or inserted during log shipping.
o Using this approach it is not possible to track insert, update or delete with corresponding transaction code. A separate row-by-row comparison of log against the accumulated-staging/warehouse is needed to determine the type of transaction
c. Using Triggers
In this solution, a series of Insert, Update and Delete triggers are created on source tables to extract the data. Insert, Delete and Update triggers are created on the transaction tables that add corresponding records to separate tables which the extraction process can gather the data. Alternatively, the trigger could directly push the data to staging server in real time without maintaining a separate set of tables at source. Triggers add performance overhead to the your existing application. Depending on the performance constraints of your application this may or may not be acceptable. Conduct appropriate performance tests before choosing this approach.
Delta Detection Mechanism
Sr.# Transaction Detection Mechanism
1 Insert Create an Insert trigger on the transaction tables that adds corresponding inserted record to destination table.
2 Update Create an Update trigger on the transaction tables that adds corresponding inserted record to destination table.
3 Deletes Create a Delete trigger on the transaction tables that adds corresponding inserted record to destination table.
In summary:
o Delta detection here is not specific to columns but entire tables, the triggers push transaction records to separate tables.
o Inserts, updates and deletes can be independently tracked.
d. Source Application Maintains Separate Delete or History Tables
In this solution, the source tables maintain only the current value, a separate set of tables are utilized to track history of transactions. It is described in hybridge modal of the data extraction.
Delta Detection Mechanism
Sr.# Transaction Detection Mechanism
1 Insert Inserts are captured either by source applications or using triggers (as explained above) in separate tables which will be used by extraction process to extract delta
2 Update Updates are captured and added as separate records either by source applications or using triggers (explained above) in separate tables which will be used by extraction process to extract delta from
3 Deletes Deletes are captured and added as separate records either by source applications or using triggers (explained above) in separate tables which will be used by extraction process to extract delta from
In summary:
o Delta detection here is not specific to columns but entire tables, the triggers push the transaction records to separate tables.
o Triggers or source application is responsible for moving the data to separate tables
o Inserts, updates and deletes can be independently tracked with corresponding transaction code set in the history tables.
e. Source Applications Stores Data Redundantly
In this scenario the staging maintains same copy of source system data, in addition a separate column is used to maintain transaction date time to identify the delta. If staging server is not reachable from source applications, then transactions are written to separate history tables which are later picked bsy extraction process and applied on staging.
Delta Detection Mechanism
Sr.# Transaction Detection Mechanism
1 Insert Source application directly applies the insert transactions on staging server
2 Update Source application directly applies the update transactions on staging server
3 Deletes Source application is to apply deletes on staging either by hard delete or soft deletes
In summary:
o There is no delta detection involved here the source directly updates the staging table.
o No changes to the source data, the source application directly pushes transactional data to destination.
o Inserts, updates and deletes can be independently tracked by having corresponding transaction code set by the source application while applying the changes.
Hard and Soft Deleting mechanism:
Hard deletes (where the records are physically deleted) are difficult to track. If you can, change the source applications, you can change them to use soft deletes (where the records are merely flagged as deleted). Soft deletes are easy to track. The hard deletes have to be captured in history separate tables. Soft deletes can be deleted by grooming process that can be run on source system or delete it after the successful extraction process.
o Change source system to use soft deletes or to put deleted records into a history table. On successful extraction, these history records can be purged to capture the fresh hard deletes.
o Use a trigger to capture the deletes into the history table.
o Copy the entire table to the staging database and compare the new version with the old to look for deleted records. (Or keep a copy of the old table on the server so you can compare for differences there.) If you have to use this method, you may want to process deletions less frequently than inserts and updates.
Strategies for Handling Large Volumes of Data
The nature of your transaction application and the rate of transactions indicate the volume of data your ETL process has to deal with. Dealing with large volume of transactions is a common scenario for many warehouse applications. There are several factors that need to be considered for such system. This section will identify the factors that affect the volume of data and how to deal with large volume of data.
Factors affecting the volume of data
Typically, a large volume of transaction is caused by a high volume of transactions. Following, however, are special cases that cause more than usual volume levels.:
• Full Sync (History Load). When you are running the running the ETL process for the first time after creating all the required warehouse structures on the staging server you will be loading large volume of historical data.
• Delta detection on the staging server. If the source system cannot provide a transaction log that includes inserts, updates, and deletions since the last extraction, then the extraction process may need to copy all the records to determine what has changed.
• Large interval between extractions. A greater than usual period between extractions could result in a larger volume of transactions to deal with.
• Reload after changes. Periodically you might have to synchronize the structure of the source and staging server. This would require loading of large volume of the data. The remaining items are situations that might trigger the need to synchronize the data.
• Schema changes on OLTP. This is a case where the source system undergoes schema changes such as:
o Addition of new columns/ Tables – addition of new cols or tables; old rows which didn’t have columns might not have records; new columns – will have any value; if new table added …will be getting all the values of the new table
o Default value changes – A column that used to have the default value of 2 now uses 3 so all existing records with the old default need to be checked to see whether they should change to the new default or remain with the fixed value.
o Data Type changes – Data type of a column in the source table is changed from date time to small date time to store higher granular information. But this also changes the values (trim the milli seconds) of the underlying data. It may result in treating as delta records
• Legacy data migration. If you migrating data from a legacy system to new system or If your business requires specific reports that would require getting data from archival storage of a legacy system then might have to deal with large amount of data.
• OLTP behaviors such as mass updates. If your OLTP system performances transactions that impact a large volume of records across the board such as updating all employees individual records at the end of financial year with an increased base pay. These occasional but massive changes require all the records be synchronized with warehouse.
• Locale Changes. Consider the scenario where the source system locale changes i.e. the code page changes from US English to French. This could require all records need to be considered for extraction.
Techniques for Handling Large Volumes of Data
Your ETL process might have to handle large volume of data between two intervals of extractions. There are various techniques for handling the large volume of data, consider the following:
o Delta detection: Using delta detection mechanism extract only the list of records that are inserted, updated or deleted from the last extraction process.
o Different Schedules. If you are extracting data from multiple sources then you could schedule the process of extracting the data on different schedules. Consider the availability of the source system, peak time of usage, criticalness of the database, and resource availability. Based on these factors consider extracting data at different times from each source, then consolidated at the staging before processing them at once. Based on the nature of the data decide if processing of extracted should start only after successful completion of all source data extracts, or incrementally for each source. It is also a good option to extract data from data sources in parallel and process one data source data at a time in sequence.
o By Partitions. If partitions are available in source system then consider extracting data from one partition in each extraction. Example: If the source table is an “append only” table, with partitions based on a record create date column, then extracting data at the end of the day can target the partition for that one day.
o Smaller delta sync intervals. Consider reducing the interval between deltas extraction, but continuing with a single transformation and load process at the end of the day. In some systems, this may cause an unacceptable ongoing drain on the source system.
o OLTP System (operational mission critical; mass updates). In rare cases, it might be possible to replicate the action of a bulk change instead of extracting all the records again. For example: if the price of all book in the database increases by 5%, you could apply the 5% change on the staging data. In order for this to work, the OLTP system would have to be more tightly linked to the data warehouse than is typically the case.
o Extraction Model. Choose an appropriate extraction model – PULL, PUSH or Hybrid as explained at the beginning of the document. Each has its own advantages, disadvantages and overhead. Consider the following:
 Server resources
 Bandwidth overhead: How much data is passed across the network issuing the SQL statement & receiving data across the network
 The difference in volume between native data (typical of a Pull extract) and text data (typical of a Push extract). If native data for 100 records is 278 bytes, but flat file data is 580 bytes you may prefer a method that can use native transfer
Guidelines for Data Extraction Process
Consider the following guidelines for data extraction process

Choose an appropriate extraction interval
If Data extraction is performed once in a day and history is not maintained in source systems then the intermediate changes may be lost. Changes happened during the day may be lost if extraction is performed at the end of the day. When a record is inserted in OLTP with value = a and then updated to value b. And when the history of transactions is not maintained separately in OLTP then when extraction happens at the end of the day, it picks up the latest value of the record which is value = b. To have transactions to be picked up, reduce the interval between two delta extractions or maintain the history of transactions in OLTP in separate history tables from which extraction process would pickup. But the entire requirement is driven by business requirement if intermediate status changes are important for analysis needs
Choose an appropriate Staging location
Consider managing the status of each data source extraction from a centralized location including logs for failure analysis. When using a pull or hybrid method, data is extracted from multiple sources to the central staging. Tracking, configuring and diagnosing for all sources is managed from the staging server,
For more information, see the “Staging Guidelines” chapter.
Choose an appropriate isolation level for reading source data
Your extraction process could set various isolation levels before reading the data. Consider the two common cases where you process Committed vs. Uncommitted data.
Reading Uncommitted data (dirty reads)
Check the following Pros and Cons of using Transaction Isolation level as Read Uncommitted:
Pros Cons
• There are no shared locks on the source tables, it means transactions perform with the same level of concurrently and efficiency whether or not the extraction process is in progress
• This allows you to extract larger volumes of data without having a significant effect on the source systems
• Data extraction is faster as there is no need for acquiring shared locks
• Dirty reads may lead to data inconsistencies

Considerations:
• Reads that occur in the middle of a transaction can lead to inconsistent data. Creating smaller transactions in the source system minimizes the risk. In an Ideal case, there should not be any dirty read by extraction process in the middle of transaction
• When using dirty reads, it may be necessary to completely resynchronize the data warehouse with the source system periodically to eliminate accumulated inconsistencies. This is easier to implement if the source database is not too large.
• Avoid creating transactions as part of the extraction process (such as keeping all the data extraction procedures on all tables in a single transaction). A transaction would increase the time required to reads the data and increases the probability of dirty reads. It also forces you to repeat the extract for tables that did extract successfully but got rolled back.
• Implementing integrity constraints at source to reduce the risk of data inconsistencies
• For extraction, use a snapshot database or standby server with committed data.
Dirty reads may cause problems for the data transformation process, such as orphaned or duplicated rows. Orphaned rows exist when a foreign key does not have a corresponding primary key. Approaches to handling these problems are covered in the Data Transformation chapter.
Reading Committed Data
Check the following pros and cons of using a Committed Isolation level

Pros Cons
• No data consistencies due to half run transactions as only committed data is read • Relatively slow as it needs to acquire shared locks on the source resources
• Source Transactions may suffer performance cause of shared locks

Considerations:
• Not good for critical systems as it keeps shared locks and hampers the performance of source systems
• Make use of snapshot database or standby server, which has committed data, for extraction
• For large tables, read the data in chunks to reduce the duration of shared locks

Maintain a Queue of Tables
Maintain a list of tables that are to be extracted against each data source in a queue and maintain status for each table as Not Processed, Waiting, Running, Processed, Failed based on the current state to better track the extraction status at table level. The list of tables including columns can be maintained in metadata store.
Typically, the extraction process does not need all the tables/views of the source system. The list of required tables and their parameters such as delta detection is to be maintained in a queue before actual extraction process is initiated. The queue preparation is based on the required/optional objects of the source and it is a configurable value (having the list of tables to be extracted), useful to extract one more table without changing the underlying code
Extract explicit field lists
Use field list while preparing extraction statement ex: SELECT Field1, Field2 FROM Table1 instead of SELECT * FROM Table1.
Having Select * in data extraction queries would bring unnecessary and unwanted records to staging which would over use network, cpu as well as data size. It is suggested to have the field list instead of * while constructing the extraction query. It also withstands the new column additions in the source schema
Tracking Optional Items
Validate source schema (for extraction tables) for critical or unsecured source systems for missing tables/columns. For optional columns in source use Nullability or default values while extracting in case of column does not exist. For optional tables, do not attempt to extract incase of object does not exist. For rest, such as required column/object is missing, declare it as an error instead of hiding
In some of OLTP systems, the database objects such as views and tables including columns are optional. There is a requirement of schema checking on what is available and based on that query is to be issued against OLTP systems. Query is issued based on existing columns and tables and views. For missing columns either NULL or default value is assumed. For missing tables, there will not be attempted to bring 0 record simply the query will not be issued for missing tables/views. If source system maintains the list of tables/columns available in a metadata table, the schema checking may directly make use of metadata table or else a separate checking process on the required objects is performed before actual extraction process
Extraction from multiple Data Sources
In a typical distribution OLTP environment, all the data is not present at one place such Accounts System, Inventory System. As data warehouse is to consolidate from all the sources data, it is required to extract data from these sources to store it in a staging data store. Consider following items while designing the ETL for multiple data sources scenario
• While extracting data from multiple data sources, failure to get from one, should result in halting the complete ETL process even to load the successful extractions
• It may so happen that look up tables (also called as master tables) from OLTP sourced from multiple sources such as Accounts table may exist in Accounts as well as Inventory system. Should there be a column in staging data store to indicate the source of the common table records?
• Handling of early arriving transactional records before the corresponding master records list, for example Inventory records are extracted before the account records. This is most likely scenario in multiple data sources extraction scenario
• The business key might have two different updates and considering merge on parameters such as Destructive, Constructive merges. In Constructive merge, there will be two records indicating the two different values of the same business key in staging data store. In Destructive merge, only the change from priority system is considered such as Account system is considered to be base for accounting information

Validating Extracted Data
Consider the following guidelines…
Verify the extracted records count
It is suggested to have do diligence check on the extracted rows, if all the expected rows are captured in the delta extraction. If not all the rows could make through extraction process then the rows are to be captured again, this row count check will determine the completeness of extraction process. Suppose any uncommitted transaction (which sets modified Date Time as Tran Start DTime) is not committed before extraction process starts but committed later with Modified Date Time falling in the range of current delta range then it may not be picked by either current delta nor next delta, resulting loss of the record.
Avoid losing foreign-key records
Because of source system problems, a foreign key row might arrive to the staging server before the primary key record. This is sometimes called a late arrival. If the extraction query joins the two tables then the detailed records will not be extracted. But in the next extraction, the detailed records will not have the appropriate modified date time. To avoid the problem, do not join the tables before extracting, but bring individual table records to the staging database and deal with the problem there, as discussed in the Data Transformation chapter.
Check for nulls and data types
If a null check is not enforced at source for non-nullable columns, then you should check the extracted records for null values. The source and staging databases should have the same schema definitions for data types, nullability, and, if possible, column sequence.
Check for duplicate key violations
If source system does not enforce primary key and there is no primary key constraint defined on source tables, it is to be enforced at staging by looking at incoming data and check for duplicates on the candidate keys
Handling Failures during Data Extraction
Extraction process may fail while getting data from source system because of
• Network connection problems
• Source transactions
• Source schema changes
• Incorrect configuration
When this happens, the system should retry the extraction to a configured number of times before declaring as a failure of extraction. The failure reason should properly be specified in the log file with the table name and the reason of failure besides the extraction process component where the failure is first identified.
Performance Considerations for Data Extraction
Consider the following guidelines:
• Consider Off peak or Idle CPU timings for data extraction. When performing Extraction on Critical systems which cannot afford to have down time or less availability of the system resources, consider extracting at idle CPU utilization or non peak hours. In general, night times are considered as appropriate time for data extraction extracting the current day’s transactions. If data volume is huge then consider extracting for smaller intervals, such as once in six hours. Have an estimate of how long the data extraction happens to bring one day of delta and loading to warehouse, check if extraction duration is coming in overlap with OLTP transactions. Typically starting the extraction process at night for example: 12 midnight and OLTP system users would start using the system at 6 AM in the morning, the extraction process should not take more than 6 hours or else it will affect the source resources for critical systems. Consider extracting in batches in such scenarios such as bringing 6 hours of data in chunks particularly when the CPU is idle or memory utilization is below 60% (per se)
• Record tracking on Source system. Having record tracking on source system would minimize the delta data volume and so the increase performance
• Placement of Servers. Having systems (Source and Staging) in the same domain results in better communication, which resulting in more band width, which results in a shorter extraction duration
• Source Transactions. If source transactions are longer running, then there will be shared/exclusive locks held on the database objects. When extraction tries to read committed, it waits until the lock is released and so the extraction duration would increase. Try to make source transaction durations as short as possible, by tuning the queries and keeping unnecessary Inserts/Updates/Deletes out of the transaction
• Extraction Process. Extraction process issues queries against OLTP system, should minimize the joins and avoid having any transformations at the extraction time by using source resources
• Source Resources. As extraction process would make use of system resources besides OLTP systems. It is suggested to have hardware sizing is performed by considering extraction process, duration and data volume in delta
• Take the impact of Extraction on the source system. If the source system is critical in terms of downtime and resources, have the extraction process being initiated by another server (or staging). Incase of critical source systems which cannot afford to have any downtime, the usage of resources for data extraction should be minimally used. Consider using PULL method of data extraction or have the transactional data available outside the transaction system such as snapshot or standby server from which the data extraction process can pick it up. Having the data extraction components outside the OLTP server would minimize the affect on OLTP system from warehouse data extraction.
• Put both source and staging on the same domain. OLTP Server and Staging to be in the same domain for better communication and transfer the extracted data. It is suggested to have staging and OLTP servers in the same domain for better communication in terms of network bandwidth and avoid permissions checks in cross domain scenarios. The network between source and staging would minimize the locks on both the servers and also keep the down time minimum.
• Do not \process the existing staging data data extraction is in progress. This would create shared locks issues
• Perform extraction in parallel rather than in serial to reduce the extraction duration. But be aware to limit the parallelism on critical systems such as maximum number of processors it can use. If the extraction process uses all the available processors and using at higher rate would adversely affect source system transactions.
Consolidation of extraction Issues and solutions
Problem Scenario Possible Technique
Source does not maintain history Reduce interval between delta extractions
Maintain transaction log tables at source using triggers or change in applications
Ignore intermediate changes if not required by reporting
Add modified datetime or timestamp column to tables
Source application writes redundantly to staging or temporary repository
Source resources are extremely limited Use a Pull method
Extract at non peak times
Extraction in small chunks during the day
Add resources to source system
Use low-overhead replication mechanisms to make a copy
Allow dirty reads during extract to minimize lock overhead
Real time reporting is required Reduce interval between extractions, possibly by using transaction replication
Create a Standby server with minimal latency
Change source applications to write data redundantly to staging server
Use triggers to write to separate repository (separate tables/storage)
Build ODS
Source does not have record tracking Add Modified Date column with getdate() as default value
Add timestamp column
Extract all the records to staging server
Using triggers to capture the transactions
Use transaction replication
Source apps cannot be changed for extraction (but schema can be changed) Add timestamp column without affecting source applications
Using Modified date defaulted to getdate()
Adding triggers
Source implements hard deletes Use triggers to capture deletes
Change apps to use soft deletes
Make use of history tables
Use transaction replication
Source uses Transaction start date as modified date Use Transaction commit date as modified date (using default/trigger)
Make use of any alternate column to determine delta detection
Have transaction duration as short as possible
Have dirty reading
Source maintains standby server Get extraction from standby server
If source has snapshot backup, extract from snapshot backup restored
Source has high transaction rate Allow dirty reads
Capture intermediate changes in a log
Increase the frequency of extract
Aware of data inconsistency and clean it using transformation layer
More detailed information is needed for reports Build an ODS to reduce the amount of data that must be maintained with the additional detail
Reports cannot be more detailed than the source

Real-Time data Extraction
Extracting source data as it happens and reflecting the same in BI reports, is useful for real time analysis. But extracting, transforming, and loading a warehouse takes time. True realtime reporting is not possible, but near-realtime reporting is possible knowing the limitations and acceptable latency.
Challenges for Real time data Extraction
High Transaction Rate
A high transaction rate makes it difficult to lock a source table, so extracting it may get delayed, limiting real time data availability. Dirty read avoid the lock problem but may lead to data inconsistencies, particularly with a high transaction volume.
Source Limitations: Source does not have history maintained so intermediate changes cannot be extracted. Source does not have record tracking mechanism so identified changed records would not be possible
Domain functionality behavior: Limitations such as incompleteness of business data would hamper real time data extraction. Example: Not having feasibility to extract intra day from source systems as the required source data is not up to date until the end of the day. Source data from different data stores can be compared and extracted after end of the day.
Complex transformations: Transformation processing, mapping, and loading to the warehouse may be complex because of the nature of the source data and cleanliness problems with the data. When transformations are slow, even if data is extracted in near real time, it cannot be processed quickly.
Data Inconsistencies: High transaction rate would cause data inconsistencies if read dirty and there is no particular time of the day the data extracted is up to date with the source, source always have transactions being in-progress in high transaction rate systems.
Source Resources Performance: Extracting data in real time may hurt the performance of OLTP transactions. This problem increases in severity with how critical and resource-intensive the source system is.
Deferred Reports: Even if the data is made available to warehouse at real time, the reports should have been published in real time. Loading data into the relational data warehouse is not be useful unless reports are executed with latest data or the analysis cube is refreshed.
Guidelines and Solutions:
Consider real time alerts. When the data source for a critical report changes, alert users of the report by using mail or an instant communication method. This limits the latency of a report to that which is unavoidable in the data preparation process..
Consider using SSIS for end to end data processing: Using SSIS for each phase of data processing right from extraction through transformation through loading This would also enable to have reduced intervals between two delta extractions so that data in warehouse would be near real time.

1 comment:

Carly Fiorina said...: Hi all,

The extract-transform-load or ETL system is often estimated to consume 70 percent of the time and effort of building a data warehouse. But there hasn't been enough careful thinking about just why the ETL system is so complex and resource intensive. Thanks a lot.....

Extract Web Data; September 7, 2011 at 11:46 AM

Machars Blog

Search This Blog

Welcome to Machers Blog

Thursday, August 7, 2008

ETL data extraction

1 comment: