Search This Blog

Welcome to Machers Blog

Blogging the world of Technology and Testing which help people to build their career.

Thursday, August 7, 2008

Data Extraction

Data Staging
Microsoft Confidential. © 2006 Microsoft Corporation. All rights reserved. These materials are confidential to and maintained as a trade secret by Microsoft Corporation. Information in these materials is restricted to Microsoft authorized recipients only. Any use, distribution or public discussion of, and any feedback to, these materials is subject to the terms of the attached license. By providing any feedback on these materials to Microsoft, you agree to the terms of that license.

Microsoft Corporation Technical Documentation License Agreement (Standard)

1. For good and valuable consideration, the receipt and sufficiency of which are acknowledged, You and Microsoft agree as follows:

(a) If You are an authorized representative of the corporation or other entity designated below ("Company"), and such Company has executed a Microsoft Corporation Non-Disclosure Agreement that is not limited to a specific subject matter or event ("Microsoft NDA"), You represent that You have authority to act on behalf of Company and agree that the Confidential Information, as defined in the Microsoft NDA, is subject to the terms and conditions of the Microsoft NDA and that Company will treat the Confidential Information accordingly;

(b) If You are an individual, and have executed a Microsoft NDA, You agree that the Confidential Information, as defined in the Microsoft NDA, is subject to the terms and conditions of the Microsoft NDA and that You will treat the Confidential Information accordingly; or

(c)If a Microsoft NDA has not been executed, You (if You are an individual), or Company (if You are an authorized representative of Company), as applicable, agrees: (a) to refrain from disclosing or distributing the Confidential Information to any third party for five (5) years from the date of disclosure of the Confidential Information by Microsoft to Company/You; (b) to refrain from reproducing or summarizing the Confidential Information; and (c) to take reasonable security precautions, at least as great as the precautions it takes to protect its own confidential information, but no less than reasonable care, to keep confidential the Confidential Information. You/Company, however, may disclose Confidential Information in accordance with a judicial or other governmental order, provided You/Company either (i) gives Microsoft reasonable notice prior to such disclosure and to allow Microsoft a reasonable opportunity to seek a protective order or equivalent, or (ii) obtains written assurance from the applicable judicial or governmental entity that it will afford the Confidential Information the highest level of protection afforded under applicable law or regulation. Confidential Information shall not include any information, however designated, that: (i) is or subsequently becomes publicly available without Your/Company’s breach of any obligation owed to Microsoft; (ii) became known to You/Company prior to Microsoft’s disclosure of such information to You/Company pursuant to the terms of this Agreement; (iii) became known to You/Company from a source other than Microsoft other than by the breach of an obligation of confidentiality owed to Microsoft; or (iv) is independently developed by You/Company. For purposes of this paragraph, "Confidential Information" means nonpublic information that Microsoft designates as being confidential or which, under the circumstances surrounding disclosure ought to be treated as confidential by Recipient. "Confidential Information" includes, without limitation, information in tangible or intangible form relating to and/or including released or unreleased Microsoft software or hardware products, the marketing or promotion of any Microsoft product, Microsoft's business policies or practices, and information received from others that Microsoft is obligated to treat as confidential.

2. You may review these Materials only (a) as a reference to assist You in planning and designing Your product, service or technology ("Product") to interface with a Microsoft Product as described in these Materials; and (b) to provide feedback on these Materials to Microsoft. All other rights are retained by Microsoft; this agreement does not give You rights under any Microsoft patents. You may not (i) duplicate any part of these Materials, (ii) remove this agreement or any notices from these Materials, or (iii) give any part of these Materials, or assign or otherwise provide Your rights under this agreement, to anyone else.

3. These Materials may contain preliminary information or inaccuracies, and may not correctly represent any associated Microsoft Product as commercially released. All Materials are provided entirely "AS IS." To the extent permitted by law, MICROSOFT MAKES NO WARRANTY OF ANY KIND, DISCLAIMS ALL EXPRESS, IMPLIED AND STATUTORY WARRANTIES, AND ASSUMES NO LIABILITY TO YOU FOR ANY DAMAGES OF ANY TYPE IN CONNECTION WITH THESE MATERIALS OR ANY INTELLECTUAL PROPERTY IN THEM.

4. If You are an entity and (a) merge into another entity or (b) a controlling ownership interest in You changes, Your right to use these Materials automatically terminates and You must destroy them.

5. You have no obligation to give Microsoft any suggestions, comments or other feedback ("Feedback") relating to these Materials. However, any Feedback you voluntarily provide may be used in Microsoft Products and related specifications or other documentation (collectively, "Microsoft Offerings") which in turn may be relied upon by other third parties to develop their own Products. Accordingly, if You do give Microsoft Feedback on any version of these Materials or the Microsoft Offerings to which they apply, You agree: (a) Microsoft may freely use, reproduce, license, distribute, and otherwise commercialize Your Feedback in any Microsoft Offering; (b) You also grant third parties, without charge, only those patent rights necessary to enable other Products to use or interface with any specific parts of a Microsoft Product that incorporate Your Feedback; and (c) You will not give Microsoft any Feedback (i) that You have reason to believe is subject to any patent, copyright or other intellectual property claim or right of any third party; or (ii) subject to license terms which seek to require any Microsoft Offering incorporating or derived from such Feedback, or other Microsoft intellectual property, to be licensed to or otherwise shared with any third party.

6. Microsoft has no obligation to maintain confidentiality of any Microsoft Offering, but otherwise the confidentiality of Your Feedback, including Your identity as the source of such Feedback, is governed by Your NDA.

7. This agreement is governed by the laws of the State of Washington. Any dispute involving it must be brought in the federal or state superior courts located in King County, Washington, and You waive any defenses allowing the dispute to be litigated elsewhere. If there is litigation, the losing party must pay the other party’s reasonable attorneys’ fees, costs and other expenses. If any part of this agreement is unenforceable, it will be considered modified to the extent necessary to make it enforceable, and the remainder shall continue in effect. This agreement is the entire agreement between You and Microsoft concerning these Materials; it may be changed only by a written document signed by both You and Microsoft.

In this Chapter
● Factors for staging or not staging
● Different staging techniques
Data extracted, transformed and loaded (ETL) from the source to destination may require temporary staging for various reasons such as addressing failures, reducing load on the source system, data cleansing, auditing purposes etc. The following section will iterate through various scenarios to help identify if staging is required in first place, if you do need staging then selecting the most appropriate model that fits the criteria becomes critical for an optimal ETL solution.
Decision to Stage or Not to Stage
As described in the Transformation guide, it is possible to load data directly from the extraction process into the data warehouse, without a separate staging area. The opportunity to avoid a staging database is even greater when using SSIS, which allows you to apply multiple transformations to data as it passes through memory between the source and the destination. Whether you need to stage the data or not as part of your ETL process is common design decision that you have to make.
Factors for Not Staging
In most simple scenario, the ETL process copies data from the source directly to the data warehouse. In this case, if there is a failure, you restart the job, again extracting data from the source system. Assuming that the source is continually available in case the job needs to restart, and the transformations are simple enough that they can be performed in memory, this is an extremely efficient way to load the data warehouse
Also, if the source is simple and clean enough that it does not need any data cleansing or transformations before loading to warehouse then it is a simple mapping without the need of a separate transformation process. The required data is extracted, mapped and loaded to warehouse, is the scenario here.
Factors for Staging
Following are reasons you would need a staging database:
• There is a time lag between Extraction and Loading process. If the extraction occurs at 7 AM and the loading to the warehouse happens at midnight then the extracted data has to be stored somewhere. A staging database is a reasonable option..
• Restart on failure:. If various parts of the ETL process are vulnerable to fail, recovery is easier if a staging database is available as the starting point. For example, Out of 100 tables, the extraction process is failed after extracting 60 files to staging area successfully. When the extraction process is restarted it does not need to extract all but the failed 40 as the data is already available for successful 60 in staging. When the source system resources are constrained, it is particularly important to avoid repeating an extract unnecessarily. The staging database allows the Extraction process to be decoupled from the subsequent processes.
• Memory Constraints. Some transformations particularly those that require sorting, can consume substantial memory resources. In many cases, those transformations can be done more efficiently by making use of a disk-based staging. Depending entirely on memory availability limits the ability of the ETL process to scale. When there are memory intensive transformations, having a disk-based process allows for more growth in the volume of data you can process.
• Auditing Requirements. Depending on the specifics of the extraction and transformation process, a staging area can provide a level of auditing support. For example, if it turns out that a specific customer is missing from the data warehouse, you could examine the staging database to help identify why that customer’s records were lost.
• Multiple Source systems: When the data is to be consolidated from multiple source systems, a staging database allows the consolidating to take place before processing the transformation.
Types of Staging
Assuming that there is a need for staging the extracted data, there are various types of staging repository patterns available. Consider the following common staging repository architectures when designing the ETL process.
• Staging Data from Multiple Sources
• Staggered Staging
• Persisted Staging
• Accumulated Staging
• Chunked Accumulated Staging

Staging Data from multiple sources
In this case the data arriving from different source systems are temporally stored in a staging location before the ETL process. As illustrated, the data from the external sources is extracted through the firewall to a staging database before the transformation process begins. This is the consolidation of data from multiple sources at a conceptual level as other lower level tasks such as data merging, prioritizing, timing and cleaning is also required once after the data consolidation (explained in data transformation chapter)

Staggered Staging
In this case, you create multiple staging databases for different stages of the ETL process. The reason for using this approach is if different steps of the process are very expensive to repeat and are also vulnerable to failure. There is clearly a cost in disk space for this option, but it allows multiple levels of restart capability, without having to repeat earlier stages of the process.

Persisted Staging
In this case an archive copy of the staging database is created routinely. The primary reason for using this alternative is to allow auditing of the extract and transformation processes for more than the current cycle.

Accumulated Staging
Ideally, the source systems provide delta data having the corresponding transaction details (Inserts, Updates, Deletes).
Accumulated staging approach can be adopted, when
1. If the source system does not have inbuilt delta detection mechanism
2. If the source system does not provide the kind of transaction applied for each delta detected record

If the source system does not have inbuilt delta detection, all the records from the required tables are extracted to staging then compared against the accumulated staging.
If the souce system does have delta detection mechanism and extracted only the changed/added records the next step is to determine the transaction type that occurred for these changed records (Inserts/Updates/Deletes). Following example would give more insight on how the comparison is made to determine the transaction type.

Incoming Extraction (Table Employee):
EmpId EmpName EmpLocation
1011 John Redmond
1021 Dave Seattle

Accumulated Staging (Table Employee):
EmpId EmpName EmpLocation
1011 John New York

In the above tables, when compared the incoming employee table with accumulated staging, the record EmpId 1011 can be determined as Modified where as the EmpId 1021 can be determined as Inserted.

Chunked Accumulated Staging
This is a case were you need to extract huge volume of data and you spread the extract load across the day on the source system. The accumulated information is then transformed and loaded to the destination warehouse as required by business needs. For example extraction could happen every hour throughout the day and transformation and loading could be done once at the end of the day.

Destination Considerations
The previous sections described various ways of staging data before loading the data warehouse. In some BI database implementations, there may be an additional step that involves distributing the data to separate data marts. .

For example, the data from the central data warehouse may need to be distributed to data marts at different geographical locations. In some ways, the data warehouse is functioning as a staging area for the distributed data marts. This is not technically a stage, it is still taken into consideration for choosing appropriate conditions to load the warehouse to ensure it is appropriate populated to destination marts.

1 comment:

Carly Fiorina said...

Hello Dude,

Data extraction being time consuming and tedious task, hiring professionals for data extraction need can ensure huge amount of time saving which can be utilized in core business activities. Data extraction services brings significant savings of money, time and infrastructure cost. Thanks a lot.....

Data Scraping Software