Search This Blog

Welcome to Machers Blog

Blogging the world of Technology and Testing which help people to build their career.

Thursday, August 7, 2008

Data Transformations

6
ETL – Data Transformations
Microsoft Confidential. © 2006 Microsoft Corporation. All rights reserved. These materials are confidential to and maintained as a trade secret by Microsoft Corporation. Information in these materials is restricted to Microsoft authorized recipients only. Any use, distribution or public discussion of, and any feedback to, these materials is subject to the terms of the attached license. By providing any feedback on these materials to Microsoft, you agree to the terms of that license.


Microsoft Corporation Technical Documentation License Agreement (Standard)
READ THIS! THIS IS A LEGAL AGREEMENT BETWEEN MICROSOFT CORPORATION ("MICROSOFT") AND THE RECIPIENT OF THESE MATERIALS, WHETHER AN INDIVIDUAL OR AN ENTITY ("YOU"). IF YOU HAVE ACCESSED THIS AGREEMENT IN THE PROCESS OF DOWNLOADING MATERIALS ("MATERIALS") FROM A MICROSOFT WEB SITE, BY CLICKING "I ACCEPT", DOWNLOADING, USING OR PROVIDING FEEDBACK ON THE MATERIALS, YOU AGREE TO THESE TERMS. IF THIS AGREEMENT IS ATTACHED TO MATERIALS, BY ACCESSING, USING OR PROVIDING FEEDBACK ON THE ATTACHED MATERIALS, YOU AGREE TO THESE TERMS.

1. For good and valuable consideration, the receipt and sufficiency of which are acknowledged, You and Microsoft agree as follows:

(a) If You are an authorized representative of the corporation or other entity designated below ("Company"), and such Company has executed a Microsoft Corporation Non-Disclosure Agreement that is not limited to a specific subject matter or event ("Microsoft NDA"), You represent that You have authority to act on behalf of Company and agree that the Confidential Information, as defined in the Microsoft NDA, is subject to the terms and conditions of the Microsoft NDA and that Company will treat the Confidential Information accordingly;

(b) If You are an individual, and have executed a Microsoft NDA, You agree that the Confidential Information, as defined in the Microsoft NDA, is subject to the terms and conditions of the Microsoft NDA and that You will treat the Confidential Information accordingly; or

(c)If a Microsoft NDA has not been executed, You (if You are an individual), or Company (if You are an authorized representative of Company), as applicable, agrees: (a) to refrain from disclosing or distributing the Confidential Information to any third party for five (5) years from the date of disclosure of the Confidential Information by Microsoft to Company/You; (b) to refrain from reproducing or summarizing the Confidential Information; and (c) to take reasonable security precautions, at least as great as the precautions it takes to protect its own confidential information, but no less than reasonable care, to keep confidential the Confidential Information. You/Company, however, may disclose Confidential Information in accordance with a judicial or other governmental order, provided You/Company either (i) gives Microsoft reasonable notice prior to such disclosure and to allow Microsoft a reasonable opportunity to seek a protective order or equivalent, or (ii) obtains written assurance from the applicable judicial or governmental entity that it will afford the Confidential Information the highest level of protection afforded under applicable law or regulation. Confidential Information shall not include any information, however designated, that: (i) is or subsequently becomes publicly available without Your/Company’s breach of any obligation owed to Microsoft; (ii) became known to You/Company prior to Microsoft’s disclosure of such information to You/Company pursuant to the terms of this Agreement; (iii) became known to You/Company from a source other than Microsoft other than by the breach of an obligation of confidentiality owed to Microsoft; or (iv) is independently developed by You/Company. For purposes of this paragraph, "Confidential Information" means nonpublic information that Microsoft designates as being confidential or which, under the circumstances surrounding disclosure ought to be treated as confidential by Recipient. "Confidential Information" includes, without limitation, information in tangible or intangible form relating to and/or including released or unreleased Microsoft software or hardware products, the marketing or promotion of any Microsoft product, Microsoft's business policies or practices, and information received from others that Microsoft is obligated to treat as confidential.

2. You may review these Materials only (a) as a reference to assist You in planning and designing Your product, service or technology ("Product") to interface with a Microsoft Product as described in these Materials; and (b) to provide feedback on these Materials to Microsoft. All other rights are retained by Microsoft; this agreement does not give You rights under any Microsoft patents. You may not (i) duplicate any part of these Materials, (ii) remove this agreement or any notices from these Materials, or (iii) give any part of these Materials, or assign or otherwise provide Your rights under this agreement, to anyone else.

3. These Materials may contain preliminary information or inaccuracies, and may not correctly represent any associated Microsoft Product as commercially released. All Materials are provided entirely "AS IS." To the extent permitted by law, MICROSOFT MAKES NO WARRANTY OF ANY KIND, DISCLAIMS ALL EXPRESS, IMPLIED AND STATUTORY WARRANTIES, AND ASSUMES NO LIABILITY TO YOU FOR ANY DAMAGES OF ANY TYPE IN CONNECTION WITH THESE MATERIALS OR ANY INTELLECTUAL PROPERTY IN THEM.

4. If You are an entity and (a) merge into another entity or (b) a controlling ownership interest in You changes, Your right to use these Materials automatically terminates and You must destroy them.

5. You have no obligation to give Microsoft any suggestions, comments or other feedback ("Feedback") relating to these Materials. However, any Feedback you voluntarily provide may be used in Microsoft Products and related specifications or other documentation (collectively, "Microsoft Offerings") which in turn may be relied upon by other third parties to develop their own Products. Accordingly, if You do give Microsoft Feedback on any version of these Materials or the Microsoft Offerings to which they apply, You agree: (a) Microsoft may freely use, reproduce, license, distribute, and otherwise commercialize Your Feedback in any Microsoft Offering; (b) You also grant third parties, without charge, only those patent rights necessary to enable other Products to use or interface with any specific parts of a Microsoft Product that incorporate Your Feedback; and (c) You will not give Microsoft any Feedback (i) that You have reason to believe is subject to any patent, copyright or other intellectual property claim or right of any third party; or (ii) subject to license terms which seek to require any Microsoft Offering incorporating or derived from such Feedback, or other Microsoft intellectual property, to be licensed to or otherwise shared with any third party.

6. Microsoft has no obligation to maintain confidentiality of any Microsoft Offering, but otherwise the confidentiality of Your Feedback, including Your identity as the source of such Feedback, is governed by Your NDA.

7. This agreement is governed by the laws of the State of Washington. Any dispute involving it must be brought in the federal or state superior courts located in King County, Washington, and You waive any defenses allowing the dispute to be litigated elsewhere. If there is litigation, the losing party must pay the other party’s reasonable attorneys’ fees, costs and other expenses. If any part of this agreement is unenforceable, it will be considered modified to the extent necessary to make it enforceable, and the remainder shall continue in effect. This agreement is the entire agreement between You and Microsoft concerning these Materials; it may be changed only by a written document signed by both You and Microsoft.








In this Chapter
● General Transformation Problems and Solutions
● Guidelines for Transformation process
● Data Quality Issus and Solutions
Overview
Data transformation is key process in ETL cycle. Transformation is the phase where in we transform the raw data into a more meaningful form meeting business analysis objectives. The following diagram shows the life cycle for transformation.



In brief of various phases of Transformation process is explained below.

Validate Extract: This is the first step after extraction to validate the data for the recount count, exceptions, and partially extracted records.

Map: Mapping is the activity to map the extracted data to the target, which is warehouse or data mart. This mapping can be 1 – 1, many to 1 or 1 to many columns.

Reformat: Reformat is performed based on the mapping. The staging schema is checked for the data type, data content, and business requirement as required for the target column. This phase related to only formatting of the staging data to fit exactly into the structure of target warehouse.

Cleanse: In this phase, the incoming data is cleaned for data quality problems such as redundancy, constraints and other data consistency problems
Integrate: This phase is for integrating data from multiple systems, from archive and current data stores.

Integrate: It is the integration phase where the cleansed data is aggregated (if warehouse is higher granular), integrated from multiple tables based on mapping for creating loading ready data set.

Apply Business rules: Business rules are applied for certain calculations and deriving meaningful and required data from extracted data

Need of Transformation
First you need to decide whether you need to perform transformations at all. The following factors are major reasons for creating transformations:
Transformation Process is required
The following section lists high level factors that indicate the need for data transformation.
• Data Quality problems. Extracted data from source data stores may have data quality issues such as repeated Primary Key values, Nulls that are not suitable in a data warehouse, incorrect date formats, and referential integrity failures, cryptic codes values for some fields, inaccurate data, and data values that are inconsistent and redundant between multiple tables. The problems of data quality, guidelines and solutions are explained in Data Quality section. The major sources data quality problems are data entry errors, source system errors, normalization problems, and source database design flaws.
• Multiple Data Sources. When data is extracted from multiple data sources, more than one system can update a single logical set of data that is commonly shared among the data sources. Managing discrepancies that result from using individual data islands as the source requires active transformations.
• Data type inconsistency. Data type inconsistencies come from staging tables that are not compatible with the source systems. This is particularly problematic if one staging table is populated from heterogeneous systems such as Excel and SQL Server, where the data sources have different data types.
• Business Rules. When there is a business rules such as the requirement of a minimum balance for a savings account, and the rule is not enforced at the source, then it should be handled as part of the transformation process.
• Heterogeneous data stores. When the data is extracted from heterogeneous data stores, the source systems may have different data types, different column names, different column orders, and different data formats, particularly for date time values. These inconsistencies can be resolved during the transformation process.
• Timely. Incase of transaction records are coming ahead of its corresponding look up table values. Example: Employee basic information record is not yet available for extracted but admission information is available in transaction tables. This is a typical transactions problem at source systems.
• Custom Calculations. Custom calculations may be needed in the warehouse. The custom calculations can be performed and directly stored in the warehouse tables for query performance OR these calculations are computed while executing the query but later is a time consuming operation.
• Data mapping. If source systems cannot be mapped to the corresponding objects in staging of staging schema design, then required transformation is to be done as part of extraction process before moving the data to staging. For example: if the source contains a customer’s birth date, but you want the age stored in a staging table, the transformation process would take care of the conversion as it loads the staging database.
• Hierarchical sources. It is about the entity identification at different levels of the source. For example, suppose that the central server keeps track of all transaction and look-up tables needed by the warehouse, and that the data stored in that central server is not a real time. Then when you extract data from both the central server and real-time child sources, you may have duplicated data that needs to be managed in a transformation process.
• Record tracking. If there is no record tracking at source, then identifying the Insert, Update, and Delete changes would be part of the data transformation process.
Transformation process can be avoided
Transformation process is not mandatory in the data warehouse loading process. It can be eliminated in the following scenarios. Eliminating transformation layer can reduce performance overhead of loading data to warehouse, so deciding if the transformation is really required in the system would benefit total processing time.
Transformation involves simple data modifications: If there is no need of
• Record tracking. Typically if there is no record tracking mechanism and no transaction identifying mechanism at source then the same is accomplished using transformation layer. In no record tracking scenario at source, all the record are extracted in each extraction process and compared against the warehouse or accumulated staging to determine the kind of transactions happened from last extraction. So, if the extracted records do have the transaction code as well as the process can effectively bring only the delta records (inserted/deleted/updated records from last extraction) then the transformation can be eliminated as the transaction can directly be applied to warehouse without any transformation.
• Data cleansing. If there are no data quality issues and the source already has constraints to keep data consistent and accurate (PK, FK, Check constraints or procedural constraints) then there is no need for separate transformations before loading the data to warehouse.
• Merging source systems. If there is no need of merging data from multiple source systems and multiple sources are not involved to accumulate data at a common place then and the source eliminate duplicates, referential integrity
• Complex decoding. Source system would be embedded with codes that is interpreted or data represented by source application. To have these codes be loaded to warehouse, data representation code them convert to user understandable values. If the decoding process is not complex to have a separate layer of transformation then the layer can be eliminated.
• Complex interpretation of default values. If the default values are just codes from the look up tables or simple codes (such as 1 for male and 2 for female) that can be converted to meaningful words while extracting data. There is no need of separate transformation process after extracting data unless there is a complex process involved to interpret default values. The general default values in the system are typically assume if the column is not allowed for null values, the value is not appropriate for the column, incorrect entry, not available yet.
• Custom calculations. If there are no complex calculations involved before loading the data to warehouse, the transformation layer can be eliminated. Simple calculations such as simple arithmetic calculations can be incorporated in extraction procedures.
Transformation Process
Transformation process involves the following steps at a conceptual level:

1. Study the source system
2. Location of Transformation Process
3. Validate Data Extract (Selection)
a. Delta Detection
b. Continuity of Delta
c. Count Check
d. Capture bad records
4. Mapping
a. Data mapping document preparation
5. Reformat
a. Conversion
6. Data Cleansing
a. Data Quality checks
7. Splitting/Joining
8. Integrate
a. Summarization
9. Enrichment
10. Applying Business Rules

Study the source system
Source System, which is a data source for the extraction system is to be studied from business domain point of view as well as the structural and integrity of the source system. Source system is not assumed to be a relational data store, it can be flat files or another database apart from SQL Server. In this section, a list of questions (can be treated as check lit) that need to discussed with source system (OLTP) database architect.

Studying the source system is not a one time activity, it is to be performed whenever
• Extraction process is enhanced
• Source business process is changed
• Source schema is changed
• New business functionality is to be extracted
• Existing data checks/constraints on the source system are modified

The check list which should be discussed is as given below

• Check how the data is fed into data source from which extraction is performed, any external applications or data is fed using the flat files or heterogeneous systems
a. This helps the understand the complexity to deal with while interacting with the source system
b. Typically structural complexity increases when you lack details of the source system or have less control over it
c. Incase of multiple data sources involved in extraction, it would result in more than one extraction process is needed to extract data and consolidation complexity
d. Extracting information if the data is fed to the source system by application would enable to check if data input screens are free text in nature where the data pollution may occur

• Discuss the data consistency or data quality checks performed on the source system
a. Mapping your warehouse requirement against the checks done at the source to see if the data checks will completely stop the polluted data to get into source system
b. As part of learning the source system – you need understand at what level of details and how much time you are going to invest in leaning the source system. You may or may not have the luxury to learn end-to-end source system (all the layers from UI to database schemes and all the dependencies). Sometimes, it might be as simple as evaluating the schema or sometimes understand how the user enters data

• Check the integrity constraints managed in source system
a. Check the constraints enforcement on source system is using declarations or procedures. If the checks are enforced using procedures then check the possibility of data slipped through to cause data pollution
b. If the data checks are enforced properly then they do not need to be checked again as part of transformation process

• Verify if the data will be in consistent at the time of data extraction, or at point of time
a. In some systems, the processing is completed at the end of the day, extracting data in the middle of the day would make the data inconsistent would lead to data quality problems
b. Transfer of check , if the check is process at the end of the day….transaction is completed, but available at the source; at 5 ‘O, Clock….pending transaction – Completeness of the transaction

• Verify the accuracy of the calculations performed on the source system
a. Though it is evident that any problem in source system calculations would have been identified and fixed, the question is about the unused calculations whose problem is not identified yet but need to be extracted for analysis requirements
Example, source applications might not be using employee leave balance so calculation problems, if any, are not identified. Having these calculations extracted to warehouse system would induce incorrect calculations. All the calculations that are extracted from source system are to be checked for accuracy.

• Discuss the columns that would allow NULL on source system
Nullability of a column is to be decided by the source systems, when there is no expectation of NULL values from source. Having nullability of a column which never gets null values is an overhead for the table

a. Even if the source schema has got columns that would allow NULLs, it does not mean that data will be null as the NULLABILITY is enforced at application level. Consider only those columns whose data may come up as NULL
b. This helps to make the decision whether to/not to perform nullability check on the source columns

Source Business Rule Is NULL check Required
NULL Not NULL Yes
Not NULL NULL No
NULL NULL No

• Discuss the procedures followed when schema changes are accomplished on source system and how the dependent systems are analyzed
a. Whenever there is a schema change in source systems that involve the objects that are extracted, then there is an impact on the extraction process
i. If a column is added, it generally will not cause extraction failure unless “Select *” is used
ii. If a column is deleted that is accessed in extraction, then the extraction will fail
iii. If a column name is modified, then it will fail the extraction if used
b. Any changes in the source schema must be analyzed for impact across dependent systems such as extraction process


• Discuss if there is any columns or tables or any other database objects on source system that would be optional from time to time
a. Optional columns and Optional tables are to be taken care while issuing extraction query to the source
b. Existence of such column/table is to be checked and query the source database accordingly

• Check the compatibility of staging schema with source system
a. The staging database schema should be inline with source system in terms of data type, precision and scale to fit data appropriate unless there is simple transformations done on a particular column such as splitting the column or concatenation

Location of Transformation Process
Once decided the need of transformation, the next question is where to perform the transformation. Any of the following technique in this regard can be considered

1. Transformation as part of Extraction Process
2. Transformation as a separate layer
3. Transformation as part of Loading Process
4. Transformation in memory

Transformations as part of the Extraction Process: In general, it is best to do filtering types of transformations whenever possible at the source. This allows you to select only the records of interest for loading into the warehouse. Ideally, you want to extract only the data that has been changed since your last extraction. While transformations could be done at the source operational system, an important consideration is to minimize the additional load the extraction process puts on the operational system. Simple transformations such as decoding the attribute (splitting), upper/lower case operations can be performed as part of extraction process.

Transformation as a separate layer: Transformations can be done in a staging area prior to loading the data into the data warehouse. When the data is extracted from single data source, based on the complexity of transformation it can be done as part of extraction, but consider having multiple data sources. Then it should be at a place where this multi data source data is consolidated and would need to be transformed, that is where the separate structure and data store called staging come into picture. If your incoming data is in a flat file, it is probably more efficient to finish the staging processes prior to loading the data into the warehouse. Transformations that require sorting, sequential processing and row-at-a-time operations can be done efficiently in the flat file staging area. By doing transformations in SQL server, if the data can be processed in bulk using SQL set operations, they can be done in parallel. If some of the data needed for the transformation purposes are already in the warehouse, it may be more efficient to do the transformation in the warehouse. For example, key lookups to obtain the surrogate key from the dimension table can be efficiently performed in staging tables in the database.

Transformations as part of Loading Process: Some simple types of transformations can be done as while loading the data to warehouse. For example, it can be used to change the case of a character column to uppercase. This is best done when a small number of rows need to be transformed typically the delta records. Precaution need to be taken care while doing complex transformations as part of loading process because it would acquire locks on the warehouse tables to cause performance impact on analysis queries.

Transformations in memory: Memory transformations are considered to be better option for large and complex transformations with no latency between extraction and loading processes. As part of this technique, the extracted data is first loaded to memory (supported by hardware sizing) and series of transformation are applied on the extracted data using set of transformation objects such as data cleansing, derived column, mapping, merging and splitting. SSIS supports transformations of this type with a number of supported objects which can directly access data from memory and perform transformation. But this technique is not much useful if staging is used a transfer area such as extraction is performed at 9AM but transformation and loading is to be performed only at the end of the day. Till that time, the data is to be staged in staging area. Having memory transformations in this case needs to land the data after extraction but need to load to memory again.

Validate Data Extraction (Selection)
Delta detection:
This is the starting point of the data transformation. It is also called the extraction process which collects the delta or full records. Extraction from a data source is explained in detail in extraction guide. Based on record tracking mechanism available at source, complete set of source records or only the changed records from last extraction are extracted. Most of the data transformation processes needs complete set of look up records and only added/modified/deleted transaction records (from last extraction) for performing decoding and look up operation, without this it has to depend on accumulated staging or warehouse data for transformation process.

Continuity of Delta:
Ensure there are no records missing between two subsequent delta extractions. Example, if delta detection is based on modified date time then end of first delta date time should be beginning of delta date time. This would make sure that there is no records are slipped out between two deltas causing data loss.
Count Check:
Ensure if all the records are extracted as expected. This requires compute the number of records that are extracted to the number of records that are expected to be extracted. First one is the number of records in staging tables after loading and second one is should be queries against with a count(*) statement after extraction. This should be performed on each extracted table. If the counts mismatch then declare the extraction process is failed and restart for all the tables which has incorrect number of records (against the expected count)

Capture bad records:
Though the records are extracted from source, it cannot be assumed that all the records would successfully be loaded to staging. Check the loading of staging procedure to
• To fail when at least one of the records could not be moved to staging mostly cause of incorrect data content or data type compatibility
• Capture the bad records and let the good records be processed through transformation
• Do not discard the bad records, analyze and take action to minimize data loss

Data mapping document preparation
Prepare a data mapping between source and warehouse including the intermediate transformations having business logic and data corrections. If it is a derived field, then describe the logic for derivation. Typically, data mapping is a spread sheet containing each source table, source column, data type and business meaning for source fields, data warehouse table, column, data type, business meaning, derivation logic and comments for warehouse fields. The mapping document is the source for implementing the transformations and loading to warehouse. Data mapping document is enhanced version of data dictionary and it is evolved after data modeling the warehouse.
Conversion
This is basically to convert the source extracted tables to the business meaning after map but before loading to the warehouse. As one example: There are more than one source tables that have information about a bank transaction and none of these transaction tables individually in source can give the balance of a customer in an account. Conversion calculates the account balance before mapping to warehouse. As another example, it is very difficult to aggregate non-additive values, such as interest rates. If you weight the non-additive value (for example, dollar weighting the interest rate), and also store the value of the weights, you can then easily calculate the appropriate weighted average rate at any level of summary. Conversion standardizes values and makes them easier for end users to understand in reports.
This step also includes reformatting the incoming data to fit to the schema of target. If the incoming data has first name and last name as one column (first name + last name) but the target has this column as two different columns. So, here after splitting the data it is to make sure that individual name columns can exactly fit to the column data type.

Data Quality checks
Data quality checks to correct redundancy, duplication, orphaned records, nullability. Data quality is further explained later in this guide.

Splitting/Joining
Splitting and joining is basically to derive the result set from set of tables. It generally includes an intermediate step before loading to warehouse as both source and warehouse schema is not one to one. Splitting is the process to include only the selected columns from a table and load into more than one object in warehouse. Merging is the process of combining multiple sources data and also populates one more than one warehouse table from more than one source table.
Summarization
The level of granularity between source and warehouse may not be same. Warehouse data would be same or higher granular than the source data. Because of this source extracted data will be summarized to bring it to granularity of the warehouse as part of transformation before loading to warehouse. As a rule of thumb, if warehouse analysis/reports does not need lower granularity then summarize the extracted data to the higher grains as needed for warehouse reports before loading to warehouse. Keep in mind that the lower the granular the data in warehouse is that much is the disk space usage, huge data volume and may create performance impact.
Example: Suppose that a fact table with daily data has one million rows per week, but that users never need daily data in reports. By pre-aggregating the data, you only need to store 200,000 rows per week. The savings from pre-aggregation are even greater when the difference in granularity is greater. For example, aggregating records from ten million customers into ten customer groups can reduce the size of the fact table by nearly a million-fold. Of course, pre-aggregating the fact table also limits the amount of detail you could drill to in a report.
Enrichment
It is the process of rearranging or simplification of individual fields.
You may use 1 or more fields from same input record to create a better view of the data.
One or more fields originate from multiple records resulting in single field for the DW
Applying Business Rules:
At this stage the incoming data is cleaned, formatted and mapped for loading to target. But it lacks the business rules which determine the fate of data analysis requirements. Example, business rule is to consider balance from Checking account if employee has salaried account. But this is not the place to check if all the calculations required for warehouse is calculated, it just needs to load all the parameters required for business requirement calculation such as checking YOY is not part of this phase but loading granular required for calculating YOY is to be loaded to warehouse.
Transformation Techniques
Following table is an overview of the general transformation problems and solutions to resolve. The solutions are explained in detail in later sections

Problem Solution
Encoded attributes at source Use split/join
Multiple data sources Merge
Heterogeneous data sources Load staging to flat files
Map using Column Names or Column Order
No Delta detection Detect using accumulated staging
Detect using Warehouse joins
Incorrect schema data types Format revisions
Data loss due to joins Avoid joins, use as part of transformations
Orphan records Reject Tables
Inferred entries to dimensions
Granularity of warehouse is higher Summarization
Sources are in hierarchy Central location
Merging
Owning part of the source
Source does not have integrity constraints Avoid filtering at source in extraction
Use data quality techniques
Source does not have transactions Check of Orphans
Check for data integrity in transformation process
No transaction codes in extracted content Detect using accumulated staging
Detect using Warehouse joins
Extraction process problems Check if extracted record count to expected count
Single failure should fail the complete extraction process
Capture the bad records and process rest
Capture the bad records and do not process rest until all resolved

Data Quality
Data Quality is one of the major challenges of the transformation processes. Data quality involves number of steps such as scrubbing, cleaning, deduplication, nullability, data types, orphaning. Data quality is further explained in details in the Data Quality section of the chapter.

Splitting
The incoming data from data source can be in encoded having more than one data field are concatenated to represent one single attribute. It is to be recognized, studied and split as part of transformation process before loading to appropriate tables in warehouse. Combining and coding the attributes the incoming data is a challenge because it needs a study on number of values present in the encoded string, and procedure to split them up. It needs business domain study to the see the impact
The following example give an idea on how the data looks like before and after splitting process,

EmpDepartment (as in Source system)
EmpDepartId EmpName Location
1011A123 John Bellevue
1023A324 Dave Redmond

EmpDepartment (after decoding)
EmpId EmpName DepartmentId Location
1011 John A123 Bellevue
1023 Dave A324 Redmond



Custom Measures
Computing custom measures from the same row is not complex but computation that involves multiple tables or each the tables from multiple sources is a complex transformation process. Such computations cannot be calculated until all the required tables are already extracted. The following example would give an idea of the process of creating a custom measure based on multiple tables:

EmpSalary
Employee EmpName Salary
1011 John 7200
1024 Lesley 2030

EmpAbsentee
Employee Absentee Type NumberOfDays
1011 Personal 67
1024 Personal 23

In this case, you want to add EligibleSalary to the EmpSalary table. Here Employee EligibleSalary is a computed column required for warehouse and the same is computed as part of transformation process. Employee EligibleSalary is derived from the NumberOfDays column in the from EmpAbsentee table. So, the EmpSalary table after the transformation would look like

EmpEligibleSalary
Employee EmpName Salary EligibleSal
1011 John 7200 7568
1024 Lesley 2030 2109

Here the complexity is, Eligible salary is not computed until the table EmpAbsentee information is also extracted. Emp absentee information might be extracted from a different data source called HR and Employee information from another Emp data source


Merging
Merging is a complex activity particularly when the same entity information is coming from multiple data sources typically from branch database as well as corporate database. There are basically two types of merging scenarios possible, they are
Constructive merging
Destructive merging

Constructive Merging: As part of constructive merging both the conflict records are loaded to warehouse without any data transformation involved but in sequence of modification date. This problem would mostly occur when there are two branch offices who update same record. The following example would detail the constructive merging process

Employee table from Department A
Employee EmpName Department Department Loc Modified Date
1011 John A123 Bellevue 3/7/2006 11 PM
1024 Lesley 2030 Redmond 2/5/2005 7 AM

Employee table from Central location
Employee EmpName Department Department Loc Modified Date
1011 John A123 Seattle 3/8/2006 10 PM
1024 Lesley 2030 Redmond 2/5/2005 7 AM

When data is transformed using Constructive merging the resultant table would look like below

Employee table (after transformation)
Employee EmpName Department Department Loc Modified Date
1011 John A123 Bellevue 3/7/2006 11 PM
1011 John A123 Seattle 3/8/2006 10 PM
1024 Lesley 2030 Redmond 2/5/2005 7 AM

Destructive Merging: This merging involves merging procedure by considering the following factors to resolve the conflict
In multi source system where the sources share common business tables and may update the same set of records across. This may induce record update tracking problems such a single record may be updated at multiple sources and updated value is not uniform through out sources.
Scenario 1# (Duplicate information from multiple data sources)


So, when the data is extracted from multiple sources, the same business key, same table may come up with two different values, following guidelines would provide some insight of solutions.
• Prioritizing the Source: Prioritizing the sources would eliminate the multi data source update problems. Always consider the higher priority source updates and apply it on warehouse.
• Modification Date: Extract modified date of the record from each sources to consider the latest date timed update before applying it on warehouse
• Deferred De Duplication: Have the data moved to warehouse as they come from each source and then keep the latest update and mark the other updates as inactive
• Central data extraction: Extract data from central site instead of individual child sites. Have multi source data is collected and consolidated at central site and consider extracting data from central site

Scenario 2# (Multiple attributes; some shared)
In this scenario, different set of attributes of the same entity extracted from multiple data sources. Example, as given below Employee information is coming from Finance and HR. Here the challenge is combining these attributes. The simple approach is having additional attributes in the resulted transformed table as shown in the diagram below. But here make sure that the common attributes are loaded one set of columns in the result table also they (common attributes) are not changed across tables/data sources.


Scenario 3# (Conditional Merging)

In this scenario, not all the fields are considered for warehouse loading, typically a sql query is used to extract only the required records satisfying a filtered condition. This is termed as merging as the resulted data is coming from both the tables and the original tables are not part of warehouse loading but merged result set is used for loading. In the given example, both employee salary information and employee basic information tables are used as source to get resulted set as given below (query)

Select Emp-Id, Salary, Name, Title from EmpSalary inner join Emp on EmpSalary > 9000 and Age < 30

The resulted data is then moved to warehouse. This is part of transformation to create the temporary location of the resulted data. Performing this query as part of loading may be more complex with storing in temporary tables.



Scenario 4# (Source ownership)
In this scenario, two data sources will have same data content but each source system owns part of the content. The owned content from one source is not changed by other data source. Example, SSN records are stored in all the states databases, but each state is allowed to modify only those records pertaining to that state. Here Bob2 in VA state cannot be modified by any other state but the owned state VA, similar to WA state records as they own content accuracy. To merge these records, consider only the owned up content from each state instead of bringing all the records of each state. This should typically happen at the time of extraction. It is to be considered if there is any data that is not owned by any other state or any particular data that is owned by multiple states. If there is data that is not owned by other state, consider extracting data from any state (not from all states). If the data is owned by multiple states then it is similar to scenario I (described above).



Data Type Discrepancy
Data type from source system is different from that of staging system. It is not necessary for staging system data types to match those of the source system. Consider a source attribute which is split up at source query into multiple attributes then the schema of staging is different from that of source and so the data type. Data type selection on staging is based on
• If all the attributes are mapped to staging as they are
• If there is any transformation performed as part of extraction
• If the extraction query combines attributes from more than one table
Also, it is not necessary for the warehouse data types to match those of the staging database, there is to be a data type conversion before loading the data to warehouse, consider the stored data type at source system for date is character type. Having staging database as a date-time field may cause a record to be rejected if the source date is not in the exact date format. So the staging database data type may be character, but a transformation process checks and converts the data to a valid date before loading it into a date-time column in the warehouse

Timely lookup records availability
Typically the Extraction process does not guarantee that all the records have been retrieved. One important type of error is when a transaction record is extracted but the corresponding lookup table record is not extracted. You must take care that these records do not become orphaned and dropped from reports. Following are some of the causes for missing lookup table records:
• No Source System Referential Integrity: The source system does not ensure, that there cannot be any transaction added unless the corresponding look up table has the look up entry. This can happen if there is no referential integrity constraint between the lookup tables and the transaction tables.
• Deferred updates: The source application may log the transaction before creating the necessary lookup records. If an extraction process is run after transaction record is added to database but before adding the primary table then it would create orphaned transaction record with no entry in its corresponding primary table.
• Not transactions: If the source system does not ensure adding primary table values before adding to transaction tables and there is no declarative foreign key relationship between primary tables to the transaction tables. It would happen because of no transactions implementation of adding in to both look up as well as transactions as an atomic DML statement.
• Multiple levels: Transactional records were brought from lower hierarchy and look up entries are extracted from parent hierarchy but lower level entries will take time to get replicated to higher in hierarchy. It would cause data inconsistency if extracted after transactions added to lower in hierarchy and parent level server is not yet updated with the new data.
• Extraction process: If extraction process does not ensure all the required tables are extracted and captures any errors and fail the extraction accordingly. Because of this if there is any problem of extracting look up tables but successful extraction of transaction tables would declares the extraction as success.
• Dirty reads: The extract process ignores the transaction state
Following are some of the valid solutions to take care of problems of this nature:
• Usage of Reject tables: Collect the transaction orphan records and store them in a temporary store. When its corresponding look up entry comes from source in later extraction then process the orphaned transaction records from the temp store created for this purpose.
• Inferred dimensions: Add the transaction records to facts but its corresponding record will also be added to dimension table, with no descriptive information. When the lookup details are available in later extraction, this inferred dimension will then be updated.
• Unknown Member: Create a special member of each dimension as the Unknown Member. When a foreign key lookup fails, assign the Unknown Member as the foreign key. Periodically attempt to re-process all records that contain an Unknown Member key as you would with a Reject Table.
• Add Referential integrity on source: As source systems are generally normalized, create the referential integrity between all tables particularly between look up table to the transaction tables.
• Define success of extraction: Do not process any foreign-key tables until the primary-key table extract has completed successfully.
• Custom Calculations: Custom calculations are performed as part of transformation so as to store it in warehouse as a fact measure and keeps the report queries to not to compute while data presentation.


Data type inconsistency
• Data type: The data type inconsistencies include mapping a character column to an integer or Date time column to an integer or any of incompatible data types. Suggested to keep the both sides of the mapping to have same data type and it means have the staging schema to assume the data types of the source system. In heterogeneous systems, a casting may be necessary if each system has different data type for the same column in staging table. General data type problems would originate from
o Mapping NCHAR/NVARCHAR to CHAR/VARCHAR
o Mapping DateTime to SmallDateTime
o Mapping INT to SMALLINT
o Mapping BIGINT to INT/SMALLINT/TINYINT
o Mapping Binary column to non Binary
• Nullability: Improper nullability rules can lead to data errors. Avoiding unnecessary constraints can improve performance. Non-null constraint can cause processes to fail in case of outer join. From TOC context, this section should be avoiding data errors. But specifics relate to all 3 Nullability issues.]
If the source has the column which is not nullable then the staging schema need not have it as a nullable.
If source query does have a left outer join with other source table, which would probably result in NULL value for the column then have the staging schema column as nullable even if the corresponding column is not nullable at source
If the staging table data is extracted from multiple source systems and any of the source system has the corresponding column defined as NULL then the corresponding staging table column should also allow null values.
If the source table has a PK defined on a column, the corresponding column in staging need not be null (not required if staging table has the same column as PK)
If the source records are filtered on a column for null values, then the corresponding staging table column can be declared as NOT NULL
• Data type length: Incorrect data type lengths may truncate the data and may cause inconsistencies such as char (10) column is mapped to char (5) column. Datetime mapping to SmallDateTime would lose the granular information. Most of the systems would warn the extraction process but warnings are ignored by the package and user typically. Make sure to use correct data type lengths in staging as per source data type lengths.

Heterogeneous Data Stores
Excel Mapping:
When data is coming from multiple systems, ensure the column names or column order is enforced to be same. Example: if the source data store is an excel sheet, then while importing the excel sheet, use the order of the columns or the names of the columns to map to the corresponding staging table or SSIS object. There are two ways to map the incoming data stream to the target (staging) system. They are
By Column Name: In the given example above, if the data is provided by customers then make sure that the required schema is shared with them. It does not need to follow column order. If agreed to follow column names, then the extraction query can be
Select F1, F2 from $Worksheet
By Column Order: If agreed by column order, then the query can be given with the order of columns such as Select Column1, Column2 from $WorkSheet. It does not need to follow column names in this case.
It is particularly important to ensure the schema compatibility when the sources are heterogeneous systems. It is not a problem is there is only source as the mapping is fixed while coding extraction object.
Excel has a very limited number of data types. You will probably need to cast the data types to make them usable—particularly if using SSIS to import the data.

Flat file mapping:
Flat files typically do not have any column names particularly if the source data is extracted using BCP utility. While importing to staging database using BCP or Bulk Insert, make sure that mapping is properly done as the process does not give any error as long as the target data type is compatible to the data.
Transformation Guidelines
Consider following guidelines which would help for efficient transformation process
Consider extracting table by table with no joins
Keep the data extraction process that would extract one source table to one staging table rather than complex joins to pull the data from multiple tables. Filtering the rows at extraction would eventually lead to data loss and auditing problems. Each table, with an optional filtering “where” clause, should be extracted one single object. Merging from different tables, such as look up of master tables or any other cross table reference joins should be avoided, these conditions are to be performed as part of transformation process. This technique would minimize the over use of source resources and also extraction duration.
Do not filter the transaction records based on non availability of Lookup record
The records should not be filtered out at source just because the corresponding look up entry is not present. Bring the transactional record to staging and decide it whether to add the look up entry as an inferred dimension member or to process this transformed record later when the look up entry comes up in the later extractions. Filtering at source because of referential integrity may not be picked up by later extractions causing an undesirable data loss.

Transformations may be done during the load process
Simple transformation such as the following can be done during either the extraction or the loading process if there are no other transformations that can be done on the staging server.
• Change Case: Changing case (upper, lower) or simple string conversions can be performed while extracting the data as they have a little impact on source resources.
• Simple data representations: Simple data representations such as using 1 for Male and 2 for Female to send the descriptive fields to warehouse instead of codes maintained at source system.
• Default values: The default values can be understood by source systems but warehouse system may have interpretation problems as the end user directly deals with data unlike using source application on OLTP. Example: Using 99999 as zip code, the source system may recognize this as a default value and probably convert this field to an empty field on the interface but having this value in warehouse would confuse the end users. Converting it to, Not Available would be more appropriate. When converting the value in a numeric field to a string, you must be sure that there is no arithmetic calculations performed on the field.
Simple Arithmetic Calculations: Arithmetic calculations such as addition of two source measures to derive one fact measure or so, a simple custom measures calculations can be accomplished while loading data without having a separate transformation process or store.
• Filtering Extracted Records: If transformation involved filtering unwanted records before going to warehouse such as restricted warehouse to one particular department or one set of employees only. Then the filtering can be performed while loading the data to warehouse without having a separate transformation process.
• Business rules: If business rules are simple to apply, such as flagging employees who have children or books whose publishers are in a particular country, can be applied while loading the data without a separate transformation process.
Though it is up to the designer decision whether to have staging or not, these are some of the suggestions as elimination of staging process would expedite the loading process as well as optimizes usage of CPU cycles, memory and more importantly disk space.
Study the dependent objects
In transformation, while deciding the parallelism take the dependent objects into consideration. Such as do not build the intermediate tables (if required) until all its dependent tables are transformed
Example: Look up tables’ transformation should happen in advance before transforming the transaction tables

Estimate initial database size based on the volume of data
Increasing the file size at run time is a time consuming operation
While creating the database, configure the database to your estimated maximum size, rather than leaving it to the Sql Server to automatically increase the size.
Storing processing performance information and benchmarking data in the database.
Saving benchmarking data in the database helps pinpoint performance problems by establishing foundations of mean/median run times. This helps the team focus on tuning opportunities and gives direction on things such as hardware load balancing, troubleshooting, and SLA agreement expectations and facilitates better practices on the maintenance of your system
Save ETL, validation and processing errors in shared database tables
All errors should be trapped, consolidated and sent to one place - your metadata repository. This means that any errors that occur in the domain of the ETL tool are logged with any errors encountered in the post-ETL tool load process, whether it be from things such as loading the operational data store (ODS) or building the online analytical processing (OLAP) cube. It is important to establish error thresholds for each process in the data warehouse as well as what actions to take when those error thresholds are encountered. This is usually one area where requirements gathering fall short







Data Quality
Overview:


Data quality is the important layer in transformation. Data quality issues are created during the different stages in the system. As most of the enterprises accumulated high volume of data over the years this posses challenge for business to find useful and meaningful information from their systems.
Due to constant changes in the business process and induction of new systems/technologies to be competitive in the market also added complexity to the data quality. .
Data or collection of related business data should facilitate a business to reach its desired goals. Factors that determine the data quality are.
Attributes of Data Quality
Use the following list of attributes to determine quality of the data that feeds into your warehouse:
Accuracy
Accuracy refers to whether the data correctly represents the underlying business entity.
Domain Integrity:
It is the values that a particular attribute can assume, typically pre defined values. Example: Customer can be Active/Inactive for representing the status.
Data Type:
Value for a data attribute is actually stored as the data type defined for that attribute. The data type defined for the attribute should be able to accommodate complete value that would be assigned for the attribute. Example: Customer status cannot assume a data type as char(6) for storing Active and inactive ones. The value inactive may not fit in the length defined.
Consistency:
Business meaning of a data field is the same across multiple source systems and multiple tables of the same data source. Example: When customer status field is at multiple places in the source database, they should always represent the same value.
Normalization.
Well-designed transaction systems store data values in only a single place. This eliminates the possibility of inconsistent data within a system. If the data in the source system is not properly normalized, the data cleansing process needs to watch for and handle inconsistent values.
Duplication:
All duplicates are completely resolved, they are identified & cross-reference created. Having duplications in source data is a major problem particularly in transactional systems as data modifications and references may create incorrect references. Example: If customer information is stored in two different records in customer table then updating the customer address or considering the customer address for deriving any other attribute may be challenging if the address is not same.
Completeness:
No critical attributes for a data entity are missing. Optional values can have Null or incomplete values.
Business rules:
All attributes adhere to the business requirements.
Structural Definiteness:
Wherever a data item can naturally be structured into individual components, the item must contain this well-defined structure, generally driven by business rules and requirements.
Clarity:
Users must be able to understand the meaning of data values. Data administrators must be able to understand the meaning of database objects. For users, data values should be descriptive. For administrators, column names should follow clear naming convention. Column names should clearly describe the business meaning of an attribute.
Timeliness:
The users determine the timeliness of the data. Example: Customer Address was one month old and customer is not staying in this place now as his status is now expired.
Adherence to Data Integrity Rules:
Source data must adhere to the referential integrity and entity integrity rules such as Primary key, foreign key constraints, business rules implemented using check constraints and using source applications.
Sources of data pollution

Data pollution can happen in three places in an enterprise.
• Upstream system
• Staging System
• Downstream System

Sources of Data Pollution
Following table illustrates the pollution cloud in different systems

System Data Pollution source Remarks
Upstream/Staging/Downstream Poor Database design Database design does not represent underlying business needs.
Upstream Data Entry operators or Application Users Due to lack of knowledge or training or understanding end users create data problem.
Staging/Downstream Data Migration During legacy to new application migration will incorporate data problems.
Staging/Downstream Internalization /Localization On geo centric systems, data will be captures in different formats/languages
Staging/Downstream Data Received from partners and suppliers When partners and suppliers use their own standard
Staging/Downstream ETL Data loss during data extraction
Staging/Downstream System integration Due to merger and acquisitions of companies’ system integration becoming complex leading to data pollution.
Upstream/Downstream Data aging When the information becomes outdated or old.
Upstream/staging/downstream Lack of common standards Not following common standards across different systems.
Upstream Non compliance of Business Rules Not providing discount to qualified customers

Controlling Data Pollution

Following is the some of data quality problems and possible solutions.
Duplication
Data duplication is common problem when we are merging data from different source systems or when the same data is stored through cross reference tables.
Possible Solution 1: Microsoft SSIS provides data cleansing functions fuzzy lookup and fuzzy grouping. Fuzzy lookup will be used against cleansed data set for validation and fuzzy grouping is used against non standardized data set. For more information following can be good reference.
o http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnsql90/html/datasol.asp
o Where to do the pollution control :
Data cleansing can be applied in upstream, staging or downstream system. Duplicate records can be cleansed in the staging or in the downstream system.

Possible Solution 2: On the silo applications by maintaining appropriate integrity constraints will help to reduce the duplicates. This could be done in the form of primary key.

Different data formats
Data quality issue happens when different data formats are used in different systems for the same or related data set.



Possible solution:
This issue will be addressed during transformation. When the format of the data is different in multiple sources, we can adopt standard format in destination. We can use Microsoft SSIS Data conversion transformation. For more information visit http://msdn2.microsoft.com/en-us/library/ms141706.aspx

Extraneous information
The data may contain irrelevant or blank fields

Data Representation
Same Data is represented in different systems in different way. Following diagram represents such a scenario where in Gender and Marital status data are represented in different ways in different systems.




Possible Solution:
On the Target database create a standard way of representing the data. Using Microsoft SSIS fuzzy lookup transformation the data can be cleansed during transformation to target database

Business Rule enforcement
Lack of enforcement of business rules. If the business rule is provide discount to preferred customers, if the operator fails to compile to the rule will result in data quality.


If incorrect primary key table is considered to get value from, then it would cause data loss or consistency. Suppose Status and Flag are two different primary table tables. Employee status to be maintained in Employee table and status code is extracted from status table to represent the employee status. But while extracting data instead of joining with Status table if it is joined with flag table (see the pic above) then it gives inconsistent and incorrect results.

Business rule change management
Business rules are tending to change and if it’s not managed appropriately it will create data quality issues.
Possible solution:
• SSIS provides features which will allow you to build custom business rules
• User needs to be trained for applying business rules
• Use a product to implement business rules engine across the organization

Incorrect source transactions
Source application implements explicit transactions and for tracking the record (for the purpose of warehouse) transaction start time is used as a date time indication of the record modification/addition.
Typically when the data is extracted, the last transaction date time is stored in Meta data and in the next extraction all the record that are modified/inserted after last stored date time are extracted, this called delta detection.
Suppose transaction start time is stored in the table, then, for example, if a transaction started at 7 AM and committed at 8 AM then 7 AM is stored as a date time in the table. Suppose the transaction started at 7AM and delta is extracted at 7.30 AM, at this time this record is not extracted as it is not committed yet, but delta is set as 7.30AM as that is the maximum time considered for this extraction.
Now, the record is committed at 8AM by updating the date time as 7 AM as it records transaction start date time. When the next extraction starts at 8.30 AM, it will not consider this record as the delta detection is set 7.30 AM so the current extraction extracts only those record whose date time is > = 7.30 AM and this record has date time as 7 AM. The query would be as given below

EmpId F1 F2 DateTime
1011 A 23 7.00AM
1023 B 24 8.00AM

SELECT F1, F2 FROM Table1 WHERE DateTime >= ‘7.30AM’

This query above cannot bring 1011 record as its DateTime is before the delta detection but as such it was committed after 7.30 AM
This will cause the data loss.
It is suggested to use transaction end date time as date time instead of start date time and keep transactions as minimum as possible as on source.
Extraction Process not reporting exception to caller
Extraction process has extracted 10 out of 15 tables and got an exception but exception is not reported properly to the caller to declare it as a failure and cause a potential restart. Because extraction process didn’t report the error, these 10 successful tables are processed for loading to warehouse leaving behind the other 5 tables. This would cause a data loss and particularly if sync date time is used as a delta detection
Incorrect joins
Incorrect joins at the source queries issued by extraction process may cause data loss as it may miss some valid record to get extracted.

SELECT F2, F3 FROM TranTable Inner Join Emp ON TranTable.EmpId = Emp.EmpId and TranTable.DateTime >= ‘1/1/2006 7:30AM’ and Emp.DateTime >= ‘1/1/2006 7:30 AM’

Queries of nature above would because the data loss as there is no need to check the delta detection on primary key table. Here if there is no record that is updated on Emp table then this query will return zero even there are transactions happened after delta detection.
It is suggested to bring all the primary table records or do not consider delta detection on primary table while joining transaction table in source extraction queries.
The ideal query would be to bring primary key tables records into one staging table and transaction table records to another then join as part of transformation

SELECT F1, F2 FROM Emp

SELECT F2, F3 FROM TranTable WHERE DateTime >= ‘1/1/2006 7:30 AM’

If there is an accumulated staging table for Emp table in staging then the query on primary key table Emp can bring only the changed records which would be combined with accumulated staging table to get complete table which can be joined with transaction table in transformation process. The query would be as follows

SELECT F1, F2 FROM Emp WHERE DateTime >= ‘1/1/2006 7:30 AM’

Daylight savings
Daylight savings time can cause data loss if not adjusting the last delta detection. The following example would detail the scenario

If Day light saving time is changed from 5PM to 6PM then it is not an issue, because there will be no data updated between 5 and 6 PM (after changing the time). Here delta can run before 5 or after 6, no data is lost.
However if day light savings time is changed from 6 PM to 5 PM. Scenario is
First transaction that happened at 5:05PM
Delta at 5:10PM <>
Delta extracts first transaction successfully

Second transaction at 5:20PM
Day light savings time is changed from 6 PM to 5 PM
Third transaction at 5:02PM
Delta – 5:15pm <>
Loss of third transaction happened at 5: 02PM
Geographical time zones
If the source system has any issues that didn’t take in consideration to time zones
UK time is 5PM == 11AM EST
Assuming the extraction is performed in UK
You run the delta that will take the max date time (5:01 PM) after that there is update of the record in US which changed the date time to 11:02 AM, this data will not be picked up in the next delta as delta date time is set at 5:01 PM
Orphan
When the data is coming from transaction to warehouse there is not corresponding LOOK UP entry (here there is no SUSPENDED record in the lookup table), because of this data is in inconsistent
As a work around, you can temporarily stop the transaction records moving to the warehouse until the lookup table is coming up, keeping these orphan record in temporary tables called as Rejected Tables
Once the Lookup table record is available then move this data from the rejected table to warehouse

Other solution is to make use of Inferred Dimensions, which can add a dimension record for the missing value. Inferred dimensions are discussed more fully in the “Timely lookup records availability” section below.

If you ensure that the OLTP system enforces data integrity (i.e. the constraint that each foreign key have a corresponding primary key in the lookup table), then these problems can all be avoided.

Id Status
1 Active
2 Inactive
3 Expired
Status table








Emp Table
EmpId Status
1011 Suspended


Date/Time Conversion
On geo centric systems, data will be captures in different time zones. This will result in storing data in different date/time format as per the location.
Possible Solution:
• Convert text dates (e.g., YYYYMMDD) to datetime field.
• EBCDIC to ASCII format
• Non-Unicode to Unicode (to handle international integration)
Non-unique Identifier
Unique identifier not representing the same product in 2 different systems
Data Anomaly
A field must be used only for the purpose for which it is defined. Example: Storing Customer address in customer status column. These issues are created during data entry by the application user
Erroneous Integration
Customer information from two data sources are integrated but resulted in incorrect data integration such as Customer has same AddressLine1 and AddressLine2 columns or two customers having same Social Security Number
Possible Solution:
Microsoft SSIS data cleansing functions fuzzy lookup and fuzzy grouping can be used for resolving Non-unique identified, data anomaly and erroneous integration. http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnsql90/html/datasol.asp

All the above are mostly reactive approach for solving the data quality problem.
Data quality is not all about cleaning up the data alone. To achieve excellence in data quality organizations needs to go beyond ETL, Data quality etc. In this section we will highlight the certain important organization requirements.
Data Governance
Data quality problems are created due to many reasons in the organization. This can be either reduced or eliminated. This can be achieved using pro-active and Reactive approaches. Organization can consider creating a data governance structure/body. This people will be responsible for the defining, monitoring enforcing the attributes of data quality. Typically data governance body will consist of
Data Steward –Responsible for defining the business rules for the data integrity, validity, and correctness
Data Administrator–Responsible for data monitoring, auditing, and cleansing
Summary
This document covered the life cycle, challenges and the solution options. Microsoft SSIS Provides either out of box or scope for custom build.

No comments: