Machars Blog: Data Mining

11
Data Mining
Microsoft Confidential. © 2006 Microsoft Corporation. All rights reserved. These materials are confidential to and maintained as a trade secret by Microsoft Corporation. Information in these materials is restricted to Microsoft authorized recipients only. Any use, distribution or public discussion of, and any feedback to, these materials is subject to the terms of the attached license. By providing any feedback on these materials to Microsoft, you agree to the terms of that license.

Microsoft Corporation Technical Documentation License Agreement (Standard)
READ THIS! THIS IS A LEGAL AGREEMENT BETWEEN MICROSOFT CORPORATION ("MICROSOFT") AND THE RECIPIENT OF THESE MATERIALS, WHETHER AN INDIVIDUAL OR AN ENTITY ("YOU"). IF YOU HAVE ACCESSED THIS AGREEMENT IN THE PROCESS OF DOWNLOADING MATERIALS ("MATERIALS") FROM A MICROSOFT WEB SITE, BY CLICKING "I ACCEPT", DOWNLOADING, USING OR PROVIDING FEEDBACK ON THE MATERIALS, YOU AGREE TO THESE TERMS. IF THIS AGREEMENT IS ATTACHED TO MATERIALS, BY ACCESSING, USING OR PROVIDING FEEDBACK ON THE ATTACHED MATERIALS, YOU AGREE TO THESE TERMS.

1. For good and valuable consideration, the receipt and sufficiency of which are acknowledged, You and Microsoft agree as follows:

(a) If You are an authorized representative of the corporation or other entity designated below ("Company"), and such Company has executed a Microsoft Corporation Non-Disclosure Agreement that is not limited to a specific subject matter or event ("Microsoft NDA"), You represent that You have authority to act on behalf of Company and agree that the Confidential Information, as defined in the Microsoft NDA, is subject to the terms and conditions of the Microsoft NDA and that Company will treat the Confidential Information accordingly;

(b) If You are an individual, and have executed a Microsoft NDA, You agree that the Confidential Information, as defined in the Microsoft NDA, is subject to the terms and conditions of the Microsoft NDA and that You will treat the Confidential Information accordingly; or

(c)If a Microsoft NDA has not been executed, You (if You are an individual), or Company (if You are an authorized representative of Company), as applicable, agrees: (a) to refrain from disclosing or distributing the Confidential Information to any third party for five (5) years from the date of disclosure of the Confidential Information by Microsoft to Company/You; (b) to refrain from reproducing or summarizing the Confidential Information; and (c) to take reasonable security precautions, at least as great as the precautions it takes to protect its own confidential information, but no less than reasonable care, to keep confidential the Confidential Information. You/Company, however, may disclose Confidential Information in accordance with a judicial or other governmental order, provided You/Company either (i) gives Microsoft reasonable notice prior to such disclosure and to allow Microsoft a reasonable opportunity to seek a protective order or equivalent, or (ii) obtains written assurance from the applicable judicial or governmental entity that it will afford the Confidential Information the highest level of protection afforded under applicable law or regulation. Confidential Information shall not include any information, however designated, that: (i) is or subsequently becomes publicly available without Your/Company’s breach of any obligation owed to Microsoft; (ii) became known to You/Company prior to Microsoft’s disclosure of such information to You/Company pursuant to the terms of this Agreement; (iii) became known to You/Company from a source other than Microsoft other than by the breach of an obligation of confidentiality owed to Microsoft; or (iv) is independently developed by You/Company. For purposes of this paragraph, "Confidential Information" means nonpublic information that Microsoft designates as being confidential or which, under the circumstances surrounding disclosure ought to be treated as confidential by Recipient. "Confidential Information" includes, without limitation, information in tangible or intangible form relating to and/or including released or unreleased Microsoft software or hardware products, the marketing or promotion of any Microsoft product, Microsoft's business policies or practices, and information received from others that Microsoft is obligated to treat as confidential.

2. You may review these Materials only (a) as a reference to assist You in planning and designing Your product, service or technology ("Product") to interface with a Microsoft Product as described in these Materials; and (b) to provide feedback on these Materials to Microsoft. All other rights are retained by Microsoft; this agreement does not give You rights under any Microsoft patents. You may not (i) duplicate any part of these Materials, (ii) remove this agreement or any notices from these Materials, or (iii) give any part of these Materials, or assign or otherwise provide Your rights under this agreement, to anyone else.

3. These Materials may contain preliminary information or inaccuracies, and may not correctly represent any associated Microsoft Product as commercially released. All Materials are provided entirely "AS IS." To the extent permitted by law, MICROSOFT MAKES NO WARRANTY OF ANY KIND, DISCLAIMS ALL EXPRESS, IMPLIED AND STATUTORY WARRANTIES, AND ASSUMES NO LIABILITY TO YOU FOR ANY DAMAGES OF ANY TYPE IN CONNECTION WITH THESE MATERIALS OR ANY INTELLECTUAL PROPERTY IN THEM.

4. If You are an entity and (a) merge into another entity or (b) a controlling ownership interest in You changes, Your right to use these Materials automatically terminates and You must destroy them.

5. You have no obligation to give Microsoft any suggestions, comments or other feedback ("Feedback") relating to these Materials. However, any Feedback you voluntarily provide may be used in Microsoft Products and related specifications or other documentation (collectively, "Microsoft Offerings") which in turn may be relied upon by other third parties to develop their own Products. Accordingly, if You do give Microsoft Feedback on any version of these Materials or the Microsoft Offerings to which they apply, You agree: (a) Microsoft may freely use, reproduce, license, distribute, and otherwise commercialize Your Feedback in any Microsoft Offering; (b) You also grant third parties, without charge, only those patent rights necessary to enable other Products to use or interface with any specific parts of a Microsoft Product that incorporate Your Feedback; and (c) You will not give Microsoft any Feedback (i) that You have reason to believe is subject to any patent, copyright or other intellectual property claim or right of any third party; or (ii) subject to license terms which seek to require any Microsoft Offering incorporating or derived from such Feedback, or other Microsoft intellectual property, to be licensed to or otherwise shared with any third party.

6. Microsoft has no obligation to maintain confidentiality of any Microsoft Offering, but otherwise the confidentiality of Your Feedback, including Your identity as the source of such Feedback, is governed by Your NDA.

7. This agreement is governed by the laws of the State of Washington. Any dispute involving it must be brought in the federal or state superior courts located in King County, Washington, and You waive any defenses allowing the dispute to be litigated elsewhere. If there is litigation, the losing party must pay the other party’s reasonable attorneys’ fees, costs and other expenses. If any part of this agreement is unenforceable, it will be considered modified to the extent necessary to make it enforceable, and the remainder shall continue in effect. This agreement is the entire agreement between You and Microsoft concerning these Materials; it may be changed only by a written document signed by both You and Microsoft.

Overview
Data Mining deals with the proactive process of searching and discovering the patterns in the data and present it in a more predictable and useful format. It enables making business decisions, increasing revenues, reducing costs and boost customer confidence.
Data Mining discovers the hidden knowledge with in the information which is not possible with traditional relational or OLAP technologies.
Need for Data Mining
Identifying the patterns in the business data is a required step for the analysts to take right decisions at the right time for business portfolio. The following sections give more insight on the need for data mining for specific business needs.
Relational and OLAP tools are good for finding and reporting information that is already existing and visible within the data. Hidden knowledge that represents patterns and regularities in data can’t be easily found by traditional reporting. Scenarios where the relational and OLAP reporting are sufficient include the following:
• Customers who used a Visa card to pay for gas bills
• Employees over age 50 who took more than 10 sick days last year
• Patients over 40 with blood pressure over 140 who have had at least one heart attack
• The total Sales Dollars by region by month for the past three years
• The year over year growth in discount percent by customer type

In all these cases the selection and grouping criteria are known in advance. Data Mining provides solution when the specific selection and grouping criteria are not known in advance, but are derived from the data values. Following are scenarios for which data mining can be highly effective:

• Predicting the seasonally adjusted sales for software products in order to prepare for customer support requirements
• Identifying demographic groups that purchase different makes of cars to plan import allotments and focus marketing efforts
• Grocery shop owners identifying products that customers typically buy together so the product physical arrangement can be optimized
• Targeting a mailing campaign to a customer base that reflects specific behavioral patterns. For example, mailing baby product discount coupons to family that have recently made similar purchases
• Focusing retention efforts on most effective employees based on revenue contribution, alignment with company policies, and consistency in performance
• Preventing potential fraud by comparing a transaction to previous purchasing patterns for a customer
• Identifying matching customers even when data entry errors include misspellings of the customer’s name or address.

Factors that would encourage considering data mining include the following:
• Data availability in source Systems: Detailed data is available from source systems, preferably on a near real-time basis. Having detailed data would be a good candidate for accurate and predictable results.
• Huge data volume: Large data sets that can be difficult to analyze effectively using other tools lend themselves to data mining solutions. Also, the statistical functions in data mining require a large sample set in order to produce meaningful results.
• Complexity to identify trends: Having multiple factors enter into to forecasting or discovery analysis lends itself to data mining, particularly when the appropriate grouping structures are not known in advance.
• Automating with minimum user interaction: Because data mining is driven by data values, the same solution can be implemented at different customer locations, achieving customized behavior with no changes to the application.

Data mining is not
Data warehouse: Data warehouse, relational or OLAP can be used for mining process but data mining itself is not data warehouse store for storing warehouse objects such as facts, dimensions.
Reporting store: Data mining is not a report store. It provides a method for analyzing data and making decisions. It does not provide any reports other than the analyzed data
OLAP: Online Analytical processing stores the data warehouse data in multi dimensional store and also does aggregates accordingly. Data Mining does not require the data to be in multi dimensional or aggregations. It cannot be treated as a replacement of OLAP store.
Data Visualization: DMX queries are to be issues against the data mining models to get the results of algorithms in mining models. There is a limited interface for viewing the data mining predictions in Business Intelligence Development Studio, additional visualization tools must be used to interface with external clients.

Though OLAP is a data store in a multi-dimensional structure that includes aggregations, it cannot replace data mining models, as OLAP requires pre-defined grouping buckets. OLAP data can be used as an alternative to detailed relational data as a source for data mining models. The data mining models then provide further analysis and additional insight on the patterns and predictions. The following table shows what is possible/not possible with both OLAP and Data Mining models.

OLAP Data Mining
Typically focuses on historical facts Typically focuses on future outcomes or trends
Aggregates data using pre-defined groupings Requires detail data
Verification driven/Factual results Discovery driven
Ad hoc queries and reports Statistical and machine learning techniques
Limited ability to include reliability estimates with predictions Data models available for predicting, discovering patterns, estimating and producing accurate results for trend analysis and forecasting
OLAP can be used as a data source for Data Mining models Data mining results can also be used in OLAP applications by incorporating new predictive variables or scores as dimensions or attributes in your OLAP tool
Data Mining Methodology
Data mining is frequently described as "the process of extracting valid, authentic, and actionable information from large databases." In other words, data mining derives patterns and trends that exist in data. These patterns and trends can be collected together and defined as a mining model.
Building a data mining model is part of a larger process that includes everything from defining the basic problem the model will solve, to deploying it into a working environment. This process can be divided into the following six steps:
1. Defining the problem
2. Data Preparation
3. Exploring data
4. Building the model
5. Exploring and validating the model
6. Deploying and updating the models
7. Accessing the models

Defining the problem:
This involves defining the expectations from data mining, including business analysts’ requirements, marketing strategies, portfolio, forecasting, and decision support requirements. For example, a grocery store owner needs to find out the cross-sell of the customers in order to increase the customer base by providing what they need, together. For a grocery store manager, projected sales information may not be as important as the cross-sell requirement. The next steps of the data mining act on the data collected for satisfying the business requirement. More of the business scenarios and modeling techniques are explained in the later sections of the chapter.

Data Preparation:
The Data Extraction chapter details the type of data sources and extraction methods for data collections effectively. Typically the source transaction system consolidates all the discrete channels of data at one place and applies transactions, but it is not necessarily be one source of information. Data Mining involves analyzing the source data at a broader level that includes both internal and external to the system. Data collection is completely based on the business requirements not necessarily extracting all the data. Following are some examples:

Internal Data Sources: Company activities, Customer records, Webs sites, Mail campaigns, purchasing transactions, Inventory

External Data Sources: Partners and Supplies that contains external credit agencies, market surveyors, customer feedbacks

Data mining modals can extract data either from relational structures or from OLAP store, explained as below in detail. Data mining algorithms implementation work same for both relational and OLAP models. The only difference is the source structure and format.

Relational Data Mining Models:
Relational mining models can be built from data that is stored in relational database systems. This type of mining model is based on data that is retrieved from any OLE DB data source. A data source may contain both single case and multiple case tables. Case is considered to be a look up or master table in relational transactional systems and as dimension in relational dimensional models. Transactional tables or fact tables where there can be more than one record exist per one case record, treated as Nested for data mining models. Data mining models can extract data from these relational models for modeling and processing. Mined data (that contains the results from models in the form of predictions, patterns) can be extracted using queries (also called DMX) to interface and provide interactions to the analysts.

OLAP Data Mining Models
OLAP cubes contain complex number of members and dimensions in a multi dimensional structure. Unlike relational data mining models, OLAP mining models do not preserve the granularity of the data. This is because the patterns and trends of the data might be lost in the process of creating aggregations. This makes it difficult to manually find the hidden patterns within these members and dimensions. To overcome this difficulty, you can use the OLAP data mining models that are based on OLAP data sources. These models allow you to find the complex patterns and apply them to business decisions. Similar to the relational data mining models, the case key and case level columns from an OLAP data source need to be specified while designing data mining model. While specifying the nested tables, only the measure groups that apply to the dimension are displayed. Select a measure group that contains the foreign key of the case dimension, as nested for data mining model.
More information at http://msdn2.microsoft.com/en-us/ms175645.aspx

Source data systems collect data of static in nature, transactional and seldom changed information either from interface systems or extracted from external/internal systems (explained above). Based on the nature of source data, it is segregated into following types,

Demographic data: This generally describes personal information
• Personal details such as gender, age, martial status, income, status, department, geographical location
• Strengths – very stable, good for predictive modeling. This data is not subjected to change frequently compared to the data that is transactional in nature (such as exchange rate, product cost, bank balance, favorites).
• Weakness – difficult to get accurate information accurately on an individual basis. Having the inaccurate demographic data leads to incorrect results from data mining. For example, if the analysis shows that women bought more mopeds than men, it may be because the gender is incorrectly stored.

Behavior data: It is measurement of an action or behavior typically transactional in nature
• Typically the most predictive type of data but it can change rapidly.
• Depends on the type of industry
o Sales amount, types and dates of purchase, payment dates and invoice amounts, customer service attributes, insurance claims
o Web site activity – click stream behavior or exact path of each visitor
• Predicts the business patterns
• Strengths: Accuracy is higher as it is happened data (transactions) than collected data (demographic). Good candidate for data mining predictable columns
• Weakness: Rapid change results in higher volume

Psychographic or attitudinal data: This is characterized by opinion polls, blogs, surveys, lifestyle characteristics or personal values
• Traditionally associated with market research – surveys, opinion polls, customer feedbacks, marketing, focus groups
• It brings added dimension to predictive modeling when companies have squeezed the predictive power out of the demographic and behavior data
• Strengths: Overall customer feedback and useful for analysis at a broader range to check if the products are liked by customers
• Weakness: Not accurate, not based on any particular individual. Different surveys will give different results
Data quality and transformations is explained in the other chapters. In brief, even If your sources are clean, integrated, and validated, they may contain data about the real world that is simply not true. This noise can, for example, be caused by errors in user input or just plain mistakes of customers filling in questionnaires. If it does not occur too often, data mining tools are able to ignore the noise and still find the overall patterns that exist in your data
Data preparation and cleaning is an often neglected but extremely important step in the data mining process. It is not required to cleanse the data before the extracted by data mining process if the intention is to find out data quality problems using data mining models.
Data can be polluted in a number of ways such as user interface, application problems, data collection mechanisms, heterogeneous systems, data transformation procedures. Often, the method by which the data where gathered was not tightly controlled, and so the data may contain out-of-range values (e.g., Income: -100), incorrect data combinations (e.g., Having New York in Japan). If the intention is not to find out data quality problems using data mining models, consider cleansing the data before using it in the models. Data quality problems and techniques are explained in Data Transformation chapter.
Data mining can handle either numeric or text based data.
Numeric Mining:
Here the input columns may have descriptive (text) content, but the prediction columns contain numeric data. Simple operations on numeric data as Greater, Less, Percentage can be applied to deduce more meaningful patterns and forecast. This chapter talks about data mining models that deal with numeric data for predictions.

Text Mining:
While data mining is typically concerned with the detection of patterns in numeric data, very often information that is critical to the business is stored in the form of text. Unlike numeric data, text is often difficult to deal with. Text mining generally consists of the analysis of (multiple) text documents by extracting key phrases, parsing, concepts, etc. and the preparation of the text processed in that manner for further analyses with numeric data mining techniques (e.g., to determine co-occurrences of concepts, key phrases, names, addresses, product names, etc.).
Exploring data:
This step involves checking if the required data contains the expected information. For example, if cross-sell information is to be analyzed as part of data mining models then individual customer transactions and products purchased must be captured. The term data reduction in the context of data mining usually applies to the goal of aggregating the information from large datasets into manageable information chunks. Data reduction methods include simple tabulation or aggregation, or more sophisticated techniques like clustering and principal components analysis.

Building Model:
A model typically contains input columns, an identifying column, and a predictable column.

Data type for the columns can be defined in a mining structure based on which algorithms process the data. The following basic terms would be useful to understand about the column types, enable for further studying the rest of the sections
Continuous Column: This column contains numeric measurements typically the product cost, salary, account balance, shipping date, invoice date having no upper bound.
Discrete Column: These are finite unrelated values such as Gender, location, age, telephone area codes. They do not need to be numeric in nature, and typically do not have a fractional component.
Discretized Column: This is a continuous column converted to be discrete. For example, grouping salaries into predefined bands.
Key: The column which uniquely identifies the row, similar to the primary key. This is sometimes called the Case attribute.

In brief, the available algorithms while using Business Intelligence Development Studio are

Microsoft Decision Trees Algorithm: This algorithm uses the values, or states, of the designated “input columns” to predict the states of the column that was designated as “predictable”. It identifies the attribute tree that best predicts the result. This algorithm allows for interplay between attributes and provides a hierarchy of attribute definitions that can be used to take a decision.
More information can be found at http://msdn2.microsoft.com/en-us/library/ms175312.aspx

Microsoft Clustering Algorithm: The algorithm does grouping of the cases in a dataset into clusters that contain similar characteristics. Identifies how the data forms subgroups and how these subgroups are different from each other. This algorithm finds patterns without a specific target result.
More information can be found at http://msdn2.microsoft.com/en-us/library/ms174879.aspx

Microsoft Naive Bayes Algorithm: Identifies the attribute that is most likely to predict the result. This algorithm is less computationally intense than other Microsoft algorithms, and therefore is useful for quickly generating a mining model to discover relationships between input columns and predictable columns. You can use this algorithm to do initial explorations of data, and then later apply the results to create additional mining models with other algorithms that are more computationally intense and more sophisticated.
More information can be found at http://msdn2.microsoft.com/en-us/library/ms174806.aspx

Microsoft Association Algorithm: Association models are built on datasets that contain identifiers both for individual cases and item set that the cases contain. An association model is made up of a series of item sets and the rules that describe how those items are grouped together within the cases. The rules that the algorithm identifies can be used to predict a customer's likely future purchases, based on the items that already exist in the customer's shopping cart. It basically identifies the subgroup of data that participates in a specific transaction.
More information can be found at http://msdn2.microsoft.com/en-us/library/ms174916.aspx

Microsoft Sequence Clustering Algorithm: Identifies the event that is likely to happen next. The algorithm takes a sequence of events as input parameter and is well suited for click stream. This algorithm is similar to the Microsoft Clustering Algorithm. However, instead of finding clusters of cases that contain similar attributes, this algorithm finds clusters of cases that contain similar paths in a sequence.
More information can be found at http://msdn2.microsoft.com/en-us/library/ms175462.aspx

Microsoft Time Series Algorithm: This algorithm is used for predicting continuous columns such as product sales. While other Microsoft algorithms create models, time series model is based only on the trends that the algorithm derives from the original dataset to create a forecast model. It basically identifies the trends that are happening and predicting future from the current data.
More information can be found at http://msdn2.microsoft.com/en-us/library/ms174923.aspx

Microsoft Neural Network Algorithm: Similar to the Microsoft Decision Trees algorithm, this algorithm also Identifies attribute tree that best predicts the result, but involves more than 2 attributes analyzed at a time. This algorithm calculates probabilities for each possible state of the input attribute when given each state of the predictable attribute.
More information can be found at http://msdn2.microsoft.com/en-us/library/ms174941.aspx

Microsoft Logistic Regression Algorithm: The Microsoft Logistic Regression algorithm is a variation of the Microsoft Neural Network algorithm, where the HIDDEN_NODE_RATIO parameter is set to 0. This setting will create a neural network model that does not contain a hidden layer, and that therefore is equivalent to logistic regression.
More information can be found at http://msdn2.microsoft.com/en-us/library/ms174828.aspx

Microsoft Linear Regression Algorithm: The Microsoft Linear Regression algorithm is a variation of the Microsoft Decision Trees algorithm, where the MINIMUM_LEAF_CASES parameter is set to be greater than or equal to the total number of cases in the dataset that the algorithm uses to train.
More information can be found at http://msdn2.microsoft.com/en-us/library/ms174824.aspx

The output of data mining model can provide you with the analyzed and forecast data that can be readily used by the business analysts. For example, if you have a budget to mail information to 1000 people about a new product, relational or OLAP queries will not produce the optimal set of 1000 people. By enhancing your data by creating a data mining attribute that you can use in your query or OLAP analysis, data mining enables you to find the 1000 people most likely to respond. This example also shows that data mining does not replace OLAP, but enhances it.
SQL Server 2005 Business Intelligence Development Studio is an integrated environment that includes several data mining algorithms and tools for building a comprehensive data mining solution. For example, you can use the Data Mining Designer tool in Business Intelligence Development Studio to create, modify, and compare the data mining models. Different Data mining models are explained in detail in the later sections of this chapter. A data mining model applies a mining model algorithm to the data that is represented by a mining structure. Model Parameters and boundary values can be defined on the data mining algorithms and usage parameters on data mining model column. You can define columns to be input columns, key columns, or predictable columns.

Evaluating and validating the model:
Different models for a given business problem could be used for analyzing various business scenarios, identifying the analytical requirements, tuning the parameters and evaluating the results of the models to make a business decision.
This section explains the details about considering various models and choosing the best one based on their predictive performance (i.e., explaining the variability in question and producing stable results across samples). This may sound like a simple operation, but in fact, it sometimes involves a very elaborate process. There are a variety of techniques developed to achieve that goal – typically applying different models to the same data set and then comparing their performance to choose the best
Data analysis life cycle represent the maturity model of the analysis.

Operational analysis is nothing but business transaction Reports (closing bank balances, who was admitted into the hospital today, how many support calls are closed today etc)
Trend analysis understands the growth of the historical data over a period of time.
Ad hoc analysis is business context analysis (Products sales by region) or it can also be used for finding the root cause such as sudden decrease in sales of a product due floods or natural calamity
Predictive analysis is predicting the patterns for the future (also called forecasting)

Deploying and updating the models
Once the right data mining model is chosen and trained with the source data, you deploy it on the server for DMX queries and APIs to access. These queries act directly on the deployed models and feed the result to the interface for customer interaction and decision making. The processed models may need to be updated as the business requirement changes. For example, a change in business requirements may require you to consider choosing a different mining model. Alternatively, the input data may change or you may have new values you need to predict. Data mining models can be trained (processed) and deployed by using SSIS tasks and transformations. This is helpful when you want the model to respond to source data changes on a near-real-time basis. Also, mining results can be accessed by using SSIS tasks and transformations for feeding OLAP or relational structures. This is useful when end users access and analyze the data using existing OLAP or relational reports.

Accessing the model:
Once the model is built and deployed, the next step is to access the mined information from the front-end interface for further analysis. The query language DMX is typically used for accessing data mining models. DMX is similar to the MDX query language for OLAP queries and to the SQL query language for relational queries.
Following tasks can be performed using DMX query language, they are
1. Creating mining structures and mining models
2. Processing mining structures and mining models
3. Deleting or dropping mining structures or mining models
4. Copying mining models
5. Browsing mining models
6. Predicting against mining models

For more information on different types of DDL and DML statements offered by DMX please refer to http://msdn2.microsoft.com/en-us/library/ms132025.aspx
For more information on the functions features in DMX please refer to http://msdn2.microsoft.com/en-us/library/ms131996.aspx
For more information on operators supported in DMX query statements please refer to http://msdn2.microsoft.com/en-us/library/ms132095.aspx
Data Mining Strategies

There are two main kinds of models in data mining, they are Predictive and Descriptive.
Predictive models can be used to forecast explicit values, based on patterns determined from known results. For example, from a database of customers who have already responded to a particular offer, a model can be built that predicts which prospects are likeliest to respond to the same offer.
Descriptive models describe patterns in existing data, and are generally used to create meaningful subgroups such as demographic clusters.
The following picture represents the hierarchical view of data mining models and segregations of algorithms into these two models.

• Classification algorithms predict one or more discrete variables, based on the other attributes in the dataset. An example of a classification algorithm is the Microsoft Decision Trees Algorithm.
• Regression algorithms predict one or more continuous variables, such as profit or loss, based on other attributes in the dataset. An example of a regression algorithm is the Microsoft Regression Algorithm.
• Time Series algorithms forecast the patterns based on the current set of continuous predictable attributes. The data that is to be taken as a base for future patterns predictions can be configured. Microsoft Time Series algorithms would be a best fit to solve time series related business requirement such as forecasting
• Prediction is the estimation of future outcomes, such as predicting which customers will be loyal, predicting which customers will respond to a promotion, works on continuous attribute set. Microsoft Time Series and Decision Trees Algorithms would be better examples to implement these scenarios.
• Segmentation algorithms divide data into groups, or clusters, of items that have similar properties. An example of a segmentation algorithm is the Microsoft Clustering Algorithm.
• Summarization algorithms are similar to clustering algorithm but instead of grouping the data, it would quantify the members of the group, such as group 1 has more number of line items available and it has most probability of occurring. Microsoft Clustering Algorithm would give this information apart from clustering the selected data set.
• Association algorithms find correlations between different attributes in a dataset. The most common application of this kind of algorithm is for creating association rules, which can be used in a market basket analysis. An example of an association algorithm is the Microsoft Association Algorithm.
• Sequence analysis algorithms summarize frequent sequences or episodes in data, such as a Web path flow. An example of a sequence analysis algorithm is the Microsoft Sequence Clustering Algorithm.

Choosing the right algorithm to use for a specific business task can be a challenge. While you can use different algorithms to perform the same business task, each algorithm produces a different result, and some algorithms can produce more than one type of result. For example, you can use the Microsoft Decision Trees algorithm not only for prediction, but also as a way to reduce the number of columns in a dataset, because the decision tree can identify columns that do not affect the final mining model.
You can use different algorithms to perform the same business task and each algorithm produces a different result. Lift charts would be useful to check the accuracy of the data mining models once built on the input data.
Use more than one algorithm to produce results and analyze the results for choosing the right one. Different algorithms produce different results. The choosing of the algorithms is based on the accuracy and on the business need

Use algorithms together – use some algorithms to explore data, and then use other algorithms to predict a specific outcome based on that data. For example, you can use a clustering algorithm, which recognizes patterns, to break data into groups that are more or less homogeneous, and then use the results to create a better decision tree model.

Use multiple algorithms within one solution to perform separate tasks. For example, regression tree algorithm can be used to obtain financial forecasting information, and a rule-based algorithm to perform a market basket analysis.

Drill down analysis on processed mining models would be useful to check the granular content behavior over the time to denote the interactive exploration of data, in particular of large databases. The process of drill-down analyses begins by considering some simple break-downs of the data by a few variables of interest (e.g., Gender, geographic region, etc.).
If the individual attributes you have are transaction amounts, you should model them as continuous rather than discrete.
Below lists out some of the popular business scenarios for which data mining models are sought and right model chosen
Real Customer Scenarios:
PROBLEM SOLUTION
The marketing department of a car company needs to identify the characteristics of existing customers to determine whether they are likely to buy a product in the future By using the Microsoft Decision Trees Algorithm, the marketing department can predict whether a particular customer will purchase a product. The Microsoft Decision Trees algorithm can make a prediction based on the customer information, such as demographics or past buying patterns.
The marketing department of a car company needs to predict monthly car sales for the coming year. It also needs to identify whether the sales report of one model can be used to predict the sales of another model By using the Microsoft time Series Algorithm on the past three year historical data, a data mining model that forecasts future car sales can be produced. Using this model, you can also make cross predictions to determine the relationship between sales trends of individual car models
The car company is redesigning its web site to favor the sale of products By using the Microsoft Association Rules Algorithm on the company records of each sale in the transactional database, the company can identify the various car models and accessories that tend to be purchased together. The company can then predict additional items that a customer might be interested in.
Predicting when the customer balance would become zero, a typical banking requirement so that alerting the customer or automatic moving the funds from another account Making use of Decision Trees and Neural Networks. Check the recent transaction of the customers when it hit zero, come out with input attributes such as premium repay, summer bike renting, gender, and age then prepare the model. This can also be predicted with average monthly balance, average weekly balance of all the customer and predict when current week balance is less than average weekly balance and could reach zero if continues.
In price comparison system which matches the best prices for a purchase. There could be a number of major items in an order and each major item could have multiple related sub items. The other variables that affect the price include trade-ins if any, sales going on at the time of order, number of units etc.
Making use of decision trees, neural nets, or logistic regression would be a better option, consider the parameters such as product type, weight, cost, location, date of the year as input parameters and predict the cost of the product across different variables
A Retail Store selling video game consoles wants to introduce the new XBOX 360 during the holiday season. Before the store introduces the product the manager wants to predict which of the existing customers are more likely to buy the new product Microsoft Decision trees would be a good model for this scenario to study the customer parameters who would mostly buy the Xbox 360, Ex: Kids of age less than 20 would most probably buy product. The input columns for this case consideration would d be Customer ID, Age, Gender, Marital Status, Total children, Occupation, Location, Yearly Income based on who can afford for purchasing and having time at home to play

Usage based matrix of each data mining algorithm:
Task Microsoft algorithms to use
Predicting a discrete attribute. For example, to predict whether the recipient of a targeted mailing campaign will buy a product. Microsoft Decision Trees Algorithm
Microsoft Naive Bayes Algorithm
Microsoft Clustering Algorithm
Microsoft Neural Network Algorithm (SSAS)

Predicting a continuous attribute. For example, to forecast next year's sales. Microsoft Decision Trees Algorithm
Microsoft Time Series Algorithm

Predicting a sequence. For example, to perform a clickstream analysis of a company's Web site. Microsoft Sequence Clustering Algorithm

Finding groups of common items in transactions. For example, to use market basket analysis to suggest additional products to a customer for purchase. Microsoft Association Algorithm
Microsoft Decision Trees Algorithm

Finding groups of similar items. For example, to segment demographic data into groups to better understand the relationships between attributes. Microsoft Clustering Algorithm
Microsoft Sequence Clustering Algorithm

Data Mining Guidelines
Following are the list of suggestions/guidelines related to implementation and modeling the Data Mining algorithms.

Considering updating data mining model when volume of data increases
Data Mining accuracy depends not only on the model adopted but the details of the data. The more the detailed the data is that much is the accuracy, but having more data being analyzed by data mining model would be time consuming. So choosing the right sampling of data is required. Example: Customer employer information is not necessary for predicting the cross-sell. Also forecasting for next year does not data which was 5 years old. When the incoming data volume changes then update the mining model for accurate results

Consider slicing the Source Cube if OLAP is source or Filter out if Relational storage is used as Mining source
Consider filtering the data either by slicing the cube or using “where” condition before the data is being processed by data mining, this would eliminate the unnecessary data to mining models to process.

Cleanse the incoming data is modeling is not for data quality identification
Data mining models can be used for predicting the data quality problems such as: Percentage of male gender people have taken pregnancy leave. This is purely a data entry problem, can be identified by data mining models. But if the data mining model intention is not for finding the data quality problems, consider cleansing the incoming data. Data cleansing is explained in Data Transformation chapter.

Consider data mining models built from same data source
All models built from the same structure must be from the same data source. However, the models can differ as to which columns of the structure are used, how the columns are used, the type of algorithm that is used to create each model, and the parameter settings for each algorithm. Usage of different data sources may result in heterogeneous data sampling for different models whose lift chart is studied for selecting the appropriate modal.

Data mining models sourcing data from granular stores
Data mining models can extract data from either relational or OLAP data structures but the only requirement is the granularity. OLAP structure typically has aggregated data and relational store has granular data. Sourcing the granular data will increase the accuracy of the resulted predictive model and better patterns discovery. Example: In order to identify the patterns of customer Age and motor bike, it needs transactional data granulated to each customer purchase.

Use lift chart for finding the accuracy and deciding the modal
The lift chart is important because it helps distinguish between models in a structure that are almost the same, to help you determine which model provides the best predictions. Similarly, the lift chart shows which type of algorithm performs the best predictions for a particular situation

Considering splitting the source data into a training set and a testing set
In order to test the predictive capability of a mining model, you can randomly divide the available source data in half, and use one half to train the model and the other half to test the predictive capability of the model when using a lift chart.

Data mining models predict but decision is yours
Data mining models analyze the given data to find patterns and predict, but taking decision of whether the result is accurate for the business scenario and giving the correct sample data to the mining for processing is your decision. Suggested to not to make decisions before finishing studying the results of data modals.

Consider having a statistical analyst for studying the models and aligning with business requirements
Statistical analyst would be a good fit for analyzing results from data mining models and applying it to decision making process. Statistical analysts would typically have experience on predictions, forecasting and patterns identification that is exactly what data mining models do to help him for more accurate business decisions.

Consider denormalized tables/views as a source for data mining models
Data mining looks at data as case tables and nested tables. Case tables maps the key to the nested tables in one to many relationship fashion. Nested tables generally have the input and predictable columns which are to be analyzed by data mining models. Having one denormalized table/view would have keys and input columns which can directly map to transactional (nested) tables would minimize the complexity of data model.

Consider using integration services for modeling, maintaining and reprocessing mining models
Make use of SSIS (Integration Services) for identifying the incoming data changes, requirement changes and modify the data mining models accordingly. This would automate reprocessing schedules of the models and keep it accurate every time when the underlying data changes. Requirement changes would need to change the model but at only one place in the integration services package to have the corresponding parameters set or adding yet another model.

Consider aggregating to the required level before building time series model
If daily sales forecast is required then first prepare the data by aggregating sales figures per item to that level. Having multiple records of the same key at the level the forecast is sought, would given incorrect results and may error out

Consider the detailed transaction table as nested and its master table as a case table
This is true for every model that is built for data mining needs to identify the case table and nested table, but this example would give an idea on what is to be selected as case and nested table.
The more common approach for market basket analysis is to model each transaction (containing multiple items) as a case so your transaction table would have a composite key comprising the transaction id and item id, with multiple rows for each transaction id, comprising all items that were bought together as part of that single transaction. From the point of view of SQL Data Mining, you would use the same table as the "case table" as well as the "nested table" containing the associated items for each transaction. As a natural modeling concept for one-to-many relationships - examples: products purchased by a customer, movies watched by a person.
As a powerful and compact way of representing variable-length, sparse cases where individual cases contain only a small subset of the potential attributes for a case. For instance, a store might sell a thousand products. If you wanted to model each potential product purchase as an attribute, without nested tables, you would need more than a thousand columns in your model , with the attribute states being "missing" or "existing" to indicate whether the customer (the "case") bought the product or not. Instead of this, the nested table allows to pivot the attributes and specify only the ones that are actually present for each case.
Challenges/Limitations
Data mining systems face a lot of challenges particularly related to the incoming data and its quality. A data mining system may work perfect for consistent data and perform significant worse when a little noise exists to the incoming data. In this section we take a look at what we mean are the most prominent problems and challenges of data mining systems today.

Noisy Data:
In a large database, many of the attribute values will be inconsistent and/or incorrect. This may be due to erroneous instruments, human error during data entry, migration problems, missing values or incorrect transactions at the source system.

Complex incoming data:
Sometimes the incoming data is complex enough that data modelers cannot uniquely define the requirements

Non-Representative Data: If the data in the training set is not representative of overall patterns, data mining will not produce accurate results. For example, if you are creating a model for employee wages, and the training set includes only salaried people, the model will not accurately represent hourly employees. Even when using large training data sets, you must be careful that the training data is representative and that it belongs to the same domain as the target data set.
No Boundary Cases: To find the real differences between two classes, boundary—or unusual—cases should be present. For example, if a data mining system is classifying animals, and non-flying birds such as penguins are not included in the training set, the model may define a bird as an animal that has wings and can fly.
Databases are Dynamic:
If the source database structure changes frequently, it is a challenge for data mining, as this requires changing and re-training the models. Ideally, the trained model should reflect the content of the source data at all times, in order to make the best possible classification. An important challenge for data mining systems is to accomplish this, by changing its rules as the source data changes.
Databases may be huge:
Most data mining algorithms have been created for handling very large data sets but what happens if you need to perform data mining on terabytes of source data? On the one hand, a huge input data source is advantageous, since that will often result in more accurate modeling, but the number of possible attribute values within a large dataset can be enormous. This means that processing the models will take enormous time and query performance can be too slow. One solution is to utilize less complex mining algorithms to minimize processing time. Other options include randomly select data from the source, or possibly summarizing to the minimum level of detail needed for the mining.
Not a complete automated solution
Data Mining is not automatic process to identify the trends, tools need guidance
Summary:
This chapter details the need of data mining in the end to end scenario of business intelligence applications implementation. The requirement of having data mining models come from the business analysts who wants to dig out more in day to day transactions to discover the hidden patterns and use it for business forecast, increase customer satisfaction, profits by taking right decisions with the help of mining models. Different data mining models are available in Business Intelligence studio. Same business scenario can be implemented by more than one model. Techniques for selecting the right model and comparing with other similar models would increase the accuracy of expected result. Guidelines and challenges are provided on the considerations while implementing the data mining solutions.

Machars Blog

Search This Blog

Welcome to Machers Blog

Thursday, August 7, 2008

Data Mining

No comments: