MS SQL SERVER: Decision trees algorithm

Microsoft Decision Trees Algorithm

Overview Decision Trees Algorithm DMX Queries Data Mining usingDecision Trees Model Content for a Decision Trees Model Decision Tree Parameters Decision Tree Stored Procedures

Decision Trees Algorithm The Microsoft Decision Trees algorithm is a classification and regression algorithm provided by Microsoft SQL Server Analysis Services for use in predictive modeling of both discrete and continuous attributes. For discrete attributes, the algorithm makes predictions based on the relationships between input columns in a dataset. It uses the values, known as states, of those columns to predict the states of a column that you designate as predictable. For example, in a scenario to predict which customers are likely to purchase a motor bike, if nine out of ten younger customers buy a motor bike, but only two out of ten older customers do so, the algorithm infers that age is a good predictor of the bike purchase.

Decision Trees Algorithm For continuous attributes, the algorithm uses linear regression to determine where a decision tree splits. If more than one column is set to predictable, or if the input data contains a nested table that is set to predictable, the algorithm builds a separate decision tree for each predictable column.

DMX Queries Lets understand how to use DMX queries by creating a simple tree model based on the School Plans data set. The table School Plans contains data about 500,000 high school students, including Parent Support, Parent Income, Sex, IQ, and whether or not the student plans to attend School. using the Decision Trees algorithm, you can create a mining model, predicting the School Plans attribute based on the four other attributes.

DMX Queries(Classification) CREATE MINING STRUCTURE SchoolPlans (ID LONG KEY, Sex TEXT DISCRETE, ParentIncome LONG CONTINUOUS, IQ LONG CONTINUOUS, ParentSupport TEXT DISCRETE, SchoolPlans TEXT DISCRETE ) WITH HOLDOUT (10 PERCENT) ALTER MINING STRUCTURE SchoolPlans ADD MINING MODEL SchoolPlan ( ID, Sex, ParentIncome, IQ, ParentSupport, SchoolPlans PREDICT ) USING Microsoft Decision Trees Model Creation:

DMX Queries(Classification) INSERT INTO SchoolPlans (ID, Sex, IQ, ParentSupport, ParentIncome, SchoolPlans) OPENQUERY(SchoolPlans, ‘SELECT ID, Sex, IQ, ParentSupport, ParentIncome, SchoolPlans FROM SchoolPlans’) Training the SchoolPlan Model

DMX Queries(Classification) SELECT t.ID, SchoolPlans.SchoolPlans, PredictProbability(SchoolPlans) AS [Probability] FROM SchoolPlans PREDICTION JOIN OPENQUERY(SchoolPlans, ‘SELECT ID, Sex, IQ, ParentSupport, ParentIncome FROM NewStudents’) AS t ON SchoolPlans.ParentIncome= t.ParentIncome AND SchoolPlans.IQ = t.IQ AND SchoolPlans.Sex= t.Sex AND SchoolPlans.ParentSupport= t.ParentSupport Predicting the SchoolPlan for a new student. This query returns ID, SchoolPlans, and Probability.

DMX Queries(Classification) SELECT t.ID, PredictHistogram(SchoolPlans) AS [SchoolPlans] FROM SchoolPlans PREDICTION JOIN OPENQUERY(SchoolPlans, ‘SELECT ID, Sex, IQ, ParentSupport, ParentIncome FROM NewStudents’) AS t ON SchoolPlans.ParentIncome= t.ParentIncome AND SchoolPlans.IQ = t.IQ AND SchoolPlans.Sex= t.Sex AND SchoolPlans.ParentSupport= t.ParentSupportn Query returns the histogram of the SchoolPlans predictions in the form of a nested table. Result of this query is shown in the next slide.

DMX Queries (Regression) Regression means predicting continuous variables using linear regression formulas based on regressors that you specify. ALTER MINING STRUCTURE SchoolPlans ADD MINING MODEL ParentIncome ( ID, Gender, ParentIncome PREDICT, IQ REGRESSOR, ParentEncouragement, SchoolPlans ) USING Microsoft Decision Trees INSERT INTO ParentIncome Creating and training a regression model to Predict ParentIncome using IQ, Sex, ParentSupport, and SchoolPlans. IQ is used as a regressor.

DMX Queries (Regression) SELECT t.StudentID, ParentIncome.ParentIncome, PredictStdev(ParentIncome) AS Deviation FROM ParentIncome PREDICTION JOIN OPENQUERY(SchoolPlans, ‘SELECT ID, Sex, IQ, ParentSupport, SchoolPlans FROM NewStudents’) AS t ON ParentIncome.SchoolPlans = t. SchoolPlans AND ParentIncome.IQ = t.IQ AND ParentIncome.Sex = t.Sex AND ParentIncome.ParentSupport = t. ParentSupport Continuous prediction using a decision tree to predict the ParentIncome for new students and the estimated standard deviation for each prediction.

DMX Queries(Association) CREATE MINING MODEL DanceAssociation ( ID LONG KEY, Gender TEXT DISCRETE, MaritalStatus TEXT DISCRETE, Shows TABLE PREDICT ( Show TEXT KEY ) ) USING Microsoft Decision Trees ,[object Object]

Each Show is considered an attribute with binary states— existing or missing.

DMX Queries(Association) INSERT INTO DanceAssociation ( ID, Gender, MaritalStatus, Shows (SKIP, Show)) SHAPE { OPENQUERY (DanceSurvey, ‘SELECT ID, Gender, [Marital Status] FROM Customers ORDER BY ID’) } APPEND ( {OPENQUERY (DanceSurvey, ‘SELECT ID, Show FROM Shows ORDER BY ID’)} RELATE ID TO ID )AS Shows Training an associative trees model Because the model contains a nested table, the training statement involves the Shape statement.

DMX Queries(Association) Training an associative trees model Suppose that there is a married male customer who likes the Michael Jackson’s Show. This query returns the other five Shows this customer is most likely to find appealing. SELECT t.ID, Predict(DanceAssociation.Shows,5, $AdjustedProbability) AS Recommendation FROM DanceAssociation NATURAL PREDICTION JOIN (SELECT ‘101’ AS ID, ‘Male’ AS Gender, ‘Married’ AS MaritalStatus, (SELECT ‘Michael Jackson’ AS Show) AS Shows) AS t

Data Mining usingDecision Trees The most common data mining task for a decision tree is classification i.e. determining whether or not a set of data belongs to a specific type, or class. The principal idea of a decision tree is to split your data recursively into subsets. The process of evaluating all inputs is then repeated on each subset. When this recursive process is completed, a decision tree is formed.

Data Mining usingDecision Trees Decision trees offer several advantages over other data mining algorithms. Trees are quick to build and easy to interpret. Each node in the tree is clearly labeled in terms of the input attributes, and each path formed from the root to a leaf forms a rule about your target variable. Prediction based on decision trees is efficient.

Model Content for a Decision Trees Model The top level is the model node. The children of the model node are its tree root nodes. If a tree model contains a single tree, there is only one node in the second level. The nodes of the other levels are either intermediate nodes (or leaf nodes) of the tree. The probabilities of each predictable attribute state are stored in the distribution row sets.

Model Content for a Decision Trees Model

Interpreting the Mining Model Content A decision trees model has a single parent node that represents the model and its metadata underneath which are independent trees that represent the predictable attributes that you select. For example, if you set up your decision tree model to predict whether customers will purchase something, and provide inputs for gender and income, the model would create a single tree for the purchasing attribute, with many branches that divide on conditions related to gender and income. However, if you then add a separate predictable attribute for participation in a customer rewards program, the algorithm will create two separate trees under the parent node. One tree contains the analysis for purchasing, and another tree contains the analysis for the customer rewards program.

Decision Tree Parameters The tree growth, tree shape, and the input output attribute settings are controlled using these parameters . You can fine-tune your model’s accuracy by adjusting these parameter settings.

Decision Tree Parameters ,[object Object],When the value of this parameter is set close to 0, there is a lower penalty for the tree growth, and you may see large trees. When its value is set close to 1, the tree growth is penalized heavily, and the resulting trees are relatively small. If there are fewer than 10 input attributes, the value is set to 0.5. if there are more than 100 attributes, the value is set to 0.99. If you have between 10 and 100 input attributes, the value is set to 0.9.

Decision Tree Parameters ,[object Object],For example, if this value is set to 25, any split that would produce a child node containing less than 25 cases is not accepted. The default value for MINIMUM_SUPPORT is 10. ,[object Object],The three possible values for SCORE METHOD are: SCORE METHOD = 1 use an entropy score for tree growth. SCORE METHOD = 2  use the Bayesian with K2 Prior method, meaning it will add a constant for each state of the predictable attribute in a tree node, regardless of the node level of the tree. SCORE METHOD = 3  use the Bayesian Dirichlet Equivalent with Uniform Prior (BDEU) method.

Decision Tree Parameters ,[object Object],SPLIT METHOD = 1 means the tree is split only in a binary way. SPLIT METHOD = 2 indicates that the tree should always split completely on each attribute. SPLIT METHOD = 3, the default method the decision tree will automatically choose the better of the previous two methods. ,[object Object],When the number of input attributes is greater than this parameter value, feature selection is invoked implicitly to select the most significant input attributes.

Decision Tree Parameters ,[object Object],When the number of predictable attributes is greater than this parameter value, feature selection is invoked implicitly to select the most significant attributes. ,[object Object],This parameter is typically used in price elasticity models. For example, suppose that you have a model to predict Sales using Price and other attributes. If you specify FORCE REGESSOR = Price, you get regression formulas using Price and other significant attributes for each node of the tree.

Decision Tree Stored Procedures Set of system-stored procedures used in the Decision Tree viewer are: ,[object Object]

CALL System.DTGetNodes(‘MovieAssociation’)

CALL System.DTGetNodeGraph(‘MovieAssociation’, 60)

CALL System.DTAddNodes(‘MovieAssociation’,‘36;34’, ‘99;282;20;261;26;201;33;269;30;187’)

Decision Tree Stored Procedures GetTreeScores is the procedure that the Decision Tree viewer uses to populate the drop-down tree selector. It takes a name of a decision tree model as a parameter and returns a table containing a row for every tree on the model and the following three columns: ATTRIBUTE_NAMEis the name of the tree. NODE_UNIQUE_NAME is the content node representing the root of the tree. MSOLAP_NODE_SCORE is a number representing the amount of information(number of nodes) in the tree.

Decision Tree Stored Procedures DTGetNodes is used by the decision tree Dependency Network viewer when you click the Add Nodes button. It returns a row for all potential nodes in the dependency network and has the following two columns: NODE UNIQUE NAME1 is an identifier that is unique for the dependency network. NODE CAPTION is the name of the node.

MS SQL SERVER: Decision trees algorithm

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (17)

Destacado

Destacado (20)

Similar a MS SQL SERVER: Decision trees algorithm

Similar a MS SQL SERVER: Decision trees algorithm (20)

Más de DataminingTools Inc

Más de DataminingTools Inc (20)

Último

Último (20)

MS SQL SERVER: Decision trees algorithm