Familiarize yourself with CART Decision Tree technology in this beginner's tutorial using a telecommunications example dataset from the 1990s. By the end of this tutorial you should feel comfortable using CART on your own with sample or real-world data.
39. CART Representation of a Surface
Model clearly non-linear
Height of bar represents probability of response
Remaining axes represent values of two predictors
Greatest prob of response here in corner to the right
0
41. Searching all splits facilitated by sorting
• On left we sort by TELEBILC, on right by TRAVTIMR
• Test smallest value first, then next smallest, etc moving all the way down the column
• The arrow shows a split sending 10 cases to the left and all other data to the right
175. Cross-Validation Train/Test Procedure:
K mutually exclusive partitions, 1 Test, K-1 Train
1 102 93 4 5 6 7 8
1 102 93 4 5 6 7 8
1 102 93 4 5 6 7 8
1 102 93 4 5 6 7 8
Test
Test
Test
Test
Learn
Learn
Learn
LearnLearn
ETC...
Learn
Above each partition is in the train sample 9 times and in the test sample 1 time
177. Euro_Telco_Mini.xls Data Set
Class=0 Class=1
CVCycle Learn Test Learn Test CVW
1 634 70 113 13 0.1026161
2 633 71 114 12 0.0960758
3 634 70 113 13 0.1026161
4 633 71 114 12 0.0960758
5 634 70 113 13 0.1026161
6 633 71 114 12 0.0960758
7 634 70 113 13 0.1026161
8 634 70 113 13 0.1026161
9 633 71 114 12 0.0960758
10 634 70 113 13 0.1026161
• Here we see the breakdown of the 830 record data set into the 10 CV folds
• Table shows sample counts for majority and minority classes for learn and test
partitions for each fold
• Observe that CART has succeeded in making each fold almost identical in the
learn/test division and in the balance between TARGET=0 and TARGET-1
• Last column is the WEIGHT that CART uses on each fold for certain
calculations
179. Aligning the CV Trees
All automatic and the user never sees this
Main CV1 CV2 CV3 CV4 CV5 CV6 CV7 CV8 CV9 CV10
Nodes 2 2 3 2 2 2 2 2 2 2 2
Complexity 0.01523 0.11543 0.04915 0.12949 0.08684 0.1178 0.09157 0.11464 0.11911 0.11201 0.10531
Nodes 4 6 4 4 4 5 4 4 5 4 4
Complexity 0.01487 0.01736 0.02034 0.01598 0.03128 0.01518 0.03642 0.02188 0.01815 0.02083 0.02285
Nodes 5 7 4 4 4 5 4 4 5 4 7
Complexity 0.01189 0.01455 0.02034 0.01598 0.03128 0.01518 0.03642 0.02188 0.01815 0.02083 0.01342
Nodes 9 8 4 8 4 9 4 9 6 8 10
Complexity 0.00893 0.01118 0.02034 0.01042 0.03128 0.01219 0.03642 0.01229 0.0114 0.01259 0.01157
• We would expect that the trees would be aligned by number of nodes and this is
approximately what happens
• CART aligns the trees by a measure of ―complexity‖ discussed in other sessions
• Alignment is required to determine the estimated error rate of the main tree when it has
been pruned to a specific size (complexity)
• Thus when the main tree is pruned to 4 terminal nodes align each CV trees appropriately.
Eight of the CV trees are also pruned to 4 nodes, but one CV tree is pruned to 5 nodes
and one to 6 nodes
188. Cross-Validation Train/Test Procedure:
K mutually exclusive partitions, 1 Test, K-1 Train
1 102 93 4 5 6 7 8
1 102 93 4 5 6 7 8
1 102 93 4 5 6 7 8
1 102 93 4 5 6 7 8
Test
Test
Test
Test
Learn
Learn
Learn
LearnLearn
ETC...
Learn
Above each partition is in the train sample 9 times and in the test sample 1 time