SlideShare una empresa de Scribd logo
1 de 27
Descargar para leer sin conexión
Machine Learning, Key to Your
Classification Challenges
Marc Borowczak, PRC Consulting LLC (http://www.prcconsulting.net)
February 25, 2016
Classification Challenges are Everywhere…
Step 1: Retrieve Existing Data
Step 2: Clean the Data
Step 3: Classifying Data with OneR
Step 4: Evaluating OneR Performance
Step 5: Improving Model with JRip
Step 6: Improving Model with C5.0
Step 7: Improving C5.0 using all variables
Step 8: Comparisons with Original Rules in Reference Material
Conclusions
References
Classification Challenges are Everywhere…
You develop pharmaceutical, cosmetic, food, industrial or civil engineered products, and are often confronted
with the challenge of sorting and classifying to meet process or performance properties. While traditional
Research and Development does approach the problem with experimentation, it generally involves designs,
time and resource constraints, and can be considered slow, expensive and often times redundant, fast forgotten
or perhaps obsolete.
Consider the alternative Machine Learning tools offers today. We will show this is not only quick, efficient and
ultimately the only way Front End of Innovation should proceed, and how it is particularly suited for
classification, an essential step used to reduce complexity and optimize product segmentation, Lean Innovation
and establishing robust source of supply networks.
Today, we will explain how Machine Learning can shed new light on this generic and very persistent
classification and clustering challenge. We will derive with modern algorithms simple (we prefer less rules) and
accurate (perfect) classifications on a complete dataset.
If you didn’t read about the other important aspect of formulation optimization, please consult Machine Learning,
Key to Your Formulation Challenges (http://rpubs.com/m3cinc
/machine_learning_key_to_your_formulation_challenges) communication.
Step 1: Retrieve Existing Data
We will mirror the approach used in the formulation challenge and use another dataset hosted on UCI Machine
Learning Repository (http://archive.ics.uci.edu/ml/datasets.html), to classify the edible attribute of…Mushrooms
(http://archive.ics.uci.edu/ml/datasets/Mushroom) based on attribute described in The Audubon Society Field
Guide to North American Mushrooms (1981). The challenge we tackle today is to classify properly a go/no-go
attribute which scientists, engineers and business professionals must address daily. Any established R&D
would certainly have similar and sometimes hidden knowledge in its archives…
Machine Learning, Key to Your Classification Challenges file:///P:/MachineLearningExamples/Machine_Learning_Classification.html
1 of 27 2/25/2016 6:27 PM
Again, We will use R to demonstrate quickly the approach on this dataset (http://archive.ics.uci.edu/ml/machine-
learning-databases/mushroom/agaricus-lepiota.data), and its full description (https://archive.ics.uci.edu
/ml/machine-learning-databases/mushroom/agaricus-lepiota.names). We continue to maintain reproducibility of
the analysis as a general practice. The analysis tool and platform are documented, all libraries clearly listed,
while data is retrieved programmatically and date stamped from the repository.
We will display a structure of the mushrooms dataset and the corresponding dictionary to translate the property
factors.
Sys.info()[1:5]
## sysname release version nodename machine
## "Windows" "7 x64" "build 9200" "STALLION" "x86-64"
sessionInfo()
## R version 3.2.2 (2015-08-14)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 8 x64 (build 9200)
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] C50_0.1.0-24 RWeka_0.4-24 stringr_1.0.0
##
## loaded via a namespace (and not attached):
## [1] digest_0.6.8 grid_3.2.2 formatR_1.2
## [4] magrittr_1.5 evaluate_0.7.2 RWekajars_3.7.12-1
## [7] stringi_1.0-1 partykit_1.0-3 rmarkdown_0.9.2
## [10] splines_3.2.2 tools_3.2.2 yaml_2.1.13
## [13] survival_2.38-3 rJava_0.9-7 htmltools_0.2.6
## [16] knitr_1.11
Machine Learning, Key to Your Classification Challenges file:///P:/MachineLearningExamples/Machine_Learning_Classification.html
2 of 27 2/25/2016 6:27 PM
library(stringr)
library(RWeka)
library(C50)
library(rpart)
library(rattle)
userdir <- getwd()
datadir <- "./data"
if (!file.exists("data")){dir.create("data")}
fileUrl <- "http://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus
-lepiota.data?accessType=DOWNLOAD"
download.file(fileUrl,destfile="./data/Mushrooms_Data.csv")
dateDownloaded <- date()
mushrooms <- read.csv("./data/Mushrooms_Data.csv",header=FALSE)
fileUrl <- "https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricu
s-lepiota.names?accessType=DOWNLOAD"
download.file(fileUrl,destfile="./data/Names.txt")
txt <- readLines("./data/Names.txt")
lns <- data.frame(beg=which(grepl("P_1) odor=",txt)),end=which(grepl("on the whole dat
aset.",txt)))
# we now capture all lines of text between beg and end from txt
res <- lapply(seq_along(lns$beg),function(l){paste(txt[seq(from=lns$beg[l],to=lns$end[
l],by=1)],collapse=" ")})
res <- gsub("t", "", res, fixed = TRUE)
res <- gsub("( {2,})"," ",res, fixed=FALSE)
res <- gsub("P_","n",res,fixed=TRUE)
writeLines(res,"./data/parsed_res.csv")
res <- readLines("./data/parsed_res.csv")
res<-res[-1]
lns <- data.frame(beg=which(grepl("7. Attribute Information:",txt)),end=which(grepl("u
rban=u,waste=w,woods=d",txt)))
txt <- lapply(seq_along(lns$beg),function(l){paste(txt[seq(from=lns$beg[l],to=lns$end[
l],by=1)],collapse=" ")})
txt <- gsub(" ", "", txt, fixed = TRUE)
txt <- gsub("(d+.)","n",txt, fixed=FALSE)
txt <- gsub("nAttributeInformation:(","",txt,fixed=FALSE)
txt <- gsub(")","",txt,fixed=FALSE)
txt <- gsub(":",",",txt,fixed=TRUE)
txt <- gsub("?","",txt,fixed=TRUE)
txt <- gsub("-","_",txt,fixed=TRUE)
writeLines(txt,"./data/parsed.csv")
attrib <- readLines("./data/parsed.csv")
attrib <- sapply (1:length(attrib),function(i) {gsub(","," ",attrib[i],fixed=TRUE)})
dictionary <- sapply (1:length(attrib),function(i) {strsplit(attrib[i],' ')})
colnames(mushrooms)<-sapply(1:length(attrib),function(i) {colnames(mushrooms)[i]<-dict
ionary[[i]][1]})
dictionary<-sapply (1:length(attrib),function(i) {dictionary[[i]][-1]}) # contains the
levels strings
dictionary<-sapply(1:length(attrib),function(i){sapply(1:lengths(dictionary[i]),functi
on(j){p1<-strsplit(dictionary[[i]][j],"=")[[1]][1];p2<-strsplit(dictionary[[i]][j],"="
)[[1]][2];dictionary[[i]][j]<-paste0(p2,',',p1)})})
Machine Learning, Key to Your Classification Challenges file:///P:/MachineLearningExamples/Machine_Learning_Classification.html
3 of 27 2/25/2016 6:27 PM
Step 2: Clean the Data
We notice that the stalk_root property has a missing level indicated with ‘?’. We can attempt two analysis: First,
we keep the missing data as coded and proceed with the classification models. We also can easily recode as
missing with the value, drop the corresponding level, and omit all non-complete cases in a new dataset
mushrooms_complete.
mushrooms_complete<-mushrooms
mushrooms_complete$stalk_root[mushrooms_complete$stalk_root=='?']<-NA
mushrooms_complete<-mushrooms_complete[complete.cases(mushrooms_complete),]
mushrooms_complete$stalk_root<-droplevels(mushrooms_complete$stalk_root)
str(mushrooms_complete)
Machine Learning, Key to Your Classification Challenges file:///P:/MachineLearningExamples/Machine_Learning_Classification.html
4 of 27 2/25/2016 6:27 PM
## 'data.frame': 5644 obs. of 23 variables:
## $ classes : Factor w/ 2 levels "e","p": 2 1 1 2 1 1 1 1 2 1 ...
## $ cap_shape : Factor w/ 6 levels "b","c","f","k",..: 6 6 1 6 6 6 1 1
6 1 ...
## $ cap_surface : Factor w/ 4 levels "f","g","s","y": 3 3 3 4 3 4 3 4 4
3 ...
## $ cap_color : Factor w/ 10 levels "b","c","e","g",..: 5 10 9 9 4 10
9 9 9 10 ...
## $ bruises : Factor w/ 2 levels "f","t": 2 2 2 2 1 2 2 2 2 2 ...
## $ odor : Factor w/ 9 levels "a","c","f","l",..: 7 1 4 7 6 1 1 4
7 1 ...
## $ gill_attachment : Factor w/ 2 levels "a","f": 2 2 2 2 2 2 2 2 2 2 ...
## $ gill_spacing : Factor w/ 2 levels "c","w": 1 1 1 1 2 1 1 1 1 1 ...
## $ gill_size : Factor w/ 2 levels "b","n": 2 1 1 2 1 1 1 1 2 1 ...
## $ gill_color : Factor w/ 12 levels "b","e","g","h",..: 5 5 6 6 5 6 3
6 8 3 ...
## $ stalk_shape : Factor w/ 2 levels "e","t": 1 1 1 1 2 1 1 1 1 1 ...
## $ stalk_root : Factor w/ 4 levels "b","c","e","r": 3 2 2 3 3 2 2 2 3
2 ...
## $ stalk_surface_above_ring: Factor w/ 4 levels "f","k","s","y": 3 3 3 3 3 3 3 3 3
3 ...
## $ stalk_surface_below_ring: Factor w/ 4 levels "f","k","s","y": 3 3 3 3 3 3 3 3 3
3 ...
## $ stalk_color_above_ring : Factor w/ 9 levels "b","c","e","g",..: 8 8 8 8 8 8 8 8
8 8 ...
## $ stalk_color_below_ring : Factor w/ 9 levels "b","c","e","g",..: 8 8 8 8 8 8 8 8
8 8 ...
## $ veil_type : Factor w/ 1 level "p": 1 1 1 1 1 1 1 1 1 1 ...
## $ veil_color : Factor w/ 4 levels "n","o","w","y": 3 3 3 3 3 3 3 3 3
3 ...
## $ ring_number : Factor w/ 3 levels "n","o","t": 2 2 2 2 2 2 2 2 2 2 ..
.
## $ ring_type : Factor w/ 5 levels "e","f","l","n",..: 5 5 5 5 1 5 5 5
5 5 ...
## $ spore_print_color : Factor w/ 9 levels "b","h","k","n",..: 3 4 4 3 4 3 3 4
3 3 ...
## $ population : Factor w/ 6 levels "a","c","n","s",..: 4 3 3 4 1 3 3 4
5 4 ...
## $ habitat : Factor w/ 7 levels "d","g","l","m",..: 6 2 4 6 2 2 4 4
2 4 ...
table(mushrooms_complete$classes)
##
## e p
## 3488 2156
We can now reassign translated levels to both the original and the complete mushroom datasets.
Machine Learning, Key to Your Classification Challenges file:///P:/MachineLearningExamples/Machine_Learning_Classification.html
5 of 27 2/25/2016 6:27 PM
m <- sapply (1:length(attrib),function(i){levels(mushrooms[[i]]) <- sapply(1:length (l
evels(mushrooms[[i]])),function(j){
a<-strsplit(dictionary[[i]][[j]],",")[[1]][1]
b<-strsplit(dictionary[[i]][[j]],",")[[1]][2]
levels(mushrooms[[i]])[levels(mushrooms[[i]])==a] <- b } )
mushrooms[[i]]} )
m <- as.data.frame(m)
colnames(m) <- colnames(mushrooms)
mushrooms <- m
m <- sapply (1:length(attrib),function(i){levels(mushrooms_complete[[i]]) <- sapply(1:
length (levels(mushrooms_complete[[i]])),function(j){
a<-strsplit(dictionary[[i]][[j]],",")[[1]][1]
b<-strsplit(dictionary[[i]][[j]],",")[[1]][2]
levels(mushrooms_complete[[i]])[levels(mushrooms_complete[[i]])==a] <- b } )
mushrooms_complete[[i]]} )
m <- as.data.frame(m)
colnames(m) <- colnames(mushrooms_complete)
mushrooms_complete <- m
rm(m,lns,attrib,txt,dictionary)# cleanup
As we observe that the veil_type feature is absolutely common with a single factor, we can exclude it from
further analysis and examine the remaining 22 properties: we’ll observe a fairly balanced classification set with
4208 edible and 3916 poisonous mushrooms on the original set.
mushrooms$veil_type <- NULL
str(mushrooms)
Machine Learning, Key to Your Classification Challenges file:///P:/MachineLearningExamples/Machine_Learning_Classification.html
6 of 27 2/25/2016 6:27 PM
## 'data.frame': 8124 obs. of 22 variables:
## $ classes : Factor w/ 2 levels "edible","poisonous": 2 1 1 2 1 1 1
1 2 1 ...
## $ cap_shape : Factor w/ 6 levels "bell","conical",..: 6 6 1 6 6 6 1
1 6 1 ...
## $ cap_surface : Factor w/ 4 levels "fibrous","grooves",..: 3 3 3 4 3 4
3 4 4 3 ...
## $ cap_color : Factor w/ 10 levels "brown","buff",..: 5 10 9 9 4 10 9
9 9 10 ...
## $ bruises : Factor w/ 2 levels "bruises","no": 2 2 2 2 1 2 2 2 2 2
...
## $ odor : Factor w/ 9 levels "almond","anise",..: 7 1 4 7 6 1 1
4 7 1 ...
## $ gill_attachment : Factor w/ 2 levels "attached","descending": 2 2 2 2 2
2 2 2 2 2 ...
## $ gill_spacing : Factor w/ 2 levels "close","crowded": 1 1 1 1 2 1 1 1
1 1 ...
## $ gill_size : Factor w/ 2 levels "broad","narrow": 2 1 1 2 1 1 1 1 2
1 ...
## $ gill_color : Factor w/ 12 levels "black","brown",..: 5 5 6 6 5 6 3
6 8 3 ...
## $ stalk_shape : Factor w/ 2 levels "enlarging","tapering": 1 1 1 1 2 1
1 1 1 1 ...
## $ stalk_root : Factor w/ 5 levels "bulbous","club",..: 4 3 3 4 4 3 3
3 4 3 ...
## $ stalk_surface_above_ring: Factor w/ 4 levels "fibrous","scaly",..: 3 3 3 3 3 3 3
3 3 3 ...
## $ stalk_surface_below_ring: Factor w/ 4 levels "fibrous","scaly",..: 3 3 3 3 3 3 3
3 3 3 ...
## $ stalk_color_above_ring : Factor w/ 9 levels "brown","buff",..: 8 8 8 8 8 8 8 8
8 8 ...
## $ stalk_color_below_ring : Factor w/ 9 levels "brown","buff",..: 8 8 8 8 8 8 8 8
8 8 ...
## $ veil_color : Factor w/ 4 levels "brown","orange",..: 3 3 3 3 3 3 3
3 3 3 ...
## $ ring_number : Factor w/ 3 levels "none","one","two": 2 2 2 2 2 2 2 2
2 2 ...
## $ ring_type : Factor w/ 5 levels "cobwebby","evanescent",..: 5 5 5 5
1 5 5 5 5 5 ...
## $ spore_print_color : Factor w/ 9 levels "black","brown",..: 3 4 4 3 4 3 3 4
3 3 ...
## $ population : Factor w/ 6 levels "abundant","clustered",..: 4 3 3 4
1 3 3 4 5 4 ...
## $ habitat : Factor w/ 7 levels "grasses","leaves",..: 6 2 4 6 2 2
4 4 2 4 ...
table(mushrooms$classes)
Machine Learning, Key to Your Classification Challenges file:///P:/MachineLearningExamples/Machine_Learning_Classification.html
7 of 27 2/25/2016 6:27 PM
##
## edible poisonous
## 4208 3916
mushrooms_complete$veil_type <- NULL
str(mushrooms_complete)
Machine Learning, Key to Your Classification Challenges file:///P:/MachineLearningExamples/Machine_Learning_Classification.html
8 of 27 2/25/2016 6:27 PM
## 'data.frame': 5644 obs. of 22 variables:
## $ classes : Factor w/ 2 levels "edible","poisonous": 2 1 1 2 1 1 1
1 2 1 ...
## $ cap_shape : Factor w/ 6 levels "bell","conical",..: 6 6 1 6 6 6 1
1 6 1 ...
## $ cap_surface : Factor w/ 4 levels "fibrous","grooves",..: 3 3 3 4 3 4
3 4 4 3 ...
## $ cap_color : Factor w/ 8 levels "brown","buff",..: 5 8 7 7 4 8 7 7
7 8 ...
## $ bruises : Factor w/ 2 levels "bruises","no": 2 2 2 2 1 2 2 2 2 2
...
## $ odor : Factor w/ 7 levels "almond","anise",..: 7 1 4 7 6 1 1
4 7 1 ...
## $ gill_attachment : Factor w/ 2 levels "attached","descending": 2 2 2 2 2
2 2 2 2 2 ...
## $ gill_spacing : Factor w/ 2 levels "close","crowded": 1 1 1 1 2 1 1 1
1 1 ...
## $ gill_size : Factor w/ 2 levels "broad","narrow": 2 1 1 2 1 1 1 1 2
1 ...
## $ gill_color : Factor w/ 9 levels "buff","chocolate",..: 3 3 4 4 3 4
1 4 5 1 ...
## $ stalk_shape : Factor w/ 2 levels "enlarging","tapering": 1 1 1 1 2 1
1 1 1 1 ...
## $ stalk_root : Factor w/ 4 levels "bulbous","club",..: 3 2 2 3 3 2 2
2 3 2 ...
## $ stalk_surface_above_ring: Factor w/ 4 levels "fibrous","scaly",..: 3 3 3 3 3 3 3
3 3 3 ...
## $ stalk_surface_below_ring: Factor w/ 4 levels "fibrous","scaly",..: 3 3 3 3 3 3 3
3 3 3 ...
## $ stalk_color_above_ring : Factor w/ 7 levels "brown","buff",..: 6 6 6 6 6 6 6 6
6 6 ...
## $ stalk_color_below_ring : Factor w/ 7 levels "brown","buff",..: 6 6 6 6 6 6 6 6
6 6 ...
## $ veil_color : Factor w/ 2 levels "white","yellow": 1 1 1 1 1 1 1 1 1
1 ...
## $ ring_number : Factor w/ 3 levels "none","one","two": 2 2 2 2 2 2 2 2
2 2 ...
## $ ring_type : Factor w/ 4 levels "cobwebby","flaring",..: 4 4 4 4 1
4 4 4 4 4 ...
## $ spore_print_color : Factor w/ 6 levels "brown","buff",..: 2 3 3 2 3 2 2 3
2 2 ...
## $ population : Factor w/ 6 levels "abundant","clustered",..: 4 3 3 4
1 3 3 4 5 4 ...
## $ habitat : Factor w/ 6 levels "grasses","leaves",..: 6 2 4 6 2 2
4 4 2 4 ...
table(mushrooms_complete$classes)
Machine Learning, Key to Your Classification Challenges file:///P:/MachineLearningExamples/Machine_Learning_Classification.html
9 of 27 2/25/2016 6:27 PM
##
## edible poisonous
## 3488 2156
However, the complete set is not only smaller, but a bit more imbalanced after removal of the missing data with
3488 edible and 2156 poisonous mushrooms.
We now will conduct 2 parallel analysis streams to compare performance classification and explore multiple
approaches, to attempt a perfect classification.
Let’s start with OneR classification, from the RWeka package.
Step 3: Classifying Data with OneR
We classify the original and complete mushrooms datasets.
mushroom_1R <- OneR(classes ~ .,data = mushrooms)
mushroomc_1R <- OneR(classes ~ .,data = mushrooms_complete)
Step 4: Evaluating OneR Performance
mushroom_1R
## odor:
## almond -> edible
## anise -> poisonous
## creosote -> poisonous
## fishy -> edible
## foul -> poisonous
## musty -> edible
## none -> poisonous
## pungent -> poisonous
## spicy -> poisonous
## (8004/8124 instances correct)
summary(mushroom_1R)
Machine Learning, Key to Your Classification Challenges file:///P:/MachineLearningExamples/Machine_Learning_Classification.html
10 of 27 2/25/2016 6:27 PM
##
## === Summary ===
##
## Correctly Classified Instances 8004 98.5229 %
## Incorrectly Classified Instances 120 1.4771 %
## Kappa statistic 0.9704
## Mean absolute error 0.0148
## Root mean squared error 0.1215
## Relative absolute error 2.958 %
## Root relative squared error 24.323 %
## Coverage of cases (0.95 level) 98.5229 %
## Mean rel. region size (0.95 level) 50 %
## Total Number of Instances 8124
##
## === Confusion Matrix ===
##
## a b <-- classified as
## 4208 0 | a = edible
## 120 3796 | b = poisonous
mushroomc_1R
## odor:
## almond -> edible
## anise -> poisonous
## creosote -> poisonous
## fishy -> edible
## foul -> poisonous
## musty -> edible
## none -> poisonous
## (5556/5644 instances correct)
summary(mushroomc_1R)
Machine Learning, Key to Your Classification Challenges file:///P:/MachineLearningExamples/Machine_Learning_Classification.html
11 of 27 2/25/2016 6:27 PM
##
## === Summary ===
##
## Correctly Classified Instances 5556 98.4408 %
## Incorrectly Classified Instances 88 1.5592 %
## Kappa statistic 0.9667
## Mean absolute error 0.0156
## Root mean squared error 0.1249
## Relative absolute error 3.3022 %
## Root relative squared error 25.6994 %
## Coverage of cases (0.95 level) 98.4408 %
## Mean rel. region size (0.95 level) 50 %
## Total Number of Instances 5644
##
## === Confusion Matrix ===
##
## a b <-- classified as
## 3488 0 | a = edible
## 88 2068 | b = poisonous
We observe the OneR model provides more than 98.52% correct classification using only the odor as criteria on
the original set and 98.44% on the complete set. However the confusion matrix reveals 120 poisonous
mushrooms were classified as edible in the original dataset and 88 in the complete dataset.
Let’s try to improve on the OneR model, using JRip.
Step 5: Improving Model with JRip
mushroom_JRip <- JRip(classes ~ ., data = mushrooms)
mushroom_JRip
## JRIP rules:
## ===========
##
## (odor = creosote) => classes=poisonous (2160.0/0.0)
## (gill_size = narrow) and (gill_color = black) => classes=poisonous (1152.0/0.0)
## (gill_size = narrow) and (odor = none) => classes=poisonous (256.0/0.0)
## (odor = anise) => classes=poisonous (192.0/0.0)
## (spore_print_color = orange) => classes=poisonous (72.0/0.0)
## (stalk_surface_below_ring = smooth) and (stalk_surface_above_ring = scaly) => class
es=poisonous (68.0/0.0)
## (habitat = meadows) and (cap_color = white) => classes=poisonous (8.0/0.0)
## (stalk_color_above_ring = yellow) => classes=poisonous (8.0/0.0)
## => classes=edible (4208.0/0.0)
##
## Number of Rules : 9
summary(mushroom_JRip)
Machine Learning, Key to Your Classification Challenges file:///P:/MachineLearningExamples/Machine_Learning_Classification.html
12 of 27 2/25/2016 6:27 PM
##
## === Summary ===
##
## Correctly Classified Instances 8124 100 %
## Incorrectly Classified Instances 0 0 %
## Kappa statistic 1
## Mean absolute error 0
## Root mean squared error 0
## Relative absolute error 0 %
## Root relative squared error 0 %
## Coverage of cases (0.95 level) 100 %
## Mean rel. region size (0.95 level) 50 %
## Total Number of Instances 8124
##
## === Confusion Matrix ===
##
## a b <-- classified as
## 4208 0 | a = edible
## 0 3916 | b = poisonous
mushroomc_JRip <- JRip(classes ~ ., data = mushrooms_complete)
mushroomc_JRip
## JRIP rules:
## ===========
##
## (odor = creosote) => classes=poisonous (1584.0/0.0)
## (gill_size = narrow) and (odor = none) => classes=poisonous (256.0/0.0)
## (odor = anise) => classes=poisonous (192.0/0.0)
## (spore_print_color = orange) => classes=poisonous (72.0/0.0)
## (population = clustered) => classes=poisonous (52.0/0.0)
## => classes=edible (3488.0/0.0)
##
## Number of Rules : 6
summary(mushroomc_JRip)
Machine Learning, Key to Your Classification Challenges file:///P:/MachineLearningExamples/Machine_Learning_Classification.html
13 of 27 2/25/2016 6:27 PM
##
## === Summary ===
##
## Correctly Classified Instances 5644 100 %
## Incorrectly Classified Instances 0 0 %
## Kappa statistic 1
## Mean absolute error 0
## Root mean squared error 0
## Relative absolute error 0 %
## Root relative squared error 0 %
## Coverage of cases (0.95 level) 100 %
## Mean rel. region size (0.95 level) 50 %
## Total Number of Instances 5644
##
## === Confusion Matrix ===
##
## a b <-- classified as
## 3488 0 | a = edible
## 0 2156 | b = poisonous
We observe that JRip derives 9 rules with 22 variables, and can classify correctly the original set. However, on
the complete set, only 6 rules are derived to reach the same perfect classification.
Step 6: Improving Model with C5.0
In the next step, we’ll attempt to improve selection performance using the C5.0 package, which we’ll apply using
odor and gill_size (the two most influent factor variables), and then compare with all 22 variables selected.
mushroom_c5rules <- C5.0(classes ~ odor + gill_size, data = mushrooms, rules = TRUE)
summary(mushroom_c5rules)
Machine Learning, Key to Your Classification Challenges file:///P:/MachineLearningExamples/Machine_Learning_Classification.html
14 of 27 2/25/2016 6:27 PM
##
## Call:
## C5.0.formula(formula = classes ~ odor + gill_size, data = mushrooms,
## rules = TRUE)
##
##
## C5.0 [Release 2.07 GPL Edition] Wed Feb 24 18:58:20 2016
## -------------------------------
##
## Class specified by attribute `outcome'
##
## Read 8124 cases (3 attributes) from undefined.data
##
## Rules:
##
## Rule 1: (4328/120, lift 1.9)
## odor in {almond, fishy, musty}
## -> class edible [0.972]
##
## Rule 2: (3796, lift 2.1)
## odor in {anise, creosote, foul, none, pungent, spicy}
## -> class poisonous [1.000]
##
## Default class: edible
##
##
## Evaluation on training data (8124 cases):
##
## Rules
## ----------------
## No Errors
##
## 2 120( 1.5%) <<
##
##
## (a) (b) <-classified as
## ---- ----
## 4208 (a): class edible
## 120 3796 (b): class poisonous
##
##
## Attribute usage:
##
## 100.00% odor
##
##
## Time: 0.0 secs
Machine Learning, Key to Your Classification Challenges file:///P:/MachineLearningExamples/Machine_Learning_Classification.html
15 of 27 2/25/2016 6:27 PM
mushroomc_c5rules <- C5.0(classes ~ odor + gill_size, data = mushrooms_complete, rules
= TRUE)
summary(mushroomc_c5rules)
Machine Learning, Key to Your Classification Challenges file:///P:/MachineLearningExamples/Machine_Learning_Classification.html
16 of 27 2/25/2016 6:27 PM
##
## Call:
## C5.0.formula(formula = classes ~ odor + gill_size, data
## = mushrooms_complete, rules = TRUE)
##
##
## C5.0 [Release 2.07 GPL Edition] Wed Feb 24 18:58:20 2016
## -------------------------------
##
## Class specified by attribute `outcome'
##
## Read 5644 cases (3 attributes) from undefined.data
##
## Rules:
##
## Rule 1: (3576/88, lift 1.6)
## odor in {almond, fishy, musty}
## -> class edible [0.975]
##
## Rule 2: (2068, lift 2.6)
## odor in {anise, creosote, foul, none}
## -> class poisonous [1.000]
##
## Default class: edible
##
##
## Evaluation on training data (5644 cases):
##
## Rules
## ----------------
## No Errors
##
## 2 88( 1.6%) <<
##
##
## (a) (b) <-classified as
## ---- ----
## 3488 (a): class edible
## 88 2068 (b): class poisonous
##
##
## Attribute usage:
##
## 100.00% odor
##
##
## Time: 0.0 secs
On the original dataset, we observe that C5.0, applied to the two most influent factor variables, yields similar
results than OneR and classifies 98.52% of the mushrooms correctly, leaving 120 misclassified! On the
complete set, C5.0 results are similar to OneR: 98.44% of the mushrooms are classified correctly, leaving 88
Machine Learning, Key to Your Classification Challenges file:///P:/MachineLearningExamples/Machine_Learning_Classification.html
17 of 27 2/25/2016 6:27 PM
mushrooms misclassified.
Let’s apply C5.0 on the 22 variables.
Step 7: Improving C5.0 using all variables
mushroom_c5improved_rules <- C5.0(classes ~ ., data = mushrooms, rules = TRUE)
summary(mushroom_c5improved_rules)
Machine Learning, Key to Your Classification Challenges file:///P:/MachineLearningExamples/Machine_Learning_Classification.html
18 of 27 2/25/2016 6:27 PM
##
## Call:
## C5.0.formula(formula = classes ~ ., data = mushrooms, rules = TRUE)
##
##
## C5.0 [Release 2.07 GPL Edition] Wed Feb 24 18:58:21 2016
## -------------------------------
##
## Class specified by attribute `outcome'
##
## Read 8124 cases (22 attributes) from undefined.data
##
## Rules:
##
## Rule 1: (4148/4, lift 1.9)
## cap_surface in {fibrous, scaly, smooth}
## odor in {almond, fishy, musty}
## stalk_color_below_ring in {cinnamon, gray, pink, red, white}
## spore_print_color in {black, brown, buff, chocolate, green, purple,
## white, yellow}
## -> class edible [0.999]
##
## Rule 2: (3500/12, lift 1.9)
## cap_surface in {fibrous, scaly, smooth}
## odor in {almond, fishy, musty}
## stalk_root in {club, cup, equal, rhizomorphs}
## spore_print_color in {buff, chocolate, purple, white}
## -> class edible [0.996]
##
## Rule 3: (3796, lift 2.1)
## odor in {anise, creosote, foul, none, pungent, spicy}
## -> class poisonous [1.000]
##
## Rule 4: (72, lift 2.0)
## spore_print_color = orange
## -> class poisonous [0.986]
##
## Rule 5: (24, lift 2.0)
## stalk_color_below_ring = yellow
## -> class poisonous [0.962]
##
## Rule 6: (16, lift 2.0)
## stalk_root = bulbous
## stalk_color_below_ring = orange
## -> class poisonous [0.944]
##
## Rule 7: (4, lift 1.7)
## cap_surface = grooves
## -> class poisonous [0.833]
##
Machine Learning, Key to Your Classification Challenges file:///P:/MachineLearningExamples/Machine_Learning_Classification.html
19 of 27 2/25/2016 6:27 PM
## Default class: edible
##
##
## Evaluation on training data (8124 cases):
##
## Rules
## ----------------
## No Errors
##
## 7 12( 0.1%) <<
##
##
## (a) (b) <-classified as
## ---- ----
## 4208 (a): class edible
## 12 3904 (b): class poisonous
##
##
## Attribute usage:
##
## 98.67% odor
## 52.83% spore_print_color
## 51.99% cap_surface
## 51.55% stalk_color_below_ring
## 43.28% stalk_root
##
##
## Time: 0.1 secs
mushroomc_c5improved_rules <- C5.0(classes ~ ., data = mushrooms_complete, rules = TRU
E)
summary(mushroomc_c5improved_rules)
Machine Learning, Key to Your Classification Challenges file:///P:/MachineLearningExamples/Machine_Learning_Classification.html
20 of 27 2/25/2016 6:27 PM
##
## Call:
## C5.0.formula(formula = classes ~ ., data = mushrooms_complete, rules
## = TRUE)
##
##
## C5.0 [Release 2.07 GPL Edition] Wed Feb 24 18:58:22 2016
## -------------------------------
##
## Class specified by attribute `outcome'
##
## Read 5644 cases (22 attributes) from undefined.data
##
## Rules:
##
## Rule 1: (3488, lift 1.6)
## odor in {almond, fishy, musty}
## spore_print_color in {buff, chocolate, purple, white}
## population in {abundant, numerous, scattered, several, solitary}
## -> class edible [1.000]
##
## Rule 2: (2068, lift 2.6)
## odor in {anise, creosote, foul, none}
## -> class poisonous [1.000]
##
## Rule 3: (72, lift 2.6)
## spore_print_color = orange
## -> class poisonous [0.986]
##
## Rule 4: (52, lift 2.6)
## population = clustered
## -> class poisonous [0.981]
##
## Default class: edible
##
##
## Evaluation on training data (5644 cases):
##
## Rules
## ----------------
## No Errors
##
## 4 0( 0.0%) <<
##
##
## (a) (b) <-classified as
## ---- ----
## 3488 (a): class edible
## 2156 (b): class poisonous
##
Machine Learning, Key to Your Classification Challenges file:///P:/MachineLearningExamples/Machine_Learning_Classification.html
21 of 27 2/25/2016 6:27 PM
##
## Attribute usage:
##
## 98.44% odor
## 63.08% spore_print_color
## 62.72% population
##
##
## Time: 0.0 secs
Using all 22 variables on the original dataset, C5.0 derives 7 rules, and classifies all but 12 correctly. On the
complete dataset, C5.0 derives 4 rules, and classifies all mushrooms correctly! We can easily chart the tree,
using the rpart and rattle packages.
tree <- rpart(mushroom_c5improved_rules,data=mushrooms,control=rpart.control(minsplit=
20,cp=0,digits=6))
fancyRpartPlot(tree, palettes=c("Greys", "Oranges"),cex=0.75, main="Original Mushroom
Dataset",sub="")
treec <- rpart(mushroomc_c5improved_rules,data=mushrooms_complete,control=rpart.contro
l(minsplit=20,cp=0,digits=6))
fancyRpartPlot(treec,palettes=c("Greys", "Oranges"),cex=0.75, main="Complete Mushroom
Dataset",sub="")
Machine Learning, Key to Your Classification Challenges file:///P:/MachineLearningExamples/Machine_Learning_Classification.html
22 of 27 2/25/2016 6:27 PM
Finally, we will use PART to classify, and compare the results.
mushroom_PART_rules <- PART(classes ~ ., data = mushrooms)
mushroom_PART_rules
Machine Learning, Key to Your Classification Challenges file:///P:/MachineLearningExamples/Machine_Learning_Classification.html
23 of 27 2/25/2016 6:27 PM
## PART decision list
## ------------------
##
## odor = creosote: poisonous (2160.0)
##
## gill_size = broad AND
## ring_number = one: edible (3392.0)
##
## ring_number = two AND
## spore_print_color = white: edible (528.0)
##
## odor = pungent: poisonous (576.0)
##
## odor = spicy: poisonous (576.0)
##
## stalk_shape = enlarging AND
## stalk_surface_below_ring = silky AND
## odor = none: poisonous (256.0)
##
## stalk_shape = enlarging AND
## odor = anise: poisonous (192.0)
##
## gill_size = narrow AND
## stalk_surface_above_ring = silky AND
## population = several: edible (192.0)
##
## gill_size = broad: poisonous (108.0)
##
## stalk_surface_below_ring = silky AND
## bruises = bruises: edible (60.0)
##
## stalk_surface_below_ring = smooth: poisonous (40.0)
##
## bruises = bruises: edible (36.0)
##
## : poisonous (8.0)
##
## Number of Rules : 13
mushroomc_PART_rules <- PART(classes ~ ., data = mushrooms_complete)
mushroomc_PART_rules
Machine Learning, Key to Your Classification Challenges file:///P:/MachineLearningExamples/Machine_Learning_Classification.html
24 of 27 2/25/2016 6:27 PM
## PART decision list
## ------------------
##
## odor = musty AND
## ring_number = one AND
## veil_color = white AND
## gill_size = broad: edible (2496.0)
##
## odor = creosote: poisonous (1584.0)
##
## odor = almond: edible (400.0)
##
## odor = fishy: edible (400.0)
##
## odor = none: poisonous (256.0)
##
## odor = anise: poisonous (192.0)
##
## stalk_root = cup: edible (96.0)
##
## spore_print_color = orange: poisonous (72.0)
##
## stalk_root = bulbous AND
## population = several: edible (64.0)
##
## population = clustered: poisonous (52.0)
##
## : edible (32.0)
##
## Number of Rules : 11
On the original mushrooms dataset, PART classifies all properly but must rely on 13 rules to reach the goal. On
the complete set, PART achieves the same outcome and derives 11 rules.
Step 8: Comparisons with Original Rules in
Reference Material
It is always interesting to compare a solution to alternatives. In this case we can refer to the original rules
derived in 1997, and extracted from the documentation which resulted in 48 errors, or 99.41% accuracy on the
whole dataset:
res
Machine Learning, Key to Your Classification Challenges file:///P:/MachineLearningExamples/Machine_Learning_Classification.html
25 of 27 2/25/2016 6:27 PM
## [1] "1) odor=NOT(almond.OR.anise.OR.none) 120 poisonous cases missed, 98.52% accura
cy "
## [2] "2) spore-print-color=green 48 cases missed, 99.41% accuracy "
## [3] "3) odor=none.AND.stalk-surface-below-ring=scaly.AND. (stalk-color-above-ring=N
OT.brown) 8 cases missed, 99.90% accuracy "
## [4] "4) habitat=leaves.AND.cap-color=white 100% accuracy Rule "
## [5] "4) may also be "
## [6] "4') population=clustered.AND.cap_color=white These rule involve 6 attributes (
out of 22). Rules for edible mushrooms are obtained as negation of the rules given abo
ve, for example the rule: odor=(almond.OR.anise.OR.none).AND.spore-print-color=NOT.gre
en gives 48 errors, or 99.41% accuracy on the whole dataset."
Conclusions
C5.0 algorithm applied on all 22 variables of the complete mushroom set is able to correctly classify with 4
rules. This is the best performance we achieved on the set, whith the minimum number of rules derived and the
most accurate (perfect) outcome obtained on the complete dataset. It also only selected 3 variables: odor,
spore_print_color and population, out of the 22 variables provided, compared to the referenced document,
where 6 attributes and 4 rules resulted in 99.41% accuracy.
We hope this typical example demonstrates that Machine Learning algorithms are well positioned to help
resolve classification challenges, offering a fast, efficient and economical alternative to tedious experimentation.
It is easy to imagine how similar questions can be resolved in all types of R&D, in materials, cosmetics, food or
any scientific area. This second tool is certainly as useful as the formulation tool (http://rpubs.com/m3cinc
/machine_learning_key_to_your_formulation_challenges) we reviewed previously.
Classifying Rubber properties to meet rolling resistance and emissions, or modern composites to build
renewable energy sources or lighweight transportation vehicles and next-generation public transit, as well as
innovative UV-shield oinments and tasty snacks and drinks…, all present similar challenges where only the
nature of inputs and outputs vary. Therefore, this method too can and should be applied broadly!
Why not try and implement Machine Learning in your scientific or technical expert area? Remember, PRC
Consulting, LLC (http://www.prcconsulting.net) is dedicated to boosting innovation thru improved Analytics, one
customer at the time!
References
Machine Learning, Key to Your Classification Challenges file:///P:/MachineLearningExamples/Machine_Learning_Classification.html
26 of 27 2/25/2016 6:27 PM
The following sources are referenced as they provided significant help and information to develop this Machine
Learning analysis applied to formulations:
UCI Machine Learning Repository (http://archive.ics.uci.edu/ml/datasets.html)1.
mushroom documentation (https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom
/agaricus-lepiota.names)
2.
stringr (https://cran.r-project.org/web/packages/stringr/stringr.pdf)3.
RWeka (https://cran.r-project.org/web/packages/RWeka/RWeka.pdf)4.
C50 (https://cran.r-project.org/web/packages/C50/C50.pdf)5.
rpart (https://cran.r-project.org/web/packages/rpart/rpart.pdf)6.
rpart.plot (https://cran.r-project.org/web/packages/rpart.plot/rpart.plot.pdf)7.
ratttle (http://rattle.togaware.com/)8.
RStudio (https://www.rstudio.com)9.
Machine Learning, Key to Your Formulation Challenges (http://rpubs.com/m3cinc
/machine_learning_key_to_your_formulation_challenges)
10.
Machine Learning, Key to Your Classification Challenges file:///P:/MachineLearningExamples/Machine_Learning_Classification.html
27 of 27 2/25/2016 6:27 PM

Más contenido relacionado

Similar a Machine Learning, Key to Your Classification Challenges

The Django Web Framework (EuroPython 2006)
The Django Web Framework (EuroPython 2006)The Django Web Framework (EuroPython 2006)
The Django Web Framework (EuroPython 2006)Simon Willison
 
UBC STAT545 2014 Cm001 intro to-course
UBC STAT545 2014 Cm001 intro to-courseUBC STAT545 2014 Cm001 intro to-course
UBC STAT545 2014 Cm001 intro to-courseJennifer Bryan
 
Rooted 2010 ppp
Rooted 2010 pppRooted 2010 ppp
Rooted 2010 pppnoc_313
 
The Breakup - Logically Sharding a Growing PostgreSQL Database
The Breakup - Logically Sharding a Growing PostgreSQL DatabaseThe Breakup - Logically Sharding a Growing PostgreSQL Database
The Breakup - Logically Sharding a Growing PostgreSQL DatabaseFred Moyer
 
An Introduction to Data Mining with R
An Introduction to Data Mining with RAn Introduction to Data Mining with R
An Introduction to Data Mining with RYanchang Zhao
 
Semantic search for Earth Observation products
Semantic search for Earth Observation productsSemantic search for Earth Observation products
Semantic search for Earth Observation productsGasperi Jerome
 
2014-10-10-SBC361-Reproducible research
2014-10-10-SBC361-Reproducible research2014-10-10-SBC361-Reproducible research
2014-10-10-SBC361-Reproducible researchYannick Wurm
 
R language tutorial
R language tutorialR language tutorial
R language tutorialDavid Chiu
 
Why re-use core classes?
Why re-use core classes?Why re-use core classes?
Why re-use core classes?Levi Waldron
 
Introduction to Data Science With R Notes
Introduction to Data Science With R NotesIntroduction to Data Science With R Notes
Introduction to Data Science With R NotesLakshmiSarvani6
 
DH11: Browsing Highly Interconnected Humanities Databases Through Multi-Resul...
DH11: Browsing Highly Interconnected Humanities Databases Through Multi-Resul...DH11: Browsing Highly Interconnected Humanities Databases Through Multi-Resul...
DH11: Browsing Highly Interconnected Humanities Databases Through Multi-Resul...Michele Pasin
 
Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...
Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...
Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...Sease
 
Business Analytics with R
Business Analytics with RBusiness Analytics with R
Business Analytics with REdureka!
 
Business Analytics Decision Tree in R
Business Analytics Decision Tree in RBusiness Analytics Decision Tree in R
Business Analytics Decision Tree in REdureka!
 
Cloudera Data Science Challenge 3 Solution by Doug Needham
Cloudera Data Science Challenge 3 Solution by Doug NeedhamCloudera Data Science Challenge 3 Solution by Doug Needham
Cloudera Data Science Challenge 3 Solution by Doug NeedhamDoug Needham
 
[FT-7][snowmantw] How to make a new functional language and make the world be...
[FT-7][snowmantw] How to make a new functional language and make the world be...[FT-7][snowmantw] How to make a new functional language and make the world be...
[FT-7][snowmantw] How to make a new functional language and make the world be...Functional Thursday
 
Reproducibility challenges in computational settings: what are they, why shou...
Reproducibility challenges in computational settings: what are they, why shou...Reproducibility challenges in computational settings: what are they, why shou...
Reproducibility challenges in computational settings: what are they, why shou...Research Data Alliance
 
Reproducible Workflow with Cytoscape and Jupyter Notebook
Reproducible Workflow with Cytoscape and Jupyter NotebookReproducible Workflow with Cytoscape and Jupyter Notebook
Reproducible Workflow with Cytoscape and Jupyter NotebookKeiichiro Ono
 

Similar a Machine Learning, Key to Your Classification Challenges (20)

The Django Web Framework (EuroPython 2006)
The Django Web Framework (EuroPython 2006)The Django Web Framework (EuroPython 2006)
The Django Web Framework (EuroPython 2006)
 
UBC STAT545 2014 Cm001 intro to-course
UBC STAT545 2014 Cm001 intro to-courseUBC STAT545 2014 Cm001 intro to-course
UBC STAT545 2014 Cm001 intro to-course
 
Easy R
Easy REasy R
Easy R
 
Rooted 2010 ppp
Rooted 2010 pppRooted 2010 ppp
Rooted 2010 ppp
 
The breakup
The breakupThe breakup
The breakup
 
The Breakup - Logically Sharding a Growing PostgreSQL Database
The Breakup - Logically Sharding a Growing PostgreSQL DatabaseThe Breakup - Logically Sharding a Growing PostgreSQL Database
The Breakup - Logically Sharding a Growing PostgreSQL Database
 
An Introduction to Data Mining with R
An Introduction to Data Mining with RAn Introduction to Data Mining with R
An Introduction to Data Mining with R
 
Semantic search for Earth Observation products
Semantic search for Earth Observation productsSemantic search for Earth Observation products
Semantic search for Earth Observation products
 
2014-10-10-SBC361-Reproducible research
2014-10-10-SBC361-Reproducible research2014-10-10-SBC361-Reproducible research
2014-10-10-SBC361-Reproducible research
 
R language tutorial
R language tutorialR language tutorial
R language tutorial
 
Why re-use core classes?
Why re-use core classes?Why re-use core classes?
Why re-use core classes?
 
Introduction to Data Science With R Notes
Introduction to Data Science With R NotesIntroduction to Data Science With R Notes
Introduction to Data Science With R Notes
 
DH11: Browsing Highly Interconnected Humanities Databases Through Multi-Resul...
DH11: Browsing Highly Interconnected Humanities Databases Through Multi-Resul...DH11: Browsing Highly Interconnected Humanities Databases Through Multi-Resul...
DH11: Browsing Highly Interconnected Humanities Databases Through Multi-Resul...
 
Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...
Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...
Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...
 
Business Analytics with R
Business Analytics with RBusiness Analytics with R
Business Analytics with R
 
Business Analytics Decision Tree in R
Business Analytics Decision Tree in RBusiness Analytics Decision Tree in R
Business Analytics Decision Tree in R
 
Cloudera Data Science Challenge 3 Solution by Doug Needham
Cloudera Data Science Challenge 3 Solution by Doug NeedhamCloudera Data Science Challenge 3 Solution by Doug Needham
Cloudera Data Science Challenge 3 Solution by Doug Needham
 
[FT-7][snowmantw] How to make a new functional language and make the world be...
[FT-7][snowmantw] How to make a new functional language and make the world be...[FT-7][snowmantw] How to make a new functional language and make the world be...
[FT-7][snowmantw] How to make a new functional language and make the world be...
 
Reproducibility challenges in computational settings: what are they, why shou...
Reproducibility challenges in computational settings: what are they, why shou...Reproducibility challenges in computational settings: what are they, why shou...
Reproducibility challenges in computational settings: what are they, why shou...
 
Reproducible Workflow with Cytoscape and Jupyter Notebook
Reproducible Workflow with Cytoscape and Jupyter NotebookReproducible Workflow with Cytoscape and Jupyter Notebook
Reproducible Workflow with Cytoscape and Jupyter Notebook
 

Machine Learning, Key to Your Classification Challenges

  • 1. Machine Learning, Key to Your Classification Challenges Marc Borowczak, PRC Consulting LLC (http://www.prcconsulting.net) February 25, 2016 Classification Challenges are Everywhere… Step 1: Retrieve Existing Data Step 2: Clean the Data Step 3: Classifying Data with OneR Step 4: Evaluating OneR Performance Step 5: Improving Model with JRip Step 6: Improving Model with C5.0 Step 7: Improving C5.0 using all variables Step 8: Comparisons with Original Rules in Reference Material Conclusions References Classification Challenges are Everywhere… You develop pharmaceutical, cosmetic, food, industrial or civil engineered products, and are often confronted with the challenge of sorting and classifying to meet process or performance properties. While traditional Research and Development does approach the problem with experimentation, it generally involves designs, time and resource constraints, and can be considered slow, expensive and often times redundant, fast forgotten or perhaps obsolete. Consider the alternative Machine Learning tools offers today. We will show this is not only quick, efficient and ultimately the only way Front End of Innovation should proceed, and how it is particularly suited for classification, an essential step used to reduce complexity and optimize product segmentation, Lean Innovation and establishing robust source of supply networks. Today, we will explain how Machine Learning can shed new light on this generic and very persistent classification and clustering challenge. We will derive with modern algorithms simple (we prefer less rules) and accurate (perfect) classifications on a complete dataset. If you didn’t read about the other important aspect of formulation optimization, please consult Machine Learning, Key to Your Formulation Challenges (http://rpubs.com/m3cinc /machine_learning_key_to_your_formulation_challenges) communication. Step 1: Retrieve Existing Data We will mirror the approach used in the formulation challenge and use another dataset hosted on UCI Machine Learning Repository (http://archive.ics.uci.edu/ml/datasets.html), to classify the edible attribute of…Mushrooms (http://archive.ics.uci.edu/ml/datasets/Mushroom) based on attribute described in The Audubon Society Field Guide to North American Mushrooms (1981). The challenge we tackle today is to classify properly a go/no-go attribute which scientists, engineers and business professionals must address daily. Any established R&D would certainly have similar and sometimes hidden knowledge in its archives… Machine Learning, Key to Your Classification Challenges file:///P:/MachineLearningExamples/Machine_Learning_Classification.html 1 of 27 2/25/2016 6:27 PM
  • 2. Again, We will use R to demonstrate quickly the approach on this dataset (http://archive.ics.uci.edu/ml/machine- learning-databases/mushroom/agaricus-lepiota.data), and its full description (https://archive.ics.uci.edu /ml/machine-learning-databases/mushroom/agaricus-lepiota.names). We continue to maintain reproducibility of the analysis as a general practice. The analysis tool and platform are documented, all libraries clearly listed, while data is retrieved programmatically and date stamped from the repository. We will display a structure of the mushrooms dataset and the corresponding dictionary to translate the property factors. Sys.info()[1:5] ## sysname release version nodename machine ## "Windows" "7 x64" "build 9200" "STALLION" "x86-64" sessionInfo() ## R version 3.2.2 (2015-08-14) ## Platform: x86_64-w64-mingw32/x64 (64-bit) ## Running under: Windows 8 x64 (build 9200) ## ## locale: ## [1] LC_COLLATE=English_United States.1252 ## [2] LC_CTYPE=English_United States.1252 ## [3] LC_MONETARY=English_United States.1252 ## [4] LC_NUMERIC=C ## [5] LC_TIME=English_United States.1252 ## ## attached base packages: ## [1] stats graphics grDevices utils datasets methods base ## ## other attached packages: ## [1] C50_0.1.0-24 RWeka_0.4-24 stringr_1.0.0 ## ## loaded via a namespace (and not attached): ## [1] digest_0.6.8 grid_3.2.2 formatR_1.2 ## [4] magrittr_1.5 evaluate_0.7.2 RWekajars_3.7.12-1 ## [7] stringi_1.0-1 partykit_1.0-3 rmarkdown_0.9.2 ## [10] splines_3.2.2 tools_3.2.2 yaml_2.1.13 ## [13] survival_2.38-3 rJava_0.9-7 htmltools_0.2.6 ## [16] knitr_1.11 Machine Learning, Key to Your Classification Challenges file:///P:/MachineLearningExamples/Machine_Learning_Classification.html 2 of 27 2/25/2016 6:27 PM
  • 3. library(stringr) library(RWeka) library(C50) library(rpart) library(rattle) userdir <- getwd() datadir <- "./data" if (!file.exists("data")){dir.create("data")} fileUrl <- "http://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus -lepiota.data?accessType=DOWNLOAD" download.file(fileUrl,destfile="./data/Mushrooms_Data.csv") dateDownloaded <- date() mushrooms <- read.csv("./data/Mushrooms_Data.csv",header=FALSE) fileUrl <- "https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricu s-lepiota.names?accessType=DOWNLOAD" download.file(fileUrl,destfile="./data/Names.txt") txt <- readLines("./data/Names.txt") lns <- data.frame(beg=which(grepl("P_1) odor=",txt)),end=which(grepl("on the whole dat aset.",txt))) # we now capture all lines of text between beg and end from txt res <- lapply(seq_along(lns$beg),function(l){paste(txt[seq(from=lns$beg[l],to=lns$end[ l],by=1)],collapse=" ")}) res <- gsub("t", "", res, fixed = TRUE) res <- gsub("( {2,})"," ",res, fixed=FALSE) res <- gsub("P_","n",res,fixed=TRUE) writeLines(res,"./data/parsed_res.csv") res <- readLines("./data/parsed_res.csv") res<-res[-1] lns <- data.frame(beg=which(grepl("7. Attribute Information:",txt)),end=which(grepl("u rban=u,waste=w,woods=d",txt))) txt <- lapply(seq_along(lns$beg),function(l){paste(txt[seq(from=lns$beg[l],to=lns$end[ l],by=1)],collapse=" ")}) txt <- gsub(" ", "", txt, fixed = TRUE) txt <- gsub("(d+.)","n",txt, fixed=FALSE) txt <- gsub("nAttributeInformation:(","",txt,fixed=FALSE) txt <- gsub(")","",txt,fixed=FALSE) txt <- gsub(":",",",txt,fixed=TRUE) txt <- gsub("?","",txt,fixed=TRUE) txt <- gsub("-","_",txt,fixed=TRUE) writeLines(txt,"./data/parsed.csv") attrib <- readLines("./data/parsed.csv") attrib <- sapply (1:length(attrib),function(i) {gsub(","," ",attrib[i],fixed=TRUE)}) dictionary <- sapply (1:length(attrib),function(i) {strsplit(attrib[i],' ')}) colnames(mushrooms)<-sapply(1:length(attrib),function(i) {colnames(mushrooms)[i]<-dict ionary[[i]][1]}) dictionary<-sapply (1:length(attrib),function(i) {dictionary[[i]][-1]}) # contains the levels strings dictionary<-sapply(1:length(attrib),function(i){sapply(1:lengths(dictionary[i]),functi on(j){p1<-strsplit(dictionary[[i]][j],"=")[[1]][1];p2<-strsplit(dictionary[[i]][j],"=" )[[1]][2];dictionary[[i]][j]<-paste0(p2,',',p1)})}) Machine Learning, Key to Your Classification Challenges file:///P:/MachineLearningExamples/Machine_Learning_Classification.html 3 of 27 2/25/2016 6:27 PM
  • 4. Step 2: Clean the Data We notice that the stalk_root property has a missing level indicated with ‘?’. We can attempt two analysis: First, we keep the missing data as coded and proceed with the classification models. We also can easily recode as missing with the value, drop the corresponding level, and omit all non-complete cases in a new dataset mushrooms_complete. mushrooms_complete<-mushrooms mushrooms_complete$stalk_root[mushrooms_complete$stalk_root=='?']<-NA mushrooms_complete<-mushrooms_complete[complete.cases(mushrooms_complete),] mushrooms_complete$stalk_root<-droplevels(mushrooms_complete$stalk_root) str(mushrooms_complete) Machine Learning, Key to Your Classification Challenges file:///P:/MachineLearningExamples/Machine_Learning_Classification.html 4 of 27 2/25/2016 6:27 PM
  • 5. ## 'data.frame': 5644 obs. of 23 variables: ## $ classes : Factor w/ 2 levels "e","p": 2 1 1 2 1 1 1 1 2 1 ... ## $ cap_shape : Factor w/ 6 levels "b","c","f","k",..: 6 6 1 6 6 6 1 1 6 1 ... ## $ cap_surface : Factor w/ 4 levels "f","g","s","y": 3 3 3 4 3 4 3 4 4 3 ... ## $ cap_color : Factor w/ 10 levels "b","c","e","g",..: 5 10 9 9 4 10 9 9 9 10 ... ## $ bruises : Factor w/ 2 levels "f","t": 2 2 2 2 1 2 2 2 2 2 ... ## $ odor : Factor w/ 9 levels "a","c","f","l",..: 7 1 4 7 6 1 1 4 7 1 ... ## $ gill_attachment : Factor w/ 2 levels "a","f": 2 2 2 2 2 2 2 2 2 2 ... ## $ gill_spacing : Factor w/ 2 levels "c","w": 1 1 1 1 2 1 1 1 1 1 ... ## $ gill_size : Factor w/ 2 levels "b","n": 2 1 1 2 1 1 1 1 2 1 ... ## $ gill_color : Factor w/ 12 levels "b","e","g","h",..: 5 5 6 6 5 6 3 6 8 3 ... ## $ stalk_shape : Factor w/ 2 levels "e","t": 1 1 1 1 2 1 1 1 1 1 ... ## $ stalk_root : Factor w/ 4 levels "b","c","e","r": 3 2 2 3 3 2 2 2 3 2 ... ## $ stalk_surface_above_ring: Factor w/ 4 levels "f","k","s","y": 3 3 3 3 3 3 3 3 3 3 ... ## $ stalk_surface_below_ring: Factor w/ 4 levels "f","k","s","y": 3 3 3 3 3 3 3 3 3 3 ... ## $ stalk_color_above_ring : Factor w/ 9 levels "b","c","e","g",..: 8 8 8 8 8 8 8 8 8 8 ... ## $ stalk_color_below_ring : Factor w/ 9 levels "b","c","e","g",..: 8 8 8 8 8 8 8 8 8 8 ... ## $ veil_type : Factor w/ 1 level "p": 1 1 1 1 1 1 1 1 1 1 ... ## $ veil_color : Factor w/ 4 levels "n","o","w","y": 3 3 3 3 3 3 3 3 3 3 ... ## $ ring_number : Factor w/ 3 levels "n","o","t": 2 2 2 2 2 2 2 2 2 2 .. . ## $ ring_type : Factor w/ 5 levels "e","f","l","n",..: 5 5 5 5 1 5 5 5 5 5 ... ## $ spore_print_color : Factor w/ 9 levels "b","h","k","n",..: 3 4 4 3 4 3 3 4 3 3 ... ## $ population : Factor w/ 6 levels "a","c","n","s",..: 4 3 3 4 1 3 3 4 5 4 ... ## $ habitat : Factor w/ 7 levels "d","g","l","m",..: 6 2 4 6 2 2 4 4 2 4 ... table(mushrooms_complete$classes) ## ## e p ## 3488 2156 We can now reassign translated levels to both the original and the complete mushroom datasets. Machine Learning, Key to Your Classification Challenges file:///P:/MachineLearningExamples/Machine_Learning_Classification.html 5 of 27 2/25/2016 6:27 PM
  • 6. m <- sapply (1:length(attrib),function(i){levels(mushrooms[[i]]) <- sapply(1:length (l evels(mushrooms[[i]])),function(j){ a<-strsplit(dictionary[[i]][[j]],",")[[1]][1] b<-strsplit(dictionary[[i]][[j]],",")[[1]][2] levels(mushrooms[[i]])[levels(mushrooms[[i]])==a] <- b } ) mushrooms[[i]]} ) m <- as.data.frame(m) colnames(m) <- colnames(mushrooms) mushrooms <- m m <- sapply (1:length(attrib),function(i){levels(mushrooms_complete[[i]]) <- sapply(1: length (levels(mushrooms_complete[[i]])),function(j){ a<-strsplit(dictionary[[i]][[j]],",")[[1]][1] b<-strsplit(dictionary[[i]][[j]],",")[[1]][2] levels(mushrooms_complete[[i]])[levels(mushrooms_complete[[i]])==a] <- b } ) mushrooms_complete[[i]]} ) m <- as.data.frame(m) colnames(m) <- colnames(mushrooms_complete) mushrooms_complete <- m rm(m,lns,attrib,txt,dictionary)# cleanup As we observe that the veil_type feature is absolutely common with a single factor, we can exclude it from further analysis and examine the remaining 22 properties: we’ll observe a fairly balanced classification set with 4208 edible and 3916 poisonous mushrooms on the original set. mushrooms$veil_type <- NULL str(mushrooms) Machine Learning, Key to Your Classification Challenges file:///P:/MachineLearningExamples/Machine_Learning_Classification.html 6 of 27 2/25/2016 6:27 PM
  • 7. ## 'data.frame': 8124 obs. of 22 variables: ## $ classes : Factor w/ 2 levels "edible","poisonous": 2 1 1 2 1 1 1 1 2 1 ... ## $ cap_shape : Factor w/ 6 levels "bell","conical",..: 6 6 1 6 6 6 1 1 6 1 ... ## $ cap_surface : Factor w/ 4 levels "fibrous","grooves",..: 3 3 3 4 3 4 3 4 4 3 ... ## $ cap_color : Factor w/ 10 levels "brown","buff",..: 5 10 9 9 4 10 9 9 9 10 ... ## $ bruises : Factor w/ 2 levels "bruises","no": 2 2 2 2 1 2 2 2 2 2 ... ## $ odor : Factor w/ 9 levels "almond","anise",..: 7 1 4 7 6 1 1 4 7 1 ... ## $ gill_attachment : Factor w/ 2 levels "attached","descending": 2 2 2 2 2 2 2 2 2 2 ... ## $ gill_spacing : Factor w/ 2 levels "close","crowded": 1 1 1 1 2 1 1 1 1 1 ... ## $ gill_size : Factor w/ 2 levels "broad","narrow": 2 1 1 2 1 1 1 1 2 1 ... ## $ gill_color : Factor w/ 12 levels "black","brown",..: 5 5 6 6 5 6 3 6 8 3 ... ## $ stalk_shape : Factor w/ 2 levels "enlarging","tapering": 1 1 1 1 2 1 1 1 1 1 ... ## $ stalk_root : Factor w/ 5 levels "bulbous","club",..: 4 3 3 4 4 3 3 3 4 3 ... ## $ stalk_surface_above_ring: Factor w/ 4 levels "fibrous","scaly",..: 3 3 3 3 3 3 3 3 3 3 ... ## $ stalk_surface_below_ring: Factor w/ 4 levels "fibrous","scaly",..: 3 3 3 3 3 3 3 3 3 3 ... ## $ stalk_color_above_ring : Factor w/ 9 levels "brown","buff",..: 8 8 8 8 8 8 8 8 8 8 ... ## $ stalk_color_below_ring : Factor w/ 9 levels "brown","buff",..: 8 8 8 8 8 8 8 8 8 8 ... ## $ veil_color : Factor w/ 4 levels "brown","orange",..: 3 3 3 3 3 3 3 3 3 3 ... ## $ ring_number : Factor w/ 3 levels "none","one","two": 2 2 2 2 2 2 2 2 2 2 ... ## $ ring_type : Factor w/ 5 levels "cobwebby","evanescent",..: 5 5 5 5 1 5 5 5 5 5 ... ## $ spore_print_color : Factor w/ 9 levels "black","brown",..: 3 4 4 3 4 3 3 4 3 3 ... ## $ population : Factor w/ 6 levels "abundant","clustered",..: 4 3 3 4 1 3 3 4 5 4 ... ## $ habitat : Factor w/ 7 levels "grasses","leaves",..: 6 2 4 6 2 2 4 4 2 4 ... table(mushrooms$classes) Machine Learning, Key to Your Classification Challenges file:///P:/MachineLearningExamples/Machine_Learning_Classification.html 7 of 27 2/25/2016 6:27 PM
  • 8. ## ## edible poisonous ## 4208 3916 mushrooms_complete$veil_type <- NULL str(mushrooms_complete) Machine Learning, Key to Your Classification Challenges file:///P:/MachineLearningExamples/Machine_Learning_Classification.html 8 of 27 2/25/2016 6:27 PM
  • 9. ## 'data.frame': 5644 obs. of 22 variables: ## $ classes : Factor w/ 2 levels "edible","poisonous": 2 1 1 2 1 1 1 1 2 1 ... ## $ cap_shape : Factor w/ 6 levels "bell","conical",..: 6 6 1 6 6 6 1 1 6 1 ... ## $ cap_surface : Factor w/ 4 levels "fibrous","grooves",..: 3 3 3 4 3 4 3 4 4 3 ... ## $ cap_color : Factor w/ 8 levels "brown","buff",..: 5 8 7 7 4 8 7 7 7 8 ... ## $ bruises : Factor w/ 2 levels "bruises","no": 2 2 2 2 1 2 2 2 2 2 ... ## $ odor : Factor w/ 7 levels "almond","anise",..: 7 1 4 7 6 1 1 4 7 1 ... ## $ gill_attachment : Factor w/ 2 levels "attached","descending": 2 2 2 2 2 2 2 2 2 2 ... ## $ gill_spacing : Factor w/ 2 levels "close","crowded": 1 1 1 1 2 1 1 1 1 1 ... ## $ gill_size : Factor w/ 2 levels "broad","narrow": 2 1 1 2 1 1 1 1 2 1 ... ## $ gill_color : Factor w/ 9 levels "buff","chocolate",..: 3 3 4 4 3 4 1 4 5 1 ... ## $ stalk_shape : Factor w/ 2 levels "enlarging","tapering": 1 1 1 1 2 1 1 1 1 1 ... ## $ stalk_root : Factor w/ 4 levels "bulbous","club",..: 3 2 2 3 3 2 2 2 3 2 ... ## $ stalk_surface_above_ring: Factor w/ 4 levels "fibrous","scaly",..: 3 3 3 3 3 3 3 3 3 3 ... ## $ stalk_surface_below_ring: Factor w/ 4 levels "fibrous","scaly",..: 3 3 3 3 3 3 3 3 3 3 ... ## $ stalk_color_above_ring : Factor w/ 7 levels "brown","buff",..: 6 6 6 6 6 6 6 6 6 6 ... ## $ stalk_color_below_ring : Factor w/ 7 levels "brown","buff",..: 6 6 6 6 6 6 6 6 6 6 ... ## $ veil_color : Factor w/ 2 levels "white","yellow": 1 1 1 1 1 1 1 1 1 1 ... ## $ ring_number : Factor w/ 3 levels "none","one","two": 2 2 2 2 2 2 2 2 2 2 ... ## $ ring_type : Factor w/ 4 levels "cobwebby","flaring",..: 4 4 4 4 1 4 4 4 4 4 ... ## $ spore_print_color : Factor w/ 6 levels "brown","buff",..: 2 3 3 2 3 2 2 3 2 2 ... ## $ population : Factor w/ 6 levels "abundant","clustered",..: 4 3 3 4 1 3 3 4 5 4 ... ## $ habitat : Factor w/ 6 levels "grasses","leaves",..: 6 2 4 6 2 2 4 4 2 4 ... table(mushrooms_complete$classes) Machine Learning, Key to Your Classification Challenges file:///P:/MachineLearningExamples/Machine_Learning_Classification.html 9 of 27 2/25/2016 6:27 PM
  • 10. ## ## edible poisonous ## 3488 2156 However, the complete set is not only smaller, but a bit more imbalanced after removal of the missing data with 3488 edible and 2156 poisonous mushrooms. We now will conduct 2 parallel analysis streams to compare performance classification and explore multiple approaches, to attempt a perfect classification. Let’s start with OneR classification, from the RWeka package. Step 3: Classifying Data with OneR We classify the original and complete mushrooms datasets. mushroom_1R <- OneR(classes ~ .,data = mushrooms) mushroomc_1R <- OneR(classes ~ .,data = mushrooms_complete) Step 4: Evaluating OneR Performance mushroom_1R ## odor: ## almond -> edible ## anise -> poisonous ## creosote -> poisonous ## fishy -> edible ## foul -> poisonous ## musty -> edible ## none -> poisonous ## pungent -> poisonous ## spicy -> poisonous ## (8004/8124 instances correct) summary(mushroom_1R) Machine Learning, Key to Your Classification Challenges file:///P:/MachineLearningExamples/Machine_Learning_Classification.html 10 of 27 2/25/2016 6:27 PM
  • 11. ## ## === Summary === ## ## Correctly Classified Instances 8004 98.5229 % ## Incorrectly Classified Instances 120 1.4771 % ## Kappa statistic 0.9704 ## Mean absolute error 0.0148 ## Root mean squared error 0.1215 ## Relative absolute error 2.958 % ## Root relative squared error 24.323 % ## Coverage of cases (0.95 level) 98.5229 % ## Mean rel. region size (0.95 level) 50 % ## Total Number of Instances 8124 ## ## === Confusion Matrix === ## ## a b <-- classified as ## 4208 0 | a = edible ## 120 3796 | b = poisonous mushroomc_1R ## odor: ## almond -> edible ## anise -> poisonous ## creosote -> poisonous ## fishy -> edible ## foul -> poisonous ## musty -> edible ## none -> poisonous ## (5556/5644 instances correct) summary(mushroomc_1R) Machine Learning, Key to Your Classification Challenges file:///P:/MachineLearningExamples/Machine_Learning_Classification.html 11 of 27 2/25/2016 6:27 PM
  • 12. ## ## === Summary === ## ## Correctly Classified Instances 5556 98.4408 % ## Incorrectly Classified Instances 88 1.5592 % ## Kappa statistic 0.9667 ## Mean absolute error 0.0156 ## Root mean squared error 0.1249 ## Relative absolute error 3.3022 % ## Root relative squared error 25.6994 % ## Coverage of cases (0.95 level) 98.4408 % ## Mean rel. region size (0.95 level) 50 % ## Total Number of Instances 5644 ## ## === Confusion Matrix === ## ## a b <-- classified as ## 3488 0 | a = edible ## 88 2068 | b = poisonous We observe the OneR model provides more than 98.52% correct classification using only the odor as criteria on the original set and 98.44% on the complete set. However the confusion matrix reveals 120 poisonous mushrooms were classified as edible in the original dataset and 88 in the complete dataset. Let’s try to improve on the OneR model, using JRip. Step 5: Improving Model with JRip mushroom_JRip <- JRip(classes ~ ., data = mushrooms) mushroom_JRip ## JRIP rules: ## =========== ## ## (odor = creosote) => classes=poisonous (2160.0/0.0) ## (gill_size = narrow) and (gill_color = black) => classes=poisonous (1152.0/0.0) ## (gill_size = narrow) and (odor = none) => classes=poisonous (256.0/0.0) ## (odor = anise) => classes=poisonous (192.0/0.0) ## (spore_print_color = orange) => classes=poisonous (72.0/0.0) ## (stalk_surface_below_ring = smooth) and (stalk_surface_above_ring = scaly) => class es=poisonous (68.0/0.0) ## (habitat = meadows) and (cap_color = white) => classes=poisonous (8.0/0.0) ## (stalk_color_above_ring = yellow) => classes=poisonous (8.0/0.0) ## => classes=edible (4208.0/0.0) ## ## Number of Rules : 9 summary(mushroom_JRip) Machine Learning, Key to Your Classification Challenges file:///P:/MachineLearningExamples/Machine_Learning_Classification.html 12 of 27 2/25/2016 6:27 PM
  • 13. ## ## === Summary === ## ## Correctly Classified Instances 8124 100 % ## Incorrectly Classified Instances 0 0 % ## Kappa statistic 1 ## Mean absolute error 0 ## Root mean squared error 0 ## Relative absolute error 0 % ## Root relative squared error 0 % ## Coverage of cases (0.95 level) 100 % ## Mean rel. region size (0.95 level) 50 % ## Total Number of Instances 8124 ## ## === Confusion Matrix === ## ## a b <-- classified as ## 4208 0 | a = edible ## 0 3916 | b = poisonous mushroomc_JRip <- JRip(classes ~ ., data = mushrooms_complete) mushroomc_JRip ## JRIP rules: ## =========== ## ## (odor = creosote) => classes=poisonous (1584.0/0.0) ## (gill_size = narrow) and (odor = none) => classes=poisonous (256.0/0.0) ## (odor = anise) => classes=poisonous (192.0/0.0) ## (spore_print_color = orange) => classes=poisonous (72.0/0.0) ## (population = clustered) => classes=poisonous (52.0/0.0) ## => classes=edible (3488.0/0.0) ## ## Number of Rules : 6 summary(mushroomc_JRip) Machine Learning, Key to Your Classification Challenges file:///P:/MachineLearningExamples/Machine_Learning_Classification.html 13 of 27 2/25/2016 6:27 PM
  • 14. ## ## === Summary === ## ## Correctly Classified Instances 5644 100 % ## Incorrectly Classified Instances 0 0 % ## Kappa statistic 1 ## Mean absolute error 0 ## Root mean squared error 0 ## Relative absolute error 0 % ## Root relative squared error 0 % ## Coverage of cases (0.95 level) 100 % ## Mean rel. region size (0.95 level) 50 % ## Total Number of Instances 5644 ## ## === Confusion Matrix === ## ## a b <-- classified as ## 3488 0 | a = edible ## 0 2156 | b = poisonous We observe that JRip derives 9 rules with 22 variables, and can classify correctly the original set. However, on the complete set, only 6 rules are derived to reach the same perfect classification. Step 6: Improving Model with C5.0 In the next step, we’ll attempt to improve selection performance using the C5.0 package, which we’ll apply using odor and gill_size (the two most influent factor variables), and then compare with all 22 variables selected. mushroom_c5rules <- C5.0(classes ~ odor + gill_size, data = mushrooms, rules = TRUE) summary(mushroom_c5rules) Machine Learning, Key to Your Classification Challenges file:///P:/MachineLearningExamples/Machine_Learning_Classification.html 14 of 27 2/25/2016 6:27 PM
  • 15. ## ## Call: ## C5.0.formula(formula = classes ~ odor + gill_size, data = mushrooms, ## rules = TRUE) ## ## ## C5.0 [Release 2.07 GPL Edition] Wed Feb 24 18:58:20 2016 ## ------------------------------- ## ## Class specified by attribute `outcome' ## ## Read 8124 cases (3 attributes) from undefined.data ## ## Rules: ## ## Rule 1: (4328/120, lift 1.9) ## odor in {almond, fishy, musty} ## -> class edible [0.972] ## ## Rule 2: (3796, lift 2.1) ## odor in {anise, creosote, foul, none, pungent, spicy} ## -> class poisonous [1.000] ## ## Default class: edible ## ## ## Evaluation on training data (8124 cases): ## ## Rules ## ---------------- ## No Errors ## ## 2 120( 1.5%) << ## ## ## (a) (b) <-classified as ## ---- ---- ## 4208 (a): class edible ## 120 3796 (b): class poisonous ## ## ## Attribute usage: ## ## 100.00% odor ## ## ## Time: 0.0 secs Machine Learning, Key to Your Classification Challenges file:///P:/MachineLearningExamples/Machine_Learning_Classification.html 15 of 27 2/25/2016 6:27 PM
  • 16. mushroomc_c5rules <- C5.0(classes ~ odor + gill_size, data = mushrooms_complete, rules = TRUE) summary(mushroomc_c5rules) Machine Learning, Key to Your Classification Challenges file:///P:/MachineLearningExamples/Machine_Learning_Classification.html 16 of 27 2/25/2016 6:27 PM
  • 17. ## ## Call: ## C5.0.formula(formula = classes ~ odor + gill_size, data ## = mushrooms_complete, rules = TRUE) ## ## ## C5.0 [Release 2.07 GPL Edition] Wed Feb 24 18:58:20 2016 ## ------------------------------- ## ## Class specified by attribute `outcome' ## ## Read 5644 cases (3 attributes) from undefined.data ## ## Rules: ## ## Rule 1: (3576/88, lift 1.6) ## odor in {almond, fishy, musty} ## -> class edible [0.975] ## ## Rule 2: (2068, lift 2.6) ## odor in {anise, creosote, foul, none} ## -> class poisonous [1.000] ## ## Default class: edible ## ## ## Evaluation on training data (5644 cases): ## ## Rules ## ---------------- ## No Errors ## ## 2 88( 1.6%) << ## ## ## (a) (b) <-classified as ## ---- ---- ## 3488 (a): class edible ## 88 2068 (b): class poisonous ## ## ## Attribute usage: ## ## 100.00% odor ## ## ## Time: 0.0 secs On the original dataset, we observe that C5.0, applied to the two most influent factor variables, yields similar results than OneR and classifies 98.52% of the mushrooms correctly, leaving 120 misclassified! On the complete set, C5.0 results are similar to OneR: 98.44% of the mushrooms are classified correctly, leaving 88 Machine Learning, Key to Your Classification Challenges file:///P:/MachineLearningExamples/Machine_Learning_Classification.html 17 of 27 2/25/2016 6:27 PM
  • 18. mushrooms misclassified. Let’s apply C5.0 on the 22 variables. Step 7: Improving C5.0 using all variables mushroom_c5improved_rules <- C5.0(classes ~ ., data = mushrooms, rules = TRUE) summary(mushroom_c5improved_rules) Machine Learning, Key to Your Classification Challenges file:///P:/MachineLearningExamples/Machine_Learning_Classification.html 18 of 27 2/25/2016 6:27 PM
  • 19. ## ## Call: ## C5.0.formula(formula = classes ~ ., data = mushrooms, rules = TRUE) ## ## ## C5.0 [Release 2.07 GPL Edition] Wed Feb 24 18:58:21 2016 ## ------------------------------- ## ## Class specified by attribute `outcome' ## ## Read 8124 cases (22 attributes) from undefined.data ## ## Rules: ## ## Rule 1: (4148/4, lift 1.9) ## cap_surface in {fibrous, scaly, smooth} ## odor in {almond, fishy, musty} ## stalk_color_below_ring in {cinnamon, gray, pink, red, white} ## spore_print_color in {black, brown, buff, chocolate, green, purple, ## white, yellow} ## -> class edible [0.999] ## ## Rule 2: (3500/12, lift 1.9) ## cap_surface in {fibrous, scaly, smooth} ## odor in {almond, fishy, musty} ## stalk_root in {club, cup, equal, rhizomorphs} ## spore_print_color in {buff, chocolate, purple, white} ## -> class edible [0.996] ## ## Rule 3: (3796, lift 2.1) ## odor in {anise, creosote, foul, none, pungent, spicy} ## -> class poisonous [1.000] ## ## Rule 4: (72, lift 2.0) ## spore_print_color = orange ## -> class poisonous [0.986] ## ## Rule 5: (24, lift 2.0) ## stalk_color_below_ring = yellow ## -> class poisonous [0.962] ## ## Rule 6: (16, lift 2.0) ## stalk_root = bulbous ## stalk_color_below_ring = orange ## -> class poisonous [0.944] ## ## Rule 7: (4, lift 1.7) ## cap_surface = grooves ## -> class poisonous [0.833] ## Machine Learning, Key to Your Classification Challenges file:///P:/MachineLearningExamples/Machine_Learning_Classification.html 19 of 27 2/25/2016 6:27 PM
  • 20. ## Default class: edible ## ## ## Evaluation on training data (8124 cases): ## ## Rules ## ---------------- ## No Errors ## ## 7 12( 0.1%) << ## ## ## (a) (b) <-classified as ## ---- ---- ## 4208 (a): class edible ## 12 3904 (b): class poisonous ## ## ## Attribute usage: ## ## 98.67% odor ## 52.83% spore_print_color ## 51.99% cap_surface ## 51.55% stalk_color_below_ring ## 43.28% stalk_root ## ## ## Time: 0.1 secs mushroomc_c5improved_rules <- C5.0(classes ~ ., data = mushrooms_complete, rules = TRU E) summary(mushroomc_c5improved_rules) Machine Learning, Key to Your Classification Challenges file:///P:/MachineLearningExamples/Machine_Learning_Classification.html 20 of 27 2/25/2016 6:27 PM
  • 21. ## ## Call: ## C5.0.formula(formula = classes ~ ., data = mushrooms_complete, rules ## = TRUE) ## ## ## C5.0 [Release 2.07 GPL Edition] Wed Feb 24 18:58:22 2016 ## ------------------------------- ## ## Class specified by attribute `outcome' ## ## Read 5644 cases (22 attributes) from undefined.data ## ## Rules: ## ## Rule 1: (3488, lift 1.6) ## odor in {almond, fishy, musty} ## spore_print_color in {buff, chocolate, purple, white} ## population in {abundant, numerous, scattered, several, solitary} ## -> class edible [1.000] ## ## Rule 2: (2068, lift 2.6) ## odor in {anise, creosote, foul, none} ## -> class poisonous [1.000] ## ## Rule 3: (72, lift 2.6) ## spore_print_color = orange ## -> class poisonous [0.986] ## ## Rule 4: (52, lift 2.6) ## population = clustered ## -> class poisonous [0.981] ## ## Default class: edible ## ## ## Evaluation on training data (5644 cases): ## ## Rules ## ---------------- ## No Errors ## ## 4 0( 0.0%) << ## ## ## (a) (b) <-classified as ## ---- ---- ## 3488 (a): class edible ## 2156 (b): class poisonous ## Machine Learning, Key to Your Classification Challenges file:///P:/MachineLearningExamples/Machine_Learning_Classification.html 21 of 27 2/25/2016 6:27 PM
  • 22. ## ## Attribute usage: ## ## 98.44% odor ## 63.08% spore_print_color ## 62.72% population ## ## ## Time: 0.0 secs Using all 22 variables on the original dataset, C5.0 derives 7 rules, and classifies all but 12 correctly. On the complete dataset, C5.0 derives 4 rules, and classifies all mushrooms correctly! We can easily chart the tree, using the rpart and rattle packages. tree <- rpart(mushroom_c5improved_rules,data=mushrooms,control=rpart.control(minsplit= 20,cp=0,digits=6)) fancyRpartPlot(tree, palettes=c("Greys", "Oranges"),cex=0.75, main="Original Mushroom Dataset",sub="") treec <- rpart(mushroomc_c5improved_rules,data=mushrooms_complete,control=rpart.contro l(minsplit=20,cp=0,digits=6)) fancyRpartPlot(treec,palettes=c("Greys", "Oranges"),cex=0.75, main="Complete Mushroom Dataset",sub="") Machine Learning, Key to Your Classification Challenges file:///P:/MachineLearningExamples/Machine_Learning_Classification.html 22 of 27 2/25/2016 6:27 PM
  • 23. Finally, we will use PART to classify, and compare the results. mushroom_PART_rules <- PART(classes ~ ., data = mushrooms) mushroom_PART_rules Machine Learning, Key to Your Classification Challenges file:///P:/MachineLearningExamples/Machine_Learning_Classification.html 23 of 27 2/25/2016 6:27 PM
  • 24. ## PART decision list ## ------------------ ## ## odor = creosote: poisonous (2160.0) ## ## gill_size = broad AND ## ring_number = one: edible (3392.0) ## ## ring_number = two AND ## spore_print_color = white: edible (528.0) ## ## odor = pungent: poisonous (576.0) ## ## odor = spicy: poisonous (576.0) ## ## stalk_shape = enlarging AND ## stalk_surface_below_ring = silky AND ## odor = none: poisonous (256.0) ## ## stalk_shape = enlarging AND ## odor = anise: poisonous (192.0) ## ## gill_size = narrow AND ## stalk_surface_above_ring = silky AND ## population = several: edible (192.0) ## ## gill_size = broad: poisonous (108.0) ## ## stalk_surface_below_ring = silky AND ## bruises = bruises: edible (60.0) ## ## stalk_surface_below_ring = smooth: poisonous (40.0) ## ## bruises = bruises: edible (36.0) ## ## : poisonous (8.0) ## ## Number of Rules : 13 mushroomc_PART_rules <- PART(classes ~ ., data = mushrooms_complete) mushroomc_PART_rules Machine Learning, Key to Your Classification Challenges file:///P:/MachineLearningExamples/Machine_Learning_Classification.html 24 of 27 2/25/2016 6:27 PM
  • 25. ## PART decision list ## ------------------ ## ## odor = musty AND ## ring_number = one AND ## veil_color = white AND ## gill_size = broad: edible (2496.0) ## ## odor = creosote: poisonous (1584.0) ## ## odor = almond: edible (400.0) ## ## odor = fishy: edible (400.0) ## ## odor = none: poisonous (256.0) ## ## odor = anise: poisonous (192.0) ## ## stalk_root = cup: edible (96.0) ## ## spore_print_color = orange: poisonous (72.0) ## ## stalk_root = bulbous AND ## population = several: edible (64.0) ## ## population = clustered: poisonous (52.0) ## ## : edible (32.0) ## ## Number of Rules : 11 On the original mushrooms dataset, PART classifies all properly but must rely on 13 rules to reach the goal. On the complete set, PART achieves the same outcome and derives 11 rules. Step 8: Comparisons with Original Rules in Reference Material It is always interesting to compare a solution to alternatives. In this case we can refer to the original rules derived in 1997, and extracted from the documentation which resulted in 48 errors, or 99.41% accuracy on the whole dataset: res Machine Learning, Key to Your Classification Challenges file:///P:/MachineLearningExamples/Machine_Learning_Classification.html 25 of 27 2/25/2016 6:27 PM
  • 26. ## [1] "1) odor=NOT(almond.OR.anise.OR.none) 120 poisonous cases missed, 98.52% accura cy " ## [2] "2) spore-print-color=green 48 cases missed, 99.41% accuracy " ## [3] "3) odor=none.AND.stalk-surface-below-ring=scaly.AND. (stalk-color-above-ring=N OT.brown) 8 cases missed, 99.90% accuracy " ## [4] "4) habitat=leaves.AND.cap-color=white 100% accuracy Rule " ## [5] "4) may also be " ## [6] "4') population=clustered.AND.cap_color=white These rule involve 6 attributes ( out of 22). Rules for edible mushrooms are obtained as negation of the rules given abo ve, for example the rule: odor=(almond.OR.anise.OR.none).AND.spore-print-color=NOT.gre en gives 48 errors, or 99.41% accuracy on the whole dataset." Conclusions C5.0 algorithm applied on all 22 variables of the complete mushroom set is able to correctly classify with 4 rules. This is the best performance we achieved on the set, whith the minimum number of rules derived and the most accurate (perfect) outcome obtained on the complete dataset. It also only selected 3 variables: odor, spore_print_color and population, out of the 22 variables provided, compared to the referenced document, where 6 attributes and 4 rules resulted in 99.41% accuracy. We hope this typical example demonstrates that Machine Learning algorithms are well positioned to help resolve classification challenges, offering a fast, efficient and economical alternative to tedious experimentation. It is easy to imagine how similar questions can be resolved in all types of R&D, in materials, cosmetics, food or any scientific area. This second tool is certainly as useful as the formulation tool (http://rpubs.com/m3cinc /machine_learning_key_to_your_formulation_challenges) we reviewed previously. Classifying Rubber properties to meet rolling resistance and emissions, or modern composites to build renewable energy sources or lighweight transportation vehicles and next-generation public transit, as well as innovative UV-shield oinments and tasty snacks and drinks…, all present similar challenges where only the nature of inputs and outputs vary. Therefore, this method too can and should be applied broadly! Why not try and implement Machine Learning in your scientific or technical expert area? Remember, PRC Consulting, LLC (http://www.prcconsulting.net) is dedicated to boosting innovation thru improved Analytics, one customer at the time! References Machine Learning, Key to Your Classification Challenges file:///P:/MachineLearningExamples/Machine_Learning_Classification.html 26 of 27 2/25/2016 6:27 PM
  • 27. The following sources are referenced as they provided significant help and information to develop this Machine Learning analysis applied to formulations: UCI Machine Learning Repository (http://archive.ics.uci.edu/ml/datasets.html)1. mushroom documentation (https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom /agaricus-lepiota.names) 2. stringr (https://cran.r-project.org/web/packages/stringr/stringr.pdf)3. RWeka (https://cran.r-project.org/web/packages/RWeka/RWeka.pdf)4. C50 (https://cran.r-project.org/web/packages/C50/C50.pdf)5. rpart (https://cran.r-project.org/web/packages/rpart/rpart.pdf)6. rpart.plot (https://cran.r-project.org/web/packages/rpart.plot/rpart.plot.pdf)7. ratttle (http://rattle.togaware.com/)8. RStudio (https://www.rstudio.com)9. Machine Learning, Key to Your Formulation Challenges (http://rpubs.com/m3cinc /machine_learning_key_to_your_formulation_challenges) 10. Machine Learning, Key to Your Classification Challenges file:///P:/MachineLearningExamples/Machine_Learning_Classification.html 27 of 27 2/25/2016 6:27 PM