Machine Learning, Key to Your Classification Challenges

Machine Learning, Key to Your
Classification Challenges
Marc Borowczak, PRC Consulting LLC (http://www.prcconsulting.net)
February 25, 2016
Classification Challenges are Everywhere…
Step 1: Retrieve Existing Data
Step 2: Clean the Data
Step 3: Classifying Data with OneR
Step 4: Evaluating OneR Performance
Step 5: Improving Model with JRip
Step 6: Improving Model with C5.0
Step 7: Improving C5.0 using all variables
Step 8: Comparisons with Original Rules in Reference Material
Conclusions
References
Classification Challenges are Everywhere…
You develop pharmaceutical, cosmetic, food, industrial or civil engineered products, and are often confronted
with the challenge of sorting and classifying to meet process or performance properties. While traditional
Research and Development does approach the problem with experimentation, it generally involves designs,
time and resource constraints, and can be considered slow, expensive and often times redundant, fast forgotten
or perhaps obsolete.
Consider the alternative Machine Learning tools offers today. We will show this is not only quick, efficient and
ultimately the only way Front End of Innovation should proceed, and how it is particularly suited for
classification, an essential step used to reduce complexity and optimize product segmentation, Lean Innovation
and establishing robust source of supply networks.
Today, we will explain how Machine Learning can shed new light on this generic and very persistent
classification and clustering challenge. We will derive with modern algorithms simple (we prefer less rules) and
accurate (perfect) classifications on a complete dataset.
If you didn’t read about the other important aspect of formulation optimization, please consult Machine Learning,
Key to Your Formulation Challenges (http://rpubs.com/m3cinc
/machine_learning_key_to_your_formulation_challenges) communication.
Step 1: Retrieve Existing Data
We will mirror the approach used in the formulation challenge and use another dataset hosted on UCI Machine
Learning Repository (http://archive.ics.uci.edu/ml/datasets.html), to classify the edible attribute of…Mushrooms
(http://archive.ics.uci.edu/ml/datasets/Mushroom) based on attribute described in The Audubon Society Field
Guide to North American Mushrooms (1981). The challenge we tackle today is to classify properly a go/no-go
attribute which scientists, engineers and business professionals must address daily. Any established R&D
would certainly have similar and sometimes hidden knowledge in its archives…
Machine Learning, Key to Your Classification Challenges file:///P:/MachineLearningExamples/Machine_Learning_Classification.html
1 of 27 2/25/2016 6:27 PM

Again, We will use R to demonstrate quickly the approach on this dataset (http://archive.ics.uci.edu/ml/machine-
learning-databases/mushroom/agaricus-lepiota.data), and its full description (https://archive.ics.uci.edu
/ml/machine-learning-databases/mushroom/agaricus-lepiota.names). We continue to maintain reproducibility of
the analysis as a general practice. The analysis tool and platform are documented, all libraries clearly listed,
while data is retrieved programmatically and date stamped from the repository.
We will display a structure of the mushrooms dataset and the corresponding dictionary to translate the property
factors.
Sys.info()[1:5]
## sysname release version nodename machine
## "Windows" "7 x64" "build 9200" "STALLION" "x86-64"
sessionInfo()
## R version 3.2.2 (2015-08-14)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 8 x64 (build 9200)
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] C50_0.1.0-24 RWeka_0.4-24 stringr_1.0.0
##
## loaded via a namespace (and not attached):
## [1] digest_0.6.8 grid_3.2.2 formatR_1.2
## [4] magrittr_1.5 evaluate_0.7.2 RWekajars_3.7.12-1
## [7] stringi_1.0-1 partykit_1.0-3 rmarkdown_0.9.2
## [10] splines_3.2.2 tools_3.2.2 yaml_2.1.13
## [13] survival_2.38-3 rJava_0.9-7 htmltools_0.2.6
## [16] knitr_1.11
2 of 27 2/25/2016 6:27 PM

library(stringr)
library(RWeka)
library(C50)
library(rpart)
library(rattle)
userdir <- getwd()
datadir <- "./data"
if (!file.exists("data")){dir.create("data")}
fileUrl <- "http://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus
-lepiota.data?accessType=DOWNLOAD"
download.file(fileUrl,destfile="./data/Mushrooms_Data.csv")
dateDownloaded <- date()
mushrooms <- read.csv("./data/Mushrooms_Data.csv",header=FALSE)
fileUrl <- "https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricu
s-lepiota.names?accessType=DOWNLOAD"
download.file(fileUrl,destfile="./data/Names.txt")
txt <- readLines("./data/Names.txt")
lns <- data.frame(beg=which(grepl("P_1) odor=",txt)),end=which(grepl("on the whole dat
aset.",txt)))
# we now capture all lines of text between beg and end from txt
res <- lapply(seq_along(lns$beg),function(l){paste(txt[seq(from=lns$beg[l],to=lns$end[
l],by=1)],collapse=" ")})
res <- gsub("t", "", res, fixed = TRUE)
res <- gsub("( {2,})"," ",res, fixed=FALSE)
res <- gsub("P_","n",res,fixed=TRUE)
writeLines(res,"./data/parsed_res.csv")
res <- readLines("./data/parsed_res.csv")
res<-res[-1]
lns <- data.frame(beg=which(grepl("7. Attribute Information:",txt)),end=which(grepl("u
rban=u,waste=w,woods=d",txt)))
txt <- lapply(seq_along(lns$beg),function(l){paste(txt[seq(from=lns$beg[l],to=lns$end[
l],by=1)],collapse=" ")})
txt <- gsub(" ", "", txt, fixed = TRUE)
txt <- gsub("(d+.)","n",txt, fixed=FALSE)
txt <- gsub("nAttributeInformation:(","",txt,fixed=FALSE)
txt <- gsub(")","",txt,fixed=FALSE)
txt <- gsub(":",",",txt,fixed=TRUE)
txt <- gsub("?","",txt,fixed=TRUE)
txt <- gsub("-","_",txt,fixed=TRUE)
writeLines(txt,"./data/parsed.csv")
attrib <- readLines("./data/parsed.csv")
attrib <- sapply (1:length(attrib),function(i) {gsub(","," ",attrib[i],fixed=TRUE)})
dictionary <- sapply (1:length(attrib),function(i) {strsplit(attrib[i],' ')})
colnames(mushrooms)<-sapply(1:length(attrib),function(i) {colnames(mushrooms)[i]<-dict
ionary[[i]][1]})
dictionary<-sapply (1:length(attrib),function(i) {dictionary[[i]][-1]}) # contains the
levels strings
dictionary<-sapply(1:length(attrib),function(i){sapply(1:lengths(dictionary[i]),functi
on(j){p1<-strsplit(dictionary[[i]][j],"=")[[1]][1];p2<-strsplit(dictionary[[i]][j],"="
)[[1]][2];dictionary[[i]][j]<-paste0(p2,',',p1)})})
3 of 27 2/25/2016 6:27 PM

Step 2: Clean the Data
We notice that the stalk_root property has a missing level indicated with ‘?’. We can attempt two analysis: First,
we keep the missing data as coded and proceed with the classification models. We also can easily recode as
missing with the value, drop the corresponding level, and omit all non-complete cases in a new dataset
mushrooms_complete.
mushrooms_complete<-mushrooms
mushrooms_complete$stalk_root[mushrooms_complete$stalk_root=='?']<-NA
mushrooms_complete<-mushrooms_complete[complete.cases(mushrooms_complete),]
mushrooms_complete$stalk_root<-droplevels(mushrooms_complete$stalk_root)
str(mushrooms_complete)
4 of 27 2/25/2016 6:27 PM

## 'data.frame': 5644 obs. of 23 variables:
## $ classes : Factor w/ 2 levels "e","p": 2 1 1 2 1 1 1 1 2 1 ...
## $ cap_shape : Factor w/ 6 levels "b","c","f","k",..: 6 6 1 6 6 6 1 1
6 1 ...
## $ cap_surface : Factor w/ 4 levels "f","g","s","y": 3 3 3 4 3 4 3 4 4
3 ...
## $ cap_color : Factor w/ 10 levels "b","c","e","g",..: 5 10 9 9 4 10
9 9 9 10 ...
## $ bruises : Factor w/ 2 levels "f","t": 2 2 2 2 1 2 2 2 2 2 ...
## $ odor : Factor w/ 9 levels "a","c","f","l",..: 7 1 4 7 6 1 1 4
7 1 ...
## $ gill_attachment : Factor w/ 2 levels "a","f": 2 2 2 2 2 2 2 2 2 2 ...
## $ gill_spacing : Factor w/ 2 levels "c","w": 1 1 1 1 2 1 1 1 1 1 ...
## $ gill_size : Factor w/ 2 levels "b","n": 2 1 1 2 1 1 1 1 2 1 ...
## $ gill_color : Factor w/ 12 levels "b","e","g","h",..: 5 5 6 6 5 6 3
6 8 3 ...
## $ stalk_shape : Factor w/ 2 levels "e","t": 1 1 1 1 2 1 1 1 1 1 ...
## $ stalk_root : Factor w/ 4 levels "b","c","e","r": 3 2 2 3 3 2 2 2 3
2 ...
## $ stalk_surface_above_ring: Factor w/ 4 levels "f","k","s","y": 3 3 3 3 3 3 3 3 3
3 ...
## $ stalk_surface_below_ring: Factor w/ 4 levels "f","k","s","y": 3 3 3 3 3 3 3 3 3
3 ...
## $ stalk_color_above_ring : Factor w/ 9 levels "b","c","e","g",..: 8 8 8 8 8 8 8 8
8 8 ...
## $ stalk_color_below_ring : Factor w/ 9 levels "b","c","e","g",..: 8 8 8 8 8 8 8 8
8 8 ...
## $ veil_type : Factor w/ 1 level "p": 1 1 1 1 1 1 1 1 1 1 ...
## $ veil_color : Factor w/ 4 levels "n","o","w","y": 3 3 3 3 3 3 3 3 3
3 ...
## $ ring_number : Factor w/ 3 levels "n","o","t": 2 2 2 2 2 2 2 2 2 2 ..
.
## $ ring_type : Factor w/ 5 levels "e","f","l","n",..: 5 5 5 5 1 5 5 5
5 5 ...
## $ spore_print_color : Factor w/ 9 levels "b","h","k","n",..: 3 4 4 3 4 3 3 4
3 3 ...
## $ population : Factor w/ 6 levels "a","c","n","s",..: 4 3 3 4 1 3 3 4
5 4 ...
## $ habitat : Factor w/ 7 levels "d","g","l","m",..: 6 2 4 6 2 2 4 4
2 4 ...
table(mushrooms_complete$classes)
##
## e p
## 3488 2156
We can now reassign translated levels to both the original and the complete mushroom datasets.
5 of 27 2/25/2016 6:27 PM

m <- sapply (1:length(attrib),function(i){levels(mushrooms[[i]]) <- sapply(1:length (l
evels(mushrooms[[i]])),function(j){
a<-strsplit(dictionary[[i]][[j]],",")[[1]][1]
b<-strsplit(dictionary[[i]][[j]],",")[[1]][2]
levels(mushrooms[[i]])[levels(mushrooms[[i]])==a] <- b } )
mushrooms[[i]]} )
m <- as.data.frame(m)
colnames(m) <- colnames(mushrooms)
mushrooms <- m
m <- sapply (1:length(attrib),function(i){levels(mushrooms_complete[[i]]) <- sapply(1:
length (levels(mushrooms_complete[[i]])),function(j){
a<-strsplit(dictionary[[i]][[j]],",")[[1]][1]
b<-strsplit(dictionary[[i]][[j]],",")[[1]][2]
levels(mushrooms_complete[[i]])[levels(mushrooms_complete[[i]])==a] <- b } )
mushrooms_complete[[i]]} )
m <- as.data.frame(m)
colnames(m) <- colnames(mushrooms_complete)
mushrooms_complete <- m
rm(m,lns,attrib,txt,dictionary)# cleanup
As we observe that the veil_type feature is absolutely common with a single factor, we can exclude it from
further analysis and examine the remaining 22 properties: we’ll observe a fairly balanced classification set with
4208 edible and 3916 poisonous mushrooms on the original set.
mushrooms$veil_type <- NULL
str(mushrooms)
6 of 27 2/25/2016 6:27 PM

## $ classes : Factor w/ 2 levels "edible","poisonous": 2 1 1 2 1 1 1
1 2 1 ...
## $ cap_shape : Factor w/ 6 levels "bell","conical",..: 6 6 1 6 6 6 1
1 6 1 ...
## $ cap_surface : Factor w/ 4 levels "fibrous","grooves",..: 3 3 3 4 3 4
3 4 4 3 ...
## $ cap_color : Factor w/ 10 levels "brown","buff",..: 5 10 9 9 4 10 9
9 9 10 ...
## $ bruises : Factor w/ 2 levels "bruises","no": 2 2 2 2 1 2 2 2 2 2
...
## $ odor : Factor w/ 9 levels "almond","anise",..: 7 1 4 7 6 1 1
4 7 1 ...
## $ gill_attachment : Factor w/ 2 levels "attached","descending": 2 2 2 2 2
2 2 2 2 2 ...
## $ gill_spacing : Factor w/ 2 levels "close","crowded": 1 1 1 1 2 1 1 1
1 1 ...
## $ gill_size : Factor w/ 2 levels "broad","narrow": 2 1 1 2 1 1 1 1 2
1 ...
## $ gill_color : Factor w/ 12 levels "black","brown",..: 5 5 6 6 5 6 3
6 8 3 ...
## $ stalk_shape : Factor w/ 2 levels "enlarging","tapering": 1 1 1 1 2 1
1 1 1 1 ...
## $ stalk_root : Factor w/ 5 levels "bulbous","club",..: 4 3 3 4 4 3 3
3 4 3 ...
## $ stalk_surface_above_ring: Factor w/ 4 levels "fibrous","scaly",..: 3 3 3 3 3 3 3
3 3 3 ...
## $ stalk_surface_below_ring: Factor w/ 4 levels "fibrous","scaly",..: 3 3 3 3 3 3 3
3 3 3 ...
## $ stalk_color_above_ring : Factor w/ 9 levels "brown","buff",..: 8 8 8 8 8 8 8 8
8 8 ...
## $ stalk_color_below_ring : Factor w/ 9 levels "brown","buff",..: 8 8 8 8 8 8 8 8
8 8 ...
## $ veil_color : Factor w/ 4 levels "brown","orange",..: 3 3 3 3 3 3 3
3 3 3 ...
## $ ring_number : Factor w/ 3 levels "none","one","two": 2 2 2 2 2 2 2 2
2 2 ...
## $ ring_type : Factor w/ 5 levels "cobwebby","evanescent",..: 5 5 5 5
1 5 5 5 5 5 ...
## $ spore_print_color : Factor w/ 9 levels "black","brown",..: 3 4 4 3 4 3 3 4
3 3 ...
## $ population : Factor w/ 6 levels "abundant","clustered",..: 4 3 3 4
1 3 3 4 5 4 ...
## $ habitat : Factor w/ 7 levels "grasses","leaves",..: 6 2 4 6 2 2
4 4 2 4 ...
table(mushrooms$classes)
7 of 27 2/25/2016 6:27 PM

##
## edible poisonous
## 4208 3916
mushrooms_complete$veil_type <- NULL
str(mushrooms_complete)
8 of 27 2/25/2016 6:27 PM

## $ classes : Factor w/ 2 levels "edible","poisonous": 2 1 1 2 1 1 1
1 2 1 ...
## $ cap_shape : Factor w/ 6 levels "bell","conical",..: 6 6 1 6 6 6 1
1 6 1 ...
## $ cap_surface : Factor w/ 4 levels "fibrous","grooves",..: 3 3 3 4 3 4
3 4 4 3 ...
## $ cap_color : Factor w/ 8 levels "brown","buff",..: 5 8 7 7 4 8 7 7
7 8 ...
## $ bruises : Factor w/ 2 levels "bruises","no": 2 2 2 2 1 2 2 2 2 2
...
## $ odor : Factor w/ 7 levels "almond","anise",..: 7 1 4 7 6 1 1
4 7 1 ...
## $ gill_attachment : Factor w/ 2 levels "attached","descending": 2 2 2 2 2
2 2 2 2 2 ...
## $ gill_spacing : Factor w/ 2 levels "close","crowded": 1 1 1 1 2 1 1 1
1 1 ...
## $ gill_size : Factor w/ 2 levels "broad","narrow": 2 1 1 2 1 1 1 1 2
1 ...
## $ gill_color : Factor w/ 9 levels "buff","chocolate",..: 3 3 4 4 3 4
1 4 5 1 ...
## $ stalk_shape : Factor w/ 2 levels "enlarging","tapering": 1 1 1 1 2 1
1 1 1 1 ...
## $ stalk_root : Factor w/ 4 levels "bulbous","club",..: 3 2 2 3 3 2 2
2 3 2 ...
## $ stalk_surface_above_ring: Factor w/ 4 levels "fibrous","scaly",..: 3 3 3 3 3 3 3
3 3 3 ...
## $ stalk_surface_below_ring: Factor w/ 4 levels "fibrous","scaly",..: 3 3 3 3 3 3 3
3 3 3 ...
## $ stalk_color_above_ring : Factor w/ 7 levels "brown","buff",..: 6 6 6 6 6 6 6 6
6 6 ...
## $ stalk_color_below_ring : Factor w/ 7 levels "brown","buff",..: 6 6 6 6 6 6 6 6
6 6 ...
## $ veil_color : Factor w/ 2 levels "white","yellow": 1 1 1 1 1 1 1 1 1
1 ...
## $ ring_number : Factor w/ 3 levels "none","one","two": 2 2 2 2 2 2 2 2
2 2 ...
## $ ring_type : Factor w/ 4 levels "cobwebby","flaring",..: 4 4 4 4 1
4 4 4 4 4 ...
## $ spore_print_color : Factor w/ 6 levels "brown","buff",..: 2 3 3 2 3 2 2 3
2 2 ...
## $ population : Factor w/ 6 levels "abundant","clustered",..: 4 3 3 4
1 3 3 4 5 4 ...
## $ habitat : Factor w/ 6 levels "grasses","leaves",..: 6 2 4 6 2 2
4 4 2 4 ...
table(mushrooms_complete$classes)
9 of 27 2/25/2016 6:27 PM

##
## edible poisonous
## 3488 2156
However, the complete set is not only smaller, but a bit more imbalanced after removal of the missing data with
3488 edible and 2156 poisonous mushrooms.
We now will conduct 2 parallel analysis streams to compare performance classification and explore multiple
approaches, to attempt a perfect classification.
Let’s start with OneR classification, from the RWeka package.
Step 3: Classifying Data with OneR
We classify the original and complete mushrooms datasets.
mushroom_1R <- OneR(classes ~ .,data = mushrooms)
mushroomc_1R <- OneR(classes ~ .,data = mushrooms_complete)
Step 4: Evaluating OneR Performance
mushroom_1R
## odor:
## almond -> edible
## anise -> poisonous
## creosote -> poisonous
## fishy -> edible
## foul -> poisonous
## musty -> edible
## none -> poisonous
## pungent -> poisonous
## spicy -> poisonous
## (8004/8124 instances correct)
summary(mushroom_1R)
10 of 27 2/25/2016 6:27 PM

##
## === Summary ===
##
## Correctly Classified Instances 8004 98.5229 %
## Incorrectly Classified Instances 120 1.4771 %
## Kappa statistic 0.9704
## Mean absolute error 0.0148
## Root mean squared error 0.1215
## Relative absolute error 2.958 %
## Root relative squared error 24.323 %
## Coverage of cases (0.95 level) 98.5229 %
## Mean rel. region size (0.95 level) 50 %
## Total Number of Instances 8124
##
## === Confusion Matrix ===
##
## a b <-- classified as
## 4208 0 | a = edible
## 120 3796 | b = poisonous
mushroomc_1R
## odor:
## almond -> edible
## anise -> poisonous
## creosote -> poisonous
## fishy -> edible
## foul -> poisonous
## musty -> edible
## none -> poisonous
## (5556/5644 instances correct)
summary(mushroomc_1R)
11 of 27 2/25/2016 6:27 PM

##
## === Summary ===
##
## Correctly Classified Instances 5556 98.4408 %
## Incorrectly Classified Instances 88 1.5592 %
## Kappa statistic 0.9667
## Mean absolute error 0.0156
## Root mean squared error 0.1249
## Relative absolute error 3.3022 %
## Root relative squared error 25.6994 %
## Coverage of cases (0.95 level) 98.4408 %
##
##
## 3488 0 | a = edible
## 88 2068 | b = poisonous
We observe the OneR model provides more than 98.52% correct classification using only the odor as criteria on
the original set and 98.44% on the complete set. However the confusion matrix reveals 120 poisonous
mushrooms were classified as edible in the original dataset and 88 in the complete dataset.
Let’s try to improve on the OneR model, using JRip.
Step 5: Improving Model with JRip
mushroom_JRip <- JRip(classes ~ ., data = mushrooms)
mushroom_JRip
## JRIP rules:
## ===========
##
## (odor = creosote) => classes=poisonous (2160.0/0.0)
## (gill_size = narrow) and (gill_color = black) => classes=poisonous (1152.0/0.0)
## (gill_size = narrow) and (odor = none) => classes=poisonous (256.0/0.0)
## (odor = anise) => classes=poisonous (192.0/0.0)
## (spore_print_color = orange) => classes=poisonous (72.0/0.0)
## (stalk_surface_below_ring = smooth) and (stalk_surface_above_ring = scaly) => class
es=poisonous (68.0/0.0)
## (habitat = meadows) and (cap_color = white) => classes=poisonous (8.0/0.0)
## (stalk_color_above_ring = yellow) => classes=poisonous (8.0/0.0)
## => classes=edible (4208.0/0.0)
##
## Number of Rules : 9
summary(mushroom_JRip)
12 of 27 2/25/2016 6:27 PM

##
## === Summary ===
##
## Correctly Classified Instances 8124 100 %
## Incorrectly Classified Instances 0 0 %
## Kappa statistic 1
## Mean absolute error 0
## Root mean squared error 0
## Relative absolute error 0 %
## Root relative squared error 0 %
## Coverage of cases (0.95 level) 100 %
##
##
## 4208 0 | a = edible
## 0 3916 | b = poisonous
mushroomc_JRip <- JRip(classes ~ ., data = mushrooms_complete)
mushroomc_JRip
## JRIP rules:
## ===========
##
## (odor = creosote) => classes=poisonous (1584.0/0.0)
## (gill_size = narrow) and (odor = none) => classes=poisonous (256.0/0.0)
## (odor = anise) => classes=poisonous (192.0/0.0)
## (spore_print_color = orange) => classes=poisonous (72.0/0.0)
## (population = clustered) => classes=poisonous (52.0/0.0)
## => classes=edible (3488.0/0.0)
##
summary(mushroomc_JRip)
13 of 27 2/25/2016 6:27 PM

##
## === Summary ===
##
## Correctly Classified Instances 5644 100 %
## Incorrectly Classified Instances 0 0 %
## Kappa statistic 1
## Mean absolute error 0
## Root mean squared error 0
## Relative absolute error 0 %
## Root relative squared error 0 %
## Coverage of cases (0.95 level) 100 %
##
##
## 3488 0 | a = edible
## 0 2156 | b = poisonous
We observe that JRip derives 9 rules with 22 variables, and can classify correctly the original set. However, on
the complete set, only 6 rules are derived to reach the same perfect classification.
Step 6: Improving Model with C5.0
In the next step, we’ll attempt to improve selection performance using the C5.0 package, which we’ll apply using
odor and gill_size (the two most influent factor variables), and then compare with all 22 variables selected.
mushroom_c5rules <- C5.0(classes ~ odor + gill_size, data = mushrooms, rules = TRUE)
summary(mushroom_c5rules)
14 of 27 2/25/2016 6:27 PM

##
## Call:
## C5.0.formula(formula = classes ~ odor + gill_size, data = mushrooms,
## rules = TRUE)
##
##
## C5.0 [Release 2.07 GPL Edition] Wed Feb 24 18:58:20 2016
## -------------------------------
##
## Class specified by attribute `outcome'
##
## Read 8124 cases (3 attributes) from undefined.data
##
## Rules:
##
## Rule 1: (4328/120, lift 1.9)
## odor in {almond, fishy, musty}
## -> class edible [0.972]
##
## Rule 2: (3796, lift 2.1)
## odor in {anise, creosote, foul, none, pungent, spicy}
## -> class poisonous [1.000]
##
## Default class: edible
##
##
## Evaluation on training data (8124 cases):
##
## Rules
## ----------------
## No Errors
##
## 2 120( 1.5%) <<
##
##
## (a) (b) <-classified as
## ---- ----
## 4208 (a): class edible
## 120 3796 (b): class poisonous
##
##
## Attribute usage:
##
## 100.00% odor
##
##
## Time: 0.0 secs
15 of 27 2/25/2016 6:27 PM

mushroomc_c5rules <- C5.0(classes ~ odor + gill_size, data = mushrooms_complete, rules
= TRUE)
summary(mushroomc_c5rules)
16 of 27 2/25/2016 6:27 PM

##
## Call:
## C5.0.formula(formula = classes ~ odor + gill_size, data
## = mushrooms_complete, rules = TRUE)
##
##
## -------------------------------
##
##
##
## Rules:
##
## Rule 1: (3576/88, lift 1.6)
##
## Rule 2: (2068, lift 2.6)
## odor in {anise, creosote, foul, none}
##
##
##
##
## Rules
## ----------------
## No Errors
##
## 2 88( 1.6%) <<
##
##
## ---- ----
##
##
## Attribute usage:
##
## 100.00% odor
##
##
## Time: 0.0 secs
On the original dataset, we observe that C5.0, applied to the two most influent factor variables, yields similar
results than OneR and classifies 98.52% of the mushrooms correctly, leaving 120 misclassified! On the
complete set, C5.0 results are similar to OneR: 98.44% of the mushrooms are classified correctly, leaving 88
17 of 27 2/25/2016 6:27 PM

mushrooms misclassified.
Let’s apply C5.0 on the 22 variables.
Step 7: Improving C5.0 using all variables
mushroom_c5improved_rules <- C5.0(classes ~ ., data = mushrooms, rules = TRUE)
summary(mushroom_c5improved_rules)
18 of 27 2/25/2016 6:27 PM

##
## Call:
## C5.0.formula(formula = classes ~ ., data = mushrooms, rules = TRUE)
##
##
## -------------------------------
##
##
##
## Rules:
##
## Rule 1: (4148/4, lift 1.9)
## cap_surface in {fibrous, scaly, smooth}
## stalk_color_below_ring in {cinnamon, gray, pink, red, white}
## spore_print_color in {black, brown, buff, chocolate, green, purple,
## white, yellow}
##
## Rule 2: (3500/12, lift 1.9)
## cap_surface in {fibrous, scaly, smooth}
## stalk_root in {club, cup, equal, rhizomorphs}
## spore_print_color in {buff, chocolate, purple, white}
##
## Rule 3: (3796, lift 2.1)
## odor in {anise, creosote, foul, none, pungent, spicy}
##
## Rule 4: (72, lift 2.0)
## spore_print_color = orange
##
## Rule 5: (24, lift 2.0)
## stalk_color_below_ring = yellow
##
## Rule 6: (16, lift 2.0)
## stalk_root = bulbous
## stalk_color_below_ring = orange
##
## Rule 7: (4, lift 1.7)
## cap_surface = grooves
##
19 of 27 2/25/2016 6:27 PM

##
##
##
## Rules
## ----------------
## No Errors
##
## 7 12( 0.1%) <<
##
##
## ---- ----
##
##
## Attribute usage:
##
## 98.67% odor
## 52.83% spore_print_color
## 51.99% cap_surface
## 51.55% stalk_color_below_ring
## 43.28% stalk_root
##
##
## Time: 0.1 secs
mushroomc_c5improved_rules <- C5.0(classes ~ ., data = mushrooms_complete, rules = TRU
E)
summary(mushroomc_c5improved_rules)
20 of 27 2/25/2016 6:27 PM

##
## Call:
## C5.0.formula(formula = classes ~ ., data = mushrooms_complete, rules
## = TRUE)
##
##
## -------------------------------
##
##
##
## Rules:
##
## Rule 1: (3488, lift 1.6)
## spore_print_color in {buff, chocolate, purple, white}
## population in {abundant, numerous, scattered, several, solitary}
##
## Rule 2: (2068, lift 2.6)
## odor in {anise, creosote, foul, none}
##
## Rule 3: (72, lift 2.6)
## spore_print_color = orange
##
## Rule 4: (52, lift 2.6)
## population = clustered
##
##
##
##
## Rules
## ----------------
## No Errors
##
## 4 0( 0.0%) <<
##
##
## ---- ----
## 2156 (b): class poisonous
##
21 of 27 2/25/2016 6:27 PM

##
## Attribute usage:
##
## 98.44% odor
## 63.08% spore_print_color
## 62.72% population
##
##
## Time: 0.0 secs
Using all 22 variables on the original dataset, C5.0 derives 7 rules, and classifies all but 12 correctly. On the
complete dataset, C5.0 derives 4 rules, and classifies all mushrooms correctly! We can easily chart the tree,
using the rpart and rattle packages.
tree <- rpart(mushroom_c5improved_rules,data=mushrooms,control=rpart.control(minsplit=
20,cp=0,digits=6))
fancyRpartPlot(tree, palettes=c("Greys", "Oranges"),cex=0.75, main="Original Mushroom
Dataset",sub="")
treec <- rpart(mushroomc_c5improved_rules,data=mushrooms_complete,control=rpart.contro
l(minsplit=20,cp=0,digits=6))
fancyRpartPlot(treec,palettes=c("Greys", "Oranges"),cex=0.75, main="Complete Mushroom
Dataset",sub="")
22 of 27 2/25/2016 6:27 PM

Finally, we will use PART to classify, and compare the results.
mushroom_PART_rules <- PART(classes ~ ., data = mushrooms)
mushroom_PART_rules
23 of 27 2/25/2016 6:27 PM

## PART decision list
## ------------------
##
## odor = creosote: poisonous (2160.0)
##
## gill_size = broad AND
## ring_number = one: edible (3392.0)
##
## ring_number = two AND
## spore_print_color = white: edible (528.0)
##
## odor = pungent: poisonous (576.0)
##
## odor = spicy: poisonous (576.0)
##
## stalk_shape = enlarging AND
## stalk_surface_below_ring = silky AND
## odor = none: poisonous (256.0)
##
## stalk_shape = enlarging AND
## odor = anise: poisonous (192.0)
##
## gill_size = narrow AND
## stalk_surface_above_ring = silky AND
## population = several: edible (192.0)
##
## gill_size = broad: poisonous (108.0)
##
## stalk_surface_below_ring = silky AND
## bruises = bruises: edible (60.0)
##
## stalk_surface_below_ring = smooth: poisonous (40.0)
##
## bruises = bruises: edible (36.0)
##
## : poisonous (8.0)
##
mushroomc_PART_rules <- PART(classes ~ ., data = mushrooms_complete)
mushroomc_PART_rules
24 of 27 2/25/2016 6:27 PM

## PART decision list
## ------------------
##
## odor = musty AND
## ring_number = one AND
## veil_color = white AND
## gill_size = broad: edible (2496.0)
##
## odor = creosote: poisonous (1584.0)
##
## odor = almond: edible (400.0)
##
## odor = fishy: edible (400.0)
##
## odor = none: poisonous (256.0)
##
## odor = anise: poisonous (192.0)
##
## stalk_root = cup: edible (96.0)
##
## spore_print_color = orange: poisonous (72.0)
##
## stalk_root = bulbous AND
## population = several: edible (64.0)
##
## population = clustered: poisonous (52.0)
##
## : edible (32.0)
##
On the original mushrooms dataset, PART classifies all properly but must rely on 13 rules to reach the goal. On
the complete set, PART achieves the same outcome and derives 11 rules.
Step 8: Comparisons with Original Rules in
Reference Material
It is always interesting to compare a solution to alternatives. In this case we can refer to the original rules
derived in 1997, and extracted from the documentation which resulted in 48 errors, or 99.41% accuracy on the
whole dataset:
res
25 of 27 2/25/2016 6:27 PM

## [1] "1) odor=NOT(almond.OR.anise.OR.none) 120 poisonous cases missed, 98.52% accura
cy "
## [2] "2) spore-print-color=green 48 cases missed, 99.41% accuracy "
## [3] "3) odor=none.AND.stalk-surface-below-ring=scaly.AND. (stalk-color-above-ring=N
OT.brown) 8 cases missed, 99.90% accuracy "
## [4] "4) habitat=leaves.AND.cap-color=white 100% accuracy Rule "
## [5] "4) may also be "
## [6] "4') population=clustered.AND.cap_color=white These rule involve 6 attributes (
out of 22). Rules for edible mushrooms are obtained as negation of the rules given abo
ve, for example the rule: odor=(almond.OR.anise.OR.none).AND.spore-print-color=NOT.gre
en gives 48 errors, or 99.41% accuracy on the whole dataset."
Conclusions
C5.0 algorithm applied on all 22 variables of the complete mushroom set is able to correctly classify with 4
rules. This is the best performance we achieved on the set, whith the minimum number of rules derived and the
most accurate (perfect) outcome obtained on the complete dataset. It also only selected 3 variables: odor,
spore_print_color and population, out of the 22 variables provided, compared to the referenced document,
where 6 attributes and 4 rules resulted in 99.41% accuracy.
We hope this typical example demonstrates that Machine Learning algorithms are well positioned to help
resolve classification challenges, offering a fast, efficient and economical alternative to tedious experimentation.
It is easy to imagine how similar questions can be resolved in all types of R&D, in materials, cosmetics, food or
any scientific area. This second tool is certainly as useful as the formulation tool (http://rpubs.com/m3cinc
/machine_learning_key_to_your_formulation_challenges) we reviewed previously.
Classifying Rubber properties to meet rolling resistance and emissions, or modern composites to build
renewable energy sources or lighweight transportation vehicles and next-generation public transit, as well as
innovative UV-shield oinments and tasty snacks and drinks…, all present similar challenges where only the
nature of inputs and outputs vary. Therefore, this method too can and should be applied broadly!
Why not try and implement Machine Learning in your scientific or technical expert area? Remember, PRC
Consulting, LLC (http://www.prcconsulting.net) is dedicated to boosting innovation thru improved Analytics, one
customer at the time!
References
26 of 27 2/25/2016 6:27 PM

The following sources are referenced as they provided significant help and information to develop this Machine
Learning analysis applied to formulations:
UCI Machine Learning Repository (http://archive.ics.uci.edu/ml/datasets.html)1.
mushroom documentation (https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom
/agaricus-lepiota.names)
2.
stringr (https://cran.r-project.org/web/packages/stringr/stringr.pdf)3.
RWeka (https://cran.r-project.org/web/packages/RWeka/RWeka.pdf)4.
C50 (https://cran.r-project.org/web/packages/C50/C50.pdf)5.
rpart (https://cran.r-project.org/web/packages/rpart/rpart.pdf)6.
rpart.plot (https://cran.r-project.org/web/packages/rpart.plot/rpart.plot.pdf)7.
ratttle (http://rattle.togaware.com/)8.
RStudio (https://www.rstudio.com)9.
Machine Learning, Key to Your Formulation Challenges (http://rpubs.com/m3cinc
/machine_learning_key_to_your_formulation_challenges)
10.
27 of 27 2/25/2016 6:27 PM

Machine Learning, Key to Your Classification Challenges

Recomendados

Recomendados

Más contenido relacionado

Similar a Machine Learning, Key to Your Classification Challenges

Similar a Machine Learning, Key to Your Classification Challenges (20)

Machine Learning, Key to Your Classification Challenges