Part I: Predictive models (Decision Tree and Regression) using SAS Enterprise Miner
Part II: Decision Tree using R.
Part III: Market-Basket Analysis using SAS miner.
2. 1 | P a g e 1 9 1 3 9 5 0 7
INTRODUCTION
-------------------------------------------------------------------------------------------
The purpose of this report isto documentthe findingsfromthe datamininganalysis conductedona
national veteran’sorganization’s Donor’sdatasetand market-basketanalysison transactional data
of Health& BeautyAidsDepartmentandthe StationaryDepartment.The objective of the data
miningactivityistoseeka betterresponse andhitrate bytargetingonly those segmentsof
customers whohave beenflaggedasa potential donorbythe predictive model. Since the request
for donationisalso complemented withasmall gift,mailingonlypotentiallyinterestedcustomers
for the upcomingcampaign wouldsubstantiallyreduce the costforthe organization, andthat
proportionof savedcostscouldbe utilizedforothercharitable activities.
ThisanalysiswasconductedonSASminerand R on database of customerswhohad donated inthe
past 12 to 24 months,withdetaileddescriptionof the steps, interpretationof the modelsandmodel
comparisons(amongstmodelsandacrossSAS& R) discussedinthe report.
The secondpart of the reportexploresandsuggestsproductsthatcouldbe bundledormarketed
togethertoenable the organizationtomaximize itsrevenues. ThisanalysiswasperformedonSAS
miner,andrelevantproposalshave beenadvisedonthe productbundles andplacements basedon
the findings.
PART A
-------------------------------------------------------------------------------------------
1. CreatingSAS miner Projectand pre-processingvariables
Afterselectinganappropriate directoryforthe projectandsettingupthe diagramand library,we
processsome of the defaultsettingsof the datasource. Since there are variousnumericvariables
withlevelslessthan20 in the dataset,we setthe Classlevelsthresholdat2.This wouldensure only
binaryvariablesare treatedasnominal variables andnumericvariableswithless than20distinct
value continue tobe treatedasinterval variables.
Similarly,we alsohave one class variable(DemCluster) withover20 distinctlevelshence,thuswe
setthe levelscountthresholdas100. The rolesof the variableshave beensetasfollows:
3. 2 | P a g e 1 9 1 3 9 5 0 7
2. Explorationof variable
Exploringthe distribution of the variablescanunearthunusual patternsandbehavioursinthe
variables.These anomaliescanhave a substantial effectonthe modellingif notrectified.
We use the Explore windowtoexamine the distributionof MedianIncome Region.The fetchsize is
keptat max (20,000 recordsor recordsinvariable,whicheverisless) toensure all the observations
are consideredinthe exploration.
Figure 2: Changing the default settings of Explore.
We prepare a Histogramforvariable MedianIncome Regiontonotice anyabnormalityinits
distribution. The distributionatdefaultsettingswasasfollows:
Figure 1: Pre-processing variables.
4. 3 | P a g e 1 9 1 3 9 5 0 7
Figure 3: Distribution of Median Income Region at 10 bins.
The distributionatdefaultsettingsdidn’tlook suspicious.However,the bin’srange wassubstantial,
whichmighthave concealedanyabnormalityinthe distribution. Hence,we change the numberof
binsto 200, whichcreatesrangesof $1000.
Figure 4: Distribution of Median Income Region with 200 bins.
Changingthe binlimitshedslightonacrucial anomalyinthe variable. We observe a
disproportionateskew forcustomerswith medianincomeof $0.
The reasonbehindthisabnormal spike coulddue tothe factthat income is a confidential
information.Not everyoneisopen orcomfortable indisclosingtheirincome tocompaniesor
organization. Thus,peoplewhodonotreporttheirincome might have beenassignedashavingan
income of $0 by the organization. UsingMedianIncome withsuchaskeweddistributioncouldlead
to a creationof a flawedmodel.
5. 4 | P a g e 1 9 1 3 9 5 0 7
Thus rectifythisanomaly,we shouldreplace the $0income valueswith missingorNA usingthe
Replacementnode. Otheralternativecouldtoreplace the $0 income valueswiththe meanof
variable.However,replacing$0withthe meanwouldrequire apriorexamination,aswithouta
properbusinesscontext(strataof peopleof didn’trecordtheirincome,etc.) thisreplacementcould
be inappropriate.
3. BuildingPredictive models
Since we discoveredanabnormal peakat$0 for
MedianIncome Regionvariable, we replace the $0
value withmissing(SASminerusesfull-stopto
denote amissingvalue) usingReplacementnode.
To carry thisreplacement,we change the default
limitmethod of Interval Variables asnone,aswe
do notwant to replace the interval variables.
Furthermore,ReplacementvaluesissetasMissing
as we want to replace the abnormal value with a
missingvalue.
Since we needtochange the value of an interval variable (DemMedIncome),we make the following
changes inReplacementIntervalEditor:
By changingthe limitmethodtoUserSpecifiedandReplacementLowerLimitto1 of
DemMedIncome,we setall valuesof thatfall below 1to missing(since only$0isthe value below
$1).
Figure 5: Snapshot of result of Replacement node - $0 has been replaced with '.'
Aftercarryingout thisstep,we divide the datasetinto50:50 ratio betweenTrainingsetand
Validationset. The trainingsetisutilizedtobuildasetof modelswhile Validationsetisusedto
Figure 5: Properties of Replacement node.
Figure 6: Replacement Interval Editor.
6. 5 | P a g e 1 9 1 3 9 5 0 7
selectthe bestmodel. Byallocatingagoodproportionof datasetto training,we reduce the riskof
overfitting.
To predictmostlikelydonors,we create three predictive models:
AutonomousDecisionTree
The firstmodel we create is a DecisionTree whichhasbeensplitautonomouslybythe algorithm.
We setthe leaf size as25 and leaf size propertyas50 to ensure preventionof verysmall leaves.The
assessmentmethod,whichspecifiesthe methodtoselectthe besttree,hasbeensetasAverage
Square Error. Thisessentiallymeansthe tree thatgeneratesthe smallestaverage squarederror
(difference betweenpredictedandactual outcome) will be selected.
Figure 7: Autonomous DecisionTree.
The optimal decisiontree resultsincreationof 5 leaves;witheachendnode indicatingaclassifying
rule that coulddistinguishlikely andnon-likelylapsingdonorsbasedonthe targetvariable.
The biggestadvantage of DecisionTree isthattheyveryeasyto understand andinterpretevenfor
people of non—technical background. Furthermore,the model explainshow itworks,witheachleaf
node denotingaclassificationrule.
Fewkeyrules/leaves of the AutonomousDecisionTree are:
Figure 6: Workspace Diagram for Donors analysis.
7. 6 | P a g e 1 9 1 3 9 5 0 7
CustomerswithGiftcountof more than2.5 or missinginthe last36 monthsand withlast
giftamountof lessthan $7.5 is verylikely(64%) todonate [Node 6].
Customerswhose medianhomevalue isless than$67350 and has giftcountof lessthan2.5
inthe last36 monthsisunlikely(37%) todonate [Node 4].
Remaining3leavescouldbe interpretedinsimilarmanner.
Interactive DecisionTree.
The secondDecisionTree hasbeencreatedinteractively,with splitsconductedonthe basisof the
Logworthof the variables anddomainknowledge(i.e. whichsplits/variablewouldbe more relevant).
Logworthis a metricthat onthe basisof Informationgaintheorycalculatesthe importance of the
variable.Essentially,Logworthindicatesthe variable’sabilitytocreate homogenousorpure
subgroups. The maximumnumberof brancheshave beenkeptat3 to allow athree-wayinterval
split.Thiswill facilitate veryspecificinsightsandrules. The intervalsplits of some of the variables are
alsochangedto enable creationof more precise rules.
Figure 8: Interval splits of GiftCnt36 changed to <2.5 & missing, >=2.5 & < 4 and >=4.5.
The assessmentmethodforthisDecisionTree is Misclassificationrate. The Interactive DecisionTree
ismore complex comparedtothe AutonomousDecisionTree,withthe tree achievingoptimalityat
21 leaves.
Figure 9: Interactive Decision Tree.
Some of the keyrulesobtained fromthe interactive DecisionTree are:
8. 7 | P a g e 1 9 1 3 9 5 0 7
Customerswhose promotioncountinthe past12 monthsislessthan17.5 or missing,having
a medianhouse value of lessthan$67350, giftcountof lessthan2.5 or missinginpast36
monthsand average giftcard amountof more than equal to $11.5 in the last36 months are
unlikely(33%) todonate [Node 54].
Customerswithmedianincomeof lessthan$54473 or missing,withpromotioncountof
card in the last12 monthsof lessthan5.5, giftcountin last36 monthsof more or equal to
4.5 and lastgiftamountof more than $7.5 or missingisverylikely (91%) todonate [Node
67].
Remaining19 leavescanbe interpreted insimilarfashion.
Regression.
The final model we create isa LogisticRegression(asthe Targetvariable isabinarynumber).The
advantage of a Regressionmodel isthe itcanexpressorquantifythe associationbetweenthe input
variable andthe targetvariables.Furthermore,Regressionisanexcellenttool forestimation. Since
we are usingLogisticRegression,the outcome of the model wouldestimate the odds(probability)of
a donor as weightedsumof the attributes(inputvariables).
As Regressionmodels doesnotaccommodate missingvalue (unlike DecisionTrees) butratherskips
it,we add a Impute node to impute missingintervalvalue withthe variable’smeanandmissing
nominal valueswiththe mostcommonnominal value of the variable. Similarly,Regressionmodels
worksbestwithlimitedbutworthysetof variables,andthuswe addVariable Clustering node to
grouptogethervariablesthatbehave similarly toreduce redundancy.Theseclustersare represented
by the variablesthathasthe leastnormalizedRsquare value (since VariablesselectioninVariable
ClusteringhasbeensetasBestVariables).MaximumEigenvalue is keptas1; thisspecifies the largest
permissible value of the secondeigenvalue ineachcluster,facilitatingcreationof betterclusters.
Figure 10: Changing the properties of Variable Clustering node.
The selectionmodel optedforthe Regressionmodel isStepwise selectionwhilethe selection
criterionissetas Validationerror.Due tothese settings,the model will initiate will novariables (i.e.
at intercept).Thenateverystepwe add a variable andat the same time variablesalreadyinthe
model are verifiedif theypassthe minimumselectioncriteriathreshold(ValidationErrorinthis
case).Variablesbelowthe thresholdlimitare removedandthisprocesscontinuesuntilthe stay
significance level isachieved.
9. 8 | P a g e 1 9 1 3 9 5 0 7
The outputof the Regressionmodel suggested StatusCategoryStarAll month,Giftcountof card in
the last 36 monthsandGift amountaverage of 36 monthsas the most crucial factors respectively.
These variableshave aninfluence orcorrelationwiththe targetvariable.
The outputcan be interpretedasaunitchange in the GiftCntCard36,withall the othervariables
remainingconstant, willleadtochange in the logodds (as it isLogisticregression) of donation by
0.1156 units.All the othervariablescanbe expressedinsimilarmanner,basedupontheir
coefficients.
4. Evaluating Predictive models
The three createdpredictive modelsare comparedonthe basisof few prominentmachine-learning
metricsto evaluate whichmodeloutperformsthe other.We use the Model Comparisonnode and
Excel to compute the metrics.
ReceiverOperatingCharacteristic (ROC) curves
The ROC curve demonstratesthe relationbetweenTrue Positive (Sensitivity)andTrue
Negative (Specificity) of amodel atvariousdiagnostics testlevels. The model’swhose curve
isclosestto the top left end(i.e.nearthe topof Sensitivity)are more accurate predictors
while model’swhosecurvesare close tothe baseline canbe saidto be poor predictors.In
otherwords,the more area underthe model’scurve,the betterthe model is.
Figure 11: Output of Regression model.
10. 9 | P a g e 1 9 1 3 9 5 0 7
The modelsare evaluatedonthe basisof their performance onthe validationset.The Regression
node curvesthe closesttothe top-leftsectionof the chart(i.e.towardsTrue Positive).The
autonomousdecisiontree staysmarginallybehindwhile the interactivedecisiontree isthe closest
to the baseline,indicatingittobe a poorpredictor.
Cumulative Lift
Liftis a measure of a model’s effectiveness.Itessentiallycomparesthe resultobtainedby
the model tothe resultobtainedwithoutamodel (i.e.randomly).Higherliftvalue would
indicate the model isthe manytimesmore effectivethan randomselections.
Figure 12: ROC curves of the Predictive models.
Figure 13: Cumulative Lift of the Predictive models.
11. 10 | P a g e 1 9 1 3 9 5 0 7
Basedon the cumulative liftaccumulatedbythe models,we could observethatthe
Autonomous DecisionTree (1.30) producedthe highestliftat15th
depth.Thiswas followed
by 1.27 cumulative liftachievedby Regressionmodel and1.16 cumulative liftbyInteractive
DecisionTree respectively at15th
depth.
Thiscan be interpretedas the top15% customersselectedbyAutonomousDecisionTree is
likelytocapture 1.30 time more donorsthan15% of customerpickeduprandomly.
Average Square Error (ASE)
ASE isthe error arisingdue the variationinthe predicted
outcome bythe model andactual outcome.Lowerthe
ASE,betterthe model is,as itproducesfewererrors.
From the resultsof the model comparisonnode,we
couldobserve RegressionandAutonomous Decision
Tree performto an identical levelintermsof the ASE
generated;withthe latterperformingonlymarginally
better.The interactive DecisionTree producedthe most
difference betweenpredictedoutcome andactual
outcome.
Misclassificationrate
Misclassificationrate isthe errorproducedwhena
model incorrectlyclassifiesaresponderasnon-
responderorvice-a-versa.Inourbusinesscontext,a
model wouldbe misclassifyingif itpredictsadonorto
be a non-donorandvice-a-versa.Naturally,we would
prefera model whichmakesthe leastamountof such
we error.
Interactive DecisionTree comesoutontop onthis
metric,producingonly0.398 worthof misclassification
rate incomparison to0.436 by Regressionand0.428 by
AutonomousDecisionTree.
Accuracy
Accuracy isthe measure thatindicateshow accuratelyamodel canpredict(bothpositive
and negative) outof the total predictionsthatthe model makes.Itiscomputedbydividing
the True Predictions(True Positive andTrue Negative) bytotal numberof predictions(i.e.
the numberof records/observations/customers).
Model False Negative True Negative False Positive True Positive
AutonomousDT 1460 1804 617 962
Regression 1111 1467 954 1311
Interactive DT 1153 1406 1015 1269
Figure 16: Confusion matrix obtained from Model Comparison (Validation dataset).
Figure 14: ASE produced by the Predictive models.
Figure 15: Misclassification rate produced by the
Predictive models.
12. 11 | P a g e 1 9 1 3 9 5 0 7
Basedon accuracy, the regressionmodel couldpredictthe donorsandnon-donorsmore
accuratelycomparedto the othertwomodels(Figure 17).
F1
F1 is the harmonicaverage of Precision (True positivesbytotal positives) andRecall
(proportionof positivescorrectlyidentified).A score of 1 wouldmeanthe model isa perfect
predictorwhile ascore of zero indicatespoorpredictor.
Regressionoutperformsthe othertwomodelsonF1 score as well,indicatingitasthe best
predictor(Figure 17).
Conclusion:
The performance of the modelsonthe basisof the above machine learningmetricscanbe
summarizedbythe followingtable:
Machine Learning Metrics
Models Accuracy F1 ROC Lift ASE Misclassification
AutonomousDT 0.571 0.481 0.591 1.30 0.2432 0.428
Interactive DT 0.552 0.539 0.567 1.16 0.250 0.398
Regression 0.574 0.559 0.595 1.27 0.2437 0.436
Figure 17: Summarizing the performance of models based on various metrics.
Figure 18: Visual comparison between the models.
On evidence,we can conclude the Regressionmodelisthe bestmodel ittermsof performance,
accuracy, effectiveness anderrorgeneration.
5. Scoring and PredictingPotential Donors
Aftercareful andthoroughexamination,we concludedthatRegressionisthe bestmodel intermsof
itspredictingcapabilitiesasitismore accurate andproducesfewererrors.Thus,we use the
Regressionmodeltoscore (i.e.applyingthe predictions onthe dataset) onanew datasetof lapsing
donors. The scoringis performedthrough Score node.
0.571
0.481
0.591
1.30
0.2432
0.428
0.552
0.539
0.567
1.16
0.250
0.398
0.574
0.559
0.595
1.27
0.2437
0.436
A C C U R ACY
F 1
R OC
L I F T
A S E
M I S C LAS S IF ICATI ON
P ERFORMANCE COMPARIS ON OF MODELS
Autonomous DT Interactive DT Regression
13. 12 | P a g e 1 9 1 3 9 5 0 7
To explore the results,we create ahistogramona new variable thathasbeencreateddue toscoring
calledpredictedtarget_B=1. Thisnewvariable contains the predictive value assignedtothe
customersbasedontheirprobabilityof donating.
To visuallyrepresentthe scoring,we create histogramof customerspredictedbythe model as
potential donorswiththeirattachedprobabilityof donating. The resultare as follows:
Figure 19: Exploring Predicted Donors.
To enable betterinsights,we change the numberof binsto20. The highlightedrecordsinthe
datasetare customers belongingto the selectedbarinthe histogram (Figure 17).The valuesonthe
X axisrepresentthe likelihoodof thatcustomerdonatingfor the nextcampaign.
The average response rate wasfoundto be 5%. Assuch, customerswithapredictedprobabilityof
over0.05 can be consideredascandidatesforthe campaign. However,tomaximize the cost-
effectivenessof the campaign,we couldderive aprobabilitythreshold basedonthe past
informationaboutcustomerlifetime value theygenerate.Customerswithpredicted valuesbeyond
that thresholdcouldbe then be targetedtogenerate evenbetterresponse/hitrate andmargins.
A rational approachwouldbe to solicitcustomerswhohave beenassignedpredictivevalue of 0.55
of more. Otheralternative toachievesignificantlybetterresponse wouldbe approachingcustomers
whobelonginthe top 30th
percentilebasedontheirpredictedvalues(Figure20).
14. 13 | P a g e 1 9 1 3 9 5 0 7
Figure 20: Snapshot of Customers with highest predictive probability of donating.
PART B: PREDICTIVE MODELLING BASED ON R
-------------------------------------------------------------------------------------------
Aftercreatingthree predictive modelsinSASminer,we builtaDecisionTree inR.As isthe
procedure before performinganydataminingactivity,we explore the variablesatthe firststepto
investigatetheirdistribution.
Exploringand transformingthe data
Summaryfunctionaswell as package Psych isinstalledtoprovide adetaileddescriptive
summarizationof the variables.
We notice thatCustomerID isconsiderasvariable byR while afew variableshadmissingvaluesin
them.A histogramiscreatedfor MedianIncome,whichrevealsdisproportionatenumberof zero
value inthe variable.
To ensure a cleanmodel,we wrangle andtransform the variables:
Figure 21: Exploring the variables.
15. 14 | P a g e 1 9 1 3 9 5 0 7
CustomerIDis rejectedasitis an ID andnot an inputfor modelling.
Target_D isrejectedasTarget_B containsthe collapseddataof Target_D and inclusionof
Target_D will leadtoleakage.
Since R considervariables containing$valuesas categorical,we transformGiftAvgLast,
GiftAvg36,GiftAvgAll,GiftAvgCard36,DemMedHomeValueandDemMedIncomeinto
numericvariables.
Figure 22: Transforming variables into numeric variables.
The zero valuesinDemMedIncome are replace withNA,asthose valueswere customers
whodidnot reveal theirincomes.
Buildinga DecisionTree
Aftercleaningthe data, we divide the dataequallyintotrainingandvalidationsets. The Decision
Tree will be builtonthe validationset usingrpartpackage.Target_Bis selectedasthe target
variablesand all the non-rejectedvariablesare selectedasthe inputs.Since the targetvariable isa
binarynumber,we select Classasthe methodwhile the complexityparameter issetat0.001. The
scriptfor Decisiontree isandthe model is plottedthe decision usingrpart.plot:
Figure 23: Decision Tree will all non-rejected variables as inputs and cp of 0.001.
16. 15 | P a g e 1 9 1 3 9 5 0 7
The decisiontree thatiscreatedis verycomplex,withlotof leaves.OnplottingitsROCcurve we
notice a huge discrepancy asthe model curvesperfectlytowardssensitivity. The model showssigns
of overfittingaswell as leakage.
Figure 24: ROC curve of the Decision Tree.
To rectifythismodel, we prune the decision tree basedoncross-validationerror. The complexity
parameterat whichthe lowesterrorisproducedisselectedasthe complexityparameterforthe
newmodel (0.0072).
Figure 25: Cross-Validation plot.
The other change inthe newmodel isselectionof variablesasinput.Forthe new model,we select
variablesthathave relevance andpredictioncapabilitiesbasedonrational judgement.Forexample,
GftAvg36 ismore relevanttothe model thanGiftAvgAll,thusonlythe formerisusedinthe new
model.
The new model DecisionTree canbe seeninthe below plot:
17. 16 | P a g e 1 9 1 3 9 5 0 7
Figure 26: Pruned Decision Tree.
Comparing the model to the modelscreated in SAS
The newmodel isa lot more cleanerandresultsin5 definite predictionrules.We plotthe ROCcurve
and liftchart forthismodel (Blue line forTrainingandRedline forValidation). The resultsasfollows:
On comparingthe ROC curve and Liftchart of DecisionTree createdonR withthe three modelsbuilt
on SASminer,we couldobserve the Regressionwouldstill comfortablytrumpall the othermodels
basedon accuracy and effectiveness.The ROCcurve of the DecisionTree builtonRisclose to the
baseline,indicatingittonotbeinga verygoodpredictor.Similarly,the liftgeneratedisveryclose to
1, whichisn’tan ideal value.
Scoring the data
The purpose of any model isto predict,andto conductthispredictionwe importthe score dataset.
As done forthe original datasource,we transformvariableswith$valuesinthemintonumeric
variables.The predictionof the model isthenappliedtothe scoreddatasetusingthe Predict
function.A snapshotof resultsof the scoringisas follows:
Figure 27: ROC curve (Left) and Lift chart (right) of Pruned Decision Tree.
18. 17 | P a g e 1 9 1 3 9 5 0 7
The firstcolumnindicatesthe customerID,the secondcolumnindicatesthe customerbeinganon-
donorwhile the 3rd
columnindicatesthe customerbeingadonor.The valuesinsecondandthird
columnindicate the probabilityof the customerfallinginthatclassification.Thiscanbe interpreted
as customerId 96362 has 41.96% predictedprobabilityof donatingforthe nextcampaign.
Similarly,the predictedvalue forall customershave beenderived. Toachieve maximumprofitability,
the organizationshouldsolicitate customerswhohave beenflaggedwithmore than60% of
probabilityof donating bythe model.
PART C: MARKET BASKET ANALYSIS AND ASSOCIATION RULES
-------------------------------------------------------------------------------------------
In thissectionof the report,we attemptto derive meaningful patternsinthe purchasingbehaviour
of customerswithreference toarange of products.The primaryobjective of thismarket-basket
analysis (MBA) isto discoveritemsthatare purchasedwithhighconfidence andhave highlift.These
insightscanenable the retail store toexpanditsrevenue numbersandachievehigherprofitability.
The MBA wasconductedon SASminer,onthe datasetcontaininginformationof over400,000
transactionsaccumulatedoverthe past3 months. The propertiesof the variable have beensetto
settingsobservedinFigure 28 andthe type of data source ischangedfrom Raw to Transactions.
Figure 28: Variable Properties for MBA.
Afterdraggingthe data-source intothe diagram, we attach Associationnode toittoconduct the
analysis.Exportrule byIDis changedto yesas we wouldlike toview the rule descriptiontablefor
the analysis.Remainingsettingsare keptunchanged.
19. 18 | P a g e 1 9 1 3 9 5 0 7
The resultsof the Associationnodesunearthedseveral insightsof enormousbusinessvalues;some
of whichare explainedbelow:
Out of the 36 rulesorcombinationof productscreated,the highestachievedliftwasfoundto3.60.
Liftessentiallymeasuresthe degreeof associationbetweenthe combinationof products.For
example,rule A ->B withlift3 wouldbe interpretedasa customeristhrice as likelytobuyproductB
if he has alreadypurchasedproductA,comparedto the likelihoodof arandomcustomerjustbuying
productB. Lift isderivedbydividingconfidencebyexpectedconfidence.
The highestliftof 3.60 wasachievedbyrule Perfume ->Toothbrush. Thisindicatesthatacustomer
whohas purchaseda Perfume is3.6 timesmore likelytobuyToothbrushcomparedtoa customer
chosenat random. Since liftissymmetrical,ruleToothbrush->Perfume wouldhave the same liftof
3.60.
Liftis significantmetricforMBA as itdenotesthe relation betweencombinationof product.A Higher
lift( > 1) of rule indicatesthe right-handproductismore likelytoboughtincomplementwiththe
left-handproductratherthanbeingboughtjustinisolation. Thisinsightcanhelpimmenselyin
productplacementinthe aisles. Incurrentcontext,rule Magazine &Greetingcards -> CandyBar has
a liftof 2.68, whichdenotesthatthe likelihoodof acustomerbuyingacandy bar incombination
withMagazine and Greetingcardsis2.68 timeshigherthanacustomerbuyingjustthe candy bar.
Basedon associationrules,we derived36 rules,witheachrule possessingsignificantvalue for
implementationatthe store.However,based onfew metrics,we wouldrecommendthe companyto
incorporate followingchangestofacilitate higherrevenue generation:
1. Placementof Products on Aisle
Since Perfume ->Toothbrushproduce the highestliftbuthave comparativelylowersupport
(customerpurchasingboththe products),these twoproductsshouldbe placedinclose
approximate toeach otherto boosttheirsales.Withthese productsinclose vicinity,
Figure 29: Tabular data of all 36 rules.
20. 19 | P a g e 1 9 1 3 9 5 0 7
purchase of Perfume will triggerthe purchase of Toothbrushorvice-a-versa,asindicatedby
theirhighlift. Similarly,productswithhighliftbutrelativelylowersupportshouldbe placed
close-by.
CandyBars -> GreetingCardshave the highestsupport(4.37%),indicatingthese two
productsare oftenpurchasedtogether.Thus,thesetwoproductsshouldbe placedat
distance fromeachotherso that customershave towalk-througharange of otherproducts
inthe processof buyingCandyBars and GreetingCards. Similarly,productsthatare often
purchasedtogether(highsupport) shouldbe placedatsome distance fromeachother.
2. Bundle,Cross-sellingandUp-selling
Figure 30: Link Graph of Products.
The networkbetween productsshow whichproductare linkedwitheachother.We can
observe pensandphotoprocessingare onlypurchasedincombinationwith Magazine.As
such,Pensand Photoprocessingshouldbe soldas a bundle withMagazinestoimprove their
salesnumbers.
We can alsoobserve Magazine isthe mostpopularproductin the store,andas such it gives
an opportunityto create upsellingandcross-sellingsituations of otherproducts around
magazine.Lesspopularproductscanbe placednearMagazine to grab more attention,as
magazinesare verypopular.Similarly, anahigherpricedalternativeproductcanbe placed
close to the magazines(forexample,premiumGreetingcards,candybarsand Toothpaste).
3. Specials
Products withhighlift(Perfume->Toothbrush,Magazine &CandyBar -> Greetingcards,
etc.) shouldbe onsale at different times.Since purchase of one product, isanyway likelyto
triggerthe purchase of anotherproductinthe rule,itscounter-productivetohave asale on
21. 20 | P a g e 1 9 1 3 9 5 0 7
both/all the itemsinthe rule. Thiscansave a large proportionof discountingcostsforthe
companywhile boostingtheirsalesnumber.
Some ruleslike Greetingcards & CandyBar -> Magazine are clearlyhotproductsduring
festive periods. Havingone of the productsona discountedrate,duringfestive season, will
temptthe customerto purchase the othertwonon-discountedproductstocompletetheir
festive shoppingwish-list.