SlideShare una empresa de Scribd logo
1 de 21
Predictive Analyticsand MarketBasket
Analysis.
BUS5PA – Assignment III
SIDDHANTH CHAURASIYA 19139507
1 | P a g e 1 9 1 3 9 5 0 7
INTRODUCTION
-------------------------------------------------------------------------------------------
The purpose of this report isto documentthe findingsfromthe datamininganalysis conductedona
national veteran’sorganization’s Donor’sdatasetand market-basketanalysison transactional data
of Health& BeautyAidsDepartmentandthe StationaryDepartment.The objective of the data
miningactivityistoseeka betterresponse andhitrate bytargetingonly those segmentsof
customers whohave beenflaggedasa potential donorbythe predictive model. Since the request
for donationisalso complemented withasmall gift,mailingonlypotentiallyinterestedcustomers
for the upcomingcampaign wouldsubstantiallyreduce the costforthe organization, andthat
proportionof savedcostscouldbe utilizedforothercharitable activities.
ThisanalysiswasconductedonSASminerand R on database of customerswhohad donated inthe
past 12 to 24 months,withdetaileddescriptionof the steps, interpretationof the modelsandmodel
comparisons(amongstmodelsandacrossSAS& R) discussedinthe report.
The secondpart of the reportexploresandsuggestsproductsthatcouldbe bundledormarketed
togethertoenable the organizationtomaximize itsrevenues. ThisanalysiswasperformedonSAS
miner,andrelevantproposalshave beenadvisedonthe productbundles andplacements basedon
the findings.
PART A
-------------------------------------------------------------------------------------------
1. CreatingSAS miner Projectand pre-processingvariables
Afterselectinganappropriate directoryforthe projectandsettingupthe diagramand library,we
processsome of the defaultsettingsof the datasource. Since there are variousnumericvariables
withlevelslessthan20 in the dataset,we setthe Classlevelsthresholdat2.This wouldensure only
binaryvariablesare treatedasnominal variables andnumericvariableswithless than20distinct
value continue tobe treatedasinterval variables.
Similarly,we alsohave one class variable(DemCluster) withover20 distinctlevelshence,thuswe
setthe levelscountthresholdas100. The rolesof the variableshave beensetasfollows:
2 | P a g e 1 9 1 3 9 5 0 7
2. Explorationof variable
Exploringthe distribution of the variablescanunearthunusual patternsandbehavioursinthe
variables.These anomaliescanhave a substantial effectonthe modellingif notrectified.
We use the Explore windowtoexamine the distributionof MedianIncome Region.The fetchsize is
keptat max (20,000 recordsor recordsinvariable,whicheverisless) toensure all the observations
are consideredinthe exploration.
Figure 2: Changing the default settings of Explore.
We prepare a Histogramforvariable MedianIncome Regiontonotice anyabnormalityinits
distribution. The distributionatdefaultsettingswasasfollows:
Figure 1: Pre-processing variables.
3 | P a g e 1 9 1 3 9 5 0 7
Figure 3: Distribution of Median Income Region at 10 bins.
The distributionatdefaultsettingsdidn’tlook suspicious.However,the bin’srange wassubstantial,
whichmighthave concealedanyabnormalityinthe distribution. Hence,we change the numberof
binsto 200, whichcreatesrangesof $1000.
Figure 4: Distribution of Median Income Region with 200 bins.
Changingthe binlimitshedslightonacrucial anomalyinthe variable. We observe a
disproportionateskew forcustomerswith medianincomeof $0.
The reasonbehindthisabnormal spike coulddue tothe factthat income is a confidential
information.Not everyoneisopen orcomfortable indisclosingtheirincome tocompaniesor
organization. Thus,peoplewhodonotreporttheirincome might have beenassignedashavingan
income of $0 by the organization. UsingMedianIncome withsuchaskeweddistributioncouldlead
to a creationof a flawedmodel.
4 | P a g e 1 9 1 3 9 5 0 7
Thus rectifythisanomaly,we shouldreplace the $0income valueswith missingorNA usingthe
Replacementnode. Otheralternativecouldtoreplace the $0 income valueswiththe meanof
variable.However,replacing$0withthe meanwouldrequire apriorexamination,aswithouta
properbusinesscontext(strataof peopleof didn’trecordtheirincome,etc.) thisreplacementcould
be inappropriate.
3. BuildingPredictive models
Since we discoveredanabnormal peakat$0 for
MedianIncome Regionvariable, we replace the $0
value withmissing(SASminerusesfull-stopto
denote amissingvalue) usingReplacementnode.
To carry thisreplacement,we change the default
limitmethod of Interval Variables asnone,aswe
do notwant to replace the interval variables.
Furthermore,ReplacementvaluesissetasMissing
as we want to replace the abnormal value with a
missingvalue.
Since we needtochange the value of an interval variable (DemMedIncome),we make the following
changes inReplacementIntervalEditor:
By changingthe limitmethodtoUserSpecifiedandReplacementLowerLimitto1 of
DemMedIncome,we setall valuesof thatfall below 1to missing(since only$0isthe value below
$1).
Figure 5: Snapshot of result of Replacement node - $0 has been replaced with '.'
Aftercarryingout thisstep,we divide the datasetinto50:50 ratio betweenTrainingsetand
Validationset. The trainingsetisutilizedtobuildasetof modelswhile Validationsetisusedto
Figure 5: Properties of Replacement node.
Figure 6: Replacement Interval Editor.
5 | P a g e 1 9 1 3 9 5 0 7
selectthe bestmodel. Byallocatingagoodproportionof datasetto training,we reduce the riskof
overfitting.
To predictmostlikelydonors,we create three predictive models:
 AutonomousDecisionTree
The firstmodel we create is a DecisionTree whichhasbeensplitautonomouslybythe algorithm.
We setthe leaf size as25 and leaf size propertyas50 to ensure preventionof verysmall leaves.The
assessmentmethod,whichspecifiesthe methodtoselectthe besttree,hasbeensetasAverage
Square Error. Thisessentiallymeansthe tree thatgeneratesthe smallestaverage squarederror
(difference betweenpredictedandactual outcome) will be selected.
Figure 7: Autonomous DecisionTree.
The optimal decisiontree resultsincreationof 5 leaves;witheachendnode indicatingaclassifying
rule that coulddistinguishlikely andnon-likelylapsingdonorsbasedonthe targetvariable.
The biggestadvantage of DecisionTree isthattheyveryeasyto understand andinterpretevenfor
people of non—technical background. Furthermore,the model explainshow itworks,witheachleaf
node denotingaclassificationrule.
Fewkeyrules/leaves of the AutonomousDecisionTree are:
Figure 6: Workspace Diagram for Donors analysis.
6 | P a g e 1 9 1 3 9 5 0 7
 CustomerswithGiftcountof more than2.5 or missinginthe last36 monthsand withlast
giftamountof lessthan $7.5 is verylikely(64%) todonate [Node 6].
 Customerswhose medianhomevalue isless than$67350 and has giftcountof lessthan2.5
inthe last36 monthsisunlikely(37%) todonate [Node 4].
Remaining3leavescouldbe interpretedinsimilarmanner.
 Interactive DecisionTree.
The secondDecisionTree hasbeencreatedinteractively,with splitsconductedonthe basisof the
Logworthof the variables anddomainknowledge(i.e. whichsplits/variablewouldbe more relevant).
Logworthis a metricthat onthe basisof Informationgaintheorycalculatesthe importance of the
variable.Essentially,Logworthindicatesthe variable’sabilitytocreate homogenousorpure
subgroups. The maximumnumberof brancheshave beenkeptat3 to allow athree-wayinterval
split.Thiswill facilitate veryspecificinsightsandrules. The intervalsplits of some of the variables are
alsochangedto enable creationof more precise rules.
Figure 8: Interval splits of GiftCnt36 changed to <2.5 & missing, >=2.5 & < 4 and >=4.5.
The assessmentmethodforthisDecisionTree is Misclassificationrate. The Interactive DecisionTree
ismore complex comparedtothe AutonomousDecisionTree,withthe tree achievingoptimalityat
21 leaves.
Figure 9: Interactive Decision Tree.
Some of the keyrulesobtained fromthe interactive DecisionTree are:
7 | P a g e 1 9 1 3 9 5 0 7
 Customerswhose promotioncountinthe past12 monthsislessthan17.5 or missing,having
a medianhouse value of lessthan$67350, giftcountof lessthan2.5 or missinginpast36
monthsand average giftcard amountof more than equal to $11.5 in the last36 months are
unlikely(33%) todonate [Node 54].
 Customerswithmedianincomeof lessthan$54473 or missing,withpromotioncountof
card in the last12 monthsof lessthan5.5, giftcountin last36 monthsof more or equal to
4.5 and lastgiftamountof more than $7.5 or missingisverylikely (91%) todonate [Node
67].
Remaining19 leavescanbe interpreted insimilarfashion.
 Regression.
The final model we create isa LogisticRegression(asthe Targetvariable isabinarynumber).The
advantage of a Regressionmodel isthe itcanexpressorquantifythe associationbetweenthe input
variable andthe targetvariables.Furthermore,Regressionisanexcellenttool forestimation. Since
we are usingLogisticRegression,the outcome of the model wouldestimate the odds(probability)of
a donor as weightedsumof the attributes(inputvariables).
As Regressionmodels doesnotaccommodate missingvalue (unlike DecisionTrees) butratherskips
it,we add a Impute node to impute missingintervalvalue withthe variable’smeanandmissing
nominal valueswiththe mostcommonnominal value of the variable. Similarly,Regressionmodels
worksbestwithlimitedbutworthysetof variables,andthuswe addVariable Clustering node to
grouptogethervariablesthatbehave similarly toreduce redundancy.Theseclustersare represented
by the variablesthathasthe leastnormalizedRsquare value (since VariablesselectioninVariable
ClusteringhasbeensetasBestVariables).MaximumEigenvalue is keptas1; thisspecifies the largest
permissible value of the secondeigenvalue ineachcluster,facilitatingcreationof betterclusters.
Figure 10: Changing the properties of Variable Clustering node.
The selectionmodel optedforthe Regressionmodel isStepwise selectionwhilethe selection
criterionissetas Validationerror.Due tothese settings,the model will initiate will novariables (i.e.
at intercept).Thenateverystepwe add a variable andat the same time variablesalreadyinthe
model are verifiedif theypassthe minimumselectioncriteriathreshold(ValidationErrorinthis
case).Variablesbelowthe thresholdlimitare removedandthisprocesscontinuesuntilthe stay
significance level isachieved.
8 | P a g e 1 9 1 3 9 5 0 7
The outputof the Regressionmodel suggested StatusCategoryStarAll month,Giftcountof card in
the last 36 monthsandGift amountaverage of 36 monthsas the most crucial factors respectively.
These variableshave aninfluence orcorrelationwiththe targetvariable.
The outputcan be interpretedasaunitchange in the GiftCntCard36,withall the othervariables
remainingconstant, willleadtochange in the logodds (as it isLogisticregression) of donation by
0.1156 units.All the othervariablescanbe expressedinsimilarmanner,basedupontheir
coefficients.
4. Evaluating Predictive models
The three createdpredictive modelsare comparedonthe basisof few prominentmachine-learning
metricsto evaluate whichmodeloutperformsthe other.We use the Model Comparisonnode and
Excel to compute the metrics.
 ReceiverOperatingCharacteristic (ROC) curves
The ROC curve demonstratesthe relationbetweenTrue Positive (Sensitivity)andTrue
Negative (Specificity) of amodel atvariousdiagnostics testlevels. The model’swhose curve
isclosestto the top left end(i.e.nearthe topof Sensitivity)are more accurate predictors
while model’swhosecurvesare close tothe baseline canbe saidto be poor predictors.In
otherwords,the more area underthe model’scurve,the betterthe model is.
Figure 11: Output of Regression model.
9 | P a g e 1 9 1 3 9 5 0 7
The modelsare evaluatedonthe basisof their performance onthe validationset.The Regression
node curvesthe closesttothe top-leftsectionof the chart(i.e.towardsTrue Positive).The
autonomousdecisiontree staysmarginallybehindwhile the interactivedecisiontree isthe closest
to the baseline,indicatingittobe a poorpredictor.
 Cumulative Lift
Liftis a measure of a model’s effectiveness.Itessentiallycomparesthe resultobtainedby
the model tothe resultobtainedwithoutamodel (i.e.randomly).Higherliftvalue would
indicate the model isthe manytimesmore effectivethan randomselections.
Figure 12: ROC curves of the Predictive models.
Figure 13: Cumulative Lift of the Predictive models.
10 | P a g e 1 9 1 3 9 5 0 7
Basedon the cumulative liftaccumulatedbythe models,we could observethatthe
Autonomous DecisionTree (1.30) producedthe highestliftat15th
depth.Thiswas followed
by 1.27 cumulative liftachievedby Regressionmodel and1.16 cumulative liftbyInteractive
DecisionTree respectively at15th
depth.
Thiscan be interpretedas the top15% customersselectedbyAutonomousDecisionTree is
likelytocapture 1.30 time more donorsthan15% of customerpickeduprandomly.
 Average Square Error (ASE)
ASE isthe error arisingdue the variationinthe predicted
outcome bythe model andactual outcome.Lowerthe
ASE,betterthe model is,as itproducesfewererrors.
From the resultsof the model comparisonnode,we
couldobserve RegressionandAutonomous Decision
Tree performto an identical levelintermsof the ASE
generated;withthe latterperformingonlymarginally
better.The interactive DecisionTree producedthe most
difference betweenpredictedoutcome andactual
outcome.
 Misclassificationrate
Misclassificationrate isthe errorproducedwhena
model incorrectlyclassifiesaresponderasnon-
responderorvice-a-versa.Inourbusinesscontext,a
model wouldbe misclassifyingif itpredictsadonorto
be a non-donorandvice-a-versa.Naturally,we would
prefera model whichmakesthe leastamountof such
we error.
Interactive DecisionTree comesoutontop onthis
metric,producingonly0.398 worthof misclassification
rate incomparison to0.436 by Regressionand0.428 by
AutonomousDecisionTree.
 Accuracy
Accuracy isthe measure thatindicateshow accuratelyamodel canpredict(bothpositive
and negative) outof the total predictionsthatthe model makes.Itiscomputedbydividing
the True Predictions(True Positive andTrue Negative) bytotal numberof predictions(i.e.
the numberof records/observations/customers).
Model False Negative True Negative False Positive True Positive
AutonomousDT 1460 1804 617 962
Regression 1111 1467 954 1311
Interactive DT 1153 1406 1015 1269
Figure 16: Confusion matrix obtained from Model Comparison (Validation dataset).
Figure 14: ASE produced by the Predictive models.
Figure 15: Misclassification rate produced by the
Predictive models.
11 | P a g e 1 9 1 3 9 5 0 7
Basedon accuracy, the regressionmodel couldpredictthe donorsandnon-donorsmore
accuratelycomparedto the othertwomodels(Figure 17).
 F1
F1 is the harmonicaverage of Precision (True positivesbytotal positives) andRecall
(proportionof positivescorrectlyidentified).A score of 1 wouldmeanthe model isa perfect
predictorwhile ascore of zero indicatespoorpredictor.
Regressionoutperformsthe othertwomodelsonF1 score as well,indicatingitasthe best
predictor(Figure 17).
Conclusion:
The performance of the modelsonthe basisof the above machine learningmetricscanbe
summarizedbythe followingtable:
Machine Learning Metrics
Models Accuracy F1 ROC Lift ASE Misclassification
AutonomousDT 0.571 0.481 0.591 1.30 0.2432 0.428
Interactive DT 0.552 0.539 0.567 1.16 0.250 0.398
Regression 0.574 0.559 0.595 1.27 0.2437 0.436
Figure 17: Summarizing the performance of models based on various metrics.
Figure 18: Visual comparison between the models.
On evidence,we can conclude the Regressionmodelisthe bestmodel ittermsof performance,
accuracy, effectiveness anderrorgeneration.
5. Scoring and PredictingPotential Donors
Aftercareful andthoroughexamination,we concludedthatRegressionisthe bestmodel intermsof
itspredictingcapabilitiesasitismore accurate andproducesfewererrors.Thus,we use the
Regressionmodeltoscore (i.e.applyingthe predictions onthe dataset) onanew datasetof lapsing
donors. The scoringis performedthrough Score node.
0.571
0.481
0.591
1.30
0.2432
0.428
0.552
0.539
0.567
1.16
0.250
0.398
0.574
0.559
0.595
1.27
0.2437
0.436
A C C U R ACY
F 1
R OC
L I F T
A S E
M I S C LAS S IF ICATI ON
P ERFORMANCE COMPARIS ON OF MODELS
Autonomous DT Interactive DT Regression
12 | P a g e 1 9 1 3 9 5 0 7
To explore the results,we create ahistogramona new variable thathasbeencreateddue toscoring
calledpredictedtarget_B=1. Thisnewvariable contains the predictive value assignedtothe
customersbasedontheirprobabilityof donating.
To visuallyrepresentthe scoring,we create histogramof customerspredictedbythe model as
potential donorswiththeirattachedprobabilityof donating. The resultare as follows:
Figure 19: Exploring Predicted Donors.
To enable betterinsights,we change the numberof binsto20. The highlightedrecordsinthe
datasetare customers belongingto the selectedbarinthe histogram (Figure 17).The valuesonthe
X axisrepresentthe likelihoodof thatcustomerdonatingfor the nextcampaign.
The average response rate wasfoundto be 5%. Assuch, customerswithapredictedprobabilityof
over0.05 can be consideredascandidatesforthe campaign. However,tomaximize the cost-
effectivenessof the campaign,we couldderive aprobabilitythreshold basedonthe past
informationaboutcustomerlifetime value theygenerate.Customerswithpredicted valuesbeyond
that thresholdcouldbe then be targetedtogenerate evenbetterresponse/hitrate andmargins.
A rational approachwouldbe to solicitcustomerswhohave beenassignedpredictivevalue of 0.55
of more. Otheralternative toachievesignificantlybetterresponse wouldbe approachingcustomers
whobelonginthe top 30th
percentilebasedontheirpredictedvalues(Figure20).
13 | P a g e 1 9 1 3 9 5 0 7
Figure 20: Snapshot of Customers with highest predictive probability of donating.
PART B: PREDICTIVE MODELLING BASED ON R
-------------------------------------------------------------------------------------------
Aftercreatingthree predictive modelsinSASminer,we builtaDecisionTree inR.As isthe
procedure before performinganydataminingactivity,we explore the variablesatthe firststepto
investigatetheirdistribution.
Exploringand transformingthe data
Summaryfunctionaswell as package Psych isinstalledtoprovide adetaileddescriptive
summarizationof the variables.
We notice thatCustomerID isconsiderasvariable byR while afew variableshadmissingvaluesin
them.A histogramiscreatedfor MedianIncome,whichrevealsdisproportionatenumberof zero
value inthe variable.
To ensure a cleanmodel,we wrangle andtransform the variables:
Figure 21: Exploring the variables.
14 | P a g e 1 9 1 3 9 5 0 7
 CustomerIDis rejectedasitis an ID andnot an inputfor modelling.
 Target_D isrejectedasTarget_B containsthe collapseddataof Target_D and inclusionof
Target_D will leadtoleakage.
 Since R considervariables containing$valuesas categorical,we transformGiftAvgLast,
GiftAvg36,GiftAvgAll,GiftAvgCard36,DemMedHomeValueandDemMedIncomeinto
numericvariables.
Figure 22: Transforming variables into numeric variables.
 The zero valuesinDemMedIncome are replace withNA,asthose valueswere customers
whodidnot reveal theirincomes.
Buildinga DecisionTree
Aftercleaningthe data, we divide the dataequallyintotrainingandvalidationsets. The Decision
Tree will be builtonthe validationset usingrpartpackage.Target_Bis selectedasthe target
variablesand all the non-rejectedvariablesare selectedasthe inputs.Since the targetvariable isa
binarynumber,we select Classasthe methodwhile the complexityparameter issetat0.001. The
scriptfor Decisiontree isandthe model is plottedthe decision usingrpart.plot:
Figure 23: Decision Tree will all non-rejected variables as inputs and cp of 0.001.
15 | P a g e 1 9 1 3 9 5 0 7
The decisiontree thatiscreatedis verycomplex,withlotof leaves.OnplottingitsROCcurve we
notice a huge discrepancy asthe model curvesperfectlytowardssensitivity. The model showssigns
of overfittingaswell as leakage.
Figure 24: ROC curve of the Decision Tree.
To rectifythismodel, we prune the decision tree basedoncross-validationerror. The complexity
parameterat whichthe lowesterrorisproducedisselectedasthe complexityparameterforthe
newmodel (0.0072).
Figure 25: Cross-Validation plot.
The other change inthe newmodel isselectionof variablesasinput.Forthe new model,we select
variablesthathave relevance andpredictioncapabilitiesbasedonrational judgement.Forexample,
GftAvg36 ismore relevanttothe model thanGiftAvgAll,thusonlythe formerisusedinthe new
model.
The new model DecisionTree canbe seeninthe below plot:
16 | P a g e 1 9 1 3 9 5 0 7
Figure 26: Pruned Decision Tree.
Comparing the model to the modelscreated in SAS
The newmodel isa lot more cleanerandresultsin5 definite predictionrules.We plotthe ROCcurve
and liftchart forthismodel (Blue line forTrainingandRedline forValidation). The resultsasfollows:
On comparingthe ROC curve and Liftchart of DecisionTree createdonR withthe three modelsbuilt
on SASminer,we couldobserve the Regressionwouldstill comfortablytrumpall the othermodels
basedon accuracy and effectiveness.The ROCcurve of the DecisionTree builtonRisclose to the
baseline,indicatingittonotbeinga verygoodpredictor.Similarly,the liftgeneratedisveryclose to
1, whichisn’tan ideal value.
Scoring the data
The purpose of any model isto predict,andto conductthispredictionwe importthe score dataset.
As done forthe original datasource,we transformvariableswith$valuesinthemintonumeric
variables.The predictionof the model isthenappliedtothe scoreddatasetusingthe Predict
function.A snapshotof resultsof the scoringisas follows:
Figure 27: ROC curve (Left) and Lift chart (right) of Pruned Decision Tree.
17 | P a g e 1 9 1 3 9 5 0 7
The firstcolumnindicatesthe customerID,the secondcolumnindicatesthe customerbeinganon-
donorwhile the 3rd
columnindicatesthe customerbeingadonor.The valuesinsecondandthird
columnindicate the probabilityof the customerfallinginthatclassification.Thiscanbe interpreted
as customerId 96362 has 41.96% predictedprobabilityof donatingforthe nextcampaign.
Similarly,the predictedvalue forall customershave beenderived. Toachieve maximumprofitability,
the organizationshouldsolicitate customerswhohave beenflaggedwithmore than60% of
probabilityof donating bythe model.
PART C: MARKET BASKET ANALYSIS AND ASSOCIATION RULES
-------------------------------------------------------------------------------------------
In thissectionof the report,we attemptto derive meaningful patternsinthe purchasingbehaviour
of customerswithreference toarange of products.The primaryobjective of thismarket-basket
analysis (MBA) isto discoveritemsthatare purchasedwithhighconfidence andhave highlift.These
insightscanenable the retail store toexpanditsrevenue numbersandachievehigherprofitability.
The MBA wasconductedon SASminer,onthe datasetcontaininginformationof over400,000
transactionsaccumulatedoverthe past3 months. The propertiesof the variable have beensetto
settingsobservedinFigure 28 andthe type of data source ischangedfrom Raw to Transactions.
Figure 28: Variable Properties for MBA.
Afterdraggingthe data-source intothe diagram, we attach Associationnode toittoconduct the
analysis.Exportrule byIDis changedto yesas we wouldlike toview the rule descriptiontablefor
the analysis.Remainingsettingsare keptunchanged.
18 | P a g e 1 9 1 3 9 5 0 7
The resultsof the Associationnodesunearthedseveral insightsof enormousbusinessvalues;some
of whichare explainedbelow:
Out of the 36 rulesorcombinationof productscreated,the highestachievedliftwasfoundto3.60.
Liftessentiallymeasuresthe degreeof associationbetweenthe combinationof products.For
example,rule A ->B withlift3 wouldbe interpretedasa customeristhrice as likelytobuyproductB
if he has alreadypurchasedproductA,comparedto the likelihoodof arandomcustomerjustbuying
productB. Lift isderivedbydividingconfidencebyexpectedconfidence.
The highestliftof 3.60 wasachievedbyrule Perfume ->Toothbrush. Thisindicatesthatacustomer
whohas purchaseda Perfume is3.6 timesmore likelytobuyToothbrushcomparedtoa customer
chosenat random. Since liftissymmetrical,ruleToothbrush->Perfume wouldhave the same liftof
3.60.
Liftis significantmetricforMBA as itdenotesthe relation betweencombinationof product.A Higher
lift( > 1) of rule indicatesthe right-handproductismore likelytoboughtincomplementwiththe
left-handproductratherthanbeingboughtjustinisolation. Thisinsightcanhelpimmenselyin
productplacementinthe aisles. Incurrentcontext,rule Magazine &Greetingcards -> CandyBar has
a liftof 2.68, whichdenotesthatthe likelihoodof acustomerbuyingacandy bar incombination
withMagazine and Greetingcardsis2.68 timeshigherthanacustomerbuyingjustthe candy bar.
Basedon associationrules,we derived36 rules,witheachrule possessingsignificantvalue for
implementationatthe store.However,based onfew metrics,we wouldrecommendthe companyto
incorporate followingchangestofacilitate higherrevenue generation:
1. Placementof Products on Aisle
Since Perfume ->Toothbrushproduce the highestliftbuthave comparativelylowersupport
(customerpurchasingboththe products),these twoproductsshouldbe placedinclose
approximate toeach otherto boosttheirsales.Withthese productsinclose vicinity,
Figure 29: Tabular data of all 36 rules.
19 | P a g e 1 9 1 3 9 5 0 7
purchase of Perfume will triggerthe purchase of Toothbrushorvice-a-versa,asindicatedby
theirhighlift. Similarly,productswithhighliftbutrelativelylowersupportshouldbe placed
close-by.
CandyBars -> GreetingCardshave the highestsupport(4.37%),indicatingthese two
productsare oftenpurchasedtogether.Thus,thesetwoproductsshouldbe placedat
distance fromeachotherso that customershave towalk-througharange of otherproducts
inthe processof buyingCandyBars and GreetingCards. Similarly,productsthatare often
purchasedtogether(highsupport) shouldbe placedatsome distance fromeachother.
2. Bundle,Cross-sellingandUp-selling
Figure 30: Link Graph of Products.
The networkbetween productsshow whichproductare linkedwitheachother.We can
observe pensandphotoprocessingare onlypurchasedincombinationwith Magazine.As
such,Pensand Photoprocessingshouldbe soldas a bundle withMagazinestoimprove their
salesnumbers.
We can alsoobserve Magazine isthe mostpopularproductin the store,andas such it gives
an opportunityto create upsellingandcross-sellingsituations of otherproducts around
magazine.Lesspopularproductscanbe placednearMagazine to grab more attention,as
magazinesare verypopular.Similarly, anahigherpricedalternativeproductcanbe placed
close to the magazines(forexample,premiumGreetingcards,candybarsand Toothpaste).
3. Specials
Products withhighlift(Perfume->Toothbrush,Magazine &CandyBar -> Greetingcards,
etc.) shouldbe onsale at different times.Since purchase of one product, isanyway likelyto
triggerthe purchase of anotherproductinthe rule,itscounter-productivetohave asale on
20 | P a g e 1 9 1 3 9 5 0 7
both/all the itemsinthe rule. Thiscansave a large proportionof discountingcostsforthe
companywhile boostingtheirsalesnumber.
Some ruleslike Greetingcards & CandyBar -> Magazine are clearlyhotproductsduring
festive periods. Havingone of the productsona discountedrate,duringfestive season, will
temptthe customerto purchase the othertwonon-discountedproductstocompletetheir
festive shoppingwish-list.

Más contenido relacionado

La actualidad más candente

Final SAS Day 2015 Poster
Final SAS Day 2015 PosterFinal SAS Day 2015 Poster
Final SAS Day 2015 PosterReuben Hilliard
 
Cluster analysis in prespective to Marketing Research
Cluster analysis in prespective to Marketing ResearchCluster analysis in prespective to Marketing Research
Cluster analysis in prespective to Marketing ResearchSahil Kapoor
 
Binary Classification Final
Binary Classification FinalBinary Classification Final
Binary Classification FinalReuben Hilliard
 
Machine_Learning_Trushita
Machine_Learning_TrushitaMachine_Learning_Trushita
Machine_Learning_TrushitaTrushita Redij
 
Gradient boosting for regression problems with example basics of regression...
Gradient boosting for regression problems with example   basics of regression...Gradient boosting for regression problems with example   basics of regression...
Gradient boosting for regression problems with example basics of regression...prateek kumar
 
Improve Your Regression with CART and RandomForests
Improve Your Regression with CART and RandomForestsImprove Your Regression with CART and RandomForests
Improve Your Regression with CART and RandomForestsSalford Systems
 
Rank Computation Model for Distribution Product in Fuzzy Multiple Attribute D...
Rank Computation Model for Distribution Product in Fuzzy Multiple Attribute D...Rank Computation Model for Distribution Product in Fuzzy Multiple Attribute D...
Rank Computation Model for Distribution Product in Fuzzy Multiple Attribute D...TELKOMNIKA JOURNAL
 
Detection of credit card fraud
Detection of credit card fraudDetection of credit card fraud
Detection of credit card fraudBastiaan Frerix
 
Using R for customer segmentation
Using R  for customer segmentationUsing R  for customer segmentation
Using R for customer segmentationKumar P
 
CART – Classification & Regression Trees
CART – Classification & Regression TreesCART – Classification & Regression Trees
CART – Classification & Regression TreesHemant Chetwani
 
CART: Not only Classification and Regression Trees
CART: Not only Classification and Regression TreesCART: Not only Classification and Regression Trees
CART: Not only Classification and Regression TreesMarc Garcia
 
Boosting conversion rates on ecommerce using deep learning algorithms
Boosting conversion rates on ecommerce using deep learning algorithmsBoosting conversion rates on ecommerce using deep learning algorithms
Boosting conversion rates on ecommerce using deep learning algorithmsArmando Vieira
 
Introduction to Random Forest
Introduction to Random Forest Introduction to Random Forest
Introduction to Random Forest Rupak Roy
 
Graphical Analysis of Simulated Financial Data Using R
Graphical Analysis of Simulated Financial Data Using RGraphical Analysis of Simulated Financial Data Using R
Graphical Analysis of Simulated Financial Data Using RIRJET Journal
 
Machine Learning Decision Tree Algorithms
Machine Learning Decision Tree AlgorithmsMachine Learning Decision Tree Algorithms
Machine Learning Decision Tree AlgorithmsRupak Roy
 
Ordinal logistic regression
Ordinal logistic regression Ordinal logistic regression
Ordinal logistic regression Dr Athar Khan
 
Consumption capability analysis for Micro-blog users based on data mining
Consumption capability analysis for Micro-blog users based on data miningConsumption capability analysis for Micro-blog users based on data mining
Consumption capability analysis for Micro-blog users based on data miningijaia
 
Classification and regression trees (cart)
Classification and regression trees (cart)Classification and regression trees (cart)
Classification and regression trees (cart)Learnbay Datascience
 

La actualidad más candente (20)

Final SAS Day 2015 Poster
Final SAS Day 2015 PosterFinal SAS Day 2015 Poster
Final SAS Day 2015 Poster
 
Cluster analysis in prespective to Marketing Research
Cluster analysis in prespective to Marketing ResearchCluster analysis in prespective to Marketing Research
Cluster analysis in prespective to Marketing Research
 
Binary Classification Final
Binary Classification FinalBinary Classification Final
Binary Classification Final
 
Machine_Learning_Trushita
Machine_Learning_TrushitaMachine_Learning_Trushita
Machine_Learning_Trushita
 
Gradient boosting for regression problems with example basics of regression...
Gradient boosting for regression problems with example   basics of regression...Gradient boosting for regression problems with example   basics of regression...
Gradient boosting for regression problems with example basics of regression...
 
Malhotra21
Malhotra21Malhotra21
Malhotra21
 
Improve Your Regression with CART and RandomForests
Improve Your Regression with CART and RandomForestsImprove Your Regression with CART and RandomForests
Improve Your Regression with CART and RandomForests
 
Rank Computation Model for Distribution Product in Fuzzy Multiple Attribute D...
Rank Computation Model for Distribution Product in Fuzzy Multiple Attribute D...Rank Computation Model for Distribution Product in Fuzzy Multiple Attribute D...
Rank Computation Model for Distribution Product in Fuzzy Multiple Attribute D...
 
Detection of credit card fraud
Detection of credit card fraudDetection of credit card fraud
Detection of credit card fraud
 
Using R for customer segmentation
Using R  for customer segmentationUsing R  for customer segmentation
Using R for customer segmentation
 
CART – Classification & Regression Trees
CART – Classification & Regression TreesCART – Classification & Regression Trees
CART – Classification & Regression Trees
 
CART: Not only Classification and Regression Trees
CART: Not only Classification and Regression TreesCART: Not only Classification and Regression Trees
CART: Not only Classification and Regression Trees
 
Krupa rm
Krupa rmKrupa rm
Krupa rm
 
Boosting conversion rates on ecommerce using deep learning algorithms
Boosting conversion rates on ecommerce using deep learning algorithmsBoosting conversion rates on ecommerce using deep learning algorithms
Boosting conversion rates on ecommerce using deep learning algorithms
 
Introduction to Random Forest
Introduction to Random Forest Introduction to Random Forest
Introduction to Random Forest
 
Graphical Analysis of Simulated Financial Data Using R
Graphical Analysis of Simulated Financial Data Using RGraphical Analysis of Simulated Financial Data Using R
Graphical Analysis of Simulated Financial Data Using R
 
Machine Learning Decision Tree Algorithms
Machine Learning Decision Tree AlgorithmsMachine Learning Decision Tree Algorithms
Machine Learning Decision Tree Algorithms
 
Ordinal logistic regression
Ordinal logistic regression Ordinal logistic regression
Ordinal logistic regression
 
Consumption capability analysis for Micro-blog users based on data mining
Consumption capability analysis for Micro-blog users based on data miningConsumption capability analysis for Micro-blog users based on data mining
Consumption capability analysis for Micro-blog users based on data mining
 
Classification and regression trees (cart)
Classification and regression trees (cart)Classification and regression trees (cart)
Classification and regression trees (cart)
 

Similar a Predictive Modelling & Market-Basket Analysis.

Churn in the Telecommunications Industry
Churn in the Telecommunications IndustryChurn in the Telecommunications Industry
Churn in the Telecommunications Industryskewdlogix
 
MIS637_Final_Project_Rahul_Bhatia
MIS637_Final_Project_Rahul_BhatiaMIS637_Final_Project_Rahul_Bhatia
MIS637_Final_Project_Rahul_BhatiaRahul Bhatia
 
Minimize Fraud And Maximize Revenue Deposit Risk Scoring
Minimize Fraud And Maximize Revenue   Deposit Risk ScoringMinimize Fraud And Maximize Revenue   Deposit Risk Scoring
Minimize Fraud And Maximize Revenue Deposit Risk Scoringjiz95001
 
Business Intelligence Using SAS Final Presentation
Business Intelligence Using SAS Final PresentationBusiness Intelligence Using SAS Final Presentation
Business Intelligence Using SAS Final PresentationJodi Liu
 
Accurate Campaign Targeting Using Classification Algorithms
Accurate Campaign Targeting Using Classification AlgorithmsAccurate Campaign Targeting Using Classification Algorithms
Accurate Campaign Targeting Using Classification AlgorithmsJieming Wei
 
Project crm submission sonali
Project crm submission sonaliProject crm submission sonali
Project crm submission sonaliSonali Gupta
 
CollectionOptimization
CollectionOptimizationCollectionOptimization
CollectionOptimizationMike Nguyen
 
Neural Network Model
Neural Network ModelNeural Network Model
Neural Network ModelEric Esajian
 
Using Advanced Analyics to bring Business Value
Using Advanced Analyics to bring Business Value Using Advanced Analyics to bring Business Value
Using Advanced Analyics to bring Business Value BigDataCloud
 
12 faces bi business intelligence ~Abdoulaye Mouke Yansane
12 faces bi business intelligence ~Abdoulaye Mouke Yansane12 faces bi business intelligence ~Abdoulaye Mouke Yansane
12 faces bi business intelligence ~Abdoulaye Mouke YansaneAbdoulaye M Yansane
 
Accurate Campaign Targeting Using Classification - Poster
Accurate Campaign Targeting Using Classification - PosterAccurate Campaign Targeting Using Classification - Poster
Accurate Campaign Targeting Using Classification - PosterJieming Wei
 
Bank churn with Data Science
Bank churn with Data ScienceBank churn with Data Science
Bank churn with Data ScienceCarolyn Knight
 
Credit Card Marketing Classification Trees Fr.docx
 Credit Card Marketing Classification Trees Fr.docx Credit Card Marketing Classification Trees Fr.docx
Credit Card Marketing Classification Trees Fr.docxShiraPrater50
 
Customer analytics
Customer analyticsCustomer analytics
Customer analyticsKarl Melo
 

Similar a Predictive Modelling & Market-Basket Analysis. (20)

Churn in the Telecommunications Industry
Churn in the Telecommunications IndustryChurn in the Telecommunications Industry
Churn in the Telecommunications Industry
 
MIS637_Final_Project_Rahul_Bhatia
MIS637_Final_Project_Rahul_BhatiaMIS637_Final_Project_Rahul_Bhatia
MIS637_Final_Project_Rahul_Bhatia
 
Minimize Fraud And Maximize Revenue Deposit Risk Scoring
Minimize Fraud And Maximize Revenue   Deposit Risk ScoringMinimize Fraud And Maximize Revenue   Deposit Risk Scoring
Minimize Fraud And Maximize Revenue Deposit Risk Scoring
 
Insurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis ProjectInsurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis Project
 
Business Intelligence Using SAS Final Presentation
Business Intelligence Using SAS Final PresentationBusiness Intelligence Using SAS Final Presentation
Business Intelligence Using SAS Final Presentation
 
Accurate Campaign Targeting Using Classification Algorithms
Accurate Campaign Targeting Using Classification AlgorithmsAccurate Campaign Targeting Using Classification Algorithms
Accurate Campaign Targeting Using Classification Algorithms
 
Project crm submission sonali
Project crm submission sonaliProject crm submission sonali
Project crm submission sonali
 
CollectionOptimization
CollectionOptimizationCollectionOptimization
CollectionOptimization
 
Machine learning project
Machine learning project Machine learning project
Machine learning project
 
Neural Network Model
Neural Network ModelNeural Network Model
Neural Network Model
 
Using Advanced Analyics to bring Business Value
Using Advanced Analyics to bring Business Value Using Advanced Analyics to bring Business Value
Using Advanced Analyics to bring Business Value
 
12 faces bi business intelligence ~Abdoulaye Mouke Yansane
12 faces bi business intelligence ~Abdoulaye Mouke Yansane12 faces bi business intelligence ~Abdoulaye Mouke Yansane
12 faces bi business intelligence ~Abdoulaye Mouke Yansane
 
Accurate Campaign Targeting Using Classification - Poster
Accurate Campaign Targeting Using Classification - PosterAccurate Campaign Targeting Using Classification - Poster
Accurate Campaign Targeting Using Classification - Poster
 
Classification Problem with KNN
Classification Problem with KNNClassification Problem with KNN
Classification Problem with KNN
 
Bank loan purchase modeling
Bank loan purchase modelingBank loan purchase modeling
Bank loan purchase modeling
 
Bank churn with Data Science
Bank churn with Data ScienceBank churn with Data Science
Bank churn with Data Science
 
Credit Card Marketing Classification Trees Fr.docx
 Credit Card Marketing Classification Trees Fr.docx Credit Card Marketing Classification Trees Fr.docx
Credit Card Marketing Classification Trees Fr.docx
 
Chapter 09
Chapter 09 Chapter 09
Chapter 09
 
Telcom churn .pptx
Telcom churn .pptxTelcom churn .pptx
Telcom churn .pptx
 
Customer analytics
Customer analyticsCustomer analytics
Customer analytics
 

Último

9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Business Analytics using Microsoft Excel
Business Analytics using Microsoft ExcelBusiness Analytics using Microsoft Excel
Business Analytics using Microsoft Excelysmaelreyes
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxdolaknnilon
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一F La
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 

Último (20)

9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Business Analytics using Microsoft Excel
Business Analytics using Microsoft ExcelBusiness Analytics using Microsoft Excel
Business Analytics using Microsoft Excel
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptx
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 

Predictive Modelling & Market-Basket Analysis.

  • 1. Predictive Analyticsand MarketBasket Analysis. BUS5PA – Assignment III SIDDHANTH CHAURASIYA 19139507
  • 2. 1 | P a g e 1 9 1 3 9 5 0 7 INTRODUCTION ------------------------------------------------------------------------------------------- The purpose of this report isto documentthe findingsfromthe datamininganalysis conductedona national veteran’sorganization’s Donor’sdatasetand market-basketanalysison transactional data of Health& BeautyAidsDepartmentandthe StationaryDepartment.The objective of the data miningactivityistoseeka betterresponse andhitrate bytargetingonly those segmentsof customers whohave beenflaggedasa potential donorbythe predictive model. Since the request for donationisalso complemented withasmall gift,mailingonlypotentiallyinterestedcustomers for the upcomingcampaign wouldsubstantiallyreduce the costforthe organization, andthat proportionof savedcostscouldbe utilizedforothercharitable activities. ThisanalysiswasconductedonSASminerand R on database of customerswhohad donated inthe past 12 to 24 months,withdetaileddescriptionof the steps, interpretationof the modelsandmodel comparisons(amongstmodelsandacrossSAS& R) discussedinthe report. The secondpart of the reportexploresandsuggestsproductsthatcouldbe bundledormarketed togethertoenable the organizationtomaximize itsrevenues. ThisanalysiswasperformedonSAS miner,andrelevantproposalshave beenadvisedonthe productbundles andplacements basedon the findings. PART A ------------------------------------------------------------------------------------------- 1. CreatingSAS miner Projectand pre-processingvariables Afterselectinganappropriate directoryforthe projectandsettingupthe diagramand library,we processsome of the defaultsettingsof the datasource. Since there are variousnumericvariables withlevelslessthan20 in the dataset,we setthe Classlevelsthresholdat2.This wouldensure only binaryvariablesare treatedasnominal variables andnumericvariableswithless than20distinct value continue tobe treatedasinterval variables. Similarly,we alsohave one class variable(DemCluster) withover20 distinctlevelshence,thuswe setthe levelscountthresholdas100. The rolesof the variableshave beensetasfollows:
  • 3. 2 | P a g e 1 9 1 3 9 5 0 7 2. Explorationof variable Exploringthe distribution of the variablescanunearthunusual patternsandbehavioursinthe variables.These anomaliescanhave a substantial effectonthe modellingif notrectified. We use the Explore windowtoexamine the distributionof MedianIncome Region.The fetchsize is keptat max (20,000 recordsor recordsinvariable,whicheverisless) toensure all the observations are consideredinthe exploration. Figure 2: Changing the default settings of Explore. We prepare a Histogramforvariable MedianIncome Regiontonotice anyabnormalityinits distribution. The distributionatdefaultsettingswasasfollows: Figure 1: Pre-processing variables.
  • 4. 3 | P a g e 1 9 1 3 9 5 0 7 Figure 3: Distribution of Median Income Region at 10 bins. The distributionatdefaultsettingsdidn’tlook suspicious.However,the bin’srange wassubstantial, whichmighthave concealedanyabnormalityinthe distribution. Hence,we change the numberof binsto 200, whichcreatesrangesof $1000. Figure 4: Distribution of Median Income Region with 200 bins. Changingthe binlimitshedslightonacrucial anomalyinthe variable. We observe a disproportionateskew forcustomerswith medianincomeof $0. The reasonbehindthisabnormal spike coulddue tothe factthat income is a confidential information.Not everyoneisopen orcomfortable indisclosingtheirincome tocompaniesor organization. Thus,peoplewhodonotreporttheirincome might have beenassignedashavingan income of $0 by the organization. UsingMedianIncome withsuchaskeweddistributioncouldlead to a creationof a flawedmodel.
  • 5. 4 | P a g e 1 9 1 3 9 5 0 7 Thus rectifythisanomaly,we shouldreplace the $0income valueswith missingorNA usingthe Replacementnode. Otheralternativecouldtoreplace the $0 income valueswiththe meanof variable.However,replacing$0withthe meanwouldrequire apriorexamination,aswithouta properbusinesscontext(strataof peopleof didn’trecordtheirincome,etc.) thisreplacementcould be inappropriate. 3. BuildingPredictive models Since we discoveredanabnormal peakat$0 for MedianIncome Regionvariable, we replace the $0 value withmissing(SASminerusesfull-stopto denote amissingvalue) usingReplacementnode. To carry thisreplacement,we change the default limitmethod of Interval Variables asnone,aswe do notwant to replace the interval variables. Furthermore,ReplacementvaluesissetasMissing as we want to replace the abnormal value with a missingvalue. Since we needtochange the value of an interval variable (DemMedIncome),we make the following changes inReplacementIntervalEditor: By changingthe limitmethodtoUserSpecifiedandReplacementLowerLimitto1 of DemMedIncome,we setall valuesof thatfall below 1to missing(since only$0isthe value below $1). Figure 5: Snapshot of result of Replacement node - $0 has been replaced with '.' Aftercarryingout thisstep,we divide the datasetinto50:50 ratio betweenTrainingsetand Validationset. The trainingsetisutilizedtobuildasetof modelswhile Validationsetisusedto Figure 5: Properties of Replacement node. Figure 6: Replacement Interval Editor.
  • 6. 5 | P a g e 1 9 1 3 9 5 0 7 selectthe bestmodel. Byallocatingagoodproportionof datasetto training,we reduce the riskof overfitting. To predictmostlikelydonors,we create three predictive models:  AutonomousDecisionTree The firstmodel we create is a DecisionTree whichhasbeensplitautonomouslybythe algorithm. We setthe leaf size as25 and leaf size propertyas50 to ensure preventionof verysmall leaves.The assessmentmethod,whichspecifiesthe methodtoselectthe besttree,hasbeensetasAverage Square Error. Thisessentiallymeansthe tree thatgeneratesthe smallestaverage squarederror (difference betweenpredictedandactual outcome) will be selected. Figure 7: Autonomous DecisionTree. The optimal decisiontree resultsincreationof 5 leaves;witheachendnode indicatingaclassifying rule that coulddistinguishlikely andnon-likelylapsingdonorsbasedonthe targetvariable. The biggestadvantage of DecisionTree isthattheyveryeasyto understand andinterpretevenfor people of non—technical background. Furthermore,the model explainshow itworks,witheachleaf node denotingaclassificationrule. Fewkeyrules/leaves of the AutonomousDecisionTree are: Figure 6: Workspace Diagram for Donors analysis.
  • 7. 6 | P a g e 1 9 1 3 9 5 0 7  CustomerswithGiftcountof more than2.5 or missinginthe last36 monthsand withlast giftamountof lessthan $7.5 is verylikely(64%) todonate [Node 6].  Customerswhose medianhomevalue isless than$67350 and has giftcountof lessthan2.5 inthe last36 monthsisunlikely(37%) todonate [Node 4]. Remaining3leavescouldbe interpretedinsimilarmanner.  Interactive DecisionTree. The secondDecisionTree hasbeencreatedinteractively,with splitsconductedonthe basisof the Logworthof the variables anddomainknowledge(i.e. whichsplits/variablewouldbe more relevant). Logworthis a metricthat onthe basisof Informationgaintheorycalculatesthe importance of the variable.Essentially,Logworthindicatesthe variable’sabilitytocreate homogenousorpure subgroups. The maximumnumberof brancheshave beenkeptat3 to allow athree-wayinterval split.Thiswill facilitate veryspecificinsightsandrules. The intervalsplits of some of the variables are alsochangedto enable creationof more precise rules. Figure 8: Interval splits of GiftCnt36 changed to <2.5 & missing, >=2.5 & < 4 and >=4.5. The assessmentmethodforthisDecisionTree is Misclassificationrate. The Interactive DecisionTree ismore complex comparedtothe AutonomousDecisionTree,withthe tree achievingoptimalityat 21 leaves. Figure 9: Interactive Decision Tree. Some of the keyrulesobtained fromthe interactive DecisionTree are:
  • 8. 7 | P a g e 1 9 1 3 9 5 0 7  Customerswhose promotioncountinthe past12 monthsislessthan17.5 or missing,having a medianhouse value of lessthan$67350, giftcountof lessthan2.5 or missinginpast36 monthsand average giftcard amountof more than equal to $11.5 in the last36 months are unlikely(33%) todonate [Node 54].  Customerswithmedianincomeof lessthan$54473 or missing,withpromotioncountof card in the last12 monthsof lessthan5.5, giftcountin last36 monthsof more or equal to 4.5 and lastgiftamountof more than $7.5 or missingisverylikely (91%) todonate [Node 67]. Remaining19 leavescanbe interpreted insimilarfashion.  Regression. The final model we create isa LogisticRegression(asthe Targetvariable isabinarynumber).The advantage of a Regressionmodel isthe itcanexpressorquantifythe associationbetweenthe input variable andthe targetvariables.Furthermore,Regressionisanexcellenttool forestimation. Since we are usingLogisticRegression,the outcome of the model wouldestimate the odds(probability)of a donor as weightedsumof the attributes(inputvariables). As Regressionmodels doesnotaccommodate missingvalue (unlike DecisionTrees) butratherskips it,we add a Impute node to impute missingintervalvalue withthe variable’smeanandmissing nominal valueswiththe mostcommonnominal value of the variable. Similarly,Regressionmodels worksbestwithlimitedbutworthysetof variables,andthuswe addVariable Clustering node to grouptogethervariablesthatbehave similarly toreduce redundancy.Theseclustersare represented by the variablesthathasthe leastnormalizedRsquare value (since VariablesselectioninVariable ClusteringhasbeensetasBestVariables).MaximumEigenvalue is keptas1; thisspecifies the largest permissible value of the secondeigenvalue ineachcluster,facilitatingcreationof betterclusters. Figure 10: Changing the properties of Variable Clustering node. The selectionmodel optedforthe Regressionmodel isStepwise selectionwhilethe selection criterionissetas Validationerror.Due tothese settings,the model will initiate will novariables (i.e. at intercept).Thenateverystepwe add a variable andat the same time variablesalreadyinthe model are verifiedif theypassthe minimumselectioncriteriathreshold(ValidationErrorinthis case).Variablesbelowthe thresholdlimitare removedandthisprocesscontinuesuntilthe stay significance level isachieved.
  • 9. 8 | P a g e 1 9 1 3 9 5 0 7 The outputof the Regressionmodel suggested StatusCategoryStarAll month,Giftcountof card in the last 36 monthsandGift amountaverage of 36 monthsas the most crucial factors respectively. These variableshave aninfluence orcorrelationwiththe targetvariable. The outputcan be interpretedasaunitchange in the GiftCntCard36,withall the othervariables remainingconstant, willleadtochange in the logodds (as it isLogisticregression) of donation by 0.1156 units.All the othervariablescanbe expressedinsimilarmanner,basedupontheir coefficients. 4. Evaluating Predictive models The three createdpredictive modelsare comparedonthe basisof few prominentmachine-learning metricsto evaluate whichmodeloutperformsthe other.We use the Model Comparisonnode and Excel to compute the metrics.  ReceiverOperatingCharacteristic (ROC) curves The ROC curve demonstratesthe relationbetweenTrue Positive (Sensitivity)andTrue Negative (Specificity) of amodel atvariousdiagnostics testlevels. The model’swhose curve isclosestto the top left end(i.e.nearthe topof Sensitivity)are more accurate predictors while model’swhosecurvesare close tothe baseline canbe saidto be poor predictors.In otherwords,the more area underthe model’scurve,the betterthe model is. Figure 11: Output of Regression model.
  • 10. 9 | P a g e 1 9 1 3 9 5 0 7 The modelsare evaluatedonthe basisof their performance onthe validationset.The Regression node curvesthe closesttothe top-leftsectionof the chart(i.e.towardsTrue Positive).The autonomousdecisiontree staysmarginallybehindwhile the interactivedecisiontree isthe closest to the baseline,indicatingittobe a poorpredictor.  Cumulative Lift Liftis a measure of a model’s effectiveness.Itessentiallycomparesthe resultobtainedby the model tothe resultobtainedwithoutamodel (i.e.randomly).Higherliftvalue would indicate the model isthe manytimesmore effectivethan randomselections. Figure 12: ROC curves of the Predictive models. Figure 13: Cumulative Lift of the Predictive models.
  • 11. 10 | P a g e 1 9 1 3 9 5 0 7 Basedon the cumulative liftaccumulatedbythe models,we could observethatthe Autonomous DecisionTree (1.30) producedthe highestliftat15th depth.Thiswas followed by 1.27 cumulative liftachievedby Regressionmodel and1.16 cumulative liftbyInteractive DecisionTree respectively at15th depth. Thiscan be interpretedas the top15% customersselectedbyAutonomousDecisionTree is likelytocapture 1.30 time more donorsthan15% of customerpickeduprandomly.  Average Square Error (ASE) ASE isthe error arisingdue the variationinthe predicted outcome bythe model andactual outcome.Lowerthe ASE,betterthe model is,as itproducesfewererrors. From the resultsof the model comparisonnode,we couldobserve RegressionandAutonomous Decision Tree performto an identical levelintermsof the ASE generated;withthe latterperformingonlymarginally better.The interactive DecisionTree producedthe most difference betweenpredictedoutcome andactual outcome.  Misclassificationrate Misclassificationrate isthe errorproducedwhena model incorrectlyclassifiesaresponderasnon- responderorvice-a-versa.Inourbusinesscontext,a model wouldbe misclassifyingif itpredictsadonorto be a non-donorandvice-a-versa.Naturally,we would prefera model whichmakesthe leastamountof such we error. Interactive DecisionTree comesoutontop onthis metric,producingonly0.398 worthof misclassification rate incomparison to0.436 by Regressionand0.428 by AutonomousDecisionTree.  Accuracy Accuracy isthe measure thatindicateshow accuratelyamodel canpredict(bothpositive and negative) outof the total predictionsthatthe model makes.Itiscomputedbydividing the True Predictions(True Positive andTrue Negative) bytotal numberof predictions(i.e. the numberof records/observations/customers). Model False Negative True Negative False Positive True Positive AutonomousDT 1460 1804 617 962 Regression 1111 1467 954 1311 Interactive DT 1153 1406 1015 1269 Figure 16: Confusion matrix obtained from Model Comparison (Validation dataset). Figure 14: ASE produced by the Predictive models. Figure 15: Misclassification rate produced by the Predictive models.
  • 12. 11 | P a g e 1 9 1 3 9 5 0 7 Basedon accuracy, the regressionmodel couldpredictthe donorsandnon-donorsmore accuratelycomparedto the othertwomodels(Figure 17).  F1 F1 is the harmonicaverage of Precision (True positivesbytotal positives) andRecall (proportionof positivescorrectlyidentified).A score of 1 wouldmeanthe model isa perfect predictorwhile ascore of zero indicatespoorpredictor. Regressionoutperformsthe othertwomodelsonF1 score as well,indicatingitasthe best predictor(Figure 17). Conclusion: The performance of the modelsonthe basisof the above machine learningmetricscanbe summarizedbythe followingtable: Machine Learning Metrics Models Accuracy F1 ROC Lift ASE Misclassification AutonomousDT 0.571 0.481 0.591 1.30 0.2432 0.428 Interactive DT 0.552 0.539 0.567 1.16 0.250 0.398 Regression 0.574 0.559 0.595 1.27 0.2437 0.436 Figure 17: Summarizing the performance of models based on various metrics. Figure 18: Visual comparison between the models. On evidence,we can conclude the Regressionmodelisthe bestmodel ittermsof performance, accuracy, effectiveness anderrorgeneration. 5. Scoring and PredictingPotential Donors Aftercareful andthoroughexamination,we concludedthatRegressionisthe bestmodel intermsof itspredictingcapabilitiesasitismore accurate andproducesfewererrors.Thus,we use the Regressionmodeltoscore (i.e.applyingthe predictions onthe dataset) onanew datasetof lapsing donors. The scoringis performedthrough Score node. 0.571 0.481 0.591 1.30 0.2432 0.428 0.552 0.539 0.567 1.16 0.250 0.398 0.574 0.559 0.595 1.27 0.2437 0.436 A C C U R ACY F 1 R OC L I F T A S E M I S C LAS S IF ICATI ON P ERFORMANCE COMPARIS ON OF MODELS Autonomous DT Interactive DT Regression
  • 13. 12 | P a g e 1 9 1 3 9 5 0 7 To explore the results,we create ahistogramona new variable thathasbeencreateddue toscoring calledpredictedtarget_B=1. Thisnewvariable contains the predictive value assignedtothe customersbasedontheirprobabilityof donating. To visuallyrepresentthe scoring,we create histogramof customerspredictedbythe model as potential donorswiththeirattachedprobabilityof donating. The resultare as follows: Figure 19: Exploring Predicted Donors. To enable betterinsights,we change the numberof binsto20. The highlightedrecordsinthe datasetare customers belongingto the selectedbarinthe histogram (Figure 17).The valuesonthe X axisrepresentthe likelihoodof thatcustomerdonatingfor the nextcampaign. The average response rate wasfoundto be 5%. Assuch, customerswithapredictedprobabilityof over0.05 can be consideredascandidatesforthe campaign. However,tomaximize the cost- effectivenessof the campaign,we couldderive aprobabilitythreshold basedonthe past informationaboutcustomerlifetime value theygenerate.Customerswithpredicted valuesbeyond that thresholdcouldbe then be targetedtogenerate evenbetterresponse/hitrate andmargins. A rational approachwouldbe to solicitcustomerswhohave beenassignedpredictivevalue of 0.55 of more. Otheralternative toachievesignificantlybetterresponse wouldbe approachingcustomers whobelonginthe top 30th percentilebasedontheirpredictedvalues(Figure20).
  • 14. 13 | P a g e 1 9 1 3 9 5 0 7 Figure 20: Snapshot of Customers with highest predictive probability of donating. PART B: PREDICTIVE MODELLING BASED ON R ------------------------------------------------------------------------------------------- Aftercreatingthree predictive modelsinSASminer,we builtaDecisionTree inR.As isthe procedure before performinganydataminingactivity,we explore the variablesatthe firststepto investigatetheirdistribution. Exploringand transformingthe data Summaryfunctionaswell as package Psych isinstalledtoprovide adetaileddescriptive summarizationof the variables. We notice thatCustomerID isconsiderasvariable byR while afew variableshadmissingvaluesin them.A histogramiscreatedfor MedianIncome,whichrevealsdisproportionatenumberof zero value inthe variable. To ensure a cleanmodel,we wrangle andtransform the variables: Figure 21: Exploring the variables.
  • 15. 14 | P a g e 1 9 1 3 9 5 0 7  CustomerIDis rejectedasitis an ID andnot an inputfor modelling.  Target_D isrejectedasTarget_B containsthe collapseddataof Target_D and inclusionof Target_D will leadtoleakage.  Since R considervariables containing$valuesas categorical,we transformGiftAvgLast, GiftAvg36,GiftAvgAll,GiftAvgCard36,DemMedHomeValueandDemMedIncomeinto numericvariables. Figure 22: Transforming variables into numeric variables.  The zero valuesinDemMedIncome are replace withNA,asthose valueswere customers whodidnot reveal theirincomes. Buildinga DecisionTree Aftercleaningthe data, we divide the dataequallyintotrainingandvalidationsets. The Decision Tree will be builtonthe validationset usingrpartpackage.Target_Bis selectedasthe target variablesand all the non-rejectedvariablesare selectedasthe inputs.Since the targetvariable isa binarynumber,we select Classasthe methodwhile the complexityparameter issetat0.001. The scriptfor Decisiontree isandthe model is plottedthe decision usingrpart.plot: Figure 23: Decision Tree will all non-rejected variables as inputs and cp of 0.001.
  • 16. 15 | P a g e 1 9 1 3 9 5 0 7 The decisiontree thatiscreatedis verycomplex,withlotof leaves.OnplottingitsROCcurve we notice a huge discrepancy asthe model curvesperfectlytowardssensitivity. The model showssigns of overfittingaswell as leakage. Figure 24: ROC curve of the Decision Tree. To rectifythismodel, we prune the decision tree basedoncross-validationerror. The complexity parameterat whichthe lowesterrorisproducedisselectedasthe complexityparameterforthe newmodel (0.0072). Figure 25: Cross-Validation plot. The other change inthe newmodel isselectionof variablesasinput.Forthe new model,we select variablesthathave relevance andpredictioncapabilitiesbasedonrational judgement.Forexample, GftAvg36 ismore relevanttothe model thanGiftAvgAll,thusonlythe formerisusedinthe new model. The new model DecisionTree canbe seeninthe below plot:
  • 17. 16 | P a g e 1 9 1 3 9 5 0 7 Figure 26: Pruned Decision Tree. Comparing the model to the modelscreated in SAS The newmodel isa lot more cleanerandresultsin5 definite predictionrules.We plotthe ROCcurve and liftchart forthismodel (Blue line forTrainingandRedline forValidation). The resultsasfollows: On comparingthe ROC curve and Liftchart of DecisionTree createdonR withthe three modelsbuilt on SASminer,we couldobserve the Regressionwouldstill comfortablytrumpall the othermodels basedon accuracy and effectiveness.The ROCcurve of the DecisionTree builtonRisclose to the baseline,indicatingittonotbeinga verygoodpredictor.Similarly,the liftgeneratedisveryclose to 1, whichisn’tan ideal value. Scoring the data The purpose of any model isto predict,andto conductthispredictionwe importthe score dataset. As done forthe original datasource,we transformvariableswith$valuesinthemintonumeric variables.The predictionof the model isthenappliedtothe scoreddatasetusingthe Predict function.A snapshotof resultsof the scoringisas follows: Figure 27: ROC curve (Left) and Lift chart (right) of Pruned Decision Tree.
  • 18. 17 | P a g e 1 9 1 3 9 5 0 7 The firstcolumnindicatesthe customerID,the secondcolumnindicatesthe customerbeinganon- donorwhile the 3rd columnindicatesthe customerbeingadonor.The valuesinsecondandthird columnindicate the probabilityof the customerfallinginthatclassification.Thiscanbe interpreted as customerId 96362 has 41.96% predictedprobabilityof donatingforthe nextcampaign. Similarly,the predictedvalue forall customershave beenderived. Toachieve maximumprofitability, the organizationshouldsolicitate customerswhohave beenflaggedwithmore than60% of probabilityof donating bythe model. PART C: MARKET BASKET ANALYSIS AND ASSOCIATION RULES ------------------------------------------------------------------------------------------- In thissectionof the report,we attemptto derive meaningful patternsinthe purchasingbehaviour of customerswithreference toarange of products.The primaryobjective of thismarket-basket analysis (MBA) isto discoveritemsthatare purchasedwithhighconfidence andhave highlift.These insightscanenable the retail store toexpanditsrevenue numbersandachievehigherprofitability. The MBA wasconductedon SASminer,onthe datasetcontaininginformationof over400,000 transactionsaccumulatedoverthe past3 months. The propertiesof the variable have beensetto settingsobservedinFigure 28 andthe type of data source ischangedfrom Raw to Transactions. Figure 28: Variable Properties for MBA. Afterdraggingthe data-source intothe diagram, we attach Associationnode toittoconduct the analysis.Exportrule byIDis changedto yesas we wouldlike toview the rule descriptiontablefor the analysis.Remainingsettingsare keptunchanged.
  • 19. 18 | P a g e 1 9 1 3 9 5 0 7 The resultsof the Associationnodesunearthedseveral insightsof enormousbusinessvalues;some of whichare explainedbelow: Out of the 36 rulesorcombinationof productscreated,the highestachievedliftwasfoundto3.60. Liftessentiallymeasuresthe degreeof associationbetweenthe combinationof products.For example,rule A ->B withlift3 wouldbe interpretedasa customeristhrice as likelytobuyproductB if he has alreadypurchasedproductA,comparedto the likelihoodof arandomcustomerjustbuying productB. Lift isderivedbydividingconfidencebyexpectedconfidence. The highestliftof 3.60 wasachievedbyrule Perfume ->Toothbrush. Thisindicatesthatacustomer whohas purchaseda Perfume is3.6 timesmore likelytobuyToothbrushcomparedtoa customer chosenat random. Since liftissymmetrical,ruleToothbrush->Perfume wouldhave the same liftof 3.60. Liftis significantmetricforMBA as itdenotesthe relation betweencombinationof product.A Higher lift( > 1) of rule indicatesthe right-handproductismore likelytoboughtincomplementwiththe left-handproductratherthanbeingboughtjustinisolation. Thisinsightcanhelpimmenselyin productplacementinthe aisles. Incurrentcontext,rule Magazine &Greetingcards -> CandyBar has a liftof 2.68, whichdenotesthatthe likelihoodof acustomerbuyingacandy bar incombination withMagazine and Greetingcardsis2.68 timeshigherthanacustomerbuyingjustthe candy bar. Basedon associationrules,we derived36 rules,witheachrule possessingsignificantvalue for implementationatthe store.However,based onfew metrics,we wouldrecommendthe companyto incorporate followingchangestofacilitate higherrevenue generation: 1. Placementof Products on Aisle Since Perfume ->Toothbrushproduce the highestliftbuthave comparativelylowersupport (customerpurchasingboththe products),these twoproductsshouldbe placedinclose approximate toeach otherto boosttheirsales.Withthese productsinclose vicinity, Figure 29: Tabular data of all 36 rules.
  • 20. 19 | P a g e 1 9 1 3 9 5 0 7 purchase of Perfume will triggerthe purchase of Toothbrushorvice-a-versa,asindicatedby theirhighlift. Similarly,productswithhighliftbutrelativelylowersupportshouldbe placed close-by. CandyBars -> GreetingCardshave the highestsupport(4.37%),indicatingthese two productsare oftenpurchasedtogether.Thus,thesetwoproductsshouldbe placedat distance fromeachotherso that customershave towalk-througharange of otherproducts inthe processof buyingCandyBars and GreetingCards. Similarly,productsthatare often purchasedtogether(highsupport) shouldbe placedatsome distance fromeachother. 2. Bundle,Cross-sellingandUp-selling Figure 30: Link Graph of Products. The networkbetween productsshow whichproductare linkedwitheachother.We can observe pensandphotoprocessingare onlypurchasedincombinationwith Magazine.As such,Pensand Photoprocessingshouldbe soldas a bundle withMagazinestoimprove their salesnumbers. We can alsoobserve Magazine isthe mostpopularproductin the store,andas such it gives an opportunityto create upsellingandcross-sellingsituations of otherproducts around magazine.Lesspopularproductscanbe placednearMagazine to grab more attention,as magazinesare verypopular.Similarly, anahigherpricedalternativeproductcanbe placed close to the magazines(forexample,premiumGreetingcards,candybarsand Toothpaste). 3. Specials Products withhighlift(Perfume->Toothbrush,Magazine &CandyBar -> Greetingcards, etc.) shouldbe onsale at different times.Since purchase of one product, isanyway likelyto triggerthe purchase of anotherproductinthe rule,itscounter-productivetohave asale on
  • 21. 20 | P a g e 1 9 1 3 9 5 0 7 both/all the itemsinthe rule. Thiscansave a large proportionof discountingcostsforthe companywhile boostingtheirsalesnumber. Some ruleslike Greetingcards & CandyBar -> Magazine are clearlyhotproductsduring festive periods. Havingone of the productsona discountedrate,duringfestive season, will temptthe customerto purchase the othertwonon-discountedproductstocompletetheir festive shoppingwish-list.