Tayko is a fictitious business which wants to use data mining to improve their email marketing efforts. They are trying different data-mining algorithms to get the best results.
2. Business Problem
Tayko is a software catalog firm that sells games and educational software
Want to market a new collection using e-mail marketing.
As member of an industry consortium, they can pull 2,00,000 emails address from
the central repository of the consortium.
To maximize the benefit, Tayko wants to pull records with high probability of
response and higher value of sale.
3. Analytics Problem
1. Create a classification model to groups the customer as responder or
purchasers(1) and non-responders or non-purchasers(0).
2. Create a prediction model to predict the value of sale of the responder(1).
4. Data Collection
Supervised learning techniques is to be applied as a desired output is required is
already defined.
A sample of 2000 customer is drawn form the central repository and test e-mail
marketing is done.
The 2 target variables : Purchased and Spending is recorded for the sample.
The result showed 1000 purchasers and 1000 non-purchasers
5. Data partitioning
The data set is partitioned into
Training set – 60% - 1200 records
Testing – 20% - 400 records
Validation – 20% - 400 records
7. Finding the variables with strong differentiation
power – Nominal Variables
Use of Catalog A, T, U, P show high percentage of people making a purchase
Use of Catalog O, H show high percentage of people not making a purchase
But only Catalog A & U has been used for more than 100 customers.
Catalog H for more than 50 customers & rest below 50 customers.
Distribution of catalogs were not even.
8. Other Nominal Variables
Out of other categorical variables : “Order Online” is the only one which show some
power to differentiate between customer who purchased and the non-purchasers.
9. Ordinal Variables
Number of purchase last year shows a good trend
People who have not made any purchase last year have
not made any purchase with the new catalogs also.
People who had made more than 3 purchase has surly
made a purchase this time also
10. Scale Variables
Out of the 2 scale variables “Last update to customer record” shows a significant
difference in their mean.
11. Target Variables
Purchaser and non-purchasers are equally distributed
However the sales value or the amount spend by customer follows a non-normal
distribution
13. Logistic Regression – Training
Final set of variables
1. Frequency : Number of transactions in last year at source
catalog
2. Web Order : Customer placed at least 1 order via web
3. Address is Residence : Address is a residence
4. Source_a, h or u :Source Catalog is A, U or H
21. New Calculated Variables
• High correlation between “last_update_days_ago ” and
“1st_update_days_ago ”
• New calculated variable DayDiff which is difference of the 2
variables
22. Multiple Linear Regression
Pre-processiong
Univariate analysis and transformation of Target Variable “Spend”
Outlier removal,
Filtering and
Transformation
23. Model & Performance
4 models are generated
Case 1 : None Residence Address & Not a Web-Order (R-sqr : 0.569 & Adj R-sqr : 0.566)
Spending = -15.733 + 79.11 * No of transaction last year – 47.825 * Catalog D + 30.632 * Catalog U
Case 2 : None Residence Address & Web-Order (R-sqr : 0.62 & Adj R-sqr : 0.616)
Spending = -42.285 + 115.976 * No of transaction last year + 45.506 * Catalog U -247.655 * Catalog H +
55.605 Catalog R
Case 3 : Residence Address & Not a Web-Order (R-sqr : 0.516 & Adj R-sqr : 0.507)
Spending = -26.965 + 69.218 * No of transaction last year + 66.219 * Catalog U – 113.587*Catalog H
Case 4 : Residence Address & Web-Order (R-sqr : 0.612 & Adj R-sqr : 0.592)
Spending = -4.616 + 65.114 * No of transaction last year - 111.934*Catalog H – 81.28 * Catalog R – 129.754
* Catalog C + 66.242 * Catalog A
27. Decision
Both the models are very weak in predicting the amount spent
There is high error for evaluation indicators.
One major reason for this can be the lack of scale variables and high correlation
between whatever scale variables are given.
Since most variables are of nominal type, converting the prediction problem to
classification might produce better result. But it was out of scope for the given
problem.
28. Conclusion
The classification of customer into purchasers and non-purchasers shows good
result and the elected logistic regression model is expected to show high
performance in live situation also.
However the prediction models show weak performance and a high degree of error
is expected if used in the current state.