3. We have internal
information. Getting
information from outside is
our challenge. There’s no
way of doing that.
– Senior Editor
Leading Media Company
“
7. WHAT ARE PEOPLE LOOKING FOR IN DATA ANALYTICS?
7
USA India
data analytics jobs
data analytics tools
data analytics salary
data analytics training
Jobs & Salary Tools Companies
Training &
Courses
data analytics courses
data analytics tools
data analytics jobs
data analytics companies
Source: https://google.com, https://google.co.in
8. WHAT’S THE POPULARITY OVER TIME?
8
“Data Analytics”
Source: https://trends.google.com/
9. WHICH CITIES HAVE INTEREST IN DATA ANALYTICS?
9Source: https://trends.google.com/
0 20 40 60 80 100 120
Gurgaon
Pimpri-Chinchwad
Noida
Bengaluru
Hyderabad
Chennai
Singapore
Mumbai
San Francisco
Dublin
Boston
Washington
Pune
Howrah
Toronto
New York
Sydney
New Delhi
Chicago
Melbourne
11. WHO’S RECRUITING THE TEAMS?
11
0 50 100 150 200 250 300 350 400 450
IBM India
Accenture
JPMorgan
KPMG
Concentrix Daksh
Microsoft India
Ernst & Young
UnitedHealth Group
Shell India Markets
Amazon Dev Centre
GE India Technology
Hewlett-Packard
Deloitte
Cisco Systems
WNS
Xerox
eClerx Services
Mphasis
AIG Analytics
Sapient Consulting
#Jobs
Source: https://www.naukri.com
12. WHAT INDUSTRIES USE DATA ANALYTICS?
12
0% 10% 20% 30% 40% 50% 60%
Software
Banking, Financial Services
Internet, Ecommerce
KPO, Research, Analytics
BPO, Call Centre, ITES
Recruitment, Staffing
Strategy Mgmt Consulting
Media & Entertainment
Advertising & PR
Accounting & Finance
Telcom, ISP
Education, Teaching & Training
Pharma, Biotech & Clinical Research
Insurance
FMCG, Foods & Beverage
Source: https://www.naukri.com
14. WHERE ARE THE DATA ANALYTICS JOBS?
14Source: https://www.naukri.com
0% 5% 10% 15% 20% 25%
Bengaluru
Delhi NCR
Mumbai
Gurgaon
Hyderabad
Others
Pune
Noida
Chennai
Delhi
15. WHO ARE THE BIG PLAYERS IN THIS SPACE?
15Source: Gartner BI Magic Quadrant
16. WHICH STARTUPS OFFER DATA ANALYTICS IN INDIA?
16Source: https://angel.co/
... and more
18. CLASSES OF ANALYTICAL SOLUTIONS
18
Proactive ActionWhat should I do to achieve my goal?
Data products, data validated actions,
increased success rate of strategic
initiatives
ModeApproach to data Benefits
Proactive DecisionsWhat is likely to happen?
Support for strategic initiatives,
forward looking decision making
Proactive Consumption
ActiveWhat happened ? Marginal business benefits
, process gap identification
Why did it happen?
Significant improvements
from status quo, data backed
management
19. 19
Proactive Action
ModeApproach to data Benefits
Proactive Decisions
Proactive Consumption
Active
Operational Reporting
for measurement of
efficiency & compliance
Marginal business benefits
, process gap identification
CLASSES OF ANALYTICAL SOLUTIONS
21. 21
Proactive Action
ModeApproach to data Benefits
Proactive Decisions
Proactive ConsumptionRoot Cause Analysis ,
Benchmarking and multi-
dimensional analysis
Significant improvements
from status quo, data backed
management
Active
CLASSES OF ANALYTICAL SOLUTIONS
22. DETECTING FRAUD
“
We know meter readings are
incorrect, for various reasons.
We don’t, however, have the
concrete proof we need to start
the process of meter reading
automation.
Part of our problem is the
volume of data that needs to be
analysed. The other is the
inexperience in tools or
analyses to identify such
patterns.
ENERGY UTILITY
23. This plot shows the frequency of all meter readings from Apr-
2010 to Mar-2011. An unusually large number of readings are
aligned with the tariff slab boundaries.
This clearly shows collusion
of some form with the
customers.
Apr-10 May-10Jun-10Jul-10 Aug-10 Sep-10 Oct-10 Nov-10 Dec-10 Jan-11 Feb-11 Mar-11
217 219 200 200 200 200 200 200 200 350 200 200
250 200 200 200 201 200 200 200 250 200 200 150
250 150 150 200 200 200 200 200 200 200 200 150
150 200 200 200 200 200 200 200 200 200 200 50
200 200 200 150 180 150 50 100 50 70 100 100
100 100 100 100 100 100 100 100 100 100 110 100
100 150 123 123 50 100 50 100 100 100 100 100
0 111 100 100 100 100 100 100 100 100 50 50
0 100 27 100 50 100 100 100 100 100 70 100
1 1 1 100 99 50 100 100 100 100 100 100
This happens with specific
customers, not randomly.
Here are such customers’
meter readings.
Section Apr-10 May-10Jun-10 Jul-10 Aug-10 Sep-10 Oct-10 Nov-10 Dec-10 Jan-11 Feb-11 Mar-11
Section 1 70% 97% 136% 65% 110% 116% 121% 107% 114% 88% 74% 109%
Section 2 66% 92% 66% 87% 70% 64% 63% 50% 58% 38% 41% 54%
Section 3 90% 46% 47% 43% 28% 31% 50% 32% 19% 38% 8% 34%
Section 4 44% 24% 36% 39% 21% 18% 24% 49% 56% 44% 31% 14%
Section 5 4% 63% -27% 20% 41% 82% 26% 34% 43% 2% 37% 15%
Section 6 18% 23% 30% 21% 28% 33% 39% 41% 39% 18% 0% 33%
Section 7 36% 51% 33% 33% 27% 35% 10% 39% 12% 5% 15% 14%
Section 8 22% 21% 28% 12% 24% 27% 10% 31% 13% 11% 22% 17%
Section 9 19% 35% 14% 9% 16% 32% 37% 12% 9% 5% -3% 11%
If we define the “extent of
fraud” as the percentage
excess of the 100 unit
meter reading, the
value varies
considerably
across sections,
and time
New section
manager arrives
… and is
transferred out
… with some
explainable
anomalies.
Why would
these happen?
Simple histograms have been applied to manage ALM compliance,
fraud in corporate directorships, and collusion in schools
24. What do the children in schools know and can do at
different stages of elementary education?
Have the inputs made into the elementary education
system had a beneficial effect or not?
24
25. HAVING BOOKS IMPROVES READING ABILITY
Having more books at home improves the performance of children when it
comes to reading. (But children typically only have only 1-10 books at home)
Number of students sampled
What is the impact? How many more marks
can having more books fetch?
Circle size indicates number of students with
this response. Few students have no books.
Is this response (“25+ books”) good or bad?
Small red bars indicate low marks. Large
green bars indicate high marks. Students
having 25+ books tend to score high marks.
The most common response is marked in
blue. This is also the circle.
The graphic is summarized in words
Indicates whether the best response is the
most popular. Blue means that it is not.
Green means that it is. Red means that the
worst level is the most popular response.
25
26. HAVING MORE SIBLINGS DOESN’T HELP READING
Children with 1 sibling do much better than children with many siblings
26
27. … BUT HELPS A LOT IN MATHEMATICS
Children with 4+ siblings do very well, children with 1 sibling fare poorly
27
28. TUITIONS HELP A LITTLE
… BUT NOT CHILDREN WITH 4+ SIBLINGS
28
29. TUITIONS HELP A LITTLE
… BUT NOT CHILDREN OF ILLITERATE PARENTS
29
30. CHILDREN LIKE GAMES, AND THEY’RE GOOD
… but playing daily hurts reading ability 30
31. 31
Proactive Action
ModeApproach to data Benefits
Proactive Decisions
Proactive Consumption
Active
Statistical Analysis thru
Segmentation, Decision Trees and
Cause-effect Modelling
Support for strategic initiatives,
forward looking decision making
CLASSES OF ANALYTICAL SOLUTIONS
33. 33
Background & Objective
Gramener Approach
Customer churn is a well noted problem in telecom industry today. One of the leading telecom
operator in the country wanted to predict the churn rate 2/4 week before using an analytical
model.
Exploratory
Analysis &
influencers
Predictive
Intervention
Linear
Discriminant
Parameters
Exploratory business analysis
performed to identify
influencers & create additional
derived metrics & derived
dimensions
Using selective metrics,
models were built on Linear
Classification like Decision
trees, Linear Discriminant
Parameters
Non – Linear
Models
Using selective metrics
non-linear families of
models were built: Neural
Networks, Random Forests
& Support Vector
Machines
• The best model was
implemented & compared
with a control set
• Targeted promotions for
predicted set yielded ~60%
reduction in churn
CLASSES OF ANALYTICAL SOLUTIONS
34. MODEL BUILDING & FINE-TUNING
Models Deployed
Pair-wise correlation
Multi-linear regression
Linear Discriminant Analysis
Decision Tree
Support Vector Machines
Neural Networks
Random Forest
Other Variability
Predict Duration
Ageing of model
Input Metrics - Customer
Incoming & Outgoing Minutes
Incoming & Outgoing Calls
Daily Mobile Usage
Closing Balance
Customer activation date
Input - Derived & Growth Metrics
Last/Average Closing balance in a month
Days since the last Outgoing Call
Days since the last Recharge
Total Decrement
Monthly Refill Amount
Total Minutes incl Incoming & Outgoing
Percentage of Incoming Minutes
Recharge Values
35. 8.3% 0.0%
MISSED WASTED
6.61
COST PER CUST.
0.0%
IMPROVEMENT
Base
MODEL
OK
WASTED
Marketing cost
Rs 40
MISSED
Acquisition cost
Rs 80
OK
No churn Churn
NochurnChurn
Prediction
Actual
37. 37
Proactive Action
ModeApproach to data Benefits
Proactive Decisions
Proactive Consumption
Active
Data driven decision making, thru
advanced mathematical models and
scenario planning
Data products, data validated actions,
increased success rate of strategic
initiatives
CLASSES OF ANALYTICAL SOLUTIONS
38. HEURISTICS
EMERGENCY
“
A man is rushed to a hospital in
the throes of a heart attack.
The nurse needs to decide
whether the victim should be
admitted into emergency care.
Although this decision can save
or cost a life, the nurse must
decide using only the available
cues, and within a few seconds
– preferably using some fancy
statistical software package.
42. SO, WHAT’S THE SKILL NEEDED TO CREATE THESE?
42
Deep Domain
Expertise
Visual Design &
Presentation
Deep
Programming
Statistics & Machine
Learning
Passion for Numbers
Domain Orientation
43. …AND WHAT ARE THE ROLES AVAILABLE?
43
Deep Domain
Expertise
Visual Design &
Presentation
Deep
Programming
Statistics & Machine
Learning
Passion for Numbers
Domain Orientation
Data Scientist
44. SO, WHAT’S THE SKILL NEEDED TO CREATE THESE?
44
Deep Domain
Expertise
Visual Design &
Presentation
Deep
Programming
Statistics & Machine
Learning
Passion for Numbers
Domain Orientation
Functional Consultant
45. SO, WHAT’S THE SKILL NEEDED TO CREATE THESE?
45
Deep Domain
Expertise
Visual Design &
Presentation
Deep
Programming
Statistics & Machine
Learning
Passion for Numbers
Domain Orientation
Information Designer
46. SO, WHAT’S THE SKILL NEEDED TO CREATE THESE?
46
Deep Domain
Expertise
Visual Design &
Presentation
Deep
Programming
Statistics & Machine
Learning
Passion for Numbers
Domain Orientation
Data Analyst
47. SO, WHAT’S THE SKILL NEEDED TO CREATE THESE?
47
Deep Domain
Expertise
Visual Design &
Presentation
Deep
Programming
Statistics & Machine
Learning
Passion for Numbers
Domain Orientation
Data ScientistFunctional Consultant
Information Designer Data Analyst
49. THE DATA SCIENCE TOOLKIT
Alteryx
Amazon EC2
Azure ML
BigQuery
Birst
Caffe
Cassandra
Cloud Compute
Cloudera
Cognos
CouchDB
D3
Decision tree
ElasticSearch
Excel
Gephi
ggplot2
Hadoop
HP Vertica
IBM Watson
Impala
Julia
Jupyter Notebook
Kafka
Kibana
Kinesis
Lambda
Leaflet
Logstash
MapR
MapReduce
Matplotlib
Microstrategy
MongoDB
NodeXL
Pandas
Pentaho
Pivotal
PowerPoint
Power BI
Qlikview
R
R Studio
Random Forest
Redis
Redshift
Regression
Revolution R
S3
SAP Hana
SAS
Spark
Spotfire
SPSS
SQL Server
Stanford NLP
Storm
SVM
Tableau
TensorFlow
Teradata
Theano
Thrift
Torch
Weka
Word2Vec
The tool does not matter. A person’s skill with the tool does.
Pick the person. Let them pick the tool.
We did the simplest possible thing – plot the number of customers who had meter readings of 0, 1, 2, 3, etc. – all the way up to 300 and beyond. (Effectively, we drew a histogram.)
As expected, it was log-normal. Relatively few users with low meter readings, and few with high meter readings. But what was striking were the spikes – at 50 units, 100 units, 200 units and 300 units – precisely at the slab boundaries.
Given the metering system, there is a strong economic incentive to stay at or within a slab boundary. Exceeding it increases the unit rate. However, there are two ways this could happen. Either the consumer watches their meter carefully, and the instant it hits 100, stops using their lights and fans – or a certain amount of money changes hands.
It was easy to see from this that there was fraud happening, but what stumped us were the spikes at 10, 20, 30, 40, etc. Here, there’s no economic incentive. There’s no significant difference between a meter reading of 10 vs 11, so there was no incentive to commit fraud. However, we later learnt that we were looking at this the wrong way. This was not a case of fraud, but of laziness. These were the meter readings taken by staff that never visited the premises, and were cooking up numbers.
When people cook up numbers, they cook up round numbers. (An official said that he had to let go of one person who had not taken readings in a colony of houses for as long as six months. “Sir, there’s a pack of dogs in the colony” was his official statement.)
The other question is, what is the nature of this fraudulent contract. Is it monthly? The meter reading guy appears and charges a small sum to adjust the reading? Or is it an annual contract that’s paid upfront? We looked at the meter readings of some of the people who were consistently at the slab boundaries. For example, the table in the middle has the readings of 10 customers, one per row. In the first row, the readings are consistently at 200 for 9 of the 12 months. However, there’s a spike in Jan-11 to 350 units. This indicated a monthly contract with a failure to pay in just one month. However, we later learnt that many of the people on this list were famous personalities. In fact, the lady in the first row had an event at their place in Jan-11, and the actual reading was expected to be well over a thousand units. But since the electricity board has a policy of not often auditing those that were in the highest slab (above 300), a more likely explanation was a collusion of the lineman with the customer to place her in the highest slab just this month, to avoid scrutiny.
Lastly, we were examining the level at which fraud can be controlled. The last table above shows the extent of fraud of each section in one city, month on month. (The extent of fraud can be measured by the relative height of the spikes compared to the expected value.) Sections vary in the level of fraud, with Section 1 having significantly more fraud than Section 9. We also observe that fraud generally decreases in the winter season (Dec – Feb) when the need for cooling is less. But what’s most striking is the negative fraud in Section 5 in Jun-10. It stays low for a couple of months, and then, as if to compensate, shoots up to 82% in Sep-10.
We learnt that this coincided with the appointment and transfer of a new section manager – under whose “regime”, fraud seems to have been dramatically controlled. It appears that a good organisation level to control fraud is at the 5,000 people strong section manager level, rather than the 100,000 people strong staff level.
Medical Institutions have vital heuristics. Each and every diagnosis has parameters latched onto it. An initial scan must be done to identify the basic ailment and proceed further. We thought how can we speed it up? A person has a severe heart attack and he cannot wait for all the scans to be done to proceed to the next sequence of treatment.
Rather, we can have a simple set of parameters which has to be checked quickly and admit the patient for treatment. The decision of admitting to a critical care unit or not is simplified and sped up. The visual cues hence can help the nurse take quick decisions statistically rather than taking a decision by wit.
Measure the pressure, use stethoscope and sphygmomanometer. If it comes out to be more than 91, he has to be admitted immediately to the intensive care unit. If the pulse is not more than 91, the age must be checked, if the age is more than 62, chances of patient stabilizing without any intensive care is ruled out. But, if the age comes out to be lesser than 62, his pulse must be diagnosed. The pulse must not be higher than 100, if it is higher than 100, the patient must be taken to the emergency ward.
Thus, a step by step pre-defined process identifying the causes and the remedies will help in saving lives. This simple visual cue through a dashboard not only saves many lives daily but also technologically aides medical workforce to record and reproduce the patient history.
Decision tree model used here helps in breaking down complicated situations down to easier-to-understand scenarios.