SlideShare una empresa de Scribd logo
1 de 62
Data Mining
Introduction
Prithwis Mukerjee, Ph.D.
Prithwis
Mukerjee 2
Why Suddenly Data Mining ?
Fluid, dynamic business environment
 Markets are, or seem to be, saturated
 Customers are aggresive and disloyal
 Speed is essential
 “The quick and the dead”
Availability of data
 Vast amounts of data are generated, stored electronically
and are waiting to be processed !
Availability of tools and techniques
 Mathematical tools are available to “process” this data
 This processing is significantly different from MIS / EDP style of
data processing
 Software tools are available to implement some of
these mathematical models
Prithwis
Mukerjee 3
The Business Environment
Customer Behaviour
 Customers have access to more channels
 Newer retail formats
 Online stores
 Customers have access to more suppliers
 Increased commoditisation
 Customer loyalty is not assured any more
Market Saturation
 Multiple suppliers operating in each market
 Niche market based on demographics and preferences
 Competition has intensified
Need for speed
 Product life cycles are getting shorter
 Time to market
Prithwis
Mukerjee 4
New way of looking at customer
Customer Relationship Management
 Intimacy, collaboration, one-to-one partnerships are
necessary
Need to ask ...
 What classes of customers do we have ? Are there
subclasses in terms of behaviour ?
 How can we sell more to existing customers ? What
exactly are they buying now ?
 Is there a pattern in the way our customers behave ?
 Who are my good customers ?
 To whom we should sell more
 Who are my bad customers ?
 Who are likely to default or defraud ?
Prithwis
Mukerjee 5
Availability of vast amounts of data
ERP and OLTP systems
 With their centralised RDBMSs are huge pools of
firmwide data that can overwhelm even the most
dedicated manager
Datawarehouse
 Technology has resulted in equally huge pools of
historical data
Storage Capacity
 Inexpensive
 Ultra high capacity
Prithwis
Mukerjee 6
Availability of vast amounts of data
Cards
 Credit & Debit Cards
 Loyalty Card
 Result in capture of huge pools of data
Transactional Data Capture
 Point of sales systems, Bar code readers
 Capture vast amount of transaction data at increasing
levels of granularity
 What was sold ? Product, SKU
 When was it sold ? Date, time
 How was it sold ? Discount, Promotions
 Beyond simple sales
 Telephone calls, frequent flyer data
Prithwis
Mukerjee 7
Trends leading to Data Flood
More data is
generated:
 Web, text, images …
 Business transactions,
calls, ...
 Scientific data:
astronomy, biology, etc
More data is
captured:
 Storage technology
faster and cheaper
 DBMS can handle
bigger DB
Prithwis
Mukerjee 8
Data Growth
In 2 years (2003 to 2005),
the size of the largest database TRIPLED!
Prithwis
Mukerjee 9
Coming to the point ....
Data mining
 Is the process of extracting unknown, valid and
actionable information from large databases and
then using this information to make crucial business
decisions.
Database
Data Mining
Tools
Data Presentation
and Visualisation
Tools
Decisions
Prithwis
Mukerjee 10
Knowledge Discovery Definition
Knowledge Discovery in Data is the
non-trivial process of identifying
 valid
 novel
 potentially useful
 and ultimately understandable patterns in data.
from Advances in Knowledge Discovery and
Data Mining, Fayyad, Piatetsky-Shapiro, Smyth,
and Uthurusamy, (Chapter 1), AAAI/MIT Press
1996
Prithwis
Mukerjee 11
Old Wine in new bottles ?
Are these the same as data mining ?
 SQL queries against large databases
 Multidimensional database analysis
 Online analytical processing
 Sophisticated graphic visualisation
 “Classical” statistical analysis
 ANOVA ? Regression ? Correrelation ?
What is missing ?
 Discovery of information without a previously
articulated or formulated hypothesis
Prithwis
Mukerjee 12
Related Fields
Statistics
Machine
Learning
Databases
Visualization
Data Mining and
Knowledge Discovery
Prithwis
Mukerjee 13
Statistics, Machine Learning, Data
Mining
Statistics:
 more theory-based
 more focused on testing hypotheses
Machine learning
 more heuristic
 focused on improving performance of a learning agent
 also looks at real-time learning and robotics – areas not part of
data mining
Data Mining and Knowledge Discovery
 integrates theory and heuristics
 focus on the entire process of knowledge discovery, including
data cleaning, learning, and integration and visualization of
results
Distinctions are fuzzy
Data Mining Tasks
Prithwis
Mukerjee 15
Some Definitions
Instance (also Item or Record):
 an example, described by a number of attributes,
 e.g. a day can be described by temperature,
humidity and cloud status
Attribute or Field
 measuring aspects of the Instance, e.g. temperature
Class (Label)
 grouping of instances, e.g. days good for playing
Prithwis
Mukerjee 16
Major Data Mining Tasks
Classification
 predicting an item class
Clustering
 finding clusters in data
Associations
 e.g. A & B & C occur
frequently
Visualization
 to facilitate human
discovery
Summarization
 describing a group
Deviation Detection
 finding changes
Estimation
 predicting a continuous
value
Link Analysis
 finding relationships
And
 So on ...
Prithwis
Mukerjee 17
Classification
Learn a method for predicting the instance class
from pre-labeled (classified) instances
Many approaches:
Statistics,
Decision Trees,
Neural Networks,
...
Prithwis
Mukerjee 18
Clustering
Find “natural” grouping of
instances given un-labeled data
Prithwis
Mukerjee 19
Association Rules & Frequent Itemsets
TID Produce
1 MILK, BREAD, EGGS
2 BREAD, SUGAR
3 BREAD, CEREAL
4 MILK, BREAD, SUGAR
5 MILK, CEREAL
6 BREAD, CEREAL
7 MILK, CEREAL
8 MILK, BREAD, CEREAL, EGGS
9 MILK, BREAD, CEREAL
Transactions
Frequent Itemsets:
Milk, Bread (4)
Bread, Cereal (3)
Milk, Bread, Cereal (2)
…
Rules:
Milk => Bread (66%)
Prithwis
Mukerjee 20
Visualization & Data Mining
Visualizing the data to
facilitate human
discovery
Presenting the
discovered results in a
visually "nice" way
Prithwis
Mukerjee 21
Summarization
Describe features of the
selected group
Use natural language
and graphics
Usually in Combination
with Deviation detection
or other methods
Average length of stay in this study area rose 45.7 percent,
from 4.3 days to 6.2 days, because ...
Prithwis
Mukerjee 22
Data Mining Central Quest
Find true patterns
and avoid overfitting
(finding seemingly signifcant
but really random patterns due
to searching too many possibilites)
Classification Methods
Prithwis
Mukerjee 24
Classification
Learn a method for predicting the instance class
from pre-labeled (classified) instances
Many approaches:
Regression,
Decision Trees,
Bayesian,
Neural Networks,
...
Given a set of points from classes
what is the class of new point ?
Prithwis
Mukerjee 25
Classification: Linear Regression
 Linear Regression
w0 + w1 x + w2 y >= 0
 Regression computes
wi from data to
minimize squared error
to ‘fit’ the data
 Not flexible enough
Prithwis
Mukerjee 26
Regression for Classification
 Any regression technique can be used for
classification
 Training: perform a regression for each class, setting the
output to 1 for training instances that belong to class, and 0 for
those that don’t
 Prediction: predict class corresponding to model with largest
output value (membership value)
 For linear regression this is known as multi-
response linear regression
Prithwis
Mukerjee 27
Classification: Decision Trees
X
Y
if X > 5 then blue
else if Y > 3 then blue
else if X > 2 then green
else blue
52
3
Prithwis
Mukerjee 28
DECISION TREE
 An internal node is a test on an attribute.
 A branch represents an outcome of the test,
e.g., Color=red.
 A leaf node represents a class label or class
label distribution.
 At each node, one attribute is chosen to split
training examples into distinct classes as much
as possible
 A new instance is classified by following a
matching path to a leaf node.
Prithwis
Mukerjee 29
Weather Data: Play or not Play?
Notruehighmildrain
Yesfalsenormalhotovercast
Yestruehighmildovercast
Yestruenormalmildsunny
Yesfalsenormalmildrain
Yesfalsenormalcoolsunny
Nofalsehighmildsunny
Yestruenormalcoolovercast
Notruenormalcoolrain
Yesfalsenormalcoolrain
Yesfalsehighmildrain
Yesfalsehighhotovercast
Notruehighhotsunny
Nofalsehighhotsunny
Play?WindyHumidityTemperatureOutlook
Note:
Outlook is the
Forecast,
no relation to
Microsoft
email program
Prithwis
Mukerjee 30
overcast
high normal falsetrue
sunny
rain
No NoYes Yes
Yes
Example Tree for “Play?”
Outlook
Humidity
Windy
Prithwis
Mukerjee 31
Classification: Neural Nets
 Can select more
complex regions
 Can be more accurate
 Also can overfit the
data – find patterns in
random noise
Prithwis
Mukerjee 32
Classification: other approaches
 Naïve Bayes
 Rules
 Support Vector Machines
 Genetic Algorithms
 …
See www.KDnuggets.com/software/
Prithwis
Mukerjee 33
Direct Marketing Paradigm
 Find most likely prospects to contact
 Not everybody needs to be contacted
 Number of targets is usually much smaller than
number of prospects
 Typical Applications
 retailers, catalogues, direct mail (and e-mail)
 customer acquisition, cross-sell, attrition prediction
 ...
Prithwis
Mukerjee 34
Direct Marketing Evaluation
 Accuracy on the entire dataset is not the right
measure
 Approach
 develop a target model
 score all prospects and rank them by decreasing
score
 select top P% of prospects for action
 How do we decide what is the best subset of
prospects ?
Prithwis
Mukerjee 35
Model-Sorted List
…4897N0.925
2422
2734
…
3820
2478
1024
1746
CustI
D
N0.06100
…N0.1199
…
…
…
…
…
Age
Y0.934
……
Y0.943
N0.952
Y0.971
TargetScor
e
No
Use a model to assign score to each customer
Sort customers by decreasing score
Expect more targets (hits) near the top of the list
3 hits in top 5% of
the list
If there 15 targets
overall, then top 5
has 3/15=20% of
targets
Data Mining Applications
Prithwis
Mukerjee 37
Data Mining Applications
Science: Chemistry,
Physics, Medicine
 Biochemical analysis
 Remote sensors on a
satellite
 Telescopes – star
galaxy classification
 Medical Image
analysis
Bioscience
 Sequence-based
analysis
 Protein structure and
function prediction
 Protein family
classification
 Microarray gene
expression
Prithwis
Mukerjee 38
Microarrays: Classifying Leukemia
 Leukemia: Acute Lymphoblastic (ALL) vs Acute
Myeloid (AML), Golub et al, Science, v.286,
1999
 72 examples (38 train, 34 test), about 7,000 genes
ALL AML
Visually similar, but genetically very different
Best Model: 97% accuracy,
1 error (sample suspected mislabelled)
Prithwis
Mukerjee 39
Microarray Potential Applications
 New and better molecular diagnostics
 Jan 11, 2005: FDA approved Roche Diagnostic AmpliChip,
based on Affymetrix technology
 New molecular targets for therapy
 few new drugs, large pipeline, …
 Improved treatment outcome
 Partially depends on genetic signature
 Fundamental Biological Discovery
 finding and refining biological pathways
 Personalized medicine ?!
Prithwis
Mukerjee 40
Pharmaceutical
companies,
Insurance and Health
care, Medicine
 Drug development
 Identify successful
medical therapies
 Claims analysis,
fraudulent behavior
 Medical diagnostic
tools
 Predict office visits
Data Mining Applications
Financial Industry,
Banks, Businesses, E-
commerce
 Stock and investment
analysis
 Identify loyal customers
vs. risky customer
 Predict customer
spending
 Risk management
 Sales forecasting
Prithwis
Mukerjee 41
Retail and Marketing
 Customer buying
patterns/demographic
characteristics
 Mailing campaigns
 Market basket
analysis
 Trend analysis
Data Mining Applications
Prithwis
Mukerjee 42
Application: Direct Marketing and CRM
 Most major direct marketing companies are
using modeling and data mining
 Most financial companies are using customer
modeling
 Modeling is easier than changing customer
behaviour
 Example
 Verizon Wireless reduced customer attrition rate from
2% to 1.5%, saving many millions of $
Prithwis
Mukerjee 43
Application: Security and Fraud
Detection
 Credit Card Fraud Detection
 over 20 Million credit cards protected by Neural
networks (Fair, Isaac)
 Securities Fraud Detection
 NASDAQ KDD system
 Phone fraud detection
 AT&T, Bell Atlantic, British Telecom/MCI
Prithwis
Mukerjee 44
Fraud Detection and Management (1)
Applications
 widely used in health care, retail, credit card services,
telecommunications (phone card fraud), etc.
Approach
 use historical data to build models of fraudulent
behavior and use data mining to help identify similar
instances
Examples
 auto insurance: detect a group of people who stage
accidents to collect on insurance
 money laundering: detect suspicious money transactions (US
Treasury's Financial Crimes Enforcement Network)
 medical insurance: detect professional patients and ring of
doctors and ring of references
Prithwis
Mukerjee 45
Fraud Detection and Management (2)
Detecting inappropriate medical treatment
 Australian Health Insurance Commission identifies
that in many cases blanket screening tests were
requested (save Australian $1m/yr).
Detecting telephone fraud
 Telephone call model: destination of the call, duration,
time of day or week. Analyze patterns that deviate
from an expected norm.
 British Telecom identified discrete groups of callers
with frequent intra-group calls, especially mobile
phones, and broke a multimillion dollar fraud.
Retail
 Analysts estimate that 38% of retail shrink is due to
dishonest employees.
Prithwis
Mukerjee 46
Application: e-Commerce
 Amazon.com recommendations
 if you bought (viewed) X, you are likely to buy Y
 Netflix
 If you liked "Monty Python and the Holy Grail",
you get a recommendation for "This is Spinal Tap"
 Comparison shopping
 Froogle, mySimon, Yahoo Shopping, …
Prithwis
Mukerjee 47
Example : Processing Loan Applications
Given: questionnaire with financial and
personal information
Problem: should money be lend?
Borderline cases referred to loan officers
But: 50% of accepted borderline cases
defaulted!
Solution:
 reject all borderline cases?
Borderline cases are most active customers!
Prithwis
Mukerjee 48
Enter Machine Learning
Given:
 1000 training examples of borderline cases
20 attributes:
 age, years with current employer,years at current
address, years with the bank, years at current job,
other credit cards
Learned rules predicted 2/3 of borderline
cases correctly!
Rules could be used to explain decisions to
customers
Prithwis
Mukerjee 49
Case study 2:Screening images
Given:
 radar satellite images of coastal waters
Problem:
 detecting oil slicks in those images
Oil slicks = dark regions with changing size and
shape
Look-alike dark regions can be caused by
weather conditions (e.g. high wind)
Expensive process requiring highly trained
personnel
Prithwis
Mukerjee 50
Dark regions extracted from normalized image
Attributes:
 size of region, shape, area, intensity, sharpness and
jaggedness of boundaries, proximity of other
regions, info about background
Constraints:
 Scarcity of training examples (oil slicks are rare!)
 Unbalanced data: most dark regions aren’t oil slicks
 Regions from same image form a batch
 Requirement is adjustable false-alarm rate
Enter Machine Learning
Prithwis
Mukerjee 51
Data Mining Applications ..
Prediction & Description
 Would this customer buy this product ?
 Is this customer likely to leave ?
Relationship Marketing
 What kind of products have been bought by this
customer ?
 What kind of marketing strategy has this customer
responded to ?
Outlier identification and Fraud detection
 Locating unusual cases and behaviours
Customer Profiling & Segmentation
 Is the bottomline that we are all looking at ...
Prithwis
Mukerjee 52
Data Mining Challenges
Computationally expensive to investigate all
possibilities
Dealing with noise/missing information and
errors in data
Choosing appropriate attributes/input
representation
Finding the minimal attribute space
Finding adequate evaluation function(s)
Extracting meaningful information
Not overfitting
Prithwis
Mukerjee 53
Are All “Discovered” Patterns Interesting?
Interestingness measures:
 A pattern is interesting if
 it is easily understood by humans,
 valid on new or test data with some degree of certainty,
 potentially useful,
 novel, or validates some hypothesis that a user
Objective vs. subjective measures:
 Objective: based on statistics and structures of
patterns
 support and confidence
 Subjective: based on user’s belief in the data
 unexpectedness, novelty, action ability, etc.
Completeness - Find all the interesting patterns
 Can a data mining system find all the interesting patterns?
 Association vs. classification vs. clustering
Privacy Issues
Prithwis
Mukerjee 55
Data Mining, Privacy, and Security
TIA: Terrorism (formerly Total) Information
Awareness Program –
 TIA program closed by Congress in 2003 because of
privacy concerns
However, in 2006 we learn that NSA is
analyzing US domestic call info to find potential
terrorists
 Invasion of Privacy or Needed Intelligence?
Prithwis
Mukerjee 56
Criticism of Analytic Approaches to
Threat Detection:
Data Mining will
 be ineffective - generate millions of false positives
 and invade privacy
First, can data mining be effective?
Prithwis
Mukerjee 57
Can Data Mining and Statistics be Effective
for Threat Detection?
Criticism: Databases have 5% errors, so
analyzing 100 million suspects will generate 5
million false positives
Reality: Analytical models correlate many items
of information to reduce false positives.
Example: Identify one biased coin from 1,000.
 After one throw of each coin, we cannot
 After 30 throws, one biased coin will stand out with
high probability.
 Can identify 19 biased coins out of 100 million with
sufficient number of throws
Prithwis
Mukerjee 58
Another Approach: Link Analysis
Can find unusual patterns in the network structure
Prithwis
Mukerjee 59
Analytic technology can be effective
Data Mining is just one additional tool to help
analysts
Combining multiple models and link analysis
can reduce false positives
Today there are millions of false positives with
manual analysis
Analytic technology has the potential to reduce
the current high rate of false positives
Prithwis
Mukerjee 60
Data Mining with Privacy
Data Mining looks for patterns, not people!
Technical solutions can limit privacy invasion
 Replacing sensitive personal data with anon. ID
 Give randomized outputs
 Multi-party computation – distributed data
 …
Bayardo & Srikant, Technological Solutions for
Protecting Privacy, IEEE Computer, Sep 2003
Prithwis
Mukerjee 61
Summary
 Data Mining and Knowledge Discovery are
needed to deal with the flood of data
 Knowledge Discovery is a process !
 Avoid overfitting (finding random patterns by
searching too many possibilities)
Prithwis
Mukerjee 62
Additional Resources
www.KDnuggets.com
data mining software, jobs, courses, etc
www.acm.org/sigkdd
ACM SIGKDD – the professional society for
data mining

Más contenido relacionado

La actualidad más candente

1 10 everyday reasons why statistics are important
1   10 everyday reasons why statistics are important1   10 everyday reasons why statistics are important
1 10 everyday reasons why statistics are importantJason Edington
 
Use of Statistics in civil engineering and in real life
Use of Statistics in civil engineering and in real lifeUse of Statistics in civil engineering and in real life
Use of Statistics in civil engineering and in real lifeEngr Habib ur Rehman
 
Statistics vs machine learning: which is more powerful
Statistics vs machine learning: which is more powerfulStatistics vs machine learning: which is more powerful
Statistics vs machine learning: which is more powerfulStat Analytica
 
Introduction to Research methodology: Orientation for Doctoral Program Course...
Introduction to Research methodology: Orientation for Doctoral Program Course...Introduction to Research methodology: Orientation for Doctoral Program Course...
Introduction to Research methodology: Orientation for Doctoral Program Course...niloysarkar
 
Too Large To Fail: Large Samples and False Discoveries
Too Large To Fail: Large Samples and False DiscoveriesToo Large To Fail: Large Samples and False Discoveries
Too Large To Fail: Large Samples and False DiscoveriesGalit Shmueli
 
Big data and macroeconomic nowcasting from data access to modelling
Big data and macroeconomic nowcasting from data access to modellingBig data and macroeconomic nowcasting from data access to modelling
Big data and macroeconomic nowcasting from data access to modellingDario Buono
 
AI_healthcare_Dr_Stefan_Pfeiffer
AI_healthcare_Dr_Stefan_PfeifferAI_healthcare_Dr_Stefan_Pfeiffer
AI_healthcare_Dr_Stefan_PfeifferStefanPfeiffer3
 
Me module-3-data-presentation-and-interpretation-may-2
Me module-3-data-presentation-and-interpretation-may-2Me module-3-data-presentation-and-interpretation-may-2
Me module-3-data-presentation-and-interpretation-may-2TsegayeTesfaye4
 
NUS-ISS Learning Day 2018- Sentiment analysis in finance
NUS-ISS Learning Day 2018- Sentiment analysis in financeNUS-ISS Learning Day 2018- Sentiment analysis in finance
NUS-ISS Learning Day 2018- Sentiment analysis in financeNUS-ISS
 
Economics & Statistics Insights in Data Science by DataPerts Technologies
Economics & Statistics Insights in Data Science by DataPerts TechnologiesEconomics & Statistics Insights in Data Science by DataPerts Technologies
Economics & Statistics Insights in Data Science by DataPerts TechnologiesRavindra Panwar
 
Machine learning versus traditional statistical modeling and medical doctors
Machine learning versus traditional statistical modeling and medical doctorsMachine learning versus traditional statistical modeling and medical doctors
Machine learning versus traditional statistical modeling and medical doctorsMaarten van Smeden
 
WPIPosterPresentation24x36
WPIPosterPresentation24x36WPIPosterPresentation24x36
WPIPosterPresentation24x36Allan La
 
Introduction to Futures Studies: Methods and Techniques
Introduction to Futures Studies: Methods and TechniquesIntroduction to Futures Studies: Methods and Techniques
Introduction to Futures Studies: Methods and TechniquesVahid Shamekhi
 

La actualidad más candente (20)

O1
O1O1
O1
 
1 10 everyday reasons why statistics are important
1   10 everyday reasons why statistics are important1   10 everyday reasons why statistics are important
1 10 everyday reasons why statistics are important
 
Use of Statistics in civil engineering and in real life
Use of Statistics in civil engineering and in real lifeUse of Statistics in civil engineering and in real life
Use of Statistics in civil engineering and in real life
 
Statistics vs machine learning: which is more powerful
Statistics vs machine learning: which is more powerfulStatistics vs machine learning: which is more powerful
Statistics vs machine learning: which is more powerful
 
1.why study statistics.
1.why study statistics.1.why study statistics.
1.why study statistics.
 
Trend Analysis
Trend AnalysisTrend Analysis
Trend Analysis
 
Introduction to Research methodology: Orientation for Doctoral Program Course...
Introduction to Research methodology: Orientation for Doctoral Program Course...Introduction to Research methodology: Orientation for Doctoral Program Course...
Introduction to Research methodology: Orientation for Doctoral Program Course...
 
Annual gsdm Camp2015 IT Group
Annual gsdm Camp2015 IT GroupAnnual gsdm Camp2015 IT Group
Annual gsdm Camp2015 IT Group
 
Too Large To Fail: Large Samples and False Discoveries
Too Large To Fail: Large Samples and False DiscoveriesToo Large To Fail: Large Samples and False Discoveries
Too Large To Fail: Large Samples and False Discoveries
 
Big data and macroeconomic nowcasting from data access to modelling
Big data and macroeconomic nowcasting from data access to modellingBig data and macroeconomic nowcasting from data access to modelling
Big data and macroeconomic nowcasting from data access to modelling
 
AI_healthcare_Dr_Stefan_Pfeiffer
AI_healthcare_Dr_Stefan_PfeifferAI_healthcare_Dr_Stefan_Pfeiffer
AI_healthcare_Dr_Stefan_Pfeiffer
 
Randy Goebel for the KIEF 2018. FROM DATA TO ECONOMIC VALUE
Randy Goebel for the KIEF 2018. FROM DATA TO ECONOMIC VALUERandy Goebel for the KIEF 2018. FROM DATA TO ECONOMIC VALUE
Randy Goebel for the KIEF 2018. FROM DATA TO ECONOMIC VALUE
 
Me module-3-data-presentation-and-interpretation-may-2
Me module-3-data-presentation-and-interpretation-may-2Me module-3-data-presentation-and-interpretation-may-2
Me module-3-data-presentation-and-interpretation-may-2
 
Spring 2016
Spring 2016Spring 2016
Spring 2016
 
NUS-ISS Learning Day 2018- Sentiment analysis in finance
NUS-ISS Learning Day 2018- Sentiment analysis in financeNUS-ISS Learning Day 2018- Sentiment analysis in finance
NUS-ISS Learning Day 2018- Sentiment analysis in finance
 
Economics & Statistics Insights in Data Science by DataPerts Technologies
Economics & Statistics Insights in Data Science by DataPerts TechnologiesEconomics & Statistics Insights in Data Science by DataPerts Technologies
Economics & Statistics Insights in Data Science by DataPerts Technologies
 
Probability & application in business
Probability & application in businessProbability & application in business
Probability & application in business
 
Machine learning versus traditional statistical modeling and medical doctors
Machine learning versus traditional statistical modeling and medical doctorsMachine learning versus traditional statistical modeling and medical doctors
Machine learning versus traditional statistical modeling and medical doctors
 
WPIPosterPresentation24x36
WPIPosterPresentation24x36WPIPosterPresentation24x36
WPIPosterPresentation24x36
 
Introduction to Futures Studies: Methods and Techniques
Introduction to Futures Studies: Methods and TechniquesIntroduction to Futures Studies: Methods and Techniques
Introduction to Futures Studies: Methods and Techniques
 

Destacado

Data mining classification-2009-v0
Data mining classification-2009-v0Data mining classification-2009-v0
Data mining classification-2009-v0Prithwis Mukerjee
 
Bitcoin, Blockchain and Crypto Contracts - Part 3
Bitcoin, Blockchain and Crypto Contracts - Part 3Bitcoin, Blockchain and Crypto Contracts - Part 3
Bitcoin, Blockchain and Crypto Contracts - Part 3Prithwis Mukerjee
 
Game theoretic concepts in Support Vector Machines
Game theoretic concepts in Support Vector MachinesGame theoretic concepts in Support Vector Machines
Game theoretic concepts in Support Vector MachinesSubhayan Mukerjee
 
Business Intelligence Industry Perspective Session I
Business Intelligence   Industry Perspective Session IBusiness Intelligence   Industry Perspective Session I
Business Intelligence Industry Perspective Session IPrithwis Mukerjee
 
The incompleteness of reason
The incompleteness of reasonThe incompleteness of reason
The incompleteness of reasonSubhayan Mukerjee
 
Tintin and Contemporary Politics
Tintin and Contemporary PoliticsTintin and Contemporary Politics
Tintin and Contemporary PoliticsSubhayan Mukerjee
 
ইন্টার্নেট কি এবং কেন ?
ইন্টার্নেট কি এবং কেন ?ইন্টার্নেট কি এবং কেন ?
ইন্টার্নেট কি এবং কেন ?Prithwis Mukerjee
 
Bitcoin, Blockchain and the Crypto Contracts - Part 2
Bitcoin, Blockchain and the Crypto Contracts - Part 2Bitcoin, Blockchain and the Crypto Contracts - Part 2
Bitcoin, Blockchain and the Crypto Contracts - Part 2Prithwis Mukerjee
 

Destacado (10)

Data mining arm-2009-v0
Data mining arm-2009-v0Data mining arm-2009-v0
Data mining arm-2009-v0
 
Data mining classification-2009-v0
Data mining classification-2009-v0Data mining classification-2009-v0
Data mining classification-2009-v0
 
Bitcoin, Blockchain and Crypto Contracts - Part 3
Bitcoin, Blockchain and Crypto Contracts - Part 3Bitcoin, Blockchain and Crypto Contracts - Part 3
Bitcoin, Blockchain and Crypto Contracts - Part 3
 
Game theoretic concepts in Support Vector Machines
Game theoretic concepts in Support Vector MachinesGame theoretic concepts in Support Vector Machines
Game theoretic concepts in Support Vector Machines
 
Business Intelligence Industry Perspective Session I
Business Intelligence   Industry Perspective Session IBusiness Intelligence   Industry Perspective Session I
Business Intelligence Industry Perspective Session I
 
The incompleteness of reason
The incompleteness of reasonThe incompleteness of reason
The incompleteness of reason
 
Tintin and Contemporary Politics
Tintin and Contemporary PoliticsTintin and Contemporary Politics
Tintin and Contemporary Politics
 
Internet of Things
Internet of ThingsInternet of Things
Internet of Things
 
ইন্টার্নেট কি এবং কেন ?
ইন্টার্নেট কি এবং কেন ?ইন্টার্নেট কি এবং কেন ?
ইন্টার্নেট কি এবং কেন ?
 
Bitcoin, Blockchain and the Crypto Contracts - Part 2
Bitcoin, Blockchain and the Crypto Contracts - Part 2Bitcoin, Blockchain and the Crypto Contracts - Part 2
Bitcoin, Blockchain and the Crypto Contracts - Part 2
 

Similar a Data mining intro-2009-v2

Data analytics and visualization
Data analytics and visualizationData analytics and visualization
Data analytics and visualizationVini Vasundharan
 
Introduction to Data Science and Analytics
Introduction to Data Science and AnalyticsIntroduction to Data Science and Analytics
Introduction to Data Science and AnalyticsDhruv Saxena
 
Customer Profiling using Data Mining
Customer Profiling using Data Mining Customer Profiling using Data Mining
Customer Profiling using Data Mining Suman Chatterjee
 
Data mining techniques
Data mining techniquesData mining techniques
Data mining techniquesHatem Magdy
 
Data mining an introduction
Data mining an introductionData mining an introduction
Data mining an introductionDr-Dipali Meher
 
Week-1-Introduction to Data Mining.pptx
Week-1-Introduction to Data Mining.pptxWeek-1-Introduction to Data Mining.pptx
Week-1-Introduction to Data Mining.pptxTake1As
 
Machine Learning, Data Mining, and
Machine Learning, Data Mining, and Machine Learning, Data Mining, and
Machine Learning, Data Mining, and butest
 
What is Data Science and How to Succeed in it
What is Data Science and How to Succeed in itWhat is Data Science and How to Succeed in it
What is Data Science and How to Succeed in itKhosrow Hassibi
 
Data Mining with SQL Server 2008
Data Mining with SQL Server 2008Data Mining with SQL Server 2008
Data Mining with SQL Server 2008Peter Gfader
 
Data Mining and Data Warehousing (MAKAUT)
Data Mining and Data Warehousing (MAKAUT)Data Mining and Data Warehousing (MAKAUT)
Data Mining and Data Warehousing (MAKAUT)Bikramjit Sarkar, Ph.D.
 
Göteborg university(condensed)
Göteborg university(condensed)Göteborg university(condensed)
Göteborg university(condensed)Zenodia Charpy
 
Data mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, ClassificationData mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, ClassificationDr. Abdul Ahad Abro
 
Business analytics and data mining
Business analytics and data miningBusiness analytics and data mining
Business analytics and data miningTony Nguyen
 
Business analytics and data mining
Business analytics and data miningBusiness analytics and data mining
Business analytics and data miningHoang Nguyen
 
Business analytics and data mining
Business analytics and data miningBusiness analytics and data mining
Business analytics and data miningLuis Goldster
 

Similar a Data mining intro-2009-v2 (20)

Data analytics and visualization
Data analytics and visualizationData analytics and visualization
Data analytics and visualization
 
Unit 1.pptx
Unit 1.pptxUnit 1.pptx
Unit 1.pptx
 
Data mining
Data miningData mining
Data mining
 
Chapter 1: Introduction to Data Mining
Chapter 1: Introduction to Data MiningChapter 1: Introduction to Data Mining
Chapter 1: Introduction to Data Mining
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
 
Introduction to Data Science and Analytics
Introduction to Data Science and AnalyticsIntroduction to Data Science and Analytics
Introduction to Data Science and Analytics
 
Customer Profiling using Data Mining
Customer Profiling using Data Mining Customer Profiling using Data Mining
Customer Profiling using Data Mining
 
Data mining techniques
Data mining techniquesData mining techniques
Data mining techniques
 
Data mining an introduction
Data mining an introductionData mining an introduction
Data mining an introduction
 
Week-1-Introduction to Data Mining.pptx
Week-1-Introduction to Data Mining.pptxWeek-1-Introduction to Data Mining.pptx
Week-1-Introduction to Data Mining.pptx
 
Machine Learning, Data Mining, and
Machine Learning, Data Mining, and Machine Learning, Data Mining, and
Machine Learning, Data Mining, and
 
What is Data Science and How to Succeed in it
What is Data Science and How to Succeed in itWhat is Data Science and How to Succeed in it
What is Data Science and How to Succeed in it
 
Data Mining with SQL Server 2008
Data Mining with SQL Server 2008Data Mining with SQL Server 2008
Data Mining with SQL Server 2008
 
Data Mining and Data Warehousing (MAKAUT)
Data Mining and Data Warehousing (MAKAUT)Data Mining and Data Warehousing (MAKAUT)
Data Mining and Data Warehousing (MAKAUT)
 
Göteborg university(condensed)
Göteborg university(condensed)Göteborg university(condensed)
Göteborg university(condensed)
 
data-mining-tutorial.ppt
data-mining-tutorial.pptdata-mining-tutorial.ppt
data-mining-tutorial.ppt
 
Data mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, ClassificationData mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, Classification
 
Business analytics and data mining
Business analytics and data miningBusiness analytics and data mining
Business analytics and data mining
 
Business analytics and data mining
Business analytics and data miningBusiness analytics and data mining
Business analytics and data mining
 
Business analytics and data mining
Business analytics and data miningBusiness analytics and data mining
Business analytics and data mining
 

Más de Prithwis Mukerjee

Más de Prithwis Mukerjee (20)

Thought controlled devices
Thought controlled devicesThought controlled devices
Thought controlled devices
 
Cloudcasting
CloudcastingCloudcasting
Cloudcasting
 
Currency, Commodity and Bitcoins
Currency, Commodity and BitcoinsCurrency, Commodity and Bitcoins
Currency, Commodity and Bitcoins
 
Data Science
Data ScienceData Science
Data Science
 
05 OLAP v6 weekend
05 OLAP  v6 weekend05 OLAP  v6 weekend
05 OLAP v6 weekend
 
04 Dimensional Analysis - v6
04 Dimensional Analysis - v604 Dimensional Analysis - v6
04 Dimensional Analysis - v6
 
Thought control
Thought controlThought control
Thought control
 
World of data @ praxis 2013 v2
World of data   @ praxis 2013  v2World of data   @ praxis 2013  v2
World of data @ praxis 2013 v2
 
BIS 08a - Application Development - II Version 2
BIS 08a - Application Development - II Version 2BIS 08a - Application Development - II Version 2
BIS 08a - Application Development - II Version 2
 
Lecture02 - Data Mining & Analytics
Lecture02 - Data Mining & AnalyticsLecture02 - Data Mining & Analytics
Lecture02 - Data Mining & Analytics
 
Data mining clustering-2009-v0
Data mining clustering-2009-v0Data mining clustering-2009-v0
Data mining clustering-2009-v0
 
PPM Lite
PPM LitePPM Lite
PPM Lite
 
OLAP Cubes in Datawarehousing
OLAP Cubes in DatawarehousingOLAP Cubes in Datawarehousing
OLAP Cubes in Datawarehousing
 
Dimensional Modelling
Dimensional ModellingDimensional Modelling
Dimensional Modelling
 
Datawarehousing and Business Intelligence
Datawarehousing and Business IntelligenceDatawarehousing and Business Intelligence
Datawarehousing and Business Intelligence
 
Business Models for Web 2.0
Business Models for Web 2.0Business Models for Web 2.0
Business Models for Web 2.0
 
BIS01 Living On the Web
BIS01 Living On the WebBIS01 Living On the Web
BIS01 Living On the Web
 
BIS03 Data Modelling - I
BIS03 Data Modelling - IBIS03 Data Modelling - I
BIS03 Data Modelling - I
 
BIS04 Data Modelling - II
BIS04 Data Modelling  - IIBIS04 Data Modelling  - II
BIS04 Data Modelling - II
 
BIS06 Physical Database Models
BIS06 Physical Database ModelsBIS06 Physical Database Models
BIS06 Physical Database Models
 

Último

[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 

Último (20)

[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 

Data mining intro-2009-v2

  • 2. Prithwis Mukerjee 2 Why Suddenly Data Mining ? Fluid, dynamic business environment  Markets are, or seem to be, saturated  Customers are aggresive and disloyal  Speed is essential  “The quick and the dead” Availability of data  Vast amounts of data are generated, stored electronically and are waiting to be processed ! Availability of tools and techniques  Mathematical tools are available to “process” this data  This processing is significantly different from MIS / EDP style of data processing  Software tools are available to implement some of these mathematical models
  • 3. Prithwis Mukerjee 3 The Business Environment Customer Behaviour  Customers have access to more channels  Newer retail formats  Online stores  Customers have access to more suppliers  Increased commoditisation  Customer loyalty is not assured any more Market Saturation  Multiple suppliers operating in each market  Niche market based on demographics and preferences  Competition has intensified Need for speed  Product life cycles are getting shorter  Time to market
  • 4. Prithwis Mukerjee 4 New way of looking at customer Customer Relationship Management  Intimacy, collaboration, one-to-one partnerships are necessary Need to ask ...  What classes of customers do we have ? Are there subclasses in terms of behaviour ?  How can we sell more to existing customers ? What exactly are they buying now ?  Is there a pattern in the way our customers behave ?  Who are my good customers ?  To whom we should sell more  Who are my bad customers ?  Who are likely to default or defraud ?
  • 5. Prithwis Mukerjee 5 Availability of vast amounts of data ERP and OLTP systems  With their centralised RDBMSs are huge pools of firmwide data that can overwhelm even the most dedicated manager Datawarehouse  Technology has resulted in equally huge pools of historical data Storage Capacity  Inexpensive  Ultra high capacity
  • 6. Prithwis Mukerjee 6 Availability of vast amounts of data Cards  Credit & Debit Cards  Loyalty Card  Result in capture of huge pools of data Transactional Data Capture  Point of sales systems, Bar code readers  Capture vast amount of transaction data at increasing levels of granularity  What was sold ? Product, SKU  When was it sold ? Date, time  How was it sold ? Discount, Promotions  Beyond simple sales  Telephone calls, frequent flyer data
  • 7. Prithwis Mukerjee 7 Trends leading to Data Flood More data is generated:  Web, text, images …  Business transactions, calls, ...  Scientific data: astronomy, biology, etc More data is captured:  Storage technology faster and cheaper  DBMS can handle bigger DB
  • 8. Prithwis Mukerjee 8 Data Growth In 2 years (2003 to 2005), the size of the largest database TRIPLED!
  • 9. Prithwis Mukerjee 9 Coming to the point .... Data mining  Is the process of extracting unknown, valid and actionable information from large databases and then using this information to make crucial business decisions. Database Data Mining Tools Data Presentation and Visualisation Tools Decisions
  • 10. Prithwis Mukerjee 10 Knowledge Discovery Definition Knowledge Discovery in Data is the non-trivial process of identifying  valid  novel  potentially useful  and ultimately understandable patterns in data. from Advances in Knowledge Discovery and Data Mining, Fayyad, Piatetsky-Shapiro, Smyth, and Uthurusamy, (Chapter 1), AAAI/MIT Press 1996
  • 11. Prithwis Mukerjee 11 Old Wine in new bottles ? Are these the same as data mining ?  SQL queries against large databases  Multidimensional database analysis  Online analytical processing  Sophisticated graphic visualisation  “Classical” statistical analysis  ANOVA ? Regression ? Correrelation ? What is missing ?  Discovery of information without a previously articulated or formulated hypothesis
  • 13. Prithwis Mukerjee 13 Statistics, Machine Learning, Data Mining Statistics:  more theory-based  more focused on testing hypotheses Machine learning  more heuristic  focused on improving performance of a learning agent  also looks at real-time learning and robotics – areas not part of data mining Data Mining and Knowledge Discovery  integrates theory and heuristics  focus on the entire process of knowledge discovery, including data cleaning, learning, and integration and visualization of results Distinctions are fuzzy
  • 15. Prithwis Mukerjee 15 Some Definitions Instance (also Item or Record):  an example, described by a number of attributes,  e.g. a day can be described by temperature, humidity and cloud status Attribute or Field  measuring aspects of the Instance, e.g. temperature Class (Label)  grouping of instances, e.g. days good for playing
  • 16. Prithwis Mukerjee 16 Major Data Mining Tasks Classification  predicting an item class Clustering  finding clusters in data Associations  e.g. A & B & C occur frequently Visualization  to facilitate human discovery Summarization  describing a group Deviation Detection  finding changes Estimation  predicting a continuous value Link Analysis  finding relationships And  So on ...
  • 17. Prithwis Mukerjee 17 Classification Learn a method for predicting the instance class from pre-labeled (classified) instances Many approaches: Statistics, Decision Trees, Neural Networks, ...
  • 18. Prithwis Mukerjee 18 Clustering Find “natural” grouping of instances given un-labeled data
  • 19. Prithwis Mukerjee 19 Association Rules & Frequent Itemsets TID Produce 1 MILK, BREAD, EGGS 2 BREAD, SUGAR 3 BREAD, CEREAL 4 MILK, BREAD, SUGAR 5 MILK, CEREAL 6 BREAD, CEREAL 7 MILK, CEREAL 8 MILK, BREAD, CEREAL, EGGS 9 MILK, BREAD, CEREAL Transactions Frequent Itemsets: Milk, Bread (4) Bread, Cereal (3) Milk, Bread, Cereal (2) … Rules: Milk => Bread (66%)
  • 20. Prithwis Mukerjee 20 Visualization & Data Mining Visualizing the data to facilitate human discovery Presenting the discovered results in a visually "nice" way
  • 21. Prithwis Mukerjee 21 Summarization Describe features of the selected group Use natural language and graphics Usually in Combination with Deviation detection or other methods Average length of stay in this study area rose 45.7 percent, from 4.3 days to 6.2 days, because ...
  • 22. Prithwis Mukerjee 22 Data Mining Central Quest Find true patterns and avoid overfitting (finding seemingly signifcant but really random patterns due to searching too many possibilites)
  • 24. Prithwis Mukerjee 24 Classification Learn a method for predicting the instance class from pre-labeled (classified) instances Many approaches: Regression, Decision Trees, Bayesian, Neural Networks, ... Given a set of points from classes what is the class of new point ?
  • 25. Prithwis Mukerjee 25 Classification: Linear Regression  Linear Regression w0 + w1 x + w2 y >= 0  Regression computes wi from data to minimize squared error to ‘fit’ the data  Not flexible enough
  • 26. Prithwis Mukerjee 26 Regression for Classification  Any regression technique can be used for classification  Training: perform a regression for each class, setting the output to 1 for training instances that belong to class, and 0 for those that don’t  Prediction: predict class corresponding to model with largest output value (membership value)  For linear regression this is known as multi- response linear regression
  • 27. Prithwis Mukerjee 27 Classification: Decision Trees X Y if X > 5 then blue else if Y > 3 then blue else if X > 2 then green else blue 52 3
  • 28. Prithwis Mukerjee 28 DECISION TREE  An internal node is a test on an attribute.  A branch represents an outcome of the test, e.g., Color=red.  A leaf node represents a class label or class label distribution.  At each node, one attribute is chosen to split training examples into distinct classes as much as possible  A new instance is classified by following a matching path to a leaf node.
  • 29. Prithwis Mukerjee 29 Weather Data: Play or not Play? Notruehighmildrain Yesfalsenormalhotovercast Yestruehighmildovercast Yestruenormalmildsunny Yesfalsenormalmildrain Yesfalsenormalcoolsunny Nofalsehighmildsunny Yestruenormalcoolovercast Notruenormalcoolrain Yesfalsenormalcoolrain Yesfalsehighmildrain Yesfalsehighhotovercast Notruehighhotsunny Nofalsehighhotsunny Play?WindyHumidityTemperatureOutlook Note: Outlook is the Forecast, no relation to Microsoft email program
  • 30. Prithwis Mukerjee 30 overcast high normal falsetrue sunny rain No NoYes Yes Yes Example Tree for “Play?” Outlook Humidity Windy
  • 31. Prithwis Mukerjee 31 Classification: Neural Nets  Can select more complex regions  Can be more accurate  Also can overfit the data – find patterns in random noise
  • 32. Prithwis Mukerjee 32 Classification: other approaches  Naïve Bayes  Rules  Support Vector Machines  Genetic Algorithms  … See www.KDnuggets.com/software/
  • 33. Prithwis Mukerjee 33 Direct Marketing Paradigm  Find most likely prospects to contact  Not everybody needs to be contacted  Number of targets is usually much smaller than number of prospects  Typical Applications  retailers, catalogues, direct mail (and e-mail)  customer acquisition, cross-sell, attrition prediction  ...
  • 34. Prithwis Mukerjee 34 Direct Marketing Evaluation  Accuracy on the entire dataset is not the right measure  Approach  develop a target model  score all prospects and rank them by decreasing score  select top P% of prospects for action  How do we decide what is the best subset of prospects ?
  • 35. Prithwis Mukerjee 35 Model-Sorted List …4897N0.925 2422 2734 … 3820 2478 1024 1746 CustI D N0.06100 …N0.1199 … … … … … Age Y0.934 …… Y0.943 N0.952 Y0.971 TargetScor e No Use a model to assign score to each customer Sort customers by decreasing score Expect more targets (hits) near the top of the list 3 hits in top 5% of the list If there 15 targets overall, then top 5 has 3/15=20% of targets
  • 37. Prithwis Mukerjee 37 Data Mining Applications Science: Chemistry, Physics, Medicine  Biochemical analysis  Remote sensors on a satellite  Telescopes – star galaxy classification  Medical Image analysis Bioscience  Sequence-based analysis  Protein structure and function prediction  Protein family classification  Microarray gene expression
  • 38. Prithwis Mukerjee 38 Microarrays: Classifying Leukemia  Leukemia: Acute Lymphoblastic (ALL) vs Acute Myeloid (AML), Golub et al, Science, v.286, 1999  72 examples (38 train, 34 test), about 7,000 genes ALL AML Visually similar, but genetically very different Best Model: 97% accuracy, 1 error (sample suspected mislabelled)
  • 39. Prithwis Mukerjee 39 Microarray Potential Applications  New and better molecular diagnostics  Jan 11, 2005: FDA approved Roche Diagnostic AmpliChip, based on Affymetrix technology  New molecular targets for therapy  few new drugs, large pipeline, …  Improved treatment outcome  Partially depends on genetic signature  Fundamental Biological Discovery  finding and refining biological pathways  Personalized medicine ?!
  • 40. Prithwis Mukerjee 40 Pharmaceutical companies, Insurance and Health care, Medicine  Drug development  Identify successful medical therapies  Claims analysis, fraudulent behavior  Medical diagnostic tools  Predict office visits Data Mining Applications Financial Industry, Banks, Businesses, E- commerce  Stock and investment analysis  Identify loyal customers vs. risky customer  Predict customer spending  Risk management  Sales forecasting
  • 41. Prithwis Mukerjee 41 Retail and Marketing  Customer buying patterns/demographic characteristics  Mailing campaigns  Market basket analysis  Trend analysis Data Mining Applications
  • 42. Prithwis Mukerjee 42 Application: Direct Marketing and CRM  Most major direct marketing companies are using modeling and data mining  Most financial companies are using customer modeling  Modeling is easier than changing customer behaviour  Example  Verizon Wireless reduced customer attrition rate from 2% to 1.5%, saving many millions of $
  • 43. Prithwis Mukerjee 43 Application: Security and Fraud Detection  Credit Card Fraud Detection  over 20 Million credit cards protected by Neural networks (Fair, Isaac)  Securities Fraud Detection  NASDAQ KDD system  Phone fraud detection  AT&T, Bell Atlantic, British Telecom/MCI
  • 44. Prithwis Mukerjee 44 Fraud Detection and Management (1) Applications  widely used in health care, retail, credit card services, telecommunications (phone card fraud), etc. Approach  use historical data to build models of fraudulent behavior and use data mining to help identify similar instances Examples  auto insurance: detect a group of people who stage accidents to collect on insurance  money laundering: detect suspicious money transactions (US Treasury's Financial Crimes Enforcement Network)  medical insurance: detect professional patients and ring of doctors and ring of references
  • 45. Prithwis Mukerjee 45 Fraud Detection and Management (2) Detecting inappropriate medical treatment  Australian Health Insurance Commission identifies that in many cases blanket screening tests were requested (save Australian $1m/yr). Detecting telephone fraud  Telephone call model: destination of the call, duration, time of day or week. Analyze patterns that deviate from an expected norm.  British Telecom identified discrete groups of callers with frequent intra-group calls, especially mobile phones, and broke a multimillion dollar fraud. Retail  Analysts estimate that 38% of retail shrink is due to dishonest employees.
  • 46. Prithwis Mukerjee 46 Application: e-Commerce  Amazon.com recommendations  if you bought (viewed) X, you are likely to buy Y  Netflix  If you liked "Monty Python and the Holy Grail", you get a recommendation for "This is Spinal Tap"  Comparison shopping  Froogle, mySimon, Yahoo Shopping, …
  • 47. Prithwis Mukerjee 47 Example : Processing Loan Applications Given: questionnaire with financial and personal information Problem: should money be lend? Borderline cases referred to loan officers But: 50% of accepted borderline cases defaulted! Solution:  reject all borderline cases? Borderline cases are most active customers!
  • 48. Prithwis Mukerjee 48 Enter Machine Learning Given:  1000 training examples of borderline cases 20 attributes:  age, years with current employer,years at current address, years with the bank, years at current job, other credit cards Learned rules predicted 2/3 of borderline cases correctly! Rules could be used to explain decisions to customers
  • 49. Prithwis Mukerjee 49 Case study 2:Screening images Given:  radar satellite images of coastal waters Problem:  detecting oil slicks in those images Oil slicks = dark regions with changing size and shape Look-alike dark regions can be caused by weather conditions (e.g. high wind) Expensive process requiring highly trained personnel
  • 50. Prithwis Mukerjee 50 Dark regions extracted from normalized image Attributes:  size of region, shape, area, intensity, sharpness and jaggedness of boundaries, proximity of other regions, info about background Constraints:  Scarcity of training examples (oil slicks are rare!)  Unbalanced data: most dark regions aren’t oil slicks  Regions from same image form a batch  Requirement is adjustable false-alarm rate Enter Machine Learning
  • 51. Prithwis Mukerjee 51 Data Mining Applications .. Prediction & Description  Would this customer buy this product ?  Is this customer likely to leave ? Relationship Marketing  What kind of products have been bought by this customer ?  What kind of marketing strategy has this customer responded to ? Outlier identification and Fraud detection  Locating unusual cases and behaviours Customer Profiling & Segmentation  Is the bottomline that we are all looking at ...
  • 52. Prithwis Mukerjee 52 Data Mining Challenges Computationally expensive to investigate all possibilities Dealing with noise/missing information and errors in data Choosing appropriate attributes/input representation Finding the minimal attribute space Finding adequate evaluation function(s) Extracting meaningful information Not overfitting
  • 53. Prithwis Mukerjee 53 Are All “Discovered” Patterns Interesting? Interestingness measures:  A pattern is interesting if  it is easily understood by humans,  valid on new or test data with some degree of certainty,  potentially useful,  novel, or validates some hypothesis that a user Objective vs. subjective measures:  Objective: based on statistics and structures of patterns  support and confidence  Subjective: based on user’s belief in the data  unexpectedness, novelty, action ability, etc. Completeness - Find all the interesting patterns  Can a data mining system find all the interesting patterns?  Association vs. classification vs. clustering
  • 55. Prithwis Mukerjee 55 Data Mining, Privacy, and Security TIA: Terrorism (formerly Total) Information Awareness Program –  TIA program closed by Congress in 2003 because of privacy concerns However, in 2006 we learn that NSA is analyzing US domestic call info to find potential terrorists  Invasion of Privacy or Needed Intelligence?
  • 56. Prithwis Mukerjee 56 Criticism of Analytic Approaches to Threat Detection: Data Mining will  be ineffective - generate millions of false positives  and invade privacy First, can data mining be effective?
  • 57. Prithwis Mukerjee 57 Can Data Mining and Statistics be Effective for Threat Detection? Criticism: Databases have 5% errors, so analyzing 100 million suspects will generate 5 million false positives Reality: Analytical models correlate many items of information to reduce false positives. Example: Identify one biased coin from 1,000.  After one throw of each coin, we cannot  After 30 throws, one biased coin will stand out with high probability.  Can identify 19 biased coins out of 100 million with sufficient number of throws
  • 58. Prithwis Mukerjee 58 Another Approach: Link Analysis Can find unusual patterns in the network structure
  • 59. Prithwis Mukerjee 59 Analytic technology can be effective Data Mining is just one additional tool to help analysts Combining multiple models and link analysis can reduce false positives Today there are millions of false positives with manual analysis Analytic technology has the potential to reduce the current high rate of false positives
  • 60. Prithwis Mukerjee 60 Data Mining with Privacy Data Mining looks for patterns, not people! Technical solutions can limit privacy invasion  Replacing sensitive personal data with anon. ID  Give randomized outputs  Multi-party computation – distributed data  … Bayardo & Srikant, Technological Solutions for Protecting Privacy, IEEE Computer, Sep 2003
  • 61. Prithwis Mukerjee 61 Summary  Data Mining and Knowledge Discovery are needed to deal with the flood of data  Knowledge Discovery is a process !  Avoid overfitting (finding random patterns by searching too many possibilities)
  • 62. Prithwis Mukerjee 62 Additional Resources www.KDnuggets.com data mining software, jobs, courses, etc www.acm.org/sigkdd ACM SIGKDD – the professional society for data mining