5. Samsung
Mumbai
Delhi
Sales per item type per branch Sales
for first quarter. Manager
Chennai
Banglore
6. • Now, the sales manager wants to know the
sales of first quarter.?
• Solution
– Extract information from each database store it at
a single place, and process using operational
systems.!
7. Solution
Mumbai
Report
Delhi
Query & Sales
Data Analysis tools Manager
Warehouse
Chennai
Banglore
8. Operational Systems
• Running the business real time
• Routine tasks
• Decision Support Systems(DSS)
– Help in taking actions!
• Used by people who deal with customers,
products
• They are increasingly used by customers
9. Data Warehouse
• A single, complete and consistent store of
data obtained from a variety of different
sources made available to end users in a what
they can understand and use in a business
context.
• A process of transforming data into
information and making it available to users in
a timely enough manner to make a difference
12. Source
Data Information
Management & Control Delivery
External
Metadata
Production
MDDB
Data Warehouse
DBMS
Internal
Report /
Query
Archived Data Marts
Data Staging
Data
Mining
13. Components
• Source Data
• Data Staging (Data Extraction, cleaning And Loading )
– Talend is the first open source ETL tool
• Data Storage
• Information Delivery (EIS)
• Management and control
14. OLAP
• Online Analytical Processing Tools
• DSS tools that use multidimensional data
analysis techniques
– Support for a DSS data store
– Data extraction and integration filter
– Specialized presentation interface
• Oracle OLAP 11G
17. 12 Rules of Data Warehouse
1. Data Warehouse and Operational
Environments are Separated
2. Data is integrated
3. Contains historical data over a long period of
time
4. Data is a snapshot data captured at a given
point in time
5. Data is subject-oriented
18. 6.Mainly read-only with periodic batch updates
7.Development Life Cycle has a data driven
approach versus the traditional process-driven
approach
8.Data contains several levels of detail
-Current, Old, Lightly Summarized, Highly
Summarized
19. 9.Environment is characterized by Read-only
transactions to very large data sets
10.System that traces data sources, transformations,
and storage
11.Metadata is a critical component
– Source, transformation, integration, storage, relationships,
history, etc
12.Contains a chargeback mechanism for resource
usage that enforces optimal use of data by end users
20. OLTP v/s Data warehousing
OLTP Data Warehousing
• Application Oriented • Subject Oriented
• Used to Run Business • Used to analyze business
• Detailed data • Summarized and refined
• Current up-to date • Snapshot Data
• Isolated data • Integrated Data
• Ad-Hoc Access
• Repetitive Access
• Performance relaxed
• Performance Sensitive
• Large volume accessed at a
• Few records accessed time
• Read/Update Access • Mostly Read
21. Data Warehouse summary
• Integrated platform for OLAP and DSS
• Helps optimize business operations
• Easy access to multidimensional data
23. Why Data Mining?
Wealth generation
Analyzing trends
Strategic decision making
Security
24. Data Mining
• Look for hidden patterns and trends in data
that is not immediately apparent from
summarizing the data
• No Query…
• …But an “Interestingness criteria”
25. Data Mining
+ =
Interestingness Hidden
Data criteria patterns
26. Data Mining Type
of
Patterns
+ =
Interestingness Hidden
Data criteria patterns
27. Data Mining
Type of data Type of
Interestingness criteria
+ =
Interestingness Hidden
Data criteria patterns
28. Type of Data
• Tabular (Ex: Transaction data)
– Relational
– Multi-dimensional
• Tree (Ex: XML data)
• Graphs
• Sequence (Ex: DNA, activity logs)
• Text, Multimedia …
29. Type of Interestingness
• Frequency
• Rarity
• Correlation
• Length of occurrence (for sequence and temporal data)
• Consistency
• Repeating / periodicity
• “Abnormal” behavior
• Other patterns of interestingness…
30. Data Mining vs Statistical Inference
Statistics:
Statistical
Conceptual Reasoning
Model
(Hypothesis)
“Proof”
(Validation of Hypothesis)
31. Data Mining vs Statistical Inference
Data mining:
Mining
Algorithm
Based on
Data Interestingness
Pattern
(model, rule,
hypothesis)
discovery
32. Used for..
• Data mining is used for
– Frequent Item-sets
– Associations
– Classifications
– Clustering
33. Techniques
• Algorithms
– Apriori algorithm
– Decision tree
• SLIQ
– Supervised Learning in QUEST
– IBM
• “GROUP BY”
mysql> select sum(sal),deptno from emp group by deptno;
34. Data Mining Summary
• Helps in pattern analysis and thus taking
actions –real time and future based.
• Analyzing trends and clusters in business
operations.