BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
Dbm630_lecture01
1. DBM630: Data Mining and
Data Warehousing
MS.IT. Rangsit University
Semester 2/2011
by Kritsada Sriphaew (sriphaew.k AT gmail.com)
Lecture 1
Introduction to
Data Mining and Data Warehousing
Text: Data Mining: Concepts and Techniques, By Jiawei Han
and Micheline Kamber, Morgan Kaufmann Publishers (2006).
ISBN: 978-1558609013
1
2. Administrative Matters
Course Syllabus
Lecture Notes & Assignments & Quizzes
Course’s Communication
Announcements, discussion, lecture notes, etc.
Page: http://www.facebook.com/pages/Data-mining-MSIT-
RSU/
2 Data Mining and Data Warehousing by Kritsada Sriphaew
3. How we will be evaluated?
Assessment Tasks
Tasks % Scores
Quizzes (Approx. 2 times) 20
Assignment 20
(Disscussion/Demonstration)
Final 60
To Pass
At least 60% of the overall scores.
3 Data Mining and Data Warehousing by Kritsada Sriphaew
4. Text Books
Mandatory Book
Data Mining: Concepts and Techniques
By Jiawei Han and Micheline Kamber
Morgan Kaufmann Publishers (2006), Second Edition,
ISBN-10: 1558609016, ISBN-13: 978-1558609013
Supplementary Book
Practical Machine Learning Tools and
Techniques with JAVA Implementations
By Ian H. Witten and Eibe Frank, Data Mining
Morgan Kaufmann Publishers (2005), 2nd Edition
ISBN-10: 0120884070, ISBN-13: 978-0120884070
4 Data Mining and Data Warehousing by Kritsada Sriphaew
5. Course Description (What we’LL learn?)
Introduction to data warehousing. Characteristics of data warehousing, drawbacks
and benefits of data warehousing, architecture of data warehousing, internal data
structure for data warehousing, data integration, creating high quality data, data
mart, online analytical processing (OLAP). Introduction to data mining, types of
data for mining, architecture of typical data mining system, data preprocessing,
association rule mining, classification and prediction, clustering, data mining
applications, current trends in data mining, text mining, web mining, including
tools for data mining analysis such as WEKA, SAS, etc.
ั
แนวคิดเบืองต้นของคลังข้อมูล คุณลักษณะของคลังข้อมูล ข้อดีและข้อเสียของคลังข้อมูล สถาปตยกรรมของคลังข้อมูล
้
โครงสร้างการจัดเก็บข้อมูลภายในคลังข้อมูล การบูรณาการข้อมูล การสร้างข้อมูลทีมคุณภาพ ดาต้ามาร์ท การ
่ ี
ประมวลผลออนไลน์เชิงวิเคราะห์ แนวคิดเบืองต้นการทาเหมืองข้อมูล ชนิดข้อมูลสาหรับการทาเหมืองข้อมูล
้
ั
สถาปตยกรรมของระบบเหมืองข้อมูล การเตรียมข้อมูล การขุดค้นกฎสัมพันธ์ การจาแนกประเภทและการทานาย การ
่ ่ ี ั ั
จัดกลุม การทาเหมืองข้อมูลทีมความซับซ้อน การประยุกต์ใช้เหมืองข้อมูล แนวโน้มปจจุบนการทาเหมืองข้อมูล เหมือง
ข้อมูลตัวอักษร เหมืองข้อมูลเว็บ รวมถึงการใช้เครืองมือในการวิเคราะห์เหมืองข้อมูล เช่น WEKA, SAS เป็ นต้น
่
5 Data Mining and Data Warehousing by Kritsada Sriphaew
6. Course Schedule (tentative)
Week Date Topics
1 8 JAN Introduction to Data Mining and Data Warehousing
2 15 JAN Data Warehouse and OLAP Technology – I
3 22 JAN Data Warehouse and OLAP Technology – II
4 29 JAN Data Mining Concepts and Data Preparation
5 5 FEB Association Rule Mining
6 12 FEB Classification Model: Decision Tree, Classification Rules
7 19 FEB Classification Model: Naïve Bayes
8 26 FEB Prediction Model: Regression
9 4 MAR Clustering
10 11 MAR Data Mining Application: Text Mining, Web Mining, Social Network
Analysis
11 18 MAR Introduction to Data Mining Tool: WEKA
12 25 MAR Tutorials
6 Final Mining and Data Warehousing by Kritsada Sriphaew
Data
7. Prerequisites
Basic Database Concepts
Basic Statistics:
Probability, Sampling, Logic, Linear Regression, …
Algorithms:
Basic Data Structures, Dynamic Programming, ...
We provide some backgrounds, but the class will be
fast pace if you have some basics in advance.
7 Data Mining and Data Warehousing by Kritsada Sriphaew
8. Introduction
Motivation: Why mine data?
KDD: Knowledge Discovery in Databases
What is Data Mining?
Data Mining: on What kind of Data?
Data Mining Tasks
Data Mining Applications
8 Data Mining and Data Warehousing by Kritsada Sriphaew
9. Evolution of Database Technology
1960s:
Data collection, database creation, IMS and network
DBMS
1970s:
Relational data model, relational DBMS implementation
1980s:
RDBMS, advanced data models (extended-relational,
OO, deductive, etc.) and application-oriented DBMS
(spatial, scientific, engineering, etc.)
1990s—2000s:
Data mining and data warehousing, multimedia
databases, and Web databases
9 Data Mining and Data Warehousing by Kritsada Sriphaew
10. Large Data Sets: A Motivation
There is often information “hidden” in the data that
is not readily evident.
Human analysts take weeks to discover useful
information.
Much of the data is never been analyzed at all
How do you explore millions of
records, tens or hundreds of
fields, and find patterns?
10 Data Mining and Data Warehousing by Kritsada Sriphaew
11. KDD Process
(Knowledge Discovery in Databases)
Interpretation/
Evaluation
Data Mining Knowledge
Preprocessing
Patterns
Selection
Preprocessed
Data
Data
Target
Data
adapted from:
U. Fayyad, et al. (1995), “From Knowledge Discovery to Data Mining: An
Overview,” Advances in Knowledge Discovery and Data Mining, U. Fayyad et
al. (Eds.), AAAI/MIT Press
11 Data Mining and Data Warehousing by Kritsada Sriphaew
13. Business Intelligence (BI) vs. Data Mining
A word to call processes, techniques and tools that support
business decision using information technology
Increasing potential
to support End User
business decisions Making Decisions
Data Presentation Business Analyst
Visualization Techniques
Data Mining
Knowledge Discovery Data Analyst
Data Exploration
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts
OLAP DBA
Data Sources
Paper, Files, Information Providers, Database Systems, OLTP
13 Data Mining and Data Warehousing by Kritsada Sriphaew
14. Terminology
Data Mining
A step in the knowledge discovery process consisting of
particular algorithms (methods) that under some
acceptable objective, produces a particular enumeration
of patterns (models) over the data.
Knowledge Discovery Process
The process of using data mining methods (algorithms)
to extract (identify) what is deemed knowledge according
to the specifications of measures and thresholds, using a
database along with any necessary preprocessing or
transformations.
14 Data Mining and Data Warehousing by Kritsada Sriphaew
15. Other definitions of Data Mining
Non‐trivial extraction of implicit, previously unknown
and useful information from data
Automatic or semi-automatic process for analyzing
large databases to find patterns that are:
valid: hold on new data with some certainty
novel: non‐obvious to the system
useful: should be possible to act on the item
understandable: humans should be able to interpret the
pattern
15 Data Mining and Data Warehousing by Kritsada Sriphaew
16. Origins of Data Mining
Overlaps various fields, but
focus on
Scalability
Algorithm and Architecture
Automation to handle large
data
16 Data Mining and Data Warehousing by Kritsada Sriphaew
17. Data Mining: on What kind of Data?
Relational Databases
Data Warehouses Structure - 3D Anatomy
Transactional Databases
Advanced Database Systems
Function – 1D Signal
Object-Relational
Spatial and Temporal
Time-Series
Metadata – Annotation
Multimedia GeneFilter Comparison Report
Text GeneFilter 1 Name:
O2#1 8-20-99adjfinal
INTENSITIES
GeneFilter 1
N2#1finaladj
Name:
Heterogeneous, Legacy, and Distributed ORF NAME
YAL001C TFC3 1
RAW
GENE NAME
NORMALIZED
CHRM F G
1 A 1 2 12.03 7.38
R GF1
403.83
GF2
WWW
YBL080C PET112 2 1 A 1 3 53.21 35.62 "1,
YBR154C
YCL044C
RPB5 2
3
1 A 1 4 79.26 78.51
1 A 1 5 53.22 44.66
"2,660.73"
"1,786.53"
YDL020C SON1 4 1 A 1 6 23.80 20.34 799.06
YDL211C 4 1 A 1 7 17.31 35.34 581.00
YDR155C CPH1 4 1 A 1 8 349.78 401.84
YDR346C 4 1 A 1 9 64.97 65.88 "2,180.87"
YAL010C MDM10 1 1 A 2 2 13.73 9.61 461.03
17 YBL088C TEL1 2 1 A 2 3 8.50 7.74
Data Mining and Data Warehousing by Kritsada Sriphaew
YBR162C 2 1 A 2 4 226.84
285.38
293.83
YCL052C PBN1 3 1 A 2 5 41.28 34.79 "1,385.79"
YDL028C MPS1 4 1 A 2 6 7.95 6.24 266.99
20. Ex: Market Basket Analysis
? Where should detergents be placed in the
Store to maximize their sales?
? Are window cleaning products purchased
when detergents and orange juice are
bought together?
? Is soda typically purchased with bananas?
Does the brand of soda make a difference?
? How are the demographics of the
neighborhood affecting what customers
are buying?
20 Data Mining and Data Warehousing by Kritsada Sriphaew
21. Ex: Anomaly Detection
Detect significant deviations from normal behavior
Applications:
Credit Card Fraud Detection
Network Intrusion Detection
21 Data Mining and Data Warehousing by Kritsada Sriphaew
22. Some Success Stories
Network intrusion detection using a combination of sequential
rule discovery and classification tree on 4 GB DARPA data
Won over (manual) knowledge engineering approach
http://www.cs.columbia.edu/~sal/JAM/PROJECT/ provides good
detailed description of the entire process
Major US bank: Customer attrition prediction
Segment customers based on financial behavior: 3 segments
Build attrition models for each of the 3 segments
40‐50% of attritions were predicted == factor of 18 increase
Targeted credit marketing: major US banks
find customer segments based on 13 months credit balances
build another response model based on surveys
increased response 4 times -- 2%
22 Data Mining and Data Warehousing by Kritsada Sriphaew
23. How You’LL Benefit
Confidently discuss the role and applicability of data
warehousing and data mining to
business/organization problems
Get background knowledge for further explore to
your thesis, independent study or your career’s
projects since data mining methods (to extract
knowledge from the data) are very useful for every
fields.
24. Assignment
Assignments will aim to test your detailed knowledge
and understanding of the topics, as well as your
critical thinking and research ability. Assignments may
include tasks involving: writing detailed designs;
reading research papers; learning and using specialist
software/hardware.
Assessment: the assignment will be worth 20% of the
total course assessment.
25. PreTest
1. Select only one of the following items to fill in the blanks.
(a) Characterization/Discrimination
(b) Classification
(c) Numeric Prediction
(d) Clustering
(e) Association Analysis
(f) Trend Analysis
Which function matches with the following task?
______(1) To estimate the price of the stock A in next month
______(2) To display a portion of sold products, according to their types.
______(3) To know which products are likely to be sold with which products
______(4) To group customers to a set of similar groups based on their features
______(5) To find the value of an experiment when a substance is tested.
______(6) To predict that a customer tends to be a good customer or not.
2. Assume that we want to design a model to forecast tomorrow’s SET index,
please suggest the detail of the model that we should construct and
recommend the input and output to the model.
25