SlideShare una empresa de Scribd logo
1 de 44
A Brief Presentation on Data Mining
Jason Rodrigues
Data Mining
• Introduction
• What is Data Mining?
• Challenges and Trends
• The origin of DM
• Data Mining Tasks
• Types of Data
• Data Quality
• Takeaways
Agenda
What is Data Mining?
Definition:

Data mining is the process of extracting patterns from
data.- Wikipedia

Process of semi-automatically analyzing large databases
to find patterns that are:

valid: hold on new data with some certainty

novel: non-obvious to the system

useful: should be possible to act on the item

understandable: humans should be able to interpret the pattern
Other Definitions:

Non-trivial extraction of implicit, previously unknown and
potentially useful information from data

Exploration & analysis, by automatic or
semi-automatic means, of
large quantities of data
in order to discover
meaningful patterns
What is Data Mining? (Other Definitions)

Extracting or “mining” knowledge from large amounts of
data

Data -driven discovery and modeling of hidden patterns
(we never new existed) in large volumes of data

Extraction of implicit, previously unknown and
unexpected, potentially extremely useful information
from data
What is Data Mining?
What is Data Mining?
– Certain names are more
prevalent in certain US
locations (O’Brien, O’Rurke,
O’Reilly… in Boston area)
– Group together similar
documents returned by
search engine according to
their context (e.g. Amazon
rainforest, Amazon.com,)
What is not Data
Mining?
– Look up phone
number in phone
directory
– Query a Web
search engine for
information about
“Amazon”
What is Data Mining?
Moore's Law:

The law is named after Intel co-founder Gordon E. Moore, who described the
trend in his 1965 paper. The paper noted that number of components in
integrated circuits had doubled every year from the invention of the integrated
circuit in 1958 until 1965 and predicted that the trend would continue "for at least
ten years"

The number of transistors that can be placed inexpensively on an integrated
circuit has doubled approximately every two years. The trend has continued for
more than half a century and is not expected to stop until 2015 or later.
Moore's Law:
What is Data Mining?
What is Data Mining? History

Data Analysis?

Data Warehouse?

Data Mining and Statistics
− Standard Deviation

Data Mining and AI and Machine Learning
− Identify Possible Heart Attack cases
1980's1980's
1990's1990's
2000's2000's
2010's2010's
Data Mining Challenges and Trends

Commercial Viewpoint
Data Mining Challenges and Trends

Scientific Viewpoint
Data Mining Challenges and Trends
 Computationally expensive to investigate all
possibilities
 Dealing with noise/missing information and
errors in data
 Choosing appropriate attributes/input
representation
 Finding the minimal attribute space
 Finding adequate evaluation function(s)
 Extracting meaningful information
 Not overfitting
Predictive AnalysisPredictive Analysis
Presentation Exploration Discovery
Passive
Interactive
Proactive
Role of Software
Business
Insight
Canned reporting
Ad-hoc reporting
Online Analytical
Processing
Data mining
Data Mining Challenges and Trends
Data Mining Tasks

Prediction Methods
− Use some variables to predict unknown or
future values of other variables.

Description Methods
− Find human-interpretable patterns that
describe the data.
Data Mining Tasks

Classification [Predictive]

Clustering [Descriptive]

Association Rule Discovery [Descriptive]

Sequential Pattern Discovery [Descriptive]

Regression [Predictive]

Deviation Detection [Predictive]
Data Mining Tasks
 Concept/Class description: Characterization
and discrimination
 Generalize, summarize, and contrast data
characteristics, e.g., dry vs. wet regions
 Association (correlation and causality)
 Multi-dimensional or single-dimensional association
age(X, “20-29”) ^ income(X, “60-90K”)  buys(X, “TV”)
Data Mining Tasks
 Classification and Prediction
 Finding models (functions) that describe and
distinguish classes or concepts for future prediction
 Example: classify countries based on climate, or
classify cars based on gas mileage
 Presentation:
 If-THEN rules, decision-tree, classification rule,
neural network
 Prediction: Predict some unknown or missing
numerical values
Data Mining Tasks

Cluster analysis
− Class label is unknown: Group data to form
new classes,

Example: cluster houses to find distribution
patterns
− Clustering based on the principle:
maximizing the intra-class similarity and
minimizing the interclass similarity
Data Mining Tasks
 Outlier analysis
 Outlier: a data object that does not comply with the
general behavior of the data
 Mostly considered as noise or exception, but is
quite useful in fraud detection, rare events analysis
 Trend and evolution analysis
 Trend and deviation: regression analysis
 Sequential pattern mining, periodicity analysis
Takeaways
What is Data Mining?
Challenges and Trends
The origin of DM
Data Mining Tasks
What is Data?

Collection of data objects and
their attributes

An attribute is a property or
characteristic of an object
− Examples: eye color of a
person, temperature, etc.
− Attribute is also known as
variable, field, characteristic,
or feature

A collection of attributes
describe an object
− Object is also known as
record, point, case, sample,
entity, or instance
Tid Refund Marital
Status
Taxable
Income Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Attributes
Objects
Attribute Values

Attribute values are numbers or symbols assigned to an
attribute

Distinction between attributes and attribute values
− Same attribute can be mapped to different attribute values

Example: height can be measured in feet or meters
− Different attributes can be mapped to the same set of
values

Example: Attribute values for ID and age are integers

But properties of attribute values can be different
− ID has no limit but age has a maximum and
minimum value
Types of Attribute

There are different types of attributes
– Nominal
 Examples: ID numbers, eye color, zip codes
– Ordinal
 Examples: rankings (e.g., taste of potato chips on
a scale from 1-10), grades, height in {tall, medium,
short}
– Interval
 Examples: calendar dates, temperatures in Celsius
or Fahrenheit.
– Ratio
 Examples: temperature in Kelvin, length, time,
counts
Properties of Attribute Values

The type of an attribute depends on which of
the following properties it possesses:
− Distinctness: = ≠
− Order: < >
− Addition: + -
− Multiplication: * /
− Nominal attribute: distinctness
− Ordinal attribute: distinctness & order
− Interval attribute: distinctness, order & addition
− Ratio attribute: all 4 properties
Attribute
Type
Description Examples Operations
Nominal The values of a nominal attribute
are just different names, i.e.,
nominal attributes provide only
enough information to distinguish
one object from another. (=, ≠)
zip codes, employee
ID numbers, eye
color, sex: {male,
female}
mode, entropy,
contingency
correlation, χ2
test
Ordinal The values of an ordinal attribute
provide enough information to
order objects. (<, >)
hardness of
minerals, {good,
better, best},
grades, street
numbers
median,
percentiles, rank
correlation, run
tests, sign tests
Interval For interval attributes, the
differences between values are
meaningful, i.e., a unit of
measurement exists.
(+, - )
calendar dates,
temperature in
Celsius or
Fahrenheit
mean, standard
deviation,
Pearson's
correlation, t and
F tests
Ratio For ratio variables, both
differences and ratios are
meaningful. (*, /)
temperature in
Kelvin, monetary
quantities, counts,
age, mass, length,
electrical current
geometric mean,
harmonic mean,
percent variation
Discreet and Continuous Attributes

Discrete Attribute
− Has only a finite or countably infinite set of values
− Examples: zip codes, counts, or the set of words in a collection
of documents
− Often represented as integer variables.
− Note: binary attributes are a special case of discrete attributes

Continuous Attribute
− Has real numbers as attribute values
− Examples: temperature, height, or weight.
− Practically, real values can only be measured and represented
using a finite number of digits.
− Continuous attributes are typically represented as floating-point
variables.
Attribute Values

Attribute values are numbers or symbols assigned to an
attribute

Distinction between attributes and attribute values
− Same attribute can be mapped to different attribute values

Example: height can be measured in feet or meters
− Different attributes can be mapped to the same set of
values

Example: Attribute values for ID and age are integers

But properties of attribute values can be different
− ID has no limit but age has a maximum and
minimum value
Types of Record Sets
Record
− Data Matrix
− Document Data
− Transaction Data
Graph
− World Wide Web
− Molecular Structures
Ordered
− Spatial Data
− Temporal Data
− Sequential Data
− Genetic Sequence Data
Record Sets

Data that consists of a collection of
records, each of which consists of a fixed
set of attributes
Tid Refund Marital
Status
Taxable
Income Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Data Matrics

If data objects have the same fixed set of numeric
attributes, then the data objects can be thought of as
points in a multi-dimensional space, where each
dimension represents a distinct attribute

Such data set can be represented by an m by n matrix,
where there are m rows, one for each object, and n
columns, one for each attribute
Document Data

Each document becomes a `term' vector,
− each term is a component (attribute) of the
vector,
− the value of each component is the number of
times the corresponding term occurs in the
document.
Transaction Data

A special type of record data, where
− each record (transaction) involves a set of
items.
− For example, consider a grocery store. The
set of products purchased by a customer
during one shopping trip constitute a
transaction, while the individual products that
were purchased are the items.
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Graph Data

A Generic Graph
5
2
1
2
5
Chemical Data
 Benzene Molecule: C6H6
Ordered Data

Sequences of transactions
An element of
the sequence
Items/Events
Ordered Data

Genomic Sequences
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
Ordered Data

Spacio Temporal Data
Average Monthly
Temperature of
land and ocean
Data Quality

What kinds of data quality problems?

How can we detect problems with the
data?

What can we do about these problems?

Examples of data quality problems:
− Noise and outliers
− missing values
− duplicate data
Noise

Noise refers to modification of original
values
− Examples: distortion of a person’s voice when
talking on a poor phone and “snow” on
television screen
Two Sine Waves Two Sine Waves + Noise
Outliers

Outliers are data objects with
characteristics that are considerably
different than most of the other data
objects in the data set
Missing Values

Reasons for missing values
− Information is not collected
(e.g., people decline to give their age and
weight)
− Attributes may not be applicable to all cases
(e.g., annual income is not applicable to
children)

Handling missing values
− Eliminate Data Objects
− Estimate Missing Values
− Ignore the Missing Value During Analysis
− Replace with all possible values (weighted by
their probabilities)
Duplicate Data

Data set may include data objects that are
duplicates, or almost duplicates of one
another
− Major issue when merging data from
heterogeous sources

Examples:
− Same person with multiple email addresses

Data cleaning
− Process of dealing with duplicate data issues
Six Rules of Data Quality
1. Data that is not used cannot be correct for very long
2. Data Quality in an information system is a function of
its use, not its collection
3.Data quality will ultimately be no better than its most
stringent use
4. Data quality problems tend to become worse with
the age of the system
5. Less likely it is that some data element will change,
more traumatic it will be when it finally does change.
6. Information overload affects data quality
Takeaways
What is Data Mining?
Challenges and Trends
The origin of DM
Data Mining Tasks
Data Types
Data Quality

Más contenido relacionado

La actualidad más candente

Preprocessing
PreprocessingPreprocessing
Preprocessingmmuthuraj
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingHarry Potter
 
1.2 steps and functionalities
1.2 steps and functionalities1.2 steps and functionalities
1.2 steps and functionalitiesRajendran
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingHarry Potter
 
An efficient data preprocessing method for mining
An efficient data preprocessing method for miningAn efficient data preprocessing method for mining
An efficient data preprocessing method for miningKamesh Waran
 
data warehousing & minining 1st unit
data warehousing & minining 1st unitdata warehousing & minining 1st unit
data warehousing & minining 1st unitbhagathk
 
Dwdmunit1 a
Dwdmunit1 aDwdmunit1 a
Dwdmunit1 abhagathk
 
Introduction to Data Mining for Newbies
Introduction to Data Mining for NewbiesIntroduction to Data Mining for Newbies
Introduction to Data Mining for NewbiesEunjeong (Lucy) Park
 
Data Preprocessing || Data Mining
Data Preprocessing || Data MiningData Preprocessing || Data Mining
Data Preprocessing || Data MiningIffat Firozy
 
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
Data Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessingData Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessing
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessingSalah Amean
 
Knowledge discovery thru data mining
Knowledge discovery thru data miningKnowledge discovery thru data mining
Knowledge discovery thru data miningDevakumar Jain
 
Datamining - On What Kind of Data
Datamining - On What Kind of DataDatamining - On What Kind of Data
Datamining - On What Kind of Datawina wulansari
 

La actualidad más candente (16)

Preprocessing
PreprocessingPreprocessing
Preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
1.2 steps and functionalities
1.2 steps and functionalities1.2 steps and functionalities
1.2 steps and functionalities
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
An efficient data preprocessing method for mining
An efficient data preprocessing method for miningAn efficient data preprocessing method for mining
An efficient data preprocessing method for mining
 
data warehousing & minining 1st unit
data warehousing & minining 1st unitdata warehousing & minining 1st unit
data warehousing & minining 1st unit
 
02 Data Mining
02 Data Mining02 Data Mining
02 Data Mining
 
Dwdmunit1 a
Dwdmunit1 aDwdmunit1 a
Dwdmunit1 a
 
Introduction to Data Mining for Newbies
Introduction to Data Mining for NewbiesIntroduction to Data Mining for Newbies
Introduction to Data Mining for Newbies
 
Data preprocessing ng
Data preprocessing   ngData preprocessing   ng
Data preprocessing ng
 
Data Preprocessing || Data Mining
Data Preprocessing || Data MiningData Preprocessing || Data Mining
Data Preprocessing || Data Mining
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
Data Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessingData Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessing
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
 
Knowledge discovery thru data mining
Knowledge discovery thru data miningKnowledge discovery thru data mining
Knowledge discovery thru data mining
 
Datamining - On What Kind of Data
Datamining - On What Kind of DataDatamining - On What Kind of Data
Datamining - On What Kind of Data
 
Data Mining
Data MiningData Mining
Data Mining
 

Similar a Its all about data mining

Wk. 3. Data [12-05-2021] (2).ppt
Wk. 3.  Data [12-05-2021] (2).pptWk. 3.  Data [12-05-2021] (2).ppt
Wk. 3. Data [12-05-2021] (2).pptMdZahidHasan55
 
Data Mining - Introduction and Data
Data Mining - Introduction and DataData Mining - Introduction and Data
Data Mining - Introduction and DataDarío Garigliotti
 
Data mining Basics and complete description
Data mining Basics and complete description Data mining Basics and complete description
Data mining Basics and complete description Sulman Ahmed
 
Data types and Attributes1 (1).pptx
Data types and Attributes1 (1).pptxData types and Attributes1 (1).pptx
Data types and Attributes1 (1).pptxRupaRaj6
 
Data Mining DataLecture Notes for Chapter 2Introduc.docx
Data Mining DataLecture Notes for Chapter 2Introduc.docxData Mining DataLecture Notes for Chapter 2Introduc.docx
Data Mining DataLecture Notes for Chapter 2Introduc.docxwhittemorelucilla
 
Data Mining DataLecture Notes for Chapter 2Introduc
Data Mining DataLecture Notes for Chapter 2IntroducData Mining DataLecture Notes for Chapter 2Introduc
Data Mining DataLecture Notes for Chapter 2IntroducOllieShoresna
 
data mining presentation power point for the study
data mining presentation power point for the studydata mining presentation power point for the study
data mining presentation power point for the studyanjanishah774
 
lect1lect1lect1lect1lect1lect1lect1lect1.ppt
lect1lect1lect1lect1lect1lect1lect1lect1.pptlect1lect1lect1lect1lect1lect1lect1lect1.ppt
lect1lect1lect1lect1lect1lect1lect1lect1.pptDEEPAK948083
 
Lect 2 getting to know your data
Lect 2 getting to know your dataLect 2 getting to know your data
Lect 2 getting to know your datahktripathy
 
omama munir 58.pptx
omama munir 58.pptxomama munir 58.pptx
omama munir 58.pptxOmamaNoor2
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introductionhktripathy
 
Data Science Chapter 2.pdf
Data Science Chapter 2.pdfData Science Chapter 2.pdf
Data Science Chapter 2.pdfMpumelelo Ndlovu
 
•  Compareandcontrastthefourartworksprovided(KongoCr.docx
•  Compareandcontrastthefourartworksprovided(KongoCr.docx•  Compareandcontrastthefourartworksprovided(KongoCr.docx
•  Compareandcontrastthefourartworksprovided(KongoCr.docxdaynamckernon
 
Data Mining : Concepts and Techniques
Data Mining : Concepts and TechniquesData Mining : Concepts and Techniques
Data Mining : Concepts and TechniquesDeepaR42
 
Data Analysis in Research: Descriptive Statistics & Normality
Data Analysis in Research: Descriptive Statistics & NormalityData Analysis in Research: Descriptive Statistics & Normality
Data Analysis in Research: Descriptive Statistics & NormalityIkbal Ahmed
 
Classification & Clustering.pptx
Classification & Clustering.pptxClassification & Clustering.pptx
Classification & Clustering.pptxImXaib
 

Similar a Its all about data mining (20)

Wk. 3. Data [12-05-2021] (2).ppt
Wk. 3.  Data [12-05-2021] (2).pptWk. 3.  Data [12-05-2021] (2).ppt
Wk. 3. Data [12-05-2021] (2).ppt
 
Data Mining - Introduction and Data
Data Mining - Introduction and DataData Mining - Introduction and Data
Data Mining - Introduction and Data
 
Data mining Basics and complete description
Data mining Basics and complete description Data mining Basics and complete description
Data mining Basics and complete description
 
Data types and Attributes1 (1).pptx
Data types and Attributes1 (1).pptxData types and Attributes1 (1).pptx
Data types and Attributes1 (1).pptx
 
Data Mining DataLecture Notes for Chapter 2Introduc.docx
Data Mining DataLecture Notes for Chapter 2Introduc.docxData Mining DataLecture Notes for Chapter 2Introduc.docx
Data Mining DataLecture Notes for Chapter 2Introduc.docx
 
Data Mining DataLecture Notes for Chapter 2Introduc
Data Mining DataLecture Notes for Chapter 2IntroducData Mining DataLecture Notes for Chapter 2Introduc
Data Mining DataLecture Notes for Chapter 2Introduc
 
chap2_data.ppt
chap2_data.pptchap2_data.ppt
chap2_data.ppt
 
chap2_data.ppt
chap2_data.pptchap2_data.ppt
chap2_data.ppt
 
data mining presentation power point for the study
data mining presentation power point for the studydata mining presentation power point for the study
data mining presentation power point for the study
 
lect1lect1lect1lect1lect1lect1lect1lect1.ppt
lect1lect1lect1lect1lect1lect1lect1lect1.pptlect1lect1lect1lect1lect1lect1lect1lect1.ppt
lect1lect1lect1lect1lect1lect1lect1lect1.ppt
 
lect1.ppt
lect1.pptlect1.ppt
lect1.ppt
 
Lect 2 getting to know your data
Lect 2 getting to know your dataLect 2 getting to know your data
Lect 2 getting to know your data
 
Pelatihan Data Analitik
Pelatihan Data AnalitikPelatihan Data Analitik
Pelatihan Data Analitik
 
omama munir 58.pptx
omama munir 58.pptxomama munir 58.pptx
omama munir 58.pptx
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introduction
 
Data Science Chapter 2.pdf
Data Science Chapter 2.pdfData Science Chapter 2.pdf
Data Science Chapter 2.pdf
 
•  Compareandcontrastthefourartworksprovided(KongoCr.docx
•  Compareandcontrastthefourartworksprovided(KongoCr.docx•  Compareandcontrastthefourartworksprovided(KongoCr.docx
•  Compareandcontrastthefourartworksprovided(KongoCr.docx
 
Data Mining : Concepts and Techniques
Data Mining : Concepts and TechniquesData Mining : Concepts and Techniques
Data Mining : Concepts and Techniques
 
Data Analysis in Research: Descriptive Statistics & Normality
Data Analysis in Research: Descriptive Statistics & NormalityData Analysis in Research: Descriptive Statistics & Normality
Data Analysis in Research: Descriptive Statistics & Normality
 
Classification & Clustering.pptx
Classification & Clustering.pptxClassification & Clustering.pptx
Classification & Clustering.pptx
 

Más de Jason Rodrigues

Johari WIndow in PPT.pptx
Johari WIndow in PPT.pptxJohari WIndow in PPT.pptx
Johari WIndow in PPT.pptxJason Rodrigues
 
Paris Conference on Applied Psychology
Paris Conference on Applied PsychologyParis Conference on Applied Psychology
Paris Conference on Applied PsychologyJason Rodrigues
 
Design and documentation of software architectures
Design and documentation of software architecturesDesign and documentation of software architectures
Design and documentation of software architecturesJason Rodrigues
 
A Sales Approach For Cloud Computing
A Sales Approach For Cloud ComputingA Sales Approach For Cloud Computing
A Sales Approach For Cloud ComputingJason Rodrigues
 

Más de Jason Rodrigues (9)

Johari WIndow in PPT.pptx
Johari WIndow in PPT.pptxJohari WIndow in PPT.pptx
Johari WIndow in PPT.pptx
 
Startup and incubation
Startup and incubationStartup and incubation
Startup and incubation
 
Paris Conference on Applied Psychology
Paris Conference on Applied PsychologyParis Conference on Applied Psychology
Paris Conference on Applied Psychology
 
Rodrigues
RodriguesRodrigues
Rodrigues
 
Safety Presentation
Safety PresentationSafety Presentation
Safety Presentation
 
Design and documentation of software architectures
Design and documentation of software architecturesDesign and documentation of software architectures
Design and documentation of software architectures
 
Wrap up
Wrap upWrap up
Wrap up
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
A Sales Approach For Cloud Computing
A Sales Approach For Cloud ComputingA Sales Approach For Cloud Computing
A Sales Approach For Cloud Computing
 

Último

Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 

Último (20)

Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 

Its all about data mining

  • 1. A Brief Presentation on Data Mining Jason Rodrigues Data Mining
  • 2. • Introduction • What is Data Mining? • Challenges and Trends • The origin of DM • Data Mining Tasks • Types of Data • Data Quality • Takeaways Agenda
  • 3.
  • 4. What is Data Mining? Definition:  Data mining is the process of extracting patterns from data.- Wikipedia  Process of semi-automatically analyzing large databases to find patterns that are:  valid: hold on new data with some certainty  novel: non-obvious to the system  useful: should be possible to act on the item  understandable: humans should be able to interpret the pattern Other Definitions:  Non-trivial extraction of implicit, previously unknown and potentially useful information from data  Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns
  • 5. What is Data Mining? (Other Definitions)  Extracting or “mining” knowledge from large amounts of data  Data -driven discovery and modeling of hidden patterns (we never new existed) in large volumes of data  Extraction of implicit, previously unknown and unexpected, potentially extremely useful information from data
  • 6. What is Data Mining? What is Data Mining? – Certain names are more prevalent in certain US locations (O’Brien, O’Rurke, O’Reilly… in Boston area) – Group together similar documents returned by search engine according to their context (e.g. Amazon rainforest, Amazon.com,) What is not Data Mining? – Look up phone number in phone directory – Query a Web search engine for information about “Amazon”
  • 7. What is Data Mining? Moore's Law:  The law is named after Intel co-founder Gordon E. Moore, who described the trend in his 1965 paper. The paper noted that number of components in integrated circuits had doubled every year from the invention of the integrated circuit in 1958 until 1965 and predicted that the trend would continue "for at least ten years"  The number of transistors that can be placed inexpensively on an integrated circuit has doubled approximately every two years. The trend has continued for more than half a century and is not expected to stop until 2015 or later.
  • 8. Moore's Law: What is Data Mining?
  • 9. What is Data Mining? History  Data Analysis?  Data Warehouse?  Data Mining and Statistics − Standard Deviation  Data Mining and AI and Machine Learning − Identify Possible Heart Attack cases 1980's1980's 1990's1990's 2000's2000's 2010's2010's
  • 10. Data Mining Challenges and Trends  Commercial Viewpoint
  • 11. Data Mining Challenges and Trends  Scientific Viewpoint
  • 12. Data Mining Challenges and Trends  Computationally expensive to investigate all possibilities  Dealing with noise/missing information and errors in data  Choosing appropriate attributes/input representation  Finding the minimal attribute space  Finding adequate evaluation function(s)  Extracting meaningful information  Not overfitting
  • 13. Predictive AnalysisPredictive Analysis Presentation Exploration Discovery Passive Interactive Proactive Role of Software Business Insight Canned reporting Ad-hoc reporting Online Analytical Processing Data mining Data Mining Challenges and Trends
  • 14. Data Mining Tasks  Prediction Methods − Use some variables to predict unknown or future values of other variables.  Description Methods − Find human-interpretable patterns that describe the data.
  • 15. Data Mining Tasks  Classification [Predictive]  Clustering [Descriptive]  Association Rule Discovery [Descriptive]  Sequential Pattern Discovery [Descriptive]  Regression [Predictive]  Deviation Detection [Predictive]
  • 16. Data Mining Tasks  Concept/Class description: Characterization and discrimination  Generalize, summarize, and contrast data characteristics, e.g., dry vs. wet regions  Association (correlation and causality)  Multi-dimensional or single-dimensional association age(X, “20-29”) ^ income(X, “60-90K”)  buys(X, “TV”)
  • 17. Data Mining Tasks  Classification and Prediction  Finding models (functions) that describe and distinguish classes or concepts for future prediction  Example: classify countries based on climate, or classify cars based on gas mileage  Presentation:  If-THEN rules, decision-tree, classification rule, neural network  Prediction: Predict some unknown or missing numerical values
  • 18. Data Mining Tasks  Cluster analysis − Class label is unknown: Group data to form new classes,  Example: cluster houses to find distribution patterns − Clustering based on the principle: maximizing the intra-class similarity and minimizing the interclass similarity
  • 19. Data Mining Tasks  Outlier analysis  Outlier: a data object that does not comply with the general behavior of the data  Mostly considered as noise or exception, but is quite useful in fraud detection, rare events analysis  Trend and evolution analysis  Trend and deviation: regression analysis  Sequential pattern mining, periodicity analysis
  • 20. Takeaways What is Data Mining? Challenges and Trends The origin of DM Data Mining Tasks
  • 21. What is Data?  Collection of data objects and their attributes  An attribute is a property or characteristic of an object − Examples: eye color of a person, temperature, etc. − Attribute is also known as variable, field, characteristic, or feature  A collection of attributes describe an object − Object is also known as record, point, case, sample, entity, or instance Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 10 Attributes Objects
  • 22. Attribute Values  Attribute values are numbers or symbols assigned to an attribute  Distinction between attributes and attribute values − Same attribute can be mapped to different attribute values  Example: height can be measured in feet or meters − Different attributes can be mapped to the same set of values  Example: Attribute values for ID and age are integers  But properties of attribute values can be different − ID has no limit but age has a maximum and minimum value
  • 23. Types of Attribute  There are different types of attributes – Nominal  Examples: ID numbers, eye color, zip codes – Ordinal  Examples: rankings (e.g., taste of potato chips on a scale from 1-10), grades, height in {tall, medium, short} – Interval  Examples: calendar dates, temperatures in Celsius or Fahrenheit. – Ratio  Examples: temperature in Kelvin, length, time, counts
  • 24. Properties of Attribute Values  The type of an attribute depends on which of the following properties it possesses: − Distinctness: = ≠ − Order: < > − Addition: + - − Multiplication: * / − Nominal attribute: distinctness − Ordinal attribute: distinctness & order − Interval attribute: distinctness, order & addition − Ratio attribute: all 4 properties
  • 25. Attribute Type Description Examples Operations Nominal The values of a nominal attribute are just different names, i.e., nominal attributes provide only enough information to distinguish one object from another. (=, ≠) zip codes, employee ID numbers, eye color, sex: {male, female} mode, entropy, contingency correlation, χ2 test Ordinal The values of an ordinal attribute provide enough information to order objects. (<, >) hardness of minerals, {good, better, best}, grades, street numbers median, percentiles, rank correlation, run tests, sign tests Interval For interval attributes, the differences between values are meaningful, i.e., a unit of measurement exists. (+, - ) calendar dates, temperature in Celsius or Fahrenheit mean, standard deviation, Pearson's correlation, t and F tests Ratio For ratio variables, both differences and ratios are meaningful. (*, /) temperature in Kelvin, monetary quantities, counts, age, mass, length, electrical current geometric mean, harmonic mean, percent variation
  • 26. Discreet and Continuous Attributes  Discrete Attribute − Has only a finite or countably infinite set of values − Examples: zip codes, counts, or the set of words in a collection of documents − Often represented as integer variables. − Note: binary attributes are a special case of discrete attributes  Continuous Attribute − Has real numbers as attribute values − Examples: temperature, height, or weight. − Practically, real values can only be measured and represented using a finite number of digits. − Continuous attributes are typically represented as floating-point variables.
  • 27. Attribute Values  Attribute values are numbers or symbols assigned to an attribute  Distinction between attributes and attribute values − Same attribute can be mapped to different attribute values  Example: height can be measured in feet or meters − Different attributes can be mapped to the same set of values  Example: Attribute values for ID and age are integers  But properties of attribute values can be different − ID has no limit but age has a maximum and minimum value
  • 28. Types of Record Sets Record − Data Matrix − Document Data − Transaction Data Graph − World Wide Web − Molecular Structures Ordered − Spatial Data − Temporal Data − Sequential Data − Genetic Sequence Data
  • 29. Record Sets  Data that consists of a collection of records, each of which consists of a fixed set of attributes Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 10
  • 30. Data Matrics  If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multi-dimensional space, where each dimension represents a distinct attribute  Such data set can be represented by an m by n matrix, where there are m rows, one for each object, and n columns, one for each attribute
  • 31. Document Data  Each document becomes a `term' vector, − each term is a component (attribute) of the vector, − the value of each component is the number of times the corresponding term occurs in the document.
  • 32. Transaction Data  A special type of record data, where − each record (transaction) involves a set of items. − For example, consider a grocery store. The set of products purchased by a customer during one shopping trip constitute a transaction, while the individual products that were purchased are the items. TID Items 1 Bread, Coke, Milk 2 Beer, Bread 3 Beer, Coke, Diaper, Milk 4 Beer, Bread, Diaper, Milk 5 Coke, Diaper, Milk
  • 33. Graph Data  A Generic Graph 5 2 1 2 5
  • 34. Chemical Data  Benzene Molecule: C6H6
  • 35. Ordered Data  Sequences of transactions An element of the sequence Items/Events
  • 37. Ordered Data  Spacio Temporal Data Average Monthly Temperature of land and ocean
  • 38. Data Quality  What kinds of data quality problems?  How can we detect problems with the data?  What can we do about these problems?  Examples of data quality problems: − Noise and outliers − missing values − duplicate data
  • 39. Noise  Noise refers to modification of original values − Examples: distortion of a person’s voice when talking on a poor phone and “snow” on television screen Two Sine Waves Two Sine Waves + Noise
  • 40. Outliers  Outliers are data objects with characteristics that are considerably different than most of the other data objects in the data set
  • 41. Missing Values  Reasons for missing values − Information is not collected (e.g., people decline to give their age and weight) − Attributes may not be applicable to all cases (e.g., annual income is not applicable to children)  Handling missing values − Eliminate Data Objects − Estimate Missing Values − Ignore the Missing Value During Analysis − Replace with all possible values (weighted by their probabilities)
  • 42. Duplicate Data  Data set may include data objects that are duplicates, or almost duplicates of one another − Major issue when merging data from heterogeous sources  Examples: − Same person with multiple email addresses  Data cleaning − Process of dealing with duplicate data issues
  • 43. Six Rules of Data Quality 1. Data that is not used cannot be correct for very long 2. Data Quality in an information system is a function of its use, not its collection 3.Data quality will ultimately be no better than its most stringent use 4. Data quality problems tend to become worse with the age of the system 5. Less likely it is that some data element will change, more traumatic it will be when it finally does change. 6. Information overload affects data quality
  • 44. Takeaways What is Data Mining? Challenges and Trends The origin of DM Data Mining Tasks Data Types Data Quality