SlideShare una empresa de Scribd logo
1 de 28
Exploring Data A preliminary exploration of the data to better understand its characteristics Key motivations of data exploration include Helping to select the right tool for preprocessing or analysis Making use of humans’ abilities to recognize pattern  People can recognize patterns not captured by data analysis tools  Related to the area of Exploratory Data Analysis (EDA) Created by statistician John Tukey
Contd… In EDA, as originally defined by Tukey The focus was on visualization Clustering and anomaly detection were viewed as exploratory techniques In data mining, clustering and anomaly detection are major areas of interest, and not thought of as just exploratory In our discussion of data exploration, we focus on Summary statistics Visualization Online Analytical Processing (OLAP)
Iris Sample Data Set  Many of the exploratory data techniques are illustrated with the Iris Plant data set. Three flower types (classes): Setosa Virginica Versicolour Four (non-class) attributes  Sepal width and length  Petal width and length
Summary Statistics Summary statistics  are numbers that summarize properties of the data Summarized properties include frequency, location and spread  Examples: 	location - mean                   	spread - standard deviation
Frequency and Mode The frequency of an attribute value is the percentage of time the value occurs in the data set  For example, given the attribute ‘gender’ and a representative population of people, the gender ‘female’ occurs about 50% of the time. The mode of a an attribute is the most frequent attribute value    The notions of frequency and mode are typically used with categorical data
Percentiles For continuous data, the notion of a percentile is more useful.  Given an ordinal or continuous attribute x and a number p between 0 and 100, the pth percentile is a value     of x such that p% of the observed values of x are less than    .  For instance, the 50th percentile is the value      such that 50% of all values of x are less than
Measures of Location: Mean and Median The mean is the most common measure of the location of a set of points.   However, the mean is very sensitive to outliers.    Thus, the median or a trimmed mean is also commonly used
Measures of Spread: Range and Variance Range is the difference between the max and min The variance or standard deviation is the most common measure of the spread of a set of points
Visualization Visualization is the conversion of data into a visual or tabular format so that the characteristics of the data and the relationships among data items or attributes can be analyzed or reported. Humans have a well developed ability to analyze large amounts of information that is presented visually Can detect general patterns and trends Can detect outliers and unusual patterns
Visualization techniques-Histogram Histogram  Usually shows the distribution of values of a single variable Divide the values into bins and show a bar plot of the number of objects in each bin.  The height of each bar indicates the number of objects Shape of histogram depends on the number of bins
Contd…. Ex: petal width 10 bins
Visualization Techniques: Box Plots Invented by J. Tukey Another way of displaying the distribution of data  Following figure shows the basic part of a box plot outlier 75th percentile 50th percentile 25th percentile 10th percentile 10th percentile
Visualization Techniques: Scatter Plots Scatter plots  Attributes values determine the position Two-dimensional scatter plots most common, but can have three-dimensional scatter plots Often additional attributes can be displayed by using the size, shape, and color of the markers that represent the objects  It is useful to have arrays of scatter plots can compactly summarize the relationships of several pairs of attributes
Contd…
Visualization Techniques: Contour Plots Contour plots  Useful when a continuous attribute is measured on a spatial grid They partition the plane into regions of similar values The contour lines that form the boundaries of these regions connect points with equal values	 The most common example is contour maps of elevation
Contour Plot Example: SST Dec, 1998
Visualization Techniques: Matrix Plots Matrix plots  Can plot the data matrix This can be useful when objects are sorted according to class Typically, the attributes are normalized to prevent one attribute from dominating the plot	 Plots of similarity or distance matrices can also be useful for visualizing the relationships between objects Examples of matrix plots are p
Visualization of the Iris Data Matrix
Visualization Techniques: Parallel Coordinates Parallel Coordinates  Used to plot the attribute values of high-dimensional data Instead of using perpendicular axes, use a set of parallel axes  The attribute values of each object are plotted as a point on each corresponding coordinate axis and the points are connected by a line	 Thus, each object is represented as a line
Other Visualization Techniques Star Plots  Similar approach to parallel coordinates, but axes radiate from a central point The line connecting the values of an object is a polygon
Contd… Chernoff Faces Approach created by Herman Chernoff This approach associates each attribute with a characteristic of a face The values of each attribute determine the appearance of the corresponding facial characteristic	 Each object becomes a separate face Relies on human’s ability to distinguish faces
OLAP On-Line Analytical Processing (OLAP) was proposed by E. F. Codd, the father of the relational database. Relational databases put data into tables, while OLAP uses a multidimensional array representation.  Such representations of data previously existed in statistics and other fields There are a number of data analysis and data exploration operations that are easier with such a data representation.
OLAP Operations: Data Cube The key operation of a OLAP is the formation of a data cube A data cube is a multidimensional representation of data, together with all possible aggregates. By all possible aggregates, we mean the aggregates that result by selecting a proper subset of the dimensions and summing over all remaining dimensions. For example, if we choose the species type dimension of the Iris data and sum over all other dimensions, the result will be a one-dimensional entry with three entries, each of which gives the number of flowers of each type.
OLAP Operations: Slicing and Dicing Slicing is selecting a group of cells from the entire multidimensional array by specifying a specific value for one or more dimensions.  Dicing involves selecting a subset of cells by specifying a range of attribute values.  This is equivalent to defining a subarray from the complete array.  In practice, both operations can also be accompanied by aggregation over some dimensions.
OLAP Operations: Roll-up and Drill-down Attribute values often have a hierarchical structure. Each date is associated with a year, month, and week. A location is associated with a continent, country, state (province, etc.), and city.  Products can be divided into various categories, such as clothing, electronics, and furniture.
Contd… Note that these categories often nest and form a tree or lattice A year contains months which contains day A country contains a state which contains a city
Contd… This hierarchical structure gives rise to the roll-up and drill-down operations. For sales data, we can aggregate (roll up) the sales across all the dates in a month.  Conversely, given a view of the data where the time dimension is broken into months, we could split the monthly sales totals (drill down) into daily sales totals. Likewise, we can drill down or roll up on the location or product ID attributes.
Visit more self help tutorials Pick a tutorial of your choice and browse through it at your own pace. The tutorials section is free, self-guiding and will not involve any additional support. Visit us at www.dataminingtools.net

Más contenido relacionado

La actualidad más candente

Basics of Educational Statistics (Graphs & its Types)
Basics of Educational Statistics (Graphs & its Types)Basics of Educational Statistics (Graphs & its Types)
Basics of Educational Statistics (Graphs & its Types)HennaAnsari
 
Areas In Statistics
Areas In StatisticsAreas In Statistics
Areas In Statisticsguestc94d8c
 
Types of data and graphical representation
Types of data and graphical representationTypes of data and graphical representation
Types of data and graphical representationReena Titoria
 
Data Presentation
Data PresentationData Presentation
Data Presentationcheergalsal
 
2.1-2.2 Organizing Data
2.1-2.2 Organizing Data2.1-2.2 Organizing Data
2.1-2.2 Organizing Datamlong24
 
Understanding the graphical representation of data in research
Understanding the graphical representation of data in researchUnderstanding the graphical representation of data in research
Understanding the graphical representation of data in researchDrShalooSaini
 
Numerical & graphical presentation of data
Numerical & graphical presentation of dataNumerical & graphical presentation of data
Numerical & graphical presentation of dataSarfraz Ahmad
 
Introduction to Statistics - Basic concepts
Introduction to Statistics - Basic conceptsIntroduction to Statistics - Basic concepts
Introduction to Statistics - Basic conceptsDocIbrahimAbdelmonaem
 
The uses of Tables & graphs
The uses of Tables & graphsThe uses of Tables & graphs
The uses of Tables & graphsFranco Jesús
 
Descriptive statistics
Descriptive statisticsDescriptive statistics
Descriptive statisticsSarfraz Ahmad
 
Diowane2003
Diowane2003Diowane2003
Diowane2003SFYC
 
LINE AND SCATTER DIAGRAM,FREQUENCY DISTRIBUTION
LINE AND SCATTER DIAGRAM,FREQUENCY DISTRIBUTIONLINE AND SCATTER DIAGRAM,FREQUENCY DISTRIBUTION
LINE AND SCATTER DIAGRAM,FREQUENCY DISTRIBUTIONruhila bhat
 
Statistic and probability 2
Statistic and probability 2Statistic and probability 2
Statistic and probability 2Irfan Yaqoob
 

La actualidad más candente (20)

Organizing data
Organizing dataOrganizing data
Organizing data
 
Histogram
HistogramHistogram
Histogram
 
Basics of Educational Statistics (Graphs & its Types)
Basics of Educational Statistics (Graphs & its Types)Basics of Educational Statistics (Graphs & its Types)
Basics of Educational Statistics (Graphs & its Types)
 
Areas In Statistics
Areas In StatisticsAreas In Statistics
Areas In Statistics
 
Types of data and graphical representation
Types of data and graphical representationTypes of data and graphical representation
Types of data and graphical representation
 
Day 3 descriptive statistics
Day 3  descriptive statisticsDay 3  descriptive statistics
Day 3 descriptive statistics
 
Descriptive statistics
Descriptive statisticsDescriptive statistics
Descriptive statistics
 
Data Presentation
Data PresentationData Presentation
Data Presentation
 
2.1-2.2 Organizing Data
2.1-2.2 Organizing Data2.1-2.2 Organizing Data
2.1-2.2 Organizing Data
 
Understanding the graphical representation of data in research
Understanding the graphical representation of data in researchUnderstanding the graphical representation of data in research
Understanding the graphical representation of data in research
 
Numerical & graphical presentation of data
Numerical & graphical presentation of dataNumerical & graphical presentation of data
Numerical & graphical presentation of data
 
Introduction to Statistics - Basic concepts
Introduction to Statistics - Basic conceptsIntroduction to Statistics - Basic concepts
Introduction to Statistics - Basic concepts
 
Histogram
HistogramHistogram
Histogram
 
Presentation of data
Presentation of dataPresentation of data
Presentation of data
 
The uses of Tables & graphs
The uses of Tables & graphsThe uses of Tables & graphs
The uses of Tables & graphs
 
Descriptive statistics
Descriptive statisticsDescriptive statistics
Descriptive statistics
 
Diowane2003
Diowane2003Diowane2003
Diowane2003
 
LINE AND SCATTER DIAGRAM,FREQUENCY DISTRIBUTION
LINE AND SCATTER DIAGRAM,FREQUENCY DISTRIBUTIONLINE AND SCATTER DIAGRAM,FREQUENCY DISTRIBUTION
LINE AND SCATTER DIAGRAM,FREQUENCY DISTRIBUTION
 
Statistical table
Statistical tableStatistical table
Statistical table
 
Statistic and probability 2
Statistic and probability 2Statistic and probability 2
Statistic and probability 2
 

Destacado (20)

Data input and transformation
Data input and transformationData input and transformation
Data input and transformation
 
XL-MINER: Associations
XL-MINER: AssociationsXL-MINER: Associations
XL-MINER: Associations
 
XL-MINER: Data Exploration
XL-MINER: Data ExplorationXL-MINER: Data Exploration
XL-MINER: Data Exploration
 
Sampling Distributions
Sampling DistributionsSampling Distributions
Sampling Distributions
 
Xlminer demo
Xlminer demoXlminer demo
Xlminer demo
 
Exploratory Data Analysis
Exploratory Data AnalysisExploratory Data Analysis
Exploratory Data Analysis
 
Exploratory data analysis v1.0
Exploratory data analysis v1.0Exploratory data analysis v1.0
Exploratory data analysis v1.0
 
Exploratory data analysis coursera
Exploratory data analysis courseraExploratory data analysis coursera
Exploratory data analysis coursera
 
XL-MINER:Prediction
XL-MINER:PredictionXL-MINER:Prediction
XL-MINER:Prediction
 
Gis
GisGis
Gis
 
Spatial analysis and Analysis Tools ( GIS )
Spatial analysis and Analysis Tools ( GIS )Spatial analysis and Analysis Tools ( GIS )
Spatial analysis and Analysis Tools ( GIS )
 
Spatial interpolation techniques
Spatial interpolation techniquesSpatial interpolation techniques
Spatial interpolation techniques
 
An Introduction to Data Mining with R
An Introduction to Data Mining with RAn Introduction to Data Mining with R
An Introduction to Data Mining with R
 
Spatial analysis and modeling
Spatial analysis and modelingSpatial analysis and modeling
Spatial analysis and modeling
 
Powerpoint paragraaf 5.3/5.4
Powerpoint paragraaf 5.3/5.4 Powerpoint paragraaf 5.3/5.4
Powerpoint paragraaf 5.3/5.4
 
Data
DataData
Data
 
Miedo Jajjjajajja
Miedo JajjjajajjaMiedo Jajjjajajja
Miedo Jajjjajajja
 
Quick Look At Clustering
Quick Look At ClusteringQuick Look At Clustering
Quick Look At Clustering
 
Oracle: DML
Oracle: DMLOracle: DML
Oracle: DML
 
Classification
ClassificationClassification
Classification
 

Similar a Exploring Data

Data Mining Exploring DataLecture Notes for Chapter 3
Data Mining Exploring DataLecture Notes for Chapter 3Data Mining Exploring DataLecture Notes for Chapter 3
Data Mining Exploring DataLecture Notes for Chapter 3OllieShoresna
 
Graphicalrepresntationofdatausingstatisticaltools2019_210902_105156.pdf
Graphicalrepresntationofdatausingstatisticaltools2019_210902_105156.pdfGraphicalrepresntationofdatausingstatisticaltools2019_210902_105156.pdf
Graphicalrepresntationofdatausingstatisticaltools2019_210902_105156.pdfHimakshi7
 
Artificial Intelligence - Data Analysis, Creative & Critical Thinking and AI...
Artificial Intelligence - Data Analysis, Creative & Critical Thinking and  AI...Artificial Intelligence - Data Analysis, Creative & Critical Thinking and  AI...
Artificial Intelligence - Data Analysis, Creative & Critical Thinking and AI...deboshreechatterjee2
 
UNIT-4.docx
UNIT-4.docxUNIT-4.docx
UNIT-4.docxscet315
 
Exploring Data (1).pptx
Exploring Data (1).pptxExploring Data (1).pptx
Exploring Data (1).pptxgina458018
 
Statistics Class 10 CBSE
Statistics Class 10 CBSE Statistics Class 10 CBSE
Statistics Class 10 CBSE Smitha Sumod
 
Research methodology-Research Report
Research methodology-Research ReportResearch methodology-Research Report
Research methodology-Research ReportDrMAlagupriyasafiq
 
Research Methodology-Data Processing
Research Methodology-Data ProcessingResearch Methodology-Data Processing
Research Methodology-Data ProcessingDrMAlagupriyasafiq
 
Exploratory Data Analysis.pptx for Data Analytics
Exploratory Data Analysis.pptx for Data AnalyticsExploratory Data Analysis.pptx for Data Analytics
Exploratory Data Analysis.pptx for Data Analyticsharshrnotaria
 
Wynberg girls high-Jade Gibson-maths-data analysis statistics
Wynberg girls high-Jade Gibson-maths-data analysis statisticsWynberg girls high-Jade Gibson-maths-data analysis statistics
Wynberg girls high-Jade Gibson-maths-data analysis statisticsWynberg Girls High
 
Visualizations in Exploratory Data Analysis
Visualizations in Exploratory Data AnalysisVisualizations in Exploratory Data Analysis
Visualizations in Exploratory Data AnalysisOluwatobiAdefami
 
Unit 2_ Descriptive Analytics for MBA .pptx
Unit 2_ Descriptive Analytics for MBA .pptxUnit 2_ Descriptive Analytics for MBA .pptx
Unit 2_ Descriptive Analytics for MBA .pptxJANNU VINAY
 
DATA VISUALIZATION.pptx
DATA VISUALIZATION.pptxDATA VISUALIZATION.pptx
DATA VISUALIZATION.pptxPraneethBhai1
 

Similar a Exploring Data (20)

Data Mining Exploring DataLecture Notes for Chapter 3
Data Mining Exploring DataLecture Notes for Chapter 3Data Mining Exploring DataLecture Notes for Chapter 3
Data Mining Exploring DataLecture Notes for Chapter 3
 
17329274.ppt
17329274.ppt17329274.ppt
17329274.ppt
 
Graphicalrepresntationofdatausingstatisticaltools2019_210902_105156.pdf
Graphicalrepresntationofdatausingstatisticaltools2019_210902_105156.pdfGraphicalrepresntationofdatausingstatisticaltools2019_210902_105156.pdf
Graphicalrepresntationofdatausingstatisticaltools2019_210902_105156.pdf
 
Artificial Intelligence - Data Analysis, Creative & Critical Thinking and AI...
Artificial Intelligence - Data Analysis, Creative & Critical Thinking and  AI...Artificial Intelligence - Data Analysis, Creative & Critical Thinking and  AI...
Artificial Intelligence - Data Analysis, Creative & Critical Thinking and AI...
 
UNIT-4.docx
UNIT-4.docxUNIT-4.docx
UNIT-4.docx
 
Exploring Data (1).pptx
Exploring Data (1).pptxExploring Data (1).pptx
Exploring Data (1).pptx
 
Statistics Class 10 CBSE
Statistics Class 10 CBSE Statistics Class 10 CBSE
Statistics Class 10 CBSE
 
Research methodology-Research Report
Research methodology-Research ReportResearch methodology-Research Report
Research methodology-Research Report
 
Research Methodology-Data Processing
Research Methodology-Data ProcessingResearch Methodology-Data Processing
Research Methodology-Data Processing
 
Exploratory Data Analysis.pptx for Data Analytics
Exploratory Data Analysis.pptx for Data AnalyticsExploratory Data Analysis.pptx for Data Analytics
Exploratory Data Analysis.pptx for Data Analytics
 
Wynberg girls high-Jade Gibson-maths-data analysis statistics
Wynberg girls high-Jade Gibson-maths-data analysis statisticsWynberg girls high-Jade Gibson-maths-data analysis statistics
Wynberg girls high-Jade Gibson-maths-data analysis statistics
 
Edited economic statistics note
Edited economic statistics noteEdited economic statistics note
Edited economic statistics note
 
Visualizations in Exploratory Data Analysis
Visualizations in Exploratory Data AnalysisVisualizations in Exploratory Data Analysis
Visualizations in Exploratory Data Analysis
 
Introduction to Descriptive Statistics
Introduction to Descriptive StatisticsIntroduction to Descriptive Statistics
Introduction to Descriptive Statistics
 
Unit 2_ Descriptive Analytics for MBA .pptx
Unit 2_ Descriptive Analytics for MBA .pptxUnit 2_ Descriptive Analytics for MBA .pptx
Unit 2_ Descriptive Analytics for MBA .pptx
 
QQ Plot.pptx
QQ Plot.pptxQQ Plot.pptx
QQ Plot.pptx
 
Bba 2001
Bba 2001Bba 2001
Bba 2001
 
Survey data & sampling
Survey data & samplingSurvey data & sampling
Survey data & sampling
 
DATA VISUALIZATION.pptx
DATA VISUALIZATION.pptxDATA VISUALIZATION.pptx
DATA VISUALIZATION.pptx
 
Stat-Lesson.pptx
Stat-Lesson.pptxStat-Lesson.pptx
Stat-Lesson.pptx
 

Más de DataminingTools Inc

AI: Introduction to artificial intelligence
AI: Introduction to artificial intelligenceAI: Introduction to artificial intelligence
AI: Introduction to artificial intelligenceDataminingTools Inc
 
Data Mining: Text and web mining
Data Mining: Text and web miningData Mining: Text and web mining
Data Mining: Text and web miningDataminingTools Inc
 
Data Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence dataData Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence dataDataminingTools Inc
 
Data Mining: Mining ,associations, and correlations
Data Mining: Mining ,associations, and correlationsData Mining: Mining ,associations, and correlations
Data Mining: Mining ,associations, and correlationsDataminingTools Inc
 
Data Mining: Graph mining and social network analysis
Data Mining: Graph mining and social network analysisData Mining: Graph mining and social network analysis
Data Mining: Graph mining and social network analysisDataminingTools Inc
 
Data warehouse and olap technology
Data warehouse and olap technologyData warehouse and olap technology
Data warehouse and olap technologyDataminingTools Inc
 

Más de DataminingTools Inc (20)

Terminology Machine Learning
Terminology Machine LearningTerminology Machine Learning
Terminology Machine Learning
 
Techniques Machine Learning
Techniques Machine LearningTechniques Machine Learning
Techniques Machine Learning
 
Machine learning Introduction
Machine learning IntroductionMachine learning Introduction
Machine learning Introduction
 
Areas of machine leanring
Areas of machine leanringAreas of machine leanring
Areas of machine leanring
 
AI: Planning and AI
AI: Planning and AIAI: Planning and AI
AI: Planning and AI
 
AI: Logic in AI 2
AI: Logic in AI 2AI: Logic in AI 2
AI: Logic in AI 2
 
AI: Logic in AI
AI: Logic in AIAI: Logic in AI
AI: Logic in AI
 
AI: Learning in AI 2
AI: Learning in AI 2AI: Learning in AI 2
AI: Learning in AI 2
 
AI: Learning in AI
AI: Learning in AI AI: Learning in AI
AI: Learning in AI
 
AI: Introduction to artificial intelligence
AI: Introduction to artificial intelligenceAI: Introduction to artificial intelligence
AI: Introduction to artificial intelligence
 
AI: Belief Networks
AI: Belief NetworksAI: Belief Networks
AI: Belief Networks
 
AI: AI & Searching
AI: AI & SearchingAI: AI & Searching
AI: AI & Searching
 
AI: AI & Problem Solving
AI: AI & Problem SolvingAI: AI & Problem Solving
AI: AI & Problem Solving
 
Data Mining: Text and web mining
Data Mining: Text and web miningData Mining: Text and web mining
Data Mining: Text and web mining
 
Data Mining: Outlier analysis
Data Mining: Outlier analysisData Mining: Outlier analysis
Data Mining: Outlier analysis
 
Data Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence dataData Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence data
 
Data Mining: Mining ,associations, and correlations
Data Mining: Mining ,associations, and correlationsData Mining: Mining ,associations, and correlations
Data Mining: Mining ,associations, and correlations
 
Data Mining: Graph mining and social network analysis
Data Mining: Graph mining and social network analysisData Mining: Graph mining and social network analysis
Data Mining: Graph mining and social network analysis
 
Data warehouse and olap technology
Data warehouse and olap technologyData warehouse and olap technology
Data warehouse and olap technology
 
Data Mining: Data processing
Data Mining: Data processingData Mining: Data processing
Data Mining: Data processing
 

Último

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 

Último (20)

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 

Exploring Data

  • 1. Exploring Data A preliminary exploration of the data to better understand its characteristics Key motivations of data exploration include Helping to select the right tool for preprocessing or analysis Making use of humans’ abilities to recognize pattern People can recognize patterns not captured by data analysis tools Related to the area of Exploratory Data Analysis (EDA) Created by statistician John Tukey
  • 2. Contd… In EDA, as originally defined by Tukey The focus was on visualization Clustering and anomaly detection were viewed as exploratory techniques In data mining, clustering and anomaly detection are major areas of interest, and not thought of as just exploratory In our discussion of data exploration, we focus on Summary statistics Visualization Online Analytical Processing (OLAP)
  • 3. Iris Sample Data Set Many of the exploratory data techniques are illustrated with the Iris Plant data set. Three flower types (classes): Setosa Virginica Versicolour Four (non-class) attributes Sepal width and length Petal width and length
  • 4. Summary Statistics Summary statistics are numbers that summarize properties of the data Summarized properties include frequency, location and spread Examples: location - mean spread - standard deviation
  • 5. Frequency and Mode The frequency of an attribute value is the percentage of time the value occurs in the data set For example, given the attribute ‘gender’ and a representative population of people, the gender ‘female’ occurs about 50% of the time. The mode of a an attribute is the most frequent attribute value The notions of frequency and mode are typically used with categorical data
  • 6. Percentiles For continuous data, the notion of a percentile is more useful. Given an ordinal or continuous attribute x and a number p between 0 and 100, the pth percentile is a value of x such that p% of the observed values of x are less than . For instance, the 50th percentile is the value such that 50% of all values of x are less than
  • 7. Measures of Location: Mean and Median The mean is the most common measure of the location of a set of points. However, the mean is very sensitive to outliers. Thus, the median or a trimmed mean is also commonly used
  • 8. Measures of Spread: Range and Variance Range is the difference between the max and min The variance or standard deviation is the most common measure of the spread of a set of points
  • 9. Visualization Visualization is the conversion of data into a visual or tabular format so that the characteristics of the data and the relationships among data items or attributes can be analyzed or reported. Humans have a well developed ability to analyze large amounts of information that is presented visually Can detect general patterns and trends Can detect outliers and unusual patterns
  • 10. Visualization techniques-Histogram Histogram Usually shows the distribution of values of a single variable Divide the values into bins and show a bar plot of the number of objects in each bin. The height of each bar indicates the number of objects Shape of histogram depends on the number of bins
  • 11. Contd…. Ex: petal width 10 bins
  • 12. Visualization Techniques: Box Plots Invented by J. Tukey Another way of displaying the distribution of data Following figure shows the basic part of a box plot outlier 75th percentile 50th percentile 25th percentile 10th percentile 10th percentile
  • 13. Visualization Techniques: Scatter Plots Scatter plots Attributes values determine the position Two-dimensional scatter plots most common, but can have three-dimensional scatter plots Often additional attributes can be displayed by using the size, shape, and color of the markers that represent the objects It is useful to have arrays of scatter plots can compactly summarize the relationships of several pairs of attributes
  • 15. Visualization Techniques: Contour Plots Contour plots Useful when a continuous attribute is measured on a spatial grid They partition the plane into regions of similar values The contour lines that form the boundaries of these regions connect points with equal values The most common example is contour maps of elevation
  • 16. Contour Plot Example: SST Dec, 1998
  • 17. Visualization Techniques: Matrix Plots Matrix plots Can plot the data matrix This can be useful when objects are sorted according to class Typically, the attributes are normalized to prevent one attribute from dominating the plot Plots of similarity or distance matrices can also be useful for visualizing the relationships between objects Examples of matrix plots are p
  • 18. Visualization of the Iris Data Matrix
  • 19. Visualization Techniques: Parallel Coordinates Parallel Coordinates Used to plot the attribute values of high-dimensional data Instead of using perpendicular axes, use a set of parallel axes The attribute values of each object are plotted as a point on each corresponding coordinate axis and the points are connected by a line Thus, each object is represented as a line
  • 20. Other Visualization Techniques Star Plots Similar approach to parallel coordinates, but axes radiate from a central point The line connecting the values of an object is a polygon
  • 21. Contd… Chernoff Faces Approach created by Herman Chernoff This approach associates each attribute with a characteristic of a face The values of each attribute determine the appearance of the corresponding facial characteristic Each object becomes a separate face Relies on human’s ability to distinguish faces
  • 22. OLAP On-Line Analytical Processing (OLAP) was proposed by E. F. Codd, the father of the relational database. Relational databases put data into tables, while OLAP uses a multidimensional array representation. Such representations of data previously existed in statistics and other fields There are a number of data analysis and data exploration operations that are easier with such a data representation.
  • 23. OLAP Operations: Data Cube The key operation of a OLAP is the formation of a data cube A data cube is a multidimensional representation of data, together with all possible aggregates. By all possible aggregates, we mean the aggregates that result by selecting a proper subset of the dimensions and summing over all remaining dimensions. For example, if we choose the species type dimension of the Iris data and sum over all other dimensions, the result will be a one-dimensional entry with three entries, each of which gives the number of flowers of each type.
  • 24. OLAP Operations: Slicing and Dicing Slicing is selecting a group of cells from the entire multidimensional array by specifying a specific value for one or more dimensions. Dicing involves selecting a subset of cells by specifying a range of attribute values. This is equivalent to defining a subarray from the complete array. In practice, both operations can also be accompanied by aggregation over some dimensions.
  • 25. OLAP Operations: Roll-up and Drill-down Attribute values often have a hierarchical structure. Each date is associated with a year, month, and week. A location is associated with a continent, country, state (province, etc.), and city. Products can be divided into various categories, such as clothing, electronics, and furniture.
  • 26. Contd… Note that these categories often nest and form a tree or lattice A year contains months which contains day A country contains a state which contains a city
  • 27. Contd… This hierarchical structure gives rise to the roll-up and drill-down operations. For sales data, we can aggregate (roll up) the sales across all the dates in a month. Conversely, given a view of the data where the time dimension is broken into months, we could split the monthly sales totals (drill down) into daily sales totals. Likewise, we can drill down or roll up on the location or product ID attributes.
  • 28. Visit more self help tutorials Pick a tutorial of your choice and browse through it at your own pace. The tutorials section is free, self-guiding and will not involve any additional support. Visit us at www.dataminingtools.net