SlideShare una empresa de Scribd logo
1 de 15
David’s Perspective
How Data Scientists Make
Reliable Decisions with Data?
David Huang
MSc. in Stat, NTU
David’s Perspective | 1
A new data-driven procedure allows stakeholders to make informative
decisions and improve decisions iteratively.
90% time and
resources
90% data analysis
knowledge
Define Business
Problem & Goal
Design and
Collect Data
Explore and
Clean Data
Determine Data
Analysis Task
Data Model
Building
Model Selection
and Evaluation
Derive Insight
& Implication
Deployment and
Presentation
Information-in Information-process Information-out
90% business
expertise
1
2
3
4
5
6
7
8
David’s Perspective | 2
Before analyzing data, we should correctly identify the data analytics
goal and its corresponding modeling techniques.
Descriptive
Modeling
Statistical
Modeling
Predictive
Modeling
▪ Summarize and present
data structure
▪ Performance review and
monitoring
▪ Find causalities and test
hypotheses
▪ Find hidden info among
variables
Objective
▪ Predict the output for
each individual
▪ Forecast with time series
structured data
▪ Researches with
business intuitions
▪ Fast and easy to do
▪ Differentiate real signals
form noises
▪ Scientifically proved
Strength
▪ Predict automatically
and accurately
▪ Scalable and flexible
▪ Not many “insights”
▪ Not quite reproducible
▪ Require reliable data
▪ Advanced knowledge
Weakness
▪ Can not explain
▪ Advanced knowledge
David’s Perspective | 3
The job of data scientists is to depict the deterministic function by
analyzing data with randomness.
Data
Relationship
Deterministic
Function
Input
Variable
Output
Variable
Deterministic
Construct
Deterministic
Construct
UnobservedMeasurable Measurable
David’s Perspective | 4
Data scientists always suffer from bias and variance when
approximating the true input-output relationship.
Bad Model
Bias – Large
Variance – Small
Bad Model
Bias – Acceptable
Variance – Large
Explanatory Model
Bias – Zero
Variance – Acceptable
Predictive Model
Bias – Small
Variance – Small
David’s Perspective | 5
Typically, we have 6 steps when analyzing a data set (1)
SOURCE: R for Data Science, Garrett Grolemund and Hadley Wickham.
Import Tidy Transform
Visualize
Model
Communicate
1 2 3
4
5
6
(1) Import Data in R
Take data stored in a file,
database, or web API, and
load it into a data frame in R.
(2) Tidy Format in R
In brief, when your data is tidy,
each column is a variable, and
each row is an observation.
David’s Perspective | 6
Import Tidy Transform
Visualize
Model
Communicate
1 2 3
4
5
6
Typically, we have 6 steps when analyzing a data set (2)
(3) Transform
Narrow in on observations of
interest, create new variables from
existing variables, and calculate a
set of summary statistics.
(4) Visualize
(a) show you unexpected things
(b) raise new questions
(c) hint your questions are wrong
(d) suggest collections of other data
SOURCE: R for Data Science, Garrett Grolemund and Hadley Wickham.
David’s Perspective | 7
Typically, we have 6 steps when analyzing a data set (3)
Model
5
Import Tidy Transform
Visualize
Model
Communicate
1 2 3
4
5
6
(5) Model
Once you have made your questions
sufficiently precise, you can use a
model (computational or statistical
methods) to answer them.
(6) Communicate
It doesn’t matter how well your models
and visualization have led you to
understand the data unless you can also
communicate your results to others.
SOURCE: R for Data Science, Garrett Grolemund and Hadley Wickham.
David’s Perspective | 8
InfoQ framework helps you to build a coherent analysis flow.
SOURCE: Information Quality: The Potential of Data and Analytics to Generate Knowledge, Ron S. Kenett and Galit Shmueli
Empirical
Model, f
Utility
Measure, U
Analysis
Goal, g
Data, X
1
2
43
Analysis Goal, g
• Explain, Predict, Describe
• Enumerative, Analytic
• Exploratory, Confirmatory
1
Data, X
• Data Size and Dimension
• Data Source
• Data Type & Relationship
2
Empirical Model, f
• Statistical Model
• Operation Research
• Machine Learning
3
Utility Measure, U
• Analysis Utility
• Domain Utility
• Conversion Utility
4
InfoQ (f, X, g ) = U ( f ( X | g ) )
David’s Perspective | 9
Online auction example:
Effect of a reserve price on the final auction price
Analysis
Goal, g
Data, X
Empirical Model,
f
Utility
Measure, U
• Identify the effect of using a secret versus public reserve price on the final
price of an auction.
• Quantify the average seffect of using a secret public reserve.
• Conduct a ‘field experiment’ by selling 25 identical pairs of Pokemon cards on
eBay during a 2-week period in April 2000.
• Each card auctioned twice: public reserve vs secret reserve price.
• Use linear regression to test for the effect of a private or public reserve price
on the final auction price and to quantify it.
• Statistical significance (or p-value) of the regression coefficient.
• Coefficient for quantifying the magnitude of the effect (a secret-reserve
auction will generate a price $0.63 lower on average)
Stage
1
2
3
4
Details & Explanation
SOURCE: Information Quality: The Potential of Data and Analytics to Generate Knowledge, Ron S. Kenett and Galit Shmueli
David’s Perspective | 10
Data resolution refers to the measurement scale and aggregation
level of the data.
SOURCE: Information Quality: The Potential of Data and Analytics to Generate Knowledge, Ron S. Kenett and Galit Shmueli
Is the data scale used aligned
with the stated goal of the study?
How reliable and precise are the
data sources and data-collection
instruments used in the study?
Is the data analysis suitable
for the data aggregation level?
Question to Ask
Failure of Google Flu Trend:
Use day-to-day search queries to predict
weekly CDC % ILI. Then, the result is
divergent at 2012 and 2013.
When you are not cautious …
David’s Perspective | 11
Data structure relates to the type(s) of data and data characteristics.
SOURCE: Information Quality: The Potential of Data and Analytics to Generate Knowledge, Ron S. Kenett and Galit Shmueli
Cross Sectional
Common Types
Data is collected from a population, or a representative
subset, at a specific point in time
Explanation
Time Series Data
Data is a series of data points indexed (or listed or
graphed) in time order.
Panel Data
Data is a multidimensional data set, whereas a time series
data set is a one-dimensional panel.
Network Data
Data consists of a finite set of vertices or nodes or points
possibly with weights on vertices.
David’s Perspective | 12
Data integration of multiple data sources and/or types often creates
new knowledge regarding the goal at hand.
SOURCE: Information Quality: The Potential of Data and Analytics to Generate Knowledge, Ron S. Kenett and Galit Shmueli
Drama and Actor Information
User Watching History
Data Source: Recommendation System Final List of Recommendation
User Behavior
Clustering
Video Series
Clustering
User Implicit
Score
David’s Perspective | 13
Temporal gaps among data collection, data analysis, and study
deployment will affect the information quality.
SOURCE: Information Quality: The Potential of Data and Analytics to Generate Knowledge, Ron S. Kenett and Galit Shmueli
Data Collection Data Analysis Study Deployment
Time
Structural break? Structural break?
1 2 3
David’s Perspective | 14
The choice of variables to collect, the temporal relationship between
them, and their meaning in the context of goal, critically affect the
information quality.
SOURCE: Information Quality: The Potential of Data and Analytics to Generate Knowledge, Ron S. Kenett and Galit Shmueli
True Model
Yt = b0 + b1 X1,t + b2 X2,t - b3 X3,t
Explanatory Modeling
Omitting the variable X3,t leads to a
biased estimation of b1 and b2.
Predictive Modeling
Omitting the variable X3,t may give a
higher predictive accuracy of Yt .

Más contenido relacionado

La actualidad más candente

OpLossModels_A2015
OpLossModels_A2015OpLossModels_A2015
OpLossModels_A2015
WenSui Liu
 
Booklet_GRA_RISK MODELLING_Second Edition (002).compressed
Booklet_GRA_RISK  MODELLING_Second Edition (002).compressedBooklet_GRA_RISK  MODELLING_Second Edition (002).compressed
Booklet_GRA_RISK MODELLING_Second Edition (002).compressed
Genest Benoit
 

La actualidad más candente (20)

1555 track 1 huang_using his mac
1555 track 1 huang_using his mac1555 track 1 huang_using his mac
1555 track 1 huang_using his mac
 
OpLossModels_A2015
OpLossModels_A2015OpLossModels_A2015
OpLossModels_A2015
 
predictive analysis and usage in procurement ppt 2017
predictive analysis and usage in procurement  ppt 2017predictive analysis and usage in procurement  ppt 2017
predictive analysis and usage in procurement ppt 2017
 
Optimizing Customer Experience - In House or Outsourced by Prof. Adré Schreuder
Optimizing Customer Experience - In House or Outsourced by Prof. Adré SchreuderOptimizing Customer Experience - In House or Outsourced by Prof. Adré Schreuder
Optimizing Customer Experience - In House or Outsourced by Prof. Adré Schreuder
 
segmentda
segmentdasegmentda
segmentda
 
Prediction of potential customers for term deposit
Prediction of potential customers for term depositPrediction of potential customers for term deposit
Prediction of potential customers for term deposit
 
1305 track 3 siegel
1305 track 3 siegel1305 track 3 siegel
1305 track 3 siegel
 
Making Analytics Actionable for Financial Institutions (Part II of III)
Making Analytics Actionable for Financial Institutions (Part II of III)Making Analytics Actionable for Financial Institutions (Part II of III)
Making Analytics Actionable for Financial Institutions (Part II of III)
 
Master Of Science Dissertation
Master Of Science DissertationMaster Of Science Dissertation
Master Of Science Dissertation
 
Booklet_GRA_RISK MODELLING_Second Edition (002).compressed
Booklet_GRA_RISK  MODELLING_Second Edition (002).compressedBooklet_GRA_RISK  MODELLING_Second Edition (002).compressed
Booklet_GRA_RISK MODELLING_Second Edition (002).compressed
 
Feelink 2014 posts
Feelink 2014 postsFeelink 2014 posts
Feelink 2014 posts
 
From Insights to Value Proposition: Matching Evidence to Payer Need
From Insights to Value Proposition: Matching Evidence to Payer NeedFrom Insights to Value Proposition: Matching Evidence to Payer Need
From Insights to Value Proposition: Matching Evidence to Payer Need
 
How GVDs Need to Evolve
How GVDs Need to EvolveHow GVDs Need to Evolve
How GVDs Need to Evolve
 
Hypothesis Testing: Spread (Compare 2+ Factors)
Hypothesis Testing: Spread (Compare 2+ Factors)Hypothesis Testing: Spread (Compare 2+ Factors)
Hypothesis Testing: Spread (Compare 2+ Factors)
 
Evaluation method for strategic investments
Evaluation method for strategic investmentsEvaluation method for strategic investments
Evaluation method for strategic investments
 
How do insurers convert data to value
How do insurers convert data to valueHow do insurers convert data to value
How do insurers convert data to value
 
Supplier Procurement Analytics powered by PMSquare
Supplier Procurement Analytics powered by PMSquareSupplier Procurement Analytics powered by PMSquare
Supplier Procurement Analytics powered by PMSquare
 
Enablers for Maturing your S&OP Processes, SherTrack
Enablers for Maturing your S&OP Processes, SherTrackEnablers for Maturing your S&OP Processes, SherTrack
Enablers for Maturing your S&OP Processes, SherTrack
 
Marketing analytics
Marketing analyticsMarketing analytics
Marketing analytics
 
Prediction, Explanation and the Business Analytics Toolkit
Prediction, Explanation and the Business Analytics Toolkit Prediction, Explanation and the Business Analytics Toolkit
Prediction, Explanation and the Business Analytics Toolkit
 

Similar a How Data Scientists Make Reliable Decisions with Data

udacity-dandsyllabus
udacity-dandsyllabusudacity-dandsyllabus
udacity-dandsyllabus
Bora Yüret
 
Data Samples & Data AnalysesNYU SCPSDataba
Data Samples & Data AnalysesNYU  SCPSDatabaData Samples & Data AnalysesNYU  SCPSDataba
Data Samples & Data AnalysesNYU SCPSDataba
OllieShoresna
 

Similar a How Data Scientists Make Reliable Decisions with Data (20)

Data analytics and visualization
Data analytics and visualizationData analytics and visualization
Data analytics and visualization
 
Kenett On Information NYU-Poly 2013
Kenett On Information NYU-Poly 2013Kenett On Information NYU-Poly 2013
Kenett On Information NYU-Poly 2013
 
Data Analytics: Better Decision, Better Business
Data Analytics: Better Decision, Better BusinessData Analytics: Better Decision, Better Business
Data Analytics: Better Decision, Better Business
 
The Research specifically DataAnalysis.pptx
The Research specifically DataAnalysis.pptxThe Research specifically DataAnalysis.pptx
The Research specifically DataAnalysis.pptx
 
Data Processing & Explain each term in details.pptx
Data Processing & Explain each term in details.pptxData Processing & Explain each term in details.pptx
Data Processing & Explain each term in details.pptx
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
Data driven decision making
Data driven decision makingData driven decision making
Data driven decision making
 
On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)
On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)
On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)
 
Barga Data Science lecture 2
Barga Data Science lecture 2Barga Data Science lecture 2
Barga Data Science lecture 2
 
udacity-dandsyllabus
udacity-dandsyllabusudacity-dandsyllabus
udacity-dandsyllabus
 
Research design decisions and be competent in the process of reliable data co...
Research design decisions and be competent in the process of reliable data co...Research design decisions and be competent in the process of reliable data co...
Research design decisions and be competent in the process of reliable data co...
 
Kenett on info q and pse
Kenett on info q and pseKenett on info q and pse
Kenett on info q and pse
 
Data Samples & Data AnalysesNYU SCPSDataba
Data Samples & Data AnalysesNYU  SCPSDatabaData Samples & Data AnalysesNYU  SCPSDataba
Data Samples & Data AnalysesNYU SCPSDataba
 
Data Analysis
Data AnalysisData Analysis
Data Analysis
 
Foundational Strategies for Trust in Big Data Part 2: Understanding Your Data
Foundational Strategies for Trust in Big Data Part 2: Understanding Your DataFoundational Strategies for Trust in Big Data Part 2: Understanding Your Data
Foundational Strategies for Trust in Big Data Part 2: Understanding Your Data
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
 
Hyatt Hotel Group Project
Hyatt Hotel Group ProjectHyatt Hotel Group Project
Hyatt Hotel Group Project
 
Big Data: selling the Business Case to the business
Big Data: selling the Business Case to the businessBig Data: selling the Business Case to the business
Big Data: selling the Business Case to the business
 
Chapter 1: Introduction to Data Mining
Chapter 1: Introduction to Data MiningChapter 1: Introduction to Data Mining
Chapter 1: Introduction to Data Mining
 
Data preparation and processing chapter 2
Data preparation and processing chapter  2Data preparation and processing chapter  2
Data preparation and processing chapter 2
 

Último

Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdf
SayantanBiswas37
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
ranjankumarbehera14
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Klinik kandungan
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
HyderabadDolls
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 

Último (20)

Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdf
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
Kings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themKings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about them
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
 

How Data Scientists Make Reliable Decisions with Data

  • 1. David’s Perspective How Data Scientists Make Reliable Decisions with Data? David Huang MSc. in Stat, NTU
  • 2. David’s Perspective | 1 A new data-driven procedure allows stakeholders to make informative decisions and improve decisions iteratively. 90% time and resources 90% data analysis knowledge Define Business Problem & Goal Design and Collect Data Explore and Clean Data Determine Data Analysis Task Data Model Building Model Selection and Evaluation Derive Insight & Implication Deployment and Presentation Information-in Information-process Information-out 90% business expertise 1 2 3 4 5 6 7 8
  • 3. David’s Perspective | 2 Before analyzing data, we should correctly identify the data analytics goal and its corresponding modeling techniques. Descriptive Modeling Statistical Modeling Predictive Modeling ▪ Summarize and present data structure ▪ Performance review and monitoring ▪ Find causalities and test hypotheses ▪ Find hidden info among variables Objective ▪ Predict the output for each individual ▪ Forecast with time series structured data ▪ Researches with business intuitions ▪ Fast and easy to do ▪ Differentiate real signals form noises ▪ Scientifically proved Strength ▪ Predict automatically and accurately ▪ Scalable and flexible ▪ Not many “insights” ▪ Not quite reproducible ▪ Require reliable data ▪ Advanced knowledge Weakness ▪ Can not explain ▪ Advanced knowledge
  • 4. David’s Perspective | 3 The job of data scientists is to depict the deterministic function by analyzing data with randomness. Data Relationship Deterministic Function Input Variable Output Variable Deterministic Construct Deterministic Construct UnobservedMeasurable Measurable
  • 5. David’s Perspective | 4 Data scientists always suffer from bias and variance when approximating the true input-output relationship. Bad Model Bias – Large Variance – Small Bad Model Bias – Acceptable Variance – Large Explanatory Model Bias – Zero Variance – Acceptable Predictive Model Bias – Small Variance – Small
  • 6. David’s Perspective | 5 Typically, we have 6 steps when analyzing a data set (1) SOURCE: R for Data Science, Garrett Grolemund and Hadley Wickham. Import Tidy Transform Visualize Model Communicate 1 2 3 4 5 6 (1) Import Data in R Take data stored in a file, database, or web API, and load it into a data frame in R. (2) Tidy Format in R In brief, when your data is tidy, each column is a variable, and each row is an observation.
  • 7. David’s Perspective | 6 Import Tidy Transform Visualize Model Communicate 1 2 3 4 5 6 Typically, we have 6 steps when analyzing a data set (2) (3) Transform Narrow in on observations of interest, create new variables from existing variables, and calculate a set of summary statistics. (4) Visualize (a) show you unexpected things (b) raise new questions (c) hint your questions are wrong (d) suggest collections of other data SOURCE: R for Data Science, Garrett Grolemund and Hadley Wickham.
  • 8. David’s Perspective | 7 Typically, we have 6 steps when analyzing a data set (3) Model 5 Import Tidy Transform Visualize Model Communicate 1 2 3 4 5 6 (5) Model Once you have made your questions sufficiently precise, you can use a model (computational or statistical methods) to answer them. (6) Communicate It doesn’t matter how well your models and visualization have led you to understand the data unless you can also communicate your results to others. SOURCE: R for Data Science, Garrett Grolemund and Hadley Wickham.
  • 9. David’s Perspective | 8 InfoQ framework helps you to build a coherent analysis flow. SOURCE: Information Quality: The Potential of Data and Analytics to Generate Knowledge, Ron S. Kenett and Galit Shmueli Empirical Model, f Utility Measure, U Analysis Goal, g Data, X 1 2 43 Analysis Goal, g • Explain, Predict, Describe • Enumerative, Analytic • Exploratory, Confirmatory 1 Data, X • Data Size and Dimension • Data Source • Data Type & Relationship 2 Empirical Model, f • Statistical Model • Operation Research • Machine Learning 3 Utility Measure, U • Analysis Utility • Domain Utility • Conversion Utility 4 InfoQ (f, X, g ) = U ( f ( X | g ) )
  • 10. David’s Perspective | 9 Online auction example: Effect of a reserve price on the final auction price Analysis Goal, g Data, X Empirical Model, f Utility Measure, U • Identify the effect of using a secret versus public reserve price on the final price of an auction. • Quantify the average seffect of using a secret public reserve. • Conduct a ‘field experiment’ by selling 25 identical pairs of Pokemon cards on eBay during a 2-week period in April 2000. • Each card auctioned twice: public reserve vs secret reserve price. • Use linear regression to test for the effect of a private or public reserve price on the final auction price and to quantify it. • Statistical significance (or p-value) of the regression coefficient. • Coefficient for quantifying the magnitude of the effect (a secret-reserve auction will generate a price $0.63 lower on average) Stage 1 2 3 4 Details & Explanation SOURCE: Information Quality: The Potential of Data and Analytics to Generate Knowledge, Ron S. Kenett and Galit Shmueli
  • 11. David’s Perspective | 10 Data resolution refers to the measurement scale and aggregation level of the data. SOURCE: Information Quality: The Potential of Data and Analytics to Generate Knowledge, Ron S. Kenett and Galit Shmueli Is the data scale used aligned with the stated goal of the study? How reliable and precise are the data sources and data-collection instruments used in the study? Is the data analysis suitable for the data aggregation level? Question to Ask Failure of Google Flu Trend: Use day-to-day search queries to predict weekly CDC % ILI. Then, the result is divergent at 2012 and 2013. When you are not cautious …
  • 12. David’s Perspective | 11 Data structure relates to the type(s) of data and data characteristics. SOURCE: Information Quality: The Potential of Data and Analytics to Generate Knowledge, Ron S. Kenett and Galit Shmueli Cross Sectional Common Types Data is collected from a population, or a representative subset, at a specific point in time Explanation Time Series Data Data is a series of data points indexed (or listed or graphed) in time order. Panel Data Data is a multidimensional data set, whereas a time series data set is a one-dimensional panel. Network Data Data consists of a finite set of vertices or nodes or points possibly with weights on vertices.
  • 13. David’s Perspective | 12 Data integration of multiple data sources and/or types often creates new knowledge regarding the goal at hand. SOURCE: Information Quality: The Potential of Data and Analytics to Generate Knowledge, Ron S. Kenett and Galit Shmueli Drama and Actor Information User Watching History Data Source: Recommendation System Final List of Recommendation User Behavior Clustering Video Series Clustering User Implicit Score
  • 14. David’s Perspective | 13 Temporal gaps among data collection, data analysis, and study deployment will affect the information quality. SOURCE: Information Quality: The Potential of Data and Analytics to Generate Knowledge, Ron S. Kenett and Galit Shmueli Data Collection Data Analysis Study Deployment Time Structural break? Structural break? 1 2 3
  • 15. David’s Perspective | 14 The choice of variables to collect, the temporal relationship between them, and their meaning in the context of goal, critically affect the information quality. SOURCE: Information Quality: The Potential of Data and Analytics to Generate Knowledge, Ron S. Kenett and Galit Shmueli True Model Yt = b0 + b1 X1,t + b2 X2,t - b3 X3,t Explanatory Modeling Omitting the variable X3,t leads to a biased estimation of b1 and b2. Predictive Modeling Omitting the variable X3,t may give a higher predictive accuracy of Yt .