11. Statistics
• Traditionally concerned with
analyzing primary (e.g.
Experimental) data collected
for checking specific
hypotheses(ideas)
• Primary data analysis or top-
down(confirmatory) analysis
• Hypothesis evaluation or
testing
Data Science
• Typically concerned with
analyzing secondary (e.g.,
observational) data collected
for other reasons
• Secondary data analysis or
bottom-up(exploratory)
analysis
• Hypothesis generation
• Knowledge discovery
Data Science with Dr Shahid
12. Data science is an interdisciplinary field
Encompasses the usage of computing tools in order to extract
knowledge from data by deploying statistical methods
Multiple definitions exist, reason being the nature of
cross-disciplinary skills needed to create value
Holy-grail of data science can be ascertained
through Venn diagrams, e.g., Drew Conway’s
Data Science with Dr Shahid
13. Data science as portrayed by Drew ConwayData Science with Dr Shahid
17. Gregory Piatetsky-Shapiro, Ph.D
Knowledge Discovery to
Data Mining to Predictive
Analytics and now to
Data Science
Essence is always: discovery
of what is true and useful
Data Science with Dr Shahid
21. BusinessUnderstanding
Goals
• Specify key
variables
(model targets,
metrics of
success)
• Relevant data
sources
How?
• Define
*objectives
(business
problems,
stakeholders)
• **SMART
metrics
• Find the data
Artifacts
• Iterating charter
• Data Sources
• Data
Dictionaries
Data Science with Dr Shahid
22. Objectives
How much/many: Regression
Which category: Classification
Which group: Clustering
Is it weird: Anomaly Detection
Which opinion: Recommendation
Specific
Measurable
Achievable
Relevant
Time-bound
Data Science with Dr Shahid
24. DataAcqusition
Goals
• Clean, high
quality
• Architecture of
data pipeline
(refresh & score)
How?
• Data Ingestion
• Explore the data
(quality, eda)
• Setup data
pipeline (Batch-based
,Streaming or real time, A hybrid)
Artifacts
• Data Q report
• Solution
Architecture
• Checkpoint
decision (re-evaluate
before full-feature engineering/model
building)
Data Science with Dr Shahid
26. Modeling
Goals
• Optimal
features
• Informative
model
• Production
ready model
How?
• Feature
engineering
• Model Training
• Production
Ready?
Artifacts
• Feature sets
• Model report
• Checkpoint
decision (Evaluate for
production)
Data Science with Dr Shahid
27. Model Training
Raw data Features
Starting data
Training split (70-80%) Validation split
(10-15%)
Test split
(10-15%)
Model gets trained Hyper parameters
tuning
Model gets
evaluatedData Science with Dr Shahid
29. Deployement
Goals
• Deploy models
with a data
pipeline to a
production env
How?
• Operationalize
the model
Artifacts
• Status
dashboard
(system health
& KPIs)
• Final Modeling
report
• Final solution
arch doc
Data Science with Dr Shahid
30. Customeracceptance
Goals
• Finalize project
deliverables
Confirm that the
pipeline, the model,
and their deployment
in a production
environment satisfy
the customer's
objectives.
How?
• System
validation
• Project hand-off
Artifacts
• Exit report of
the project for
the customer
Data Science with Dr Shahid
34. Data Science with Dr Shahid
• Linear algebra, Calculus
• Probability theory, Graph theory
• Distributions, summary stats, hypothesis testing
Math/Statistics
• Supervised learning
• Unsupervised learning
• Validation, model comparison
Machine
learning
• Algorithms and data structures
• Data Visualization
• Data processing
Software engg
35. Data Science with Dr Shahid
Data
Scientists
Data Analyst
ML
engineer
Data engineer
Data
Architect
BI developer
38. Data Science with Dr Shahid
https://www.facebook.com/drshahid.phd
https://www.linkedin.com/in/muhammad-shahid-67876212
muhammad.shahid@ieee.org
Thank You!