SlideShare una empresa de Scribd logo
1 de 36
Descargar para leer sin conexión
1 © Hortonworks Inc. 2011–2018. All rights reserved
DWS Barcelona 2019
Robert Hryniewicz
@robhryniewicz
Data Science Crash Course
2 © Hortonworks Inc. 2011–2018. All rights reserved
What is Machine Learning?
Machine Learning is programming with data (as opposed to programming with code).
Machine Learning is a way to use data to draw
meaningful conclusions including identifying
patterns, anomalies and trends that may not be
obvious to humans.
Machine learning is math, at scale.
Machine learning is learning patterns
from data labelled or not.
Machine learning is when I explain my challenge to
the computer and it finds a way to solve it.
Machine Learning allows for emotional decisions to
become objective.
3 © Hortonworks Inc. 2011–2018. All rights reserved
Examples where Machine Learning can be applied
Healthcare
• Predict diagnosis
• Prioritize screenings
• Reduce re-admittance rates
Financial services
• Fraud Detection/prevention
• Predict underwriting risk
• New account risk screens
Public Sector
• Analyze public sentiment
• Optimize resource allocation
• Law enforcement & security
Retail
• Product recommendation
• Inventory management
• Price optimization
Telco/mobile
• Predict customer churn
• Predict equipment failure
• Customer behavior analysis
Oil & Gas
• Predictive maintenance
• Seismic data management
• Predict well production levels
Insurance
• Risk assessment
• Customer insights/experience
• Finance real time analysis
Life sciences
• Genome sequencing
• Drug development
• Sensor data
4 © Hortonworks Inc. 2011–2018. All rights reserved
Machine Learning – Major Types
Supervised Learning Unsupervised Learning
Reinforcement
Learning
5 © Hortonworks Inc. 2011–2018. All rights reserved
Supervised Learning
Input
Input
Input
Input
Input
Input
Input
Output 1
Output n
Use labeled (training)
datasets on to learn the
relationship of given
inputs to outputs.
Once model is trained use
it to predict outputs on
new input data.
Output 2
.
.
.
…
…
6 © Hortonworks Inc. 2011–2018. All rights reserved
Unsupervised Learning
Explore, classify & find
patterns in the input data
without being explicit
about the output.
7 © Hortonworks Inc. 2011–2018. All rights reserved
Reinforcement Learning
Algorithm
Environment
ActionRewardState
Algorithm learns to
maximize rewards it
receives for its actions
(e.g. maximizes points for
investment returns).
Use when you don’t have
lots of training data, you
can’t clearly define ideal
end-state, or the only way
to learn is by interacting
with the environment.
8 © Hortonworks Inc. 2011–2018. All rights reserved
Regression
Classification
Recommender Systems / Collaborative Filtering
Clustering
Dimensionality Reduction
• Logistic Regression
• Support Vector Machines (SVM)
• Random Forest (RF)
• Naïve Bayes
• Linear Regression • Alternating Least Squares (ALS)
• K-Means, LDA
• Principal Component Analysis (PCA)
Deep Learning
• Fully Connected Neural Nets
Ø Tabular or Recommender Systems
• Convolutional Neural Nets (CNNs)
Ø Images
• Recurrent Neural Nets (RNNs)
Ø Natural Language Processing (NLP) / Text
9 © Hortonworks Inc. 2011–2018. All rights reserved
REGRESSION
Predicting a continuous-valued output
Example: Predicting house prices based on number of bedrooms and square footage
Algorithms: Linear Regression
10 © Hortonworks Inc. 2011–2018. All rights reserved
CLASSIFICATION
Identifying to which category an object belongs to
Examples: spam detection, diabetes diagnosis, text labeling
Algorithms:
• Logistic Regression
• Fast training (linear model)
• Classes expressed in probabilities
• Less overfitting [+]
• Less fitting (accuracy) [-]
• Support Vector Machines (SVM)
• “Best” supervised learning algorithm, effective
• State of the art prior to Deep Learning
• More robust to outliers than Log Regression
• Handles non-linearity
• Random Forest
(ensemble of Decision Trees)
• Fast training
• Handles categorical features
• Does not require feature scaling
• Captures non-linearity and
feature interaction
• i.e. performs feature selection / PCA implicitly
• Naïve Bayes
• Good for text classification
• Assumes independent variables / words
11 © Hortonworks Inc. 2011–2018. All rights reserved
Visual Intro to Decision Trees
• http://www.r2d3.us/visual-intro-to-machine-learning-part-1
CLASSIFICATION
12 © Hortonworks Inc. 2011–2018. All rights reserved
CLUSTERING
Automatic grouping of similar objects into sets (clusters)
Example: market segmentation – auto group customers into different market segments
Algorithms: K-means, LDA
13 © Hortonworks Inc. 2011–2018. All rights reserved
COLLABORATIVE FILTERING
Fill in the missing entries of a user-item association matrix
Applications: Product/movie recommendation
Algorithms: Alternating Least Squares (ALS)
14 © Hortonworks Inc. 2011–2018. All rights reserved
DIMENSIONALITY REDUCTION
Reducing the number of redundant features/variables
Applications:
• Removing noise in images by selecting only
“important” features
• Removing redundant features, e.g. MPH & KPH are
linearly dependent
Algorithms: Principal Component Analysis (PCA)
15 © Hortonworks Inc. 2011–2018. All rights reserved
16 © Hortonworks Inc. 2011–2018. All rights reserved
Simple/shallow vs Deep Neural Net
17 © Hortonworks Inc. 2011–2018. All rights reserved
• Convolutional Neural Nets (CNNs)
• Recurrent Neural Nets (RNNs)
• Long Short-Term Memory (LSTM)
Popular Neural Net Architectures
ß Images
ß Text / Language (NLP) & Time Series
18 © Hortonworks Inc. 2011–2018. All rights reserved
Number Probability
0 0.03
1 0.01
2 0.04
3 0.08
4 0.05
5 0.08
6 0.07
7 0.02
8 0.54
9 0.08
19 © Hortonworks Inc. 2011–2018. All rights reserved
Quickly Training Deep Learning Models
with Transfer Learning
19
20 © Hortonworks Inc. 2011–2018. All rights reserved
How to Build a Deep Learning Image Recognition System?
African Bush Elephant Indian Elephant Sri Lankan Elephant Borneo Pygmy Elephant
Step 1: Download examples to train the model with
21 © Hortonworks Inc. 2011–2018. All rights reserved
How to Build a Deep Learning Image Recognition System?
Step 2: Augment dataset to enrich training data
à Adds 5-10x more training examples
22 © Hortonworks Inc. 2011–2018. All rights reserved
dawn.cs.stanford.edu/benchmark
Step 3: Select and download a pre-trained model
How to Build a Deep Learning Image Recognition System?
23 © Hortonworks Inc. 2011–2018. All rights reserved
Sample Architecture of a CNN
Pretrained
Parameters
Random
Parameters
24 © Hortonworks Inc. 2011–2018. All rights reserved
Step 4: Apply transfer learning
How to Build a Deep Learning Image Recognition System?
Pretrained Network
(millions of parameters)
Random
ParametersINPUT OUTPUT
Borneo Pygmy
Elephant
Indian
Elephant
Train
Parameters
Step A
Adjust
Parameters
Step B
25 © Hortonworks Inc. 2011–2018. All rights reserved
Step 5: Host a trained model on a server and make it accessible via a web app
How to Build a Deep Learning Image Recognition System?
User uploads
Borneo Pygmy Elephant
Web app returns
26 © Hortonworks Inc. 2011–2018. All rights reserved
Data Science Journey
26
27 © Hortonworks Inc. 2011–2018. All rights reserved
What is data science?
The scientific exploration of data to extract meaning or
insight, using statistics and mathematical models with
the end goal of making smarter, quicker decisions.
28 © Hortonworks Inc. 2011–2018. All rights reserved
29 © Hortonworks Inc. 2011–2018. All rights reserved
Start by Asking Relevant Questions
• Specific (can you think of a clear answer?)
• Measurable (quantifiable? data driven?)
• Actionable (if you had an answer, could you do something with it?)
• Realistic (can you get an answer with data you have?)
• Timely (answer in reasonable timeframe?)
30 © Hortonworks Inc. 2011–2018. All rights reserved
Data Preparation
1. Data analysis (audit for anomalies/errors)
2. Creating an intuitive workflow (formulate seq. of prep operations)
3. Validation (correctness evaluated against sample representative dataset)
4. Transformation (actual prep process takes place)
5. Backflow of cleaned data (replace original dirty data)
Approx. 80% of Data Analyst’s job is Data Preparation!
Example of multiple values used for U.S. States è California, CA, Cal., Cal
31 © Hortonworks Inc. 2011–2018. All rights reserved
Feature Selection
• Also known as variable or attribute selection
• Why important?
• simplification of models è easier to interpret by researchers/users
• shorter training times
• enhanced generalization by reducing overfitting
• Dimensionality reduction vs feature selection
• Dimensionality reduction: create new combinations of attributes
• Feature selection: include/exclude attributes in data without changing them
Q: Which features should you use to create a predictive model?
32 © Hortonworks Inc. 2011–2018. All rights reserved
Hyperparameters
• Define higher-level model properties, e.g. complexity or learning rate
• Cannot be learned during training à need to be predefined
• Can be decided by
• setting different values
• training different models
• choosing the values that test better
• Hyperparameter examples
• Number of leaves or depth of a tree
• Number of latent factors in a matrix factorization
• Learning rate (in many models)
• Number of hidden layers in a deep neural network
• Number of clusters in a k-means clustering
33 © Hortonworks Inc. 2011–2018. All rights reserved
v Residuals
• residual of an observed value is the difference between
the observed value and the estimated value
v R2 (R Squared) – Coefficient of Determination
• indicates a goodness of fit
• R2 of 1 means regression line perfectly fits data
v RMSE (Root Mean Square Error)
• measure of differences between values predicted by a model and values actually
observed
• good measure of accuracy, but only to compare forecasting errors of different
models (individual variables are scale-dependent)
34 © Hortonworks Inc. 2011–2018. All rights reserved
With that in mind…
• No simple formula for “good questions” only general guidelines
• The right data is better than lots of data
• Understanding relationships matters
35 © Hortonworks Inc. 2011–2018. All rights reserved
Enterprise Data Science @ Scale
Enterprise- Grade
Leverage
enterprise-grade
security,
governance and
operations
Tools
Enhance productivity
by enabling data
scientists to use their
favorite tools,
technologies and
libraries
Deployment
Compress the
time to insight
by deploying
models into
production
faster
Data
Build more
robust models
by using all
the data in the
data lake
36 © Hortonworks Inc. 2011–2018. All rights reserved
Thanks!
Robert Hryniewicz
@robhryniewicz

Más contenido relacionado

La actualidad más candente

IIoT + Predictive Analytics: Solving for Disruption in Oil & Gas and Energy &...
IIoT + Predictive Analytics: Solving for Disruption in Oil & Gas and Energy &...IIoT + Predictive Analytics: Solving for Disruption in Oil & Gas and Energy &...
IIoT + Predictive Analytics: Solving for Disruption in Oil & Gas and Energy &...
DataWorks Summit
 
Cyber-I3 System - Intelligence, Incidence, and Investigation-based Big Data T...
Cyber-I3 System - Intelligence, Incidence, and Investigation-based Big Data T...Cyber-I3 System - Intelligence, Incidence, and Investigation-based Big Data T...
Cyber-I3 System - Intelligence, Incidence, and Investigation-based Big Data T...
DataWorks Summit
 
How big data and AI saved the day: critical IP almost walked out the door
How big data and AI saved the day: critical IP almost walked out the doorHow big data and AI saved the day: critical IP almost walked out the door
How big data and AI saved the day: critical IP almost walked out the door
DataWorks Summit
 
Software engineering practices for the data science and machine learning life...
Software engineering practices for the data science and machine learning life...Software engineering practices for the data science and machine learning life...
Software engineering practices for the data science and machine learning life...
DataWorks Summit
 
Achieving a 360 degree view of manufacturing
Achieving a 360 degree view of manufacturingAchieving a 360 degree view of manufacturing
Achieving a 360 degree view of manufacturing
DataWorks Summit
 

La actualidad más candente (20)

IIoT + Predictive Analytics: Solving for Disruption in Oil & Gas and Energy &...
IIoT + Predictive Analytics: Solving for Disruption in Oil & Gas and Energy &...IIoT + Predictive Analytics: Solving for Disruption in Oil & Gas and Energy &...
IIoT + Predictive Analytics: Solving for Disruption in Oil & Gas and Energy &...
 
Data Science at Speed. At Scale.
Data Science at Speed. At Scale.Data Science at Speed. At Scale.
Data Science at Speed. At Scale.
 
Accelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at ScaleAccelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at Scale
 
Make Streaming IoT Analytics Work for You
Make Streaming IoT Analytics Work for YouMake Streaming IoT Analytics Work for You
Make Streaming IoT Analytics Work for You
 
Overcoming the AI hype — and what enterprises should really focus on
Overcoming the AI hype — and what enterprises should really focus onOvercoming the AI hype — and what enterprises should really focus on
Overcoming the AI hype — and what enterprises should really focus on
 
Cyber-I3 System - Intelligence, Incidence, and Investigation-based Big Data T...
Cyber-I3 System - Intelligence, Incidence, and Investigation-based Big Data T...Cyber-I3 System - Intelligence, Incidence, and Investigation-based Big Data T...
Cyber-I3 System - Intelligence, Incidence, and Investigation-based Big Data T...
 
Modernize Your Existing EDW with IBM Big SQL & Hortonworks Data Platform
Modernize Your Existing EDW with IBM Big SQL & Hortonworks Data PlatformModernize Your Existing EDW with IBM Big SQL & Hortonworks Data Platform
Modernize Your Existing EDW with IBM Big SQL & Hortonworks Data Platform
 
How big data and AI saved the day: critical IP almost walked out the door
How big data and AI saved the day: critical IP almost walked out the doorHow big data and AI saved the day: critical IP almost walked out the door
How big data and AI saved the day: critical IP almost walked out the door
 
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATATIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
 
Software engineering practices for the data science and machine learning life...
Software engineering practices for the data science and machine learning life...Software engineering practices for the data science and machine learning life...
Software engineering practices for the data science and machine learning life...
 
Achieving a 360 degree view of manufacturing
Achieving a 360 degree view of manufacturingAchieving a 360 degree view of manufacturing
Achieving a 360 degree view of manufacturing
 
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
 
Machine Learning Models: From Research to Production 6.13.18
Machine Learning Models: From Research to Production 6.13.18Machine Learning Models: From Research to Production 6.13.18
Machine Learning Models: From Research to Production 6.13.18
 
The Manulife Journey
The Manulife JourneyThe Manulife Journey
The Manulife Journey
 
Delivering improved patient outcomes through advanced analytics 6.26.18
Delivering improved patient outcomes through advanced analytics 6.26.18Delivering improved patient outcomes through advanced analytics 6.26.18
Delivering improved patient outcomes through advanced analytics 6.26.18
 
10 Lessons Learned from Meeting with 150 Banks Across the Globe
10 Lessons Learned from Meeting with 150 Banks Across the Globe10 Lessons Learned from Meeting with 150 Banks Across the Globe
10 Lessons Learned from Meeting with 150 Banks Across the Globe
 
Zementis hortonworks-webinar-2014-09
Zementis hortonworks-webinar-2014-09Zementis hortonworks-webinar-2014-09
Zementis hortonworks-webinar-2014-09
 
Driving Better Products with Customer Intelligence

Driving Better Products with Customer Intelligence
Driving Better Products with Customer Intelligence

Driving Better Products with Customer Intelligence

 
Parallel/Distributed Deep Learning and CDSW
Parallel/Distributed Deep Learning and CDSWParallel/Distributed Deep Learning and CDSW
Parallel/Distributed Deep Learning and CDSW
 
Making the Most of Data in Multiple Data Sources (with Virtual Data Lakes)
Making the Most of Data in Multiple Data Sources (with Virtual Data Lakes)Making the Most of Data in Multiple Data Sources (with Virtual Data Lakes)
Making the Most of Data in Multiple Data Sources (with Virtual Data Lakes)
 

Similar a Data Science Crash Course

Running Enterprise Workloads with an Open Source Hybrid Cloud Data Architecture
Running Enterprise Workloads with an Open Source Hybrid Cloud Data ArchitectureRunning Enterprise Workloads with an Open Source Hybrid Cloud Data Architecture
Running Enterprise Workloads with an Open Source Hybrid Cloud Data Architecture
DataWorks Summit
 
Square Pegs In Round Holes: Rethinking Data Availability in the Age of Automa...
Square Pegs In Round Holes: Rethinking Data Availability in the Age of Automa...Square Pegs In Round Holes: Rethinking Data Availability in the Age of Automa...
Square Pegs In Round Holes: Rethinking Data Availability in the Age of Automa...
Denodo
 
Running Enterprise Workloads with an open source Hybrid Cloud Data Architectu...
Running Enterprise Workloads with an open source Hybrid Cloud Data Architectu...Running Enterprise Workloads with an open source Hybrid Cloud Data Architectu...
Running Enterprise Workloads with an open source Hybrid Cloud Data Architectu...
DataWorks Summit
 

Similar a Data Science Crash Course (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Spark-Zeppelin-ML on HWX
Spark-Zeppelin-ML on HWXSpark-Zeppelin-ML on HWX
Spark-Zeppelin-ML on HWX
 
Data science workshop
Data science workshopData science workshop
Data science workshop
 
Deep learning 101
Deep learning 101Deep learning 101
Deep learning 101
 
Machine Learning AND Deep Learning for OpenPOWER
Machine Learning AND Deep Learning for OpenPOWERMachine Learning AND Deep Learning for OpenPOWER
Machine Learning AND Deep Learning for OpenPOWER
 
The Transformation of HPC: Simulation and Cognitive Methods in the Era of Big...
The Transformation of HPC: Simulation and Cognitive Methods in the Era of Big...The Transformation of HPC: Simulation and Cognitive Methods in the Era of Big...
The Transformation of HPC: Simulation and Cognitive Methods in the Era of Big...
 
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
 
Enterprise data science at scale
Enterprise data science at scaleEnterprise data science at scale
Enterprise data science at scale
 
Running Enterprise Workloads with an Open Source Hybrid Cloud Data Architecture
Running Enterprise Workloads with an Open Source Hybrid Cloud Data ArchitectureRunning Enterprise Workloads with an Open Source Hybrid Cloud Data Architecture
Running Enterprise Workloads with an Open Source Hybrid Cloud Data Architecture
 
Solving Cybersecurity at Scale
Solving Cybersecurity at ScaleSolving Cybersecurity at Scale
Solving Cybersecurity at Scale
 
Hortonworks - IBM Cognitive - The Future of Data Science
Hortonworks - IBM Cognitive - The Future of Data ScienceHortonworks - IBM Cognitive - The Future of Data Science
Hortonworks - IBM Cognitive - The Future of Data Science
 
Using Machine Learning to Understand and Predict Marketing ROI
Using Machine Learning to Understand and Predict Marketing ROIUsing Machine Learning to Understand and Predict Marketing ROI
Using Machine Learning to Understand and Predict Marketing ROI
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Square Pegs In Round Holes: Rethinking Data Availability in the Age of Automa...
Square Pegs In Round Holes: Rethinking Data Availability in the Age of Automa...Square Pegs In Round Holes: Rethinking Data Availability in the Age of Automa...
Square Pegs In Round Holes: Rethinking Data Availability in the Age of Automa...
 
Hortonworks Hybrid Cloud - Putting you back in control of your data
Hortonworks Hybrid Cloud - Putting you back in control of your dataHortonworks Hybrid Cloud - Putting you back in control of your data
Hortonworks Hybrid Cloud - Putting you back in control of your data
 
Application of Data Science in Government Services – IPMA Forum 2016 Speaker ...
Application of Data Science in Government Services – IPMA Forum 2016 Speaker ...Application of Data Science in Government Services – IPMA Forum 2016 Speaker ...
Application of Data Science in Government Services – IPMA Forum 2016 Speaker ...
 
accelerate-intelligent-solutions-with-machine-learning-platform-brief
accelerate-intelligent-solutions-with-machine-learning-platform-briefaccelerate-intelligent-solutions-with-machine-learning-platform-brief
accelerate-intelligent-solutions-with-machine-learning-platform-brief
 
Real-Time With AI – The Convergence Of Big Data And AI by Colin MacNaughton
Real-Time With AI – The Convergence Of Big Data And AI by Colin MacNaughtonReal-Time With AI – The Convergence Of Big Data And AI by Colin MacNaughton
Real-Time With AI – The Convergence Of Big Data And AI by Colin MacNaughton
 
Running Enterprise Workloads with an open source Hybrid Cloud Data Architectu...
Running Enterprise Workloads with an open source Hybrid Cloud Data Architectu...Running Enterprise Workloads with an open source Hybrid Cloud Data Architectu...
Running Enterprise Workloads with an open source Hybrid Cloud Data Architectu...
 

Más de DataWorks Summit

HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 

Más de DataWorks Summit (20)

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
 

Último

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Último (20)

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 

Data Science Crash Course

  • 1. 1 © Hortonworks Inc. 2011–2018. All rights reserved DWS Barcelona 2019 Robert Hryniewicz @robhryniewicz Data Science Crash Course
  • 2. 2 © Hortonworks Inc. 2011–2018. All rights reserved What is Machine Learning? Machine Learning is programming with data (as opposed to programming with code). Machine Learning is a way to use data to draw meaningful conclusions including identifying patterns, anomalies and trends that may not be obvious to humans. Machine learning is math, at scale. Machine learning is learning patterns from data labelled or not. Machine learning is when I explain my challenge to the computer and it finds a way to solve it. Machine Learning allows for emotional decisions to become objective.
  • 3. 3 © Hortonworks Inc. 2011–2018. All rights reserved Examples where Machine Learning can be applied Healthcare • Predict diagnosis • Prioritize screenings • Reduce re-admittance rates Financial services • Fraud Detection/prevention • Predict underwriting risk • New account risk screens Public Sector • Analyze public sentiment • Optimize resource allocation • Law enforcement & security Retail • Product recommendation • Inventory management • Price optimization Telco/mobile • Predict customer churn • Predict equipment failure • Customer behavior analysis Oil & Gas • Predictive maintenance • Seismic data management • Predict well production levels Insurance • Risk assessment • Customer insights/experience • Finance real time analysis Life sciences • Genome sequencing • Drug development • Sensor data
  • 4. 4 © Hortonworks Inc. 2011–2018. All rights reserved Machine Learning – Major Types Supervised Learning Unsupervised Learning Reinforcement Learning
  • 5. 5 © Hortonworks Inc. 2011–2018. All rights reserved Supervised Learning Input Input Input Input Input Input Input Output 1 Output n Use labeled (training) datasets on to learn the relationship of given inputs to outputs. Once model is trained use it to predict outputs on new input data. Output 2 . . . … …
  • 6. 6 © Hortonworks Inc. 2011–2018. All rights reserved Unsupervised Learning Explore, classify & find patterns in the input data without being explicit about the output.
  • 7. 7 © Hortonworks Inc. 2011–2018. All rights reserved Reinforcement Learning Algorithm Environment ActionRewardState Algorithm learns to maximize rewards it receives for its actions (e.g. maximizes points for investment returns). Use when you don’t have lots of training data, you can’t clearly define ideal end-state, or the only way to learn is by interacting with the environment.
  • 8. 8 © Hortonworks Inc. 2011–2018. All rights reserved Regression Classification Recommender Systems / Collaborative Filtering Clustering Dimensionality Reduction • Logistic Regression • Support Vector Machines (SVM) • Random Forest (RF) • Naïve Bayes • Linear Regression • Alternating Least Squares (ALS) • K-Means, LDA • Principal Component Analysis (PCA) Deep Learning • Fully Connected Neural Nets Ø Tabular or Recommender Systems • Convolutional Neural Nets (CNNs) Ø Images • Recurrent Neural Nets (RNNs) Ø Natural Language Processing (NLP) / Text
  • 9. 9 © Hortonworks Inc. 2011–2018. All rights reserved REGRESSION Predicting a continuous-valued output Example: Predicting house prices based on number of bedrooms and square footage Algorithms: Linear Regression
  • 10. 10 © Hortonworks Inc. 2011–2018. All rights reserved CLASSIFICATION Identifying to which category an object belongs to Examples: spam detection, diabetes diagnosis, text labeling Algorithms: • Logistic Regression • Fast training (linear model) • Classes expressed in probabilities • Less overfitting [+] • Less fitting (accuracy) [-] • Support Vector Machines (SVM) • “Best” supervised learning algorithm, effective • State of the art prior to Deep Learning • More robust to outliers than Log Regression • Handles non-linearity • Random Forest (ensemble of Decision Trees) • Fast training • Handles categorical features • Does not require feature scaling • Captures non-linearity and feature interaction • i.e. performs feature selection / PCA implicitly • Naïve Bayes • Good for text classification • Assumes independent variables / words
  • 11. 11 © Hortonworks Inc. 2011–2018. All rights reserved Visual Intro to Decision Trees • http://www.r2d3.us/visual-intro-to-machine-learning-part-1 CLASSIFICATION
  • 12. 12 © Hortonworks Inc. 2011–2018. All rights reserved CLUSTERING Automatic grouping of similar objects into sets (clusters) Example: market segmentation – auto group customers into different market segments Algorithms: K-means, LDA
  • 13. 13 © Hortonworks Inc. 2011–2018. All rights reserved COLLABORATIVE FILTERING Fill in the missing entries of a user-item association matrix Applications: Product/movie recommendation Algorithms: Alternating Least Squares (ALS)
  • 14. 14 © Hortonworks Inc. 2011–2018. All rights reserved DIMENSIONALITY REDUCTION Reducing the number of redundant features/variables Applications: • Removing noise in images by selecting only “important” features • Removing redundant features, e.g. MPH & KPH are linearly dependent Algorithms: Principal Component Analysis (PCA)
  • 15. 15 © Hortonworks Inc. 2011–2018. All rights reserved
  • 16. 16 © Hortonworks Inc. 2011–2018. All rights reserved Simple/shallow vs Deep Neural Net
  • 17. 17 © Hortonworks Inc. 2011–2018. All rights reserved • Convolutional Neural Nets (CNNs) • Recurrent Neural Nets (RNNs) • Long Short-Term Memory (LSTM) Popular Neural Net Architectures ß Images ß Text / Language (NLP) & Time Series
  • 18. 18 © Hortonworks Inc. 2011–2018. All rights reserved Number Probability 0 0.03 1 0.01 2 0.04 3 0.08 4 0.05 5 0.08 6 0.07 7 0.02 8 0.54 9 0.08
  • 19. 19 © Hortonworks Inc. 2011–2018. All rights reserved Quickly Training Deep Learning Models with Transfer Learning 19
  • 20. 20 © Hortonworks Inc. 2011–2018. All rights reserved How to Build a Deep Learning Image Recognition System? African Bush Elephant Indian Elephant Sri Lankan Elephant Borneo Pygmy Elephant Step 1: Download examples to train the model with
  • 21. 21 © Hortonworks Inc. 2011–2018. All rights reserved How to Build a Deep Learning Image Recognition System? Step 2: Augment dataset to enrich training data à Adds 5-10x more training examples
  • 22. 22 © Hortonworks Inc. 2011–2018. All rights reserved dawn.cs.stanford.edu/benchmark Step 3: Select and download a pre-trained model How to Build a Deep Learning Image Recognition System?
  • 23. 23 © Hortonworks Inc. 2011–2018. All rights reserved Sample Architecture of a CNN Pretrained Parameters Random Parameters
  • 24. 24 © Hortonworks Inc. 2011–2018. All rights reserved Step 4: Apply transfer learning How to Build a Deep Learning Image Recognition System? Pretrained Network (millions of parameters) Random ParametersINPUT OUTPUT Borneo Pygmy Elephant Indian Elephant Train Parameters Step A Adjust Parameters Step B
  • 25. 25 © Hortonworks Inc. 2011–2018. All rights reserved Step 5: Host a trained model on a server and make it accessible via a web app How to Build a Deep Learning Image Recognition System? User uploads Borneo Pygmy Elephant Web app returns
  • 26. 26 © Hortonworks Inc. 2011–2018. All rights reserved Data Science Journey 26
  • 27. 27 © Hortonworks Inc. 2011–2018. All rights reserved What is data science? The scientific exploration of data to extract meaning or insight, using statistics and mathematical models with the end goal of making smarter, quicker decisions.
  • 28. 28 © Hortonworks Inc. 2011–2018. All rights reserved
  • 29. 29 © Hortonworks Inc. 2011–2018. All rights reserved Start by Asking Relevant Questions • Specific (can you think of a clear answer?) • Measurable (quantifiable? data driven?) • Actionable (if you had an answer, could you do something with it?) • Realistic (can you get an answer with data you have?) • Timely (answer in reasonable timeframe?)
  • 30. 30 © Hortonworks Inc. 2011–2018. All rights reserved Data Preparation 1. Data analysis (audit for anomalies/errors) 2. Creating an intuitive workflow (formulate seq. of prep operations) 3. Validation (correctness evaluated against sample representative dataset) 4. Transformation (actual prep process takes place) 5. Backflow of cleaned data (replace original dirty data) Approx. 80% of Data Analyst’s job is Data Preparation! Example of multiple values used for U.S. States è California, CA, Cal., Cal
  • 31. 31 © Hortonworks Inc. 2011–2018. All rights reserved Feature Selection • Also known as variable or attribute selection • Why important? • simplification of models è easier to interpret by researchers/users • shorter training times • enhanced generalization by reducing overfitting • Dimensionality reduction vs feature selection • Dimensionality reduction: create new combinations of attributes • Feature selection: include/exclude attributes in data without changing them Q: Which features should you use to create a predictive model?
  • 32. 32 © Hortonworks Inc. 2011–2018. All rights reserved Hyperparameters • Define higher-level model properties, e.g. complexity or learning rate • Cannot be learned during training à need to be predefined • Can be decided by • setting different values • training different models • choosing the values that test better • Hyperparameter examples • Number of leaves or depth of a tree • Number of latent factors in a matrix factorization • Learning rate (in many models) • Number of hidden layers in a deep neural network • Number of clusters in a k-means clustering
  • 33. 33 © Hortonworks Inc. 2011–2018. All rights reserved v Residuals • residual of an observed value is the difference between the observed value and the estimated value v R2 (R Squared) – Coefficient of Determination • indicates a goodness of fit • R2 of 1 means regression line perfectly fits data v RMSE (Root Mean Square Error) • measure of differences between values predicted by a model and values actually observed • good measure of accuracy, but only to compare forecasting errors of different models (individual variables are scale-dependent)
  • 34. 34 © Hortonworks Inc. 2011–2018. All rights reserved With that in mind… • No simple formula for “good questions” only general guidelines • The right data is better than lots of data • Understanding relationships matters
  • 35. 35 © Hortonworks Inc. 2011–2018. All rights reserved Enterprise Data Science @ Scale Enterprise- Grade Leverage enterprise-grade security, governance and operations Tools Enhance productivity by enabling data scientists to use their favorite tools, technologies and libraries Deployment Compress the time to insight by deploying models into production faster Data Build more robust models by using all the data in the data lake
  • 36. 36 © Hortonworks Inc. 2011–2018. All rights reserved Thanks! Robert Hryniewicz @robhryniewicz