SlideShare una empresa de Scribd logo
1 de 35
1© Cloudera, Inc. All rights reserved.
A Moneyball Approach
Josh Wills | Senior Director of Data Science
Building Data Science Teams
2© Cloudera, Inc. All rights reserved.
About Me
3© Cloudera, Inc. All rights reserved.
A Team Building Exercise
4© Cloudera, Inc. All rights reserved.
Data Scientist Supply vs. Data Scientist Demand
5© Cloudera, Inc. All rights reserved.
Recruiting Techniques
6© Cloudera, Inc. All rights reserved.
Moneyball and Data Science
7© Cloudera, Inc. All rights reserved.
Choosing The Right Metrics
8© Cloudera, Inc. All rights reserved.
1. Analyzing “Unstructured” Data Sources
9© Cloudera, Inc. All rights reserved.
2. Building Machine Learning Models
10© Cloudera, Inc. All rights reserved.
3. Turn Static Reports Into Analytical Applications
11© Cloudera, Inc. All rights reserved.
Answering More Questions in Less Time
12© Cloudera, Inc. All rights reserved.
How To Answer Questions
Like A Data Scientist
13© Cloudera, Inc. All rights reserved.
1. Read and deserialize input data.
2. Project/filter input records.
3. Shuffle: serialize it, send over the
network, deserialize it.
4. Apply aggregation logic.
5. Serialize output data.
The Life of a Data Processing Job
14© Cloudera, Inc. All rights reserved.
Handling the Cost of Serialization
15© Cloudera, Inc. All rights reserved.
The Traditional RDBMS Approach
16© Cloudera, Inc. All rights reserved.
The Cost of The Traditional RDBMS Approach
17© Cloudera, Inc. All rights reserved.
Query Scheduling and Exploratory Data Analysis
18© Cloudera, Inc. All rights reserved.
The Spark Approach
19© Cloudera, Inc. All rights reserved.
The Cost of the Spark Approach
20© Cloudera, Inc. All rights reserved.
The MapReduce Approach
21© Cloudera, Inc. All rights reserved.
MapReduce In The Hands of a Data Scientist
22© Cloudera, Inc. All rights reserved.
Example: Hive Multi-Insert
23© Cloudera, Inc. All rights reserved.
Our Goal: Public Transit for Questions
24© Cloudera, Inc. All rights reserved.
Data Modeling for Data Scientists
25© Cloudera, Inc. All rights reserved.
Motivating Example: Spelling Correction
26© Cloudera, Inc. All rights reserved.
Event Series Analytics
27© Cloudera, Inc. All rights reserved.
A Simple Star Schema for Spell Correction
28© Cloudera, Inc. All rights reserved.
The Combinatorial Explosion
29© Cloudera, Inc. All rights reserved.
• What parameters does this model
need…
• during the analysis phase?
• during deployment?
• Some Candidates
• Lag time between events
• Similarity of queries
• What else?
Designing the Spell Correction Data Product
30© Cloudera, Inc. All rights reserved.
A Supernova Schema for Search
31© Cloudera, Inc. All rights reserved.
Spell Correction in SQL
32© Cloudera, Inc. All rights reserved.
Exhibit: http://github.com/jwills/exhibit
33© Cloudera, Inc. All rights reserved.
Querying Nested Types with Impala
34© Cloudera, Inc. All rights reserved.
• Core Metric: # Outputs/ # Jobs
• Measure on both an individual and
aggregate level
• Drive the marginal cost of asking one
additional question towards zero
• Point business analysts at output
tables for interactive analysis with
Impala
• Self-serve BI frees up resources
(compute + data science time)
Trading Up: From Data Analyst to Data Scientist
35© Cloudera, Inc. All rights reserved.
Thanks!
@josh_wills

Más contenido relacionado

La actualidad más candente

Architecting a Data Platform For Enterprise Use (Strata NY 2018)
Architecting a Data Platform For Enterprise Use (Strata NY 2018)Architecting a Data Platform For Enterprise Use (Strata NY 2018)
Architecting a Data Platform For Enterprise Use (Strata NY 2018)mark madsen
 
Moving Data Science from an Event to A Program: Considerations in Creating Su...
Moving Data Science from an Event to A Program: Considerations in Creating Su...Moving Data Science from an Event to A Program: Considerations in Creating Su...
Moving Data Science from an Event to A Program: Considerations in Creating Su...Domino Data Lab
 
How I Learned to Stop Worrying and Love Linked Data
How I Learned to Stop Worrying and Love Linked DataHow I Learned to Stop Worrying and Love Linked Data
How I Learned to Stop Worrying and Love Linked DataDomino Data Lab
 
Data Architecture: OMG It’s Made of People
Data Architecture: OMG It’s Made of PeopleData Architecture: OMG It’s Made of People
Data Architecture: OMG It’s Made of Peoplemark madsen
 
Building a Data Platform Strata SF 2019
Building a Data Platform Strata SF 2019Building a Data Platform Strata SF 2019
Building a Data Platform Strata SF 2019mark madsen
 
Data Science Salon: Quit Wasting Time – Case Studies in Production Machine Le...
Data Science Salon: Quit Wasting Time – Case Studies in Production Machine Le...Data Science Salon: Quit Wasting Time – Case Studies in Production Machine Le...
Data Science Salon: Quit Wasting Time – Case Studies in Production Machine Le...Formulatedby
 
Assumptions about Data and Analysis: Briefing room webcast slides
Assumptions about Data and Analysis: Briefing room webcast slidesAssumptions about Data and Analysis: Briefing room webcast slides
Assumptions about Data and Analysis: Briefing room webcast slidesmark madsen
 
Data science team, a practice to setup
Data science team, a practice to setupData science team, a practice to setup
Data science team, a practice to setupOmid Mogharian
 
Walmart Big Data Expo
Walmart Big Data ExpoWalmart Big Data Expo
Walmart Big Data ExpoBigDataExpo
 
Solve User Problems: Data Architecture for Humans
Solve User Problems: Data Architecture for HumansSolve User Problems: Data Architecture for Humans
Solve User Problems: Data Architecture for Humansmark madsen
 
How to understand trends in the data & software market
How to understand trends in the data & software marketHow to understand trends in the data & software market
How to understand trends in the data & software marketmark madsen
 
Introduction to open data in DataOps
Introduction to open data in DataOpsIntroduction to open data in DataOps
Introduction to open data in DataOpsDataops Ghent Meetup
 
Cloud-native Enterprise Data Science Teams
Cloud-native Enterprise Data Science TeamsCloud-native Enterprise Data Science Teams
Cloud-native Enterprise Data Science TeamsBoston Consulting Group
 
Data Quality Analytics: Understanding what is in your data, before using it
Data Quality Analytics: Understanding what is in your data, before using itData Quality Analytics: Understanding what is in your data, before using it
Data Quality Analytics: Understanding what is in your data, before using itDomino Data Lab
 
Pay no attention to the man behind the curtain - the unseen work behind data ...
Pay no attention to the man behind the curtain - the unseen work behind data ...Pay no attention to the man behind the curtain - the unseen work behind data ...
Pay no attention to the man behind the curtain - the unseen work behind data ...mark madsen
 
CTO Perspectives: What's Next for Data Management and Healthcare?
CTO Perspectives: What's Next for Data Management and Healthcare?CTO Perspectives: What's Next for Data Management and Healthcare?
CTO Perspectives: What's Next for Data Management and Healthcare?Health Catalyst
 
13 2792 big-data_keynote_presentation_finalpass_05_d_v02
13 2792 big-data_keynote_presentation_finalpass_05_d_v0213 2792 big-data_keynote_presentation_finalpass_05_d_v02
13 2792 big-data_keynote_presentation_finalpass_05_d_v02Erin Kerrigan
 
Dataiku, Pitch Data Innovation Night, Boston, Septembre 16th
Dataiku, Pitch Data Innovation Night, Boston, Septembre 16thDataiku, Pitch Data Innovation Night, Boston, Septembre 16th
Dataiku, Pitch Data Innovation Night, Boston, Septembre 16thDataiku
 

La actualidad más candente (20)

Architecting a Data Platform For Enterprise Use (Strata NY 2018)
Architecting a Data Platform For Enterprise Use (Strata NY 2018)Architecting a Data Platform For Enterprise Use (Strata NY 2018)
Architecting a Data Platform For Enterprise Use (Strata NY 2018)
 
Moving Data Science from an Event to A Program: Considerations in Creating Su...
Moving Data Science from an Event to A Program: Considerations in Creating Su...Moving Data Science from an Event to A Program: Considerations in Creating Su...
Moving Data Science from an Event to A Program: Considerations in Creating Su...
 
How I Learned to Stop Worrying and Love Linked Data
How I Learned to Stop Worrying and Love Linked DataHow I Learned to Stop Worrying and Love Linked Data
How I Learned to Stop Worrying and Love Linked Data
 
The Big Data Dream Team
The Big Data Dream TeamThe Big Data Dream Team
The Big Data Dream Team
 
Data Architecture: OMG It’s Made of People
Data Architecture: OMG It’s Made of PeopleData Architecture: OMG It’s Made of People
Data Architecture: OMG It’s Made of People
 
Building a Data Platform Strata SF 2019
Building a Data Platform Strata SF 2019Building a Data Platform Strata SF 2019
Building a Data Platform Strata SF 2019
 
Data Science Salon: Quit Wasting Time – Case Studies in Production Machine Le...
Data Science Salon: Quit Wasting Time – Case Studies in Production Machine Le...Data Science Salon: Quit Wasting Time – Case Studies in Production Machine Le...
Data Science Salon: Quit Wasting Time – Case Studies in Production Machine Le...
 
Assumptions about Data and Analysis: Briefing room webcast slides
Assumptions about Data and Analysis: Briefing room webcast slidesAssumptions about Data and Analysis: Briefing room webcast slides
Assumptions about Data and Analysis: Briefing room webcast slides
 
Data science team, a practice to setup
Data science team, a practice to setupData science team, a practice to setup
Data science team, a practice to setup
 
Walmart Big Data Expo
Walmart Big Data ExpoWalmart Big Data Expo
Walmart Big Data Expo
 
Solve User Problems: Data Architecture for Humans
Solve User Problems: Data Architecture for HumansSolve User Problems: Data Architecture for Humans
Solve User Problems: Data Architecture for Humans
 
How to understand trends in the data & software market
How to understand trends in the data & software marketHow to understand trends in the data & software market
How to understand trends in the data & software market
 
Introduction to open data in DataOps
Introduction to open data in DataOpsIntroduction to open data in DataOps
Introduction to open data in DataOps
 
Cloud-native Enterprise Data Science Teams
Cloud-native Enterprise Data Science TeamsCloud-native Enterprise Data Science Teams
Cloud-native Enterprise Data Science Teams
 
Data Quality Analytics: Understanding what is in your data, before using it
Data Quality Analytics: Understanding what is in your data, before using itData Quality Analytics: Understanding what is in your data, before using it
Data Quality Analytics: Understanding what is in your data, before using it
 
Pay no attention to the man behind the curtain - the unseen work behind data ...
Pay no attention to the man behind the curtain - the unseen work behind data ...Pay no attention to the man behind the curtain - the unseen work behind data ...
Pay no attention to the man behind the curtain - the unseen work behind data ...
 
CTO Perspectives: What's Next for Data Management and Healthcare?
CTO Perspectives: What's Next for Data Management and Healthcare?CTO Perspectives: What's Next for Data Management and Healthcare?
CTO Perspectives: What's Next for Data Management and Healthcare?
 
13 2792 big-data_keynote_presentation_finalpass_05_d_v02
13 2792 big-data_keynote_presentation_finalpass_05_d_v0213 2792 big-data_keynote_presentation_finalpass_05_d_v02
13 2792 big-data_keynote_presentation_finalpass_05_d_v02
 
Dataiku, Pitch Data Innovation Night, Boston, Septembre 16th
Dataiku, Pitch Data Innovation Night, Boston, Septembre 16thDataiku, Pitch Data Innovation Night, Boston, Septembre 16th
Dataiku, Pitch Data Innovation Night, Boston, Septembre 16th
 
Asking Why
Asking WhyAsking Why
Asking Why
 

Destacado

The Moneyball Approach to Recruitment: Big Data = Big Changes
The Moneyball Approach to Recruitment: Big Data = Big ChangesThe Moneyball Approach to Recruitment: Big Data = Big Changes
The Moneyball Approach to Recruitment: Big Data = Big ChangesGlen Cathey
 
2015 nov 27_thug_paytm_rt_ingest_brief_final
2015 nov 27_thug_paytm_rt_ingest_brief_final2015 nov 27_thug_paytm_rt_ingest_brief_final
2015 nov 27_thug_paytm_rt_ingest_brief_finalAdam Muise
 
Moneyball & Data Analytics
Moneyball & Data AnalyticsMoneyball & Data Analytics
Moneyball & Data AnalyticsHRBoss
 
Abc czy potrzebna mi wentylacja mechaniczna z odzyskiem ciepla
Abc czy potrzebna mi wentylacja mechaniczna z odzyskiem cieplaAbc czy potrzebna mi wentylacja mechaniczna z odzyskiem ciepla
Abc czy potrzebna mi wentylacja mechaniczna z odzyskiem cieplaabc-kotly
 
Parent handbook final updated
Parent handbook final updatedParent handbook final updated
Parent handbook final updatedAibek Dunaev
 
Abc czy oplaca sie zainwestowac w kolektory sloneczne
Abc czy oplaca sie zainwestowac w kolektory sloneczneAbc czy oplaca sie zainwestowac w kolektory sloneczne
Abc czy oplaca sie zainwestowac w kolektory sloneczneabc-kotly
 
Abc jak zbudowany jest kolektor sloneczny
Abc jak zbudowany jest kolektor slonecznyAbc jak zbudowany jest kolektor sloneczny
Abc jak zbudowany jest kolektor slonecznyabc-kotly
 
Literate environment analysis presentation
Literate environment analysis presentationLiterate environment analysis presentation
Literate environment analysis presentationDalenAmy Morey
 
Protección de las mujeres contra la violencia de genero en la argentina
Protección de las mujeres contra la violencia de genero en la argentinaProtección de las mujeres contra la violencia de genero en la argentina
Protección de las mujeres contra la violencia de genero en la argentinaMaría Isabel Sanchez
 
Hemsö fästning
Hemsö fästningHemsö fästning
Hemsö fästningabeltopgun
 
Реклама на Kuppi.kg - коммерческое предложение
Реклама на Kuppi.kg - коммерческое предложениеРеклама на Kuppi.kg - коммерческое предложение
Реклама на Kuppi.kg - коммерческое предложениеAibek Dunaev
 
Javascript3
Javascript3Javascript3
Javascript3mozks
 

Destacado (20)

The Moneyball Approach to Recruitment: Big Data = Big Changes
The Moneyball Approach to Recruitment: Big Data = Big ChangesThe Moneyball Approach to Recruitment: Big Data = Big Changes
The Moneyball Approach to Recruitment: Big Data = Big Changes
 
Moneyball
Moneyball Moneyball
Moneyball
 
2015 nov 27_thug_paytm_rt_ingest_brief_final
2015 nov 27_thug_paytm_rt_ingest_brief_final2015 nov 27_thug_paytm_rt_ingest_brief_final
2015 nov 27_thug_paytm_rt_ingest_brief_final
 
Moneyball & Data Analytics
Moneyball & Data AnalyticsMoneyball & Data Analytics
Moneyball & Data Analytics
 
Abc czy potrzebna mi wentylacja mechaniczna z odzyskiem ciepla
Abc czy potrzebna mi wentylacja mechaniczna z odzyskiem cieplaAbc czy potrzebna mi wentylacja mechaniczna z odzyskiem ciepla
Abc czy potrzebna mi wentylacja mechaniczna z odzyskiem ciepla
 
Parent handbook final updated
Parent handbook final updatedParent handbook final updated
Parent handbook final updated
 
Life history of frog
Life history of frogLife history of frog
Life history of frog
 
Abc czy oplaca sie zainwestowac w kolektory sloneczne
Abc czy oplaca sie zainwestowac w kolektory sloneczneAbc czy oplaca sie zainwestowac w kolektory sloneczne
Abc czy oplaca sie zainwestowac w kolektory sloneczne
 
Abc jak zbudowany jest kolektor sloneczny
Abc jak zbudowany jest kolektor slonecznyAbc jak zbudowany jest kolektor sloneczny
Abc jak zbudowany jest kolektor sloneczny
 
La tecnologia esperanzadora
La tecnologia esperanzadoraLa tecnologia esperanzadora
La tecnologia esperanzadora
 
Literate environment analysis presentation
Literate environment analysis presentationLiterate environment analysis presentation
Literate environment analysis presentation
 
Protección de las mujeres contra la violencia de genero en la argentina
Protección de las mujeres contra la violencia de genero en la argentinaProtección de las mujeres contra la violencia de genero en la argentina
Protección de las mujeres contra la violencia de genero en la argentina
 
Barriers to insulin therapy
Barriers to insulin therapyBarriers to insulin therapy
Barriers to insulin therapy
 
How to phrase your
How to phrase yourHow to phrase your
How to phrase your
 
Hemsö fästning
Hemsö fästningHemsö fästning
Hemsö fästning
 
7km mas al_ajhezeh
7km mas al_ajhezeh7km mas al_ajhezeh
7km mas al_ajhezeh
 
Crisis
CrisisCrisis
Crisis
 
Presentacion 0000001
Presentacion 0000001Presentacion 0000001
Presentacion 0000001
 
Реклама на Kuppi.kg - коммерческое предложение
Реклама на Kuppi.kg - коммерческое предложениеРеклама на Kuppi.kg - коммерческое предложение
Реклама на Kuppi.kg - коммерческое предложение
 
Javascript3
Javascript3Javascript3
Javascript3
 

Similar a Building Data Science Teams: A Moneyball Approach

Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...Data Con LA
 
The Vision & Challenge of Applied Machine Learning
The Vision & Challenge of Applied Machine LearningThe Vision & Challenge of Applied Machine Learning
The Vision & Challenge of Applied Machine LearningCloudera, Inc.
 
Keynote: The Journey to Pervasive Analytics
Keynote: The Journey to Pervasive AnalyticsKeynote: The Journey to Pervasive Analytics
Keynote: The Journey to Pervasive AnalyticsCloudera, Inc.
 
Enterprise Metadata Integration, Cloudera
Enterprise Metadata Integration, ClouderaEnterprise Metadata Integration, Cloudera
Enterprise Metadata Integration, ClouderaNeo4j
 
Next-Gen ML/AI Platform
Next-Gen ML/AI PlatformNext-Gen ML/AI Platform
Next-Gen ML/AI PlatformJosh Yeh
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
 
Machine Learning Model Deployment: Strategy to Implementation
Machine Learning Model Deployment: Strategy to ImplementationMachine Learning Model Deployment: Strategy to Implementation
Machine Learning Model Deployment: Strategy to ImplementationDataWorks Summit
 
Analytics, Everywhere. Keys to Effective Analytics and Data Discovery
Analytics, Everywhere. Keys to Effective Analytics and Data DiscoveryAnalytics, Everywhere. Keys to Effective Analytics and Data Discovery
Analytics, Everywhere. Keys to Effective Analytics and Data DiscoveryDLT Solutions
 
Data Tools and the Data Scientist Shortage
Data Tools and the Data Scientist ShortageData Tools and the Data Scientist Shortage
Data Tools and the Data Scientist ShortageWes McKinney
 
Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015hadooparchbook
 
Part 3: Models in Production: A Look From Beginning to End
Part 3: Models in Production: A Look From Beginning to EndPart 3: Models in Production: A Look From Beginning to End
Part 3: Models in Production: A Look From Beginning to EndCloudera, Inc.
 
Data Science in the Enterprise
Data Science in the EnterpriseData Science in the Enterprise
Data Science in the EnterpriseThe Hive
 
Druid and Hive Together : Use Cases and Best Practices
Druid and Hive Together : Use Cases and Best PracticesDruid and Hive Together : Use Cases and Best Practices
Druid and Hive Together : Use Cases and Best PracticesDataWorks Summit
 
Cloudera Altus: Big Data in the Cloud Made Easy
Cloudera Altus: Big Data in the Cloud Made EasyCloudera Altus: Big Data in the Cloud Made Easy
Cloudera Altus: Big Data in the Cloud Made EasyCloudera, Inc.
 
The 5 Biggest Data Myths in Telco: Exposed
The 5 Biggest Data Myths in Telco: ExposedThe 5 Biggest Data Myths in Telco: Exposed
The 5 Biggest Data Myths in Telco: ExposedCloudera, Inc.
 
Spark and Deep Learning Frameworks at Scale 7.19.18
Spark and Deep Learning Frameworks at Scale 7.19.18Spark and Deep Learning Frameworks at Scale 7.19.18
Spark and Deep Learning Frameworks at Scale 7.19.18Cloudera, Inc.
 
Data Science in Enterprise
Data Science in EnterpriseData Science in Enterprise
Data Science in EnterpriseJosh Yeh
 
From Insight to Action: Using Data Science to Transform Your Organization
From Insight to Action: Using Data Science to Transform Your OrganizationFrom Insight to Action: Using Data Science to Transform Your Organization
From Insight to Action: Using Data Science to Transform Your OrganizationCloudera, Inc.
 

Similar a Building Data Science Teams: A Moneyball Approach (20)

Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
 
The Vision & Challenge of Applied Machine Learning
The Vision & Challenge of Applied Machine LearningThe Vision & Challenge of Applied Machine Learning
The Vision & Challenge of Applied Machine Learning
 
Keynote: The Journey to Pervasive Analytics
Keynote: The Journey to Pervasive AnalyticsKeynote: The Journey to Pervasive Analytics
Keynote: The Journey to Pervasive Analytics
 
Enterprise Metadata Integration, Cloudera
Enterprise Metadata Integration, ClouderaEnterprise Metadata Integration, Cloudera
Enterprise Metadata Integration, Cloudera
 
Next-Gen ML/AI Platform
Next-Gen ML/AI PlatformNext-Gen ML/AI Platform
Next-Gen ML/AI Platform
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Machine Learning Model Deployment: Strategy to Implementation
Machine Learning Model Deployment: Strategy to ImplementationMachine Learning Model Deployment: Strategy to Implementation
Machine Learning Model Deployment: Strategy to Implementation
 
Analytics, Everywhere. Keys to Effective Analytics and Data Discovery
Analytics, Everywhere. Keys to Effective Analytics and Data DiscoveryAnalytics, Everywhere. Keys to Effective Analytics and Data Discovery
Analytics, Everywhere. Keys to Effective Analytics and Data Discovery
 
Data Tools and the Data Scientist Shortage
Data Tools and the Data Scientist ShortageData Tools and the Data Scientist Shortage
Data Tools and the Data Scientist Shortage
 
Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015
 
Part 3: Models in Production: A Look From Beginning to End
Part 3: Models in Production: A Look From Beginning to EndPart 3: Models in Production: A Look From Beginning to End
Part 3: Models in Production: A Look From Beginning to End
 
Data Science in the Enterprise
Data Science in the EnterpriseData Science in the Enterprise
Data Science in the Enterprise
 
Federated Learning
Federated LearningFederated Learning
Federated Learning
 
Druid and Hive Together : Use Cases and Best Practices
Druid and Hive Together : Use Cases and Best PracticesDruid and Hive Together : Use Cases and Best Practices
Druid and Hive Together : Use Cases and Best Practices
 
Cloudera Altus: Big Data in the Cloud Made Easy
Cloudera Altus: Big Data in the Cloud Made EasyCloudera Altus: Big Data in the Cloud Made Easy
Cloudera Altus: Big Data in the Cloud Made Easy
 
The 5 Biggest Data Myths in Telco: Exposed
The 5 Biggest Data Myths in Telco: ExposedThe 5 Biggest Data Myths in Telco: Exposed
The 5 Biggest Data Myths in Telco: Exposed
 
Spark and Deep Learning Frameworks at Scale 7.19.18
Spark and Deep Learning Frameworks at Scale 7.19.18Spark and Deep Learning Frameworks at Scale 7.19.18
Spark and Deep Learning Frameworks at Scale 7.19.18
 
Data Science in Enterprise
Data Science in EnterpriseData Science in Enterprise
Data Science in Enterprise
 
From Insight to Action: Using Data Science to Transform Your Organization
From Insight to Action: Using Data Science to Transform Your OrganizationFrom Insight to Action: Using Data Science to Transform Your Organization
From Insight to Action: Using Data Science to Transform Your Organization
 

Último

BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 

Último (20)

BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 

Building Data Science Teams: A Moneyball Approach

  • 1. 1© Cloudera, Inc. All rights reserved. A Moneyball Approach Josh Wills | Senior Director of Data Science Building Data Science Teams
  • 2. 2© Cloudera, Inc. All rights reserved. About Me
  • 3. 3© Cloudera, Inc. All rights reserved. A Team Building Exercise
  • 4. 4© Cloudera, Inc. All rights reserved. Data Scientist Supply vs. Data Scientist Demand
  • 5. 5© Cloudera, Inc. All rights reserved. Recruiting Techniques
  • 6. 6© Cloudera, Inc. All rights reserved. Moneyball and Data Science
  • 7. 7© Cloudera, Inc. All rights reserved. Choosing The Right Metrics
  • 8. 8© Cloudera, Inc. All rights reserved. 1. Analyzing “Unstructured” Data Sources
  • 9. 9© Cloudera, Inc. All rights reserved. 2. Building Machine Learning Models
  • 10. 10© Cloudera, Inc. All rights reserved. 3. Turn Static Reports Into Analytical Applications
  • 11. 11© Cloudera, Inc. All rights reserved. Answering More Questions in Less Time
  • 12. 12© Cloudera, Inc. All rights reserved. How To Answer Questions Like A Data Scientist
  • 13. 13© Cloudera, Inc. All rights reserved. 1. Read and deserialize input data. 2. Project/filter input records. 3. Shuffle: serialize it, send over the network, deserialize it. 4. Apply aggregation logic. 5. Serialize output data. The Life of a Data Processing Job
  • 14. 14© Cloudera, Inc. All rights reserved. Handling the Cost of Serialization
  • 15. 15© Cloudera, Inc. All rights reserved. The Traditional RDBMS Approach
  • 16. 16© Cloudera, Inc. All rights reserved. The Cost of The Traditional RDBMS Approach
  • 17. 17© Cloudera, Inc. All rights reserved. Query Scheduling and Exploratory Data Analysis
  • 18. 18© Cloudera, Inc. All rights reserved. The Spark Approach
  • 19. 19© Cloudera, Inc. All rights reserved. The Cost of the Spark Approach
  • 20. 20© Cloudera, Inc. All rights reserved. The MapReduce Approach
  • 21. 21© Cloudera, Inc. All rights reserved. MapReduce In The Hands of a Data Scientist
  • 22. 22© Cloudera, Inc. All rights reserved. Example: Hive Multi-Insert
  • 23. 23© Cloudera, Inc. All rights reserved. Our Goal: Public Transit for Questions
  • 24. 24© Cloudera, Inc. All rights reserved. Data Modeling for Data Scientists
  • 25. 25© Cloudera, Inc. All rights reserved. Motivating Example: Spelling Correction
  • 26. 26© Cloudera, Inc. All rights reserved. Event Series Analytics
  • 27. 27© Cloudera, Inc. All rights reserved. A Simple Star Schema for Spell Correction
  • 28. 28© Cloudera, Inc. All rights reserved. The Combinatorial Explosion
  • 29. 29© Cloudera, Inc. All rights reserved. • What parameters does this model need… • during the analysis phase? • during deployment? • Some Candidates • Lag time between events • Similarity of queries • What else? Designing the Spell Correction Data Product
  • 30. 30© Cloudera, Inc. All rights reserved. A Supernova Schema for Search
  • 31. 31© Cloudera, Inc. All rights reserved. Spell Correction in SQL
  • 32. 32© Cloudera, Inc. All rights reserved. Exhibit: http://github.com/jwills/exhibit
  • 33. 33© Cloudera, Inc. All rights reserved. Querying Nested Types with Impala
  • 34. 34© Cloudera, Inc. All rights reserved. • Core Metric: # Outputs/ # Jobs • Measure on both an individual and aggregate level • Drive the marginal cost of asking one additional question towards zero • Point business analysts at output tables for interactive analysis with Impala • Self-serve BI frees up resources (compute + data science time) Trading Up: From Data Analyst to Data Scientist
  • 35. 35© Cloudera, Inc. All rights reserved. Thanks! @josh_wills

Notas del editor

  1. Expand on this definition here.
  2. Companies are trying to acquire data scientists. What they should be trying to acquire is insights. How do data scientists leverage their programming skills to create more insights than an equivalently knowledgeable data analyst?
  3. ML models: fraud/risk, ad clicks, next best action/recommenders, etc., etc.
  4. ML models: fraud/risk, ad clicks, next best action/recommenders, etc., etc.
  5. SUM MAX EXAMPLE
  6. Discuss traffic congestion and the problem of induced demand.
  7. Discuss scheduling and resource management (i.e., you’re only allowed to drive your Ferrari between midnight and 6 AM.)
  8. Data scientists know how to structure data in a way that maximizes the number of questions that can be answered by a single MR job.