SlideShare una empresa de Scribd logo
1 de 18
June 2013
BIG DATA SCIENCE: A PATH FORWARD
CONFIDENTIAL | 2
linkedin.com/in/danmallinger/
@danmallinger
www.thinkbiganalytics.com
 Data Science Lead @ Think Big
 Product/Brand Obsessive
 Teacher
 Occasional Engineer
CONFIDENTIAL | 3
TODAY
• High level exploration of the
• skills, tools, and techniques
• needed to achieve early success
• and to help you build
• your data science practice.
CONFIDENTIAL | 4
 Understand our organizational needs for data science
 Infrastructure: Technological tools and platforms.
 Talent: Staff hired and trained.
 Capabilities: Data science techniques utilized.
INFRASTRUCTURE, TALENT, & CAPABILITIES
Hadoop NoSQL Analytics SQL/MPP Real Time
Scripting MapReduce
Data
Exploration
Basic Modeling PhD Math
Visualization Clustering Categorization
Continuous
Models
Text Analysis
CONFIDENTIAL | 5
 Boxed Solutions: Mahout & Platform
 Toolkits: RHadoop, Scikit, etc.
 You will need toolkits to solve unique problems
 but smart techniques make that easier.
 Boxed solutions are limited
 but can be a good source of early velocity.
ANALYTICS TOOLS
CONFIDENTIAL | 6
 Gigabytes from Stackoverflow
 Questions from users
 With metadata
 Users have reputations
 Questions open or closed
 Follow along
 Thinking about your data
 To learn in a
 Familiar context and
 Plan
DATA
Presenter Audience
Hadoop NoSQL Analytics SQL/MPP Real Time
Scripting MapReduce Exploration Basic Modeling PhD Math
Visualization Clustering Categorization Continuous Text Analysis
CONFIDENTIAL | 7
select count(1) as total
, sum(has_code)
, avg(body_count)
, stddev_samp(body_count)
, corr(reputation,
owner_questions)
,
histogram_numeric(body_count, 10)
from questions
;
STEP 1: EXPLORE
Hadoop NoSQL Analytics SQL/MPP Real Time
Scripting MapReduce Exploration Basic Modeling PhD Math
Visualization Clustering Categorization Continuous Text Analysis
Patterns through Hive Patterns through Tableau
CONFIDENTIAL | 8
 Summaries of unstructured
data
 Time-since metrics
select transform(…)
using ‘python …’
 Clustering: Browsing cohorts
/bin/mahout canopy
STEP 2: FEATURE BUILDING
Hadoop NoSQL Analytics SQL/MPP Real Time
Scripting MapReduce Exploration Basic Modeling PhD Math
Visualization Clustering Categorization Continuous Text Analysis
SQL Windowing Cross-Record Features
CONFIDENTIAL | 9
• Sample (don’t parallelize)
• Naturally parallel
• SVD
• Random Forests
• Estimators and Ensembles
• Bootstrapping
• Localizing
• Advanced Parallelization
• Linear models with SGD
• Neural networks
PARALLEL MODELS IN HADOOP
Hadoop NoSQL Analytics SQL/MPP Real Time
Scripting MapReduce Exploration Basic Modeling PhD Math
Visualization Clustering Categorization Continuous Text Analysis
CONFIDENTIAL | 10
 Single R model
 run many times
 over samples
 and aggregated
m <- C5.0(status ~ …)
STEP 3: STRUCTURED MODEL (BAGGING)
Hadoop NoSQL Analytics SQL/MPP Real Time
Scripting MapReduce Exploration Basic Modeling PhD Math
Visualization Clustering Categorization Continuous Text Analysis
Mapper 1:
Define n reducer keys
Send any record to reducer I with
probability p
Reducer 1:
Key: Id of sample
Value: List of records
Perform analysis over records
Reducer 2:
Key: One
Value: List of models
Aggregate the models (e.g. average)
Bagging a Model
CONFIDENTIAL | 11
WHERE ARE WE?
Hadoop NoSQL Analytics SQL/MPP Real Time
Scripting MapReduce Exploration Basic Modeling PhD Math
Visualization Clustering Categorization Continuous Text Analysis
 We’ve created a structured model
 to flag questions that won’t be closed
 using Big Data.
 But we haven’t used unstructured data.
CONFIDENTIAL | 12
TEXT ANALYSIS
Hadoop NoSQL Analytics SQL/MPP Real Time
Scripting MapReduce Exploration Basic Modeling PhD Math
Visualization Clustering Categorization Continuous Text Analysis
• Is “the big dog” really different from “dog is big?”
• How about “I like eggs but hate tofu” and “I hate eggs but like tofu?”
• Language has lexical and syntactical features
• Different techniques leverage these in different ways
 Bag of Words: Structure doesn’t matter
 n-gram: Structure matters (but not that much)
 Feature Extraction: BACON! BACON! BACON!
CONFIDENTIAL | 13
STEP 4: UNSTRUCTURED MODEL
Hadoop NoSQL Analytics SQL/MPP Real Time
Scripting MapReduce Exploration Basic Modeling PhD Math
Visualization Clustering Categorization Continuous Text Analysis
 Similar to Hadoop’s Word
Count
 Create counts for
token/category pairs
 Use counts to calculate
Information Gain
MR Job 1:
Calculate information gain (IG) for all
tokens.
MR Job 2:
Select tokens with largest IG.
Create structured data for record, tokens:
question #4 | 0 | 1 | 0 | 1 | 1
MR Job 3:
Build a classifier over the newly structured
data (prior slides)
Information Gain
CONFIDENTIAL | 14
WHERE ARE WE?
Hadoop NoSQL Analytics SQL/MPP Real Time
Scripting MapReduce Exploration Basic Modeling PhD Math
Visualization Clustering Categorization Continuous Text Analysis
 We’ve created two models
 One structured,
 one unstructured.
 But they don’t work together.
CONFIDENTIAL | 15
STEP 5: ENSEMBLE MODEL
Hadoop NoSQL Analytics SQL/MPP Real Time
Scripting MapReduce Exploration Basic Modeling PhD Math
Visualization Clustering Categorization Continuous Text Analysis
 Join many models together
 By using their output
 As input to ensemble model.
 Best when models perform
differently
 Exploit differences with
nonlinearities
 Like interaction effects.
Ensembling
Mapper 1:
Load multiple models
Score the models per record and output
Reducer 1:
Key: Id of record
Value: List of model outputs
Join model outputs to make new records
MR Job 2:
Build a model over the output data as if it
was raw data.
CONFIDENTIAL | 16
 We’ve created two models:
 one structured,
 one unstructured
 and have ensembled them
 to create a single, powerful model
 and solve a practical business problem.
WHERE ARE WE?
Hadoop NoSQL Analytics SQL/MPP Real Time
Scripting MapReduce Exploration Basic Modeling PhD Math
Visualization Clustering Categorization Continuous Text Analysis
CONFIDENTIAL | 17
 This required simple infrastructure
 a blend of analysis and scripting skills
 an understanding of BIG data science techniques
 but not a team of PhDs or a billion dollars.
HOW DID WE GET HERE?
Hadoop NoSQL Analytics SQL/MPP Real Time
Scripting MapReduce Exploration Basic Modeling PhD Math
Visualization Clustering Categorization Continuous Text Analysis
CONFIDENTIAL | 18
Questions?
www.thinkbiganalytics.com
@danmallinger

Más contenido relacionado

La actualidad más candente

Data Science Salon: Culture, Data Engineering and Hamburger Stands: Thoughts ...
Data Science Salon: Culture, Data Engineering and Hamburger Stands: Thoughts ...Data Science Salon: Culture, Data Engineering and Hamburger Stands: Thoughts ...
Data Science Salon: Culture, Data Engineering and Hamburger Stands: Thoughts ...Formulatedby
 
Driving Datascience at scale using Postgresql, Greenplum and Dataiku - Greenp...
Driving Datascience at scale using Postgresql, Greenplum and Dataiku - Greenp...Driving Datascience at scale using Postgresql, Greenplum and Dataiku - Greenp...
Driving Datascience at scale using Postgresql, Greenplum and Dataiku - Greenp...VMware Tanzu
 
Applied Data Science Course Part 1: Concepts & your first ML model
Applied Data Science Course Part 1: Concepts & your first ML modelApplied Data Science Course Part 1: Concepts & your first ML model
Applied Data Science Course Part 1: Concepts & your first ML modelDataiku
 
Applied Data Science Course Part 2: the data science workflow and basic model...
Applied Data Science Course Part 2: the data science workflow and basic model...Applied Data Science Course Part 2: the data science workflow and basic model...
Applied Data Science Course Part 2: the data science workflow and basic model...Dataiku
 
Data Science Salon: Introduction to Machine Learning - Marketing Use Case
Data Science Salon: Introduction to Machine Learning - Marketing Use CaseData Science Salon: Introduction to Machine Learning - Marketing Use Case
Data Science Salon: Introduction to Machine Learning - Marketing Use CaseFormulatedby
 
GraphTour 2020 - Opening Keynote
GraphTour 2020 - Opening KeynoteGraphTour 2020 - Opening Keynote
GraphTour 2020 - Opening KeynoteNeo4j
 
GraphTour 2020 - Customer Journey with Neo4j Services
GraphTour 2020 - Customer Journey with Neo4j ServicesGraphTour 2020 - Customer Journey with Neo4j Services
GraphTour 2020 - Customer Journey with Neo4j ServicesNeo4j
 
Big data perspective solution & technology
Big data perspective solution & technologyBig data perspective solution & technology
Big data perspective solution & technologyPankaj Khattar
 
GraphTour 2020 - Customer Journey with Neo4j Services
GraphTour 2020 - Customer Journey with Neo4j ServicesGraphTour 2020 - Customer Journey with Neo4j Services
GraphTour 2020 - Customer Journey with Neo4j ServicesNeo4j
 
Enterprise ready: a look at Neo4j in production
Enterprise ready: a look at Neo4j in productionEnterprise ready: a look at Neo4j in production
Enterprise ready: a look at Neo4j in productionNeo4j
 
CI/DC in MLOps by J.B. Hunt
CI/DC in MLOps by J.B. HuntCI/DC in MLOps by J.B. Hunt
CI/DC in MLOps by J.B. HuntDatabricks
 
GraphTour 2020 - Neo4j: What's New?
GraphTour 2020 - Neo4j: What's New?GraphTour 2020 - Neo4j: What's New?
GraphTour 2020 - Neo4j: What's New?Neo4j
 
"What we learned from 5 years of building a data science software that actual...
"What we learned from 5 years of building a data science software that actual..."What we learned from 5 years of building a data science software that actual...
"What we learned from 5 years of building a data science software that actual...Dataconomy Media
 
Data & Analytics at Scale
Data & Analytics at ScaleData & Analytics at Scale
Data & Analytics at ScaleWalid Mehanna
 
SpeedTrack Tech Overview 2015
SpeedTrack Tech Overview 2015SpeedTrack Tech Overview 2015
SpeedTrack Tech Overview 2015Michael Zoltowski
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDataWorks Summit/Hadoop Summit
 
Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013
Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013
Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013Dataiku
 
Intelligence Demo – Illustrating the Value of Your Connected Data
Intelligence Demo – Illustrating the Value of Your Connected DataIntelligence Demo – Illustrating the Value of Your Connected Data
Intelligence Demo – Illustrating the Value of Your Connected DataNeo4j
 
PASS Summit Data Storytelling with R Power BI and AzureML
PASS Summit Data Storytelling with R Power BI and AzureMLPASS Summit Data Storytelling with R Power BI and AzureML
PASS Summit Data Storytelling with R Power BI and AzureMLJen Stirrup
 

La actualidad más candente (20)

Data Science Salon: Culture, Data Engineering and Hamburger Stands: Thoughts ...
Data Science Salon: Culture, Data Engineering and Hamburger Stands: Thoughts ...Data Science Salon: Culture, Data Engineering and Hamburger Stands: Thoughts ...
Data Science Salon: Culture, Data Engineering and Hamburger Stands: Thoughts ...
 
Driving Datascience at scale using Postgresql, Greenplum and Dataiku - Greenp...
Driving Datascience at scale using Postgresql, Greenplum and Dataiku - Greenp...Driving Datascience at scale using Postgresql, Greenplum and Dataiku - Greenp...
Driving Datascience at scale using Postgresql, Greenplum and Dataiku - Greenp...
 
Applied Data Science Course Part 1: Concepts & your first ML model
Applied Data Science Course Part 1: Concepts & your first ML modelApplied Data Science Course Part 1: Concepts & your first ML model
Applied Data Science Course Part 1: Concepts & your first ML model
 
Applied Data Science Course Part 2: the data science workflow and basic model...
Applied Data Science Course Part 2: the data science workflow and basic model...Applied Data Science Course Part 2: the data science workflow and basic model...
Applied Data Science Course Part 2: the data science workflow and basic model...
 
Data Science Salon: Introduction to Machine Learning - Marketing Use Case
Data Science Salon: Introduction to Machine Learning - Marketing Use CaseData Science Salon: Introduction to Machine Learning - Marketing Use Case
Data Science Salon: Introduction to Machine Learning - Marketing Use Case
 
GraphTour 2020 - Opening Keynote
GraphTour 2020 - Opening KeynoteGraphTour 2020 - Opening Keynote
GraphTour 2020 - Opening Keynote
 
GraphTour 2020 - Customer Journey with Neo4j Services
GraphTour 2020 - Customer Journey with Neo4j ServicesGraphTour 2020 - Customer Journey with Neo4j Services
GraphTour 2020 - Customer Journey with Neo4j Services
 
Big data perspective solution & technology
Big data perspective solution & technologyBig data perspective solution & technology
Big data perspective solution & technology
 
GraphTour 2020 - Customer Journey with Neo4j Services
GraphTour 2020 - Customer Journey with Neo4j ServicesGraphTour 2020 - Customer Journey with Neo4j Services
GraphTour 2020 - Customer Journey with Neo4j Services
 
Enterprise ready: a look at Neo4j in production
Enterprise ready: a look at Neo4j in productionEnterprise ready: a look at Neo4j in production
Enterprise ready: a look at Neo4j in production
 
CI/DC in MLOps by J.B. Hunt
CI/DC in MLOps by J.B. HuntCI/DC in MLOps by J.B. Hunt
CI/DC in MLOps by J.B. Hunt
 
GraphTour 2020 - Neo4j: What's New?
GraphTour 2020 - Neo4j: What's New?GraphTour 2020 - Neo4j: What's New?
GraphTour 2020 - Neo4j: What's New?
 
"What we learned from 5 years of building a data science software that actual...
"What we learned from 5 years of building a data science software that actual..."What we learned from 5 years of building a data science software that actual...
"What we learned from 5 years of building a data science software that actual...
 
Data & Analytics at Scale
Data & Analytics at ScaleData & Analytics at Scale
Data & Analytics at Scale
 
SpeedTrack Tech Overview 2015
SpeedTrack Tech Overview 2015SpeedTrack Tech Overview 2015
SpeedTrack Tech Overview 2015
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
 
Building up a Data Science Team from Scratch
Building up a Data Science Team from ScratchBuilding up a Data Science Team from Scratch
Building up a Data Science Team from Scratch
 
Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013
Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013
Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013
 
Intelligence Demo – Illustrating the Value of Your Connected Data
Intelligence Demo – Illustrating the Value of Your Connected DataIntelligence Demo – Illustrating the Value of Your Connected Data
Intelligence Demo – Illustrating the Value of Your Connected Data
 
PASS Summit Data Storytelling with R Power BI and AzureML
PASS Summit Data Storytelling with R Power BI and AzureMLPASS Summit Data Storytelling with R Power BI and AzureML
PASS Summit Data Storytelling with R Power BI and AzureML
 

Destacado

Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingHari Shreedharan
 
R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers
R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster AnswersR+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers
R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster AnswersRevolution Analytics
 
Apachecon Europe 2012: Operating HBase - Things you need to know
Apachecon Europe 2012: Operating HBase - Things you need to knowApachecon Europe 2012: Operating HBase - Things you need to know
Apachecon Europe 2012: Operating HBase - Things you need to knowChristian Gügi
 
Dan Mallinger – Data Science Practice Manager, Think Big Analytics at MLconf ATL
Dan Mallinger – Data Science Practice Manager, Think Big Analytics at MLconf ATLDan Mallinger – Data Science Practice Manager, Think Big Analytics at MLconf ATL
Dan Mallinger – Data Science Practice Manager, Think Big Analytics at MLconf ATLMLconf
 
R + 15 minutes = Hadoop cluster
R + 15 minutes = Hadoop clusterR + 15 minutes = Hadoop cluster
R + 15 minutes = Hadoop clusterJeffrey Breen
 
January 2015 HUG: Apache Flink: Fast and reliable large-scale data processing
January 2015 HUG: Apache Flink:  Fast and reliable large-scale data processingJanuary 2015 HUG: Apache Flink:  Fast and reliable large-scale data processing
January 2015 HUG: Apache Flink: Fast and reliable large-scale data processingYahoo Developer Network
 
Are You Ready for Big Data Big Analytics?
Are You Ready for Big Data Big Analytics? Are You Ready for Big Data Big Analytics?
Are You Ready for Big Data Big Analytics? Revolution Analytics
 
HBase and Impala Notes - Munich HUG - 20131017
HBase and Impala Notes - Munich HUG - 20131017HBase and Impala Notes - Munich HUG - 20131017
HBase and Impala Notes - Munich HUG - 20131017larsgeorge
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopRevolution Analytics
 

Destacado (10)

Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark Streaming
 
R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers
R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster AnswersR+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers
R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers
 
Apachecon Europe 2012: Operating HBase - Things you need to know
Apachecon Europe 2012: Operating HBase - Things you need to knowApachecon Europe 2012: Operating HBase - Things you need to know
Apachecon Europe 2012: Operating HBase - Things you need to know
 
Dan Mallinger – Data Science Practice Manager, Think Big Analytics at MLconf ATL
Dan Mallinger – Data Science Practice Manager, Think Big Analytics at MLconf ATLDan Mallinger – Data Science Practice Manager, Think Big Analytics at MLconf ATL
Dan Mallinger – Data Science Practice Manager, Think Big Analytics at MLconf ATL
 
R + 15 minutes = Hadoop cluster
R + 15 minutes = Hadoop clusterR + 15 minutes = Hadoop cluster
R + 15 minutes = Hadoop cluster
 
January 2015 HUG: Apache Flink: Fast and reliable large-scale data processing
January 2015 HUG: Apache Flink:  Fast and reliable large-scale data processingJanuary 2015 HUG: Apache Flink:  Fast and reliable large-scale data processing
January 2015 HUG: Apache Flink: Fast and reliable large-scale data processing
 
Are You Ready for Big Data Big Analytics?
Are You Ready for Big Data Big Analytics? Are You Ready for Big Data Big Analytics?
Are You Ready for Big Data Big Analytics?
 
HBase and Impala Notes - Munich HUG - 20131017
HBase and Impala Notes - Munich HUG - 20131017HBase and Impala Notes - Munich HUG - 20131017
HBase and Impala Notes - Munich HUG - 20131017
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
 
Predictive Analytics using R
Predictive Analytics using RPredictive Analytics using R
Predictive Analytics using R
 

Similar a BIG Data Science: A Path Forward

A Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptA Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptSanket Shikhar
 
Data science and OSS
Data science and OSSData science and OSS
Data science and OSSKevin Crocker
 
The Challenges of Bringing Machine Learning to the Masses
The Challenges of Bringing Machine Learning to the MassesThe Challenges of Bringing Machine Learning to the Masses
The Challenges of Bringing Machine Learning to the MassesAlice Zheng
 
Multiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier DominguezMultiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier DominguezBig Data Spain
 
Topic modeling using big data analytics
Topic modeling using big data analytics Topic modeling using big data analytics
Topic modeling using big data analytics Farheen Nilofer
 
Multiplaform Solution for Graph Datasources
Multiplaform Solution for Graph DatasourcesMultiplaform Solution for Graph Datasources
Multiplaform Solution for Graph DatasourcesStratio
 
Production-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to heroProduction-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to heroDaniel Marcous
 
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SFTed Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SFMLconf
 
Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learningRajesh Muppalla
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataDatabricks
 
BigDL webinar - Deep Learning Library for Spark
BigDL webinar - Deep Learning Library for SparkBigDL webinar - Deep Learning Library for Spark
BigDL webinar - Deep Learning Library for SparkDESMOND YUEN
 
The Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-SystemThe Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-Systeminside-BigData.com
 
How to Survive as a Data Architect in a Polyglot Database World
How to Survive as a Data Architect in a Polyglot Database WorldHow to Survive as a Data Architect in a Polyglot Database World
How to Survive as a Data Architect in a Polyglot Database WorldKaren Lopez
 
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsApache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsAnyscale
 
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019VMware Tanzu
 
Combining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache SparkCombining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache SparkDataWorks Summit/Hadoop Summit
 

Similar a BIG Data Science: A Path Forward (20)

A Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptA Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.ppt
 
Data science and OSS
Data science and OSSData science and OSS
Data science and OSS
 
The Challenges of Bringing Machine Learning to the Masses
The Challenges of Bringing Machine Learning to the MassesThe Challenges of Bringing Machine Learning to the Masses
The Challenges of Bringing Machine Learning to the Masses
 
Multiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier DominguezMultiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier Dominguez
 
Topic modeling using big data analytics
Topic modeling using big data analytics Topic modeling using big data analytics
Topic modeling using big data analytics
 
DevOps for DataScience
DevOps for DataScienceDevOps for DataScience
DevOps for DataScience
 
Multiplaform Solution for Graph Datasources
Multiplaform Solution for Graph DatasourcesMultiplaform Solution for Graph Datasources
Multiplaform Solution for Graph Datasources
 
Production-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to heroProduction-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to hero
 
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SFTed Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
 
Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learning
 
Python and data analytics
Python and data analyticsPython and data analytics
Python and data analytics
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
 
BigDL webinar - Deep Learning Library for Spark
BigDL webinar - Deep Learning Library for SparkBigDL webinar - Deep Learning Library for Spark
BigDL webinar - Deep Learning Library for Spark
 
Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014
 
Spark
SparkSpark
Spark
 
The Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-SystemThe Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-System
 
How to Survive as a Data Architect in a Polyglot Database World
How to Survive as a Data Architect in a Polyglot Database WorldHow to Survive as a Data Architect in a Polyglot Database World
How to Survive as a Data Architect in a Polyglot Database World
 
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsApache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
 
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
 
Combining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache SparkCombining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache Spark
 

Último

Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 

Último (20)

Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 

BIG Data Science: A Path Forward

  • 1. June 2013 BIG DATA SCIENCE: A PATH FORWARD
  • 2. CONFIDENTIAL | 2 linkedin.com/in/danmallinger/ @danmallinger www.thinkbiganalytics.com  Data Science Lead @ Think Big  Product/Brand Obsessive  Teacher  Occasional Engineer
  • 3. CONFIDENTIAL | 3 TODAY • High level exploration of the • skills, tools, and techniques • needed to achieve early success • and to help you build • your data science practice.
  • 4. CONFIDENTIAL | 4  Understand our organizational needs for data science  Infrastructure: Technological tools and platforms.  Talent: Staff hired and trained.  Capabilities: Data science techniques utilized. INFRASTRUCTURE, TALENT, & CAPABILITIES Hadoop NoSQL Analytics SQL/MPP Real Time Scripting MapReduce Data Exploration Basic Modeling PhD Math Visualization Clustering Categorization Continuous Models Text Analysis
  • 5. CONFIDENTIAL | 5  Boxed Solutions: Mahout & Platform  Toolkits: RHadoop, Scikit, etc.  You will need toolkits to solve unique problems  but smart techniques make that easier.  Boxed solutions are limited  but can be a good source of early velocity. ANALYTICS TOOLS
  • 6. CONFIDENTIAL | 6  Gigabytes from Stackoverflow  Questions from users  With metadata  Users have reputations  Questions open or closed  Follow along  Thinking about your data  To learn in a  Familiar context and  Plan DATA Presenter Audience Hadoop NoSQL Analytics SQL/MPP Real Time Scripting MapReduce Exploration Basic Modeling PhD Math Visualization Clustering Categorization Continuous Text Analysis
  • 7. CONFIDENTIAL | 7 select count(1) as total , sum(has_code) , avg(body_count) , stddev_samp(body_count) , corr(reputation, owner_questions) , histogram_numeric(body_count, 10) from questions ; STEP 1: EXPLORE Hadoop NoSQL Analytics SQL/MPP Real Time Scripting MapReduce Exploration Basic Modeling PhD Math Visualization Clustering Categorization Continuous Text Analysis Patterns through Hive Patterns through Tableau
  • 8. CONFIDENTIAL | 8  Summaries of unstructured data  Time-since metrics select transform(…) using ‘python …’  Clustering: Browsing cohorts /bin/mahout canopy STEP 2: FEATURE BUILDING Hadoop NoSQL Analytics SQL/MPP Real Time Scripting MapReduce Exploration Basic Modeling PhD Math Visualization Clustering Categorization Continuous Text Analysis SQL Windowing Cross-Record Features
  • 9. CONFIDENTIAL | 9 • Sample (don’t parallelize) • Naturally parallel • SVD • Random Forests • Estimators and Ensembles • Bootstrapping • Localizing • Advanced Parallelization • Linear models with SGD • Neural networks PARALLEL MODELS IN HADOOP Hadoop NoSQL Analytics SQL/MPP Real Time Scripting MapReduce Exploration Basic Modeling PhD Math Visualization Clustering Categorization Continuous Text Analysis
  • 10. CONFIDENTIAL | 10  Single R model  run many times  over samples  and aggregated m <- C5.0(status ~ …) STEP 3: STRUCTURED MODEL (BAGGING) Hadoop NoSQL Analytics SQL/MPP Real Time Scripting MapReduce Exploration Basic Modeling PhD Math Visualization Clustering Categorization Continuous Text Analysis Mapper 1: Define n reducer keys Send any record to reducer I with probability p Reducer 1: Key: Id of sample Value: List of records Perform analysis over records Reducer 2: Key: One Value: List of models Aggregate the models (e.g. average) Bagging a Model
  • 11. CONFIDENTIAL | 11 WHERE ARE WE? Hadoop NoSQL Analytics SQL/MPP Real Time Scripting MapReduce Exploration Basic Modeling PhD Math Visualization Clustering Categorization Continuous Text Analysis  We’ve created a structured model  to flag questions that won’t be closed  using Big Data.  But we haven’t used unstructured data.
  • 12. CONFIDENTIAL | 12 TEXT ANALYSIS Hadoop NoSQL Analytics SQL/MPP Real Time Scripting MapReduce Exploration Basic Modeling PhD Math Visualization Clustering Categorization Continuous Text Analysis • Is “the big dog” really different from “dog is big?” • How about “I like eggs but hate tofu” and “I hate eggs but like tofu?” • Language has lexical and syntactical features • Different techniques leverage these in different ways  Bag of Words: Structure doesn’t matter  n-gram: Structure matters (but not that much)  Feature Extraction: BACON! BACON! BACON!
  • 13. CONFIDENTIAL | 13 STEP 4: UNSTRUCTURED MODEL Hadoop NoSQL Analytics SQL/MPP Real Time Scripting MapReduce Exploration Basic Modeling PhD Math Visualization Clustering Categorization Continuous Text Analysis  Similar to Hadoop’s Word Count  Create counts for token/category pairs  Use counts to calculate Information Gain MR Job 1: Calculate information gain (IG) for all tokens. MR Job 2: Select tokens with largest IG. Create structured data for record, tokens: question #4 | 0 | 1 | 0 | 1 | 1 MR Job 3: Build a classifier over the newly structured data (prior slides) Information Gain
  • 14. CONFIDENTIAL | 14 WHERE ARE WE? Hadoop NoSQL Analytics SQL/MPP Real Time Scripting MapReduce Exploration Basic Modeling PhD Math Visualization Clustering Categorization Continuous Text Analysis  We’ve created two models  One structured,  one unstructured.  But they don’t work together.
  • 15. CONFIDENTIAL | 15 STEP 5: ENSEMBLE MODEL Hadoop NoSQL Analytics SQL/MPP Real Time Scripting MapReduce Exploration Basic Modeling PhD Math Visualization Clustering Categorization Continuous Text Analysis  Join many models together  By using their output  As input to ensemble model.  Best when models perform differently  Exploit differences with nonlinearities  Like interaction effects. Ensembling Mapper 1: Load multiple models Score the models per record and output Reducer 1: Key: Id of record Value: List of model outputs Join model outputs to make new records MR Job 2: Build a model over the output data as if it was raw data.
  • 16. CONFIDENTIAL | 16  We’ve created two models:  one structured,  one unstructured  and have ensembled them  to create a single, powerful model  and solve a practical business problem. WHERE ARE WE? Hadoop NoSQL Analytics SQL/MPP Real Time Scripting MapReduce Exploration Basic Modeling PhD Math Visualization Clustering Categorization Continuous Text Analysis
  • 17. CONFIDENTIAL | 17  This required simple infrastructure  a blend of analysis and scripting skills  an understanding of BIG data science techniques  but not a team of PhDs or a billion dollars. HOW DID WE GET HERE? Hadoop NoSQL Analytics SQL/MPP Real Time Scripting MapReduce Exploration Basic Modeling PhD Math Visualization Clustering Categorization Continuous Text Analysis