SlideShare una empresa de Scribd logo
1 de 44
Descargar para leer sin conexión
Advanced Analytics in Hadoop
Thomas W. Dinsmore
1
Advanced Analytics in Hadoop
• Use cases
• Architectures
• Current Options:
• Open Source
• Commercial
2
Analytics
3
Ad Hoc Queries
ReportsData Access
Visualization
Data Manipulation
OLAP/ROLAP etc
Advanced Discovery
Predictive Analytics
Optimization
Simulation
Text Analytics
Geospatial Analytics
Econometrics
Dashboards
Scorecards
Streaming Analytics
Computational Complexity
Advanced Analytics
4
Advanced Discovery
Predictive Analytics
Optimization
Simulation
Text Analytics
Geospatial Analytics
Econometrics
Streaming Analytics
Computational Complexity
Advanced Analytics
5
Advanced Discovery
Predictive Analytics
Optimization
Simulation
Text Analytics
Geospatial Analytics
Econometrics
Streaming Analytics
Feature Extraction
Dimension Reduction
6
7
Analytics Platform
For some use cases, you must use all of the data.
8
Anomaly
Detection
Affinity
Analysis
Clustering
Social
Network
Analysis
Collaborative
Filtering
For others, using all of the data is worth it.
9
Catastrophic Risk Modeling
Modeling with Fine-grained
Behavioral Data
10
1. Apache Mahout!
2. Code it yourself.!
3. …
Your Options (2013)
Architecture
11
Legacy Alongside
12
HDFS HDFS HDFS HDFS HDFS HDFS
Data
Legacy Pass-Through
13
HDFS HDFS HDFS HDFS HDFS HDFS
MapReduce
Data
MapReduce Push-Down
14
HDFS HDFS HDFS HDFS HDFS HDFS
MapReduce
Advantages!
• Co-exists w/ other applications
• Integrated workload management
• Simplified administration
Disdvantages!
• MapReduce latency
Co-Located In-Memory (Asymmetric)
15
YARN
HDFS
Map!
Reduce
HDFS
Map!
Reduce
HDFS
Map!
Reduce
HDFS
Map!
Reduce
HDFS
Map!
Reduce
HDFS
Map!
Reduce
Advantages!
• Easy to adapt legacy apps
• Isolates analytic workload
Disdvantages!
• Data moves within the cluster
• Requires YARN
Co-Located In-Memory (Symmetric)
16
HDFS
Map!
Reduce
YARN
HDFS
Map!
Reduce
HDFS
Map!
Reduce
HDFS
Map!
Reduce
HDFS
Map!
Reduce
HDFS
Map!
Reduce
HDFS
Map!
Reduce
Advantages!
• Lowest latency
Disdvantages!
• Upgrade every node
• Requires YARN
Summary: Architecture
• MapReduce Push-Down is current “champion”
• Stable
• Co-exists well with Hadoop ecosystem
• MR 1.0 penalizes performance
• Required: persistent in-memory processing
• YARN enables co-location
17
Open Source Projects
18
Apache Mahout
• Apache incubator project (2007)
• Machine learning library
• Included in most distributions
• Thin acceptance, few contributors
• Diverse architecture
• Single-node
• MapReduce
• New algos run on Spark
• Recently cleaned up
19
Apache Giraph
• Apache top-level project
• Runs in MapReduce
• Dedicated graph engine
• Used by Facebook, few others
• Dead in the water
• No presence in leading distros
• No significant commercial support
• No releases in 13 months
• No recent code commits on Git
20
GraphLab
• Carnegie Mellon project (2009)
• Distributed in-memory engine:
• Primarily graph analysis
• Selected machine learning algos
• Interface from Java, JavaScript,
Python
• GraphLab Inc provides commercial
support (2013, $6.75MM)
• Independent distribution, or through
Pivotal
21
0xdata H2O
• Vendor-driven open source project
• 0xdata sells support, customization
• Distributed in-memory prediction engine
• Multiple deployment options:
• Standalone (with HDFS)
• Over YARN
• In MapReduce
• Claims 2,000+ users
• 4 public references
• Used by a leading P&C insurer
• Java, R, Python and Scala interfaces
22
Apache Spark
• Top-level Apache project (2/14)
• Release 1.0 (5/14)
• Distributed in-memory analytics
• Machine learning
• Graph analytics
• Streaming analytics
• Fast SQL
• Compatible with Hadoop storage
• Integrated with YARN
• Scala, Python, Java interfaces (+SparkR)
• Growing ecosystem
• Supported in leading Hadoop distributions
23
Apache Spark: Hadoop Distributions
24
Spark Components
MLLIB GraphX Spark Streaming Spark SQL Shark
Cloudera Yes Yes Yes Yes (Impala)
Hortonworks Yes (Storm) (Stinger)
MapR Yes Yes Yes Yes Yes
Pivotal Yes Yes Yes Yes Yes
IBM BigInsights
Summary: Open Source Projects
25
0xdata !
H2O 2.2
Apache !
Giraph 1.1
Apache !
Mahout 0.9
Apache !
Spark 1.0
GraphLab 2.2
Status Independent Top-Level Top-Level Top-Level Independent
Architecture
Co-Located Memory-
Centric
MapReduce MapReduce
Co-Located Memory-
Centric
Co-Located Memory-
Centric
Interfaces Java, Python, R, Scala Java Java
Java, Python, Scala
(SparkR)
Python
Commercial Support 0xdata Databricks GraphLab, Inc.
Distribution Independent Independent
All Hadoop
Distributions
Cloudera!
Hortonworks!
MapR!
Pivotal
Independent
Analytic Features
26
0xdata !
H2O 2.2
Apache !
Giraph 1.1
Apache !
Mahout 0.9
Apache !
Spark 1.0
GraphLab 2.2
Prediction +++ + +++
Dimension Reduction + +++ + +
Clustering + +++ + +++
Collaborative Filtering +++ + +++
Text Analytics +++ +++
Matrix Operations + +++ +
Graph Analysis + + +++
Analytic Features: Prediction
27
Mahout 0.9 Spark 1.0 H2O 2.2
Linear Regression +
Logistic Regression +
Generalized Linear Models +
Naive Bayes + + +
Decision Tree +
Gradient Boosted Trees +
Random Forests + +
Linear Support Vector Machine +
Deep Learning (Backprop MLP) +
Analytic Features: Dimension Reduction
28
Mahout 0.9 Spark 1.0 H2O 2.2 GraphLab 2.2
Singular Value
Decomposition + +
Lanczos Algorithm + +
Stochastic SVD +
Principal Components
Analysis + + +
Analytic Features: Clustering
29
Mahout 0.9 Spark 1.0 H2O 2.2 GraphLab 2.2
k-Means + + + +
Fuzzy k-Means +
Streaming k-Means +
Spectral Clustering + +
Analytic Features: Collaborative Filtering
30
Mahout 0.9 Spark 1.0 H2O 2.2 GraphLab 2.2
Item-Based +
Matrix Factorization with ALS + + +
Matrix Factorization with ALS,
Implicit Feedback +
ALS with Parallel Coordinate
Descent +
Weighted ALS +
Sparse ALS +
Analytic Features: Text Analytics
31
Mahout 0.9 Spark 1.0 H2O 2.2 GraphLab 2.2
Latent Dirichlet Allocation + +
Frequent Pattern Mining +
Collocations +
Matrix Operations
32
Mahout 0.9 Spark 1.0 H2O 2.2 GraphLab 2.2
Stochastic Gradient
Descent + +
Limited-Memory BFGS +
RowSimilarityJob +
ConcatMatrices +
Summary: Open Source
• Giraph is toast
• Mahout may be recovering from roadkill status
• GraphLab outperforms Spark GraphX today in graph analytics
• 0xdata H2O outperforms Spark MLLib today in machine learning
• Spark catching up fast
• More resources and distribution
• Integrated platform for ML and graph analysis
33
Commercial Software
34
Alpine
• Business user interface
• Collaboration environment
• Broad library of techniques
• Strong cloud offering
• Leverages Hadoop (multiple distros), Hawq or
Pivotal Greenplum
• Push-down MapReduce
• Certified on Spark
• Small but growing customer base
35
IBM SPSS Analytics Server
• Introduced 2013
• Serves as “back end” for SPSS
Modeler
• Uses push-down MR
• Limited analytic feature set
• IBM supports on multiple Hadoop
distros
• Customer acceptance unknown
36
Revolution Analytics ScaleR
• ScaleR library of distributed statistics,
machine learning functions
• Tools to distribute arbitrary R functions
• Runs in Cloudera, Hortonworks, Teradata, LSF
clusters, MS HPC
• Hadoop edition uses MR push-down
• Tools simplify installation in large clusters
• R interface
• Partnerships with Alteryx, Qlik, MicroStrategy,
Tableau provide business interfaces
37
Skytree Server
• Georgia Tech’s FastLab project, repurposed as
commercial software
• Distributed machine learning platform
• Very opaque about technical details
• User interface is an API
• Co-located in Hadoop under YARN
• Just certified by Hortonworks
• Customer acceptance unknown
• No new public references in a year
• Used by leading credit card company
38
SAS High-Performance Analytics
• Distributed in-memory analytics
• Designed to run in special-purpose appliances (2011)
• Repurposed to run in Hadoop (2013)
• Co-exists poorly — cannot run SAS and MapReduce at
the same time
• Reads entire dataset into memory
• Uses MPI to communicate among nodes
• Requires upgrades from standard Hadoop infrastructure
• Customer acceptance unknown
• No public references
• Generic success stories missing from Strata presos
39
SAS LASR Server
• SAS’ “other” distributed in-memory platform
• Back end for several end-user products
• SAS Visual Analytics (2012)
• SAS Visual Statistics (New)
• SAS In-Memory Statistics for Hadoop (New)
• Recently added statistics and machine learning
• Does not read raw HDFS; must be transformed to proprietary
SASHDAT
• Like HPA, reads entire dataset into memory.
• 16 Core 256GB node can load 75GB table
• Runs DS2 programs, not Legacy SAS programs
• Fast, but with limited feature set
• SAS claims 1,400 “sites” for Visual Analytics
• Many of those are standalone boxes
40
Summary: Commercial
• Alpine’s interface is compelling to business user
• IBM Analytics Server is a good first release
• RRE ScaleR appeals to R users, plays well in Hadoop sandbox
• Skytree Server: strong in prediction
• SAS: why two competing memory-centric architectures?
41
Progress
• Spark: blindingly fast maturity
• Rapidly expanding library of analytic features
• Growing developer community, ecosystem
• Commercial: from zero to many
42
Interesting Questions
• Will Mahout get a second wind?
• Will Spark MLLib displace 0xdata?
• Will Spark GraphX catch up to GraphLab?
• Can Spark Streaming compete with Storm and commercial entrants?
• How quickly will customers adopt memory-centric architecture for analytics?
• What will Alpine and MicroStrategy do with Spark?
• Will IBM distribute Spark in BigInsights?
• When will SAS announce a reference customer for HPA/LASR in Hadoop?
43
Advanced Analytics in Hadoop
Thomas W. Dinsmore
44

Más contenido relacionado

La actualidad más candente

Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkSlim Baltagi
 
Big Data MDX with Mondrian and Apache Kylin
Big Data MDX with Mondrian and Apache KylinBig Data MDX with Mondrian and Apache Kylin
Big Data MDX with Mondrian and Apache Kylininovex GmbH
 
Customer Education Webcast: New Features in Data Integration and Streaming CDC
Customer Education Webcast: New Features in Data Integration and Streaming CDCCustomer Education Webcast: New Features in Data Integration and Streaming CDC
Customer Education Webcast: New Features in Data Integration and Streaming CDCPrecisely
 
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...Rittman Analytics
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferretAndrii Gakhov
 
NoSQL Application Development with JSON and MapR-DB
NoSQL Application Development with JSON and MapR-DBNoSQL Application Development with JSON and MapR-DB
NoSQL Application Development with JSON and MapR-DBMapR Technologies
 
Flink SQL & TableAPI in Large Scale Production at Alibaba
Flink SQL & TableAPI in Large Scale Production at AlibabaFlink SQL & TableAPI in Large Scale Production at Alibaba
Flink SQL & TableAPI in Large Scale Production at AlibabaDataWorks Summit
 
From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...
From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...
From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...DataWorks Summit/Hadoop Summit
 
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...DataWorks Summit
 
Demystify Big Data Breakfast Briefing: Herb Cunitz, Hortonworks
Demystify Big Data Breakfast Briefing:  Herb Cunitz, HortonworksDemystify Big Data Breakfast Briefing:  Herb Cunitz, Hortonworks
Demystify Big Data Breakfast Briefing: Herb Cunitz, HortonworksHortonworks
 
Pandas UDF: Scalable Analysis with Python and PySpark
Pandas UDF: Scalable Analysis with Python and PySparkPandas UDF: Scalable Analysis with Python and PySpark
Pandas UDF: Scalable Analysis with Python and PySparkLi Jin
 
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...Databricks
 
Fedbench - A Benchmark Suite for Federated Semantic Data Processing
Fedbench - A Benchmark Suite for Federated Semantic Data ProcessingFedbench - A Benchmark Suite for Federated Semantic Data Processing
Fedbench - A Benchmark Suite for Federated Semantic Data ProcessingPeter Haase
 
Kylin Engineering Principles
Kylin Engineering PrinciplesKylin Engineering Principles
Kylin Engineering PrinciplesXu Jiang
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemShivaji Dutta
 
The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...
The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...
The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...Databricks
 
Hadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and Future
Hadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and FutureHadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and Future
Hadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and FutureVinod Kumar Vavilapalli
 

La actualidad más candente (20)

Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to Spark
 
Big Data MDX with Mondrian and Apache Kylin
Big Data MDX with Mondrian and Apache KylinBig Data MDX with Mondrian and Apache Kylin
Big Data MDX with Mondrian and Apache Kylin
 
Customer Education Webcast: New Features in Data Integration and Streaming CDC
Customer Education Webcast: New Features in Data Integration and Streaming CDCCustomer Education Webcast: New Features in Data Integration and Streaming CDC
Customer Education Webcast: New Features in Data Integration and Streaming CDC
 
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
 
Spark mhug2
Spark mhug2Spark mhug2
Spark mhug2
 
NoSQL Application Development with JSON and MapR-DB
NoSQL Application Development with JSON and MapR-DBNoSQL Application Development with JSON and MapR-DB
NoSQL Application Development with JSON and MapR-DB
 
Active Learning for Fraud Prevention
Active Learning for Fraud PreventionActive Learning for Fraud Prevention
Active Learning for Fraud Prevention
 
Flink SQL & TableAPI in Large Scale Production at Alibaba
Flink SQL & TableAPI in Large Scale Production at AlibabaFlink SQL & TableAPI in Large Scale Production at Alibaba
Flink SQL & TableAPI in Large Scale Production at Alibaba
 
From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...
From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...
From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...
 
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
 
Demystify Big Data Breakfast Briefing: Herb Cunitz, Hortonworks
Demystify Big Data Breakfast Briefing:  Herb Cunitz, HortonworksDemystify Big Data Breakfast Briefing:  Herb Cunitz, Hortonworks
Demystify Big Data Breakfast Briefing: Herb Cunitz, Hortonworks
 
Pandas UDF: Scalable Analysis with Python and PySpark
Pandas UDF: Scalable Analysis with Python and PySparkPandas UDF: Scalable Analysis with Python and PySpark
Pandas UDF: Scalable Analysis with Python and PySpark
 
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...
 
What's new in SQL on Hadoop and Beyond
What's new in SQL on Hadoop and BeyondWhat's new in SQL on Hadoop and Beyond
What's new in SQL on Hadoop and Beyond
 
Fedbench - A Benchmark Suite for Federated Semantic Data Processing
Fedbench - A Benchmark Suite for Federated Semantic Data ProcessingFedbench - A Benchmark Suite for Federated Semantic Data Processing
Fedbench - A Benchmark Suite for Federated Semantic Data Processing
 
Kylin Engineering Principles
Kylin Engineering PrinciplesKylin Engineering Principles
Kylin Engineering Principles
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystem
 
The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...
The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...
The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...
 
Hadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and Future
Hadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and FutureHadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and Future
Hadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and Future
 

Destacado

Distributed Processing of Stream Text Mining
Distributed Processing of Stream Text MiningDistributed Processing of Stream Text Mining
Distributed Processing of Stream Text MiningLi Miao
 
R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers
R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster AnswersR+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers
R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster AnswersRevolution Analytics
 
Advanced analytics
Advanced analyticsAdvanced analytics
Advanced analyticsShankar R
 
Real-time Analytics with Cassandra, Spark, and Shark
Real-time Analytics with Cassandra, Spark, and SharkReal-time Analytics with Cassandra, Spark, and Shark
Real-time Analytics with Cassandra, Spark, and SharkEvan Chan
 
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...Databricks
 
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...Spark Summit
 
The future of business intelligence
The future of business intelligence The future of business intelligence
The future of business intelligence Phocas Software
 
Business Applications of Predictive Modeling at Scale - KDD 2016 Tutorial
Business Applications of Predictive Modeling at Scale - KDD 2016 TutorialBusiness Applications of Predictive Modeling at Scale - KDD 2016 Tutorial
Business Applications of Predictive Modeling at Scale - KDD 2016 TutorialQiang Zhu
 
IBM SPSS Overview Text Analytics Brief
IBM SPSS Overview Text Analytics BriefIBM SPSS Overview Text Analytics Brief
IBM SPSS Overview Text Analytics BriefIan Balina
 
A Practical Guide: Building your Business Intelligence Business Case for 2017
A Practical Guide: Building your Business Intelligence Business Case for 2017A Practical Guide: Building your Business Intelligence Business Case for 2017
A Practical Guide: Building your Business Intelligence Business Case for 2017Sisense
 
What's New in Predictive Analytics IBM SPSS
What's New in Predictive Analytics IBM SPSSWhat's New in Predictive Analytics IBM SPSS
What's New in Predictive Analytics IBM SPSSVirginia Fernandez
 
White Paper - The Business Case For Business Intelligence
White Paper -  The Business Case For Business IntelligenceWhite Paper -  The Business Case For Business Intelligence
White Paper - The Business Case For Business IntelligenceDavid Walker
 
Big Data & Analytics Architecture
Big Data & Analytics ArchitectureBig Data & Analytics Architecture
Big Data & Analytics ArchitectureArvind Sathi
 
Tableau Software - Business Analytics and Data Visualization
Tableau Software - Business Analytics and Data VisualizationTableau Software - Business Analytics and Data Visualization
Tableau Software - Business Analytics and Data Visualizationlesterathayde
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceDatabricks
 

Destacado (20)

Distributed Processing of Stream Text Mining
Distributed Processing of Stream Text MiningDistributed Processing of Stream Text Mining
Distributed Processing of Stream Text Mining
 
Science in text mining
Science in text miningScience in text mining
Science in text mining
 
R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers
R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster AnswersR+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers
R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers
 
Advanced analytics
Advanced analyticsAdvanced analytics
Advanced analytics
 
Real-time Analytics with Cassandra, Spark, and Shark
Real-time Analytics with Cassandra, Spark, and SharkReal-time Analytics with Cassandra, Spark, and Shark
Real-time Analytics with Cassandra, Spark, and Shark
 
Get your data analytics strategy right!
Get your data analytics strategy right!Get your data analytics strategy right!
Get your data analytics strategy right!
 
Are API Services Taking Over All the Interesting Data Science Problems?
Are API Services Taking Over All the Interesting Data Science Problems?Are API Services Taking Over All the Interesting Data Science Problems?
Are API Services Taking Over All the Interesting Data Science Problems?
 
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...
 
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
 
The future of business intelligence
The future of business intelligence The future of business intelligence
The future of business intelligence
 
Business Applications of Predictive Modeling at Scale - KDD 2016 Tutorial
Business Applications of Predictive Modeling at Scale - KDD 2016 TutorialBusiness Applications of Predictive Modeling at Scale - KDD 2016 Tutorial
Business Applications of Predictive Modeling at Scale - KDD 2016 Tutorial
 
IBM SPSS Overview Text Analytics Brief
IBM SPSS Overview Text Analytics BriefIBM SPSS Overview Text Analytics Brief
IBM SPSS Overview Text Analytics Brief
 
A Practical Guide: Building your Business Intelligence Business Case for 2017
A Practical Guide: Building your Business Intelligence Business Case for 2017A Practical Guide: Building your Business Intelligence Business Case for 2017
A Practical Guide: Building your Business Intelligence Business Case for 2017
 
What's New in Predictive Analytics IBM SPSS
What's New in Predictive Analytics IBM SPSSWhat's New in Predictive Analytics IBM SPSS
What's New in Predictive Analytics IBM SPSS
 
White Paper - The Business Case For Business Intelligence
White Paper -  The Business Case For Business IntelligenceWhite Paper -  The Business Case For Business Intelligence
White Paper - The Business Case For Business Intelligence
 
SAS Institute: Big data and smarter analytics
SAS Institute: Big data and smarter analyticsSAS Institute: Big data and smarter analytics
SAS Institute: Big data and smarter analytics
 
Big Data & Analytics Architecture
Big Data & Analytics ArchitectureBig Data & Analytics Architecture
Big Data & Analytics Architecture
 
Tableau Software - Business Analytics and Data Visualization
Tableau Software - Business Analytics and Data VisualizationTableau Software - Business Analytics and Data Visualization
Tableau Software - Business Analytics and Data Visualization
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data Science
 
Big Data and Advanced Analytics
Big Data and Advanced AnalyticsBig Data and Advanced Analytics
Big Data and Advanced Analytics
 

Similar a Advanced Analytics in Hadoop

Advanced Analytics and Big Data (August 2014)
Advanced Analytics and Big Data (August 2014)Advanced Analytics and Big Data (August 2014)
Advanced Analytics and Big Data (August 2014)Thomas W. Dinsmore
 
Bi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonBi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonDremio Corporation
 
Apache HAWQ and Apache MADlib: Journey to Apache
Apache HAWQ and Apache MADlib: Journey to ApacheApache HAWQ and Apache MADlib: Journey to Apache
Apache HAWQ and Apache MADlib: Journey to ApachePivotalOpenSourceHub
 
Modern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An OverviewModern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An OverviewGreat Wide Open
 
Hadoop Infrastructure and SoftServe Experience by Vitaliy Bashun, Data Architect
Hadoop Infrastructure and SoftServe Experience by Vitaliy Bashun, Data ArchitectHadoop Infrastructure and SoftServe Experience by Vitaliy Bashun, Data Architect
Hadoop Infrastructure and SoftServe Experience by Vitaliy Bashun, Data ArchitectSoftServe
 
Pacemaker hadoop infrastructure and soft serve experience
Pacemaker   hadoop infrastructure and soft serve experiencePacemaker   hadoop infrastructure and soft serve experience
Pacemaker hadoop infrastructure and soft serve experienceVitaliy Bashun
 
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...Mark Rittman
 
DataMass Summit - Machine Learning for Big Data in SQL Server
DataMass Summit - Machine Learning for Big Data  in SQL ServerDataMass Summit - Machine Learning for Big Data  in SQL Server
DataMass Summit - Machine Learning for Big Data in SQL ServerŁukasz Grala
 
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsBig Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsAndrew Brust
 
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop :
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop : Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop :
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop : Mark Rittman
 
Big Data tools in practice
Big Data tools in practiceBig Data tools in practice
Big Data tools in practiceDarko Marjanovic
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft PlatformJesus Rodriguez
 
BI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache CassandraBI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache CassandraVictor Coustenoble
 
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014cdmaxime
 
Simple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform ConceptSimple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform ConceptSatish Mohan
 
The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)
The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)
The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)DataPad Inc.
 

Similar a Advanced Analytics in Hadoop (20)

Advanced Analytics and Big Data (August 2014)
Advanced Analytics and Big Data (August 2014)Advanced Analytics and Big Data (August 2014)
Advanced Analytics and Big Data (August 2014)
 
Bi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonBi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in London
 
Apache HAWQ and Apache MADlib: Journey to Apache
Apache HAWQ and Apache MADlib: Journey to ApacheApache HAWQ and Apache MADlib: Journey to Apache
Apache HAWQ and Apache MADlib: Journey to Apache
 
Modern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An OverviewModern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An Overview
 
Hadoop Infrastructure and SoftServe Experience by Vitaliy Bashun, Data Architect
Hadoop Infrastructure and SoftServe Experience by Vitaliy Bashun, Data ArchitectHadoop Infrastructure and SoftServe Experience by Vitaliy Bashun, Data Architect
Hadoop Infrastructure and SoftServe Experience by Vitaliy Bashun, Data Architect
 
Pacemaker hadoop infrastructure and soft serve experience
Pacemaker   hadoop infrastructure and soft serve experiencePacemaker   hadoop infrastructure and soft serve experience
Pacemaker hadoop infrastructure and soft serve experience
 
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
 
Apache drill
Apache drillApache drill
Apache drill
 
DataMass Summit - Machine Learning for Big Data in SQL Server
DataMass Summit - Machine Learning for Big Data  in SQL ServerDataMass Summit - Machine Learning for Big Data  in SQL Server
DataMass Summit - Machine Learning for Big Data in SQL Server
 
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsBig Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
 
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop :
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop : Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop :
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop :
 
Big Data tools in practice
Big Data tools in practiceBig Data tools in practice
Big Data tools in practice
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft Platform
 
BI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache CassandraBI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache Cassandra
 
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
 
Simple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform ConceptSimple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform Concept
 
Big Data training
Big Data trainingBig Data training
Big Data training
 
Dev Ops Training
Dev Ops TrainingDev Ops Training
Dev Ops Training
 
The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)
The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)
The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 

Más de AnalyticsWeek

Understanding Customer Buying Journey with Big Data
Understanding Customer Buying Journey with Big DataUnderstanding Customer Buying Journey with Big Data
Understanding Customer Buying Journey with Big DataAnalyticsWeek
 
Data-As-A-Service to enable compliance reporting
Data-As-A-Service to enable compliance reportingData-As-A-Service to enable compliance reporting
Data-As-A-Service to enable compliance reportingAnalyticsWeek
 
Making sense of unstructured data by turning strings into things
Making sense of unstructured data by turning strings into thingsMaking sense of unstructured data by turning strings into things
Making sense of unstructured data by turning strings into thingsAnalyticsWeek
 
Reimagining the role of data in government
Reimagining the role of data in governmentReimagining the role of data in government
Reimagining the role of data in governmentAnalyticsWeek
 
The History and Use of R
The History and Use of RThe History and Use of R
The History and Use of RAnalyticsWeek
 
Rethinking classical approaches to analysis and predictive modeling
Rethinking classical approaches to analysis and predictive modelingRethinking classical approaches to analysis and predictive modeling
Rethinking classical approaches to analysis and predictive modelingAnalyticsWeek
 
Using Topological Data Analysis on your BigData
Using Topological Data Analysis on your BigDataUsing Topological Data Analysis on your BigData
Using Topological Data Analysis on your BigDataAnalyticsWeek
 
Big Data Introduction to D3
Big Data Introduction to D3Big Data Introduction to D3
Big Data Introduction to D3AnalyticsWeek
 

Más de AnalyticsWeek (8)

Understanding Customer Buying Journey with Big Data
Understanding Customer Buying Journey with Big DataUnderstanding Customer Buying Journey with Big Data
Understanding Customer Buying Journey with Big Data
 
Data-As-A-Service to enable compliance reporting
Data-As-A-Service to enable compliance reportingData-As-A-Service to enable compliance reporting
Data-As-A-Service to enable compliance reporting
 
Making sense of unstructured data by turning strings into things
Making sense of unstructured data by turning strings into thingsMaking sense of unstructured data by turning strings into things
Making sense of unstructured data by turning strings into things
 
Reimagining the role of data in government
Reimagining the role of data in governmentReimagining the role of data in government
Reimagining the role of data in government
 
The History and Use of R
The History and Use of RThe History and Use of R
The History and Use of R
 
Rethinking classical approaches to analysis and predictive modeling
Rethinking classical approaches to analysis and predictive modelingRethinking classical approaches to analysis and predictive modeling
Rethinking classical approaches to analysis and predictive modeling
 
Using Topological Data Analysis on your BigData
Using Topological Data Analysis on your BigDataUsing Topological Data Analysis on your BigData
Using Topological Data Analysis on your BigData
 
Big Data Introduction to D3
Big Data Introduction to D3Big Data Introduction to D3
Big Data Introduction to D3
 

Último

[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfOverkill Security
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 

Último (20)

[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 

Advanced Analytics in Hadoop

  • 1. Advanced Analytics in Hadoop Thomas W. Dinsmore 1
  • 2. Advanced Analytics in Hadoop • Use cases • Architectures • Current Options: • Open Source • Commercial 2
  • 3. Analytics 3 Ad Hoc Queries ReportsData Access Visualization Data Manipulation OLAP/ROLAP etc Advanced Discovery Predictive Analytics Optimization Simulation Text Analytics Geospatial Analytics Econometrics Dashboards Scorecards Streaming Analytics Computational Complexity
  • 4. Advanced Analytics 4 Advanced Discovery Predictive Analytics Optimization Simulation Text Analytics Geospatial Analytics Econometrics Streaming Analytics Computational Complexity
  • 5. Advanced Analytics 5 Advanced Discovery Predictive Analytics Optimization Simulation Text Analytics Geospatial Analytics Econometrics Streaming Analytics Feature Extraction Dimension Reduction
  • 6. 6
  • 8. For some use cases, you must use all of the data. 8 Anomaly Detection Affinity Analysis Clustering Social Network Analysis Collaborative Filtering
  • 9. For others, using all of the data is worth it. 9 Catastrophic Risk Modeling Modeling with Fine-grained Behavioral Data
  • 10. 10 1. Apache Mahout! 2. Code it yourself.! 3. … Your Options (2013)
  • 12. Legacy Alongside 12 HDFS HDFS HDFS HDFS HDFS HDFS Data
  • 13. Legacy Pass-Through 13 HDFS HDFS HDFS HDFS HDFS HDFS MapReduce Data
  • 14. MapReduce Push-Down 14 HDFS HDFS HDFS HDFS HDFS HDFS MapReduce Advantages! • Co-exists w/ other applications • Integrated workload management • Simplified administration Disdvantages! • MapReduce latency
  • 15. Co-Located In-Memory (Asymmetric) 15 YARN HDFS Map! Reduce HDFS Map! Reduce HDFS Map! Reduce HDFS Map! Reduce HDFS Map! Reduce HDFS Map! Reduce Advantages! • Easy to adapt legacy apps • Isolates analytic workload Disdvantages! • Data moves within the cluster • Requires YARN
  • 17. Summary: Architecture • MapReduce Push-Down is current “champion” • Stable • Co-exists well with Hadoop ecosystem • MR 1.0 penalizes performance • Required: persistent in-memory processing • YARN enables co-location 17
  • 19. Apache Mahout • Apache incubator project (2007) • Machine learning library • Included in most distributions • Thin acceptance, few contributors • Diverse architecture • Single-node • MapReduce • New algos run on Spark • Recently cleaned up 19
  • 20. Apache Giraph • Apache top-level project • Runs in MapReduce • Dedicated graph engine • Used by Facebook, few others • Dead in the water • No presence in leading distros • No significant commercial support • No releases in 13 months • No recent code commits on Git 20
  • 21. GraphLab • Carnegie Mellon project (2009) • Distributed in-memory engine: • Primarily graph analysis • Selected machine learning algos • Interface from Java, JavaScript, Python • GraphLab Inc provides commercial support (2013, $6.75MM) • Independent distribution, or through Pivotal 21
  • 22. 0xdata H2O • Vendor-driven open source project • 0xdata sells support, customization • Distributed in-memory prediction engine • Multiple deployment options: • Standalone (with HDFS) • Over YARN • In MapReduce • Claims 2,000+ users • 4 public references • Used by a leading P&C insurer • Java, R, Python and Scala interfaces 22
  • 23. Apache Spark • Top-level Apache project (2/14) • Release 1.0 (5/14) • Distributed in-memory analytics • Machine learning • Graph analytics • Streaming analytics • Fast SQL • Compatible with Hadoop storage • Integrated with YARN • Scala, Python, Java interfaces (+SparkR) • Growing ecosystem • Supported in leading Hadoop distributions 23
  • 24. Apache Spark: Hadoop Distributions 24 Spark Components MLLIB GraphX Spark Streaming Spark SQL Shark Cloudera Yes Yes Yes Yes (Impala) Hortonworks Yes (Storm) (Stinger) MapR Yes Yes Yes Yes Yes Pivotal Yes Yes Yes Yes Yes IBM BigInsights
  • 25. Summary: Open Source Projects 25 0xdata ! H2O 2.2 Apache ! Giraph 1.1 Apache ! Mahout 0.9 Apache ! Spark 1.0 GraphLab 2.2 Status Independent Top-Level Top-Level Top-Level Independent Architecture Co-Located Memory- Centric MapReduce MapReduce Co-Located Memory- Centric Co-Located Memory- Centric Interfaces Java, Python, R, Scala Java Java Java, Python, Scala (SparkR) Python Commercial Support 0xdata Databricks GraphLab, Inc. Distribution Independent Independent All Hadoop Distributions Cloudera! Hortonworks! MapR! Pivotal Independent
  • 26. Analytic Features 26 0xdata ! H2O 2.2 Apache ! Giraph 1.1 Apache ! Mahout 0.9 Apache ! Spark 1.0 GraphLab 2.2 Prediction +++ + +++ Dimension Reduction + +++ + + Clustering + +++ + +++ Collaborative Filtering +++ + +++ Text Analytics +++ +++ Matrix Operations + +++ + Graph Analysis + + +++
  • 27. Analytic Features: Prediction 27 Mahout 0.9 Spark 1.0 H2O 2.2 Linear Regression + Logistic Regression + Generalized Linear Models + Naive Bayes + + + Decision Tree + Gradient Boosted Trees + Random Forests + + Linear Support Vector Machine + Deep Learning (Backprop MLP) +
  • 28. Analytic Features: Dimension Reduction 28 Mahout 0.9 Spark 1.0 H2O 2.2 GraphLab 2.2 Singular Value Decomposition + + Lanczos Algorithm + + Stochastic SVD + Principal Components Analysis + + +
  • 29. Analytic Features: Clustering 29 Mahout 0.9 Spark 1.0 H2O 2.2 GraphLab 2.2 k-Means + + + + Fuzzy k-Means + Streaming k-Means + Spectral Clustering + +
  • 30. Analytic Features: Collaborative Filtering 30 Mahout 0.9 Spark 1.0 H2O 2.2 GraphLab 2.2 Item-Based + Matrix Factorization with ALS + + + Matrix Factorization with ALS, Implicit Feedback + ALS with Parallel Coordinate Descent + Weighted ALS + Sparse ALS +
  • 31. Analytic Features: Text Analytics 31 Mahout 0.9 Spark 1.0 H2O 2.2 GraphLab 2.2 Latent Dirichlet Allocation + + Frequent Pattern Mining + Collocations +
  • 32. Matrix Operations 32 Mahout 0.9 Spark 1.0 H2O 2.2 GraphLab 2.2 Stochastic Gradient Descent + + Limited-Memory BFGS + RowSimilarityJob + ConcatMatrices +
  • 33. Summary: Open Source • Giraph is toast • Mahout may be recovering from roadkill status • GraphLab outperforms Spark GraphX today in graph analytics • 0xdata H2O outperforms Spark MLLib today in machine learning • Spark catching up fast • More resources and distribution • Integrated platform for ML and graph analysis 33
  • 35. Alpine • Business user interface • Collaboration environment • Broad library of techniques • Strong cloud offering • Leverages Hadoop (multiple distros), Hawq or Pivotal Greenplum • Push-down MapReduce • Certified on Spark • Small but growing customer base 35
  • 36. IBM SPSS Analytics Server • Introduced 2013 • Serves as “back end” for SPSS Modeler • Uses push-down MR • Limited analytic feature set • IBM supports on multiple Hadoop distros • Customer acceptance unknown 36
  • 37. Revolution Analytics ScaleR • ScaleR library of distributed statistics, machine learning functions • Tools to distribute arbitrary R functions • Runs in Cloudera, Hortonworks, Teradata, LSF clusters, MS HPC • Hadoop edition uses MR push-down • Tools simplify installation in large clusters • R interface • Partnerships with Alteryx, Qlik, MicroStrategy, Tableau provide business interfaces 37
  • 38. Skytree Server • Georgia Tech’s FastLab project, repurposed as commercial software • Distributed machine learning platform • Very opaque about technical details • User interface is an API • Co-located in Hadoop under YARN • Just certified by Hortonworks • Customer acceptance unknown • No new public references in a year • Used by leading credit card company 38
  • 39. SAS High-Performance Analytics • Distributed in-memory analytics • Designed to run in special-purpose appliances (2011) • Repurposed to run in Hadoop (2013) • Co-exists poorly — cannot run SAS and MapReduce at the same time • Reads entire dataset into memory • Uses MPI to communicate among nodes • Requires upgrades from standard Hadoop infrastructure • Customer acceptance unknown • No public references • Generic success stories missing from Strata presos 39
  • 40. SAS LASR Server • SAS’ “other” distributed in-memory platform • Back end for several end-user products • SAS Visual Analytics (2012) • SAS Visual Statistics (New) • SAS In-Memory Statistics for Hadoop (New) • Recently added statistics and machine learning • Does not read raw HDFS; must be transformed to proprietary SASHDAT • Like HPA, reads entire dataset into memory. • 16 Core 256GB node can load 75GB table • Runs DS2 programs, not Legacy SAS programs • Fast, but with limited feature set • SAS claims 1,400 “sites” for Visual Analytics • Many of those are standalone boxes 40
  • 41. Summary: Commercial • Alpine’s interface is compelling to business user • IBM Analytics Server is a good first release • RRE ScaleR appeals to R users, plays well in Hadoop sandbox • Skytree Server: strong in prediction • SAS: why two competing memory-centric architectures? 41
  • 42. Progress • Spark: blindingly fast maturity • Rapidly expanding library of analytic features • Growing developer community, ecosystem • Commercial: from zero to many 42
  • 43. Interesting Questions • Will Mahout get a second wind? • Will Spark MLLib displace 0xdata? • Will Spark GraphX catch up to GraphLab? • Can Spark Streaming compete with Storm and commercial entrants? • How quickly will customers adopt memory-centric architecture for analytics? • What will Alpine and MicroStrategy do with Spark? • Will IBM distribute Spark in BigInsights? • When will SAS announce a reference customer for HPA/LASR in Hadoop? 43
  • 44. Advanced Analytics in Hadoop Thomas W. Dinsmore 44