SlideShare una empresa de Scribd logo
1 de 23
In-Hadoop, In-Database,
and In-Memory Processing
for Predictive Analytics
Predict to Act.
Stop Looking at the Rear View Mirror
2
From BI…
Business
Intelligence can
only show you what
already has
happened.
This is like driving a
car by only looking
into the rear view
mirror.
Do you really want
to drive your
business like that?
Stop Looking at the Rear View Mirror
3
CONTINUED
…to…
Data Discovery and
Real-time Analytics
offer a view
through the
windscreen.
You can see what is
happening right
now but you still
cannot identify
upcoming chances
or threats.
Stop Looking at the Rear View Mirror
4
CONTINUED
Predictions.
Predictive Analytics
delivers future
outcomes and
provides a look-
ahead view.
This is like
projecting what lies
behind the next
curve already on
your wind screen.
Predictive Insights
will show you that
there is an accident.
Prediction-based
actions will trigger
automatically that
the car slows down.
Warning:
Accident
Ahead!
Value is Higher for Prediction-based Actions
5
Inform.
Aggregation of micro-predictions will show you what can be expected
Allow for better decision making
Useful for supporting strategic decisions
Will not disrupt business processes
Limited & unspecified total value
Operationalize.
Millions of micro-predictions
Each Predictive Action is embedded into your business process
You will know how often you will be right and what your total gain
will be
Brings your business processes to a new, pro-active level
Huge total value
Predictive
Insights
Predactions*
*Predactions are Prediction-based Actions. You will predict what is going to happen. And then you will predact on this.
Science?
 Predictive analytics is complex
 Hadoop is complex
 Proposed solution: Let’s create more Data Scientists!
 But there are flaws with this approach:
– Scientists are supposed to create new things. Data scientists spend
95% of their time on integrating and transforming data.
– Shortage of data scientists predicted (KcKinsey report)
– Being a hardcore programmer, having a PhD in Statistics, and being
able to understand business problems is a rare skill mix…
6
What else can we do?
7
Radoop: RapidMiner on Hadoop
 We do this with RapidMiner + Hadoop = Radoop
– Hadoop is primarily used for batch analytics workloads (ad-hoc
reporting, machine learning, etc.)
– Hadoop only provides programming APIs and command line tools
– Radoop is a partner of RapidMiner who brought the simplicity of
RapidMiner for advanced analytics to Hadoop clusters
– Radoop is developed since 2010
8
We need to empower collaborative teams with
different backgrounds to analyze data in Hadoop –
one team member might be the data scientist.
RapidMiner for Prediction-based Actions
9
Empower business users:
Easy-to-use GUI for the
design of processes.
Predictive insights shown to
improve decision making.
Business analysts in the
driver’s seat: Let your
analysts transform business
problems into Prediction-
based Actions. Create
millions of micro-predictions
and automate everyday
decision making.
Facilitates Collaboration
among business users,
business analysts, data
scientists, and IT
professionals.
Radoop: RapidMiner on Hadoop
10
 RapidMiner Data Flow Interface:
Simple design, execution and
maintenance of analytics processes
– Focus: ad-hoc reporting and
machine learning
– Also supports data
import/export, data
transformations, ETL
workloads, visualization
 Combines distributed and in-
memory analytics
Supported Hadoop Distributions
11
Client- or Server-based Architecture
12
Client-based Architecture Server-based Architecture
Segment Users based on Service Usage (ex.)
 Task: Define K user segments and assign users to segments
 Solution with Hadoop + Mahout:
– CREATE TABLE: define a schema for the service usage log file by
manually listing columns, types, defining separator character, etc.
– Write HiveQL queries (or Pig scripts or…) to aggregate service logs for
each user and calculate user attributes describing them
– Implement and execute a custom MapReduce job to convert data to
Mahout’s input format
– Run the Mahout K-Means algorithm with proper parameters
– Implement and execute a custom MapReduce job to convert the result
back into a delimited format
– Export the result from HDFS and import it into an RDBMS (or whatever
system makes use of the “predactions”…)
13
Segment Users based on Service Usage (ex.)
 Task: Define K user segments and assign users to segments
 Solution with Radoop:
14
Radoop: Data Management
15
Radoop: Process Management
16
Radoop: Supported Functions
 Import/Export data to/from Hadoop
– Read CSV
– Read Database
– Write CSV
– Write Database
– Retrieve/Store/Append to Hive
 Data Transformations
– Select Attributes
– Filter Examples
– Generate Attributes
– Generate ID
– Aggregate
– Join
– Sort
– Normalize
– Replace
– Replace/Declare Missing Values
– Hive/Pig Script
 Machine learning & Statistical modeling
– Clustering: K-Means, Fuzzy K-Means,
Dirichlet, Canopy
– Model learning: Naive Bayes
– Model scoring: Naive Bayes, Decision
Tree, Logistic Regression, Linear
Regression
– Evaluation: Performance
– …and more…
17
Production Use at…
18
Engine Comparison
 In-Memory:
– In-memory analytics is always the fastest way to build analytical models
– Data set size is restricted by hardware (memory)
– Data set size: On decent hardware, up to ca. 100 million data points
 In-Database:
– Not applicable for all analysis tasks
– Runtime depends on the power of the database server
– Data set size: Unlimited (limit is the external storage capacity)
 In-Hadoop:
– Not applicable for all analysis tasks
– Runtime depends on the power of the Hadoop cluster
– Due to massive overhead introduced by Hadoop, the usage of Hadoop is not
recommended for smaller data set sizes
– Data set size: Unlimited (limit is the external storage capacity)
19
Runtime Comparison for Naïve Bayes (20 nodes)
20
Runtime Comparison for Number of Nodes
21
Conclusion
 Predictive Analytics on Hadoop for Everyone:
– RapidMiner + Radoop is an easy-to-use & efficient alternative supporting the
collaboration process between different team members
– Not only Predictive Intelligence but also Prediction-based Actions can be created on top
of Hadoop clusters by everyone
 Runtimes:
– Looking at the runtimes for analytical algorithms, it can be easily seen that limitations in
terms of data set sizes have vanished today – but at the price of larger runtimes
– Running predictive analytics on Hadoop clusters is prohibitively slow for small data sets
and in many cases also for interactive real-time reports
– Depending on the data itself, the number of nodes, and the selected predictive analytics
algorithm, those can beat the other engines already at ca. 10M to 25M data points
– In general we recommend to stay in-memory for up to 100M data points and invest in
hardware before doing the switch to in-database (up to 500M data points) and then to
Hadoop clusters for data sets beyond this size
22
RapidMiner USA
RapidMiner, Inc. (Headquarters)
10 Fawcett St
Cambridge, MA 02138
United States
E-mail contact-us@rapidminer.com
Phone +1 - 617 - 401 - 7708
Fax +1 - 617 - 401 - 7709
CONTACT US
23
RapidMiner Germany
RapidMiner GmbH
Stockumer Str. 475
44227 Dortmund
Germany
E-mail contact-de@rapidminer.com
Phone +49 - 231 - 425 786 9-0
Fax +49 - 231 - 425 786 9-9
RapidMiner UK
RapidMiner Ltd.
Quatro House, Frimley Road
Camberley GU16 7ER
United Kingdom
E-mail contact-uk@rapidminer.com
Phone +44 1276 804 426
www.rapidminer.com

Más contenido relacionado

La actualidad más candente

CIO Guide to Using SAP HANA Platform For Big Data
CIO Guide to Using SAP HANA Platform For Big DataCIO Guide to Using SAP HANA Platform For Big Data
CIO Guide to Using SAP HANA Platform For Big DataSnehanshu Shah
 
Revolution Analytics
Revolution AnalyticsRevolution Analytics
Revolution Analyticstempledf
 
Pervasive DataRush
Pervasive DataRushPervasive DataRush
Pervasive DataRushtempledf
 
Cascading User Group Meet
Cascading User Group MeetCascading User Group Meet
Cascading User Group MeetVinoth Kannan
 
01 sap hana landscape and operations infrastructure v2 0
01  sap hana landscape and operations infrastructure v2 001  sap hana landscape and operations infrastructure v2 0
01 sap hana landscape and operations infrastructure v2 0Chris Kernaghan
 
The key to unlocking the Value in the IoT? Managing the Data!
The key to unlocking the Value in the IoT? Managing the Data!The key to unlocking the Value in the IoT? Managing the Data!
The key to unlocking the Value in the IoT? Managing the Data!DataWorks Summit/Hadoop Summit
 
Smart Meter Data Analytic using Hadoop
Smart Meter Data Analytic using HadoopSmart Meter Data Analytic using Hadoop
Smart Meter Data Analytic using HadoopDataWorks Summit
 
SAP Lambda Architecture Point of View
SAP Lambda Architecture Point of ViewSAP Lambda Architecture Point of View
SAP Lambda Architecture Point of ViewSnehanshu Shah
 
Indexing 3-dimensional trajectories: Apache Spark and Cassandra integration
Indexing 3-dimensional trajectories: Apache Spark and Cassandra integrationIndexing 3-dimensional trajectories: Apache Spark and Cassandra integration
Indexing 3-dimensional trajectories: Apache Spark and Cassandra integrationCesare Cugnasco
 
Geospatial data platform at Uber
Geospatial data platform at UberGeospatial data platform at Uber
Geospatial data platform at UberDataWorks Summit
 
Hadoop Reporting and Analysis - Jaspersoft
Hadoop Reporting and Analysis - JaspersoftHadoop Reporting and Analysis - Jaspersoft
Hadoop Reporting and Analysis - JaspersoftHortonworks
 
SplunkSummit 2015 - Real World Big Data Architecture
SplunkSummit 2015 -  Real World Big Data ArchitectureSplunkSummit 2015 -  Real World Big Data Architecture
SplunkSummit 2015 - Real World Big Data ArchitectureSplunk
 
HP Vertica and MapR Webinar: Building a Business Case for SQL-on-Hadoop
HP Vertica and MapR Webinar: Building a Business Case for SQL-on-HadoopHP Vertica and MapR Webinar: Building a Business Case for SQL-on-Hadoop
HP Vertica and MapR Webinar: Building a Business Case for SQL-on-HadoopMapR Technologies
 
LendingClub RealTime BigData Platform with Oracle GoldenGate
LendingClub RealTime BigData Platform with Oracle GoldenGateLendingClub RealTime BigData Platform with Oracle GoldenGate
LendingClub RealTime BigData Platform with Oracle GoldenGateRajit Saha
 
BDM39: HP Vertica BI: Sub-second big data analytics your users and developers...
BDM39: HP Vertica BI: Sub-second big data analytics your users and developers...BDM39: HP Vertica BI: Sub-second big data analytics your users and developers...
BDM39: HP Vertica BI: Sub-second big data analytics your users and developers...Big Data Montreal
 
Track B-1 建構新世代的智慧數據平台
Track B-1 建構新世代的智慧數據平台Track B-1 建構新世代的智慧數據平台
Track B-1 建構新世代的智慧數據平台Etu Solution
 

La actualidad más candente (20)

CIO Guide to Using SAP HANA Platform For Big Data
CIO Guide to Using SAP HANA Platform For Big DataCIO Guide to Using SAP HANA Platform For Big Data
CIO Guide to Using SAP HANA Platform For Big Data
 
Revolution Analytics
Revolution AnalyticsRevolution Analytics
Revolution Analytics
 
Pervasive DataRush
Pervasive DataRushPervasive DataRush
Pervasive DataRush
 
Cascading User Group Meet
Cascading User Group MeetCascading User Group Meet
Cascading User Group Meet
 
01 sap hana landscape and operations infrastructure v2 0
01  sap hana landscape and operations infrastructure v2 001  sap hana landscape and operations infrastructure v2 0
01 sap hana landscape and operations infrastructure v2 0
 
Apache Eagle: Secure Hadoop in Real Time
Apache Eagle: Secure Hadoop in Real TimeApache Eagle: Secure Hadoop in Real Time
Apache Eagle: Secure Hadoop in Real Time
 
The key to unlocking the Value in the IoT? Managing the Data!
The key to unlocking the Value in the IoT? Managing the Data!The key to unlocking the Value in the IoT? Managing the Data!
The key to unlocking the Value in the IoT? Managing the Data!
 
Smart Meter Data Analytic using Hadoop
Smart Meter Data Analytic using HadoopSmart Meter Data Analytic using Hadoop
Smart Meter Data Analytic using Hadoop
 
SAP Lambda Architecture Point of View
SAP Lambda Architecture Point of ViewSAP Lambda Architecture Point of View
SAP Lambda Architecture Point of View
 
Keys for Success from Streams to Queries
Keys for Success from Streams to QueriesKeys for Success from Streams to Queries
Keys for Success from Streams to Queries
 
Indexing 3-dimensional trajectories: Apache Spark and Cassandra integration
Indexing 3-dimensional trajectories: Apache Spark and Cassandra integrationIndexing 3-dimensional trajectories: Apache Spark and Cassandra integration
Indexing 3-dimensional trajectories: Apache Spark and Cassandra integration
 
Geospatial data platform at Uber
Geospatial data platform at UberGeospatial data platform at Uber
Geospatial data platform at Uber
 
Building a Scalable Data Science Platform with R
Building a Scalable Data Science Platform with RBuilding a Scalable Data Science Platform with R
Building a Scalable Data Science Platform with R
 
Hadoop Reporting and Analysis - Jaspersoft
Hadoop Reporting and Analysis - JaspersoftHadoop Reporting and Analysis - Jaspersoft
Hadoop Reporting and Analysis - Jaspersoft
 
SplunkSummit 2015 - Real World Big Data Architecture
SplunkSummit 2015 -  Real World Big Data ArchitectureSplunkSummit 2015 -  Real World Big Data Architecture
SplunkSummit 2015 - Real World Big Data Architecture
 
HP Vertica and MapR Webinar: Building a Business Case for SQL-on-Hadoop
HP Vertica and MapR Webinar: Building a Business Case for SQL-on-HadoopHP Vertica and MapR Webinar: Building a Business Case for SQL-on-Hadoop
HP Vertica and MapR Webinar: Building a Business Case for SQL-on-Hadoop
 
LendingClub RealTime BigData Platform with Oracle GoldenGate
LendingClub RealTime BigData Platform with Oracle GoldenGateLendingClub RealTime BigData Platform with Oracle GoldenGate
LendingClub RealTime BigData Platform with Oracle GoldenGate
 
BDM39: HP Vertica BI: Sub-second big data analytics your users and developers...
BDM39: HP Vertica BI: Sub-second big data analytics your users and developers...BDM39: HP Vertica BI: Sub-second big data analytics your users and developers...
BDM39: HP Vertica BI: Sub-second big data analytics your users and developers...
 
Spark meets Smart Meters
Spark meets Smart MetersSpark meets Smart Meters
Spark meets Smart Meters
 
Track B-1 建構新世代的智慧數據平台
Track B-1 建構新世代的智慧數據平台Track B-1 建構新世代的智慧數據平台
Track B-1 建構新世代的智慧數據平台
 

Destacado

Slides for the Apache Geode Hands-on Meetup and Hackathon Announcement
Slides for the Apache Geode Hands-on Meetup and Hackathon Announcement Slides for the Apache Geode Hands-on Meetup and Hackathon Announcement
Slides for the Apache Geode Hands-on Meetup and Hackathon Announcement VMware Tanzu
 
dashDB & R によるデータ分析 - In database Analytics 基礎編 -
dashDB & R によるデータ分析 - In database Analytics 基礎編 -dashDB & R によるデータ分析 - In database Analytics 基礎編 -
dashDB & R によるデータ分析 - In database Analytics 基礎編 -IBM Analytics Japan
 
Predicting Customer Experience through Hadoop and Customer Behavior Graphs
Predicting Customer Experience through Hadoop and Customer Behavior GraphsPredicting Customer Experience through Hadoop and Customer Behavior Graphs
Predicting Customer Experience through Hadoop and Customer Behavior GraphsHortonworks
 
Hortonworks tech workshop in-memory processing with spark
Hortonworks tech workshop   in-memory processing with sparkHortonworks tech workshop   in-memory processing with spark
Hortonworks tech workshop in-memory processing with sparkHortonworks
 
Accenture Liquid Application Studio
Accenture Liquid Application StudioAccenture Liquid Application Studio
Accenture Liquid Application StudioAccenture Technology
 
Top 5 Considerations for a Big Data Solution
Top 5 Considerations for a Big Data SolutionTop 5 Considerations for a Big Data Solution
Top 5 Considerations for a Big Data SolutionDataStax
 

Destacado (8)

Slides for the Apache Geode Hands-on Meetup and Hackathon Announcement
Slides for the Apache Geode Hands-on Meetup and Hackathon Announcement Slides for the Apache Geode Hands-on Meetup and Hackathon Announcement
Slides for the Apache Geode Hands-on Meetup and Hackathon Announcement
 
Physical Architecture Layer Design
Physical Architecture Layer DesignPhysical Architecture Layer Design
Physical Architecture Layer Design
 
dashDB & R によるデータ分析 - In database Analytics 基礎編 -
dashDB & R によるデータ分析 - In database Analytics 基礎編 -dashDB & R によるデータ分析 - In database Analytics 基礎編 -
dashDB & R によるデータ分析 - In database Analytics 基礎編 -
 
Predicting Customer Experience through Hadoop and Customer Behavior Graphs
Predicting Customer Experience through Hadoop and Customer Behavior GraphsPredicting Customer Experience through Hadoop and Customer Behavior Graphs
Predicting Customer Experience through Hadoop and Customer Behavior Graphs
 
Hortonworks tech workshop in-memory processing with spark
Hortonworks tech workshop   in-memory processing with sparkHortonworks tech workshop   in-memory processing with spark
Hortonworks tech workshop in-memory processing with spark
 
Accenture Liquid Application Studio
Accenture Liquid Application StudioAccenture Liquid Application Studio
Accenture Liquid Application Studio
 
9 handy Excel demos
9 handy Excel demos9 handy Excel demos
9 handy Excel demos
 
Top 5 Considerations for a Big Data Solution
Top 5 Considerations for a Big Data SolutionTop 5 Considerations for a Big Data Solution
Top 5 Considerations for a Big Data Solution
 

Similar a In-Hadoop, In-Database and In-Memory Processing for Predictive Analytics

Hadoop Demo eConvergence
Hadoop Demo eConvergenceHadoop Demo eConvergence
Hadoop Demo eConvergencekvnnrao
 
The Big Picture on Big Data and Cognos
The Big Picture on Big Data and CognosThe Big Picture on Big Data and Cognos
The Big Picture on Big Data and CognosSenturus
 
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011Jonathan Seidman
 
Gartner peer forum sept 2011 orbitz
Gartner peer forum sept 2011   orbitzGartner peer forum sept 2011   orbitz
Gartner peer forum sept 2011 orbitzRaghu Kashyap
 
Instant Data Discovery with Dashboards and Visual Analytics
Instant Data Discovery with Dashboards and Visual AnalyticsInstant Data Discovery with Dashboards and Visual Analytics
Instant Data Discovery with Dashboards and Visual AnalyticsMia Yuan Cao
 
Yaroslav Ravlinko “Evolution of Data Processing platform from Hadoop to nowad...
Yaroslav Ravlinko “Evolution of Data Processing platform from Hadoop to nowad...Yaroslav Ravlinko “Evolution of Data Processing platform from Hadoop to nowad...
Yaroslav Ravlinko “Evolution of Data Processing platform from Hadoop to nowad...Lviv Startup Club
 
Using Visualization to Succeed with Big Data
Using Visualization to Succeed with Big Data Using Visualization to Succeed with Big Data
Using Visualization to Succeed with Big Data Pactera_US
 
Skillwise Big Data part 2
Skillwise Big Data part 2Skillwise Big Data part 2
Skillwise Big Data part 2Skillwise Group
 
The convergence of reporting and interactive BI on Hadoop
The convergence of reporting and interactive BI on HadoopThe convergence of reporting and interactive BI on Hadoop
The convergence of reporting and interactive BI on HadoopDataWorks Summit
 
The sensor data challenge - Innovations (not only) for the Internet of Things
The sensor data challenge - Innovations (not only) for the Internet of ThingsThe sensor data challenge - Innovations (not only) for the Internet of Things
The sensor data challenge - Innovations (not only) for the Internet of ThingsStephan Reimann
 
Get Started Quickly with IBM's Hadoop as a Service
Get Started Quickly with IBM's Hadoop as a ServiceGet Started Quickly with IBM's Hadoop as a Service
Get Started Quickly with IBM's Hadoop as a ServiceIBM Cloud Data Services
 
Journey to SAS Analytics Grid with SAS, R, Python
Journey to SAS Analytics Grid with SAS, R, PythonJourney to SAS Analytics Grid with SAS, R, Python
Journey to SAS Analytics Grid with SAS, R, PythonSumit Sarkar
 
SAP HANA - Big Data and Fast Data
SAP HANA - Big Data and Fast DataSAP HANA - Big Data and Fast Data
SAP HANA - Big Data and Fast DataVitaliy Rudnytskiy
 
Embedding Data Visualization In OEM and Saas Apps
Embedding Data Visualization In OEM and Saas AppsEmbedding Data Visualization In OEM and Saas Apps
Embedding Data Visualization In OEM and Saas AppsMia Yuan Cao
 
Predicting Consumer Behaviour via Hadoop
Predicting Consumer Behaviour via HadoopPredicting Consumer Behaviour via Hadoop
Predicting Consumer Behaviour via HadoopSkillspeed
 

Similar a In-Hadoop, In-Database and In-Memory Processing for Predictive Analytics (20)

Hadoop Demo eConvergence
Hadoop Demo eConvergenceHadoop Demo eConvergence
Hadoop Demo eConvergence
 
A hadoop map reduce
A hadoop map reduceA hadoop map reduce
A hadoop map reduce
 
The Big Picture on Big Data and Cognos
The Big Picture on Big Data and CognosThe Big Picture on Big Data and Cognos
The Big Picture on Big Data and Cognos
 
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011
 
Gartner peer forum sept 2011 orbitz
Gartner peer forum sept 2011   orbitzGartner peer forum sept 2011   orbitz
Gartner peer forum sept 2011 orbitz
 
Instant Data Discovery with Dashboards and Visual Analytics
Instant Data Discovery with Dashboards and Visual AnalyticsInstant Data Discovery with Dashboards and Visual Analytics
Instant Data Discovery with Dashboards and Visual Analytics
 
Yaroslav Ravlinko “Evolution of Data Processing platform from Hadoop to nowad...
Yaroslav Ravlinko “Evolution of Data Processing platform from Hadoop to nowad...Yaroslav Ravlinko “Evolution of Data Processing platform from Hadoop to nowad...
Yaroslav Ravlinko “Evolution of Data Processing platform from Hadoop to nowad...
 
Using Visualization to Succeed with Big Data
Using Visualization to Succeed with Big Data Using Visualization to Succeed with Big Data
Using Visualization to Succeed with Big Data
 
Skillwise Big Data part 2
Skillwise Big Data part 2Skillwise Big Data part 2
Skillwise Big Data part 2
 
BigData Analysis
BigData AnalysisBigData Analysis
BigData Analysis
 
Skilwise Big data
Skilwise Big dataSkilwise Big data
Skilwise Big data
 
The convergence of reporting and interactive BI on Hadoop
The convergence of reporting and interactive BI on HadoopThe convergence of reporting and interactive BI on Hadoop
The convergence of reporting and interactive BI on Hadoop
 
Recommendation engine
Recommendation engineRecommendation engine
Recommendation engine
 
The sensor data challenge - Innovations (not only) for the Internet of Things
The sensor data challenge - Innovations (not only) for the Internet of ThingsThe sensor data challenge - Innovations (not only) for the Internet of Things
The sensor data challenge - Innovations (not only) for the Internet of Things
 
Get Started Quickly with IBM's Hadoop as a Service
Get Started Quickly with IBM's Hadoop as a ServiceGet Started Quickly with IBM's Hadoop as a Service
Get Started Quickly with IBM's Hadoop as a Service
 
Big Data: hype or necessity?
Big Data: hype or necessity?Big Data: hype or necessity?
Big Data: hype or necessity?
 
Journey to SAS Analytics Grid with SAS, R, Python
Journey to SAS Analytics Grid with SAS, R, PythonJourney to SAS Analytics Grid with SAS, R, Python
Journey to SAS Analytics Grid with SAS, R, Python
 
SAP HANA - Big Data and Fast Data
SAP HANA - Big Data and Fast DataSAP HANA - Big Data and Fast Data
SAP HANA - Big Data and Fast Data
 
Embedding Data Visualization In OEM and Saas Apps
Embedding Data Visualization In OEM and Saas AppsEmbedding Data Visualization In OEM and Saas Apps
Embedding Data Visualization In OEM and Saas Apps
 
Predicting Consumer Behaviour via Hadoop
Predicting Consumer Behaviour via HadoopPredicting Consumer Behaviour via Hadoop
Predicting Consumer Behaviour via Hadoop
 

Más de DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

Más de DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Último

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 

Último (20)

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 

In-Hadoop, In-Database and In-Memory Processing for Predictive Analytics

  • 1. In-Hadoop, In-Database, and In-Memory Processing for Predictive Analytics Predict to Act.
  • 2. Stop Looking at the Rear View Mirror 2 From BI… Business Intelligence can only show you what already has happened. This is like driving a car by only looking into the rear view mirror. Do you really want to drive your business like that?
  • 3. Stop Looking at the Rear View Mirror 3 CONTINUED …to… Data Discovery and Real-time Analytics offer a view through the windscreen. You can see what is happening right now but you still cannot identify upcoming chances or threats.
  • 4. Stop Looking at the Rear View Mirror 4 CONTINUED Predictions. Predictive Analytics delivers future outcomes and provides a look- ahead view. This is like projecting what lies behind the next curve already on your wind screen. Predictive Insights will show you that there is an accident. Prediction-based actions will trigger automatically that the car slows down. Warning: Accident Ahead!
  • 5. Value is Higher for Prediction-based Actions 5 Inform. Aggregation of micro-predictions will show you what can be expected Allow for better decision making Useful for supporting strategic decisions Will not disrupt business processes Limited & unspecified total value Operationalize. Millions of micro-predictions Each Predictive Action is embedded into your business process You will know how often you will be right and what your total gain will be Brings your business processes to a new, pro-active level Huge total value Predictive Insights Predactions* *Predactions are Prediction-based Actions. You will predict what is going to happen. And then you will predact on this.
  • 6. Science?  Predictive analytics is complex  Hadoop is complex  Proposed solution: Let’s create more Data Scientists!  But there are flaws with this approach: – Scientists are supposed to create new things. Data scientists spend 95% of their time on integrating and transforming data. – Shortage of data scientists predicted (KcKinsey report) – Being a hardcore programmer, having a PhD in Statistics, and being able to understand business problems is a rare skill mix… 6
  • 7. What else can we do? 7
  • 8. Radoop: RapidMiner on Hadoop  We do this with RapidMiner + Hadoop = Radoop – Hadoop is primarily used for batch analytics workloads (ad-hoc reporting, machine learning, etc.) – Hadoop only provides programming APIs and command line tools – Radoop is a partner of RapidMiner who brought the simplicity of RapidMiner for advanced analytics to Hadoop clusters – Radoop is developed since 2010 8 We need to empower collaborative teams with different backgrounds to analyze data in Hadoop – one team member might be the data scientist.
  • 9. RapidMiner for Prediction-based Actions 9 Empower business users: Easy-to-use GUI for the design of processes. Predictive insights shown to improve decision making. Business analysts in the driver’s seat: Let your analysts transform business problems into Prediction- based Actions. Create millions of micro-predictions and automate everyday decision making. Facilitates Collaboration among business users, business analysts, data scientists, and IT professionals.
  • 10. Radoop: RapidMiner on Hadoop 10  RapidMiner Data Flow Interface: Simple design, execution and maintenance of analytics processes – Focus: ad-hoc reporting and machine learning – Also supports data import/export, data transformations, ETL workloads, visualization  Combines distributed and in- memory analytics
  • 12. Client- or Server-based Architecture 12 Client-based Architecture Server-based Architecture
  • 13. Segment Users based on Service Usage (ex.)  Task: Define K user segments and assign users to segments  Solution with Hadoop + Mahout: – CREATE TABLE: define a schema for the service usage log file by manually listing columns, types, defining separator character, etc. – Write HiveQL queries (or Pig scripts or…) to aggregate service logs for each user and calculate user attributes describing them – Implement and execute a custom MapReduce job to convert data to Mahout’s input format – Run the Mahout K-Means algorithm with proper parameters – Implement and execute a custom MapReduce job to convert the result back into a delimited format – Export the result from HDFS and import it into an RDBMS (or whatever system makes use of the “predactions”…) 13
  • 14. Segment Users based on Service Usage (ex.)  Task: Define K user segments and assign users to segments  Solution with Radoop: 14
  • 17. Radoop: Supported Functions  Import/Export data to/from Hadoop – Read CSV – Read Database – Write CSV – Write Database – Retrieve/Store/Append to Hive  Data Transformations – Select Attributes – Filter Examples – Generate Attributes – Generate ID – Aggregate – Join – Sort – Normalize – Replace – Replace/Declare Missing Values – Hive/Pig Script  Machine learning & Statistical modeling – Clustering: K-Means, Fuzzy K-Means, Dirichlet, Canopy – Model learning: Naive Bayes – Model scoring: Naive Bayes, Decision Tree, Logistic Regression, Linear Regression – Evaluation: Performance – …and more… 17
  • 19. Engine Comparison  In-Memory: – In-memory analytics is always the fastest way to build analytical models – Data set size is restricted by hardware (memory) – Data set size: On decent hardware, up to ca. 100 million data points  In-Database: – Not applicable for all analysis tasks – Runtime depends on the power of the database server – Data set size: Unlimited (limit is the external storage capacity)  In-Hadoop: – Not applicable for all analysis tasks – Runtime depends on the power of the Hadoop cluster – Due to massive overhead introduced by Hadoop, the usage of Hadoop is not recommended for smaller data set sizes – Data set size: Unlimited (limit is the external storage capacity) 19
  • 20. Runtime Comparison for Naïve Bayes (20 nodes) 20
  • 21. Runtime Comparison for Number of Nodes 21
  • 22. Conclusion  Predictive Analytics on Hadoop for Everyone: – RapidMiner + Radoop is an easy-to-use & efficient alternative supporting the collaboration process between different team members – Not only Predictive Intelligence but also Prediction-based Actions can be created on top of Hadoop clusters by everyone  Runtimes: – Looking at the runtimes for analytical algorithms, it can be easily seen that limitations in terms of data set sizes have vanished today – but at the price of larger runtimes – Running predictive analytics on Hadoop clusters is prohibitively slow for small data sets and in many cases also for interactive real-time reports – Depending on the data itself, the number of nodes, and the selected predictive analytics algorithm, those can beat the other engines already at ca. 10M to 25M data points – In general we recommend to stay in-memory for up to 100M data points and invest in hardware before doing the switch to in-database (up to 500M data points) and then to Hadoop clusters for data sets beyond this size 22
  • 23. RapidMiner USA RapidMiner, Inc. (Headquarters) 10 Fawcett St Cambridge, MA 02138 United States E-mail contact-us@rapidminer.com Phone +1 - 617 - 401 - 7708 Fax +1 - 617 - 401 - 7709 CONTACT US 23 RapidMiner Germany RapidMiner GmbH Stockumer Str. 475 44227 Dortmund Germany E-mail contact-de@rapidminer.com Phone +49 - 231 - 425 786 9-0 Fax +49 - 231 - 425 786 9-9 RapidMiner UK RapidMiner Ltd. Quatro House, Frimley Road Camberley GU16 7ER United Kingdom E-mail contact-uk@rapidminer.com Phone +44 1276 804 426 www.rapidminer.com