SlideShare una empresa de Scribd logo
1 de 50
1
Hybrid Transactional/Analytics Processing
with Spark and In-Memory Data Grids
Copyright © GigaSpaces 2017. All rights reserved.
Ali Hodroj
VP, Products and Strategy @ahodroj
2
GigaSpaces
Ultra-Low Latency / High Throughput Middleware
Direct customers
500+
Headquarters
New York, NY
Established
2001
3
HERE
How we got
4
We’re seeing more in our customer base
5
…a shift towards real-time
BI
Big
Data
Fast
Data
6
Sample Customer Use Cases
Internet of Things Omni-Channel Operational
Intelligence
Operational
Analytics
Predictive
Analytics
Fraud Detection, Supply
chain optimization
Personalization,
Recommendation
Edge
Analytics
Operational Intelligence,
Predictive Maintenance,
Spatial Analytics
7
In-Memory Computing
(not a new thing)
Rapid decline in RAM prices lead to advanced data processing
innovations
drives
• Transactional (2001-present)
– In-Memory Databases
– In-Memory Data Grids
• Analytics (2012-present)
– In-Memory Data Processing
Frameworks (Spark)
– In-Memory File Systems (Tachyon)
8
In-Memory Data Processing: Apache Spark
99
Data Grid is a cluster of
machines that work
together to create a
resilient shared data
fabric for low-latency
data access and extreme
transaction processing
In-Memory Data Grid:
Online Transaction Processing at Low-Latency and High Throughput
http://xap.github.io
10
In-Memory Data Grid 101
Feeder
Virtual Machine Virtual MachineVirtual Machine
Partitioned Data
11
Write
Event-Driven / Reactive
In-Memory Data Grid 101: Execution Models
RPC / Master-Worker
12
Write
Event-Driven / Reactive
In-Memory Data Grid 101: Execution Models
RPC / Master-Worker
13
In-Memory Data Grid 101: Typical Deployment
HTML
HTTP/S
HW LB
REST
HTTP/
S
REST
HTTP/S
LB
Agen
t
GSA
HTTPD
Load
Balanc
er
LB
Agen
t
GSA
HTTPD
Load
Balanc
er
Mirror
Service
GSA
DB
Private or Public Cloud
Processing Processing Processing
Processing Processing
Processi
ng
Processing Processing Processing
Processing Processing Processing
Primary Set 1 Primary Set 2 Primary Set 3
Primary Set 4 Primary Set 5 Primary Set 6
Backup Set 6Backup Set 5Backup Set 4
Backup Set 1 Backup Set 2 Backup Set 3
GSA GSA GSA
GSA GSA GSA
Async
)
14
Host Cisco UCS Server
CPU Intel 16core 2.9GHz
Concurrent Threads 2
Throughput 200, 400, 800 ops/sec
15
16
Hybrid Transactional/Analytics Processing at Scale
Provide closed-loop analytics pipeline. Data,
insight, to action at sub-second latency
IoT and Omni-channel require the
convergence of many different data
types
Blend of both real-time and historical
data
Requirements
1
Bi-directional integration between
transactional and analytical data stores
Ability to support POJO, JSON,
GeoSpatial, and Unstructured types
through a unified API
Unified and scale-out real-time
and historical data store
Challenges
2
3
17
HTAP:
SPARK + MICROSERVICES
Our road towards
18
What’s needed
Large-scale distributed
analytics framework
Unified, scale-out, low-latency data store
Transactional capabilities:
ACID, Event-Driven, Rich
Data modeling
Microservices
19
Our approach to HTAP
Low-latency Scale-Out
In-Memory Data Grid
Large-scale distributed
analytics framework
+
20
• Unified & Concise API
• Highly Flexible Data Store
Integration
• Massive Community and Adoption
Why Spark?
21
1
Bi-directional integration between
transactional and analytical data stores
Provide closed-loop analytics pipeline. Data, insight, to action
at scale (at sub-seconds)
22
23
In-Memory Data Grid
In-Memory Store(RAM) Flash, SSD, Off-Heap Store
Spark Spark SQL
Spark
Steaming
Machine Learning
Highavailability
Security&Management
Transactional Tier
ACID-compliant
Strong Consistency
Analytics Tier
24
• Get Partitions: An array of partitions
that a dataset is divided to
• Compute: A compute function to do a
computation on partitions
• Get Preferred Location: Optional
preferred locations, i.e. hosts for a
partition where the data will be loaded
• IMDG Distributed Query to get partitions
and their hosts
• Iterator over portion of data
• Hosts from Distributed Query
Build a connector: Spark to IMDG
25
node 1
Spark master
Grid
master
node 2
Spark worker
Grid
Partition
node 3
Spark worker
Grid
Partition
NoSQL Storage
Pattern #1: Data Locality (machine-level)
26
Aggregation in
Spark
Filtering and
columns pruning
in Data Grid
SELECT SUM(amount)
FROM order
WHERE city = ‘NY’ AND year > 2012
Spark SQL architecture:
• Pushing down predicates to Data Grid
• Leveraging indexes
• Transparent to user
• Enabling support for other languages -
Python/R
Implementing DataSource API
Pattern #2: Pushdown Predicates (Grid-side processing)
27
node 1
Spark master
Grid
master
node 2
Spark worker
Grid
Partition
node 3
Spark worker
Grid
Partition
Lightweight
workers,
small JVMs
Large JVMs,
Fast
indexing
NoSQL Storage
Pattern #3: Decouple Data Processing from Data Storage
28
Push-down
Predicates
performance
Traditional Spark filtering of 7MM records
Grid-side + Spark filtering of 7MM records
31
sec
800
ms
vs
29
Ability to support POJO, JSON, GeoSpatial, and
Unstructured types through a unified API
2
IoT and Omni-channel require the convergence of many
different data types
In-Memory Data Grid + Spark Convergence
Geo-Spatial Full Text
Simple K/V to RDD Mapping
POJO Domain Model to Spark
POJO Domain Model to Spark (Event-Driven)
JSON Domain Model to Spark
Geo-Spatial Data Frames
Geo-Spatial
Full Text Indexes + Lucene Analyzers
Full Text
37
Unified and scale-out real-time and historical
data store
3
Blend of both real-time and historical data
38
hash(key) % #nodes
In-Memory Data Grid Partitioning
39
hash(key) % #nodes
In-Memory Data Grid Partitioning – With HA
40
node 1
Spark executor
Spark
Partition
#1
Grid
Partition #1
Direct
connection
Simple, but
not enough
parallelism
for Spark
node 2
Spark executor
Spark
Partition
#2
Grid
Partition #2
node 3
Spark executor
Spark
Partition
#3
Grid
Partition #3
Spark to Data Grid Partition Cardinality
41
node 1
Spark Executor
Grid Primary #1
0
.
.
1
.
.
2
.
.
3
.
.
4
.
.
5
.
.
.
.
.
.
.
.
.
.
.
.
Spark
Partition #1
1023
1 Spark partition = M grid buckets
1 Grid partition = N Spark partitions
Spark
Partition #2
Spark
Partition #1
Pattern #4: Grid bucketing for higher throughput
42
Eventually, we productized this as
an open source Spark distribution
@InsightEdgeIO http://insightedge.io
Apache 2 License
http://insightedge.io/docs
http://insightedge.io/blog
http://github.com/InsightEdge
GigaSpaces InsightEdge
http://insightedge.io
High Performance Spark with OLTP Capabilities
upcoming: Spark RDD/DF native read/save on Off-Heap
(SSD/Flash/Direct Buffers)
Application
Processi
ng
Primary
instance
s
Backup
instance
s
Sync
Replicati
on
Storage
Array
Storage
Array
In Memory Data Grid
Spark worker Spark worker
• Significant RAM TCO reduction
in Spark clusters
• Direct RDD/DataFrame read
write from SSD/Flash device
• Avoid Filesystem hops and
write amplification
46
REFERENCE
ARCHITECTURES
47
In-Process HTAP
48
In-Memory Data Grid
Realtime Replication
• Scoring models
• Trigger actions
• Events
Transactions Analytics
XAP + InsightEdge deployed on
different grid clusters with bi-
directional real-time data replication
Point-of-Decision HTAP
4949
Challenge
• Stream data from 1,000s of Taxis
• Actively monitor and generate real-time notifications
• Real-time Route Optimization and Geo-Fencing
Solution
• Leverage unified in-memory data fabric as middleware for
geo-spatial analytics
• Elastically scale stream processing and transactional apps
together
• Location-based tracking, Geo-fencing
Edge components
Data Sources
Transportation / IoT: Connected Cars / Fleet Geo-Analytics
50
THANK YOU!
QUESTIONS?

Más contenido relacionado

La actualidad más candente

Pentaho Analytics on MongoDB
Pentaho Analytics on MongoDBPentaho Analytics on MongoDB
Pentaho Analytics on MongoDBMark Kromer
 
Finding the needle in the haystack: how Nestle is leveraging big data to defe...
Finding the needle in the haystack: how Nestle is leveraging big data to defe...Finding the needle in the haystack: how Nestle is leveraging big data to defe...
Finding the needle in the haystack: how Nestle is leveraging big data to defe...Big Data Spain
 
VP of WW Partners by Alan Chhabra
VP of WW Partners by Alan ChhabraVP of WW Partners by Alan Chhabra
VP of WW Partners by Alan ChhabraBig Data Spain
 
Snowflakes in the Cloud Real world experience on a new approach for Big Data
Snowflakes in the Cloud Real world experience on a new approach for Big DataSnowflakes in the Cloud Real world experience on a new approach for Big Data
Snowflakes in the Cloud Real world experience on a new approach for Big DataDevFest DC
 
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...Dataconomy Media
 
Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!DataWorks Summit
 
Graph Hardware Architecture - Enterprise graphs deserve great hardware!
Graph Hardware Architecture - Enterprise graphs deserve great hardware!Graph Hardware Architecture - Enterprise graphs deserve great hardware!
Graph Hardware Architecture - Enterprise graphs deserve great hardware!TigerGraph
 
Building big data solutions on azure
Building big data solutions on azureBuilding big data solutions on azure
Building big data solutions on azureEyal Ben Ivri
 
Advanced data science algorithms applied to scalable stream processing by Dav...
Advanced data science algorithms applied to scalable stream processing by Dav...Advanced data science algorithms applied to scalable stream processing by Dav...
Advanced data science algorithms applied to scalable stream processing by Dav...Big Data Spain
 
Pouring the Foundation: Data Management in the Energy Industry
Pouring the Foundation: Data Management in the Energy IndustryPouring the Foundation: Data Management in the Energy Industry
Pouring the Foundation: Data Management in the Energy IndustryDataWorks Summit
 
The Rise of Engineering-Driven Analytics by Loren Shure
The Rise of Engineering-Driven Analytics by Loren ShureThe Rise of Engineering-Driven Analytics by Loren Shure
The Rise of Engineering-Driven Analytics by Loren ShureBig Data Spain
 
Cortana Analytics Workshop: Operationalizing Your End-to-End Analytics Solution
Cortana Analytics Workshop: Operationalizing Your End-to-End Analytics SolutionCortana Analytics Workshop: Operationalizing Your End-to-End Analytics Solution
Cortana Analytics Workshop: Operationalizing Your End-to-End Analytics SolutionMSAdvAnalytics
 
Scalability and Graph Analytics with Neo4j - Stefan Kolmar, Neo4j
Scalability and Graph Analytics with Neo4j - Stefan Kolmar, Neo4jScalability and Graph Analytics with Neo4j - Stefan Kolmar, Neo4j
Scalability and Graph Analytics with Neo4j - Stefan Kolmar, Neo4jNeo4j
 
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
[Webinar] Getting to Insights Faster: A Framework for Agile Big DataInfochimps, a CSC Big Data Business
 
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...Mark Rittman
 
Bigdata Machine Learning Platform
Bigdata Machine Learning PlatformBigdata Machine Learning Platform
Bigdata Machine Learning PlatformMk Kim
 
Optimizing industrial operations using the big data ecosystem
Optimizing industrial operations using the big data ecosystemOptimizing industrial operations using the big data ecosystem
Optimizing industrial operations using the big data ecosystemDataWorks Summit
 
Unlocking Geospatial Analytics Use Cases with CARTO and Databricks
Unlocking Geospatial Analytics Use Cases with CARTO and DatabricksUnlocking Geospatial Analytics Use Cases with CARTO and Databricks
Unlocking Geospatial Analytics Use Cases with CARTO and DatabricksDatabricks
 

La actualidad más candente (20)

Pentaho Analytics on MongoDB
Pentaho Analytics on MongoDBPentaho Analytics on MongoDB
Pentaho Analytics on MongoDB
 
Finding the needle in the haystack: how Nestle is leveraging big data to defe...
Finding the needle in the haystack: how Nestle is leveraging big data to defe...Finding the needle in the haystack: how Nestle is leveraging big data to defe...
Finding the needle in the haystack: how Nestle is leveraging big data to defe...
 
VP of WW Partners by Alan Chhabra
VP of WW Partners by Alan ChhabraVP of WW Partners by Alan Chhabra
VP of WW Partners by Alan Chhabra
 
Snowflakes in the Cloud Real world experience on a new approach for Big Data
Snowflakes in the Cloud Real world experience on a new approach for Big DataSnowflakes in the Cloud Real world experience on a new approach for Big Data
Snowflakes in the Cloud Real world experience on a new approach for Big Data
 
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
 
Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!
 
Graph Hardware Architecture - Enterprise graphs deserve great hardware!
Graph Hardware Architecture - Enterprise graphs deserve great hardware!Graph Hardware Architecture - Enterprise graphs deserve great hardware!
Graph Hardware Architecture - Enterprise graphs deserve great hardware!
 
Building big data solutions on azure
Building big data solutions on azureBuilding big data solutions on azure
Building big data solutions on azure
 
Advanced data science algorithms applied to scalable stream processing by Dav...
Advanced data science algorithms applied to scalable stream processing by Dav...Advanced data science algorithms applied to scalable stream processing by Dav...
Advanced data science algorithms applied to scalable stream processing by Dav...
 
Pouring the Foundation: Data Management in the Energy Industry
Pouring the Foundation: Data Management in the Energy IndustryPouring the Foundation: Data Management in the Energy Industry
Pouring the Foundation: Data Management in the Energy Industry
 
The Rise of Engineering-Driven Analytics by Loren Shure
The Rise of Engineering-Driven Analytics by Loren ShureThe Rise of Engineering-Driven Analytics by Loren Shure
The Rise of Engineering-Driven Analytics by Loren Shure
 
Intuit Analytics Cloud 101
Intuit Analytics Cloud 101Intuit Analytics Cloud 101
Intuit Analytics Cloud 101
 
Cortana Analytics Workshop: Operationalizing Your End-to-End Analytics Solution
Cortana Analytics Workshop: Operationalizing Your End-to-End Analytics SolutionCortana Analytics Workshop: Operationalizing Your End-to-End Analytics Solution
Cortana Analytics Workshop: Operationalizing Your End-to-End Analytics Solution
 
Scalability and Graph Analytics with Neo4j - Stefan Kolmar, Neo4j
Scalability and Graph Analytics with Neo4j - Stefan Kolmar, Neo4jScalability and Graph Analytics with Neo4j - Stefan Kolmar, Neo4j
Scalability and Graph Analytics with Neo4j - Stefan Kolmar, Neo4j
 
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
 
Smart data for a predictive bank
Smart data for a predictive bankSmart data for a predictive bank
Smart data for a predictive bank
 
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
 
Bigdata Machine Learning Platform
Bigdata Machine Learning PlatformBigdata Machine Learning Platform
Bigdata Machine Learning Platform
 
Optimizing industrial operations using the big data ecosystem
Optimizing industrial operations using the big data ecosystemOptimizing industrial operations using the big data ecosystem
Optimizing industrial operations using the big data ecosystem
 
Unlocking Geospatial Analytics Use Cases with CARTO and Databricks
Unlocking Geospatial Analytics Use Cases with CARTO and DatabricksUnlocking Geospatial Analytics Use Cases with CARTO and Databricks
Unlocking Geospatial Analytics Use Cases with CARTO and Databricks
 

Destacado

Application-level Disaster Recovery on OpenStack
Application-level Disaster Recovery on OpenStackApplication-level Disaster Recovery on OpenStack
Application-level Disaster Recovery on OpenStackAli Hodroj
 
6 GigaSpaces Principles to Survive Black Friday
6 GigaSpaces Principles to Survive Black Friday6 GigaSpaces Principles to Survive Black Friday
6 GigaSpaces Principles to Survive Black FridayAli Hodroj
 
E-Commerce and In-Memory Computing: Crossing the Scalability Chasm
E-Commerce and In-Memory Computing: Crossing the Scalability ChasmE-Commerce and In-Memory Computing: Crossing the Scalability Chasm
E-Commerce and In-Memory Computing: Crossing the Scalability ChasmAli Hodroj
 
Exascale Computing Project - Driving a HUGE Change in a Changing World
Exascale Computing Project - Driving a HUGE Change in a Changing WorldExascale Computing Project - Driving a HUGE Change in a Changing World
Exascale Computing Project - Driving a HUGE Change in a Changing Worldinside-BigData.com
 
Cloudifying High Availability: The Case for Elastic Disaster Recovery
Cloudifying High Availability: The Case for Elastic Disaster RecoveryCloudifying High Availability: The Case for Elastic Disaster Recovery
Cloudifying High Availability: The Case for Elastic Disaster RecoveryAli Hodroj
 
Accelerating Hadoop, Spark, and Memcached with HPC Technologies
Accelerating Hadoop, Spark, and Memcached with HPC TechnologiesAccelerating Hadoop, Spark, and Memcached with HPC Technologies
Accelerating Hadoop, Spark, and Memcached with HPC Technologiesinside-BigData.com
 
Building a REST Job Server for interactive Spark as a service by Romain Rigau...
Building a REST Job Server for interactive Spark as a service by Romain Rigau...Building a REST Job Server for interactive Spark as a service by Romain Rigau...
Building a REST Job Server for interactive Spark as a service by Romain Rigau...Spark Summit
 
Dynamic Resource Allocation Spark on YARN
Dynamic Resource Allocation Spark on YARNDynamic Resource Allocation Spark on YARN
Dynamic Resource Allocation Spark on YARNTsuyoshi OZAWA
 
Introduction to ML with Apache Spark MLlib
Introduction to ML with Apache Spark MLlibIntroduction to ML with Apache Spark MLlib
Introduction to ML with Apache Spark MLlibTaras Matyashovsky
 
State of the OpenFabrics Alliance
State of the OpenFabrics AllianceState of the OpenFabrics Alliance
State of the OpenFabrics Allianceinside-BigData.com
 
Data Source API in Spark
Data Source API in SparkData Source API in Spark
Data Source API in SparkDatabricks
 
Microsoft Project Olympus AI Accelerator Chassis (HGX-1)
Microsoft Project Olympus AI Accelerator Chassis (HGX-1)Microsoft Project Olympus AI Accelerator Chassis (HGX-1)
Microsoft Project Olympus AI Accelerator Chassis (HGX-1)inside-BigData.com
 
How Data Drives Business at Choice Hotels
How Data Drives Business at Choice HotelsHow Data Drives Business at Choice Hotels
How Data Drives Business at Choice HotelsCloudera, Inc.
 
3 Things Every Sales Team Needs to Be Thinking About in 2017
3 Things Every Sales Team Needs to Be Thinking About in 20173 Things Every Sales Team Needs to Be Thinking About in 2017
3 Things Every Sales Team Needs to Be Thinking About in 2017Drift
 

Destacado (15)

Application-level Disaster Recovery on OpenStack
Application-level Disaster Recovery on OpenStackApplication-level Disaster Recovery on OpenStack
Application-level Disaster Recovery on OpenStack
 
6 GigaSpaces Principles to Survive Black Friday
6 GigaSpaces Principles to Survive Black Friday6 GigaSpaces Principles to Survive Black Friday
6 GigaSpaces Principles to Survive Black Friday
 
E-Commerce and In-Memory Computing: Crossing the Scalability Chasm
E-Commerce and In-Memory Computing: Crossing the Scalability ChasmE-Commerce and In-Memory Computing: Crossing the Scalability Chasm
E-Commerce and In-Memory Computing: Crossing the Scalability Chasm
 
RDMA on ARM
RDMA on ARMRDMA on ARM
RDMA on ARM
 
Exascale Computing Project - Driving a HUGE Change in a Changing World
Exascale Computing Project - Driving a HUGE Change in a Changing WorldExascale Computing Project - Driving a HUGE Change in a Changing World
Exascale Computing Project - Driving a HUGE Change in a Changing World
 
Cloudifying High Availability: The Case for Elastic Disaster Recovery
Cloudifying High Availability: The Case for Elastic Disaster RecoveryCloudifying High Availability: The Case for Elastic Disaster Recovery
Cloudifying High Availability: The Case for Elastic Disaster Recovery
 
Accelerating Hadoop, Spark, and Memcached with HPC Technologies
Accelerating Hadoop, Spark, and Memcached with HPC TechnologiesAccelerating Hadoop, Spark, and Memcached with HPC Technologies
Accelerating Hadoop, Spark, and Memcached with HPC Technologies
 
Building a REST Job Server for interactive Spark as a service by Romain Rigau...
Building a REST Job Server for interactive Spark as a service by Romain Rigau...Building a REST Job Server for interactive Spark as a service by Romain Rigau...
Building a REST Job Server for interactive Spark as a service by Romain Rigau...
 
Dynamic Resource Allocation Spark on YARN
Dynamic Resource Allocation Spark on YARNDynamic Resource Allocation Spark on YARN
Dynamic Resource Allocation Spark on YARN
 
Introduction to ML with Apache Spark MLlib
Introduction to ML with Apache Spark MLlibIntroduction to ML with Apache Spark MLlib
Introduction to ML with Apache Spark MLlib
 
State of the OpenFabrics Alliance
State of the OpenFabrics AllianceState of the OpenFabrics Alliance
State of the OpenFabrics Alliance
 
Data Source API in Spark
Data Source API in SparkData Source API in Spark
Data Source API in Spark
 
Microsoft Project Olympus AI Accelerator Chassis (HGX-1)
Microsoft Project Olympus AI Accelerator Chassis (HGX-1)Microsoft Project Olympus AI Accelerator Chassis (HGX-1)
Microsoft Project Olympus AI Accelerator Chassis (HGX-1)
 
How Data Drives Business at Choice Hotels
How Data Drives Business at Choice HotelsHow Data Drives Business at Choice Hotels
How Data Drives Business at Choice Hotels
 
3 Things Every Sales Team Needs to Be Thinking About in 2017
3 Things Every Sales Team Needs to Be Thinking About in 20173 Things Every Sales Team Needs to Be Thinking About in 2017
3 Things Every Sales Team Needs to Be Thinking About in 2017
 

Similar a Spark DC Interactive Meetup: HTAP with Spark and In-Memory Data Grids

SnappyData Toronto Meetup Nov 2017
SnappyData Toronto Meetup Nov 2017SnappyData Toronto Meetup Nov 2017
SnappyData Toronto Meetup Nov 2017SnappyData
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceeRic Choo
 
Pivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream AnalyticsPivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream Analyticskgshukla
 
High Performance Spatial-Temporal Trajectory Analysis with Spark
High Performance Spatial-Temporal Trajectory Analysis with Spark High Performance Spatial-Temporal Trajectory Analysis with Spark
High Performance Spatial-Temporal Trajectory Analysis with Spark DataWorks Summit/Hadoop Summit
 
SnappyData @ Seattle Spark Meetup
SnappyData @ Seattle Spark MeetupSnappyData @ Seattle Spark Meetup
SnappyData @ Seattle Spark MeetupSnappyData
 
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization Denodo
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureDATAVERSITY
 
Red Hat Storage: Emerging Use Cases
Red Hat Storage: Emerging Use CasesRed Hat Storage: Emerging Use Cases
Red Hat Storage: Emerging Use CasesRed_Hat_Storage
 
Using the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid Warehouse
Using the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid WarehouseUsing the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid Warehouse
Using the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid WarehouseRizaldy Ignacio
 
In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015
In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015
In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015Iulia Emanuela Iancuta
 
Building a Single Logical Data Lake: For Advanced Analytics, Data Science, an...
Building a Single Logical Data Lake: For Advanced Analytics, Data Science, an...Building a Single Logical Data Lake: For Advanced Analytics, Data Science, an...
Building a Single Logical Data Lake: For Advanced Analytics, Data Science, an...Denodo
 
Stsg17 speaker yousunjeong
Stsg17 speaker yousunjeongStsg17 speaker yousunjeong
Stsg17 speaker yousunjeongYousun Jeong
 
Operational Analytics Using Spark and NoSQL Data Stores
Operational Analytics Using Spark and NoSQL Data StoresOperational Analytics Using Spark and NoSQL Data Stores
Operational Analytics Using Spark and NoSQL Data StoresDATAVERSITY
 
Fom io t_to_bigdata_step_by_step-final
Fom io t_to_bigdata_step_by_step-finalFom io t_to_bigdata_step_by_step-final
Fom io t_to_bigdata_step_by_step-finalLuis Filipe Silva
 
Data Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax EnterpriseData Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax EnterpriseDataStax
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
 
Real-Time Analytics with Confluent and MemSQL
Real-Time Analytics with Confluent and MemSQLReal-Time Analytics with Confluent and MemSQL
Real-Time Analytics with Confluent and MemSQLSingleStore
 
Continuous Intelligence - Intersecting Event-Based Business Logic and ML
Continuous Intelligence - Intersecting Event-Based Business Logic and MLContinuous Intelligence - Intersecting Event-Based Business Logic and ML
Continuous Intelligence - Intersecting Event-Based Business Logic and MLParis Carbone
 

Similar a Spark DC Interactive Meetup: HTAP with Spark and In-Memory Data Grids (20)

SnappyData Toronto Meetup Nov 2017
SnappyData Toronto Meetup Nov 2017SnappyData Toronto Meetup Nov 2017
SnappyData Toronto Meetup Nov 2017
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data Science
 
Pivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream AnalyticsPivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream Analytics
 
High Performance Spatial-Temporal Trajectory Analysis with Spark
High Performance Spatial-Temporal Trajectory Analysis with Spark High Performance Spatial-Temporal Trajectory Analysis with Spark
High Performance Spatial-Temporal Trajectory Analysis with Spark
 
SnappyData @ Seattle Spark Meetup
SnappyData @ Seattle Spark MeetupSnappyData @ Seattle Spark Meetup
SnappyData @ Seattle Spark Meetup
 
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
 
Analytics&IoT
Analytics&IoTAnalytics&IoT
Analytics&IoT
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 
Red Hat Storage: Emerging Use Cases
Red Hat Storage: Emerging Use CasesRed Hat Storage: Emerging Use Cases
Red Hat Storage: Emerging Use Cases
 
Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3
 
Using the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid Warehouse
Using the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid WarehouseUsing the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid Warehouse
Using the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid Warehouse
 
In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015
In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015
In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015
 
Building a Single Logical Data Lake: For Advanced Analytics, Data Science, an...
Building a Single Logical Data Lake: For Advanced Analytics, Data Science, an...Building a Single Logical Data Lake: For Advanced Analytics, Data Science, an...
Building a Single Logical Data Lake: For Advanced Analytics, Data Science, an...
 
Stsg17 speaker yousunjeong
Stsg17 speaker yousunjeongStsg17 speaker yousunjeong
Stsg17 speaker yousunjeong
 
Operational Analytics Using Spark and NoSQL Data Stores
Operational Analytics Using Spark and NoSQL Data StoresOperational Analytics Using Spark and NoSQL Data Stores
Operational Analytics Using Spark and NoSQL Data Stores
 
Fom io t_to_bigdata_step_by_step-final
Fom io t_to_bigdata_step_by_step-finalFom io t_to_bigdata_step_by_step-final
Fom io t_to_bigdata_step_by_step-final
 
Data Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax EnterpriseData Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax Enterprise
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 
Real-Time Analytics with Confluent and MemSQL
Real-Time Analytics with Confluent and MemSQLReal-Time Analytics with Confluent and MemSQL
Real-Time Analytics with Confluent and MemSQL
 
Continuous Intelligence - Intersecting Event-Based Business Logic and ML
Continuous Intelligence - Intersecting Event-Based Business Logic and MLContinuous Intelligence - Intersecting Event-Based Business Logic and ML
Continuous Intelligence - Intersecting Event-Based Business Logic and ML
 

Último

Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsSafe Software
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesŁukasz Chruściel
 
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...Akihiro Suda
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...OnePlan Solutions
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringHironori Washizaki
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services
 
Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM ArchitectureUnderstanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecturerahul_net
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxAndreas Kunz
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López
 
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfExploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfkalichargn70th171
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZABSYZ Inc
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...OnePlan Solutions
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Natan Silnitsky
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationBradBedford3
 
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identityteam-WIBU
 
Lecture # 8 software design and architecture (SDA).ppt
Lecture # 8 software design and architecture (SDA).pptLecture # 8 software design and architecture (SDA).ppt
Lecture # 8 software design and architecture (SDA).pptesrabilgic2
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalLionel Briand
 

Último (20)

Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data Streams
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New Features
 
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their Engineering
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
 
Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM ArchitectureUnderstanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecture
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
 
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfExploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZ
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdf
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
 
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identity
 
Lecture # 8 software design and architecture (SDA).ppt
Lecture # 8 software design and architecture (SDA).pptLecture # 8 software design and architecture (SDA).ppt
Lecture # 8 software design and architecture (SDA).ppt
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive Goal
 

Spark DC Interactive Meetup: HTAP with Spark and In-Memory Data Grids

  • 1. 1 Hybrid Transactional/Analytics Processing with Spark and In-Memory Data Grids Copyright © GigaSpaces 2017. All rights reserved. Ali Hodroj VP, Products and Strategy @ahodroj
  • 2. 2 GigaSpaces Ultra-Low Latency / High Throughput Middleware Direct customers 500+ Headquarters New York, NY Established 2001
  • 4. 4 We’re seeing more in our customer base
  • 5. 5 …a shift towards real-time BI Big Data Fast Data
  • 6. 6 Sample Customer Use Cases Internet of Things Omni-Channel Operational Intelligence Operational Analytics Predictive Analytics Fraud Detection, Supply chain optimization Personalization, Recommendation Edge Analytics Operational Intelligence, Predictive Maintenance, Spatial Analytics
  • 7. 7 In-Memory Computing (not a new thing) Rapid decline in RAM prices lead to advanced data processing innovations drives • Transactional (2001-present) – In-Memory Databases – In-Memory Data Grids • Analytics (2012-present) – In-Memory Data Processing Frameworks (Spark) – In-Memory File Systems (Tachyon)
  • 9. 99 Data Grid is a cluster of machines that work together to create a resilient shared data fabric for low-latency data access and extreme transaction processing In-Memory Data Grid: Online Transaction Processing at Low-Latency and High Throughput http://xap.github.io
  • 10. 10 In-Memory Data Grid 101 Feeder Virtual Machine Virtual MachineVirtual Machine Partitioned Data
  • 11. 11 Write Event-Driven / Reactive In-Memory Data Grid 101: Execution Models RPC / Master-Worker
  • 12. 12 Write Event-Driven / Reactive In-Memory Data Grid 101: Execution Models RPC / Master-Worker
  • 13. 13 In-Memory Data Grid 101: Typical Deployment HTML HTTP/S HW LB REST HTTP/ S REST HTTP/S LB Agen t GSA HTTPD Load Balanc er LB Agen t GSA HTTPD Load Balanc er Mirror Service GSA DB Private or Public Cloud Processing Processing Processing Processing Processing Processi ng Processing Processing Processing Processing Processing Processing Primary Set 1 Primary Set 2 Primary Set 3 Primary Set 4 Primary Set 5 Primary Set 6 Backup Set 6Backup Set 5Backup Set 4 Backup Set 1 Backup Set 2 Backup Set 3 GSA GSA GSA GSA GSA GSA Async )
  • 14. 14 Host Cisco UCS Server CPU Intel 16core 2.9GHz Concurrent Threads 2 Throughput 200, 400, 800 ops/sec
  • 15. 15
  • 16. 16 Hybrid Transactional/Analytics Processing at Scale Provide closed-loop analytics pipeline. Data, insight, to action at sub-second latency IoT and Omni-channel require the convergence of many different data types Blend of both real-time and historical data Requirements 1 Bi-directional integration between transactional and analytical data stores Ability to support POJO, JSON, GeoSpatial, and Unstructured types through a unified API Unified and scale-out real-time and historical data store Challenges 2 3
  • 18. 18 What’s needed Large-scale distributed analytics framework Unified, scale-out, low-latency data store Transactional capabilities: ACID, Event-Driven, Rich Data modeling Microservices
  • 19. 19 Our approach to HTAP Low-latency Scale-Out In-Memory Data Grid Large-scale distributed analytics framework +
  • 20. 20 • Unified & Concise API • Highly Flexible Data Store Integration • Massive Community and Adoption Why Spark?
  • 21. 21 1 Bi-directional integration between transactional and analytical data stores Provide closed-loop analytics pipeline. Data, insight, to action at scale (at sub-seconds)
  • 22. 22
  • 23. 23 In-Memory Data Grid In-Memory Store(RAM) Flash, SSD, Off-Heap Store Spark Spark SQL Spark Steaming Machine Learning Highavailability Security&Management Transactional Tier ACID-compliant Strong Consistency Analytics Tier
  • 24. 24 • Get Partitions: An array of partitions that a dataset is divided to • Compute: A compute function to do a computation on partitions • Get Preferred Location: Optional preferred locations, i.e. hosts for a partition where the data will be loaded • IMDG Distributed Query to get partitions and their hosts • Iterator over portion of data • Hosts from Distributed Query Build a connector: Spark to IMDG
  • 25. 25 node 1 Spark master Grid master node 2 Spark worker Grid Partition node 3 Spark worker Grid Partition NoSQL Storage Pattern #1: Data Locality (machine-level)
  • 26. 26 Aggregation in Spark Filtering and columns pruning in Data Grid SELECT SUM(amount) FROM order WHERE city = ‘NY’ AND year > 2012 Spark SQL architecture: • Pushing down predicates to Data Grid • Leveraging indexes • Transparent to user • Enabling support for other languages - Python/R Implementing DataSource API Pattern #2: Pushdown Predicates (Grid-side processing)
  • 27. 27 node 1 Spark master Grid master node 2 Spark worker Grid Partition node 3 Spark worker Grid Partition Lightweight workers, small JVMs Large JVMs, Fast indexing NoSQL Storage Pattern #3: Decouple Data Processing from Data Storage
  • 28. 28 Push-down Predicates performance Traditional Spark filtering of 7MM records Grid-side + Spark filtering of 7MM records 31 sec 800 ms vs
  • 29. 29 Ability to support POJO, JSON, GeoSpatial, and Unstructured types through a unified API 2 IoT and Omni-channel require the convergence of many different data types
  • 30. In-Memory Data Grid + Spark Convergence Geo-Spatial Full Text
  • 31. Simple K/V to RDD Mapping
  • 32. POJO Domain Model to Spark
  • 33. POJO Domain Model to Spark (Event-Driven)
  • 34. JSON Domain Model to Spark
  • 36. Full Text Indexes + Lucene Analyzers Full Text
  • 37. 37 Unified and scale-out real-time and historical data store 3 Blend of both real-time and historical data
  • 38. 38 hash(key) % #nodes In-Memory Data Grid Partitioning
  • 39. 39 hash(key) % #nodes In-Memory Data Grid Partitioning – With HA
  • 40. 40 node 1 Spark executor Spark Partition #1 Grid Partition #1 Direct connection Simple, but not enough parallelism for Spark node 2 Spark executor Spark Partition #2 Grid Partition #2 node 3 Spark executor Spark Partition #3 Grid Partition #3 Spark to Data Grid Partition Cardinality
  • 41. 41 node 1 Spark Executor Grid Primary #1 0 . . 1 . . 2 . . 3 . . 4 . . 5 . . . . . . . . . . . . Spark Partition #1 1023 1 Spark partition = M grid buckets 1 Grid partition = N Spark partitions Spark Partition #2 Spark Partition #1 Pattern #4: Grid bucketing for higher throughput
  • 42. 42 Eventually, we productized this as an open source Spark distribution
  • 43. @InsightEdgeIO http://insightedge.io Apache 2 License http://insightedge.io/docs http://insightedge.io/blog http://github.com/InsightEdge
  • 45. upcoming: Spark RDD/DF native read/save on Off-Heap (SSD/Flash/Direct Buffers) Application Processi ng Primary instance s Backup instance s Sync Replicati on Storage Array Storage Array In Memory Data Grid Spark worker Spark worker • Significant RAM TCO reduction in Spark clusters • Direct RDD/DataFrame read write from SSD/Flash device • Avoid Filesystem hops and write amplification
  • 48. 48 In-Memory Data Grid Realtime Replication • Scoring models • Trigger actions • Events Transactions Analytics XAP + InsightEdge deployed on different grid clusters with bi- directional real-time data replication Point-of-Decision HTAP
  • 49. 4949 Challenge • Stream data from 1,000s of Taxis • Actively monitor and generate real-time notifications • Real-time Route Optimization and Geo-Fencing Solution • Leverage unified in-memory data fabric as middleware for geo-spatial analytics • Elastically scale stream processing and transactional apps together • Location-based tracking, Geo-fencing Edge components Data Sources Transportation / IoT: Connected Cars / Fleet Geo-Analytics