Fast Data Open Source Platforms

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Fast Data Open Source
An Overview
Chuck Scyphers
Big Data Lead
East Coast

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Safe Harbor Statement
The following is intended to outline our general product direction. It is intended for
information purposes only, and may not be incorporated into any contract. It is not a
commitment to deliver any material, code, or functionality, and should not be relied upon
in making purchasing decisions. The development, release, and timing of any features or
functionality described for Oracle’s products remains at the sole discretion of Oracle.
3

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal/Restricted/Highly Restricted 4
• 25+ years experience with highly
available, highly scalable, high
throughput, globally-spanning
enterprise class systems.
• 7 startups (2 wins, 2 break-evens,
3 losses); last two built on Hadoop
and NoSQL systems (resume
analytics and behavioral analysis
of network traffic
• Chief Data Architect, US-Visit
(Department of State) and US
Department of Energy SLD Project
Chuck Scyphers

Agenda
5
Fast Data Definition
Popular Open Source Platforms
What Do We Want To Be When We Grow Up?
Refreshments

Agenda
6
Refreshments

What Is Fast Data?
“Fast data is the application of big data analytics to smaller data sets in near-
real or real-time in order to solve a problem or create business value. The
goal of fast data is to quickly gather and mine structured and unstructured
data so that action can be taken.”
7

New Concepts for a Modern Data Platform Architecture
Polyglot
Fit for Purpose Data
Lambda
Speed Layer
Batch Layer
Data
Sources
Data
Services
Kappa
Data
Services
Data PipelineData
Sources

How Data Impacts The Organization
9
67%
executives who say
drawing intelligence
from data is top priority
Source: Oracle Research Study - From Overload to Impact: An Industry Scorecard on Big Data Business Challenges, July 2012

10
89%
executives who would
grade themselves C or
lower in preparedness

11
% believe their organization is losing
revenue as a result of not being able
to fully leverage information

Velocity Matters
of executives say
too much critical
information is
delivered too late
Source: Aberdeen Group – January 2012, survey of 247 executives - Data Management for BI – Big Data, Bigger Insight, Superior Performance

Why Do We Care?
 It’s about getting more from in-flight data
 It’s about faster action, faster insights
 It’s about visibility and predictability
 It’s about running your business in real-time

Key Value Drivers of Timely Accurate Action
14
Delivering Tangible Results With Fast Data
Higher Quality
In Operations
Improved
Efficiency
New
Services
Better Customer
Experience

Fast Data Is Universal Across Industries
15
Financial Services
Transportation &
Logistics
Telecommunications
Manufacturing &
Retail
Utilities & Oil and GasHealth carePublic Sector

Fast Data Characteristics
16
ANALYZEMOVE &
TRANSFORM
FILTER &
CORRELATE
ACT

Oracle Confidential – Internal/Restricted/Highly Restricted 17
ANALYZEMOVE &
TRANSFORM
FILTER &
CORRELATE
ACT
Complete In-Flight Event Processing
• Eliminate, Consolidate, Correlate, And/Or
Filter Data While In Flight
• Analyze Data Streams
• Enrich Data For More Accurate Decisions
• Process Data In The Stream To Free Up
Back End Resources

ANALYZEMOVE &
TRANSFORM
FILTER &
CORRELATE
ACT
Work With The Stream
• Apply Basic Filtering At Capture
• Improve Trusted Quality Of
Information
• Move Data (duh)

ANALYZEMOVE &
TRANSFORM
FILTER &
CORRELATE
ACT
Speed Up The OODA Loop
• Get Actionable
Insights
• Predict
Outcomes

ANALYZEMOVE &
TRANSFORM
FILTER &
CORRELATE
ACT
Make Decisions That Matter Faster
• Deliver Real-Time Decisions And
Recommendations To Customers/Employees
• Automatically Render Decisions Within A Process
With Tailored Messaging
• Integrate Human Workflow, Process Management,
Activity Monitoring

Fast Data Customers
From Oracle (naturally)

Fast Data And Financial Services
Improving Customer Experiences
• Improve Customer Experience: The goal is to
connect all data about the customers to improve
customer service experience and to lower the burden
of hiring new representatives.
• Reduce Staffing Demands: For customers calling to
discuss a claim or their coverage, it means fewer
annoying waits as an agent accesses data from any of
dozens of different places.
• Consolidate information in real-time: All a
customer’s transactions: claims, records, status,
possible cross-sell information (e.g., if someone lives in
an apartment and might need renter’s insurance)

Fast Data And Retail
• Price optimization - leveraging analytics to price
goods and services on the fly based on real-time
metrics such as competitor pricing, supply chain
and inventory data, market data and consumer
behavior data.
• Product placement analysis - processing video
data to identify shopping trends, assesses
effectiveness of displays to improve store layouts
and product placements.
• Staffing - The largest retailers are analyzing
weather forecasts, promotional campaigns and
dates to effectively meet staffing requirements on
holidays all year round.

Fast Data And Public Sector
LA City Planning And Traffic Analysis
• Dynamic Pricing for Toll Lanes: if a driver is paying to drive in the
HOT (high-occupancy tolling) lane, he’s guaranteed a consistent
speed of 45 miles per hour. If traffic starts backing up, prices for
individual cars will rise to discourage them from entering, saving
the lanes for high-occupancy vehicles
• Express Park: It’s not enough to know how to set the price, you
have to make sure that data gets to users in real time. Drivers also
need to know parking spots will still be there when they arrive in
40 minutes.
• Combining M2M: The answer lies in combining information from
other sources, such as mass-transit systems, toll highways, traffic
sensors and weather data to paint a real-time picture of what
traffic actually looks like

Fast Data And Telecomm
Location Based Mobile Billboard Advertising at Turkcell
• Processing over 800,000 subscriber
related events per second (with 1.5
Billion Events Daily)
• Provided and executed over 50
simultaneous campaigns
• Ensured customer responsiveness
with less than 1 second times with
a scalable architecture, ready to
expand on demand

Agenda
26
Refreshments

HDFS Based
• Spark
• HBase
• Impala
• H20
• Apex
Other Based
• Druid
• Flink
• Storm
• Kafka
• Samza
• ElasticSearch
• Lucene/Solr
27
Composite
• SMACK
• PANCAKE
Open Source Platforms
General Classifications

HDFS Based
• Spark
• HBase
• Impala
• H20
• Apex
Other Based
• Druid
• Flink
• ElasticSearch
• Storm
• Kafka
• Lucene/Solr
28
Composite
• SMACK
• PANCAKE

Spark
HDFS Based
HDFS Based
• Spark
• HBase
• Impala
• H20
• Apex
29
• In-Memory Distributed Processing Framework
• Will Spill To Disk As Needed
• Handles Streaming Data Through Micro-batching

HBase
HDFS Based
HDFS Based
• Spark*
• HBase
• Impala
• H20
• Apex
30
• A NoSQL Columnar Store Built On Top Of HDFS
• Provides A Big Table–esque Processing Model
• Compression
• In-memory
• Bloom Filters By Column
• Offers Both Real Time Read/Write Access
And Random Access To HDFS

Impala
HDFS Based
HDFS Based
• Spark*
• HBase
• Impala
• H20
• Apex
31
• Real-time SQL queries over data stored
in HDFS or HBase
• No MapReduce processing
• Uses a MPP query engine on the Hadoop cluster
• Utilizes Hive metastore for metadata repository
• Leveraged by numerous BI tools and applications
• Not ANSI SQL

H20
HDFS Based
HDFS Based
• Spark*
• HBase
• Impala
• H20
• Apex
32

Apex
HDFS Based
HDFS Based
• Spark*
• HBase
• Impala
• H20
• Apex
33

HDFS Based
• Spark*
• HBase
• Impala
• H20
• Apex
Other Based
• Druid
• Flink
• Storm
• Kafka
• Samza
• ElasticSearch
• Lucene/Solr
34
Composite
• SMACK
• PANCAKE

• Druid
• Flink
• Storm
• Kafka
• Samza
• ElasticSearch
• Lucene/Solr
35
Other Based

Other Based
• Druid
• Flink
• Storm
• Kafka
• Samza
• ElasticSearch
• Lucene/Solr
36
Druid
MySQL
Zookeeper

Other Based
• Druid
• Flink
• Storm
• Kafka
• Samza
• ElasticSearch
• Lucene/Solr
37
Flink

Other Based
• Druid
• Flink
• Storm
• Kafka
• Samza
• ElasticSearch
• Lucene/Solr
38
Storm

Other Based
• Druid
• Flink
• Storm
• Kafka
• Samza
• ElasticSearch
• Lucene/Solr
39
Kafka

Samza
Other Based
40
• Druid
• Flink
• Storm
• Kafka
• Samza
• ElasticSearch
• Lucene/Solr

Search Based
• Druid
• Flink
• Storm
• Kafka
• Samza
• ElasticSearch
• Lucene/Solr
41
ElasticSearch

Search Based
• Druid
• Flink
• Storm
• Kafka
• Samza
• ElasticSearch
• Lucene/Solr
42
Lucene/Solr

A Quick Comparison
43
Guarantee Throughput
Fault
Tolerance
Overhead
Computation
Model Windowing
Memory
Management
DAG
Based
Batch
Support Latency
Stateful
Operations
Spark Exactly Once
100k+
records/sec
Low Microbatches Time Based
Moving towards
automatic
yes Yes seconds yes
Flink Exactly Once Low
Continuous
Flow Operation
Record Based
/ User
Defined
Automatic Yes milliseconds
Storm
At least
Once/Exactly
Once (+ Trident)
100k+
records/sec
Continuous
Flow Operation
yes
No (unless
paired with
Trident)
milliseconds
no (unless
with Trident)
Samza At least Once 10k+ records/sec
Continuous
Flow Operation
milliseconds yes
Hadoop Lower High Batch Only Nope YARN is helping No Only

HDFS Based
• Spark*
• HBase
• Impala
• H20
• Apex
Other Based
• Druid
• Flink
• ElasticSearch
• Storm
• Kafka
• Samza
• Lucene/Solr
44
Composite
• SMACK
• PANCAKE

Composite
• SMACK
• PANCAKE
45
Composite

• SMACK
• PANCAKE
46
SMACK Stack
Spark
Mesos
Akka
Cassandra
Kafka

• SMACK
• PANCAKE
47
PANCAKE Pile
Presto
Arrow
NiFi
Cassandra
AirFlow
Kafka
Elastic Search

• SMACK
• PANCAKE
48
PANCAKE STACK
Presto
Arrow
NiFi
Cassandra
AirFlow
Kafka
ElasticSearch
Spark
TensorFlow
Algebird
CoreNLP
Kibana

Agenda
49
Refreshments

What Do We Want This Meetup To Be?
• How Often Do We Want To Meet?
• Where? (other than here)
• From Whom Do We Want To Hear?
– Vendors?
– Never Vendors?
• Demos & Code?
• Sponsors?
• Who’s Hiring? Who’s Looking?
50

Agenda
51
Refreshments

Refreshments
52
Reston Town Center
1888 Explorer St
Reston VA

Fast Data Open Source Platforms

Fast Data Open Source Platforms

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (20)

Similar a Fast Data Open Source Platforms

Similar a Fast Data Open Source Platforms (20)

Último

Último (20)

Fast Data Open Source Platforms

Notas del editor