SlideShare a Scribd company logo
1 of 56
Enabling Operational Intelligence:
Making Hadoop Real-Time
Copyright © 2014 by ScaleOut Software, Inc.
Los Angeles Big Data Users Group
August 6, 2014
Bill Bain, CEO (wbain@scaleoutsoftware.com)
2 ScaleOut Software, Inc.
• Operational Intelligence vs. Business Intelligence
• Data-Parallel Computation
• Implementing Using an In-Memory Data Grid:
• Distributing the Data Across a Cluster
• Running the Computation using “Parallel Method Invocation”
• An Example in Financial Services
• Implementing In-Memory Hadoop MapReduce
• Video Demo
• Examples of Applications in Operational Intelligence
Agenda
3 ScaleOut Software, Inc.
• Develops and markets In-Memory Data Grids,
software middleware for:
• Scaling application performance and
• Providing operational intelligence using
• In-memory data storage and computing
• Dr. William Bain, Founder & CEO
• Career focused on parallel computing – Bell Labs, Intel, Microsoft
• 3 prior start-ups, last acquired by Microsoft and product now ships as
Network Load Balancing in Windows Server
• Eight years in the market; 400 customers, 10,000 servers
• Sample customers:
About ScaleOut Software
4 ScaleOut Software, Inc.
Goal: Provide immediate feedback to a system handling live data.
A few examples:
• Equity trading: to minimize risk during a trading day
• Ecommerce: for real-time recommendations
• Reservations systems: to identify issues, reroute, etc.
• Credit cards & wire transfers: to detect fraud in real time
• Smart grids: to optimize power distribution & detect issues
Online Systems Need Operational
Intelligence
5 ScaleOut Software, Inc.
Big Data Analytics
Real-Time vs. Batch Analytics
Static data sets
Petabytes
Disk storage
Hours to minutes
Best uses:
• Analyzing
warehoused data
• Mining for long-
term trends
Live data sets
Gigabytes to terabytes
In-memory storage
Minutes to seconds
Best uses:
• Tracking live data
• Immediately
identifying trends
and capturing
opportunities
• Providing immediate
feedback
Analytics
Server
hServer
Hadoop
IBM
Teradata
SAS
SAP
Real-Time Batch
Real-time
“Operational Intelligence”
Batch
“Business Intelligence”
6 ScaleOut Software, Inc.
• Operational intelligence can co-exist with business intelligence:
• Processes streaming data close to its sources.
• Provides real-time, “tactical” feedback (e.g., recommendations, alerts).
• Translates data for storage in the data warehouse (ETL).
• Data warehouse provides “strategic” guidance.
• Using the same tool set (e.g., Hadoop MapReduce) lowers TCO:
• Leverages common skill set.
• Simplifies design (e.g., loading data into HDFS).
Integrated View of Analytics
7 ScaleOut Software, Inc.
• To keep up with fast
growing “live” workloads &
maintain fast response times:
• Ex.: Handle incoming data
streams in real time.
• Ex. Process updates to data
set based on incoming data.
• To identify and respond to
trends in fast-changing data:
• Ex. Evaluate data set changes in
real time.
• Ex. Respond to identified
patterns within seconds.
Challenges for Operational Intelligence
0
50
100
150
200
250
300
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
Millions
Growth in Web Servers
Source:
Netcraft
0
500
1000
1500
2000
2500
3000
3500
4000
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
Exebytes
Growth in “Big Data”
“More data has been
created in the past three
years than in the past
40,000.”
8 ScaleOut Software, Inc.
The Solution: Data-Parallel Computing
• Straightforward, well understood model of parallel computation
• An alternative to task-parallel computation (e.g., Storm)
• Simple: runs the same code on multiple, in-memory data items.
• Powerful: maintains a “live,” in-memory model of a real-world system
• Fast: avoids data motion which lowers speedup.
Analyze Data (Eval)
Combine Results
(Merge)
Server Cluster
9 ScaleOut Software, Inc.
Track “live” data and analyze in real-time:
Implementing OI
In-Memory
State
NoSQL
Storage
Real-Time
Data Parallel
Analysis
10 ScaleOut Software, Inc.
• Storm implements pipelined execution of tasks by “bolts” on
incoming data streams.
• Streams can be distributed to bolts with configurable mappings.
• Developer controls the number of tasks per bolt.
• Storm uses a centralized master node and Zookeeper for fault-
tolerance.
• Key strength: continuous
processing of input
streams
• Issues:
• Complexity / tuning
• Minimizing data motion
• Managing global state
Quick Comparison to Storm
11 ScaleOut Software, Inc.
Data-Parallel Enables Linear Speedup
Avoids data motion (network or disk I/O) which limits throughput:
12 ScaleOut Software, Inc.
Data-Parallel Computing Is Not New
• 1980’s: Special Purpose Hardware: “SIMD”
Thinking Machines
Connection Machine 5
• 1990’s: General Purpose Parallel Supercomputers:
“Domain Decomposition”, “SPMD”
Intel
IPSC-2
IBM
SP1
13 ScaleOut Software, Inc.
Data-Parallel Computing Is Not New
• 1990’s – early 2000’s: HPC on Clusters: “MPI”
• Since 2003: Clusters, the Cloud, and IMDGs: “MapReduce”
HP
Blade
Servers
Amazon EC2,
Windows Azure
14 ScaleOut Software, Inc.
• In-memory data grid
(IMDG) holds active
entities undergoing
state changes in
memory.
• Backing store
optionally holds large
population of entities.
• IMDG processes
incoming stream of
state changes.
• Analytics engine
examines entities in real
time and generates
alerts within seconds
as needed.
A Data-Parallel Architecture for
Operational Intelligence
15 ScaleOut Software, Inc.
In-Memory Data Grid (IMDG) stores “live” data in a cluster:
• Fits in the business logic layer:
• Follows object-oriented view of data
(vs. relational view).
• Stores collections of Java/.NET
objects shared by multiple clients.
• Uses create/read/update/delete
and query APIs to access data.
• Implemented across a cluster of
servers or VMs:
• Scales storage and throughput
by adding servers.
• Provides high availability
in case a server fails.
In-Memory Data Grid for Live Data
16 ScaleOut Software, Inc.
• IMDG’s collections of objects act like in-
memory collections:
• Unstructured, typically instances of a class
(stored as serialized blobs)
• Individually accessible / update-able
• IMDG adds attributes:
• Accessible by global key
• Query-able by properties
• Highly available
• Optional timeouts
• Distributed locking
• Integration with a backing store
• Optional dependency relationships
• Asynchronous event handling
IMDGs Store “Live” Data
Basic “CRUD” APIs:
• Create(key, obj, tout)
• Read(key)
• Update(key, obj)
• Delete(key)
and…
• Lock(key)
• Unlock(key)
Object
key
17 ScaleOut Software, Inc.
Spark / Spark Streaming from U.C.
Berkeley amplab:
• In-memory computing to accelerate and
extend Hadoop MapReduce using data-
parallel operators in Scala.
• Stores data as “resilient
distributed datasets” (RDDs):
• Distributed across cluster
• Immutable
• Hold data from/output to HDFS.
• Store data stream as a sequence of RDDs.
• Comparison to IMDG:
• Not designed for “live” data:
• Lacks CRUD on individual objects.
• Lacks high availability.
• Designed for “data parallel” operators.
Quick Comparison to Spark
18 ScaleOut Software, Inc.
Data-Parallel Computing Using PMI
“Parallel Method Invocation” (PMI): an object-oriented version of data-
parallel computing from the HPC community:
• Serves as a platform for MapReduce and other data-parallel operators.
• Selects objects using a parallel query on data hosted in the IMDG.
• Runs user-defined methods in parallel across the cluster.
Analyze Data (Eval)
Combine Results
(Merge)
In-Memory Data Grid Runs
Data-Parallel Computation.
19 ScaleOut Software, Inc.
Integrate analysis into a stock trading platform:
• The IMDG holds market data and hedging strategies.
• Updates to market data
continuously flow through
the IMDG.
• The IMDG performs
repeated data-parallel
analysis on hedging
strategies and alerts
traders in real time.
• IMDG automatically and dynamically
scales its throughput to handle new
hedging strategies by adding servers.
Example in Financial Services
20 ScaleOut Software, Inc.
Selects all relevant objects in a distributed collection.
• Query spec matches data’s object-oriented properties.
• Selected objects are fed to the analysis engine on the local server.
Step 1: Select with Parallel Query
21 ScaleOut Software, Inc.
Java Example: Parallel Query
public class Portfolio {
private long id;
private Set<Stock> longPositions;
private Set<Stock> shortPositions;
private double totalValue;
private Region region;
private boolean alerted; // alert for trading
@SossIndexAttribute // query-able property
public double getTotalValue() {…}
@SossIndexAttribute // query-able property
public Region getRegion() {…}
public Set<Long> evalPositions(MarketSnapshot ms) {…};
}
NamedCache pset = CacheFactory.getCache(“portfolios");
Set<Portfolio> res = pset.queryObjects(Portfolio.class,
and(greaterThan(“totalValue”, 1000000),
equals(“region”, Region.US)));
22 ScaleOut Software, Inc.
• Create method to analyze a queried portfolio and another method to
pair-wise merge the result sets of alerted portfolios:
Java Example: Parallel Method Invocation
public class PortfolioAnalysis implements
Invokable<Portfolio, MarketSnapshot, Set<Long>>
{
public Set<Long> eval(Portfolio p, MarketSnapshot ms)
throws InvokeException {
// update portfolio and return id if alerted:
return p.evalPositions(ms);
}
public Set<Long> merge(Set<Long> set1, Set<Long> set2)
throws InvokeException {
set1.addAll(set2);
return set1; // merged set of alerted portfolio ids
}}
23 ScaleOut Software, Inc.
• Run a parallel method invocation on a queried set of portfolios and
return set of ids for alerted portfolios:
Java Example: Parallel Method Invocation
NamedCache pset = CacheFactory.getCache(“portfolios");
InvokeResult alertedPortolios = pset.invoke(
PortfolioAnalysis.class,
Portfolio.class,
and(greaterThan(“totalValue”, 1000000), // query spec
equals(“region”, Region.US)),
marketSnapshot, // parameters
...
);
System.out.println("The alerted portfolios are" +
alertedPortfolios.getResult());
24 ScaleOut Software, Inc.
• IMDG ships user’s code and libraries to its servers.
• IMDG automatically schedules analysis operations across all grid
servers and cores:
• The analysis runs on all objects selected
by the parallel query.
• Each grid server analyzes its locally stored
objects to minimize data motion.
• Parallel execution ensures fast
completion time:
• IMDG automatically distributes
workload across servers/cores.
• Scaling the IMDG automatically
handles larger data sets.
Running the Analysis
25 ScaleOut Software, Inc.
• The IMDG automatically merges all analysis results:
• The IMDG first merges all results within each grid server in parallel.
• It then merges results across all grid servers to create one combined
result.
• Efficient parallel merge
minimizes the delay in
combining all results.
• The IMDG delivers the
combined result to the
invoking application as
one object.
Merging the Results
26 ScaleOut Software, Inc.
• Measured a similar financial services application (back testing stock
trading strategies on stock histories)
• Hosted IMDG in Amazon EC2 using 75 servers holding 1 TB of stock
history data in memory
• IMDG handled a continuous stream of updates (1.1 GB/s)
• Results: analyzed 1 TB in 4.1 seconds (250 GB/s) with linear scaling
Sample Performance Results for PMI
27 ScaleOut Software, Inc.
Benefits:
• Enables use of Hadoop MapReduce for operational intelligence.
• Accelerates data access by holding data in memory.
• Analyzes and updates “live” data.
• Reduces overheads of standard
Hadoop distributions:
• Batch scheduling
• Disk access
• Data shuffling
• Mandatory key sorting
• Enables new features, e.g.:
• Global combining, optional sorting
Using PMI to Implement
“In-Memory” Hadoop MapReduce
28 ScaleOut Software, Inc.
• A Hadoop distribution does not have to be installed unless HDFS is used.
• The developer starts MapReduce applications from a remote workstation.
• The IMDG automatically builds a reusable “invocation grid” of JVMs on the
grid’s servers for PMI and ships the application’s jars.
• Results are stored in the IMDG, HDFS, or optionally globally merged and
returned to the remote workstation.
Running MapReduce on the IMDG
29 ScaleOut Software, Inc.
Run In-Memory MR with YARN
• YARN, transparently integrates batch and in-memory MapReduce into
a single execution framework with shared access to HDFS.
• For example, hServer can transparently run Apache Hive in-memory.
Example of ScaleOut hServer with Hortonworks
Example of Hive
Running on hServer
30 ScaleOut Software, Inc.
Run MapReduce as two PMI
phases:
• Data can be input from either the
IMDG or an external data source.
• Works with any input/output format
compatible with the Apache
distribution.
• IMDG uses its data-parallel
execution engine (PMI) to invoke
the mappers and the reducers.
• Eliminates batch scheduling
overhead.
• Intermediate results are stored
within the IMDG.
• Minimizes data motion between the
mappers and reducers.
• Allows optional sorting.
• Output of a single reducer/combiner
optionally can be globally merged.
Implementing MapReduce
31 ScaleOut Software, Inc.
• IMDG adds grid input format for
accessing key/value pairs held in
the IMDG.
• MapReduce programs optionally
can output results to IMDG with
grid output format.
• Grid Record Reader optimizes
access to key/value pairs to
eliminate network overhead.
• Applications can access and
update key/value pairs as
operational data during analysis.
Accessing IMDG Data for M/R
32 ScaleOut Software, Inc.
• IMDG adds Dataset Record Reader (wrapper) to cache HDFS
data during program execution.
• Hadoop automatically retrieves data from ScaleOut IMDG on
subsequent runs.
• Dataset Record Reader
stores and retrieves data
with minimum network
and memory overheads.
• Tests with Terasort
benchmark have
demonstrated 11X
faster access latency
over HDFS without IMDG.
Optional Caching of HDFS Data
33 ScaleOut Software, Inc.
IMDG needs multiple in-memory
storage models:
• Named cache, optimized for
rich semantics on large
objects:
• Property-based query
• Distributed locking
• Access from remote grids
• Named map, optimized for
efficient storage and bulk
analysis (e.g., MapReduce):
• Highly efficient object storage
• Pipelined, bulk-access
mechanisms
Optimized In-Memory Storage
34 ScaleOut Software, Inc.
In-Memory Named Map:
• Stores key/value pairs in chunks.
• Allows CRUD operations on kvps.
• Automatically organizes chunks into
splits.
• Uses per-split hash table to access
keys and manage multi-valued
keys.
• Stores shuffled data set between
mappers and reducers.
• Pipelines chunks to mappers and
from reducers.
• Optionally uses memory mapped
files to reduce access latency.
• Provides support for sorting keys.
Named Map Optimizations
35 ScaleOut Software, Inc.
• Measured performance:
• Startup times reduced to a few milliseconds
• Word count benchmark shows 20X speedup.
• Real-world example shows >40X speedup.
• MapReduce optimizations:
• Optional sorting
• Optional multicast of parameters to mappers
• Optional O(logN) global combining (avoids
single reducer)
• Optional HDFS caching
• Optional reuse of JVMs across jobs
• Current limitations:
• No specific security for multi-tenancy
• Intermediate data must fit in the IMDG
Performance & Optimizations
36 ScaleOut Software, Inc.
• Invocation grids can be re-used across MapReduce jobs:
Accelerating Start-Up Times
public static void main(String argv[]) throws Exception {
//Configure and load the invocation grid
InvocationGrid grid = HServerJob.getInvocationGridBuilder("myGrid").
// Add JAR files as IG dependencies
addJar("main-job.jar"). addJar("first-library.jar").
// Add classes as IG dependencies
addClass(MyMapper.class). addClass(MyReducer.class).
// Define custom JVM parameters
setJVMParameters("-Xms512M -Xmx1024M").
load();
//Run 10 jobs on the same invocation grid
for(int i=0; i<10; i++) {
Configuration conf = new Configuration();
//The preloaded invocation grid is passed as the parameter to the job
Job job = new HServerJob(conf, "Job number "+i, false, grid);
//......Configure the job here.........
//Run the job
job.waitForCompletion(true);
}
//Unload the invocation grid when we are done
grid.unload();
}
37 ScaleOut Software, Inc.
• IMDG can run Apache Hive distribution
unchanged.
• Accelerates queries for datasets hosted in
HDFS or the IMDG:
• Intermediate data must fit within the IMDG.
• Challenges we faced:
• Requires YARN to transparently invoke
MapReduce on IMDG.
• IMDG must use multiple JVMs per server
since Hive tasks are not thread-safe.
• IMDG must support Hadoop’s distributed
cache (required by Hive).
Running Hive on In-Memory Data
38 ScaleOut Software, Inc.
• Assume we have a named map called “customers” of
customer objects:
Example: Querying a Named Map
public class Customer implements Serializable
{
private int customerId;
private String firstName;
private String lastName;
private String login;
public int getCustomerId() {
return customerId;}
public String getFirstName() {
return firstName;}
...
}
39 ScaleOut Software, Inc.
• Create a table view of a named map:
• Associates class properties with columns.
• Allows properties to be omitted.
• Allows use of custom serialization.
Example: Querying a Named Map
public hive> CREATE TABLE
customers (customerid int, firstname string, lastname
string, login string)
STORED BY
'com.scaleoutsoftware.soss.hserver.hive.HServerHiveStorage
Handler'
TBLPROPERTIES ("hserver.map.name" = "customers");
OK
Time taken: 0.508 seconds
40 ScaleOut Software, Inc.
• Now query the named map:
Example: Querying a Named Map
hive> SELECT * FROM customers;
..............................
1 Eduardo Hazelrigg ehazelrigg
13 Serena Sadberry ssadberry
9 Ermelinda Manganaro emanganaro
5 Edda Speir espeir
17 Tomeka Stovall tstovall
21 Luciano Perkinson lperkinson
25 Jacob Garrow jgarrow
33 Quincy Kreutzer qkreutzer
37 Iona Speir ispeir
41 Ermelinda Thielen ethielen
Time taken: 0.475 seconds, Fetched: 100 row(s)
41 ScaleOut Software, Inc.
The Challenge: Operational intelligence to quickly evaluate
and respond to sub-second market changes:
• Hedge fund tracks a set of hedging strategies:
• Strategies can cover various market
sectors, such as high-tech, automotive,
energy, consumer, real estate, etc.
• Each strategy contains list of holdings
and rules for managing the holdings
(such as target allocations).
• Updates to market data
continuously arrive during
the trading day.
• Challenge: The hedge fund must be able to quickly update and
analyze its hedging strategies and provide alerts to traders.
Demo of the Finserv Application
42 ScaleOut Software, Inc.
• Delivers a stream of alerts to traders
within a few seconds.
• Enables the trader to examine strategy details in real time:
Output: Real-Time Alerts
43 ScaleOut Software, Inc.
• Video Link
Video
44 ScaleOut Software, Inc.
Fast map/reduce reconciles inventory and order systems
for an online retailer:
• Challenge: Inventory and online
order management are handled
by different applications.
• Reconciled once per day.
• Inaccurate orders reduces margins.
• Solution:
• Host SKUs in IMDG updated in real
time by order & inventory systems.
• Use MapReduce to reconcile in two minutes.
• Results: Real-time reconciliation ensures accurate orders.
Example in Ecommerce: Inventory
Management
45 ScaleOut Software, Inc.
• IMDG holds customer
information for active
Web users.
• IMDG saves/retrieves
customer information
from backing store.
• Web browsers send
activity information to
analytics engine.
• IMDG updates customer history and
preferences.
• Analytics engine identifies browsing and
buying patterns.
• Analytics engine makes suggestions in
real-time. Also sends email follow-ups.
Example: Web Shopping
46 ScaleOut Software, Inc.
• Track
connectivity
issues.
• Obtain time-
sensitive
business data.
• Offer enhanced
services.
• Increase
security.
Example: Telecommunications
Optimize Operations
Customer Experience
Historical queries
for real-time data
enrichment
Stream
persistence for
future analysis
Network
Elements
47 ScaleOut Software, Inc.
• Online systems need operational
intelligence on “live” data for
immediate feedback.
• Operational intelligence can be
implemented using standard
data-parallel computing
techniques, such as M/R.
• In-memory data grids provide
an excellent platform for
operational intelligence:
• Host and update “live” data.
• Implement high availability.
• Offer fast, data-parallel
computation for immediate
feedback.
Recap
Additional Information
48
49 ScaleOut Software, Inc.
• ScaleOut StateServer®
• In-Memory Data Grid for Windows and
Linux
• Scales application performance.
• Industry-leading performance and ease of use
• ScaleOut GeoServer® adds
• WAN based data replication for DR
• Breakthrough technology for global
data access
• ScaleOut Analytics Server® adds
• Real-time data analysis for “live” data
• Comprehensive management tools
• ScaleOut hServer®
• Full Hadoop Map/Reduce engine (>20X faster*)
• Hadoop Map/Reduce on live, in-memory data
ScaleOut Software Products
ScaleOut StateServer In-Memory Data Grid
Grid
Service
Grid
Service
Grid
Service
Grid
Service
*in benchmark testing
50 ScaleOut Software, Inc.
Many Use Cases:
• Authorizations / Payment
Processing / Mobile Payments
• Service Activation
• Inventory Management
• Sensor Data / SCADA
• Real Time Tracking
• Fraud Detection
• Situational Awareness
• Churn Management
• Market Feed / Event Handlers
• Execution Rules
• Financial: Risk, P&L, Pricing
• Operational Risk Compliance
The Need for Real-Time Analytics
Across Key Industries:
• CPG
• Financial
• Telco
• Retail
• Utilities
• Manufacturing
• Logistics
• IC / DoD
• Life Sciences
• Government
• Health Care
• Law enforcement
51 ScaleOut Software, Inc.
• Brick and mortar stores need to compete with online experience.
• Point-of-sale identifies opt-in customers to analytics engine.
• RFID tags identify product selection and availability in showroom.
• Analytics engine sends real-time advisories to sales staff via tablet.
Example: Retail Shopping
52 ScaleOut Software, Inc.
• Typically used for very large, static, offline datasets
• Data must be copied from disk-based storage (e.g., HDFS) into
memory for analysis.
• Hadoop Map/Reduce adds lengthy batch scheduling and data
shuffling overhead.
Problem: Hadoop Cannot Efficiently
Perform Real-Time Analytics
53 ScaleOut Software, Inc.
// This job will run using the Hadoop
// job tracker:
public static void main(String[] args)
throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(
TextInputFormat.class);
job.setOutputFormatClass(
TextOutputFormat.class);
FileInputFormat.addInputPath(
job, new Path(args[0]));
FileOutputFormat.setOutputPath(
job, new Path(args[1]));
job.waitForCompletion(true);
}
// This job will run using ScaleOut hServer:
public static void main(String[] args)
throws Exception {
Configuration conf = new Configuration();
Job job = new HServerJob(conf, "wordcount");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(
TextInputFormat.class);
job.setOutputFormatClass(
TextOutputFormat.class);
FileInputFormat.addInputPath(
job, new Path(args[0]));
FileOutputFormat.setOutputPath(
job, new Path(args[1]));
job.waitForCompletion(true);
}
Configuring MapReduce for the IMDG
• Without YARN, subclass the Hadoop Job class with a one-line change (below).
• With YARN, just replace the MapReduce execution framework.
54 ScaleOut Software, Inc.
• Mark class properties as indexes for query:
• Define a query using these properties:
Parallel Query Example (C#)
class Stock {
[SossIndex]
public string Ticker { get; set; }
public decimal TotalShares { get; set; }
public decimal Price { get; set; }}
NamedCache cache = CacheFactory.GetCache("Stocks");
var q = from s in cache.QueryObjects<Stock>()
where s.Ticker == "GOOG" || s.Ticker == "ORCL"
select s;
Console.WriteLine("{0} Stocks found", q.Count());
55 ScaleOut Software, Inc.
• Create method to analyze each queried stock object:
• Create method to pair-wise merge the results:
Example of Analysis Code (C#)
static decimal eval(Stock stock, StockCalcParams params)
{
return stock.Price * stock.TotalShares;
}
static decimal merge(decimal r1, decimal r2)
{
return r1 + r2;
}
56 ScaleOut Software, Inc.
• Run a parallel method invocation:
Invoking the Parallel Analysis (C#)
NamedCache cache = CacheFactory.GetCache("Stocks");
decimal valueOfSelectedStocks =
(from s in cache.QueryObjects<Stock>()
where s.Ticker == "GOOG" || s.Ticker == "ORCL"
select s)
.Invoke(new StockCalcParams(…),
new Func<Stock, StockCalcParams, decimal>(eval))
.Merge(new Func<decimal, decimal, decimal>(merge));
Console.WriteLine(“The value of selected stocks is {0}",
valueOfSelectedStocks);

More Related Content

What's hot

Breakout: Hadoop and the Operational Data Store
Breakout: Hadoop and the Operational Data StoreBreakout: Hadoop and the Operational Data Store
Breakout: Hadoop and the Operational Data StoreCloudera, Inc.
 
Enabling the Active Data Warehouse with Apache Kudu
Enabling the Active Data Warehouse with Apache KuduEnabling the Active Data Warehouse with Apache Kudu
Enabling the Active Data Warehouse with Apache KuduGrant Henke
 
Hadoop and Hive in Enterprises
Hadoop and Hive in EnterprisesHadoop and Hive in Enterprises
Hadoop and Hive in Enterprisesmarkgrover
 
Machine Learning for z/OS
Machine Learning for z/OSMachine Learning for z/OS
Machine Learning for z/OSCuneyt Goksu
 
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...Databricks
 
Spark and Couchbase– Augmenting the Operational Database with Spark
Spark and Couchbase– Augmenting the Operational Database with SparkSpark and Couchbase– Augmenting the Operational Database with Spark
Spark and Couchbase– Augmenting the Operational Database with SparkMatt Ingenthron
 
DataStax | Adversarial Modeling: Graph, ML, and Analytics for Identity Fraud ...
DataStax | Adversarial Modeling: Graph, ML, and Analytics for Identity Fraud ...DataStax | Adversarial Modeling: Graph, ML, and Analytics for Identity Fraud ...
DataStax | Adversarial Modeling: Graph, ML, and Analytics for Identity Fraud ...DataStax
 
Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...
 Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ... Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...
Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...Cloudera, Inc.
 
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...Amr Awadallah
 
Migrating legacy ERP data into Hadoop
Migrating legacy ERP data into HadoopMigrating legacy ERP data into Hadoop
Migrating legacy ERP data into HadoopDataWorks Summit
 
Big Data Architecture and Deployment
Big Data Architecture and DeploymentBig Data Architecture and Deployment
Big Data Architecture and DeploymentCisco Canada
 
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsBuilding the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsKhalid Salama
 
Tools and approaches for migrating big datasets to the cloud
Tools and approaches for migrating big datasets to the cloudTools and approaches for migrating big datasets to the cloud
Tools and approaches for migrating big datasets to the cloudDataWorks Summit
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...Data Con LA
 
Integrated Data Warehouse with Hadoop and Oracle Database
Integrated Data Warehouse with Hadoop and Oracle DatabaseIntegrated Data Warehouse with Hadoop and Oracle Database
Integrated Data Warehouse with Hadoop and Oracle DatabaseGwen (Chen) Shapira
 
Choosing technologies for a big data solution in the cloud
Choosing technologies for a big data solution in the cloudChoosing technologies for a big data solution in the cloud
Choosing technologies for a big data solution in the cloudJames Serra
 
Hadoop Reporting and Analysis - Jaspersoft
Hadoop Reporting and Analysis - JaspersoftHadoop Reporting and Analysis - Jaspersoft
Hadoop Reporting and Analysis - JaspersoftHortonworks
 
Big Data Real Time Applications
Big Data Real Time ApplicationsBig Data Real Time Applications
Big Data Real Time ApplicationsDataWorks Summit
 

What's hot (20)

Breakout: Hadoop and the Operational Data Store
Breakout: Hadoop and the Operational Data StoreBreakout: Hadoop and the Operational Data Store
Breakout: Hadoop and the Operational Data Store
 
Enabling the Active Data Warehouse with Apache Kudu
Enabling the Active Data Warehouse with Apache KuduEnabling the Active Data Warehouse with Apache Kudu
Enabling the Active Data Warehouse with Apache Kudu
 
Hadoop and Hive in Enterprises
Hadoop and Hive in EnterprisesHadoop and Hive in Enterprises
Hadoop and Hive in Enterprises
 
IBM Power8 announce
IBM Power8 announceIBM Power8 announce
IBM Power8 announce
 
Machine Learning for z/OS
Machine Learning for z/OSMachine Learning for z/OS
Machine Learning for z/OS
 
Active Learning for Fraud Prevention
Active Learning for Fraud PreventionActive Learning for Fraud Prevention
Active Learning for Fraud Prevention
 
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
 
Spark and Couchbase– Augmenting the Operational Database with Spark
Spark and Couchbase– Augmenting the Operational Database with SparkSpark and Couchbase– Augmenting the Operational Database with Spark
Spark and Couchbase– Augmenting the Operational Database with Spark
 
DataStax | Adversarial Modeling: Graph, ML, and Analytics for Identity Fraud ...
DataStax | Adversarial Modeling: Graph, ML, and Analytics for Identity Fraud ...DataStax | Adversarial Modeling: Graph, ML, and Analytics for Identity Fraud ...
DataStax | Adversarial Modeling: Graph, ML, and Analytics for Identity Fraud ...
 
Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...
 Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ... Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...
Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...
 
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
 
Migrating legacy ERP data into Hadoop
Migrating legacy ERP data into HadoopMigrating legacy ERP data into Hadoop
Migrating legacy ERP data into Hadoop
 
Big Data Architecture and Deployment
Big Data Architecture and DeploymentBig Data Architecture and Deployment
Big Data Architecture and Deployment
 
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsBuilding the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake Analytics
 
Tools and approaches for migrating big datasets to the cloud
Tools and approaches for migrating big datasets to the cloudTools and approaches for migrating big datasets to the cloud
Tools and approaches for migrating big datasets to the cloud
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...
 
Integrated Data Warehouse with Hadoop and Oracle Database
Integrated Data Warehouse with Hadoop and Oracle DatabaseIntegrated Data Warehouse with Hadoop and Oracle Database
Integrated Data Warehouse with Hadoop and Oracle Database
 
Choosing technologies for a big data solution in the cloud
Choosing technologies for a big data solution in the cloudChoosing technologies for a big data solution in the cloud
Choosing technologies for a big data solution in the cloud
 
Hadoop Reporting and Analysis - Jaspersoft
Hadoop Reporting and Analysis - JaspersoftHadoop Reporting and Analysis - Jaspersoft
Hadoop Reporting and Analysis - Jaspersoft
 
Big Data Real Time Applications
Big Data Real Time ApplicationsBig Data Real Time Applications
Big Data Real Time Applications
 

Similar to Making Hadoop Realtime by Dr. William Bain of Scaleout Software

New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...
 New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S... New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...
New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...Big Data Spain
 
Operational Intelligence Using Hadoop
Operational Intelligence Using HadoopOperational Intelligence Using Hadoop
Operational Intelligence Using HadoopDataWorks Summit
 
Real-time analysis using an in-memory data grid - Cloud Expo 2013
Real-time analysis using an in-memory data grid - Cloud Expo 2013Real-time analysis using an in-memory data grid - Cloud Expo 2013
Real-time analysis using an in-memory data grid - Cloud Expo 2013ScaleOut Software
 
November 2013 HUG: Real-time analytics with in-memory grid
November 2013 HUG: Real-time analytics with in-memory gridNovember 2013 HUG: Real-time analytics with in-memory grid
November 2013 HUG: Real-time analytics with in-memory gridYahoo Developer Network
 
Webinar: Enterprise Trends for Database-as-a-Service
Webinar: Enterprise Trends for Database-as-a-ServiceWebinar: Enterprise Trends for Database-as-a-Service
Webinar: Enterprise Trends for Database-as-a-ServiceMongoDB
 
Kognitio overview jan 2013
Kognitio overview jan 2013Kognitio overview jan 2013
Kognitio overview jan 2013Kognitio
 
Kognitio overview jan 2013
Kognitio overview jan 2013Kognitio overview jan 2013
Kognitio overview jan 2013Michael Hiskey
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureDATAVERSITY
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)Spark Summit
 
Simplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache KuduSimplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache KuduCloudera, Inc.
 
QuerySurge Slide Deck for Big Data Testing Webinar
QuerySurge Slide Deck for Big Data Testing WebinarQuerySurge Slide Deck for Big Data Testing Webinar
QuerySurge Slide Deck for Big Data Testing WebinarRTTS
 
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...AboutYouGmbH
 
Big Data Berlin v8.0 Stream Processing with Apache Apex
Big Data Berlin v8.0 Stream Processing with Apache Apex Big Data Berlin v8.0 Stream Processing with Apache Apex
Big Data Berlin v8.0 Stream Processing with Apache Apex Apache Apex
 
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...Dataconomy Media
 
InfoSphere BigInsights - Analytics power for Hadoop - field experience
InfoSphere BigInsights - Analytics power for Hadoop - field experienceInfoSphere BigInsights - Analytics power for Hadoop - field experience
InfoSphere BigInsights - Analytics power for Hadoop - field experienceWilfried Hoge
 
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...DataStax
 
Hybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGsHybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGsAli Hodroj
 
An overview of modern scalable web development
An overview of modern scalable web developmentAn overview of modern scalable web development
An overview of modern scalable web developmentTung Nguyen
 

Similar to Making Hadoop Realtime by Dr. William Bain of Scaleout Software (20)

New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...
 New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S... New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...
New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...
 
Operational Intelligence Using Hadoop
Operational Intelligence Using HadoopOperational Intelligence Using Hadoop
Operational Intelligence Using Hadoop
 
Real-time analysis using an in-memory data grid - Cloud Expo 2013
Real-time analysis using an in-memory data grid - Cloud Expo 2013Real-time analysis using an in-memory data grid - Cloud Expo 2013
Real-time analysis using an in-memory data grid - Cloud Expo 2013
 
November 2013 HUG: Real-time analytics with in-memory grid
November 2013 HUG: Real-time analytics with in-memory gridNovember 2013 HUG: Real-time analytics with in-memory grid
November 2013 HUG: Real-time analytics with in-memory grid
 
Webinar: Enterprise Trends for Database-as-a-Service
Webinar: Enterprise Trends for Database-as-a-ServiceWebinar: Enterprise Trends for Database-as-a-Service
Webinar: Enterprise Trends for Database-as-a-Service
 
Kognitio overview jan 2013
Kognitio overview jan 2013Kognitio overview jan 2013
Kognitio overview jan 2013
 
Kognitio overview jan 2013
Kognitio overview jan 2013Kognitio overview jan 2013
Kognitio overview jan 2013
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
 
Simplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache KuduSimplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache Kudu
 
QuerySurge Slide Deck for Big Data Testing Webinar
QuerySurge Slide Deck for Big Data Testing WebinarQuerySurge Slide Deck for Big Data Testing Webinar
QuerySurge Slide Deck for Big Data Testing Webinar
 
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
 
Big Data Berlin v8.0 Stream Processing with Apache Apex
Big Data Berlin v8.0 Stream Processing with Apache Apex Big Data Berlin v8.0 Stream Processing with Apache Apex
Big Data Berlin v8.0 Stream Processing with Apache Apex
 
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
 
InfoSphere BigInsights - Analytics power for Hadoop - field experience
InfoSphere BigInsights - Analytics power for Hadoop - field experienceInfoSphere BigInsights - Analytics power for Hadoop - field experience
InfoSphere BigInsights - Analytics power for Hadoop - field experience
 
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
 
Hybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGsHybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGs
 
An overview of modern scalable web development
An overview of modern scalable web developmentAn overview of modern scalable web development
An overview of modern scalable web development
 
Analytics&IoT
Analytics&IoTAnalytics&IoT
Analytics&IoT
 
Ibm db2update2019 icp4 data
Ibm db2update2019   icp4 dataIbm db2update2019   icp4 data
Ibm db2update2019 icp4 data
 

More from Data Con LA

Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA
 
Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA
 
Data Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA
 
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA
 
Data Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA
 
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA
 
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA
 
Data Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA
 
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA
 
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA
 
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA
 
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA
 
Data Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA
 
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA
 
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA
 
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA
 
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA
 
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA
 

More from Data Con LA (20)

Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
 
Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
 
Data Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup Showcase
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
 
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendations
 
Data Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA 2022 - AI Ethics
Data Con LA 2022 - AI Ethics
 
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learning
 
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
 
Data Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentation
 
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
 
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWS
 
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
 
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
 
Data Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data Science
 
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
 
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
 
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
 
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
 
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with Kafka
 

Recently uploaded

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 

Recently uploaded (20)

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 

Making Hadoop Realtime by Dr. William Bain of Scaleout Software

  • 1. Enabling Operational Intelligence: Making Hadoop Real-Time Copyright © 2014 by ScaleOut Software, Inc. Los Angeles Big Data Users Group August 6, 2014 Bill Bain, CEO (wbain@scaleoutsoftware.com)
  • 2. 2 ScaleOut Software, Inc. • Operational Intelligence vs. Business Intelligence • Data-Parallel Computation • Implementing Using an In-Memory Data Grid: • Distributing the Data Across a Cluster • Running the Computation using “Parallel Method Invocation” • An Example in Financial Services • Implementing In-Memory Hadoop MapReduce • Video Demo • Examples of Applications in Operational Intelligence Agenda
  • 3. 3 ScaleOut Software, Inc. • Develops and markets In-Memory Data Grids, software middleware for: • Scaling application performance and • Providing operational intelligence using • In-memory data storage and computing • Dr. William Bain, Founder & CEO • Career focused on parallel computing – Bell Labs, Intel, Microsoft • 3 prior start-ups, last acquired by Microsoft and product now ships as Network Load Balancing in Windows Server • Eight years in the market; 400 customers, 10,000 servers • Sample customers: About ScaleOut Software
  • 4. 4 ScaleOut Software, Inc. Goal: Provide immediate feedback to a system handling live data. A few examples: • Equity trading: to minimize risk during a trading day • Ecommerce: for real-time recommendations • Reservations systems: to identify issues, reroute, etc. • Credit cards & wire transfers: to detect fraud in real time • Smart grids: to optimize power distribution & detect issues Online Systems Need Operational Intelligence
  • 5. 5 ScaleOut Software, Inc. Big Data Analytics Real-Time vs. Batch Analytics Static data sets Petabytes Disk storage Hours to minutes Best uses: • Analyzing warehoused data • Mining for long- term trends Live data sets Gigabytes to terabytes In-memory storage Minutes to seconds Best uses: • Tracking live data • Immediately identifying trends and capturing opportunities • Providing immediate feedback Analytics Server hServer Hadoop IBM Teradata SAS SAP Real-Time Batch Real-time “Operational Intelligence” Batch “Business Intelligence”
  • 6. 6 ScaleOut Software, Inc. • Operational intelligence can co-exist with business intelligence: • Processes streaming data close to its sources. • Provides real-time, “tactical” feedback (e.g., recommendations, alerts). • Translates data for storage in the data warehouse (ETL). • Data warehouse provides “strategic” guidance. • Using the same tool set (e.g., Hadoop MapReduce) lowers TCO: • Leverages common skill set. • Simplifies design (e.g., loading data into HDFS). Integrated View of Analytics
  • 7. 7 ScaleOut Software, Inc. • To keep up with fast growing “live” workloads & maintain fast response times: • Ex.: Handle incoming data streams in real time. • Ex. Process updates to data set based on incoming data. • To identify and respond to trends in fast-changing data: • Ex. Evaluate data set changes in real time. • Ex. Respond to identified patterns within seconds. Challenges for Operational Intelligence 0 50 100 150 200 250 300 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 Millions Growth in Web Servers Source: Netcraft 0 500 1000 1500 2000 2500 3000 3500 4000 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 Exebytes Growth in “Big Data” “More data has been created in the past three years than in the past 40,000.”
  • 8. 8 ScaleOut Software, Inc. The Solution: Data-Parallel Computing • Straightforward, well understood model of parallel computation • An alternative to task-parallel computation (e.g., Storm) • Simple: runs the same code on multiple, in-memory data items. • Powerful: maintains a “live,” in-memory model of a real-world system • Fast: avoids data motion which lowers speedup. Analyze Data (Eval) Combine Results (Merge) Server Cluster
  • 9. 9 ScaleOut Software, Inc. Track “live” data and analyze in real-time: Implementing OI In-Memory State NoSQL Storage Real-Time Data Parallel Analysis
  • 10. 10 ScaleOut Software, Inc. • Storm implements pipelined execution of tasks by “bolts” on incoming data streams. • Streams can be distributed to bolts with configurable mappings. • Developer controls the number of tasks per bolt. • Storm uses a centralized master node and Zookeeper for fault- tolerance. • Key strength: continuous processing of input streams • Issues: • Complexity / tuning • Minimizing data motion • Managing global state Quick Comparison to Storm
  • 11. 11 ScaleOut Software, Inc. Data-Parallel Enables Linear Speedup Avoids data motion (network or disk I/O) which limits throughput:
  • 12. 12 ScaleOut Software, Inc. Data-Parallel Computing Is Not New • 1980’s: Special Purpose Hardware: “SIMD” Thinking Machines Connection Machine 5 • 1990’s: General Purpose Parallel Supercomputers: “Domain Decomposition”, “SPMD” Intel IPSC-2 IBM SP1
  • 13. 13 ScaleOut Software, Inc. Data-Parallel Computing Is Not New • 1990’s – early 2000’s: HPC on Clusters: “MPI” • Since 2003: Clusters, the Cloud, and IMDGs: “MapReduce” HP Blade Servers Amazon EC2, Windows Azure
  • 14. 14 ScaleOut Software, Inc. • In-memory data grid (IMDG) holds active entities undergoing state changes in memory. • Backing store optionally holds large population of entities. • IMDG processes incoming stream of state changes. • Analytics engine examines entities in real time and generates alerts within seconds as needed. A Data-Parallel Architecture for Operational Intelligence
  • 15. 15 ScaleOut Software, Inc. In-Memory Data Grid (IMDG) stores “live” data in a cluster: • Fits in the business logic layer: • Follows object-oriented view of data (vs. relational view). • Stores collections of Java/.NET objects shared by multiple clients. • Uses create/read/update/delete and query APIs to access data. • Implemented across a cluster of servers or VMs: • Scales storage and throughput by adding servers. • Provides high availability in case a server fails. In-Memory Data Grid for Live Data
  • 16. 16 ScaleOut Software, Inc. • IMDG’s collections of objects act like in- memory collections: • Unstructured, typically instances of a class (stored as serialized blobs) • Individually accessible / update-able • IMDG adds attributes: • Accessible by global key • Query-able by properties • Highly available • Optional timeouts • Distributed locking • Integration with a backing store • Optional dependency relationships • Asynchronous event handling IMDGs Store “Live” Data Basic “CRUD” APIs: • Create(key, obj, tout) • Read(key) • Update(key, obj) • Delete(key) and… • Lock(key) • Unlock(key) Object key
  • 17. 17 ScaleOut Software, Inc. Spark / Spark Streaming from U.C. Berkeley amplab: • In-memory computing to accelerate and extend Hadoop MapReduce using data- parallel operators in Scala. • Stores data as “resilient distributed datasets” (RDDs): • Distributed across cluster • Immutable • Hold data from/output to HDFS. • Store data stream as a sequence of RDDs. • Comparison to IMDG: • Not designed for “live” data: • Lacks CRUD on individual objects. • Lacks high availability. • Designed for “data parallel” operators. Quick Comparison to Spark
  • 18. 18 ScaleOut Software, Inc. Data-Parallel Computing Using PMI “Parallel Method Invocation” (PMI): an object-oriented version of data- parallel computing from the HPC community: • Serves as a platform for MapReduce and other data-parallel operators. • Selects objects using a parallel query on data hosted in the IMDG. • Runs user-defined methods in parallel across the cluster. Analyze Data (Eval) Combine Results (Merge) In-Memory Data Grid Runs Data-Parallel Computation.
  • 19. 19 ScaleOut Software, Inc. Integrate analysis into a stock trading platform: • The IMDG holds market data and hedging strategies. • Updates to market data continuously flow through the IMDG. • The IMDG performs repeated data-parallel analysis on hedging strategies and alerts traders in real time. • IMDG automatically and dynamically scales its throughput to handle new hedging strategies by adding servers. Example in Financial Services
  • 20. 20 ScaleOut Software, Inc. Selects all relevant objects in a distributed collection. • Query spec matches data’s object-oriented properties. • Selected objects are fed to the analysis engine on the local server. Step 1: Select with Parallel Query
  • 21. 21 ScaleOut Software, Inc. Java Example: Parallel Query public class Portfolio { private long id; private Set<Stock> longPositions; private Set<Stock> shortPositions; private double totalValue; private Region region; private boolean alerted; // alert for trading @SossIndexAttribute // query-able property public double getTotalValue() {…} @SossIndexAttribute // query-able property public Region getRegion() {…} public Set<Long> evalPositions(MarketSnapshot ms) {…}; } NamedCache pset = CacheFactory.getCache(“portfolios"); Set<Portfolio> res = pset.queryObjects(Portfolio.class, and(greaterThan(“totalValue”, 1000000), equals(“region”, Region.US)));
  • 22. 22 ScaleOut Software, Inc. • Create method to analyze a queried portfolio and another method to pair-wise merge the result sets of alerted portfolios: Java Example: Parallel Method Invocation public class PortfolioAnalysis implements Invokable<Portfolio, MarketSnapshot, Set<Long>> { public Set<Long> eval(Portfolio p, MarketSnapshot ms) throws InvokeException { // update portfolio and return id if alerted: return p.evalPositions(ms); } public Set<Long> merge(Set<Long> set1, Set<Long> set2) throws InvokeException { set1.addAll(set2); return set1; // merged set of alerted portfolio ids }}
  • 23. 23 ScaleOut Software, Inc. • Run a parallel method invocation on a queried set of portfolios and return set of ids for alerted portfolios: Java Example: Parallel Method Invocation NamedCache pset = CacheFactory.getCache(“portfolios"); InvokeResult alertedPortolios = pset.invoke( PortfolioAnalysis.class, Portfolio.class, and(greaterThan(“totalValue”, 1000000), // query spec equals(“region”, Region.US)), marketSnapshot, // parameters ... ); System.out.println("The alerted portfolios are" + alertedPortfolios.getResult());
  • 24. 24 ScaleOut Software, Inc. • IMDG ships user’s code and libraries to its servers. • IMDG automatically schedules analysis operations across all grid servers and cores: • The analysis runs on all objects selected by the parallel query. • Each grid server analyzes its locally stored objects to minimize data motion. • Parallel execution ensures fast completion time: • IMDG automatically distributes workload across servers/cores. • Scaling the IMDG automatically handles larger data sets. Running the Analysis
  • 25. 25 ScaleOut Software, Inc. • The IMDG automatically merges all analysis results: • The IMDG first merges all results within each grid server in parallel. • It then merges results across all grid servers to create one combined result. • Efficient parallel merge minimizes the delay in combining all results. • The IMDG delivers the combined result to the invoking application as one object. Merging the Results
  • 26. 26 ScaleOut Software, Inc. • Measured a similar financial services application (back testing stock trading strategies on stock histories) • Hosted IMDG in Amazon EC2 using 75 servers holding 1 TB of stock history data in memory • IMDG handled a continuous stream of updates (1.1 GB/s) • Results: analyzed 1 TB in 4.1 seconds (250 GB/s) with linear scaling Sample Performance Results for PMI
  • 27. 27 ScaleOut Software, Inc. Benefits: • Enables use of Hadoop MapReduce for operational intelligence. • Accelerates data access by holding data in memory. • Analyzes and updates “live” data. • Reduces overheads of standard Hadoop distributions: • Batch scheduling • Disk access • Data shuffling • Mandatory key sorting • Enables new features, e.g.: • Global combining, optional sorting Using PMI to Implement “In-Memory” Hadoop MapReduce
  • 28. 28 ScaleOut Software, Inc. • A Hadoop distribution does not have to be installed unless HDFS is used. • The developer starts MapReduce applications from a remote workstation. • The IMDG automatically builds a reusable “invocation grid” of JVMs on the grid’s servers for PMI and ships the application’s jars. • Results are stored in the IMDG, HDFS, or optionally globally merged and returned to the remote workstation. Running MapReduce on the IMDG
  • 29. 29 ScaleOut Software, Inc. Run In-Memory MR with YARN • YARN, transparently integrates batch and in-memory MapReduce into a single execution framework with shared access to HDFS. • For example, hServer can transparently run Apache Hive in-memory. Example of ScaleOut hServer with Hortonworks Example of Hive Running on hServer
  • 30. 30 ScaleOut Software, Inc. Run MapReduce as two PMI phases: • Data can be input from either the IMDG or an external data source. • Works with any input/output format compatible with the Apache distribution. • IMDG uses its data-parallel execution engine (PMI) to invoke the mappers and the reducers. • Eliminates batch scheduling overhead. • Intermediate results are stored within the IMDG. • Minimizes data motion between the mappers and reducers. • Allows optional sorting. • Output of a single reducer/combiner optionally can be globally merged. Implementing MapReduce
  • 31. 31 ScaleOut Software, Inc. • IMDG adds grid input format for accessing key/value pairs held in the IMDG. • MapReduce programs optionally can output results to IMDG with grid output format. • Grid Record Reader optimizes access to key/value pairs to eliminate network overhead. • Applications can access and update key/value pairs as operational data during analysis. Accessing IMDG Data for M/R
  • 32. 32 ScaleOut Software, Inc. • IMDG adds Dataset Record Reader (wrapper) to cache HDFS data during program execution. • Hadoop automatically retrieves data from ScaleOut IMDG on subsequent runs. • Dataset Record Reader stores and retrieves data with minimum network and memory overheads. • Tests with Terasort benchmark have demonstrated 11X faster access latency over HDFS without IMDG. Optional Caching of HDFS Data
  • 33. 33 ScaleOut Software, Inc. IMDG needs multiple in-memory storage models: • Named cache, optimized for rich semantics on large objects: • Property-based query • Distributed locking • Access from remote grids • Named map, optimized for efficient storage and bulk analysis (e.g., MapReduce): • Highly efficient object storage • Pipelined, bulk-access mechanisms Optimized In-Memory Storage
  • 34. 34 ScaleOut Software, Inc. In-Memory Named Map: • Stores key/value pairs in chunks. • Allows CRUD operations on kvps. • Automatically organizes chunks into splits. • Uses per-split hash table to access keys and manage multi-valued keys. • Stores shuffled data set between mappers and reducers. • Pipelines chunks to mappers and from reducers. • Optionally uses memory mapped files to reduce access latency. • Provides support for sorting keys. Named Map Optimizations
  • 35. 35 ScaleOut Software, Inc. • Measured performance: • Startup times reduced to a few milliseconds • Word count benchmark shows 20X speedup. • Real-world example shows >40X speedup. • MapReduce optimizations: • Optional sorting • Optional multicast of parameters to mappers • Optional O(logN) global combining (avoids single reducer) • Optional HDFS caching • Optional reuse of JVMs across jobs • Current limitations: • No specific security for multi-tenancy • Intermediate data must fit in the IMDG Performance & Optimizations
  • 36. 36 ScaleOut Software, Inc. • Invocation grids can be re-used across MapReduce jobs: Accelerating Start-Up Times public static void main(String argv[]) throws Exception { //Configure and load the invocation grid InvocationGrid grid = HServerJob.getInvocationGridBuilder("myGrid"). // Add JAR files as IG dependencies addJar("main-job.jar"). addJar("first-library.jar"). // Add classes as IG dependencies addClass(MyMapper.class). addClass(MyReducer.class). // Define custom JVM parameters setJVMParameters("-Xms512M -Xmx1024M"). load(); //Run 10 jobs on the same invocation grid for(int i=0; i<10; i++) { Configuration conf = new Configuration(); //The preloaded invocation grid is passed as the parameter to the job Job job = new HServerJob(conf, "Job number "+i, false, grid); //......Configure the job here......... //Run the job job.waitForCompletion(true); } //Unload the invocation grid when we are done grid.unload(); }
  • 37. 37 ScaleOut Software, Inc. • IMDG can run Apache Hive distribution unchanged. • Accelerates queries for datasets hosted in HDFS or the IMDG: • Intermediate data must fit within the IMDG. • Challenges we faced: • Requires YARN to transparently invoke MapReduce on IMDG. • IMDG must use multiple JVMs per server since Hive tasks are not thread-safe. • IMDG must support Hadoop’s distributed cache (required by Hive). Running Hive on In-Memory Data
  • 38. 38 ScaleOut Software, Inc. • Assume we have a named map called “customers” of customer objects: Example: Querying a Named Map public class Customer implements Serializable { private int customerId; private String firstName; private String lastName; private String login; public int getCustomerId() { return customerId;} public String getFirstName() { return firstName;} ... }
  • 39. 39 ScaleOut Software, Inc. • Create a table view of a named map: • Associates class properties with columns. • Allows properties to be omitted. • Allows use of custom serialization. Example: Querying a Named Map public hive> CREATE TABLE customers (customerid int, firstname string, lastname string, login string) STORED BY 'com.scaleoutsoftware.soss.hserver.hive.HServerHiveStorage Handler' TBLPROPERTIES ("hserver.map.name" = "customers"); OK Time taken: 0.508 seconds
  • 40. 40 ScaleOut Software, Inc. • Now query the named map: Example: Querying a Named Map hive> SELECT * FROM customers; .............................. 1 Eduardo Hazelrigg ehazelrigg 13 Serena Sadberry ssadberry 9 Ermelinda Manganaro emanganaro 5 Edda Speir espeir 17 Tomeka Stovall tstovall 21 Luciano Perkinson lperkinson 25 Jacob Garrow jgarrow 33 Quincy Kreutzer qkreutzer 37 Iona Speir ispeir 41 Ermelinda Thielen ethielen Time taken: 0.475 seconds, Fetched: 100 row(s)
  • 41. 41 ScaleOut Software, Inc. The Challenge: Operational intelligence to quickly evaluate and respond to sub-second market changes: • Hedge fund tracks a set of hedging strategies: • Strategies can cover various market sectors, such as high-tech, automotive, energy, consumer, real estate, etc. • Each strategy contains list of holdings and rules for managing the holdings (such as target allocations). • Updates to market data continuously arrive during the trading day. • Challenge: The hedge fund must be able to quickly update and analyze its hedging strategies and provide alerts to traders. Demo of the Finserv Application
  • 42. 42 ScaleOut Software, Inc. • Delivers a stream of alerts to traders within a few seconds. • Enables the trader to examine strategy details in real time: Output: Real-Time Alerts
  • 43. 43 ScaleOut Software, Inc. • Video Link Video
  • 44. 44 ScaleOut Software, Inc. Fast map/reduce reconciles inventory and order systems for an online retailer: • Challenge: Inventory and online order management are handled by different applications. • Reconciled once per day. • Inaccurate orders reduces margins. • Solution: • Host SKUs in IMDG updated in real time by order & inventory systems. • Use MapReduce to reconcile in two minutes. • Results: Real-time reconciliation ensures accurate orders. Example in Ecommerce: Inventory Management
  • 45. 45 ScaleOut Software, Inc. • IMDG holds customer information for active Web users. • IMDG saves/retrieves customer information from backing store. • Web browsers send activity information to analytics engine. • IMDG updates customer history and preferences. • Analytics engine identifies browsing and buying patterns. • Analytics engine makes suggestions in real-time. Also sends email follow-ups. Example: Web Shopping
  • 46. 46 ScaleOut Software, Inc. • Track connectivity issues. • Obtain time- sensitive business data. • Offer enhanced services. • Increase security. Example: Telecommunications Optimize Operations Customer Experience Historical queries for real-time data enrichment Stream persistence for future analysis Network Elements
  • 47. 47 ScaleOut Software, Inc. • Online systems need operational intelligence on “live” data for immediate feedback. • Operational intelligence can be implemented using standard data-parallel computing techniques, such as M/R. • In-memory data grids provide an excellent platform for operational intelligence: • Host and update “live” data. • Implement high availability. • Offer fast, data-parallel computation for immediate feedback. Recap
  • 49. 49 ScaleOut Software, Inc. • ScaleOut StateServer® • In-Memory Data Grid for Windows and Linux • Scales application performance. • Industry-leading performance and ease of use • ScaleOut GeoServer® adds • WAN based data replication for DR • Breakthrough technology for global data access • ScaleOut Analytics Server® adds • Real-time data analysis for “live” data • Comprehensive management tools • ScaleOut hServer® • Full Hadoop Map/Reduce engine (>20X faster*) • Hadoop Map/Reduce on live, in-memory data ScaleOut Software Products ScaleOut StateServer In-Memory Data Grid Grid Service Grid Service Grid Service Grid Service *in benchmark testing
  • 50. 50 ScaleOut Software, Inc. Many Use Cases: • Authorizations / Payment Processing / Mobile Payments • Service Activation • Inventory Management • Sensor Data / SCADA • Real Time Tracking • Fraud Detection • Situational Awareness • Churn Management • Market Feed / Event Handlers • Execution Rules • Financial: Risk, P&L, Pricing • Operational Risk Compliance The Need for Real-Time Analytics Across Key Industries: • CPG • Financial • Telco • Retail • Utilities • Manufacturing • Logistics • IC / DoD • Life Sciences • Government • Health Care • Law enforcement
  • 51. 51 ScaleOut Software, Inc. • Brick and mortar stores need to compete with online experience. • Point-of-sale identifies opt-in customers to analytics engine. • RFID tags identify product selection and availability in showroom. • Analytics engine sends real-time advisories to sales staff via tablet. Example: Retail Shopping
  • 52. 52 ScaleOut Software, Inc. • Typically used for very large, static, offline datasets • Data must be copied from disk-based storage (e.g., HDFS) into memory for analysis. • Hadoop Map/Reduce adds lengthy batch scheduling and data shuffling overhead. Problem: Hadoop Cannot Efficiently Perform Real-Time Analytics
  • 53. 53 ScaleOut Software, Inc. // This job will run using the Hadoop // job tracker: public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass( TextInputFormat.class); job.setOutputFormatClass( TextOutputFormat.class); FileInputFormat.addInputPath( job, new Path(args[0])); FileOutputFormat.setOutputPath( job, new Path(args[1])); job.waitForCompletion(true); } // This job will run using ScaleOut hServer: public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new HServerJob(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass( TextInputFormat.class); job.setOutputFormatClass( TextOutputFormat.class); FileInputFormat.addInputPath( job, new Path(args[0])); FileOutputFormat.setOutputPath( job, new Path(args[1])); job.waitForCompletion(true); } Configuring MapReduce for the IMDG • Without YARN, subclass the Hadoop Job class with a one-line change (below). • With YARN, just replace the MapReduce execution framework.
  • 54. 54 ScaleOut Software, Inc. • Mark class properties as indexes for query: • Define a query using these properties: Parallel Query Example (C#) class Stock { [SossIndex] public string Ticker { get; set; } public decimal TotalShares { get; set; } public decimal Price { get; set; }} NamedCache cache = CacheFactory.GetCache("Stocks"); var q = from s in cache.QueryObjects<Stock>() where s.Ticker == "GOOG" || s.Ticker == "ORCL" select s; Console.WriteLine("{0} Stocks found", q.Count());
  • 55. 55 ScaleOut Software, Inc. • Create method to analyze each queried stock object: • Create method to pair-wise merge the results: Example of Analysis Code (C#) static decimal eval(Stock stock, StockCalcParams params) { return stock.Price * stock.TotalShares; } static decimal merge(decimal r1, decimal r2) { return r1 + r2; }
  • 56. 56 ScaleOut Software, Inc. • Run a parallel method invocation: Invoking the Parallel Analysis (C#) NamedCache cache = CacheFactory.GetCache("Stocks"); decimal valueOfSelectedStocks = (from s in cache.QueryObjects<Stock>() where s.Ticker == "GOOG" || s.Ticker == "ORCL" select s) .Invoke(new StockCalcParams(…), new Func<Stock, StockCalcParams, decimal>(eval)) .Merge(new Func<decimal, decimal, decimal>(merge)); Console.WriteLine(“The value of selected stocks is {0}", valueOfSelectedStocks);