SlideShare una empresa de Scribd logo
1 de 41
Descargar para leer sin conexión
This is not a contribution
Evan Chan June 2016
•
700 Updatable Queries Per Second:
•
Spark as a Real-Time Web Service
This is not a contribution
Who Am I?
User and contributor to Spark since 0.9, Cassandra
since 0.6
Datastax Cassandra MVP
Created Spark Job Server and FiloDB
Talks at Spark Summit, Cassandra Summit, Strata, Scala
Days, etc.
This is not a contribution
Apache Spark
•
Usually used for rich analytics, not time-critical.
Machine learning: generating models, predictions, etc.
SQL Queries seconds to minutes, low concurrency
Stream processing
•
What about for low-latency, highly concurrent queries?
Dashboards?
This is not a contribution
Low-Latency Web Queries
•
Why is it important?
Dashboards
Interactive analytics
Real-time data processing
•
Why not use the Spark stack for this?
This is not a contribution
Web Query Stack
Web Client / JS
App Server
RDBMS
This is not a contribution
Spark-based Low-Latency Stack
Web Client / JS
???
Spark
This is not a contribution
Creating a new SparkContext is S-L-O-W
Start up HTTP/BitTorrent File Server
Start up UI
Start up executor processes and wait for
confirmation
•
The bigger the cluster, the slower!
This is not a contribution
Using a Persistent Context for Low Latency
Avoid high overhead of Spark application launch
Standard pattern:
Spark Job Server
Hive Thrift Server
Accept queries and run them in context
Usually means fixed resources - great for SLA
predictability
This is not a contribution
FAIR Scheduling
FIFO vs FAIR Scheduling
FAIR scheduler can co-schedule concurrent Spark
jobs even if they take up lots of resources
Scheduler pools with individual policies
Higher concurrency
FIFO allows concurrency if tasks do not use up all
threads
In Mesos, use coarse-grained mode to avoid
launching executors on every Spark task
This is not a contribution
Low-Latency Game Plan
Start a persistent Spark Context (ex. the Hive
ThriftServer - we’ll get to that below)
Run it in FAIR scheduler mode
Use fast in-memory storage
Maximize concurrency by using as few partitions/
threads as possible
Host the data and run it on a single node - avoid
expensive network shuffles
This is not a contribution
In-Memory Storage
Is it really faster than on disk files? With OS Caching
It's about consistency of performance - not just hot
data in the page cache, but ALL data.
Fast random access
Making different tradeoffs as new memory
technologies emerge (NVRAM etc.)
Higher IO -> less need for compression
Apache Arrow
This is not a contribution
•
So, let’s talk about Spark storage in
detail…
This is not a contribution
HDFS? Parquet Files?
Column pruning speeds up I/O significantly
Still have to scan lots of files
File organization not the easiest for filtering
For low-latency, need much more fine-grained
indexing
This is not a contribution
Cached RDDs
•
Let's say you have an RDD[T], where each item is of type T.
Bytes are saved on JVM heap, or optionally heap + disk
Spark optionally serializes it, using by default Java
serialization, so it (hopefully) takes up less space
Pros: easy (myRdd.cache())
Cons: have to iterate over every item, no column
pruning, slow if need to deserialize, memory hungry,
cannot update
This is not a contribution
Cached DataFrames
•
Works on a DataFrame (RDD[Row] with a schema)
•
sqlContext.cacheTable(tableA)
Uses columnar storage for very efficient storage
Columnar pruning for faster querying
Pros: easy, efficient memory footprint, fast!
Cons: no filtering, cannot update
This is not a contribution
Why are Updates Important?
Appends
Streaming workloads. Add new data continuously.
Real data is *always* changing. Queries on live
real-time data has business benefits.
Updates
Idempotency = really simple ingestion pipelines
Simpler streaming later
update late events (See Spark 2.0 Structured
Streaming)
This is not a contribution
Advantages of Filtering
Two methods to lower query latency:
Scan data faster (in-memory)
Scan less data (filtering)
RDDs and cached DFs - prune by partition
Dynamo/BigTable - 2D Filtering
Filter by partition
Filter within partitions
This is not a contribution
Workarounds - Updating RDDs
Union(oldRDD, newRDD)
Creates a tree of RDDs - slows down queries
significantly
IndexedRDD
This is not a contribution
•
Introducing FiloDB.
•
A distributed, versioned, columnar
analytics database.
•
Built for streaming.
This is not a contribution
Fast Analytics Storage
Scan speeds competitive with Apache Parquet
In-memory version significantly faster
Flexible filtering along two dimensions
Much more efficient and flexible partition key
filtering
Efficient columnar storage using dictionary encoding
and other techniques
Updatable
This is not a contribution
Comparing Storage Costs and Query Speeds
•
https://www.oreilly.com/ideas/apache-cassandra-for-
analytics-a-performance-and-storage-analysis
This is not a contribution
Robust Distributed Storage
•
In-memory storage engine, or
•
Apache Cassandra as the rock-solid storage engine.
This is not a contribution
Cassandra-like Data Model
partition keys - distributes data around a cluster, and
allows for fine grained and flexible filtering
segment keys - do range scans within a partition, e.g. by
time slice
primary key based ingestion and updates
Column A Column B
Partition Key 1 Segment 1 Segment 2 Segment 1 Segment 2
Partition Key 2 Segment 1 Segment 2 Segment 1 Segment 2
This is not a contribution
Very Flexible Filtering
•
Unlike Cassandra, FiloDB offers very flexible and
efficient filtering on partition keys. Partial key matches,
fast IN queries on any part of the partition key.
•
No need to write multiple tables to work around
answering different queries.
This is not a contribution
Spark SQL Queries!
•
- Read to and write from Spark Dataframes
•
- Append/merge to FiloDB table from Spark Streaming
•
- Use Tableau or any other JDBC tool
CREATE TABLE gdelt USING filodb.spark OPTIONS (dataset "gdelt");
SELECT Actor1Name, Actor2Name, AvgTone FROM gdelt ORDER BY AvgTone
DESC LIMIT 15;
INSERT INTO gdelt SELECT * FROM NewMonthData;
This is not a contribution
What’s in the Name?
•
Rich, sweet layers of distributed, versioned database
goodness
This is not a contribution
Message
Queue
Events
Spark
Streaming
Short term
storage, K-V
Adhoc,
SQL, ML
Cassandra
FiloDB: Events,
ad-hoc, batch
Spark
Dashboa
rds,
maps
This is not a contribution
SMACK stack for all your analytics
Regular Cassandra tables for highly concurrent,
aggregate / key-value lookups (dashboards)
FiloDB + C* + Spark for efficient long term event
storage
Ad hoc / SQL / BI
Data source for MLLib / building models
Data storage for classified / predicted / scored data
This is not a contribution
Message
Queue
Events
Spark
Streaming
Models
Cassandra
FiloDB: Long term event storage
Spark Learned
Data
This is not a contribution
•
Fast SQL Server in Spark
This is not a contribution
Data:The New York City Taxi Dataset
•
The public NYC Taxi Dataset contains telemetry (pickup, dropoff locations, times)
info on millions of taxi rides in NYC.
Partition key - :stringPrefix medallion 2 - hash multiple drivers trips into
~300 partitions
Segment key - :timeslice pickup_datetime 6d
Row key - hack_license, pickup_datetime
•
Allows for easy filtering by individual drivers, and slicing by time.
Medallion Prefix 1/1 - 1/6 1/7 - 1/12
AA records records
AB records records
This is not a contribution
collectAsync
•
To support running concurrent queries better, we rely on a
relatively unknown feature of Spark's RDD API, collectAync:
•
sqlContext.sql(queryString).rdd.collectAsync
•
This returns a Scala Future, which can easily be composed
using Future.sequence to launch a whole series of
asynchronous RDD operations. They will be executed with
the help of a separate ForkJoin thread pool.
This is not a contribution
Initial Results
Run lots of queries concurrently using collectAsync
Spark local[*] mode
SQL queries on first million rows of NYC Taxi dataset
50 Queries per Second
Most of time not running queries but parsing SQL !
This is not a contribution
Some Observations
1. Starting up a Spark task is actually pretty low
latency - milliseconds
2. One huge benefit to filtering is reduced thread/CPU
usage. Most of the queries ended up being single
partition / single thread.
This is not a contribution
Lessons
1. Cache the SQL to DataFrame/LogicalPlan parsing.
This saves ~20ms per parse, which is not
insignificant for low-latency apps
2. Distribute the SQL parsing away from the main
thread so it's not gated by one thread
This is not a contribution
SQL Plan Caching
•
Cache the `DataFrame` containing the logical plan translated from
parsing SQL.
•
Now - **700 QPS**!!
val cachedDF = new collection.mutable.HashMap[String, DataFrame]
def getCachedDF(query: String): DataFrame =
cachedDF.getOrElseUpdate(query, sql.sql(query))
This is not a contribution
Scaling with More Data
•
15 million rows of NYC Taxi data - **still 700 QPS**!
•
This makes sense due to the efficiency of querying.
This is not a contribution
Fast Spark Query Stack
Run Spark context on heap with `local[*]`
Load FiloDB-Spark connector, load data in memory
Very fast queries all in process
Front end app
FiloDB-Spark
SparkContext
InMemoryColumnStore
This is not a contribution
Fast Spark Query Stack II
HTTP/REST using Spark Job Server
JS app
Spark Job Server
FiloDB-Spark
SparkContext
InMemoryColumnStoreHTTP / REST
This is not a contribution
Slower: Hive Thrift Server Stack
BI Client
Hive Thrift Server
Spark
JDBC
FiloDB-Spark
SQLContext
Hive MetaStore
This is not a contribution
Your Contributions Welcome!
•
http://github.com/tuplejump/FiloDB

Más contenido relacionado

La actualidad más candente

Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark TutorialAhmet Bulut
 
Reactive app using actor model & apache spark
Reactive app using actor model & apache sparkReactive app using actor model & apache spark
Reactive app using actor model & apache sparkRahul Kumar
 
The Pushdown of Everything by Stephan Kessler and Santiago Mola
The Pushdown of Everything by Stephan Kessler and Santiago MolaThe Pushdown of Everything by Stephan Kessler and Santiago Mola
The Pushdown of Everything by Stephan Kessler and Santiago MolaSpark Summit
 
Beneath RDD in Apache Spark by Jacek Laskowski
Beneath RDD in Apache Spark by Jacek LaskowskiBeneath RDD in Apache Spark by Jacek Laskowski
Beneath RDD in Apache Spark by Jacek LaskowskiSpark Summit
 
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...Helena Edelson
 
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and SparkCassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and SparkDataStax Academy
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerEvan Chan
 
Apache Spark Streaming: Architecture and Fault Tolerance
Apache Spark Streaming: Architecture and Fault ToleranceApache Spark Streaming: Architecture and Fault Tolerance
Apache Spark Streaming: Architecture and Fault ToleranceSachin Aggarwal
 
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Spark Summit
 
Sqoop on Spark for Data Ingestion
Sqoop on Spark for Data IngestionSqoop on Spark for Data Ingestion
Sqoop on Spark for Data IngestionDataWorks Summit
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationshadooparchbook
 
OLAP with Cassandra and Spark
OLAP with Cassandra and SparkOLAP with Cassandra and Spark
OLAP with Cassandra and SparkEvan Chan
 
Spark Tuning For Enterprise System Administrators, Spark Summit East 2016
Spark Tuning For Enterprise System Administrators, Spark Summit East 2016Spark Tuning For Enterprise System Administrators, Spark Summit East 2016
Spark Tuning For Enterprise System Administrators, Spark Summit East 2016Anya Bida
 
Dive into Spark Streaming
Dive into Spark StreamingDive into Spark Streaming
Dive into Spark StreamingGerard Maas
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi
 
Alpine academy apache spark series #1 introduction to cluster computing wit...
Alpine academy apache spark series #1   introduction to cluster computing wit...Alpine academy apache spark series #1   introduction to cluster computing wit...
Alpine academy apache spark series #1 introduction to cluster computing wit...Holden Karau
 
Spark Internals Training | Apache Spark | Spark | Anika Technologies
Spark Internals Training | Apache Spark | Spark | Anika TechnologiesSpark Internals Training | Apache Spark | Spark | Anika Technologies
Spark Internals Training | Apache Spark | Spark | Anika TechnologiesAnand Narayanan
 
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...Databricks
 

La actualidad más candente (19)

Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Reactive app using actor model & apache spark
Reactive app using actor model & apache sparkReactive app using actor model & apache spark
Reactive app using actor model & apache spark
 
The Pushdown of Everything by Stephan Kessler and Santiago Mola
The Pushdown of Everything by Stephan Kessler and Santiago MolaThe Pushdown of Everything by Stephan Kessler and Santiago Mola
The Pushdown of Everything by Stephan Kessler and Santiago Mola
 
Beneath RDD in Apache Spark by Jacek Laskowski
Beneath RDD in Apache Spark by Jacek LaskowskiBeneath RDD in Apache Spark by Jacek Laskowski
Beneath RDD in Apache Spark by Jacek Laskowski
 
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...
 
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and SparkCassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job Server
 
Apache Spark Streaming: Architecture and Fault Tolerance
Apache Spark Streaming: Architecture and Fault ToleranceApache Spark Streaming: Architecture and Fault Tolerance
Apache Spark Streaming: Architecture and Fault Tolerance
 
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
 
Sqoop on Spark for Data Ingestion
Sqoop on Spark for Data IngestionSqoop on Spark for Data Ingestion
Sqoop on Spark for Data Ingestion
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
OLAP with Cassandra and Spark
OLAP with Cassandra and SparkOLAP with Cassandra and Spark
OLAP with Cassandra and Spark
 
Spark Tuning For Enterprise System Administrators, Spark Summit East 2016
Spark Tuning For Enterprise System Administrators, Spark Summit East 2016Spark Tuning For Enterprise System Administrators, Spark Summit East 2016
Spark Tuning For Enterprise System Administrators, Spark Summit East 2016
 
Dive into Spark Streaming
Dive into Spark StreamingDive into Spark Streaming
Dive into Spark Streaming
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Alpine academy apache spark series #1 introduction to cluster computing wit...
Alpine academy apache spark series #1   introduction to cluster computing wit...Alpine academy apache spark series #1   introduction to cluster computing wit...
Alpine academy apache spark series #1 introduction to cluster computing wit...
 
Spark Internals Training | Apache Spark | Spark | Anika Technologies
Spark Internals Training | Apache Spark | Spark | Anika TechnologiesSpark Internals Training | Apache Spark | Spark | Anika Technologies
Spark Internals Training | Apache Spark | Spark | Anika Technologies
 
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
 

Destacado

Postgres Open 2014 - A Performance Characterization of Postgres on Different ...
Postgres Open 2014 - A Performance Characterization of Postgres on Different ...Postgres Open 2014 - A Performance Characterization of Postgres on Different ...
Postgres Open 2014 - A Performance Characterization of Postgres on Different ...Faisal Akber
 
Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...
Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...
Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...DataStax
 
Breakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and SparkBreakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and SparkEvan Chan
 
Spark + Cassandra = Real Time Analytics on Operational Data
Spark + Cassandra = Real Time Analytics on Operational DataSpark + Cassandra = Real Time Analytics on Operational Data
Spark + Cassandra = Real Time Analytics on Operational DataVictor Coustenoble
 
Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015Evan Chan
 
Infinit: Modern Storage Platform for Container Environments
Infinit: Modern Storage Platform for Container EnvironmentsInfinit: Modern Storage Platform for Container Environments
Infinit: Modern Storage Platform for Container EnvironmentsDocker, Inc.
 
Real-Time Analytics with Apache Cassandra and Apache Spark
Real-Time Analytics with Apache Cassandra and Apache SparkReal-Time Analytics with Apache Cassandra and Apache Spark
Real-Time Analytics with Apache Cassandra and Apache SparkGuido Schmutz
 
BI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache CassandraBI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache CassandraVictor Coustenoble
 

Destacado (8)

Postgres Open 2014 - A Performance Characterization of Postgres on Different ...
Postgres Open 2014 - A Performance Characterization of Postgres on Different ...Postgres Open 2014 - A Performance Characterization of Postgres on Different ...
Postgres Open 2014 - A Performance Characterization of Postgres on Different ...
 
Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...
Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...
Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...
 
Breakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and SparkBreakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and Spark
 
Spark + Cassandra = Real Time Analytics on Operational Data
Spark + Cassandra = Real Time Analytics on Operational DataSpark + Cassandra = Real Time Analytics on Operational Data
Spark + Cassandra = Real Time Analytics on Operational Data
 
Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015
 
Infinit: Modern Storage Platform for Container Environments
Infinit: Modern Storage Platform for Container EnvironmentsInfinit: Modern Storage Platform for Container Environments
Infinit: Modern Storage Platform for Container Environments
 
Real-Time Analytics with Apache Cassandra and Apache Spark
Real-Time Analytics with Apache Cassandra and Apache SparkReal-Time Analytics with Apache Cassandra and Apache Spark
Real-Time Analytics with Apache Cassandra and Apache Spark
 
BI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache CassandraBI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache Cassandra
 

Similar a 700 Updatable Queries Per Second: Spark as a Real-Time Web Service

TupleJump: Breakthrough OLAP performance on Cassandra and Spark
TupleJump: Breakthrough OLAP performance on Cassandra and SparkTupleJump: Breakthrough OLAP performance on Cassandra and Spark
TupleJump: Breakthrough OLAP performance on Cassandra and SparkDataStax Academy
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksAnyscale
 
Jump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksJump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksAnyscale
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsDatabricks
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Landon Robinson
 
Healthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkHealthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkDatabricks
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersDatabricks
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Databricks
 
Data processing with spark in r & python
Data processing with spark in r & pythonData processing with spark in r & python
Data processing with spark in r & pythonMaloy Manna, PMP®
 
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Chris Fregly
 
Migrating ETL Workflow to Apache Spark at Scale in Pinterest
Migrating ETL Workflow to Apache Spark at Scale in PinterestMigrating ETL Workflow to Apache Spark at Scale in Pinterest
Migrating ETL Workflow to Apache Spark at Scale in PinterestDatabricks
 
Introduction to Apache Flink
Introduction to Apache FlinkIntroduction to Apache Flink
Introduction to Apache Flinkdatamantra
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...Chetan Khatri
 
From HDFS to S3: Migrate Pinterest Apache Spark Clusters
From HDFS to S3: Migrate Pinterest Apache Spark ClustersFrom HDFS to S3: Migrate Pinterest Apache Spark Clusters
From HDFS to S3: Migrate Pinterest Apache Spark ClustersDatabricks
 
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability MeetupApache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability MeetupHyderabad Scalability Meetup
 
Dec6 meetup spark presentation
Dec6 meetup spark presentationDec6 meetup spark presentation
Dec6 meetup spark presentationRamesh Mudunuri
 
Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014mahchiev
 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for BeginnersAnirudh
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Djamel Zouaoui
 

Similar a 700 Updatable Queries Per Second: Spark as a Real-Time Web Service (20)

TupleJump: Breakthrough OLAP performance on Cassandra and Spark
TupleJump: Breakthrough OLAP performance on Cassandra and SparkTupleJump: Breakthrough OLAP performance on Cassandra and Spark
TupleJump: Breakthrough OLAP performance on Cassandra and Spark
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Jump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksJump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with Databricks
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous Applications
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
 
Healthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkHealthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache Spark
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
 
Data processing with spark in r & python
Data processing with spark in r & pythonData processing with spark in r & python
Data processing with spark in r & python
 
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
 
Migrating ETL Workflow to Apache Spark at Scale in Pinterest
Migrating ETL Workflow to Apache Spark at Scale in PinterestMigrating ETL Workflow to Apache Spark at Scale in Pinterest
Migrating ETL Workflow to Apache Spark at Scale in Pinterest
 
Introduction to Apache Flink
Introduction to Apache FlinkIntroduction to Apache Flink
Introduction to Apache Flink
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
 
From HDFS to S3: Migrate Pinterest Apache Spark Clusters
From HDFS to S3: Migrate Pinterest Apache Spark ClustersFrom HDFS to S3: Migrate Pinterest Apache Spark Clusters
From HDFS to S3: Migrate Pinterest Apache Spark Clusters
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
 
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability MeetupApache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
 
Dec6 meetup spark presentation
Dec6 meetup spark presentationDec6 meetup spark presentation
Dec6 meetup spark presentation
 
Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014
 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for Beginners
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 

Más de Evan Chan

Porting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to RustPorting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to RustEvan Chan
 
Designing Stateful Apps for Cloud and Kubernetes
Designing Stateful Apps for Cloud and KubernetesDesigning Stateful Apps for Cloud and Kubernetes
Designing Stateful Apps for Cloud and KubernetesEvan Chan
 
Histograms at scale - Monitorama 2019
Histograms at scale - Monitorama 2019Histograms at scale - Monitorama 2019
Histograms at scale - Monitorama 2019Evan Chan
 
FiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
FiloDB: Reactive, Real-Time, In-Memory Time Series at ScaleFiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
FiloDB: Reactive, Real-Time, In-Memory Time Series at ScaleEvan Chan
 
MIT lecture - Socrata Open Data Architecture
MIT lecture - Socrata Open Data ArchitectureMIT lecture - Socrata Open Data Architecture
MIT lecture - Socrata Open Data ArchitectureEvan Chan
 
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)Evan Chan
 
Real-time Analytics with Cassandra, Spark, and Shark
Real-time Analytics with Cassandra, Spark, and SharkReal-time Analytics with Cassandra, Spark, and Shark
Real-time Analytics with Cassandra, Spark, and SharkEvan Chan
 

Más de Evan Chan (7)

Porting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to RustPorting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to Rust
 
Designing Stateful Apps for Cloud and Kubernetes
Designing Stateful Apps for Cloud and KubernetesDesigning Stateful Apps for Cloud and Kubernetes
Designing Stateful Apps for Cloud and Kubernetes
 
Histograms at scale - Monitorama 2019
Histograms at scale - Monitorama 2019Histograms at scale - Monitorama 2019
Histograms at scale - Monitorama 2019
 
FiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
FiloDB: Reactive, Real-Time, In-Memory Time Series at ScaleFiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
FiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
 
MIT lecture - Socrata Open Data Architecture
MIT lecture - Socrata Open Data ArchitectureMIT lecture - Socrata Open Data Architecture
MIT lecture - Socrata Open Data Architecture
 
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
 
Real-time Analytics with Cassandra, Spark, and Shark
Real-time Analytics with Cassandra, Spark, and SharkReal-time Analytics with Cassandra, Spark, and Shark
Real-time Analytics with Cassandra, Spark, and Shark
 

Último

Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Bookingroncy bisnoi
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Christo Ananth
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfKamal Acharya
 
Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...Christo Ananth
 
Vivazz, Mieres Social Housing Design Spain
Vivazz, Mieres Social Housing Design SpainVivazz, Mieres Social Housing Design Spain
Vivazz, Mieres Social Housing Design Spaintimesproduction05
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...Call Girls in Nagpur High Profile
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . pptDineshKumar4165
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTbhaskargani46
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756dollysharma2066
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Bookingdharasingh5698
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...ranjana rawat
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringmulugeta48
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...Call Girls in Nagpur High Profile
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VDineshKumar4165
 

Último (20)

Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...
 
Vivazz, Mieres Social Housing Design Spain
Vivazz, Mieres Social Housing Design SpainVivazz, Mieres Social Housing Design Spain
Vivazz, Mieres Social Housing Design Spain
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineering
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
 
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 

700 Updatable Queries Per Second: Spark as a Real-Time Web Service

  • 1. This is not a contribution Evan Chan June 2016 • 700 Updatable Queries Per Second: • Spark as a Real-Time Web Service
  • 2. This is not a contribution Who Am I? User and contributor to Spark since 0.9, Cassandra since 0.6 Datastax Cassandra MVP Created Spark Job Server and FiloDB Talks at Spark Summit, Cassandra Summit, Strata, Scala Days, etc.
  • 3. This is not a contribution Apache Spark • Usually used for rich analytics, not time-critical. Machine learning: generating models, predictions, etc. SQL Queries seconds to minutes, low concurrency Stream processing • What about for low-latency, highly concurrent queries? Dashboards?
  • 4. This is not a contribution Low-Latency Web Queries • Why is it important? Dashboards Interactive analytics Real-time data processing • Why not use the Spark stack for this?
  • 5. This is not a contribution Web Query Stack Web Client / JS App Server RDBMS
  • 6. This is not a contribution Spark-based Low-Latency Stack Web Client / JS ??? Spark
  • 7. This is not a contribution Creating a new SparkContext is S-L-O-W Start up HTTP/BitTorrent File Server Start up UI Start up executor processes and wait for confirmation • The bigger the cluster, the slower!
  • 8. This is not a contribution Using a Persistent Context for Low Latency Avoid high overhead of Spark application launch Standard pattern: Spark Job Server Hive Thrift Server Accept queries and run them in context Usually means fixed resources - great for SLA predictability
  • 9. This is not a contribution FAIR Scheduling FIFO vs FAIR Scheduling FAIR scheduler can co-schedule concurrent Spark jobs even if they take up lots of resources Scheduler pools with individual policies Higher concurrency FIFO allows concurrency if tasks do not use up all threads In Mesos, use coarse-grained mode to avoid launching executors on every Spark task
  • 10. This is not a contribution Low-Latency Game Plan Start a persistent Spark Context (ex. the Hive ThriftServer - we’ll get to that below) Run it in FAIR scheduler mode Use fast in-memory storage Maximize concurrency by using as few partitions/ threads as possible Host the data and run it on a single node - avoid expensive network shuffles
  • 11. This is not a contribution In-Memory Storage Is it really faster than on disk files? With OS Caching It's about consistency of performance - not just hot data in the page cache, but ALL data. Fast random access Making different tradeoffs as new memory technologies emerge (NVRAM etc.) Higher IO -> less need for compression Apache Arrow
  • 12. This is not a contribution • So, let’s talk about Spark storage in detail…
  • 13. This is not a contribution HDFS? Parquet Files? Column pruning speeds up I/O significantly Still have to scan lots of files File organization not the easiest for filtering For low-latency, need much more fine-grained indexing
  • 14. This is not a contribution Cached RDDs • Let's say you have an RDD[T], where each item is of type T. Bytes are saved on JVM heap, or optionally heap + disk Spark optionally serializes it, using by default Java serialization, so it (hopefully) takes up less space Pros: easy (myRdd.cache()) Cons: have to iterate over every item, no column pruning, slow if need to deserialize, memory hungry, cannot update
  • 15. This is not a contribution Cached DataFrames • Works on a DataFrame (RDD[Row] with a schema) • sqlContext.cacheTable(tableA) Uses columnar storage for very efficient storage Columnar pruning for faster querying Pros: easy, efficient memory footprint, fast! Cons: no filtering, cannot update
  • 16. This is not a contribution Why are Updates Important? Appends Streaming workloads. Add new data continuously. Real data is *always* changing. Queries on live real-time data has business benefits. Updates Idempotency = really simple ingestion pipelines Simpler streaming later update late events (See Spark 2.0 Structured Streaming)
  • 17. This is not a contribution Advantages of Filtering Two methods to lower query latency: Scan data faster (in-memory) Scan less data (filtering) RDDs and cached DFs - prune by partition Dynamo/BigTable - 2D Filtering Filter by partition Filter within partitions
  • 18. This is not a contribution Workarounds - Updating RDDs Union(oldRDD, newRDD) Creates a tree of RDDs - slows down queries significantly IndexedRDD
  • 19. This is not a contribution • Introducing FiloDB. • A distributed, versioned, columnar analytics database. • Built for streaming.
  • 20. This is not a contribution Fast Analytics Storage Scan speeds competitive with Apache Parquet In-memory version significantly faster Flexible filtering along two dimensions Much more efficient and flexible partition key filtering Efficient columnar storage using dictionary encoding and other techniques Updatable
  • 21. This is not a contribution Comparing Storage Costs and Query Speeds • https://www.oreilly.com/ideas/apache-cassandra-for- analytics-a-performance-and-storage-analysis
  • 22. This is not a contribution Robust Distributed Storage • In-memory storage engine, or • Apache Cassandra as the rock-solid storage engine.
  • 23. This is not a contribution Cassandra-like Data Model partition keys - distributes data around a cluster, and allows for fine grained and flexible filtering segment keys - do range scans within a partition, e.g. by time slice primary key based ingestion and updates Column A Column B Partition Key 1 Segment 1 Segment 2 Segment 1 Segment 2 Partition Key 2 Segment 1 Segment 2 Segment 1 Segment 2
  • 24. This is not a contribution Very Flexible Filtering • Unlike Cassandra, FiloDB offers very flexible and efficient filtering on partition keys. Partial key matches, fast IN queries on any part of the partition key. • No need to write multiple tables to work around answering different queries.
  • 25. This is not a contribution Spark SQL Queries! • - Read to and write from Spark Dataframes • - Append/merge to FiloDB table from Spark Streaming • - Use Tableau or any other JDBC tool CREATE TABLE gdelt USING filodb.spark OPTIONS (dataset "gdelt"); SELECT Actor1Name, Actor2Name, AvgTone FROM gdelt ORDER BY AvgTone DESC LIMIT 15; INSERT INTO gdelt SELECT * FROM NewMonthData;
  • 26. This is not a contribution What’s in the Name? • Rich, sweet layers of distributed, versioned database goodness
  • 27. This is not a contribution Message Queue Events Spark Streaming Short term storage, K-V Adhoc, SQL, ML Cassandra FiloDB: Events, ad-hoc, batch Spark Dashboa rds, maps
  • 28. This is not a contribution SMACK stack for all your analytics Regular Cassandra tables for highly concurrent, aggregate / key-value lookups (dashboards) FiloDB + C* + Spark for efficient long term event storage Ad hoc / SQL / BI Data source for MLLib / building models Data storage for classified / predicted / scored data
  • 29. This is not a contribution Message Queue Events Spark Streaming Models Cassandra FiloDB: Long term event storage Spark Learned Data
  • 30. This is not a contribution • Fast SQL Server in Spark
  • 31. This is not a contribution Data:The New York City Taxi Dataset • The public NYC Taxi Dataset contains telemetry (pickup, dropoff locations, times) info on millions of taxi rides in NYC. Partition key - :stringPrefix medallion 2 - hash multiple drivers trips into ~300 partitions Segment key - :timeslice pickup_datetime 6d Row key - hack_license, pickup_datetime • Allows for easy filtering by individual drivers, and slicing by time. Medallion Prefix 1/1 - 1/6 1/7 - 1/12 AA records records AB records records
  • 32. This is not a contribution collectAsync • To support running concurrent queries better, we rely on a relatively unknown feature of Spark's RDD API, collectAync: • sqlContext.sql(queryString).rdd.collectAsync • This returns a Scala Future, which can easily be composed using Future.sequence to launch a whole series of asynchronous RDD operations. They will be executed with the help of a separate ForkJoin thread pool.
  • 33. This is not a contribution Initial Results Run lots of queries concurrently using collectAsync Spark local[*] mode SQL queries on first million rows of NYC Taxi dataset 50 Queries per Second Most of time not running queries but parsing SQL !
  • 34. This is not a contribution Some Observations 1. Starting up a Spark task is actually pretty low latency - milliseconds 2. One huge benefit to filtering is reduced thread/CPU usage. Most of the queries ended up being single partition / single thread.
  • 35. This is not a contribution Lessons 1. Cache the SQL to DataFrame/LogicalPlan parsing. This saves ~20ms per parse, which is not insignificant for low-latency apps 2. Distribute the SQL parsing away from the main thread so it's not gated by one thread
  • 36. This is not a contribution SQL Plan Caching • Cache the `DataFrame` containing the logical plan translated from parsing SQL. • Now - **700 QPS**!! val cachedDF = new collection.mutable.HashMap[String, DataFrame] def getCachedDF(query: String): DataFrame = cachedDF.getOrElseUpdate(query, sql.sql(query))
  • 37. This is not a contribution Scaling with More Data • 15 million rows of NYC Taxi data - **still 700 QPS**! • This makes sense due to the efficiency of querying.
  • 38. This is not a contribution Fast Spark Query Stack Run Spark context on heap with `local[*]` Load FiloDB-Spark connector, load data in memory Very fast queries all in process Front end app FiloDB-Spark SparkContext InMemoryColumnStore
  • 39. This is not a contribution Fast Spark Query Stack II HTTP/REST using Spark Job Server JS app Spark Job Server FiloDB-Spark SparkContext InMemoryColumnStoreHTTP / REST
  • 40. This is not a contribution Slower: Hive Thrift Server Stack BI Client Hive Thrift Server Spark JDBC FiloDB-Spark SQLContext Hive MetaStore
  • 41. This is not a contribution Your Contributions Welcome! • http://github.com/tuplejump/FiloDB