SlideShare a Scribd company logo
1 of 41
Download to read offline
PRESENTATION	
  TITLE	
  ON	
  ONE	
  LINE	
  
AND	
  ON	
  TWO	
  LINES
First	
  and	
  last	
  name
Position,	
  company
Arbitrary	
  Stateful Aggregations
using	
  Structured	
  Streaming
in	
  Apache	
  Spark™
Software	
  Engineer,	
  Databricks
Burak	
  Yavuz
PRESENTATION	
  TITLE	
  ON	
  ONE	
  LINE	
  
AND	
  ON	
  TWO	
  LINES
First	
  and	
  last	
  name
Position,	
  company
Burak	
  Yavuz
2
●Software	
  Engineer	
  – Databricks
-­‐ “We	
  make	
  your	
  streams	
  come	
  true”
●Apache	
  Spark	
  Committer	
  as	
  of	
  Feb	
  2017
●MS	
  in	
  Management	
  Science	
  &	
  Engineering	
  -­‐
Stanford	
  University
●BS	
  in	
  Mechanical	
  Engineering	
  -­‐ Bogazici University,	
  
Istanbul
PRESENTATION	
  TITLE	
  ON	
  ONE	
  LINE	
  
AND	
  ON	
  TWO	
  LINES
First	
  and	
  last	
  name
Position,	
  company
TEAM
About
Started  Spark  project  (now  Apache  Spark)  at  UC  Berkeley  in  2009
PRODUCT
Unified  Analytics  Platform
MISSION
Making  Big  Data  Simple
PRESENTATION	
  TITLE	
  ON	
  ONE	
  LINE	
  
AND	
  ON	
  TWO	
  LINES
First	
  and	
  last	
  name
Position,	
  company
Outline
oStructured	
  Streaming	
  Concepts
oStateful Processing	
  in	
  Structured	
  Streaming
oUse	
  Cases	
  and	
  How	
  NoSQL	
  Stores	
  Fit	
  In
oDemos
PRESENTATION	
  TITLE	
  ON	
  ONE	
  LINE	
  
AND	
  ON	
  TWO	
  LINES
First	
  and	
  last	
  name
Position,	
  company
The simplest way to perform streaming analytics
is not having to reason about streaming at all
PRESENTATION	
  TITLE	
  ON	
  ONE	
  LINE	
  
AND	
  ON	
  TWO	
  LINES
First	
  and	
  last	
  name
Position,	
  company
PRESENTATION	
  TITLE	
  ON	
  ONE	
  LINE	
  
AND	
  ON	
  TWO	
  LINES
First	
  and	
  last	
  name
Position,	
  company
New	
  Model
Input:	
  data	
  from	
  source	
  as	
  an	
  
append-­‐only table
Trigger:	
  how	
  frequently	
  to	
  check
input	
  for	
  new	
  data
Query:	
  operations	
  on	
  input
usual	
  map/filter/reduce	
  
new	
  window,	
  session	
  ops
Trigger: every 1 sec
1 2 3
Time
data up
to 1
Input data up
to 2
data up
to 3
Query
PRESENTATION	
  TITLE	
  ON	
  ONE	
  LINE	
  
AND	
  ON	
  TWO	
  LINES
First	
  and	
  last	
  name
Position,	
  company
Trigger: every 1 sec
1 2 3
result
for data
up to 1
Result
Query
Time
data up
to 1
Input data up
to 2
result
for data
up to 2
data up
to 3
result
for data
up to 3
Output
[complete mode]
output all the rows in the result table
New	
  Model
Result:	
  final	
  operated	
  table	
  
updated	
  every	
  trigger	
  interval
Output:	
  what	
  part	
  of	
  result	
  to	
  
write	
  to	
  data	
  sink	
  after	
  every	
  	
  	
  	
  
trigger
Complete	
  output:	
   Write	
  full	
  result	
  table	
  
every	
  time
PRESENTATION	
  TITLE	
  ON	
  ONE	
  LINE	
  
AND	
  ON	
  TWO	
  LINES
First	
  and	
  last	
  name
Position,	
  company
Trigger: every 1 sec
1 2 3
result
for data
up to 1
Result
Query
Time
data up
to 1
Input data up
to 2
result
for data
up to 2
data up
to 3
result
for data
up to 3
Output
[append mode]
output only new rows since
last trigger
Result: final operated table updated
every trigger interval
Output: what part of result to write
to data sink after every trigger
Complete output: Write full result table
every time
Append output: Write only new rows that got
added to result table since previous batch
*Not all output modes are feasible with all queries
New Model
PRESENTATION	
  TITLE	
  ON	
  ONE	
  LINE	
  
AND	
  ON	
  TWO	
  LINES
First	
  and	
  last	
  name
Position,	
  company
PRESENTATION	
  TITLE	
  ON	
  ONE	
  LINE	
  
AND	
  ON	
  TWO	
  LINES
First	
  and	
  last	
  name
Position,	
  company
Output	
  Modes
▪ Append	
  mode	
  (default) -­‐ New	
  rows	
  added	
  to	
  the	
  Result	
  Table	
  
since	
  the	
  last	
  trigger	
  will	
  be	
  outputted	
  to	
  the	
  sink.	
  Rows	
  will	
  be	
  
output	
  only	
  once,	
  and	
  cannot	
  be	
  rescinded.
Example	
  use	
  cases:	
  ETL
PRESENTATION	
  TITLE	
  ON	
  ONE	
  LINE	
  
AND	
  ON	
  TWO	
  LINES
First	
  and	
  last	
  name
Position,	
  company
Output	
  Modes
▪ Complete	
  mode -­‐ The	
  whole	
  Result	
  Table	
  will	
  be	
  outputted	
  to	
  the	
  
sink	
  after	
  every	
  trigger.	
  This	
  is	
  supported	
  for	
  aggregation	
  queries.
Example	
  use	
  cases:	
  Monitoring
PRESENTATION	
  TITLE	
  ON	
  ONE	
  LINE	
  
AND	
  ON	
  TWO	
  LINES
First	
  and	
  last	
  name
Position,	
  company
Output	
  Modes
▪ Update	
  mode -­‐ (Available	
  since	
  Spark	
  2.1.1)	
  Only	
  the	
  rows	
  in	
  the	
  
Result	
  Table	
  that	
  were	
  updated	
  since	
  the	
  last	
  trigger	
  will	
  be	
  
outputted	
  to	
  the	
  sink.
Example	
  use	
  cases:	
  Alerting,	
  Sessionization
PRESENTATION	
  TITLE	
  ON	
  ONE	
  LINE	
  
AND	
  ON	
  TWO	
  LINES
First	
  and	
  last	
  name
Position,	
  company
Outline
oStructured	
  Streaming	
  Concepts
oStateful Processing	
  in	
  Structured	
  Streaming
oUse	
  Cases	
  and	
  How	
  NoSQL	
  Stores	
  Fit	
  In
oDemos
PRESENTATION	
  TITLE	
  ON	
  ONE	
  LINE	
  
AND	
  ON	
  TWO	
  LINES
First	
  and	
  last	
  name
Position,	
  company
Event	
  time	
  Aggregations
Many	
  use	
  cases	
  require	
  aggregate	
  statistics	
  by	
  event	
  time
E.g.	
  what's	
  the	
  #errors	
  in	
  each	
  system	
  in	
  1	
  hour	
  windows?
Many	
  challenges
Extracting	
  event	
  time	
  from	
  data,	
  handling	
  late,	
  out-­‐of-­‐order	
  data
DStream APIs	
  were	
  insufficient	
  for	
  event	
  time	
  operations
PRESENTATION	
  TITLE	
  ON	
  ONE	
  LINE	
  
AND	
  ON	
  TWO	
  LINES
First	
  and	
  last	
  name
Position,	
  company
Event	
  time	
  Aggregations
Windowing	
  is	
  just	
  another	
  type	
  of	
  grouping	
  in	
  Struct.	
  Streaming
number	
  of	
  records	
  every	
  hour
parsedData
.groupBy(window("timestamp","1  hour"))
.count()
parsedData
.groupBy(
"device",  
window("timestamp","10  mins"))
.avg("signal")
avg signal strength of each
device every 10 mins
Use built-in functions to extract event-time
No need for separate extractors
PRESENTATION	
  TITLE	
  ON	
  ONE	
  LINE	
  
AND	
  ON	
  TWO	
  LINES
First	
  and	
  last	
  name
Position,	
  company
Advanced	
  Aggregations
Powerful	
  built-­‐in	
  
aggregations
Multiple	
  simultaneous	
  
aggregations
Custom	
  aggs using	
  
reduceGroups,	
  UDAFs
parsedData
.groupBy(window("timestamp","1  hour"))
.agg(avg("signal"),  stddev("signal"),  max("signal"))
variance,  stddev,  kurtosis,  stddev_samp,  collect_list,  
collect_set,  corr,  approx_count_distinct,  ...  
//  Compute  histogram  of  age  by  name.
val hist =  ds.groupBy(_.type).mapGroups {
case (type,  data:  Iter[DeviceData])  =>
val buckets =  new Array[Int](10)            
data.map(_.signal).foreach {  a  => buckets(a/10)+=1 }        
(type,  buckets)
}
PRESENTATION	
  TITLE	
  ON	
  ONE	
  LINE	
  
AND	
  ON	
  TWO	
  LINES
First	
  and	
  last	
  name
Position,	
  company
Stateful Processing	
  for	
  Aggregations
In-­‐memory,	
  streaming	
  
state	
  maintained	
  for	
  
aggregations 12:00 - 13:00 1 12:00 - 13:00 3
13:00 - 14:00 1
12:00 - 13:00 3
13:00 - 14:00 2
14:00 - 15:00 5
12:00 - 13:00 5
13:00 - 14:00 2
14:00 - 15:00 5
15:00 - 16:00 4
12:00 - 13:00 3
13:00 - 14:00 2
14:00 - 15:00 6
15:00 - 16:00 4
16:00 - 17:00 3
13:00 14:00 15:00 16:00 17:00
Keeping state allows late data to
update counts of old windows
But size of the state increases
indefinitely if old windows not dropped
red = state updated
with late data
PRESENTATION	
  TITLE	
  ON	
  ONE	
  LINE	
  
AND	
  ON	
  TWO	
  LINES
First	
  and	
  last	
  name
Position,	
  company
PRESENTATION	
  TITLE	
  ON	
  ONE	
  LINE	
  
AND	
  ON	
  TWO	
  LINES
First	
  and	
  last	
  name
Position,	
  company
Watermarking	
  and	
  Late	
  Data	
  
Watermark [Spark	
  2.1]	
  -­‐ a	
  moving	
  
threshold	
  that	
  trails	
  behind	
  the	
  max	
  
seen	
  event	
  time
Trailing	
  gap	
  defines	
  how	
  late	
  data	
  is	
  
expected	
  to	
  be
event time
max event time
watermark data older
than
watermark
not expected
12:30 PM
12:20 PM
trailing gap
of 10 mins
PRESENTATION	
  TITLE	
  ON	
  ONE	
  LINE	
  
AND	
  ON	
  TWO	
  LINES
First	
  and	
  last	
  name
Position,	
  company
Watermarking	
  and	
  Late	
  Data
Data	
  newer	
  than	
  watermark	
  may	
  
be	
  late,	
  but	
  allowed	
  to	
  aggregate
Data	
  older	
  than	
  watermark	
  is	
  "too	
  
late"	
  and	
  dropped
State	
  older	
  than	
  watermark	
  
automatically	
  deleted	
  to	
  limit	
  the	
  
amount	
  of	
  intermediate	
  state
max event time
event time
watermark
late data
allowed to
aggregate
data too
late,
dropped
PRESENTATION	
  TITLE	
  ON	
  ONE	
  LINE	
  
AND	
  ON	
  TWO	
  LINES
First	
  and	
  last	
  name
Position,	
  company
Watermarking	
  and	
  Late	
  Data
Control	
  the	
  tradeoff	
  between	
  state	
  
size	
  and	
  lateness	
  requirements
Handle	
  more	
  late	
  à keep	
  more	
  state
Reduce	
  state	
  à handle	
  less	
  lateness
max event time
event time
watermark
allowed
lateness
of 10 mins
parsedData
.withWatermark("timestamp",  "10  minutes")
.groupBy(window("timestamp","5  minutes"))
.count()
late data
allowed to
aggregate
data too
late,
dropped
PRESENTATION	
  TITLE	
  ON	
  ONE	
  LINE	
  
AND	
  ON	
  TWO	
  LINES
First	
  and	
  last	
  name
Position,	
  company
Watermarking	
  to	
  Limit	
  State	
  [Spark	
  2.1]
data too late,
ignored in counts,
state dropped
Processing Time12:00
12:05
12:10
12:15
12:10 12:15 12:20
12:07
12:13
12:08
EventTime
12:15
12:18
12:04
watermark updated to
12:14 - 10m = 12:04
for next trigger,
state < 12:04 deleted
data is late, but
considered in counts
parsedData
.withWatermark("timestamp",  "10  minutes")
.groupBy(window("timestamp","5  minutes"))
.count()
system tracks max
observed event time
12:08
wm = 12:04
10min
12:14
More details in blog post!
PRESENTATION	
  TITLE	
  ON	
  ONE	
  LINE	
  
AND	
  ON	
  TWO	
  LINES
First	
  and	
  last	
  name
Position,	
  company
PRESENTATION	
  TITLE	
  ON	
  ONE	
  LINE	
  
AND	
  ON	
  TWO	
  LINES
First	
  and	
  last	
  name
Position,	
  company
Working	
  With	
  Time
df.withWatermark("timestampColumn",  "5  hours")
.groupBy(window("timestampColumn",  "1  minute"))
.count()
.writeStream
.trigger("10  seconds")
Separate processing details (output rate, late data tolerance)
from query semantics.
PRESENTATION	
  TITLE	
  ON	
  ONE	
  LINE	
  
AND	
  ON	
  TWO	
  LINES
First	
  and	
  last	
  name
Position,	
  company
Working	
  With	
  Time
df.withWatermark("timestampColumn",  "5  hours")
.groupBy(window("timestampColumn",  "1  minute"))
.count()
.writeStream
.trigger("10  seconds")
How to group
data by time
Same in streaming & batch
PRESENTATION	
  TITLE	
  ON	
  ONE	
  LINE	
  
AND	
  ON	
  TWO	
  LINES
First	
  and	
  last	
  name
Position,	
  company
Working	
  With	
  Time
df.withWatermark("timestampColumn",  "5  hours")
.groupBy(window("timestampColumn",  "1  minute"))
.count()
.writeStream
.trigger("10  seconds")
How late
data can be
PRESENTATION	
  TITLE	
  ON	
  ONE	
  LINE	
  
AND	
  ON	
  TWO	
  LINES
First	
  and	
  last	
  name
Position,	
  company
Working	
  With	
  Time
df.withWatermark("timestampColumn",  "5  hours")
.groupBy(window("timestampColumn",  "1  minute"))
.count()
.writeStream
.trigger("10  seconds")
How often
to emit updates
PRESENTATION	
  TITLE	
  ON	
  ONE	
  LINE	
  
AND	
  ON	
  TWO	
  LINES
First	
  and	
  last	
  name
Position,	
  company
Arbitrary	
  Stateful Operations	
  [Spark	
  2.2]
mapGroupsWithState
allows	
  any	
  user-­‐defined
stateful ops	
  to	
  a	
  
user-­‐defined	
  state
Direct	
  support	
  for	
  per-­‐key	
  
timeouts	
  in	
  event-­‐time	
  or	
  
processing-­‐time
supports	
  Scala	
  and	
  Java
ds.groupByKey(groupingFunc)
.mapGroupsWithState
(timeoutConf)
(mappingWithStateFunc)
def mappingWithStateFunc(
key: K,  
values: Iterator[V],  
state: GroupState[S]): U =  {  
//  update  or  remove  state
//  set  timeouts
//  return  mapped  value
}
PRESENTATION	
  TITLE	
  ON	
  ONE	
  LINE	
  
AND	
  ON	
  TWO	
  LINES
First	
  and	
  last	
  name
Position,	
  company
flatMapGroupsWithState
▪ Applies	
  the	
  given	
  function	
  to	
  each	
  group	
  of	
  data,	
  while	
  maintaining	
  
a	
  user-­‐defined	
  per-­‐group state
▪ Invoked	
  once	
  per	
  group	
  in	
  batch
▪ Invoked	
  each	
  trigger	
  (with	
  the	
  existence	
  of	
  data)	
  per	
  group	
  in	
  
streaming
▪ Requires	
  user	
  to	
  provide	
  an	
  output	
  mode	
  for	
  the	
  function
PRESENTATION	
  TITLE	
  ON	
  ONE	
  LINE	
  
AND	
  ON	
  TWO	
  LINES
First	
  and	
  last	
  name
Position,	
  company
flatMapGroupsWithState
▪ mapGroupsWithState is	
  a	
  special	
  case	
  with
oOutput	
  mode:	
  Update
oOutput	
  size:	
  1	
  row	
  per	
  group
▪ Supports	
  both	
  Processing	
  Time	
  and	
  Event	
  Time	
  timeouts
PRESENTATION	
  TITLE	
  ON	
  ONE	
  LINE	
  
AND	
  ON	
  TWO	
  LINES
First	
  and	
  last	
  name
Position,	
  company
Outline
oStructured	
  Streaming	
  Concepts
oStateful Processing	
  in	
  Structured	
  Streaming
oUse	
  Cases and	
  How	
  NoSQL	
  Stores	
  Fit	
  In
oDemos
PRESENTATION	
  TITLE	
  ON	
  ONE	
  LINE	
  
AND	
  ON	
  TWO	
  LINES
First	
  and	
  last	
  name
Position,	
  company
Alerting
val monitoring  =  stream
.as[Event]
.groupBy(_.id)
.flatMapGroupsWithState(Append,  GST.ProcessingTimeTimeout)  {
(id:  Int,  events:  Iterator[Event],  state:  GroupState[…])  =>
...
}
.writeStream
.queryName("alerts")
.foreach(new  PagerdutySink(credentials))
Monitor a stream using custom stateful logic with timeouts.
PRESENTATION	
  TITLE	
  ON	
  ONE	
  LINE	
  
AND	
  ON	
  TWO	
  LINES
First	
  and	
  last	
  name
Position,	
  company
Alerting
▪ Save	
  your	
  state	
  to	
  Scylla	
  to	
  power	
  dashboards
▪ Have	
  the	
  stream	
  trigger	
  alerts	
  ASAP
PRESENTATION	
  TITLE	
  ON	
  ONE	
  LINE	
  
AND	
  ON	
  TWO	
  LINES
First	
  and	
  last	
  name
Position,	
  company
Sessionization
val monitoring  =  stream
.as[Event]
.groupBy(_.session_id)
.mapGroupsWithState(GroupStateTimeout.EventTimeTimeout)  {
(id:  Int,  events:  Iterator[Event],  state:  GroupState[…])  =>
...
}
.writeStream
.scylla("trips")
Analyze sessions of user/system behavior
PRESENTATION	
  TITLE	
  ON	
  ONE	
  LINE	
  
AND	
  ON	
  TWO	
  LINES
First	
  and	
  last	
  name
Position,	
  company
Sessionization
▪ Update	
  sessions	
  in	
  your	
  stream
▪ Save	
  it	
  to	
  a	
  NoSQL	
  store	
  like	
  Scylla!
PRESENTATION	
  TITLE	
  ON	
  ONE	
  LINE	
  
AND	
  ON	
  TWO	
  LINES
First	
  and	
  last	
  name
Position,	
  company
Demo
PRESENTATION	
  TITLE	
  ON	
  ONE	
  LINE	
  
AND	
  ON	
  TWO	
  LINES
First	
  and	
  last	
  name
Position,	
  company
Try Spark 2.2 on Community Edition today!
https://databricks.com/try-databricks
PRESENTATION	
  TITLE	
  ON	
  ONE	
  LINE	
  
AND	
  ON	
  TWO	
  LINES
First	
  and	
  last	
  name
Position,	
  company
Apache Spark’s Structured Streaming at Scale Series
https://databricks.com/blog/category/engineering
Twitter: @databricks
PRESENTATION	
  TITLE	
  ON	
  ONE	
  LINE	
  
AND	
  ON	
  TWO	
  LINES
First	
  and	
  last	
  name
Position,	
  company
We are hiring!
https://databricks.com/company/careers
PRESENTATION	
  TITLE	
  ON	
  ONE	
  LINE	
  
AND	
  ON	
  TWO	
  LINES
First	
  and	
  last	
  name
Position,	
  company
THANK	
  YOU
burak@databricks.com
“Does anyone have any questions for my answers?”
- Henry Kissinger

More Related Content

What's hot

What's hot (20)

Scylla Summit 2017: How to Run Cassandra/Scylla from a MySQL DBA's Point of View
Scylla Summit 2017: How to Run Cassandra/Scylla from a MySQL DBA's Point of ViewScylla Summit 2017: How to Run Cassandra/Scylla from a MySQL DBA's Point of View
Scylla Summit 2017: How to Run Cassandra/Scylla from a MySQL DBA's Point of View
 
Scylla Summit 2017: A Deep Dive on Heat Weighted Load Balancing
Scylla Summit 2017: A Deep Dive on Heat Weighted Load BalancingScylla Summit 2017: A Deep Dive on Heat Weighted Load Balancing
Scylla Summit 2017: A Deep Dive on Heat Weighted Load Balancing
 
Scylla Summit 2017: How to Use Gocql to Execute Queries and What the Driver D...
Scylla Summit 2017: How to Use Gocql to Execute Queries and What the Driver D...Scylla Summit 2017: How to Use Gocql to Execute Queries and What the Driver D...
Scylla Summit 2017: How to Use Gocql to Execute Queries and What the Driver D...
 
Scylla Summit 2017: Scylla on Samsung NVMe Z-SSDs
Scylla Summit 2017: Scylla on Samsung NVMe Z-SSDsScylla Summit 2017: Scylla on Samsung NVMe Z-SSDs
Scylla Summit 2017: Scylla on Samsung NVMe Z-SSDs
 
Scylla Summit 2017 Keynote: NextGen NoSQL with CEO Dor Laor
Scylla Summit 2017 Keynote: NextGen NoSQL with CEO Dor LaorScylla Summit 2017 Keynote: NextGen NoSQL with CEO Dor Laor
Scylla Summit 2017 Keynote: NextGen NoSQL with CEO Dor Laor
 
Scylla Summit 2017: Running a Soft Real-time Service at One Million QPS
Scylla Summit 2017: Running a Soft Real-time Service at One Million QPSScylla Summit 2017: Running a Soft Real-time Service at One Million QPS
Scylla Summit 2017: Running a Soft Real-time Service at One Million QPS
 
Scylla Summit 2017: Distributed Materialized Views
Scylla Summit 2017: Distributed Materialized ViewsScylla Summit 2017: Distributed Materialized Views
Scylla Summit 2017: Distributed Materialized Views
 
Scylla Summit 2017: How to Optimize and Reduce Inter-DC Network Traffic and S...
Scylla Summit 2017: How to Optimize and Reduce Inter-DC Network Traffic and S...Scylla Summit 2017: How to Optimize and Reduce Inter-DC Network Traffic and S...
Scylla Summit 2017: How to Optimize and Reduce Inter-DC Network Traffic and S...
 
Scylla Summit 2017: Scylla's Open Source Monitoring Solution
Scylla Summit 2017: Scylla's Open Source Monitoring SolutionScylla Summit 2017: Scylla's Open Source Monitoring Solution
Scylla Summit 2017: Scylla's Open Source Monitoring Solution
 
Scylla Summit 2017: Scylla on Kubernetes
Scylla Summit 2017: Scylla on KubernetesScylla Summit 2017: Scylla on Kubernetes
Scylla Summit 2017: Scylla on Kubernetes
 
Scylla Summit 2017: Snapfish's Journey Towards Scylla
Scylla Summit 2017: Snapfish's Journey Towards ScyllaScylla Summit 2017: Snapfish's Journey Towards Scylla
Scylla Summit 2017: Snapfish's Journey Towards Scylla
 
Scylla Summit 2017: How to Ruin Your Workload's Performance by Choosing the W...
Scylla Summit 2017: How to Ruin Your Workload's Performance by Choosing the W...Scylla Summit 2017: How to Ruin Your Workload's Performance by Choosing the W...
Scylla Summit 2017: How to Ruin Your Workload's Performance by Choosing the W...
 
Scylla Summit 2017: A Toolbox for Understanding Scylla in the Field
Scylla Summit 2017: A Toolbox for Understanding Scylla in the FieldScylla Summit 2017: A Toolbox for Understanding Scylla in the Field
Scylla Summit 2017: A Toolbox for Understanding Scylla in the Field
 
If You Care About Performance, Use User Defined Types
If You Care About Performance, Use User Defined TypesIf You Care About Performance, Use User Defined Types
If You Care About Performance, Use User Defined Types
 
Scylla Summit 2017: Saving Thousands by Running Scylla on EC2 Spot Instances
Scylla Summit 2017: Saving Thousands by Running Scylla on EC2 Spot InstancesScylla Summit 2017: Saving Thousands by Running Scylla on EC2 Spot Instances
Scylla Summit 2017: Saving Thousands by Running Scylla on EC2 Spot Instances
 
Scylla Summit 2017: How Baidu Runs Scylla on a Petabyte-Level Big Data Platform
Scylla Summit 2017: How Baidu Runs Scylla on a Petabyte-Level Big Data PlatformScylla Summit 2017: How Baidu Runs Scylla on a Petabyte-Level Big Data Platform
Scylla Summit 2017: How Baidu Runs Scylla on a Petabyte-Level Big Data Platform
 
Scylla Summit 2017: Welcome and Keynote - Nextgen NoSQL
Scylla Summit 2017: Welcome and Keynote - Nextgen NoSQLScylla Summit 2017: Welcome and Keynote - Nextgen NoSQL
Scylla Summit 2017: Welcome and Keynote - Nextgen NoSQL
 
Scylla Summit 2017: Repair, Backup, Restore: Last Thing Before You Go to Prod...
Scylla Summit 2017: Repair, Backup, Restore: Last Thing Before You Go to Prod...Scylla Summit 2017: Repair, Backup, Restore: Last Thing Before You Go to Prod...
Scylla Summit 2017: Repair, Backup, Restore: Last Thing Before You Go to Prod...
 
Scylla Summit 2017 Keynote: NextGen NoSQL with Chairman Benny Schnaider
Scylla Summit 2017 Keynote: NextGen NoSQL with Chairman Benny SchnaiderScylla Summit 2017 Keynote: NextGen NoSQL with Chairman Benny Schnaider
Scylla Summit 2017 Keynote: NextGen NoSQL with Chairman Benny Schnaider
 
Scylla Summit 2017: Stretching Scylla Silly: The Datastore of a Graph Databas...
Scylla Summit 2017: Stretching Scylla Silly: The Datastore of a Graph Databas...Scylla Summit 2017: Stretching Scylla Silly: The Datastore of a Graph Databas...
Scylla Summit 2017: Stretching Scylla Silly: The Datastore of a Graph Databas...
 

Viewers also liked

Viewers also liked (6)

Scylla Summit 2017: Cry in the Dojo, Laugh in the Battlefield: How We Constan...
Scylla Summit 2017: Cry in the Dojo, Laugh in the Battlefield: How We Constan...Scylla Summit 2017: Cry in the Dojo, Laugh in the Battlefield: How We Constan...
Scylla Summit 2017: Cry in the Dojo, Laugh in the Battlefield: How We Constan...
 
Scylla Summit 2017: Managing 10,000 Node Storage Clusters at Twitter
Scylla Summit 2017: Managing 10,000 Node Storage Clusters at TwitterScylla Summit 2017: Managing 10,000 Node Storage Clusters at Twitter
Scylla Summit 2017: Managing 10,000 Node Storage Clusters at Twitter
 
Scylla Summit 2017: Streaming ETL in Kafka for Everyone with KSQL
Scylla Summit 2017: Streaming ETL in Kafka for Everyone with KSQLScylla Summit 2017: Streaming ETL in Kafka for Everyone with KSQL
Scylla Summit 2017: Streaming ETL in Kafka for Everyone with KSQL
 
CassieQ: The Distributed Message Queue Built on Cassandra (Anton Kropp, Cural...
CassieQ: The Distributed Message Queue Built on Cassandra (Anton Kropp, Cural...CassieQ: The Distributed Message Queue Built on Cassandra (Anton Kropp, Cural...
CassieQ: The Distributed Message Queue Built on Cassandra (Anton Kropp, Cural...
 
Scylla Summit 2017: Keynote, Looking back, looking ahead
Scylla Summit 2017: Keynote, Looking back, looking aheadScylla Summit 2017: Keynote, Looking back, looking ahead
Scylla Summit 2017: Keynote, Looking back, looking ahead
 
How to Monitor and Size Workloads on AWS i3 instances
How to Monitor and Size Workloads on AWS i3 instancesHow to Monitor and Size Workloads on AWS i3 instances
How to Monitor and Size Workloads on AWS i3 instances
 

Similar to Scylla Summit 2017: Stateful Streaming Applications with Apache Spark

40043 claborn
40043 claborn40043 claborn
40043 claborn
Baba Ib
 
Prog1 chap1 and chap 2
Prog1 chap1 and chap 2Prog1 chap1 and chap 2
Prog1 chap1 and chap 2
rowensCap
 
Tecnicas e Instrumentos de Recoleccion de Datos
Tecnicas e Instrumentos de Recoleccion de DatosTecnicas e Instrumentos de Recoleccion de Datos
Tecnicas e Instrumentos de Recoleccion de Datos
Angel Giraldo
 
Moving Towards a Streaming Architecture
Moving Towards a Streaming ArchitectureMoving Towards a Streaming Architecture
Moving Towards a Streaming Architecture
Gabriele Modena
 

Similar to Scylla Summit 2017: Stateful Streaming Applications with Apache Spark (20)

Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
 
Spark streaming
Spark streamingSpark streaming
Spark streaming
 
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016 A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
 
40043 claborn
40043 claborn40043 claborn
40043 claborn
 
Data Processing with Apache Spark Meetup Talk
Data Processing with Apache Spark Meetup TalkData Processing with Apache Spark Meetup Talk
Data Processing with Apache Spark Meetup Talk
 
Wamika Singh, Suman Kumari - Let's decipher the DevOps macedonia - Codemotion...
Wamika Singh, Suman Kumari - Let's decipher the DevOps macedonia - Codemotion...Wamika Singh, Suman Kumari - Let's decipher the DevOps macedonia - Codemotion...
Wamika Singh, Suman Kumari - Let's decipher the DevOps macedonia - Codemotion...
 
An Architect's guide to real time big data systems
An Architect's guide to real time big data systemsAn Architect's guide to real time big data systems
An Architect's guide to real time big data systems
 
Taking Spark Streaming to the Next Level with Datasets and DataFrames
Taking Spark Streaming to the Next Level with Datasets and DataFramesTaking Spark Streaming to the Next Level with Datasets and DataFrames
Taking Spark Streaming to the Next Level with Datasets and DataFrames
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
 
Witsml data processing with kafka and spark streaming
Witsml data processing with kafka and spark streamingWitsml data processing with kafka and spark streaming
Witsml data processing with kafka and spark streaming
 
Let's decipher the DevOps macedonia
Let's decipher the DevOps macedoniaLet's decipher the DevOps macedonia
Let's decipher the DevOps macedonia
 
Stream Analytics
Stream Analytics Stream Analytics
Stream Analytics
 
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache SparkArbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
 
Prog1 chap1 and chap 2
Prog1 chap1 and chap 2Prog1 chap1 and chap 2
Prog1 chap1 and chap 2
 
A Deep Dive into Structured Streaming in Apache Spark
A Deep Dive into Structured Streaming in Apache Spark A Deep Dive into Structured Streaming in Apache Spark
A Deep Dive into Structured Streaming in Apache Spark
 
Data Pipeline for The Big Data/Data Science OKC
Data Pipeline for The Big Data/Data Science OKCData Pipeline for The Big Data/Data Science OKC
Data Pipeline for The Big Data/Data Science OKC
 
The Future of Real-Time in Spark
The Future of Real-Time in SparkThe Future of Real-Time in Spark
The Future of Real-Time in Spark
 
The Future of Real-Time in Spark
The Future of Real-Time in SparkThe Future of Real-Time in Spark
The Future of Real-Time in Spark
 
Tecnicas e Instrumentos de Recoleccion de Datos
Tecnicas e Instrumentos de Recoleccion de DatosTecnicas e Instrumentos de Recoleccion de Datos
Tecnicas e Instrumentos de Recoleccion de Datos
 
Moving Towards a Streaming Architecture
Moving Towards a Streaming ArchitectureMoving Towards a Streaming Architecture
Moving Towards a Streaming Architecture
 

More from ScyllaDB

More from ScyllaDB (20)

Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
What Developers Need to Unlearn for High Performance NoSQL
What Developers Need to Unlearn for High Performance NoSQLWhat Developers Need to Unlearn for High Performance NoSQL
What Developers Need to Unlearn for High Performance NoSQL
 
Low Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & PitfallsLow Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & Pitfalls
 
Dissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasDissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance Dilemmas
 
Beyond Linear Scaling: A New Path for Performance with ScyllaDB
Beyond Linear Scaling: A New Path for Performance with ScyllaDBBeyond Linear Scaling: A New Path for Performance with ScyllaDB
Beyond Linear Scaling: A New Path for Performance with ScyllaDB
 
Dissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasDissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance Dilemmas
 
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
 
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
 
Database Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
Database Performance at Scale Masterclass: Driver Strategies by Piotr SarnaDatabase Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
Database Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
 
Replacing Your Cache with ScyllaDB
Replacing Your Cache with ScyllaDBReplacing Your Cache with ScyllaDB
Replacing Your Cache with ScyllaDB
 
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear ScalabilityPowering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
 
7 Reasons Not to Put an External Cache in Front of Your Database.pptx
7 Reasons Not to Put an External Cache in Front of Your Database.pptx7 Reasons Not to Put an External Cache in Front of Your Database.pptx
7 Reasons Not to Put an External Cache in Front of Your Database.pptx
 
Getting the most out of ScyllaDB
Getting the most out of ScyllaDBGetting the most out of ScyllaDB
Getting the most out of ScyllaDB
 
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a MigrationNoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
 
NoSQL Database Migration Masterclass - Session 3: Migration Logistics
NoSQL Database Migration Masterclass - Session 3: Migration LogisticsNoSQL Database Migration Masterclass - Session 3: Migration Logistics
NoSQL Database Migration Masterclass - Session 3: Migration Logistics
 
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and ChallengesNoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
 
ScyllaDB Virtual Workshop
ScyllaDB Virtual WorkshopScyllaDB Virtual Workshop
ScyllaDB Virtual Workshop
 
DBaaS in the Real World: Risks, Rewards & Tradeoffs
DBaaS in the Real World: Risks, Rewards & TradeoffsDBaaS in the Real World: Risks, Rewards & Tradeoffs
DBaaS in the Real World: Risks, Rewards & Tradeoffs
 
Build Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDBBuild Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDB
 
NoSQL Data Modeling 101
NoSQL Data Modeling 101NoSQL Data Modeling 101
NoSQL Data Modeling 101
 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 

Scylla Summit 2017: Stateful Streaming Applications with Apache Spark

  • 1. PRESENTATION  TITLE  ON  ONE  LINE   AND  ON  TWO  LINES First  and  last  name Position,  company Arbitrary  Stateful Aggregations using  Structured  Streaming in  Apache  Spark™ Software  Engineer,  Databricks Burak  Yavuz
  • 2. PRESENTATION  TITLE  ON  ONE  LINE   AND  ON  TWO  LINES First  and  last  name Position,  company Burak  Yavuz 2 ●Software  Engineer  – Databricks -­‐ “We  make  your  streams  come  true” ●Apache  Spark  Committer  as  of  Feb  2017 ●MS  in  Management  Science  &  Engineering  -­‐ Stanford  University ●BS  in  Mechanical  Engineering  -­‐ Bogazici University,   Istanbul
  • 3. PRESENTATION  TITLE  ON  ONE  LINE   AND  ON  TWO  LINES First  and  last  name Position,  company TEAM About Started  Spark  project  (now  Apache  Spark)  at  UC  Berkeley  in  2009 PRODUCT Unified  Analytics  Platform MISSION Making  Big  Data  Simple
  • 4. PRESENTATION  TITLE  ON  ONE  LINE   AND  ON  TWO  LINES First  and  last  name Position,  company Outline oStructured  Streaming  Concepts oStateful Processing  in  Structured  Streaming oUse  Cases  and  How  NoSQL  Stores  Fit  In oDemos
  • 5. PRESENTATION  TITLE  ON  ONE  LINE   AND  ON  TWO  LINES First  and  last  name Position,  company The simplest way to perform streaming analytics is not having to reason about streaming at all
  • 6. PRESENTATION  TITLE  ON  ONE  LINE   AND  ON  TWO  LINES First  and  last  name Position,  company
  • 7. PRESENTATION  TITLE  ON  ONE  LINE   AND  ON  TWO  LINES First  and  last  name Position,  company New  Model Input:  data  from  source  as  an   append-­‐only table Trigger:  how  frequently  to  check input  for  new  data Query:  operations  on  input usual  map/filter/reduce   new  window,  session  ops Trigger: every 1 sec 1 2 3 Time data up to 1 Input data up to 2 data up to 3 Query
  • 8. PRESENTATION  TITLE  ON  ONE  LINE   AND  ON  TWO  LINES First  and  last  name Position,  company Trigger: every 1 sec 1 2 3 result for data up to 1 Result Query Time data up to 1 Input data up to 2 result for data up to 2 data up to 3 result for data up to 3 Output [complete mode] output all the rows in the result table New  Model Result:  final  operated  table   updated  every  trigger  interval Output:  what  part  of  result  to   write  to  data  sink  after  every         trigger Complete  output:   Write  full  result  table   every  time
  • 9. PRESENTATION  TITLE  ON  ONE  LINE   AND  ON  TWO  LINES First  and  last  name Position,  company Trigger: every 1 sec 1 2 3 result for data up to 1 Result Query Time data up to 1 Input data up to 2 result for data up to 2 data up to 3 result for data up to 3 Output [append mode] output only new rows since last trigger Result: final operated table updated every trigger interval Output: what part of result to write to data sink after every trigger Complete output: Write full result table every time Append output: Write only new rows that got added to result table since previous batch *Not all output modes are feasible with all queries New Model
  • 10. PRESENTATION  TITLE  ON  ONE  LINE   AND  ON  TWO  LINES First  and  last  name Position,  company
  • 11. PRESENTATION  TITLE  ON  ONE  LINE   AND  ON  TWO  LINES First  and  last  name Position,  company Output  Modes ▪ Append  mode  (default) -­‐ New  rows  added  to  the  Result  Table   since  the  last  trigger  will  be  outputted  to  the  sink.  Rows  will  be   output  only  once,  and  cannot  be  rescinded. Example  use  cases:  ETL
  • 12. PRESENTATION  TITLE  ON  ONE  LINE   AND  ON  TWO  LINES First  and  last  name Position,  company Output  Modes ▪ Complete  mode -­‐ The  whole  Result  Table  will  be  outputted  to  the   sink  after  every  trigger.  This  is  supported  for  aggregation  queries. Example  use  cases:  Monitoring
  • 13. PRESENTATION  TITLE  ON  ONE  LINE   AND  ON  TWO  LINES First  and  last  name Position,  company Output  Modes ▪ Update  mode -­‐ (Available  since  Spark  2.1.1)  Only  the  rows  in  the   Result  Table  that  were  updated  since  the  last  trigger  will  be   outputted  to  the  sink. Example  use  cases:  Alerting,  Sessionization
  • 14. PRESENTATION  TITLE  ON  ONE  LINE   AND  ON  TWO  LINES First  and  last  name Position,  company Outline oStructured  Streaming  Concepts oStateful Processing  in  Structured  Streaming oUse  Cases  and  How  NoSQL  Stores  Fit  In oDemos
  • 15. PRESENTATION  TITLE  ON  ONE  LINE   AND  ON  TWO  LINES First  and  last  name Position,  company Event  time  Aggregations Many  use  cases  require  aggregate  statistics  by  event  time E.g.  what's  the  #errors  in  each  system  in  1  hour  windows? Many  challenges Extracting  event  time  from  data,  handling  late,  out-­‐of-­‐order  data DStream APIs  were  insufficient  for  event  time  operations
  • 16. PRESENTATION  TITLE  ON  ONE  LINE   AND  ON  TWO  LINES First  and  last  name Position,  company Event  time  Aggregations Windowing  is  just  another  type  of  grouping  in  Struct.  Streaming number  of  records  every  hour parsedData .groupBy(window("timestamp","1  hour")) .count() parsedData .groupBy( "device",   window("timestamp","10  mins")) .avg("signal") avg signal strength of each device every 10 mins Use built-in functions to extract event-time No need for separate extractors
  • 17. PRESENTATION  TITLE  ON  ONE  LINE   AND  ON  TWO  LINES First  and  last  name Position,  company Advanced  Aggregations Powerful  built-­‐in   aggregations Multiple  simultaneous   aggregations Custom  aggs using   reduceGroups,  UDAFs parsedData .groupBy(window("timestamp","1  hour")) .agg(avg("signal"),  stddev("signal"),  max("signal")) variance,  stddev,  kurtosis,  stddev_samp,  collect_list,   collect_set,  corr,  approx_count_distinct,  ...   //  Compute  histogram  of  age  by  name. val hist =  ds.groupBy(_.type).mapGroups { case (type,  data:  Iter[DeviceData])  => val buckets =  new Array[Int](10)             data.map(_.signal).foreach {  a  => buckets(a/10)+=1 }         (type,  buckets) }
  • 18. PRESENTATION  TITLE  ON  ONE  LINE   AND  ON  TWO  LINES First  and  last  name Position,  company Stateful Processing  for  Aggregations In-­‐memory,  streaming   state  maintained  for   aggregations 12:00 - 13:00 1 12:00 - 13:00 3 13:00 - 14:00 1 12:00 - 13:00 3 13:00 - 14:00 2 14:00 - 15:00 5 12:00 - 13:00 5 13:00 - 14:00 2 14:00 - 15:00 5 15:00 - 16:00 4 12:00 - 13:00 3 13:00 - 14:00 2 14:00 - 15:00 6 15:00 - 16:00 4 16:00 - 17:00 3 13:00 14:00 15:00 16:00 17:00 Keeping state allows late data to update counts of old windows But size of the state increases indefinitely if old windows not dropped red = state updated with late data
  • 19. PRESENTATION  TITLE  ON  ONE  LINE   AND  ON  TWO  LINES First  and  last  name Position,  company
  • 20. PRESENTATION  TITLE  ON  ONE  LINE   AND  ON  TWO  LINES First  and  last  name Position,  company Watermarking  and  Late  Data   Watermark [Spark  2.1]  -­‐ a  moving   threshold  that  trails  behind  the  max   seen  event  time Trailing  gap  defines  how  late  data  is   expected  to  be event time max event time watermark data older than watermark not expected 12:30 PM 12:20 PM trailing gap of 10 mins
  • 21. PRESENTATION  TITLE  ON  ONE  LINE   AND  ON  TWO  LINES First  and  last  name Position,  company Watermarking  and  Late  Data Data  newer  than  watermark  may   be  late,  but  allowed  to  aggregate Data  older  than  watermark  is  "too   late"  and  dropped State  older  than  watermark   automatically  deleted  to  limit  the   amount  of  intermediate  state max event time event time watermark late data allowed to aggregate data too late, dropped
  • 22. PRESENTATION  TITLE  ON  ONE  LINE   AND  ON  TWO  LINES First  and  last  name Position,  company Watermarking  and  Late  Data Control  the  tradeoff  between  state   size  and  lateness  requirements Handle  more  late  à keep  more  state Reduce  state  à handle  less  lateness max event time event time watermark allowed lateness of 10 mins parsedData .withWatermark("timestamp",  "10  minutes") .groupBy(window("timestamp","5  minutes")) .count() late data allowed to aggregate data too late, dropped
  • 23. PRESENTATION  TITLE  ON  ONE  LINE   AND  ON  TWO  LINES First  and  last  name Position,  company Watermarking  to  Limit  State  [Spark  2.1] data too late, ignored in counts, state dropped Processing Time12:00 12:05 12:10 12:15 12:10 12:15 12:20 12:07 12:13 12:08 EventTime 12:15 12:18 12:04 watermark updated to 12:14 - 10m = 12:04 for next trigger, state < 12:04 deleted data is late, but considered in counts parsedData .withWatermark("timestamp",  "10  minutes") .groupBy(window("timestamp","5  minutes")) .count() system tracks max observed event time 12:08 wm = 12:04 10min 12:14 More details in blog post!
  • 24. PRESENTATION  TITLE  ON  ONE  LINE   AND  ON  TWO  LINES First  and  last  name Position,  company
  • 25. PRESENTATION  TITLE  ON  ONE  LINE   AND  ON  TWO  LINES First  and  last  name Position,  company Working  With  Time df.withWatermark("timestampColumn",  "5  hours") .groupBy(window("timestampColumn",  "1  minute")) .count() .writeStream .trigger("10  seconds") Separate processing details (output rate, late data tolerance) from query semantics.
  • 26. PRESENTATION  TITLE  ON  ONE  LINE   AND  ON  TWO  LINES First  and  last  name Position,  company Working  With  Time df.withWatermark("timestampColumn",  "5  hours") .groupBy(window("timestampColumn",  "1  minute")) .count() .writeStream .trigger("10  seconds") How to group data by time Same in streaming & batch
  • 27. PRESENTATION  TITLE  ON  ONE  LINE   AND  ON  TWO  LINES First  and  last  name Position,  company Working  With  Time df.withWatermark("timestampColumn",  "5  hours") .groupBy(window("timestampColumn",  "1  minute")) .count() .writeStream .trigger("10  seconds") How late data can be
  • 28. PRESENTATION  TITLE  ON  ONE  LINE   AND  ON  TWO  LINES First  and  last  name Position,  company Working  With  Time df.withWatermark("timestampColumn",  "5  hours") .groupBy(window("timestampColumn",  "1  minute")) .count() .writeStream .trigger("10  seconds") How often to emit updates
  • 29. PRESENTATION  TITLE  ON  ONE  LINE   AND  ON  TWO  LINES First  and  last  name Position,  company Arbitrary  Stateful Operations  [Spark  2.2] mapGroupsWithState allows  any  user-­‐defined stateful ops  to  a   user-­‐defined  state Direct  support  for  per-­‐key   timeouts  in  event-­‐time  or   processing-­‐time supports  Scala  and  Java ds.groupByKey(groupingFunc) .mapGroupsWithState (timeoutConf) (mappingWithStateFunc) def mappingWithStateFunc( key: K,   values: Iterator[V],   state: GroupState[S]): U =  {   //  update  or  remove  state //  set  timeouts //  return  mapped  value }
  • 30. PRESENTATION  TITLE  ON  ONE  LINE   AND  ON  TWO  LINES First  and  last  name Position,  company flatMapGroupsWithState ▪ Applies  the  given  function  to  each  group  of  data,  while  maintaining   a  user-­‐defined  per-­‐group state ▪ Invoked  once  per  group  in  batch ▪ Invoked  each  trigger  (with  the  existence  of  data)  per  group  in   streaming ▪ Requires  user  to  provide  an  output  mode  for  the  function
  • 31. PRESENTATION  TITLE  ON  ONE  LINE   AND  ON  TWO  LINES First  and  last  name Position,  company flatMapGroupsWithState ▪ mapGroupsWithState is  a  special  case  with oOutput  mode:  Update oOutput  size:  1  row  per  group ▪ Supports  both  Processing  Time  and  Event  Time  timeouts
  • 32. PRESENTATION  TITLE  ON  ONE  LINE   AND  ON  TWO  LINES First  and  last  name Position,  company Outline oStructured  Streaming  Concepts oStateful Processing  in  Structured  Streaming oUse  Cases and  How  NoSQL  Stores  Fit  In oDemos
  • 33. PRESENTATION  TITLE  ON  ONE  LINE   AND  ON  TWO  LINES First  and  last  name Position,  company Alerting val monitoring  =  stream .as[Event] .groupBy(_.id) .flatMapGroupsWithState(Append,  GST.ProcessingTimeTimeout)  { (id:  Int,  events:  Iterator[Event],  state:  GroupState[…])  => ... } .writeStream .queryName("alerts") .foreach(new  PagerdutySink(credentials)) Monitor a stream using custom stateful logic with timeouts.
  • 34. PRESENTATION  TITLE  ON  ONE  LINE   AND  ON  TWO  LINES First  and  last  name Position,  company Alerting ▪ Save  your  state  to  Scylla  to  power  dashboards ▪ Have  the  stream  trigger  alerts  ASAP
  • 35. PRESENTATION  TITLE  ON  ONE  LINE   AND  ON  TWO  LINES First  and  last  name Position,  company Sessionization val monitoring  =  stream .as[Event] .groupBy(_.session_id) .mapGroupsWithState(GroupStateTimeout.EventTimeTimeout)  { (id:  Int,  events:  Iterator[Event],  state:  GroupState[…])  => ... } .writeStream .scylla("trips") Analyze sessions of user/system behavior
  • 36. PRESENTATION  TITLE  ON  ONE  LINE   AND  ON  TWO  LINES First  and  last  name Position,  company Sessionization ▪ Update  sessions  in  your  stream ▪ Save  it  to  a  NoSQL  store  like  Scylla!
  • 37. PRESENTATION  TITLE  ON  ONE  LINE   AND  ON  TWO  LINES First  and  last  name Position,  company Demo
  • 38. PRESENTATION  TITLE  ON  ONE  LINE   AND  ON  TWO  LINES First  and  last  name Position,  company Try Spark 2.2 on Community Edition today! https://databricks.com/try-databricks
  • 39. PRESENTATION  TITLE  ON  ONE  LINE   AND  ON  TWO  LINES First  and  last  name Position,  company Apache Spark’s Structured Streaming at Scale Series https://databricks.com/blog/category/engineering Twitter: @databricks
  • 40. PRESENTATION  TITLE  ON  ONE  LINE   AND  ON  TWO  LINES First  and  last  name Position,  company We are hiring! https://databricks.com/company/careers
  • 41. PRESENTATION  TITLE  ON  ONE  LINE   AND  ON  TWO  LINES First  and  last  name Position,  company THANK  YOU burak@databricks.com “Does anyone have any questions for my answers?” - Henry Kissinger