2017 High Performance Database with Scala, Akka, Spark

Building a High-
Performance Database with
Scala, Akka, and Spark
Evan Chan
November 2017

Who am I
User and contributor to Spark since 0.9,
Cassandra since 0.6
Created Spark Job Server and FiloDB
Talks at Spark Summit, Cassandra Summit, Strata,
Scala Days, etc.
http://velvia.github.io/

Why Build a New
Streaming Database?

Needs
• Ingest HUGE streams of events — IoT etc.
• Real-time, low latency, and somewhat ﬂexible queries
• Dashboards, quick answers on new data
• Flexible schemas and query patterns
• Keep your streaming pipeline super simple
• Streaming = hardest to debug. Simplicity rules!

Message
Queue
Events
Stream
Processing
Layer
State /
Database
Happy
Users

Spark + HDFS Streaming
Kafka
Spark
Streaming
Many small files
(microbatches)
Dedup,
consolidate
job
Larger efficient
files
• High latency
• Big impedance mismatch between streaming
systems and a file system designed for big blobs
of data

Cassandra?
• Ingest HUGE streams of events — IoT etc.
• C* is not efﬁcient for writing raw events
• Real-time, low latency, and somewhat ﬂexible queries
• C* is real-time, but only low latency for simple
lookups. Add Spark => much higher latency
• Flexible schemas and query patterns
• C* only handles simple lookups

Introducing FiloDB
A distributed, columnar time-series/event database.
Built for streaming.
http://www.github.com/ﬁlodb/FiloDB

Message
Queue
Events
Spark
Streaming
Short term
storage, K-V
Adhoc,
SQL, ML
Cassandra
FiloDB: Events,
ad-hoc, batch
Spark
Dashboa
rds,
maps

100% Reactive
• Scala
• Akka Cluster
• Spark
• Monix / Reactive Streams
• Typesafe Conﬁg for all conﬁguration
• Scodec, Ficus, Enumeratum, Scalactic, etc.
• Even most of the performance critical parts are written in Scala
:)

Scala, Akka, and
Spark for Database

Why use Scala and Akka?
• Akka Cluster!
• Just the right abstractions - streams, futures,
Akka, type safety….
• Failure handling and supervision are critical for
databases
• All the pattern matching and immutable goodness
:)

Scala Big Data Projects
• Spark
• GeoMesa
• Khronus - Akka time-series DB
• Sirius - Akka distributed KV Store
• FiloDB!

Actors vs Futures vs
Observables

One FiloDB Node
NodeCoordinatorActor
(NCA)
DatasetCoordinatorActor
(DsCA)
(DsCA)
Active MemTable
Flushing MemTable
Reprojector ColumnStore
Data, commands

Akka vs Futures
NodeCoordinatorActor
(NCA)
(DsCA)
(DsCA)
Active MemTable
Flushing MemTable
Reprojector ColumnStore
Data, commands
Akka - control
ﬂow
Core I/O - Futures/Observables

Akka vs Futures
• Akka Actors:
• External FiloDB node API (remote + cluster)
• Async messaging with clients
• Cluster/distributed state management
• Futures and Observables:
• Core I/O
• Columnar data processing / ingestion
• Type-safe processing stages

Futures for Single Actions
/**
* Clears all data from the column store for that given projection, for all versions.
* More like a truncation, not a drop.
* NOTE: please make sure there are no reprojections or writes going on before calling this
*/
def clearProjectionData(projection: Projection): Future[Response]
/**
* Completely and permanently drops the dataset from the column store.
* @param dataset the DatasetRef for the dataset to drop.
*/
def dropDataset(dataset: DatasetRef): Future[Response]
/**
* Appends the ChunkSets and incremental indices in the segment to the column store.
* @param segment the ChunkSetSegment to write / merge to the columnar store
* @param version the version # to write the segment to
* @return Success. Future.failure(exception) otherwise.
*/
def appendSegment(projection: RichProjection,
segment: ChunkSetSegment,
version: Int): Future[Response]

Monix / Reactive Streams
• http://monix.io
• “observable sequences that are exposed as
asynchronous streams, expanding on the
observer pattern, strongly inspired by ReactiveX
and by Scalaz, but designed from the ground up
for back-pressure and made to cleanly interact
with Scala’s standard library, compatible out-of-
the-box with the Reactive Streams protocol”
• Much better than Future[Iterator[_]]

Monix / Reactive Streams
def readChunks(projection: RichProjection,
columns: Seq[Column],
version: Int,
partMethod: PartitionScanMethod,
chunkMethod: ChunkScanMethod = AllChunkScan): Observable[ChunkSetReader] = {
scanPartitions(projection, version, partMethod)
// Partitions to pipeline of single chunks
.flatMap { partIndex =>
stats.incrReadPartitions(1)
readPartitionChunks(projection.datasetRef, version, columns, partIndex, chunkMethod)
// Collate single chunks to ChunkSetReaders
}.scan(new ChunkSetReaderAggregator(columns, stats)) { _ add _ }
.collect { case agg: ChunkSetReaderAggregator if agg.canEmit => agg.emit() }
}
}

Functional Reactive Stream
Processing
• Ingest stream merged with ﬂush commands
• Built in async/parallel tasks via mapAsync
• Notify on end of stream, errors
val combinedStream = Observable.merge(stream.map(SomeData), flushStream)
combinedStream.map {
case SomeData(records) => shard.ingest(records)
None
case FlushCommand(group) => shard.switchGroupBuffers(group)
Some(FlushGroup(shard.shardNum, group, shard.latestOffset))
}.collect { case Some(flushGroup) => flushGroup }
.mapAsync(numParallelFlushes)(shard.createFlushTask _)
.foreach { x => }
.recover { case ex: Exception => errHandler(ex) }

Spark/Akka Cluster Setup
Driver
NodeClusterActor
Client
Executor
NCA
DsCA1 DsCA2
Executor
NCA
DsCA1 DsCA2

Adding one executor
Driver
NodeClusterActor
Client
executor1
NCA
DsCA1 DsCA2
State: 
Executors ->
(executor1)
MemberUp
ActorSelection
ActorRef

Adding second executor
Driver
NodeClusterActor
Client
executor1
NCA
DsCA1 DsCA2
State: 
Executors ->
(executor1,
executor2) MemberUp
ActorSelection ActorRef
executor2
NCA
DsCA1 DsCA2

Sending a command
Driver
NodeClusterActor
Client
Executor
NCA
DsCA1 DsCA2
Executor
NCA
DsCA1 DsCA2
Flush()

Yes, Akka in Spark
• Columnar ingestion is stateful - need stickiness of
state. This is inherently difficult in Spark.
• Akka (cluster) gives us a separate, asynchronous
control channel to talk to FiloDB ingestors
• Spark only gives data flow primitives, not async
messaging
• We need to route incoming records to the correct
ingestion node. Sorting data is inefficient and forces
all nodes to wait for sorting to be done.

Data Ingestion Setup
Executor
NCA
DsCA1 DsCA2
task0 task1
Row Source
Actor
Row Source
Actor
Executor
NCA
DsCA1 DsCA2
task0 task1
Row Source
Actor
Row Source
Actor
Node
Cluster
Actor
Partition Map

FiloDB NodeFiloDB Node
FiloDB separate nodes
Executor
NCA
DsCA1 DsCA2
task0 task1
Row Source
Actor
Row Source
Actor
Executor
NCA
DsCA1 DsCA2
task0 task1
Row Source
Actor
Row Source
Actor
Node
Cluster
Actor
Partition Map

Testing Akka Cluster
• MultiNodeSpec / sbt-multi-jvm
• NodeClusterSpec
• Tests joining of different cluster nodes and
partition map updates
• Is partition map updated properly if a cluster
node goes down — inject network failures
• Lessons

Kamon Tracing
• http://kamon.io
• One trace can encapsulate multiple Future steps
all executing on different threads
• Tunable tracing levels
• Summary stats and histograms for segments
• Super useful for production debugging of reactive
stack

Kamon Tracing
def appendSegment(projection: RichProjection,
version: Int): Future[Response] = Tracer.withNewContext("append-segment") {
val ctx = Tracer.currentContext
stats.segmentAppend()
if (segment.chunkSets.isEmpty) {
stats.segmentEmpty()
return(Future.successful(NotApplied))
}
for { writeChunksResp <- writeChunks(projection.datasetRef, version, segment, ctx)
writeIndexResp <- writeIndices(projection, version, segment, ctx)
if writeChunksResp == Success
} yield {
ctx.finish()
writeIndexResp
}
}
private def writeChunks(dataset: DatasetRef,
version: Int,
ctx: TraceContext): Future[Response] = {
asyncSubtrace(ctx, "write-chunks", "ingestion") {
val binPartition = segment.binaryPartition
val segmentId = segment.segmentId
val chunkTable = getOrCreateChunkTable(dataset)
Future.traverse(segment.chunkSets) { chunkSet =>
chunkTable.writeChunks(binPartition, version, segmentId, chunkSet.info.id, chunkSet.chunks, stats)
}.map { responses => responses.head }
}
}

Kamon Metrics
• Uses HDRHistogram for much ﬁner and more
accurate buckets
• Built-in metrics for Akka actors, Spray, Akka-Http,
Play, etc. etc.
KAMON trace name=append-segment n=2863 min=765952 p50=2113536 p90=3211264 p95=3981312
p99=9895936 p999=16121856 max=19529728
KAMON trace-segment name=write-chunks n=2864 min=436224 p50=1597440 p90=2637824
p95=3424256 p99=9109504 p999=15335424 max=18874368
KAMON trace-segment name=write-index n=2863 min=278528 p50=432128 p90=544768 p95=598016
p99=888832 p999=2260992 max=8355840

Validation: Scalactic
private def getColumnsFromNames(allColumns: Seq[Column],
columnNames: Seq[String]): Seq[Column] Or BadSchema = {
if (columnNames.isEmpty) {
Good(allColumns)
} else {
val columnMap = allColumns.map { c => c.name -> c }.toMap
val missing = columnNames.toSet -- columnMap.keySet
if (missing.nonEmpty) { Bad(MissingColumnNames(missing.toSeq, "projection")) }
else { Good(columnNames.map(columnMap)) }
}
}
for { computedColumns <- getComputedColumns(dataset.name, allColIds, columns)
dataColumns <- getColumnsFromNames(columns, normProjection.columns)
richColumns = dataColumns ++ computedColumns
// scalac has problems dealing with (a, b, c) <- getColIndicesAndType... apparently
segStuff <- getColIndicesAndType(richColumns, Seq(normProjection.segmentColId), "segment")
keyStuff <- getColIndicesAndType(richColumns, normProjection.keyColIds, "row")
partStuff <- getColIndicesAndType(richColumns, dataset.partitionColumns, "partition") }
yield {
• Notice how multiple validations compose!

How do you go REALLY fast?
• Don’t serialize
• Don’t allocate
• Don’t copy

Filo fast
• Filo binary vectors - 2 billion records/sec
• Spark InMemoryColumnStore - 125 million
records/sec
• Spark CassandraColumnStore - 25 million
records/sec

Filo: High Performance
Binary Vectors
• Designed for NoSQL, not a ﬁle format
• random or linear access
• on or off heap
• missing value support
• Scala only, but cross-platform support possible
http://github.com/velvia/ﬁlo is a binary data vector library designed
for extreme read performance with minimal deserialization costs.

Billions of Ops / Sec
• JMH benchmark: 0.5ns per FiloVector element access / add
• 2 Billion adds per second - single threaded
• Who said Scala cannot be fast?
• Spark API (row-based) limits performance signiﬁcantly
val randomInts = (0 until numValues).map(i => util.Random.nextInt)
val randomIntsAray = randomInts.toArray
val filoBuffer = VectorBuilder(randomInts).toFiloBuffer
val sc = FiloVector[Int](filoBuffer)
@Benchmark
@BenchmarkMode(Array(Mode.AverageTime))
@OutputTimeUnit(TimeUnit.MICROSECONDS)
def sumAllIntsFiloApply(): Int = {
var total = 0
for { i <- 0 until numValues optimized } {
total += sc(i)
}
total
}

JVM Inlining
• Very small methods can be inlined by the JVM
• ﬁnal def avoids virtual method dispatch.
• Thus methods in traits, abstract classes not inlinable
val base = baseReader.readInt(0)
final def apply(i: Int): Int = base + dataReader.read(i)
case (32, _) => new TypedBufferReader[Int] {
final def read(i: Int): Int = reader.readInt(i)
}
final def readInt(i: Int): Int = unsafe.getInt(byteArray, (offset + i * 4).toLong)
0.5ns/read is achieved through a stack of very small methods:

BinaryRecord
• Tough problem: FiloDB must handle many
different datasets, each with different schemas
• Cannot rely on static types and standard
serialization mechanisms - case classes,
Protobuf, etc.
• Serialization very costly, especially strings
• Solution: BinaryRecord

BinaryRecord II
• BinaryRecord is a binary (ie transport ready) record
class that supports any schema or mix of column
types
• Values can be extracted or written with no serialization
cost
• UTF8-encoded string class
• String compare as fast as native Java strings
• Immutable API once built

Use Case: Sorting
• Regular sorting: deserialize record, create sort
key, compare sort key
• BinaryRecord sorting: binary compare ﬁelds
directly — no deserialization, no object allocations

Regular Sorting
Protobuf/Avro etc record
Deserialized instance
Sort Key
Protobuf/Avro etc record
Deserialized instance
Sort Key
Cmp

BinaryRecord Sorting
• BinaryRecord sorting: binary compare ﬁelds
directly — no deserialization, no object allocations
name: Str age: Int
lastTimestamp:
Long
group: Str
name: Str age: Int
lastTimestamp:
Long
group: Str

SBT-JMH
• Super useful tool to leverage JMH, the best micro
benchmarking harness
• JMH is written by the JDK folks

In Summary
• Scala, Akka, reactive can give you both awesome
abstractions AND performance
• Use Akka for distribution, state, protocols
• Use reactive/Monix for functional, concurrent
stream processing
• Build (or use FiloDB’s) fast low-level abstractions
with good APIs

2017 High Performance Database with Scala, Akka, Spark

Recomendados

Recomendados

Más contenido relacionado

Destacado

Destacado (20)

Más de Evan Chan

Más de Evan Chan (17)

Último

Último (20)

2017 High Performance Database with Scala, Akka, Spark