Apache Samza is a distributed stream processing framework, that's used Kafka for messaging, and YARN to provide fault tolerance, processor isolation, security, and resource management.
2. Samza overview
an open-source distributed stream processing created by Linkedin
- sub-second latency
- handle large amount of state
- fault tolerance
- no messages are ever lost
- partitioned and distributed at every level
- processor isolation
- pluggable
4. Kafka
- The stream may be sharded into one or more partitions.
- Each partition is independent from the others, and is replicated
across multiple machines.
- Each partition consists of a sequence of messages in a fixed order.
- Each message has an offset, which indicates its position in that
sequence.
- A Samza job can start consuming the sequence of messages from
any starting offset.
7. Streams
A stream is composed of immutable messages of a similar type or
category
- more than one stream consumed in the same job, are chosen by
RoundRobin by default, but can be overridden
- by configuration streams can be prioritised
8. Job
Job is code that performs a logical
transformation on a set of input
streams to append output messages to
set of output streams.
9. Partitions
Each stream is broken into one or
more partitions. Each partition in
the stream is a totally ordered
sequence of messages.
10. Task
A job is scaled by breaking it into
multiple tasks. The task is the unit of
parallelism of the job, just as the
partition is to the stream. Each task
consumes data from one partition for
each of the job’s input streams.
11. Containers
Containers are the unit of physical
parallelism, and a container is
essentially a Unix process (or Linux
cgroup). Each container runs one or
more tasks.
12. SamzaContainer starts up steps
1 - Get last checkpointed offset for each input stream partition
2 - Create a “reader” thread for every input stream partition
3 - Start metrics reporters to report metrics
4 - Start a checkpoint timer to save your task’s input stream offsets
every so often
13. SamzaContainer starts up steps
5 - Start a window timer to trigger your task’s window method, if it is
defined
6 - Instantiate and initialize your StreamTask once for each input
stream partition
7 - Start an event loop that takes messages from the input stream reader
threads, and gives them to your StreamTasks
8 - Notify lifecycle listeners during each one of these steps
15. State Management
- fast approach using a local database
- fault tolerance sending a local store’s
writes to a replicated changelog and
checkpointing
- out of the box support RocksDB
(key-value)
16. Event Loop
- synchronous tasks will run on the single thread by default, but you
can configure
- asynchronous tasks will always be invoked in a single thread, while
callbacks can be triggered from a different thread.
Samza will make sure that checkpointing is automatically performed
only after the async calls have completed.
17. Metrics
Samza has its own library to expose metrics, with counters, gauges and
timer.
Metrics can be exposed by JMX, Kafka topic and so on
18. Security
Samza provides no security.
All security is implemented in the stream system, or in the environment
that Samza containers run.