Distributed Stream Processing in the real [Perl] world

Distributed Stream Processing
in the real [Perl] world.

YAPC::Asia 2012 Day 1 (2012/09/28)

TAGOMORI Satoshi (@tagomoris)

NHN Japan

12年9月29日土曜日

tagomoris

• TAGOMORI Satoshi ( @tagomoris )

• Working at NHN Japan


What this talk contains

• What "Stream Processing" is

• Why we want "Stream Processing"

• What features we should write for "Stream Processing"

• Frameworks and tools for "Distributed Stream Processing"

• Implementations in the Perl world


What "Stream Processing" is


Stream


Stream ?

•Continuously increasing data

•access logs, trace logs, sales checks, ...

•typically written in ﬁle line-by-line

tail -f


Stream Processing

•Convert, select, aggregate passed data

•NOT wait EOF (in many cases)

tail -f|grep ^hit|sed -es/hit/miss/g


Stream Processing over network

•Data are collected from many nodes

•to seach/query/store

•Separate heavy processes from edge
nodes
edge: tail -f|nc
backend: nc -l|grep|sed|tee|...


Why we want "Stream Processing"


Batch ﬁle copy & convert

access.0928.16.log
16:00 ...............................
16:00 ........................................
.. 60min.
16:59 .................
16:59 ...........................
latency for 16:00 log
ﬂush wait 3min. 62+ minutes
?min. Copy over network

?min. Convert into query friendly
structure


Stream data copy & convert

access.0928.16.log Copy over network
16:00 ............................... Convert next-to-next
16:00 ........................................
in real time
..
16:59 .................
16:59 ...........................

Very low latency for each log lines
(if trafﬁc is not larger than capacity)


Case of data size explosion (batch)

serviceA Casual batch over multi node/service
may be blocked by
serviceB unbalanced data size

needs long tranfer
serviceC time

serviceD

Asynchronous batch is very good problem...


Case of data size explosion (stream)

serviceA
Streams are mixed
and not blocked by heavy traffics
serviceB
heavy
(if traffic is not larger than capacity)
serviceC traffic

serviceD


What features we should write for
"Stream Processing"


One-by-one input/process/output


One-by-one input/process/output

convert
one record format one record (or none)

select

•Basic feature

•I/O call overhead is relatively heavy


Burst transfer/read/write and process


Burst transfer/read/write and process

read and read and
store convert store
records many records records
temprally many records format (or few or none) temprally
from to
input select output

•less input/output calls

•more performance with async I/O and multi process


Control buffer ﬂush intervals


Control buffer flush intervals

buffer buffer

read and read and
store store
read records many many records records write
inputs temprally records (or few or none) temprally records
from to
input output

0.5sec? 1sec? 3sec? 30sec?
•Control flushing about buffer size and latency

•(Semi-)real-time control flow arguments

•Max size of lost data when process crushed


Buffering/Queueing


Buffering/Queueing

output
buffer
send to
next node STOP
records next node

buffer
output send to
buffer next node
records next node
buffer

buffer
output send to recover
buffer next node
records next node
buffer

output send to streaming
buffer next node
records next node


Connection keepalive
Connection pooling


Connection keepalive / connection pooling

node B
node A
node C

node D

•Keep connections and select one to use

•TCP connection establishment needs large cost

•manage node status (alive/down) at same time

•not only inter-nodes, but also inter-process
connections


Distribution


Distribution: Load balancing (cpu/node)

send to
processor
next node

load send to
records processor
balancer next node

send to
processor
next node

•Distribute large scale data to many nodes
•nodes: servers, or processor processes
•to make total performance high


Distribution: High availability (process/node)

send to
processor
next node

load send to
records processor
balancer next node

send to
processor
next node

•Distribute large scale data to N+1 (or 2 or more) nodes
•to make system tolerant of node trouble
•without any failover (and takeback) operations


Routing

records for
output A
service A

process A

records for
records router router output B
service B

records for
process B output C
service C


TOO MANY FEATURES TO
IMPLEMENT !!!!!


Frameworks and tools for
"Distributed Stream Processing"

Frameworks and tools

•Apache Kafka

•written in Scala (... with JVM!)

•Twitter Storm

•written in Clojure (...with JVM!)

•Fluentd


Fluentd


Fluentd

•Mainly written by @frsyuki in TreasureData

•APLv2 software on github

•Log read/transfer/write daemon based on
MessagePack

•structured data (Hash: key:value pairs)

•Plugin mechanism for input/output/buffer features

•now many plugins are published

Fluentd features: input/output

•File tailing, network, and other input plugins

•tail and parse line-by-line

•receive records from app logger or other fluentd

•in_syslog, in_exec, in_dstat, .....

•Output to many many storage/systems

•other fluentd, file, S3, mongodb, mysql, HDFS, .....


Fluentd features: buffers

•Pluggable buffers

•output plugin buffers are swappable (by configuration)

•In memory buffers: fast, but lost at fluentd down

•file buffers: slow, but always saved

•Buffer plugins are also added by users

•No one public plugin exists now....


Fluentd features: routing

•Tag based routing

•all records have tag and time

•Fluentd use tags which plugin the record sended next

•conﬁgurartions are:

•tag matcher pattern + plugin conﬁguration


Fluentd features: exec_ﬁlter

•Output records to speciﬁed (and forked) command

•And get records from command's STDOUT

•We can specify our stream processor as command


I'm very sorry that....


Fluentd is written in Ruby
Fluentd plugins released as rubygems


Problems about Fluentd (for stream processing)

•Eager buffering

•Eager default buffering conﬁg, not to ﬂush under 8MB

•Performance

•Many many features for data protection injures
performance

•Doesn't work on Windows


Implementations in the Perl world


ﬂuent-agent-lite (Fluent::AgentLite)

•Log collection agent tools (in perl) by tagomoris

•fast and low load

•gets logs from ﬁle/STDIN, and sends to other nodes

•minimal features for log collector agent

•doesn't parse log lines (send 1 attribute with whole
log line)

•supports load balancing and failover of destination

ﬂuent-agent (Fluent::Agent)

•Fluentd feature subset tools by tagomoris

•written in Perl

•libuv and UV module for async I/O lib (for Windows)

•Goal: simple, fast and easy deployment

•UNDER CONSTRUCTION

•60% features and many bugs, not in CPAN now


Features of Fluent::Agent

•1 input, 1 output and 0/1 filter

•Network I/O: protocol compatible with Fluentd
•and simple load balancing/failover feature

•File input/output: superset features of Fluentd (in plan)

•Filter with any command: compatible with Fluentd's
exec_filter
filter
data/records input output data/records
any program
you want

Pros of Fluent::Agent (in plan)

•Simple and fast software for stream processing

•Stateless nodes

•fluent-agent works without any configuration files

•fluent-agent works with only commandline options

•Simple buffering and load balance

•less memory usage


Cons of Fluent::Agent (in fact)

•Poor input/output methods

•ﬂuent-agent doesn't have plugin architecture (currently)

•in future, CPAN based plugin system?

•Lack of data protection for death of process

•ﬂuent-agent have only memory buffer


Fluentd and ﬂuent-agent


Fluentd and fluent-agent and fluent-agent-lite
service
node fluent-agent-lite

service
fluent-agent

service fluent-agent fluent-agent fluentd

service fluent-agent
fluent-agent
service
fluent-agent

service fluent-agent fluentd
node fluent-agent-lite fluent-agent

service writer for
node fluent-agent-lite fluent-agent storages
/
deliver processor aggregator


Conclusion

•Distributed Stream Processing is:

•to provides more power to our application

•very hard (and interesting) problem

•that we have some supporting frameworks/tools like
Fluentd and/or ﬂuent-agent


Let's try to improve your application

with stream processing

instead of many many batches

Thanks!

CAST: crouton & luke & chacha
Thanks to @kbysmnr

Distributed Stream Processing in the real [Perl] world

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (14)

Similar a Distributed Stream Processing in the real [Perl] world

Similar a Distributed Stream Processing in the real [Perl] world (20)

Más de SATOSHI TAGOMORI

Más de SATOSHI TAGOMORI (20)

Distributed Stream Processing in the real [Perl] world