Distributed Stream Processing in the real [Perl] world
1. Distributed Stream Processing
in the real [Perl] world.
YAPC::Asia 2012 Day 1 (2012/09/28)
TAGOMORI Satoshi (@tagomoris)
NHN Japan
12年9月29日土曜日
2. tagomoris
• TAGOMORI Satoshi ( @tagomoris )
• Working at NHN Japan
12年9月29日土曜日
3. What this talk contains
• What "Stream Processing" is
• Why we want "Stream Processing"
• What features we should write for "Stream Processing"
• Frameworks and tools for "Distributed Stream Processing"
• Implementations in the Perl world
12年9月29日土曜日
6. Stream ?
•Continuously increasing data
•access logs, trace logs, sales checks, ...
•typically written in file line-by-line
tail -f
12年9月29日土曜日
7. Stream Processing
•Convert, select, aggregate passed data
•NOT wait EOF (in many cases)
tail -f|grep ^hit|sed -es/hit/miss/g
12年9月29日土曜日
8. Stream Processing over network
•Data are collected from many nodes
•to seach/query/store
•Separate heavy processes from edge
nodes
edge: tail -f|nc
backend: nc -l|grep|sed|tee|...
12年9月29日土曜日
11. Stream data copy & convert
access.0928.16.log Copy over network
16:00 ............................... Convert next-to-next
16:00 ........................................
in real time
..
16:59 .................
16:59 ...........................
Very low latency for each log lines
(if traffic is not larger than capacity)
12年9月29日土曜日
12. Case of data size explosion (batch)
serviceA Casual batch over multi node/service
may be blocked by
serviceB unbalanced data size
needs long tranfer
serviceC time
serviceD
Asynchronous batch is very good problem...
12年9月29日土曜日
13. Case of data size explosion (stream)
serviceA
Streams are mixed
and not blocked by heavy traffics
serviceB
heavy
(if traffic is not larger than capacity)
serviceC traffic
serviceD
12年9月29日土曜日
14. What features we should write for
"Stream Processing"
12年9月29日土曜日
16. One-by-one input/process/output
convert
one record format one record (or none)
select
•Basic feature
•I/O call overhead is relatively heavy
12年9月29日土曜日
18. Burst transfer/read/write and process
read and read and
store convert store
records many records records
temprally many records format (or few or none) temprally
from to
input select output
•less input/output calls
•more performance with async I/O and multi process
12年9月29日土曜日
20. Control buffer flush intervals
buffer buffer
read and read and
store store
read records many many records records write
inputs temprally records (or few or none) temprally records
from to
input output
0.5sec? 1sec? 3sec? 30sec?
•Control flushing about buffer size and latency
•(Semi-)real-time control flow arguments
•Max size of lost data when process crushed
12年9月29日土曜日
22. Buffering/Queueing
output
buffer
send to
next node STOP
records next node
buffer
output send to
buffer next node
records next node
buffer
buffer
output send to recover
buffer next node
records next node
buffer
output send to streaming
buffer next node
records next node
12年9月29日土曜日
24. Connection keepalive / connection pooling
node B
node A
node C
node D
•Keep connections and select one to use
•TCP connection establishment needs large cost
•manage node status (alive/down) at same time
•not only inter-nodes, but also inter-process
connections
12年9月29日土曜日
26. Distribution: Load balancing (cpu/node)
send to
processor
next node
load send to
records processor
balancer next node
send to
processor
next node
•Distribute large scale data to many nodes
•nodes: servers, or processor processes
•to make total performance high
12年9月29日土曜日
27. Distribution: High availability (process/node)
send to
processor
next node
load send to
records processor
balancer next node
send to
processor
next node
•Distribute large scale data to N+1 (or 2 or more) nodes
•to make system tolerant of node trouble
•without any failover (and takeback) operations
12年9月29日土曜日
28. Routing
records for
output A
service A
process A
records for
records router router output B
service B
records for
process B output C
service C
12年9月29日土曜日
33. Fluentd
•Mainly written by @frsyuki in TreasureData
•APLv2 software on github
•Log read/transfer/write daemon based on
MessagePack
•structured data (Hash: key:value pairs)
•Plugin mechanism for input/output/buffer features
•now many plugins are published
12年9月29日土曜日
34. Fluentd features: input/output
•File tailing, network, and other input plugins
•tail and parse line-by-line
•receive records from app logger or other fluentd
•in_syslog, in_exec, in_dstat, .....
•Output to many many storage/systems
•other fluentd, file, S3, mongodb, mysql, HDFS, .....
12年9月29日土曜日
35. Fluentd features: buffers
•Pluggable buffers
•output plugin buffers are swappable (by configuration)
•In memory buffers: fast, but lost at fluentd down
•file buffers: slow, but always saved
•Buffer plugins are also added by users
•No one public plugin exists now....
12年9月29日土曜日
36. Fluentd features: routing
•Tag based routing
•all records have tag and time
•Fluentd use tags which plugin the record sended next
•configurartions are:
•tag matcher pattern + plugin configuration
12年9月29日土曜日
37. Fluentd features: exec_filter
•Output records to specified (and forked) command
•And get records from command's STDOUT
•We can specify our stream processor as command
12年9月29日土曜日
39. Fluentd is written in Ruby
Fluentd plugins released as rubygems
12年9月29日土曜日
40. Problems about Fluentd (for stream processing)
•Eager buffering
•Eager default buffering config, not to flush under 8MB
•Performance
•Many many features for data protection injures
performance
•Doesn't work on Windows
12年9月29日土曜日
42. fluent-agent-lite (Fluent::AgentLite)
•Log collection agent tools (in perl) by tagomoris
•fast and low load
•gets logs from file/STDIN, and sends to other nodes
•minimal features for log collector agent
•doesn't parse log lines (send 1 attribute with whole
log line)
•supports load balancing and failover of destination
12年9月29日土曜日
43. fluent-agent (Fluent::Agent)
•Fluentd feature subset tools by tagomoris
•written in Perl
•libuv and UV module for async I/O lib (for Windows)
•Goal: simple, fast and easy deployment
•UNDER CONSTRUCTION
•60% features and many bugs, not in CPAN now
12年9月29日土曜日
44. Features of Fluent::Agent
•1 input, 1 output and 0/1 filter
•Network I/O: protocol compatible with Fluentd
•and simple load balancing/failover feature
•File input/output: superset features of Fluentd (in plan)
•Filter with any command: compatible with Fluentd's
exec_filter
filter
data/records input output data/records
any program
you want
12年9月29日土曜日
45. Pros of Fluent::Agent (in plan)
•Simple and fast software for stream processing
•Stateless nodes
•fluent-agent works without any configuration files
•fluent-agent works with only commandline options
•Simple buffering and load balance
•less memory usage
12年9月29日土曜日
46. Cons of Fluent::Agent (in fact)
•Poor input/output methods
•fluent-agent doesn't have plugin architecture (currently)
•in future, CPAN based plugin system?
•Lack of data protection for death of process
•fluent-agent have only memory buffer
12年9月29日土曜日
48. Fluentd and fluent-agent and fluent-agent-lite
service
node fluent-agent-lite
service
fluent-agent
node fluent-agent-lite
service fluent-agent fluent-agent fluentd
node fluent-agent-lite
service fluent-agent
node fluent-agent-lite
fluent-agent
service
fluent-agent
node fluent-agent-lite
service fluent-agent fluentd
node fluent-agent-lite fluent-agent
service writer for
node fluent-agent-lite fluent-agent storages
/
deliver processor aggregator
12年9月29日土曜日
49. Conclusion
•Distributed Stream Processing is:
•to provides more power to our application
•very hard (and interesting) problem
•that we have some supporting frameworks/tools like
Fluentd and/or fluent-agent
12年9月29日土曜日
50. Let's try to improve your application
with stream processing
instead of many many batches
Thanks!
CAST: crouton & luke & chacha
Thanks to @kbysmnr
12年9月29日土曜日