Retrospection / prospection and schema

Retrospection / prospection
and schema
TAGOMORI Satoshi (@tagomoris)
LINE Corp.
2014/01/31 (Fri) at University of Tsukuba
the 1st half

14年1月31日金曜日

TAGOMORI Satoshi (@tagomoris)
LINE Corp.
Development Support Team

Logs
Service metrics (Users, PageViews, ...)
UX/UI metrics (Access path, Taps/views, ...)
Monitoring metrics (Trafﬁc Gbps, TBytes/day, ...)
System monitoring (Error rates, Response time, ...)


Software for Logging
Collection: Fluentd, Scribed, Flume, LogStash, ...
Storage: RDBMS, Hadoop HDFS, NoSQLs, Elasticsearch, ....
Processing: SQL, Hadoop MapReduce(Hive), Presto, Impala, ...
Stream-Processing: Storm, Kafka, Norikra, ...
Visualization: Kibana, Tableau Fnordmetric, GrowthForecast, Focuslight, ...
Appliance: DHW + BI Tools
Services: Google BigQuery, Treasure Data, ...


How inspect logs
Retrospection (reactive search)
Store data, and search
Prospection (proactive search)
Deﬁne what should be processed, and store data


What logs inspected
Schema-full data:
strict schema: pre defined fields w/ types (or reject)
schema on read: try to read known fields (or ignore)
Schema-less data:
any fields (or ignore), any types (implicit/explicit
conversion)
fit for services in-development (all internet services!)

How/what
HowWhat

Schema-full

Schema-less

Retrospect

RDBMS,
Hive, BigQuery,
Cassandra, HBase, ...

MongoDB,
Hive(SerDe), TD,
Plain text ﬁle, ...

Prospect

Esper,
many of stream CEPs,
...

Norikra, ...


Data size: schema & index
Logs: size is always important (xTB - xPB)
Schema:
size optimization
access optimization on memory/disk
Index:
access optimization on memory/disk
more memory/disk required
hard to distribute


Query response improvements
of retrospection
Schema-full + indexed (RDBMS)
Query plan optimization
Schema on read
I/O and Task size optimization & scale out
Schema-less + indexed (Mongo)
mmap-ed index & data (!)


Query response improvements
of prospection

Time window + incremental calculation
Stream processing engines


Stream processing
and data size
No disks: reduction of failure points
Less memory:
size of just processing and I/O buffers
aggregation results
Easy to distribute:
stream duplication
stream splitting by aggregation key


Stream processing and schema
Stream processing: query -> data
Prospective schema by queries:
Queries know required ﬁelds and its types
Unused ﬁelds can be ignored
Implicit type conversion available
Schema-less data + schema-full queries


My goal:
Schema-less data stream
+ schema-full queries

It’s Norikra!


Retrospection / prospection and schema

Recomendados

Recomendados

Más contenido relacionado

Más de SATOSHI TAGOMORI

Más de SATOSHI TAGOMORI (20)

Último

Último (20)

Retrospection / prospection and schema