Data Analytics Service Company and Its Ruby Usage

Data Analytics Service Company
and Its Ruby Usage
EuRuKo 2015 (Oct 17, 2015)
Satoshi Tagomori (@tagomoris)

Satoshi "Moris" Tagomori
(@tagomoris)
Fluentd, MessagePack-Ruby, Norikra, ...
Treasure Data, Inc.

Data Analytics Platform
Data Analytics Service

Data Analytics Flow
Collect Store Process Visualize
Data source
Reporting
Monitoring

• Data collection, storage
• Console & API endpoints
• Schema management
• Processing (batch, query, ...)
• Queuing & Scheduling
• Data connector/exporter

• Data collection, storage: Ruby(OSS), Java/JRuby(OSS)
• Console & API endpoints: Ruby(RoR)
• Schema management: Ruby/Java (MessagePack)
• Processing (batch, query, ...): Java(Hadoop,Presto)
• Queuing & Scheduling: Ruby(OSS)
• Data connector/exporter: Java, Java/JRuby(OSS)

Treasure Data Architecture: Overview
Console
API
EventCollector
PlazmaDB
Worker
Scheduler
Hadoop
Cluster
Presto
Cluster
USERS
TD SDKs
SERVERS
DataConnector
CUSTOMER's
SYSTEMS

OSS products
• To make logging more easy & simple than ever!
• Plugin system
• Open development
• For various environment/usage
• Fluentd, Fluent-Bit, Embulk
• Fluent-Bit: Data collector for Embedded Linux
http://ﬂuentbit.io/

http://www.fluentd.org/
Fluentd
Unified Logging Layer
For Stream Data
Written in CRuby
http://www.slideshare.net/treasure-data/the-basics-of-fluentd-35681111

Bulk Data Loader
High Throughput&Reliability
Embulk
Written in Java/JRuby
http://www.slideshare.net/frsyuki/embuk-making-data-integration-works-relaxed
http://www.embulk.org/

HDFS
MySQL
Amazon S3
CSV Files
SequenceFile
Salesforce.com
Elasticsearch
Cassandra
Hive
Redis
✓ Parallel execution
✓ Data validation
✓ Error recovery
✓ Deterministic behavior
✓ Idempotent retrying
Plugins Plugins
bulk load

Console/API
• RoR + AWS RDS + AngularJS
• on EC2 (API) and Heroku (Console)
• Operation, Conﬁguration & Managing Data

Collecting Data
• Import over Console/API
• From browsers and CLI (TD toolbelt)
• Treasure Agent (rpm/deb)
• Fluentd packaged by Treasure Data
• Post from JavaScript/iOS/Android SDK
• To EventCollector (HTTP endpoint for SDKs, impl. w/ Fluentd)

DataConnector
• Data bulk loader for various data sources
• Load customers' data to Treasure Data
• S3, Redshift, MySQL, PostgreSQL, Salesforce, ...
• Hosted Embulk
• Much computing resources
• Distributed execution on Hadoop MapReduce

Hadoop, Presto clusters
• Some Hadoop/Presto clusters
• We're OSS products itself, not customized one
• with minimal patches for storage I/O

Queue/Worker, Scheduler
• Treasure Data: multi-tenant data analytics service
• executes many jobs in shared clusters (queries,
imports, ...)
• CORE: queues-workers & schedulers
• Clusters have queues/scheduler... it's not enough
• resource limitations for each price plans
• priority queues for job types
• and many others

PerfectQueue
https://github.com/treasure-data/perfectqueue

PerfectQueue
• Highly available distributed queue using RDBMS
• Written in CRuby
• Enqueue by INSERT INTO
• Dequeue/Commit by UPDATE
• Flexible scheduling rather than scalability
• Using Amazon RDS (MySQL) internally
• + Workers on EC2

PerfectSched
https://github.com/treasure-data/perfectsched

PerfectSched
• Highly available distributed scheduler using RDBMS
• Written in CRuby
• At-least-one semantics
• PerfectSched enqueues jobs into PerfectQueue

Storage, Schema
• Another core technology for Treasure Data service
• High performance, schema on read, less cost
• columnar ﬁle format
• high throughput & high concurrency
• compression
• Less schema management
• for customers

PlazmaDB
http://www.slideshare.net/treasure-data/td-techplazma

PlazmaDB
• Distributed database using RDBMS & Distributed FS
• metadata on RDBMS, data chunks on DFS
• Amazon RDS(PostgreSQL) + Amazon S3 / Riak CS
• High throughput & high availability by S3
• Columnar format based on MessagePack
• time based chunking for time series data

Monitoring
• Using DataDog for internal operations
• Monitoring for our customers required:
• How many records are they importing?
• How many jobs are they executing?
• How many threads/processes is a job consuming?

PerfectMonitor
• Is still under construction :P
• Fluentd based metrics collection
• Detailed metric for real-time, summarized for past
• Real-time metric storage using InﬂuxDB
• Historic metric storage using Treasure Data
• Real-time data series are disposable :D
• Potential next OSS product from Treasure Data

For Further improvement
• More performance for more customers
• Dynamic scaling for better performance and less
cost
• New analytics features for brand new experience

"Done is better than Perfect."

We'll improve our code step by step,
with improvements of ruby and developer
community <3
Thanks!

Data Analytics Service Company and Its Ruby Usage

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Data Analytics Service Company and Its Ruby Usage

Similar a Data Analytics Service Company and Its Ruby Usage (20)

Más de SATOSHI TAGOMORI

Más de SATOSHI TAGOMORI (20)

Último

Último (20)

Data Analytics Service Company and Its Ruby Usage