17. Data Analytics Platform
• Data collection, storage: Ruby(OSS), Java/JRuby(OSS)
• Console & API endpoints: Ruby(RoR)
• Schema management: Ruby/Java (MessagePack)
• Processing (batch, query, ...): Java(Hadoop,Presto)
• Queuing & Scheduling: Ruby(OSS)
• Data connector/exporter: Java, Java/JRuby(OSS)
18. Treasure Data Architecture: Overview
Console
API
EventCollector
PlazmaDB
Worker
Scheduler
Hadoop
Cluster
Presto
Cluster
USERS
TD SDKs
SERVERS
DataConnector
CUSTOMER's
SYSTEMS
19. OSS products
• To make logging more easy & simple than ever!
• Plugin system
• Open development
• For various environment/usage
• Fluentd, Fluent-Bit, Embulk
• Fluent-Bit: Data collector for Embedded Linux
http://fluentbit.io/
22. Bulk Data Loader
High Throughput&Reliability
Embulk
Written in Java/JRuby
http://www.slideshare.net/frsyuki/embuk-making-data-integration-works-relaxed
http://www.embulk.org/
24. Treasure Data Architecture: Overview
Console
API
EventCollector
PlazmaDB
Worker
Scheduler
Hadoop
Cluster
Presto
Cluster
USERS
TD SDKs
SERVERS
DataConnector
CUSTOMER's
SYSTEMS
25. Console/API
• RoR + AWS RDS + AngularJS
• on EC2 (API) and Heroku (Console)
• Operation, Configuration & Managing Data
26. Treasure Data Architecture: Overview
Console
API
EventCollector
PlazmaDB
Worker
Scheduler
Hadoop
Cluster
Presto
Cluster
USERS
TD SDKs
SERVERS
DataConnector
CUSTOMER's
SYSTEMS
27. Collecting Data
• Import over Console/API
• From browsers and CLI (TD toolbelt)
• Treasure Agent (rpm/deb)
• Fluentd packaged by Treasure Data
• Post from JavaScript/iOS/Android SDK
• To EventCollector (HTTP endpoint for SDKs, impl. w/ Fluentd)
28. Treasure Data Architecture: Overview
Console
API
EventCollector
PlazmaDB
Worker
Scheduler
Hadoop
Cluster
Presto
Cluster
USERS
TD SDKs
SERVERS
DataConnector
CUSTOMER's
SYSTEMS
29. DataConnector
• Data bulk loader for various data sources
• Load customers' data to Treasure Data
• S3, Redshift, MySQL, PostgreSQL, Salesforce, ...
• Hosted Embulk
• Much computing resources
• Distributed execution on Hadoop MapReduce
30. Treasure Data Architecture: Overview
Console
API
EventCollector
PlazmaDB
Worker
Scheduler
Hadoop
Cluster
Presto
Cluster
USERS
TD SDKs
SERVERS
DataConnector
CUSTOMER's
SYSTEMS
31. Hadoop, Presto clusters
• Some Hadoop/Presto clusters
• We're OSS products itself, not customized one
• with minimal patches for storage I/O
32. Treasure Data Architecture: Overview
Console
API
EventCollector
PlazmaDB
Worker
Scheduler
Hadoop
Cluster
Presto
Cluster
USERS
TD SDKs
SERVERS
DataConnector
CUSTOMER's
SYSTEMS
33. Queue/Worker, Scheduler
• Treasure Data: multi-tenant data analytics service
• executes many jobs in shared clusters (queries,
imports, ...)
• CORE: queues-workers & schedulers
• Clusters have queues/scheduler... it's not enough
• resource limitations for each price plans
• priority queues for job types
• and many others
35. PerfectQueue
• Highly available distributed queue using RDBMS
• Written in CRuby
• Enqueue by INSERT INTO
• Dequeue/Commit by UPDATE
• Flexible scheduling rather than scalability
• Using Amazon RDS (MySQL) internally
• + Workers on EC2
37. PerfectSched
• Highly available distributed scheduler using RDBMS
• Written in CRuby
• At-least-one semantics
• PerfectSched enqueues jobs into PerfectQueue
38. Storage, Schema
• Another core technology for Treasure Data service
• High performance, schema on read, less cost
• columnar file format
• high throughput & high concurrency
• compression
• Less schema management
• for customers
39. Treasure Data Architecture: Overview
Console
API
EventCollector
PlazmaDB
Worker
Scheduler
Hadoop
Cluster
Presto
Cluster
USERS
TD SDKs
SERVERS
DataConnector
CUSTOMER's
SYSTEMS
41. PlazmaDB
• Distributed database using RDBMS & Distributed FS
• metadata on RDBMS, data chunks on DFS
• Amazon RDS(PostgreSQL) + Amazon S3 / Riak CS
• High throughput & high availability by S3
• Columnar format based on MessagePack
• time based chunking for time series data
42. Monitoring
• Using DataDog for internal operations
• Monitoring for our customers required:
• How many records are they importing?
• How many jobs are they executing?
• How many threads/processes is a job consuming?
44. PerfectMonitor
• Is still under construction :P
• Fluentd based metrics collection
• Detailed metric for real-time, summarized for past
• Real-time metric storage using InfluxDB
• Historic metric storage using Treasure Data
• Real-time data series are disposable :D
• Potential next OSS product from Treasure Data
45. For Further improvement
• More performance for more customers
• Dynamic scaling for better performance and less
cost
• New analytics features for brand new experience