4. Company Overview
Silicon Valley-based Company
• All Founders are Japanese
• Hironobu Yoshikawa
• Kazuki Ohta
• Sadayuki Furuhashi
OSS Enthusiasts
• MessagePack, Fluentd, etc.
4
Friday, April 5, 13
5. Investors
Bill Tai
Naren Gupta - Nexus Ventures, Director of Redhat, TIBCO
Othman Laraki - Former VP Growth at Twitter
James Lindenbaum, Adam Wiggins, Orion Henry - Heroku
Founders
Anand Babu Periasamy, Hitesh Chellani - Gluster Founders
Yukihiro “Matz” Matsumoto - Creator of Ruby
Dan Scheinman - Director of Arista Networks
Jerry Yang - Founder of Yahoo!
+ 10 more people
• and....
5
Friday, April 5, 13
7. Why Cloud? ‘Time’ is Money
Ideal
Customer Expectation
Value
Obsolete
over time
Reality
(On-Premise)
Upgrade
HW/SW Selection, PoC, Deploy...
Time
Sign-up or PO
7
Friday, April 5, 13
8. Big Data Adoption Stages
Optimization What’s the best?
Predictive Analysis What’s a trend? Analytics
Statistical Analysis Treasure Data’s FOCUS
Why?
Alerts Error?(80% of needs)
Drill Down Query Where exactly?
Reporting
Ad-hoc Reports Where?
Standard Reports What happened?
Intelligence Sophistication
8
Friday, April 5, 13
9. Full Stack Support for Big Data Reporting
Our best-in-class architecture Data from almost any source
and operations team ensure the can be securely and reliably
integrity and availability of your uploaded using td-agent in
data. streaming or batch mode.
Our SQL, REST, JDBC, ODBC You can store gigabytes to
and command-line interfaces petabytes of data efficiently and
support all major query tools securely in our cloud-based
and approaches. columnar datastore.
9
Friday, April 5, 13
13. Treasure Data = Collect + Store + Query
13
Friday, April 5, 13
14. Example in AdTech: MobFox
1. Europe’s largest independent mobile ad exchange.
2. 20 billion imps/month (circa Jan. 2013)
3. Serving ads for 15,000+ mobile apps (circa Jan. 2013)
4. Needed Big Data Analytics infrastructure ASAP.
14
Friday, April 5, 13
16. Used AWS Products (1)
RDS
• Store user information, job status, etc...
• Store metadata of our columnar database
• Queue of worker (perfectqueue / perfectsched)
EC2
• API servers
• Hadoop clusters
• Job workers
• Using Chef to deploy
16
Friday, April 5, 13
17. Used AWS Products (2)
ELB
• Load balancing of API servers
• Load balancing of td-agents
S3
• Columnar storage built on top of S3
• MessagePack columnar format
• realtime / archive storage
• Our Result feature supports S3 output.
No EMR, SQS and other products !
17
Friday, April 5, 13
18. Architecture Breakdown
Data Collection Data Store/Analytics Connectivity
• Increasing variety of • Remaining complexity in • Required to ensure
data sources both traditional DWH connectivity with
• No single data schema and Hadoop (very slow existing BI/visualization/
• Lack of streaming data time to market) apps by JDBC, REST
collection method • Challenges in scaling and ODBC.
• 60% of Big Data project data volume and • Output ot other services,
resource consumed expanding cost. e.g. S3, RDBMS, etc.
18
Friday, April 5, 13
19. 1) Data Collection
60% of BI project resource is consumed here
Most ‘underestimated’ and ‘unsexy’ but MOST important
Fluentd: OSS lightweight but robust Log Collector
• http://fluentd.org/
19
Friday, April 5, 13
20. Fluentd
the missing log collector
fluentd.org
20
Friday, April 5, 13
21. In short
Open sourced log collector written in Ruby
Using rubygems ecosystem for plugins
It’s like syslogd, but
uses JSON for log messages
21
Friday, April 5, 13
24. Before Fluentd
Server1 Server2 Server3
Application Application Application
・・・ ・・・ ・・・
High Latency!
must wait for a day...
Fluent
Log Server
24
Friday, April 5, 13
25. After Fluentd
Server1 Server2 Server3
Application Application Application
Fluentd ・・・ Fluentd ・・・ Fluentd ・・・
In streaming!
Fluentd Fluentd
25
Friday, April 5, 13
27. td-agent
Open sourced distribution package of fluentd
ETL part of Treasure Data
Including useful components
• ruby, jemalloc, fluentd
• 3rd party gems: td, mongo, webhdfs, etc...
• td plugin is for Treasure Data
http://packages.treasure-data.com/
27
Friday, April 5, 13
28. Treasure Data Service Architecture
This!
Apache
App Treasure Data
td-agent columnar data
App RDBMS warehouse
Other data sources
MAPREDUCE JOBS
HIVE, PIG (to be supported)
td-command
Query
Query
Processing
API
JDBC, REST Cluster
User BI apps
28
Friday, April 5, 13
30. 2) Data Store / Analytics - Columnar Storage
30
Friday, April 5, 13
31. Treasure Data Service Processing Flow
Worker
Frontend
Job Queue Hadoop
Hadoop
Applications push
metrics to Fluentd
sums up data minutes
(via local Fluentd) Fluentd Fluentd (partial aggregation)
Treasure
Librato Metrics
Data
for historical analysis for realtime analysis
31
Friday, April 5, 13
39. Data first, Schema later
SELECT 54 (int) “test” (string) 120 (int) NULL
Schema user:int name:string value:int host:int
Raw data(JSON) {“user”:54, “name”:”test”, “value”:”120”, “host”:”local”}
39
Friday, April 5, 13
40. 3) Connectivity
REST API
td-command
Query
Query
Query API
Processing
JDBC, ODBC Driver Cluster
BI apps
Web App
Treasure Data
Result MySQL Columnar Storage
S3
…
40
Friday, April 5, 13
41. Multi-Tenancy
All customers share the Hadoop clusters (Multi Data Centers)
Resource Sharing (Burst Cores), Rapid Improvement, Ease of Upgrade
Job Submission
+ Plan Change
Local FairScheduler
datacenter A
Local FairScheduler
Global
datacenter B
Scheduler
Local FairScheduler
datacenter C On-Demand
Resouce Allocation
Local FairScheduler
datacenter D
41
Friday, April 5, 13
42. Conclusion
Treasure Data
• Cloud based Big-data analytics platform
• Provide Machete for Big data reporting
Big Data processing
• Collect / Store / Analytics / Visualization
Our focus!
Our used AWS products
• EC2, S3, RDS, ELB
• Building Treasure Data specific systems on AWS
42
Friday, April 5, 13
43. Big Data for the Rest of Us
www.treasure-data.com | @TreasureData
Friday, April 5, 13