1. Collecting app metrics
in decentralized systems
Decision making based on facts
Sadayuki Furuhashi
Treasuare Data, Inc.
Founder & Software Architect Fluentd meetup #3
5. Treasure Data Service Architecture
open sourced
Apache
App Treasure Data
td-agent columnar data
App RDBMS warehouse
Other data sources
MAPREDUCE JOBS
HIVE, PIG (to be supported)
td-command
Query
Query
Processing
API
JDBC, REST Cluster
User BI apps
6. Example Use Case – MySQL to TD
hundreds of app servers
Rails app
writes logs to text files MySQL Daily/Hourly Google
Nightly Batch Spreadsheet
INSERT
Rails app MySQL
writes logs to text files
MySQL
MySQL
Rails app
writes logs to text files
KPI
Feedback rankings visualization
- Limited scalability
- Fixed schema
- Not realtime
- Unexpected INSERT latency
7. Example Use Case – MySQL to TD
hundreds of app servers
Rails app td-agent
sends event logs Daily/Hourly Google
Batch Spreadsheet
Rails app td-agent Treasure Data
sends event logs
MySQL
Rails app td-agent
Logs are available
sends event logs
after several mins.
KPI
Feedback rankings visualization
Unlimited scalability
Flexible schema
Realtime
Less performance impact
8. What’s Treasure Data?
Key differentiators:
> TD delivers BigData analytics
> in days, not months
> without specialists or IT resources
> for 1/10th the cost of the alternatives
Why? Because it’s a multi-tenant service.
9. Problem 1:
investigating problems took time
Customers need support...
> “I uploaded data but can’t get on queries”
> “Download query results take time”
> “Our queries take longer time recently”
10. Problem 1:
investigating problems took time
Investigating these problems took time
because:
doubts.count.times {
servers.count.times {
ssh to a server
grep logs
}
}
11. * the actual facts
> Actually data were not uploaded
(clients had a problem; disk full)
We had ought to monitor uploading so that we immediately know
we’re not getting data from the user.
> Our servers were getting slower because of increasing
load
We had ought to notice it and add servers before having the problem.
> There was a bug which occurs under a specific
condition
We had ought to collect unexpected errors and fix it as soon as
possible so that both we and users save time.
12. Problem 2:
many tasks to do but hard to prioritize
We want to do...
> fix bugs
> improve performance
> increase number of sign-ups
> increase number of queries by customers
> incrasse number of periodic queries
What’s the “bottleneck”, whch should be
solved first?
13. Problem 2:
many tasks to do but hard to prioritize
We need data to make decision.
data: Performance is getting worse.
decision: Let’s add servers.
data: Many customers upload data but few customers issue queries.
decision: Let’s improve documents.
data: A customer stopped to run upload data.
decision: They might got a problem at the client side.
14. How did we solve?
We collected application metrics.
16. Solution v1:
Frontend Worker
Job Queue Hadoop
Hadoop
Fluentd pulls metrics every minuts
Fluentd (in_exec plugin)
Treasure Data Librato Metrics
for historical analysis for realtime analysis
17.
18. What’s solved
We can monitor overal behavior of servers.
We can notice performance decreasing.
We can get alerts when a problem occurs.
19. What’s not solved
We can’t get detailed information.
> how large data is “this user” uploading?
Configuration file is complicated.
> we need to add lines to declare new metrics
Monitoring server is SPOF.
20. Solution v2:
Frontend Worker
Job Queue Hadoop
Hadoop
Applications push
metrics to Fluentd
sums up data minuts
(via local Fluentd) Fluentd Fluentd (partial aggregation)
Treasure Data Librato Metrics
for historical analysis for realtime analysis
21. What’s solved by v2
We can get detailed information directly from
applications
> graphs for each customers
DRY - we can keep configuration files simple
> Just add one line to apps
> No needs to update fluentd.conf
Decentralized streaming aggregation
> partial aggregation on fluentd,
total aggregation on Librato Metrics
24. What did we learn?
> We always have lots of tasks
> we need data to prioritize them.
> Problems are usually complicated
> we need data to save time.
> Adding metrics should be DRY
> otherwise you feel bored and will not add metrics.
> Realtime analysis is useful,
but we still need batch analysis.
> “who are not issuing queries, despite of storing data last month?”
> “which pages did users look before sign-up?”
> “which pages did not users look before getting trouble?”
25. We open sourced
MetricSense
https://github.com/treasure-data/metricsense
26. Components of MetricSense
metricsense.gem
> client library for Ruby to send metrics
fluent-plugin-metricsense
> plugin for Fluentd to collect metrics
> pluggable backends:
> Librato Metrics backend
> RDBMS backend
27. RDB backend for MetricSense
Aggregate metrics on RDBMS in optimized
form for time-series data.
> Borrowed concepts from OpenTSDB and
OLAP cube.
metric_tags: segment_values:
metric_id, metric_name, segment_name segment_id, name
1 “import.size” NULL 5 “a001”
2 “import.size” “account” 6 “a002”
data:
base_time, metric_id, segment_id, m0, m1, m2, ..., m59
19:00 1 5 25 31 19 ... 21
21:00 2 5 75 94 68 ... 72
21:00 2 6 63 82 55 ... 63
28. Solution v3 (future work):
Alerting using historical data
> simple machine largning to adjust threashold
values
Historical average
Alert!
31. Sales Engineer
Evangelize TD/Fluentd. Get everyone excited!
Help customers deploy and maintain TD successfully.
Preferred experience: OS, DB, BI, statistics and data
science
Devops engineer
Development, operation and monitoring of our large-
scale, multi-tenant system
Preferred experience: large-scale system development
and management
32. Competitive salary + equity package
Who we want
STRONG business and customer support DNA
Everyone is equally responsible for customer support
Customer success = our success
Self-discipline and responsible
Be your own manager
Team player with excellent communication skills
Distributed team and global customer base
Contact me: sf@treasure-data.com