Our Logging, Metrics and Monitoring as a Service, Volta, is aimed at providing a scalable logging and metrics service for applications and services across the stack: starting from low level networks and core openstack services to platform services to Symantec products. Volta integrates with Keystone to provide secure authentication and multi-tenancy which is used to limit the visibility of logs/metrics to specific users/tenants or to specific services (e.g., only nova or only swift). Volta also provides features for setting up Alerts on log and metric events.
In this session, we will share with you how we have built Volta using battle tested open source / OpenStack components such as Keystone, Kafka, Storm, ElasticSearch, InfluxDB, Logstash, Kibana, and Grafana. We will also present our Keystone based authentication and multi-tenancy model and its implementation for limiting the visibility of logs and metrics for queries and alerts.
Logging, Metrics and Monitoring as a Service Architecture
1. Volta: Logging, Metrics and
Monitoring as a Service
LN Renganarayana
Technical Director / Architect
Cloud Platform Engineering
ln_renganarayana@symantec.com
twitter: @lrengan
1Jan 7, 2015Volta / Cloud Platform Engineering, Symantec
2. Outline
• Motivation: data and events are the foundation of business
• Why build a (new) Service?
• What have we built: a (near) real-time data analytics pipeline
• The journey and lessons learned
• Looking ahead: Volta next gen
Jan 7, 2015Volta / Cloud Platform Engineering, Symantec
2
3. Data and events : the foundation
Jan 7, 2015Volta / Cloud Platform Engineering, Symantec
3
Picture: “Devops with S for sharing”, Patrick Debois
which features
to build?
what is a good
pricing model?
how fast can I
build?
what is the perf
of my code?
how is the
service?
what is
my
capacity?
what is
my
current
usage?
4. Why build a (new) service?
Jan 7, 2015Volta / Cloud Platform Engineering, Symantec
4
5. Why build a service?
Jan 7, 2015Volta / Cloud Platform Engineering, Symantec
5
Picture: Jim Nisbet & Philip O’Toole
AWS re:Invent 2013 Loggly presentation
6. Single place for events across the stack
Volta / Cloud Platform Engineering, Symantec
6
Jan 7, 2015
Bare Metal
IaaS (OpenStack)
Platform Services
BP, SP, KV, OBS
Symantec Services & Apps
Volta
Identity
Manager
CI / CD
Common
Services
7. Volta : Design Goals
• Design for both Developers and Ops
– Make it extremely simple to capture events
– provide powerful search and visualization tools
• Secure, Multi Tenant : well we are Symantec, so Security comes first
• Scalable : elastically scale with load
• Highly Available: Volta is the eyes & ears for the Operations
• One system for logs, metrics, monitoring & other events
• Build using open source tools and for open sourcing
Jan 7, 2015Volta / Cloud Platform Engineering, Symantec
7
8. What we have built ...
A (near) real-time data analytics pipeline
Jan 7, 2015Volta / Cloud Platform Engineering, Symantec
8
9. Volta Client View
Jan 7, 2015Volta / Cloud Platform Engineering, Symantec
9
App
Platform
Services
Writes app
metrics directly
Infrastructure
SNMPVars
expose
metrics
JMX
Pull
Metrics
Push
Metrics
Volta
Shipper
VM
logs
Volta
metrics log events
Alerts&
ConfigUI
Push: StatsD, metrics extension for openstack
Pull: CollectD. Shipper: logstash, moving to Heka
11. Jan 7, 2015Volta / Cloud Platform Engineering, Symantec
11
Kafka cluster
knode1 knode2 knode3 knodeN...
log, metric, alert events
Client App / Service
log & metrics shipper
log, metric & alert events
The Ingest Pipeline
VIP
• Kafka – replicated, fault
tolerant, persistent
message queue
• LogTopic, MetricTopic,
AlertTopic
• each topic is split into
partitions
• per topic retention policy
12. Event processing and storage
Jan 7, 2015Volta / Cloud Platform Engineering, Symantec
12
Storm cluster
Elastic
Search
Elastic
SearchRedis
Alerts email &
callbacks
log, metric & alert events
InfluxDB
InfluxDB
InfluxDB
MetricsStore
Elastic
Search
Elastic
Search
Elastic
Search
LogStore
Authentication, Validation, Alerts Processing
Quota
&
Policy
• alert rules
• [tenantid,
apikey] pairs
• Per tenant per day index
• Index typed fields
• Quota and retention policy
• Tenant id prefixed time series names
• Continuous queries do rollups
• Retention policy through rollups
13. Multi-tenancy Proxy & UI
Volta / Cloud Platform Engineering, Symantec
13
Keystone
Front End Cluster: Multi-tenancy and Kibana, Graphana Proxies
Elastic
SearchElastic
SearchRedis
Load Balancer
s1 s2 s3 s4 ... sn
InfluxDB
InfluxDB
InfluxDB
MetricsStore
Elastic
SearchElastic
SearchElastic
Search
LogStore
• Intercepts and rewrites queries
to ES and InfluxDB
• Enforces Multi-tenancy
(visibility of events to users)
14. Security and Multi-tenancy model
• Authentication with Keystone backed by LDAP
– user authentication for Query API and UI
• Multi tenancy with users and groups
– Events have tenant id and apikey
• Cross tenant correlation
– group membership used for cross-tenant event visibility / correlation
• Dashboard sharing
Jan 7, 2015Volta / Cloud Platform Engineering, Symantec
14
15. Retention Policy : Log Events
• ElasticSearch allows powerful querying, but comes at a cost
– Store only logs that would help better operate and trouble shoot
– Use appropriate debug levels (not INFO)
• Fixed quota : 350 GB or 500 GB
• When tenant reaches quota limit, Volta will delete 20 % of old logs to
free up space
• Through wise use of quota you can retain logs for lots of days
• Volta can retain logs for longer duration, for special tenants who need
to store them for compliance / audit
Jan 7, 2015Volta / Cloud Platform Engineering, Symantec
15
16. Metric Events: Retention Policy and Rollups
Naming scheme:
host + “.” + name + “.” + type_if_avail + “.” + retention_period
Retention period: 1 day, 1 week, 1 month, and 3 months:
Names for the example:
● default 1 day: lmm-dev-bastion.memory.used_
● 1 week: lmm-dev_bastion.memory.used_1w
● 1 month: lmm-dev_bastion.memory.used_1m
● 3 months: lmm-dev_bastion.memory.used_3m
rollup precision:
● default 1 day: user defined (highest)
● 1 week: metrics aggregated to 1 minute
● 1 month: metrics aggregated to 5 minutes
● 3 months: metrics aggregated to 1 hour
Naming scheme & retention policies
{
"@version": "1",
"@timestamp": "2014-08-06T19:17:43.000Z",
"host": "lmm-dev-bastion",
"name": "memory",
"collectd_type": "memory",
"type_instance": "used",
"value": 341884928,
"tenant_id": "db5ca8e4c8514fad9f98dbc4d648ee87",
"apikey": "26d85ae3-1e10-4ce4-837a-7a1c8dfc67fb"
}
Sample for metric from collectd
Jan 7, 2015Volta / Cloud Platform Engineering, Symantec
16
17. Alerts : Email and Callbacks
• Alerts can be set using the Alert UI or the REST API
• Alerts can be sent to Email or post Webhook (REST endpoint)
• Webhook provides a good mechanism for integration with external automation and UIs
• Alerts on Log events
– User specifies an alert template using regular expression to match
– Can match one or more fields from a Log event
– Simple and complex expressions
• Alerts on Metric events
– User specifies an alert template using comparison operators
– Can match one or more fields from the Metric event
– Simple and complex expressions
Jan 7, 2015Volta / Cloud Platform Engineering, Symantec
17
18. Current deployment
• Multiple deployments : on bare KVM nodes, on OpenStack VMs
– On KVM nodes: 40+ VMs, 80+ TB storage, many large memory nodes
– Components are deployed in clustered mode for HA
– Some with active/active replication, some with active/passive
• Use by Platform and Infrastructure Services
– Tens of thousands of events per second (seen around 160 K events /sec)
– Hundreds of GBs of data collected and indexed per day
– Queries are currently coming from Kibana and Grafana, in future from APIs
Jan 7, 2015Volta / Cloud Platform Engineering, Symantec
18
19. The Journey and Lessons ...
Jan 7, 2015Volta / Cloud Platform Engineering, Symantec
19
20. Log, metrics and alerts
• log events
– insist on good severity levels,
– enforce quota induce behavior change
– watch out for large messages (zip lines from stdout/stderr)
• metric events
– keep users aware of rollups (granularity)
• alerts
– watch out for too simple ones alert floods
– watch out for complex regex performance / memory suckers
– encourage metrics based alerts this is what scales
Jan 7, 2015Volta / Cloud Platform Engineering, Symantec
20
21. Kafka, ES and Storm
• Kafka
– retention policy vs storage space: do the math with ingest & processing rate
– if you are not using auto-rebalance of leaders, keep an eye on the leaders
• Storm
– smaller topologies: easy to update and optimize
– match consumer parallelism (number of partitions) to kafka spouts
– tune number of executor threads to optimal performance
• ElasticSearch:
– aggregate your writes
– heap size <= 32 GB, turn off swap,
– benefits hugely from high iops use SSDs if you can
Jan 7, 2015Volta / Cloud Platform Engineering, Symantec
21
22. Using Open Source Software : Joy and Frustrations
• Be ready for constant upgrades
– for bug fixes
– to get cool new features: Grafana, Kibana
– for stability, cool stats and visualization: Storm
• InfluxDB clustering maturing
– temporary HA solution (write to 2+ influxDBs)
– waiting for 0.9 release with better clustering
Jan 7, 2015Volta / Cloud Platform Engineering, Symantec
22
23. Eat your own Dog Food
• Volta was a cobbler’s child for a while …
– did not use any system to aggregate logs and metrics!
• Now we are using Volta to collect its logs and metrics
– send logs and metrics from one Volta instance to another
– sending to the same instance is an interesting one!
• Important metrics:
– ingest rate, Storm processing rate, ES / Influx Write latency
– end to end latency of events
Jan 7, 2015Volta / Cloud Platform Engineering, Symantec
23
24. Synthetic Transactions and Tracking SLAs
• Goal: track Service level metrics
– availability to users / business
– latency for operations to users
• Use Synthetic Transactions that exercise a sequence of APIs
– measure success / failure rates
– measure end to end latency
– collect, trend and alert on these
Jan 7, 2015Volta / Cloud Platform Engineering, Symantec
24
25. Deployment & Ops : automate, automate, automate …
• Volta is a collection of services
– use separate repos, deploy small changes
• Lots of configuration parameters : manage consistency
– performance very sensitive to values
– e.g., Heap, number of workers, etc.
• Performance benchmarking
– need to be done for each environment
• CI and Deployment pipeline
Jan 7, 2015Volta / Cloud Platform Engineering, Symantec
25
26. Volta next gen
Jan 7, 2015Volta / Cloud Platform Engineering, Symantec
26
27. Volta Next Gen
• OpenSource Volta
• Refactor Storm
– Split into separate metric and log topologies and batch writes
• Move ES and InfluxDB to higher iops storage (SSDs?)
• Multi-DC support via stream duplication
• Archival into Swift / HDFS
• Anomaly detection using CEP / Storm
• HTTP REST API in front of Kafka
• Deployment automation using OpenStack Murano
Jan 7, 2015Volta / Cloud Platform Engineering, Symantec
27
28. Thank you!
Questions, Comments, Suggestions?
Jan 7, 2015Volta / Cloud Platform Engineering, Symantec 28
We are interested in Open Sourcing & Collaborating on Volta.
Interested?
And, we are hiring …. interested?
ln_renganarayana@symantec.com
twitter: @lrengan
30. LMM Metrics Data Model
● name : name of the metric. LMM uses this to store
the metrics and you will use in queries: select
“value” from “load”
● value : value of the metrics at a given time
● @timestamp : time stamp
● host : host name or any other id
● tenant_id : tenant id (keystone)
● apikey : LMM apikey
{
"@version": "1",
"@timestamp": "2014-07-30T00:16:59.000Z",
"name": "cpu",
"host": "demo.symcpe.net",
"plugin_instance": "0",
"collectd_type": "cpu",
"type_instance": "interrupt",
"value": 0,
"tenant_id":"db5ca8e4c8514fad9f98dbc4d648ee87",
"apikey": "26d85ae3-1e10-4ce4-837a-7a1c8dfc67fb"
}
Mandatory fields Sample for metric from collectd
Collectd : name of plugin becomes name of metric. E.g.: cpu or memory
StatsD : users metric name concatenated with metric type by a dot. E.g.: myapp.counter or myapp.gauge
Reserved fields: time, sequence_number Special field: type_instance
Jan 7, 2015Volta / Cloud Platform Engineering, Symantec
30
Notas del editor
data driven/informed development, ops, choice : OODA loop
what users like? which features to build? how fast can I build it? how is my service running?
What are the use cases? Why build a new service?
- as a service: how can I make it someone else' problem?
- consumed by services across the stack
- scalable and elastic
- secure, multi-tenant
- splunk was too expensive: new competing open source tech emerging
Everyone starts with...– A bunch of log files (syslog, application specific)
– On a bunch of machines
• Management consists of doing the simple stuff
– Rotate files, compress and delete
– Information is there but awkward to find specific events
– Weird log retention policies evolve over time
User authentication with Keystone for Query API & UI
Tenant id and API key used for events sent to LMM
Tenant ids from Keystone and API keys generated by LMM
Every event is tagged with a tenant id
Log events: tenant id as a field
Metric events: tenant id prefixed to the metric name
Keystone group membership used for sophisticated cross-tenant event visibility / correlation