Logging, Metrics and Monitoring as a Service Architecture

Volta: Logging, Metrics and
Monitoring as a Service
LN Renganarayana
Technical Director / Architect
Cloud Platform Engineering
ln_renganarayana@symantec.com
twitter: @lrengan
1Jan 7, 2015Volta / Cloud Platform Engineering, Symantec

Outline
• Motivation: data and events are the foundation of business
• Why build a (new) Service?
• What have we built: a (near) real-time data analytics pipeline
• The journey and lessons learned
• Looking ahead: Volta next gen
Jan 7, 2015Volta / Cloud Platform Engineering, Symantec
2

Data and events : the foundation
3
Picture: “Devops with S for sharing”, Patrick Debois
which features
to build?
what is a good
pricing model?
how fast can I
build?
what is the perf
of my code?
how is the
service?
what is
my
capacity?
what is
my
current
usage?

Why build a (new) service?
4

Why build a service?
5
Picture: Jim Nisbet & Philip O’Toole
AWS re:Invent 2013 Loggly presentation

Single place for events across the stack
Volta / Cloud Platform Engineering, Symantec
6
Jan 7, 2015
Bare Metal
IaaS (OpenStack)
Platform Services
BP, SP, KV, OBS
Symantec Services & Apps
Volta
Identity
Manager
CI / CD
Common
Services

Volta : Design Goals
• Design for both Developers and Ops
– Make it extremely simple to capture events
– provide powerful search and visualization tools
• Secure, Multi Tenant : well we are Symantec, so Security comes first 
• Scalable : elastically scale with load
• Highly Available: Volta is the eyes & ears for the Operations
• One system for logs, metrics, monitoring & other events
• Build using open source tools and for open sourcing
7

What we have built ...
A (near) real-time data analytics pipeline
8

Volta Client View
9
App
Platform
Services
Writes app
metrics directly
Infrastructure
SNMPVars
expose
metrics
JMX
Pull
Metrics
Push
Metrics
Volta
Shipper
VM
logs
Volta
metrics log events
Alerts&
ConfigUI
Push: StatsD, metrics extension for openstack
Pull: CollectD. Shipper: logstash, moving to Heka

10
Kafka cluster
knode1
Keystone
knode2 knode3 knodeN...
log, metric, alert events
Storm cluster
Front End Cluster: Multi-tenancy and Kibana, Graphana Proxies
Elastic
Search
Elastic
SearchRedis
Alerts email &
callbacks
Load Balancer
Client App / Service
s1 s2 s3 s4 ... sn
log & metrics shipper
log, metric & alert events
InfluxDB
InfluxDB
InfluxDB
MetricsStore
Elastic
Search
Elastic
Search
Elastic
Search
LogStore
Authentication, Validation, Alerts Processing
VoltaUndertheHood
Quota
&
Policy

11
Kafka cluster
knode1 knode2 knode3 knodeN...
log, metric, alert events
Client App / Service
log & metrics shipper
The Ingest Pipeline
VIP
• Kafka – replicated, fault
tolerant, persistent
message queue
• LogTopic, MetricTopic,
AlertTopic
• each topic is split into
partitions
• per topic retention policy

Event processing and storage
12
Storm cluster
Elastic
Search
Elastic
SearchRedis
Alerts email &
callbacks
InfluxDB
InfluxDB
InfluxDB
MetricsStore
Elastic
Search
Elastic
Search
Elastic
Search
LogStore
Authentication, Validation, Alerts Processing
Quota
&
Policy
• alert rules
• [tenantid,
apikey] pairs
• Per tenant per day index
• Index typed fields
• Quota and retention policy
• Tenant id prefixed time series names
• Continuous queries do rollups
• Retention policy through rollups

Multi-tenancy Proxy & UI
Volta / Cloud Platform Engineering, Symantec
13
Keystone
Front End Cluster: Multi-tenancy and Kibana, Graphana Proxies
Elastic
SearchElastic
SearchRedis
Load Balancer
s1 s2 s3 s4 ... sn
InfluxDB
InfluxDB
InfluxDB
MetricsStore
Elastic
SearchElastic
SearchElastic
Search
LogStore
• Intercepts and rewrites queries
to ES and InfluxDB
• Enforces Multi-tenancy
(visibility of events to users)

Security and Multi-tenancy model
• Authentication with Keystone backed by LDAP
– user authentication for Query API and UI
• Multi tenancy with users and groups
– Events have tenant id and apikey
• Cross tenant correlation
– group membership used for cross-tenant event visibility / correlation
• Dashboard sharing
14

Retention Policy : Log Events
• ElasticSearch allows powerful querying, but comes at a cost
– Store only logs that would help better operate and trouble shoot
– Use appropriate debug levels (not INFO)
• Fixed quota : 350 GB or 500 GB
• When tenant reaches quota limit, Volta will delete 20 % of old logs to
free up space
• Through wise use of quota you can retain logs for lots of days
• Volta can retain logs for longer duration, for special tenants who need
to store them for compliance / audit
15

Metric Events: Retention Policy and Rollups
Naming scheme:
host + “.” + name + “.” + type_if_avail + “.” + retention_period
Retention period: 1 day, 1 week, 1 month, and 3 months:
Names for the example:
● default 1 day: lmm-dev-bastion.memory.used_
● 1 week: lmm-dev_bastion.memory.used_1w
● 1 month: lmm-dev_bastion.memory.used_1m
● 3 months: lmm-dev_bastion.memory.used_3m
rollup precision:
● default 1 day: user defined (highest)
● 1 week: metrics aggregated to 1 minute
● 1 month: metrics aggregated to 5 minutes
● 3 months: metrics aggregated to 1 hour
Naming scheme & retention policies
{
"@version": "1",
"@timestamp": "2014-08-06T19:17:43.000Z",
"host": "lmm-dev-bastion",
"name": "memory",
"collectd_type": "memory",
"type_instance": "used",
"value": 341884928,
"tenant_id": "db5ca8e4c8514fad9f98dbc4d648ee87",
"apikey": "26d85ae3-1e10-4ce4-837a-7a1c8dfc67fb"
}
Sample for metric from collectd
16

Alerts : Email and Callbacks
• Alerts can be set using the Alert UI or the REST API
• Alerts can be sent to Email or post Webhook (REST endpoint)
• Webhook provides a good mechanism for integration with external automation and UIs
• Alerts on Log events
– User specifies an alert template using regular expression to match
– Can match one or more fields from a Log event
– Simple and complex expressions
• Alerts on Metric events
– User specifies an alert template using comparison operators
– Can match one or more fields from the Metric event
– Simple and complex expressions
17

Current deployment
• Multiple deployments : on bare KVM nodes, on OpenStack VMs
– On KVM nodes: 40+ VMs, 80+ TB storage, many large memory nodes
– Components are deployed in clustered mode for HA
– Some with active/active replication, some with active/passive
• Use by Platform and Infrastructure Services
– Tens of thousands of events per second (seen around 160 K events /sec)
– Hundreds of GBs of data collected and indexed per day
– Queries are currently coming from Kibana and Grafana, in future from APIs
18

The Journey and Lessons ...
19

Log, metrics and alerts
• log events
– insist on good severity levels,
– enforce quota  induce behavior change 
– watch out for large messages (zip lines from stdout/stderr)
• metric events
– keep users aware of rollups (granularity)
• alerts
– watch out for too simple ones  alert floods
– watch out for complex regex  performance / memory suckers
– encourage metrics based alerts  this is what scales
20

Kafka, ES and Storm
• Kafka
– retention policy vs storage space: do the math with ingest & processing rate
– if you are not using auto-rebalance of leaders, keep an eye on the leaders
• Storm
– smaller topologies: easy to update and optimize
– match consumer parallelism (number of partitions) to kafka spouts
– tune number of executor threads to optimal performance
• ElasticSearch:
– aggregate your writes
– heap size <= 32 GB, turn off swap,
– benefits hugely from high iops  use SSDs if you can
21

Using Open Source Software : Joy and Frustrations
• Be ready for constant upgrades
– for bug fixes
– to get cool new features: Grafana, Kibana
– for stability, cool stats and visualization: Storm
• InfluxDB clustering maturing
– temporary HA solution (write to 2+ influxDBs)
– waiting for 0.9 release with better clustering
22

Eat your own Dog Food
• Volta was a cobbler’s child for a while …
– did not use any system to aggregate logs and metrics!
• Now we are using Volta to collect its logs and metrics
– send logs and metrics from one Volta instance to another
– sending to the same instance is an interesting one!
• Important metrics:
– ingest rate, Storm processing rate, ES / Influx Write latency
– end to end latency of events
23

Synthetic Transactions and Tracking SLAs
• Goal: track Service level metrics
– availability to users / business
– latency for operations to users
• Use Synthetic Transactions that exercise a sequence of APIs
– measure success / failure rates
– measure end to end latency
– collect, trend and alert on these
24

Deployment & Ops : automate, automate, automate …
• Volta is a collection of services
– use separate repos, deploy small changes
• Lots of configuration parameters : manage consistency
– performance very sensitive to values
– e.g., Heap, number of workers, etc.
• Performance benchmarking
– need to be done for each environment
• CI and Deployment pipeline
25

Volta next gen
26

Volta Next Gen
• OpenSource Volta
• Refactor Storm
– Split into separate metric and log topologies and batch writes
• Move ES and InfluxDB to higher iops storage (SSDs?)
• Multi-DC support via stream duplication
• Archival into Swift / HDFS
• Anomaly detection using CEP / Storm
• HTTP REST API in front of Kafka
• Deployment automation using OpenStack Murano
27

Thank you!
Questions, Comments, Suggestions?
Jan 7, 2015Volta / Cloud Platform Engineering, Symantec 28
We are interested in Open Sourcing & Collaborating on Volta.
Interested?
And, we are hiring …. interested?
ln_renganarayana@symantec.com
twitter: @lrengan

Backup Slides
Jan 7, 2015Volta / Cloud Platform Engineering, Symantec 29

LMM Metrics Data Model
● name : name of the metric. LMM uses this to store
the metrics and you will use in queries: select
“value” from “load”
● value : value of the metrics at a given time
● @timestamp : time stamp
● host : host name or any other id
● tenant_id : tenant id (keystone)
● apikey : LMM apikey
{
"@version": "1",
"@timestamp": "2014-07-30T00:16:59.000Z",
"name": "cpu",
"host": "demo.symcpe.net",
"plugin_instance": "0",
"collectd_type": "cpu",
"type_instance": "interrupt",
"value": 0,
"tenant_id":"db5ca8e4c8514fad9f98dbc4d648ee87",
"apikey": "26d85ae3-1e10-4ce4-837a-7a1c8dfc67fb"
}
Mandatory fields Sample for metric from collectd
Collectd : name of plugin becomes name of metric. E.g.: cpu or memory
StatsD : users metric name concatenated with metric type by a dot. E.g.: myapp.counter or myapp.gauge
Reserved fields: time, sequence_number Special field: type_instance
30

Logging, Metrics and Monitoring as a Service Architecture

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (20)

Similar a Logging, Metrics and Monitoring as a Service Architecture

Similar a Logging, Metrics and Monitoring as a Service Architecture (20)

Último

Último (20)

Logging, Metrics and Monitoring as a Service Architecture

Notas del editor