Splunk All the Things: Our First 3 Months Monitoring Web Service APIs - Splunk .conf2012

Splunk All the Things:
Our First 3 Months Monitoring
Web Service APIs
Dan Cundiff (@pmotch) and Eric Helgeson (@nulleric)
Target Corporation

Copyright © 2012 Splunk Inc.

Agenda

Context

Problem

Solution

Examples

In progress and future stuff

Lessons and challenges

3

Context: Enterprise Services @ Target
Data and transactional APIs for all the domains in our business
– Products (inventory, price, description, etc)
– Locations
– Coupons
– etc
APIs exposed inside and outside
Mostly RESTful APIs, some pub sub/messaging
Used by mobile devices, applications, partners on the outside, etc.
Constantly evolving, rapidly improving, all the time

4

Problem
First API go-live:
– Millions of log events per day (grep/cut/sed/awk not cutting it)
– Logs scattered everywhere
– Limited access to logs
– Needed end to end visibility of web services
– Needed ability to discover information in logs
– Can we be pro-active? Faster reactive?
Looming horizon:
– BILLIONS of log events coming
– Questions changing everyday from business, support, execs, developers

5

Solution: Gave Splunk a try
Installed Splunk on a lab server
Hooked up Splunk to the logs
Quickly created 15+ searches and reports
Generated a dashboard for visibility and trending
Total time to do all this in Splunk:

~4 hours
6

Why Splunk
Understanding what’s “normal”
– Identify tolerances
– Identify actionable events vs. anomalies
You don’t know what you don’t know
– …but Splunk can tell you what you don’t know

7

Why Splunk, part 2
Indicators when are things trending badly
– Proactive monitoring and recovery
– Standard deviations, percentage changes over time, outliers
Full stack visibility
– API gateway
– Network (load balancers, firewalls)
– Web/app
– OS

8

Why Splunk, part 3
Quick and flexible dashboards
Drill down
Community (Splunkbase, blogs, etc)
Google-able™
App store!

9

What is “normal”?
Volume

11

What is “normal”?, part 2
API response time SLAs

12

What is “normal”?, part 3
Errors happen, but what is acceptable?

13

404s
~1700 errors once a day every week
404s for stores that don’t exist
Bot?
– Who are they?
– Malicious? Competitor? Individual?
– Reach out to understand why

14

Understanding consumers
Who and how is it being used?
What’s their experience?

15

Understanding consumers, part 2
Load testing in production?

16

Understanding infrastructure
Expected design vs actual implementation
Not balancing workload as expected

17

Understanding providers
How are providers responding?
Is overhead added to the API response?

18

Requirements feedback loop
Requirement: 200 tps
Actual: ~20 tps

19

Business intelligence from APIs
Where are people searching?
Where should we build our next store?
How far are people traveling?
What time of day?
Mobile vs website?
iOS vs Android?
International?

20

Metrics for APIs
(source: http://blog.programmableweb.com/2012/08/02/the-api-measurement-secret-know-what-metrics-matter/)
Traffic Metrics Service Metrics Support Metrics
– Total calls – Performance – Support tickets
– Top methods – Availability – Response time
– Call chains – Error rates – Community metrics
– Quota faults – Code defects
Business Metrics
Developer Metrics Marketing Metrics – Direct revenue
– Total developer count – Developer registrations – Indirect revenue
– Number – Developer portal funnel – Market share
of active developers – Traffic sources – Costs
– Top developers – Event metrics
– Trending apps
– Retention
21

Splunk all the things
Consumer apps
Provider systems
OS, firewalls, proxies
External API gateway logs
Anything in between (middleware, integrations, etc)
Correlate with logs from apps degrees away (e.g. .com web logs)

Development (perf test results, git, Jenkins/CI, wiki, etc)

Dashboards
Global dashboard summarizing all APIs
BI dashboards
Executive dashboards

24

Dashboards, part 2
Environment dashboards for each API
– CI
– Test
– Stage
– Prod

25

Dashboards, part 3
Alert trending dashboards for each API

26

Splunking Continuous Integration
Drill down into CI results linked straight from Jenkins
– Filtered by date OR transaction GUID

27

Splunking Continuous Integration, part 2
We practice code as documentation
Every commit, Jenkins runs, extracts documentation from code, puts it
in the respective wiki pages (pretty cool! – automated / no humans)
Splunk monitors wiki changes using the MediaWiki API
Monitor CI + human wiki changes

https://github.com/pmotch/wikislurp

28

Common Logging Service
CLS is our strategy for getting logs from all places into Splunk
How
– Use UFs on end points everywhere
– Else, consolidate and mount Splunk
– Else, use CLS RESTful API
Enables end-to-end visibility
– Insert GUIDs across all the hops in the transaction
Use out of the box log formats (e.g. Log4j)

29

Lessons
RTFM
– Keep logs flat
– Keep timestamp (ISO8601) at the beginning
– k=v
Iterate quick, push to prod; minimal tweaks to Splunk
Flatten out of box audit events (XML)
– Toggle at runtime
Don’t re-invent the wheel, use what your system provides, Splunk can
handle it!

31

Lessons, part 2
Don’t pre-optimize up front
– Governance
– Standards
– Alerting
– Access controls
Optimize as needed

32

Lessons, part 3
Create a community

33

Lessons, part 4
Create best practices, standards, etc in a wiki

34

Challenges: Organizational
“Stop. We already have tools that do this. Use those.”
– tgtMAKE saves the day
– tgtMAKE = R&D
– R&D = $, servers, flak shelter, people network

Make it real strategy
– Demo to as many key players as possible
– Drum up interested
– Show actual value

35

Challenges: Organizational, part 2

http://knowyourmeme.com/photos/361379-shut-up-and-take-my-money

36

Challenges: Organizational, part 3
The data can’t be trusted?

37

Challenges: OS
RHEL 6
SELinux
Ipfw
Install notes: http://nulleric.tumblr.com/post/13855621770/splunk-on-
redhat-6-install-notes

38

Challenges: Infrastructure
VM requirement
Adhering to MDHA requirements
Universal Forwarder skepticism

39

Challenges: Logs on the outside
Universal Forwarders on servers that we don’t manage
Firewalls
Multi-layered DMZs

40

Challenges: Splunk
…

41

Challenges: Splunk (err, improvements)
Index improvements
– Cheap servers, can fail, can expand
– Replication, N=3
– Replicas on N-1 subsequent nodes
– Data is always available, smooth out across servers if they go down or expand
– Multi-tenant
– Think OpenStack Swift “Ring” concept or Cassandra
– There’s that CAP Theorem thing; they say it’s a big deal.
GUI for deployment client configurations (lazy and for n00bs, we know)
Ability to extend charts with other libraries (like D3 or something)

42

Recap

Be bold. Tooling matters. Sell it.
Splunk all the things!
Iterate, adapt, change quickly.

43

We’re hiring
(come talk to us)

44

Splunk All the Things: Our First 3 Months Monitoring Web Service APIs - Splunk .conf2012

Recomendados

Recomendados

Más contenido relacionado

Similar a Splunk All the Things: Our First 3 Months Monitoring Web Service APIs - Splunk .conf2012

Similar a Splunk All the Things: Our First 3 Months Monitoring Web Service APIs - Splunk .conf2012 (20)

Más de Dan Cundiff

Más de Dan Cundiff (7)

Último

Último (20)

Splunk All the Things: Our First 3 Months Monitoring Web Service APIs - Splunk .conf2012

Notas del editor