2. About MetroPCS
• Provider of unlimited wireless communications
service for a flat-rate with no annual contract
• Fifth largest facilities-based wireless carrier in the
United States
• Approximately 9 million subscribers
3. Gregg Woodcock
• Extensive telecom experience creating and supporting Nortel's
wireless products for 18+ years
• My team designs/creates/deploys or evaluates/integrates
software and hardware to support the resiliency and
expansion of our private mobile telephone network
• Anything with a data connection flows through my group
(GPS, Push-to-Talk, browser, VOIP)
• Previously with Sabre, Mobile Metrics and Nortel
4. The Challenge at MetroPCS
Launching New Products and Tariff Justification/
Services Optimization
Speeding application de-bugging Splunk statistical analysis uses RDBMS
means bringing new products to lookup to calculate cost per call
market faster
Call Detail Record Visibility Detecting Abuse
Splunk’s ability to ingest any format Reports and dashboards highlight
without parsers or adapters speeds possible abusers—key indicator of
deployment and time to value Terms of Service Abusers
5. Speeding De-bugging, Speeding Time to Market
Launching a new handset every month!
We’ve gone from hours to minutes for troubleshooting issues
Self-service, secure access
No more bouncing issues from group to group
Same types of bugs happen each time—so we’ve automated
searches and alerts for these known patterns
“Splunk helps us uncover most bugs before we go into production, improves
user experience and gets us to market much more quickly.”
6. Speed of Implementation
• Splunk was up and running on
commodity
hardware in 2 weeks
• Incumbent product would have
required 8-month
services engagement
7. Unexpected Benefits
• Open source de-bugging (everyone can play :
teach men to fish)!
• Overall understanding of baseline system health
• Fast subpoena compliance for Law Enforcement
inquiries – CDR analysis
“Better able to close the door on all kinds of leaks!”
8. Lowest Cost Call Routing Revenue Optimization
High Priced Tariffs with no Visibility
Optimal call routes difficult to track or understand
Manual mediation of tariff information was 3+ month exercise—often still
without desired results, resulting in higher than necessary fees
• Splunk helps us understand actual partner costs by looking at
partner tariffs from an external RDBMS and calculating actual
charges based on call duration
• We now have the ability to optimize call routing
• Lowest cost routing has direct impact on bottom line--saving
hundreds of thousands of dollars
11. Interesting Splunk Story
When an earthquake hit Trinidad and Tobago we
knew about it before the news broke (“What’s
up with Trinidad and Tobago?”). When the
incoming lines went down our searches looking
for Bad Answer/Seize Ratios (ASR) showed this
route’s ASR drop through the floor.
12. Growing Business Means Growing Data Volumes
Business growth is good—but growth without understanding
can be crippling
CDR data alone is >1TB/ day
Correlating with other network and external data more challenging still
• Splunk allows us to link the rich data in CDRs with
external RDBMSs, systems and networks
• New visibility has highlighted new business opportunity
and exposed abuse
14. Android Smartphone Launch
• Data usage skyrockets overnight, beyond expectations
– What handset types were being used?
– Where and when is usage happening?
– Track usage broken down day-to-day, hour-by-hour
• Available Data source - Radius accounting records which
contains many key details:
• MDN, Realm, BSID (SID/NID/Cell/Sector encoded), PCF IP Address (32-bit
hex), Service Option (Radio Technology, e.g.
1xRTT, EVDO), Bytesin, BytesOut
14
15. Lookup CSV
• Radius user – MDN@Realm
• Realm is 1-to-1 mapping to device type
• Derive fields to show handset type
LOOKUP-realm2device_lookup =
realm2device_lookup realm OUTPUT
UImake,UImodel UIOS
15
16. Field Extraction
• Convert BSID field to 4 other derived fields
• [bsid_to_sid_nid_cell_sector]
• SOURCE_KEY=BSID
• REGEX=^(?<SID >.{4})(?<NID >.{4})(?<Cell >.{3})(?<Sector
>.{1})
16
17. Human-readable market name
• Convert SID number into a human-readable market name
(e.g. DFW)
• LOOKUP-SID2market_lookup =
SID2market_lookup SID OUTPUT MKTfromSID
• LOOKUP-service_option_lookup =
service_option_lookup SO OUTPUT SO_
17
18. My First Macro!
• Convert 32-bit hex PCF IP Address into dotted-quad
format
[32bit2dottedquad(2)]
args = IP32bit, nameIPdottedquad
definition = eval ip0=floor($IP32bit$/16777216) | eval
ipx=$IP32bit$%16777216 | eval ip1=floor(ipx/65536) | eval
ipx=ipx%65536 | eval ip2=floor(ipx/256) | eval ip3=ipx%256 | strcat
ip0 "." ip1 "." ip2 "." ip3 $nameIPdottedquad$
18
19. Event Typing for Carrier Discrimination
• Every data session has 2 attributes corresponding with 2 event
types:
– Subscriber’s home carrier (“SUB”)
– Data session’s service carrier (“SVC”)
SPRINT_SUB
MPCS_SVC
19
20. Create Any Usage Report
• Count users, sessions, bytes-in, bytes-out, bytes-total
• Break down by user(MDN), carrier(MPCS, Sprint, etc.)
handset (UIMake,UIModel), OS (UIOS: Android, RIM,
BREW, Windows), region/market (MKTfromSID),
PCF(PCFIP_32bitAddr or PCFIP_dottedQuad), Cell(Cell),
Sector (Cell,Sector), radio technology
(ServiceOption=1xRTT/EVDO/LTE) or any combination
thereof.
20
21. Splunk Saves Vendor’s Launch Date!
• Desire to accommodate a vendor’s beta software that did
not have the SNMP alerting portion updated
• This service generated exploitable error logs that we were
already “Splunking”
• Enter SplunkBase
• Modified a “send SNMP” script we found and created
scheduled searches that automatically
raise alarms in our NOC
21
22. Search for a Naming Convention
[SNMP: PGW: blade3a: MAJOR: BWS Oracle
Sequence Number exhaustion]
action.email = 1
action.email.sendresults = 1
action.email.to = Gwoodcock@metropcs.com
action.script = 1
action.script.filename = sendSNMPtrap.pl
counttype = number of events
22
23. Search for a Naming Convention (cont’d)
cron_schedule = 0 * * * *
description = If this sequence number "tops out" the PGW will
fail all transactions!
dispatch.earliest_time = -1h@h
dispatch.latest_time = now
enableSched = 1
quantity = 0
relation = greater than
search = <REDACTED>
23
24. The Space-Saving Transaction Command
• A large amount of duplicates went undetected
• Needed to identify duplicate (but not identical) or
unmatched events
• Created a field called “lastChars” that helped me determine
duplicates from different sources
index=xxx | rex field=_raw ".*?[(?<lastChars>.*)$" | transaction
lastChars maxpause=0 maxspan=0 keepevicted=true | where
mvcount(source) > 1
24
25. _internal Index Automation
• Automated data retention
• Reveals every index that had data purged
• Threshold based alerts
index=_internal sourcetype=splunkd bucketmover "will attempt to freeze" | rex
field=_raw
"/splunk_data/[^/]*/(?<indexname>[^/]*)/db_(?<newestTime>[^_]*)_(?<oldestTime>[
^_]*)_.*" | dedup indexname | eval retentionDays=(now()-oldestTime)/(60*60*24) |
stats values(retentionDays) as retentionDays by indexname
25
26. Lessons Learned
• Do disk partitioning for indices at virtual/software/conf
layer, not LUN/hardware layer
• Always have way more disk that you think you’ll need
• Always have more indexers than you think you’ll need
• PUT THE DEPLOYMENT SERVER IN FIRST (some pain to
wedge in later and you WILL have to do this eventually)
• Keep up with upgrades (many bug fixes)
• Convert discoveries into scheduled searches (don’t have the
same “surprise” twice)
26
What is MetroPCS, what does it entail? What does the playing field look like?
What is MetroPCS, what does it entail? What does the playing field look like?
Lessons learned, share best practices with the audience....things they need to look out for.
At search time, Splunk looks up the current partner tariffs from an external RDBMS, integrating this data and calculating actual charges based on the call duration in the CDR. Splunk also calculates the best rate and telecommunications partner, which makes it easy to determine whether lowest-cost partners are being used in all cases. Splunk performs the external tariff lookup and calculation at a rate of over 10,000 CDRs/second on a single commodity x86 server.
Could also use Splunk to enforce SLAs with the MSSP
Every data session has 2 attributes: the home subscriber carrier (“SUB”) and the current service carrier (“SVC”) and these correspond to 2 categories of eventtypes. If for example we have a data session that is a Sprint roamer using our network, the accounting record for this will have eventtypes SPRINT_SUB and MPCS_SVC. This is allows to very easily discriminate different categories of traffic without having to directly examine any IP addresses or SIDs and only maintaining this (always changing) data in a single place, OUTSIDE of the searches.
Lessons learned, share best practices with the audience....things they need to look out for.
The naming convention is important because the only variable a designer may change that gets passed to a script from an automated search is the searches name so all data must be encoded into the search name. In our case, any search that calls our “sendSNMPtrap.pl” script that starts with the string “SNMP: “ will derive the following fields from the search name:# * <PRODUCT> is the name of the service that has the problem (e.g. PGW)# * <HOST> is the name/IP-Address of the server that has the problem# * <SEVERITY> = <INFO, MINOR, MAJOR, CRITICAL># * <DESCRIPTION> is any plain text but fewer than 100 characters total and# DO NOT USE THE COLON CHARACTER or it will be truncated after the 1st one! Obviously, each search must be limited to the one particular <HOST> that is listed in the <HOST> portion of the search's name and a clone of each search must be scheduled for each host.
The naming convention is important because the only variable a designer may change that gets passed to a script from an automated search is the searches name so all data must be encoded into the search name. In our case, any search that calls our “sendSNMPtrap.pl” script that starts with the string “SNMP: “ will derive the following fields from the search name:# * <PRODUCT> is the name of the service that has the problem (e.g. PGW)# * <HOST> is the name/IP-Address of the server that has the problem# * <SEVERITY> = <INFO, MINOR, MAJOR, CRITICAL># * <DESCRIPTION> is any plain text but fewer than 100 characters total and# DO NOT USE THE COLON CHARACTER or it will be truncated after the 1st one! Obviously, each search must be limited to the one particular <HOST> that is listed in the <HOST> portion of the search's name and a clone of each search must be scheduled for each host.
The reason it went undetected for so long was that the process writing the data put different lead characters in each event because it was writing to 2 different places. I had to ignore the first portion of each event and compare only the last portion. Here is the transaction command I used which creates a field called “lastChars” consisting of everything beyond the first “[“ character and shows me any duplicate events which come from different sources:index=xxx | rex field=_raw ".*?\\[(?<lastChars>.*)$" | transaction lastCharsmaxpause=0 maxspan=0 keepevicted=true | where mvcount(source) > 1In the latter use, we have many operations that should have a “start” event and an “end” event and sometimes we are missing one or the other in our data. Here is a search that will show these events:index=xxx | transaction SomeFieldmaxpause=1s | where linecount > 2 | stats count Don’t forget that when you run any search you can click on the icon of horizontal lines just underneath the “X results in the last Y <timeframe>” text and see the individual events that made up the search results data.
We like to monitor our data retention and were able to use the _internal index to automate this. We run the following search every day for the previous day and it shows every index that has had data purged and the timespan of the remaining data. We run this from within a script which parses the “retentionDays” values returned and compares them with our retention targets for each index and sends alerts when we cross particular thresholds.
Lessons learned, share best practices with the audience....things they need to look out for.