Explore, Analyze and Visualize Data in Hadoop and NoSQL. Make massive quantities of machine data accessible, usable and valuable for the people who need it, at the speed they need it. Use Hunk to turn underutilized data into valuable insights in minutes, not weeks or months.
2. Splunk
Disruptive Approach to Unstructured Data
Structured
RDBMS
SQL Search
Schema at Write Schema at Read
1980-2010 2010+
ETL Universal
Indexing
Unstructured
Volume | Velocity | Variety
3. Mainframe
Data
VMware
Platform for Machine Data
Exchange PCI Security
DB Connect MobileForwarders
Syslog,
TCP,
Other
Sensors,
Control
Systems
600+ Ecosystem of Apps
Stream
SPLUNK TODAY
5. 5
Distributed File System
(semi-structured)
Key/Value, Columnar or
Other (semi-structured)
Relational Database
(highly structured)
MapReduce
Cassandra
Accumulo
MongoDB
Splunk - Big Data Technologies
SQL &
MapReduce
NoSQL
Temporal, Unstructured
Heterogeneous
Hadoop
RDBMS HDFS Storage +
MapReduce
Real-Time Indexing
5
Oracle
MySQL
IBM DB2
Teradata
6. Massive Linear Scalability to Tens of TBs/Day
Send data from 1000s of servers using combination of Splunk Forwarders, syslog, WMI, message queues, or other remote protocols
Auto load-balanced forwarding to as many Splunk Indexers as you need to index terabytes/day
Offload search load to Splunk Search Heads
6
Automatic load balancing linearly
scales indexing
Distributed search and MapReduce
linearly scales search and reporting
7. 7
Splunk Real-Time Analytics
Data
ParsingQueue
Parsing Pipeline
• Source, event typing
• Character set
normalization
• Line breaking
• Timestamp identification
• Regex transforms
Indexing
Pipeline
Real-time
Buffer
Raw data
Index Files
Real-time
Search
Process
Monitor Input
IndexQueue
TCP/UDP Input
Scripted Input
Splunk
Index
7
8. 8
Search Head Clustering
Ability to group search heads into a cluster in order to provide
Highly Available and Scalable search services = Thousands of Users
8
MISSION
CRITICAL
ENTERPRISE
9. 9
Splunk Index Replication – High Availability
9
2
Master asks the redundant
peer to act as primary
3
Peers copies the search
files / index files / raw data
2 3
1
Master auto-detects that a
peer is down
1
• Default is 3X Replication
11. 11
Splunk and Hadoop
1
Hunk:
– Main use case = Analyze Hadoop Data using Hadoop Processing
Splunk Hadoop Connect:
– Main use case = Real-time export data from Splunk to Hadoop
Hunk Archive
– Main use case = Archive Splunk indexers to Hadoop
Splunk HadoopOps:
– Main use case = Monitor Hadoop
13. 13
Hunk – Unique
1
1. Run Natively in Hadoop:
– Use Hadoop MapReduce
2. Mixed Mode:
– Allows for data Preview
3. Auto deploy SplunkD to DataNodes:
– On the fly Indexing
4. Access Control:
– Allows for many users / many Hadoop directories / support Kerberos
5. Schema On the Fly
14. 14
Run Natively in Hadoop
External resource
(e.g. hadoop.prod)
MapReduce
jobs
Tasks
/ working
directory
Index on data nodes
Hunk
search head >
1
5
3
4
2
NameNode
JobTracker
(YARN)
DataNode /
TaskTracker
DataNode /
TaskTracker
DataNode /
TaskTracker
HDFS
14
Hadoop
MR Jobs
15. 15
Mixed-mode Search
15
Time
Hadoop MR /
Splunk Index
Splunk Stream
Switch over
time
preview
preview
• Data Preview
• Allows users to search interactively by pausing and
refining queries
16. 16
Indexing On the fly - Hunk Data Processing
16
HDFS
Results
Final search
results
ERP
Search process
Remote results Remote results
Search head
MapReduce
Search process
TaskTracker
raw
preprocessed
Remote results
Remote results
17. 17 1
Role-based Security for Shared Clusters
Pass-through
Authentication
• Provide role-based security
for Hadoop clusters
• Access Hadoop resources
under security and
compliance
• Integrates with Kerberos
for Hadoop security
Business
Analyst
Marketing
Analyst
Sys
Admin
Business
Analyst
Queue:
Biz Analytics
Marketing
Analyst
Queue:
Marketing
Sys
Admin2
Queue:
Prod
18. 18
Managed Archiving Splunk Enterprise to Hunk-HDFS
1
• Archive buckets to Hadoop (HDFS) instead of freezing buckets or throwing data away
• Store old data up to 1/10 cheaper in Hadoop cheap batch storage instead of SANs
• Optimize Splunk Enterprise search head performance for real-time monitoring,
alerting and dashboarding with short-term historical context
• Hunk search, analyze and visualize months or years of historical data in Hadoop
• Run federated queries and dashboards across Splunk Enterprise and Hunk
Hadoop Clusters
WARM
COLD
FROZEN
20. 20
New Search
i ndex=" j obsummar y_l ogs_al l _r ed" cl ust er =" di l i t hi um* " | eval t ot al _sl ot _seconds=( m apSl ot Seconds + r educeSl ot Sec
onds) | eval gb_hour s=( ( t ot al _sl ot _seconds * 0. 5) / 3600) | eval gb_hour s=r ound( gb_h our s) | t i mechar t span=6h sum
( gb_hour s) as gb_hour s by queue
Last 7 days
✓ 1,175,726 events (5/20/ 14 8:00:00.000 PM to 5/ 27/14 8:26:26.000 PM)
200,000
400,000
600,000
_time ↕
OTH
ER
↕
apg_dai
lyhigh_
p3 ↕
apg_dail
ymedium
_p5 ↕
apg_hou
rlyhigh_
p1 ↕
apg_ho
urlylow_
p4 ↕
apg_hourl
ymedium
_p2 ↕
apg
_p7
↕
curveb
all_larg
e ↕
curveb
all_me
d ↕
sling
shot
↕
sling
stone
↕
Visualization
_time
Wed May 21
2014
Thu May 22 Fri May 23 Sat May 24 Sun May 25 Mon May 26
Yahoo - Visualizing Hadoop
2
• 600PB of Data
• Very large clusters used by many
groups across the enterprise
• 35,000 individual Datanodes
• Hadoop is provided as a Self
Service
21. 21
Vantrix Mobile media optimization
2
144 Hadoop Nodes,
69 TB SSD Storage
Analytics Application
10 Million subscribers generate:
• 80GB of raw session log data / day
• 26 Million video data session records
Hunk Query
• 20 sec – search through 27M events
• Returning 4.7M events
Hunk as indexer - Automatically indexed and counted field value occurrences
Hunk as Self Service - Proved invaluable for identifying and exploring use cases
Hunk business value – Help identify when subscribers abandon video
But listening to your machine data isn’t as easy as it sounds.
Machine Data is different:
It is voluminous unstructured time series data with no predefined schema
It is generated by all IT systems– from servers and applications, to RFIDs and wire data.
It is non-standard data and characterized by many unpredictable and changing formats
Because of this, machine data cannot be managed using traditional approaches.
Traditional approaches require you to transform your data and force fit it into a brittle schema – They aren’t designed to handle the inconsistent machine data formats
Traditional approaches are designed with specific use cases and queries in mind – they limit the problems that you can solve
Traditional approaches rely on siloed tools that are designed for structured data approaches and legacy computing environments – They are inherently limited in their ability to scale
To listen to your machine data, you need a solution with no limits:
No limits on the formats of data
No limits on where you can collect the data from
No limits on the questions that you can ask and the use cases you can solve.
And no limits on scale.
You need a solution that can keep up with Machine Data.
Since then, Splunk has invested significantly to expand from a search tool to a mission-critical platform. The platform includes hundreds of data types and can scale to massive volumes
Today, it’s more than Splunk Enterprise, we’ve added Splunk Cloud, Hunk, Splunk MINT for mobile intelligence; and have more than 600 Apps.
Machine data is more than logs! It’s wire data, mainframe data, mobile device data, sensor data, metrics
Your use cases have evolved well beyond troubleshooting so we’re investing in solutions that leverage the power of Splunk Enterprise to provide you with packaged views into your data for faster, deeper insights.
Our most well-known solution is Splunk Enterprise Security and if you aren’t using it yet, we encourage you to find out why it’s turning the traditional SIEM market upside down.
How has big data evolved over time. For a long time, ‘big data’ was was simply a large database.
The database industry – in order to handle large data – moved to smaller databases, but many of them. Horizontal partitioning (Also known as Sharding) is a database design principle whereby rows of a database table are held separately (For example, A -> D in one database E -> H in a second database, etc ..)
Hadoop was introduced by Google and was adapted as the de-facto big data system. Hadoop is an open source project from Apache that has evolved rapidly into a major technology movement. It has emerged as a popular way to handle massive amounts of data, including structured and complex unstructured data. Its popularity is due in part to its ability to store and process large amounts of data effectively across clusters of commodity hardware. Apache Hadoop is not actually a single product but instead a collection of several components. For the most part, Hadoop is a batch oriented system.
** Teradata Aster Data & SQL on Hadoop are SQL interface systems that can talk to Hadoop
** Cassandra & HBase are NoSQL databases that can process data using a Key / Value in real-time.
Splunk = Temporal, Unstructured, Heterogeneous, real-time analytics platform.
Splunk allows you divide up the work of search and indexing across as many servers as you need to achieve the performance and scale you require. Using work dividing techniques such as MapReduce, Splunk can take a single search and query as many indexers as you need to complete the job, allowing you to use inexpensive commodity hardware in massively parallel clusters.
For example, if you had 1 million events to search, one Indexer can easily complete that search. But it will take a little time – let’s say 30 seconds. However, if the same million events was spread across 10 indexers, the same search would complete in 3 seconds. How fast and how large you want your searches is yours to control by adding indexers as desired.
For the most part, you can use monitor to add nearly all your data sources from files and directories. However, you might want to use upload to add one-time inputs, such as an archive of historical data. You can enable Splunk to accept an input on any TCP or UDP port. Splunk consumes any data sent on these ports. Use this method for syslog (default port is UDP 514), or set up netcat and bind to a port. TCP is the protocol underlying Splunk's data distribution and is the recommended method for sending data from any remote machine to your Splunk server. Splunk can index remote data from syslog-ng or any other application that transmits via TCP. However, there are times when you want to use scripts to feed data to Splunk for indexing, or prepare data from a non-standard source so Splunk can properly parse events and extract fields. You can use shell scripts, python scripts, Windows batch files, PowerShell, or any other utility that can format and stream the data that you want Splunk to index. You can stream the data to Splunk or write the data from a script to a file.
All data that comes into Splunk enters through the parsing pipeline as large chunks. During parsing, Splunk breaks these chunks into events which it hands off to the indexing pipeline, where final processing occurs. During both parsing and indexing, Splunk acts on the data, transforming it in various ways. Most of these processes are configurable, so you have the ability to adapt them to your needs.
To kick off a real-time search in Splunk Web, use the time range menu to select a preset Real-time time range window, such as 30 seconds or 1 minute. You can also specify a sliding time range window to apply to your real-time search. This defines a real-time buffer.
The Splunk Index is the repository for Splunk Enterprise data. Splunk Enterprise transforms incoming data into events, which it stores in indexes.
Faster Recovery II -
If you look at the screen – 2 indexers on the left with green cylinders – searchable copies of the data, 2 indexers on the right – only raw data
What happens when a peer goes down, master waits for hb timeout and marks the peer down.
<Cl>
Reassigns primaries to another peer. Then tries to enforce the replication policy, makes copies of the raw data and search files
In 5.0, search files are generated on each peer from the raw data, In 6.0, the search files are copied over from a peer that already has them instead of regenerating.
<Cl>
These statistics are from our internal tests...
Another point to note is generating search files from the rawdata is cpu intensive as compared to copying search files.
Quick to set-up, scales to multiple concurrent databases
Enrich machine data with structured data from relational databases
Execute database queries directly from the Splunk user interface
Browse and navigate database schemas and tables
Combine machine data with structured data from relational databases
Quick to set-up, scales to multiple concurrent databases
Enrich machine data with structured data from relational databases
Execute database queries directly from the Splunk user interface
Browse and navigate database schemas and tables
Combine machine data with structured data from relational databases
Search execution:
The Hunk Search head takes the list of content of directories in the virtual index. The search head filters directories & files based on the search & time range (partition pruning)
The NameNode and JobTracker (MapReduce Resource Manager in YARN) read data from MapReduce framework and feed it to search process. The process computes File Splits, constructs and submits the MapReduce jobs.
Hunk streams a few File Splits from HDFS and processes them in the Search Head to provider quick previews. The search head consumes and merges the MapReduce results (provide incremental previews) while the MapReduce jobs kick off.
The data nodes run a copy of splunkd to process the the jobs and write them to a working directory in HDFS.
Final results are stored in the Hunk search head.
Hunk utilizes the Splunk Search Processing Language, the industry-leading method to enable interactive data exploration across large, diverse data sets. There is no requirement to "understand" data up front. For customers of Splunk Enterprise, reuse your Search Processing Language knowledge and skill set for data stored in Hadoop. Any commands whose output depends on the event input order would yield different results – this is because Splunk guarantees events to be delivered in descending time order. Hunk doesn’t. This is the reason why transaction and localize do not work.
We can see the results from the intermediate Hadoop Map jobs getting steamed into the Splunk UI even before all the Map jobs are finished, and once all the Hadoop Maps are done processing the results, Splunk displays the full results.
In essence, Splunk acts as the Hadoop Reduce phase and there is no need to use Hadoop for that phase.
Hunk starts the streaming and reporting modes concurrently. Streaming results show until the reporting results come in. Allows users to search interactively by pausing and refining queries.
This is a major, unique advantage of Hunk compared to alternative approaches such as Hive or SQL on Hadoop which require fixed schema in an effort to speed up searches, while Hunk retains the combination of schema on the fly with results preview.
In this new feature, planned for release in the next Hunk release (version 6.2.1), archive buckets to Hadoop (the Hadoop Distributed File System, or HDFS) instead of freezing buckets or throwing data away. This significantly lowers the total cost of ownership (TCO) for Splunk Enterprise installations while giving security analysts, risk managers and marketers access to months or years of historical data integral for their job success.
Store old data up to 1/10 cheaper in Hadoop cheap batch storage instead of SANs
Optimize Splunk Enterprise search head performance for real-time monitoring, alerting and dashboarding with short-term historical context
Hunk search, analyze and visualize months or years of historical data in Hadoop
Run federated queries and dashboards across Splunk Enterprise and Hunk