SlideShare una empresa de Scribd logo
1 de 49
Descargar para leer sin conexión
Hydra - A Practical
Introduction
Big Data DC - @bigdatadc
Matt Abrams - @abramsm
March 4th 2013
Agenda
•

What is Hydra?

•

Sample Data and Analysis Questions

•

Getting started with a local Hydra dev environment

•

Hydra’s Key Concepts

•

Creating your first Hydra job

•

Putting it all together
Hydra’s Goals
•

Support Streaming and Batch
Processing

•

Massive Scalability

•

Fault tolerant by design (bend but
do not break)

•

Incremental Data Processing

•

Full stack operational support
•

Command and Control

•

Alerting

•

Resource Management

•

Data/Task Rebalancing

•

Data replication and Backup
What Exactly is Hydra?
•

File System

•

Data Processing

•

Query System

•

Job/Cluster
Management

•

Operational Alerting

•

Open Source
Hydra - Terms
•

Job: a process for processing data

•

Task: a processing component of a job. A job can have
one to n tasks

•

Node: A logic unit of processing capacity available to a
cluster

•

Minion: Management process that runs on cluster nodes.
Acts as gate keeper for controlling task processes

•

Spawn: Cluster management controller and UI
Hydra Cluster
Our Sample Data (Log-Synth)
3.535,	
  5214d63bab95687d,	
  166.144.203.186,	
  "the	
  then	
  good"	
  
3.568,	
  5dbd9451948ad895,	
  88.120.153.226,	
  "know	
  boys"	
  
4.206,	
  5dbd9451948ad895,	
  88.120.153.226,	
  "to"	
  
4.673,	
  b967d99cad0b3e60,	
  88.120.153.226,	
  "seven"	
  
4.900,	
  bd0d760fbb338955,	
  166.144.203.186,	
  "did	
  local	
  it"
What do we want to know?
•

What are the top IP addresses by request count?

•

What are the top IP address by unique user count?

•

What are the most common search terms?

•

What are the most common search terms in the slowest
5% of queries?

•

What are the daily number of unique searches, unique
users, unique IP addresses, and distribution of
response times (all approximates)?
Setting up Hydra’s Local Stack
Vagrant
•

$	
  vagrant	
  init	
  precise32	
  http://
files.vagrantup.com/precise32.box	
  

•

//	
  add:	
  config.vm.network	
  :forwarded_port,	
  
guest:	
  5052,	
  host:	
  5052	
  to	
  your	
  Vagrantfile	
  

•

$	
  vagrant	
  up	
  

•

$	
  vagrant	
  ssh
Java7
•

$	
  sudo	
  apt-­‐get	
  update	
  	
  

•

$	
  sudo	
  apt-­‐get	
  install	
  python-­‐software-­‐
properties	
  

•

$	
  sudo	
  add-­‐apt-­‐repository	
  ppa:webupd8team/java	
  

•

$	
  sudo	
  apt-­‐get	
  update	
  

•

$	
  sudo	
  apt-­‐get	
  install	
  oracle-­‐java7-­‐installer
RabbitMQ, Maven, Git, Make

•

$	
  sudo	
  apt-­‐get	
  install	
  rabbitmq-­‐server	
  

•

$	
  sudo	
  apt-­‐get	
  install	
  maven	
  

•

$	
  sudo	
  apt-­‐get	
  install	
  git	
  

•

$	
  sudo	
  apt-­‐get	
  install	
  make
Copy on Write
•

$	
  wget	
  http://xmailserver.org/fl-­‐cow-­‐0.10.tar.gz	
  

•

$	
  tar	
  zxvf	
  fl-­‐cow-­‐0.10.tar.gz	
  

•

$	
  cd	
  fl-­‐cow-­‐0.10	
  

•

$	
  ./configure	
  —prefix=/usr	
  

•

$	
  make;	
  make	
  check	
  

•

$	
  sudo	
  make	
  install	
  

•

$	
  export	
  LD_PRELOAD=/usr/lib/libflcow.so:$LD_PRELOAD
Hydra
•

$	
  git	
  clone	
  https://github.com/addthis/
hydra.git	
  

•

$	
  cd	
  hydra;	
  mvn	
  clean	
  -­‐Pbdbje	
  package	
  

•

$	
  ./hydra-­‐uber/bin/local-­‐stack.sh	
  start	
  

•

$	
  ./hydra-­‐uber/bin/local-­‐stack.sh	
  start	
  

•

$	
  ./hydra-­‐uber/bin/local-­‐stack.sh	
  seed
Stage Sample Data in Stream
Directory

•

$	
  mkdir	
  ~/hydra/hydra-­‐local/streams/log-­‐synth	
  

•

$	
  cp	
  $YOUR_SAMPLE_DATA_DIR	
  ~/hydra/hydra-­‐
local/streams/log-­‐synth
Pipes and Filters
BundleFilters
• Return

true or false

• Operate

on entire

rows
• Add/Remove
• Edit
• May

ValueFilters
• Operate

on single
volume values

• Return

columns

Column Values

include a call to
ValueFilter

a value or null

• No

visibility to full
row

• Often

take input from
BundleFilter
BundleFilter - Chain
// chain of bundle filters
{"op":"chain", “filter”:[
//LIST OF BUNDLE
//FILTERS
….
]}
BundleFilter - Existence

// false if UID column is null
{"op":"field", "from":"UID"},
Bundle Filter - Concatenation

// joins FOO and BAR
// Stores output in new column “OUTPUT”
!

{"op":"concat", "in":["FOO", “BAR”], "out":"OUTPUT"},
BundleFilter - Equality
Testing

// FIELD_ONE == FIELD_TWO
!

{“op":"equals", "left":"FIELD_ONE", "right":"FIELD_TWO"},
BundleFilter - Math!

// DUR = Math.round((end-start)/1000)
!

{"op":"num", "columns":["END", "START", "DUR"], 

 "define":"c0,c1,sub,v1000,ddiv,toint,v2,set"}
Stack Math - Sample Data
C0,START_TIME

C1,END_TIME

100,234

200,468
Stack Math
c0,c1,sub,v1000,ddiv,toint,v2,set

200,468
100,234

Sub

200,468-100,234
=100,234
Stack Math
c0,c1,sub,v1000,ddiv,toint,v2,set

1000
100,234

DDIV

100,234/1000
=100.234
Stack Math
c0,c1,sub,v1000,ddiv,toint,v2,set

100.234

toint

100
Stack Math - Sample Result
C0,START_TIME

C1,END_TIME

C2,DURATION

100,234

200,468

100
ValueFilter - Glob
ValueFilter

{from:"SOURCE", filter:{op:”glob”, pattern:"Log_[0-9]*"}}

BundleFilter
ValueFilter - Chain, Split,
Index
ValueFilter

{op:"field", from:”LIST”,filter: {op:"chain", filter:[
{op:”split", split:"="}, 
{op:"index", index:0}
]}},
ValueFilter(s)
Data Attachments
Data Attachments are
Hydra’s Secret Weapon
•

Top-K Estimator

•

Cardinality Estimation (HyperLogLog Plus)

•

Quantile Estimation (Q,T-Digest)

•

Bloom Filters

•

Multiset streaming summarization (CountMin Sketch)
Data Attachment Example
A single node that tracks the top 1000 unique search terms, the distinct count of
UIDs, and provides quantile estimation for the query time
Putting it All Together
Job Structure
• Jobs

have three
sections
• Source
• Map
• Output
Source
•

Defines the properties
of the input data set

•

Several built in source
types:
•

Mesh

•

Local File System

•

Kafka
Map
•

Select fields from
input record to
process

•

Apply filters to rows
and columns

•

Drop or expand rows
Output - Tree
•

Output(s) can be trees
or data files

•

Trees represent data
aggregations that can
be queried

•

Files Output Targets
•

File System

•

Cassandra

•

HDFS
Lets put it all Together
Create Hydra Job
Run Job
Query
What are the top IP
Addresses By Record Count?
•

Exact
•
•

•

path: root/byip/+:+hits
ops: gather=ks;sort=1:n:d;limit=100

Approximate
•

path: root/byip/+$+uidcount

•

ops: gather=ks;sort=1:n:d;limit=100
What are the top IPs by
unique user count?
•

Exact
•
•

•

path: root/byip/+/+
ops: gather=kk;sort=0;gather=ku;sort=1:n:d

Approximate
•

path: root/byip/+$+uidcount

•

ops: gather=ks;sort=1:n:d;limit=100
What are the search terms
for the slowest 5%?
•

First get the 95th percentile query time
•
•

•

path: /root$+timeDigest=quantile(.95)
ops: num=c0,toint,v0,set;gather=a

Now find all queries then 95th percentile
•

path: /root/bytime/+/+:+hits

•

ops: num=c0,v950,gteq;gather=iks;sort=1:n:d
Daily Unqiue Searches, Users, IPs
and distribution of response times?
•

Query Path:
•

•

Ops:
•

•

root$+termcount$+uidcount$+ipcount$+timeDigest=quantile(.
25)$+timeDigest=quantile(.50)$+timeDigest=quantile(.75)$
+timeDigest=quantile(.95)$+timeDigest=quantile(.999):+hits

gather=sssaaaaaa;title=total,searches,uids,ips,.25,.50,.75,.95,.999

Remote Ops:
•

num=c4,toint,v4,set;num=c5,toint,v5,set;num=c6,toint,v6,set;num
=c7,toint,v7,set;num=c8,toint,v8,set;
But yeah, I could do that with CLI!
Related Open Source
Projects
•

Meshy - https://github.com/addthis/meshy

•

Codec - https://github.com/addthis/codec

•

Muxy - https://github.com/addthis/muxy

•

Bundle - https://github.com/addthis/bundle

•

Basis - https://github.com/addthis/basis

•

Column Compressor - https://github.com/addthis/
columncompressor

•

Cluster Boot Service - https://github.com/stewartoallen/cbs
Helpful Resources
•

Hydra - https://github.com/addthis/hydra

•

Hydra User Reference - http://ossdocs.addthiscode.net/hydra/latest/user-reference/

•

Hydra User Guide - http://oss-docs.addthiscode.net/
hydra/latest/user-guide/

•

IRC - #hydra

•

Mailing List - https://groups.google.com/forum/#!forum/
hydra-oss

Más contenido relacionado

La actualidad más candente

HaskellとDebianの辛くて甘い関係
HaskellとDebianの辛くて甘い関係HaskellとDebianの辛くて甘い関係
HaskellとDebianの辛くて甘い関係Kiwamu Okabe
 
Redis and its many use cases
Redis and its many use casesRedis and its many use cases
Redis and its many use casesChristian Joudrey
 
Redis as a message queue
Redis as a message queueRedis as a message queue
Redis as a message queueBrandon Lamb
 
Exploring, understanding and monitoring macOS activity with osquery
Exploring, understanding and monitoring macOS activity with osqueryExploring, understanding and monitoring macOS activity with osquery
Exploring, understanding and monitoring macOS activity with osqueryZachary Wasserman
 
Introduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Redis - Usability and Use Cases
Redis - Usability and Use CasesRedis - Usability and Use Cases
Redis - Usability and Use CasesFabrizio Farinacci
 
Redis SoCraTes 2014
Redis SoCraTes 2014Redis SoCraTes 2014
Redis SoCraTes 2014steffenbauer
 
Kicking ass with redis
Kicking ass with redisKicking ass with redis
Kicking ass with redisDvir Volk
 
quickguide-einnovator-9-redis
quickguide-einnovator-9-redisquickguide-einnovator-9-redis
quickguide-einnovator-9-redisjorgesimao71
 
Object Storage with Gluster
Object Storage with GlusterObject Storage with Gluster
Object Storage with GlusterGluster.org
 
Paris Redis Meetup Introduction
Paris Redis Meetup IntroductionParis Redis Meetup Introduction
Paris Redis Meetup IntroductionGregory Boissinot
 
eZ Publish cluster unleashed revisited
eZ Publish cluster unleashed revisitedeZ Publish cluster unleashed revisited
eZ Publish cluster unleashed revisitedBertrand Dunogier
 
An Introduction to REDIS NoSQL database
An Introduction to REDIS NoSQL databaseAn Introduction to REDIS NoSQL database
An Introduction to REDIS NoSQL databaseAli MasudianPour
 
Redis - for duplicate detection on real time stream
Redis - for duplicate detection on real time streamRedis - for duplicate detection on real time stream
Redis - for duplicate detection on real time streamCodemotion
 
Redis Use Patterns (DevconTLV June 2014)
Redis Use Patterns (DevconTLV June 2014)Redis Use Patterns (DevconTLV June 2014)
Redis Use Patterns (DevconTLV June 2014)Itamar Haber
 

La actualidad más candente (20)

HaskellとDebianの辛くて甘い関係
HaskellとDebianの辛くて甘い関係HaskellとDebianの辛くて甘い関係
HaskellとDebianの辛くて甘い関係
 
Hadoop
HadoopHadoop
Hadoop
 
Red Hat Linux cheat sheet
Red Hat Linux cheat sheetRed Hat Linux cheat sheet
Red Hat Linux cheat sheet
 
OWASP Proxy
OWASP ProxyOWASP Proxy
OWASP Proxy
 
Redis and its many use cases
Redis and its many use casesRedis and its many use cases
Redis and its many use cases
 
Redis as a message queue
Redis as a message queueRedis as a message queue
Redis as a message queue
 
Exploring, understanding and monitoring macOS activity with osquery
Exploring, understanding and monitoring macOS activity with osqueryExploring, understanding and monitoring macOS activity with osquery
Exploring, understanding and monitoring macOS activity with osquery
 
Introduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLab
 
Redis - Usability and Use Cases
Redis - Usability and Use CasesRedis - Usability and Use Cases
Redis - Usability and Use Cases
 
Redis SoCraTes 2014
Redis SoCraTes 2014Redis SoCraTes 2014
Redis SoCraTes 2014
 
Kicking ass with redis
Kicking ass with redisKicking ass with redis
Kicking ass with redis
 
quickguide-einnovator-9-redis
quickguide-einnovator-9-redisquickguide-einnovator-9-redis
quickguide-einnovator-9-redis
 
Nginx-lua
Nginx-luaNginx-lua
Nginx-lua
 
Object Storage with Gluster
Object Storage with GlusterObject Storage with Gluster
Object Storage with Gluster
 
Paris Redis Meetup Introduction
Paris Redis Meetup IntroductionParis Redis Meetup Introduction
Paris Redis Meetup Introduction
 
eZ Publish cluster unleashed revisited
eZ Publish cluster unleashed revisitedeZ Publish cluster unleashed revisited
eZ Publish cluster unleashed revisited
 
Caching. api. http 1.1
Caching. api. http 1.1Caching. api. http 1.1
Caching. api. http 1.1
 
An Introduction to REDIS NoSQL database
An Introduction to REDIS NoSQL databaseAn Introduction to REDIS NoSQL database
An Introduction to REDIS NoSQL database
 
Redis - for duplicate detection on real time stream
Redis - for duplicate detection on real time streamRedis - for duplicate detection on real time stream
Redis - for duplicate detection on real time stream
 
Redis Use Patterns (DevconTLV June 2014)
Redis Use Patterns (DevconTLV June 2014)Redis Use Patterns (DevconTLV June 2014)
Redis Use Patterns (DevconTLV June 2014)
 

Similar a Hydra - Getting Started

Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesCorley S.r.l.
 
Cascading introduction
Cascading introductionCascading introduction
Cascading introductionAlex Su
 
Big data week presentation
Big data week presentationBig data week presentation
Big data week presentationJoseph Adler
 
Digdagによる大規模データ処理の自動化とエラー処理
Digdagによる大規模データ処理の自動化とエラー処理Digdagによる大規模データ処理の自動化とエラー処理
Digdagによる大規模データ処理の自動化とエラー処理Sadayuki Furuhashi
 
Building a Dev/Test Cloud with Apache CloudStack
Building a Dev/Test Cloud with Apache CloudStackBuilding a Dev/Test Cloud with Apache CloudStack
Building a Dev/Test Cloud with Apache CloudStackke4qqq
 
Ingesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedIngesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedwhoschek
 
Building Hadoop Data Applications with Kite
Building Hadoop Data Applications with KiteBuilding Hadoop Data Applications with Kite
Building Hadoop Data Applications with Kitehuguk
 
Docker Logging and analysing with Elastic Stack - Jakub Hajek
Docker Logging and analysing with Elastic Stack - Jakub Hajek Docker Logging and analysing with Elastic Stack - Jakub Hajek
Docker Logging and analysing with Elastic Stack - Jakub Hajek PROIDEA
 
Docker Logging and analysing with Elastic Stack
Docker Logging and analysing with Elastic StackDocker Logging and analysing with Elastic Stack
Docker Logging and analysing with Elastic StackJakub Hajek
 
Running Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on HadoopRunning Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on Hadoopclairvoyantllc
 
OSMC 2009 | Icinga by Icinga Team
OSMC 2009 | Icinga by Icinga TeamOSMC 2009 | Icinga by Icinga Team
OSMC 2009 | Icinga by Icinga TeamNETWAYS
 
PuppetDB: A Single Source for Storing Your Puppet Data - PUG NY
PuppetDB: A Single Source for Storing Your Puppet Data - PUG NYPuppetDB: A Single Source for Storing Your Puppet Data - PUG NY
PuppetDB: A Single Source for Storing Your Puppet Data - PUG NYPuppet
 
Icinga 2009 at OSMC
Icinga 2009 at OSMCIcinga 2009 at OSMC
Icinga 2009 at OSMCIcinga
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupRafal Kwasny
 
Cosmos, Big Data GE implementation in FIWARE
Cosmos, Big Data GE implementation in FIWARECosmos, Big Data GE implementation in FIWARE
Cosmos, Big Data GE implementation in FIWAREFernando Lopez Aguilar
 
Cosmos, Big Data GE Implementation
Cosmos, Big Data GE ImplementationCosmos, Big Data GE Implementation
Cosmos, Big Data GE ImplementationFIWARE
 
How ElasticSearch lives in my DevOps life
How ElasticSearch lives in my DevOps lifeHow ElasticSearch lives in my DevOps life
How ElasticSearch lives in my DevOps life琛琳 饶
 

Similar a Hydra - Getting Started (20)

Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting Languages
 
Cascading introduction
Cascading introductionCascading introduction
Cascading introduction
 
Big data week presentation
Big data week presentationBig data week presentation
Big data week presentation
 
Digdagによる大規模データ処理の自動化とエラー処理
Digdagによる大規模データ処理の自動化とエラー処理Digdagによる大規模データ処理の自動化とエラー処理
Digdagによる大規模データ処理の自動化とエラー処理
 
Building a Dev/Test Cloud with Apache CloudStack
Building a Dev/Test Cloud with Apache CloudStackBuilding a Dev/Test Cloud with Apache CloudStack
Building a Dev/Test Cloud with Apache CloudStack
 
Ingesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedIngesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmed
 
Building Hadoop Data Applications with Kite
Building Hadoop Data Applications with KiteBuilding Hadoop Data Applications with Kite
Building Hadoop Data Applications with Kite
 
Docker Logging and analysing with Elastic Stack - Jakub Hajek
Docker Logging and analysing with Elastic Stack - Jakub Hajek Docker Logging and analysing with Elastic Stack - Jakub Hajek
Docker Logging and analysing with Elastic Stack - Jakub Hajek
 
Docker Logging and analysing with Elastic Stack
Docker Logging and analysing with Elastic StackDocker Logging and analysing with Elastic Stack
Docker Logging and analysing with Elastic Stack
 
Spark etl
Spark etlSpark etl
Spark etl
 
Running Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on HadoopRunning Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on Hadoop
 
SolrCloud on Hadoop
SolrCloud on HadoopSolrCloud on Hadoop
SolrCloud on Hadoop
 
OSMC 2009 | Icinga by Icinga Team
OSMC 2009 | Icinga by Icinga TeamOSMC 2009 | Icinga by Icinga Team
OSMC 2009 | Icinga by Icinga Team
 
PuppetDB: A Single Source for Storing Your Puppet Data - PUG NY
PuppetDB: A Single Source for Storing Your Puppet Data - PUG NYPuppetDB: A Single Source for Storing Your Puppet Data - PUG NY
PuppetDB: A Single Source for Storing Your Puppet Data - PUG NY
 
Icinga 2009 at OSMC
Icinga 2009 at OSMCIcinga 2009 at OSMC
Icinga 2009 at OSMC
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
 
Logstash
LogstashLogstash
Logstash
 
Cosmos, Big Data GE implementation in FIWARE
Cosmos, Big Data GE implementation in FIWARECosmos, Big Data GE implementation in FIWARE
Cosmos, Big Data GE implementation in FIWARE
 
Cosmos, Big Data GE Implementation
Cosmos, Big Data GE ImplementationCosmos, Big Data GE Implementation
Cosmos, Big Data GE Implementation
 
How ElasticSearch lives in my DevOps life
How ElasticSearch lives in my DevOps lifeHow ElasticSearch lives in my DevOps life
How ElasticSearch lives in my DevOps life
 

Último

Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbuapidays
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusZilliz
 

Último (20)

Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 

Hydra - Getting Started

  • 1. Hydra - A Practical Introduction Big Data DC - @bigdatadc Matt Abrams - @abramsm March 4th 2013
  • 2.
  • 3. Agenda • What is Hydra? • Sample Data and Analysis Questions • Getting started with a local Hydra dev environment • Hydra’s Key Concepts • Creating your first Hydra job • Putting it all together
  • 4. Hydra’s Goals • Support Streaming and Batch Processing • Massive Scalability • Fault tolerant by design (bend but do not break) • Incremental Data Processing • Full stack operational support • Command and Control • Alerting • Resource Management • Data/Task Rebalancing • Data replication and Backup
  • 5. What Exactly is Hydra? • File System • Data Processing • Query System • Job/Cluster Management • Operational Alerting • Open Source
  • 6. Hydra - Terms • Job: a process for processing data • Task: a processing component of a job. A job can have one to n tasks • Node: A logic unit of processing capacity available to a cluster • Minion: Management process that runs on cluster nodes. Acts as gate keeper for controlling task processes • Spawn: Cluster management controller and UI
  • 8. Our Sample Data (Log-Synth) 3.535,  5214d63bab95687d,  166.144.203.186,  "the  then  good"   3.568,  5dbd9451948ad895,  88.120.153.226,  "know  boys"   4.206,  5dbd9451948ad895,  88.120.153.226,  "to"   4.673,  b967d99cad0b3e60,  88.120.153.226,  "seven"   4.900,  bd0d760fbb338955,  166.144.203.186,  "did  local  it"
  • 9. What do we want to know? • What are the top IP addresses by request count? • What are the top IP address by unique user count? • What are the most common search terms? • What are the most common search terms in the slowest 5% of queries? • What are the daily number of unique searches, unique users, unique IP addresses, and distribution of response times (all approximates)?
  • 10. Setting up Hydra’s Local Stack
  • 11. Vagrant • $  vagrant  init  precise32  http:// files.vagrantup.com/precise32.box   • //  add:  config.vm.network  :forwarded_port,   guest:  5052,  host:  5052  to  your  Vagrantfile   • $  vagrant  up   • $  vagrant  ssh
  • 12. Java7 • $  sudo  apt-­‐get  update     • $  sudo  apt-­‐get  install  python-­‐software-­‐ properties   • $  sudo  add-­‐apt-­‐repository  ppa:webupd8team/java   • $  sudo  apt-­‐get  update   • $  sudo  apt-­‐get  install  oracle-­‐java7-­‐installer
  • 13. RabbitMQ, Maven, Git, Make • $  sudo  apt-­‐get  install  rabbitmq-­‐server   • $  sudo  apt-­‐get  install  maven   • $  sudo  apt-­‐get  install  git   • $  sudo  apt-­‐get  install  make
  • 14. Copy on Write • $  wget  http://xmailserver.org/fl-­‐cow-­‐0.10.tar.gz   • $  tar  zxvf  fl-­‐cow-­‐0.10.tar.gz   • $  cd  fl-­‐cow-­‐0.10   • $  ./configure  —prefix=/usr   • $  make;  make  check   • $  sudo  make  install   • $  export  LD_PRELOAD=/usr/lib/libflcow.so:$LD_PRELOAD
  • 15. Hydra • $  git  clone  https://github.com/addthis/ hydra.git   • $  cd  hydra;  mvn  clean  -­‐Pbdbje  package   • $  ./hydra-­‐uber/bin/local-­‐stack.sh  start   • $  ./hydra-­‐uber/bin/local-­‐stack.sh  start   • $  ./hydra-­‐uber/bin/local-­‐stack.sh  seed
  • 16. Stage Sample Data in Stream Directory • $  mkdir  ~/hydra/hydra-­‐local/streams/log-­‐synth   • $  cp  $YOUR_SAMPLE_DATA_DIR  ~/hydra/hydra-­‐ local/streams/log-­‐synth
  • 18. BundleFilters • Return true or false • Operate on entire rows • Add/Remove • Edit • May ValueFilters • Operate on single volume values • Return columns Column Values include a call to ValueFilter a value or null • No visibility to full row • Often take input from BundleFilter
  • 19. BundleFilter - Chain // chain of bundle filters {"op":"chain", “filter”:[ //LIST OF BUNDLE //FILTERS …. ]}
  • 20. BundleFilter - Existence // false if UID column is null {"op":"field", "from":"UID"},
  • 21. Bundle Filter - Concatenation // joins FOO and BAR // Stores output in new column “OUTPUT” ! {"op":"concat", "in":["FOO", “BAR”], "out":"OUTPUT"},
  • 22. BundleFilter - Equality Testing // FIELD_ONE == FIELD_TWO ! {“op":"equals", "left":"FIELD_ONE", "right":"FIELD_TWO"},
  • 23. BundleFilter - Math! // DUR = Math.round((end-start)/1000) ! {"op":"num", "columns":["END", "START", "DUR"], "define":"c0,c1,sub,v1000,ddiv,toint,v2,set"}
  • 24. Stack Math - Sample Data C0,START_TIME C1,END_TIME 100,234 200,468
  • 28. Stack Math - Sample Result C0,START_TIME C1,END_TIME C2,DURATION 100,234 200,468 100
  • 29. ValueFilter - Glob ValueFilter {from:"SOURCE", filter:{op:”glob”, pattern:"Log_[0-9]*"}} BundleFilter
  • 30. ValueFilter - Chain, Split, Index ValueFilter {op:"field", from:”LIST”,filter: {op:"chain", filter:[ {op:”split", split:"="}, {op:"index", index:0} ]}}, ValueFilter(s)
  • 32. Data Attachments are Hydra’s Secret Weapon • Top-K Estimator • Cardinality Estimation (HyperLogLog Plus) • Quantile Estimation (Q,T-Digest) • Bloom Filters • Multiset streaming summarization (CountMin Sketch)
  • 33. Data Attachment Example A single node that tracks the top 1000 unique search terms, the distinct count of UIDs, and provides quantile estimation for the query time
  • 34. Putting it All Together
  • 35. Job Structure • Jobs have three sections • Source • Map • Output
  • 36. Source • Defines the properties of the input data set • Several built in source types: • Mesh • Local File System • Kafka
  • 37. Map • Select fields from input record to process • Apply filters to rows and columns • Drop or expand rows
  • 38. Output - Tree • Output(s) can be trees or data files • Trees represent data aggregations that can be queried • Files Output Targets • File System • Cassandra • HDFS
  • 39. Lets put it all Together
  • 42. Query
  • 43. What are the top IP Addresses By Record Count? • Exact • • • path: root/byip/+:+hits ops: gather=ks;sort=1:n:d;limit=100 Approximate • path: root/byip/+$+uidcount • ops: gather=ks;sort=1:n:d;limit=100
  • 44. What are the top IPs by unique user count? • Exact • • • path: root/byip/+/+ ops: gather=kk;sort=0;gather=ku;sort=1:n:d Approximate • path: root/byip/+$+uidcount • ops: gather=ks;sort=1:n:d;limit=100
  • 45. What are the search terms for the slowest 5%? • First get the 95th percentile query time • • • path: /root$+timeDigest=quantile(.95) ops: num=c0,toint,v0,set;gather=a Now find all queries then 95th percentile • path: /root/bytime/+/+:+hits • ops: num=c0,v950,gteq;gather=iks;sort=1:n:d
  • 46. Daily Unqiue Searches, Users, IPs and distribution of response times? • Query Path: • • Ops: • • root$+termcount$+uidcount$+ipcount$+timeDigest=quantile(. 25)$+timeDigest=quantile(.50)$+timeDigest=quantile(.75)$ +timeDigest=quantile(.95)$+timeDigest=quantile(.999):+hits gather=sssaaaaaa;title=total,searches,uids,ips,.25,.50,.75,.95,.999 Remote Ops: • num=c4,toint,v4,set;num=c5,toint,v5,set;num=c6,toint,v6,set;num =c7,toint,v7,set;num=c8,toint,v8,set;
  • 47. But yeah, I could do that with CLI!
  • 48. Related Open Source Projects • Meshy - https://github.com/addthis/meshy • Codec - https://github.com/addthis/codec • Muxy - https://github.com/addthis/muxy • Bundle - https://github.com/addthis/bundle • Basis - https://github.com/addthis/basis • Column Compressor - https://github.com/addthis/ columncompressor • Cluster Boot Service - https://github.com/stewartoallen/cbs
  • 49. Helpful Resources • Hydra - https://github.com/addthis/hydra • Hydra User Reference - http://ossdocs.addthiscode.net/hydra/latest/user-reference/ • Hydra User Guide - http://oss-docs.addthiscode.net/ hydra/latest/user-guide/ • IRC - #hydra • Mailing List - https://groups.google.com/forum/#!forum/ hydra-oss