SlideShare una empresa de Scribd logo
1 de 59
Descargar para leer sin conexión
Hadoop 2.2.0
Hadoop grows up
Adam Muise

© Hortonworks Inc. 2013. Confidential and Proprietary.

Page 1
Rob Ford says…

…turn off your #*@!#%!!! Mobile Phones!
© Hortonworks Inc. 2013. Confidential and Proprietary.

Page 2
YARN
Yet Another Resource Negotiator

© Hortonworks Inc. 2013. Confidential and Proprietary.
A new abstraction layer
Single Use System

Multi Purpose Platform

Batch Apps

Batch, Interactive, Online, Streaming, …

HADOOP 1.0

HADOOP 2.0
MapReduce	
  

Others	
  

(data	
  processing)	
  

MapReduce	
  

(data	
  processing)	
  

YARN	
  

(cluster	
  resource	
  management	
  
	
  &	
  data	
  processing)	
  

(cluster	
  resource	
  management)	
  

HDFS	
  

HDFS2	
  

(redundant,	
  reliable	
  storage)	
  

© Hortonworks Inc. 2013. Confidential and Proprietary.

(redundant,	
  reliable	
  storage)	
  

Page 4
Concepts
• Application
– Application is a job submitted to the framework
– Example – Map Reduce Job

• Container
– Basic unit of allocation
– Fine-grained resource allocation across multiple resource
types (memory, cpu, disk, network, gpu etc.)
–  container_0 = 2GB, 1CPU
–  container_1 = 1GB, 6 CPU

– Replaces the fixed map/reduce slots

© Hortonworks Inc. 2013. Confidential and Proprietary.

5
YARN Architecture
• Resource Manager
– Global resource scheduler
– Hierarchical queues

• Node Manager
– Per-machine agent
– Manages the life-cycle of container
– Container resource monitoring

• Application Master
– Per-application
– Manages application scheduling and task execution
– E.g. MapReduce Application Master
© Hortonworks Inc. 2013. Confidential and Proprietary.

6
YARN Architecture - Walkthrough
ResourceManager	
  

Client2	
  
Scheduler	
  

NodeManager	
  

NodeManager	
  

NodeManager	
  

NodeManager	
  

Container	
  1.1	
  
Container	
  2.2	
  

Container	
  2.4	
  

NodeManager	
  

NodeManager	
  

AM	
  1	
  

NodeManager	
  

Container	
  1.2	
  

NodeManager	
  

Container	
  1.3	
  

© Hortonworks Inc. 2012

NodeManager	
  

AM2	
  

NodeManager	
  

NodeManager	
  

Container	
  2.1	
  

NodeManager	
  

Container	
  2.3	
  
YARN as OS for Data Lake
ResourceManager	
  

Scheduler	
  

NodeManager	
  

NodeManager	
  
map	
  1.1	
  

NodeManager	
  
nimbus0	
  

NodeManager	
  

vertex1.1.1	
  

vertex1.2.2	
  

NodeManager	
  

NodeManager	
  

NodeManager	
  

NodeManager	
  

map1.2	
  
Batch	
  

InteracFve	
  SQL	
  

vertex1.1.2	
  

nimbus2	
  

NodeManager	
  

NodeManager	
  
nimbus1	
  
reduce1.1	
  

© Hortonworks Inc. 2012

NodeManager	
  

Real-­‐Time	
  

NodeManager	
  

vertex1.2.1	
  
Multi-Tenant YARN
ResourceManager	
  

Scheduler	
  
root

Mrkting
30%

Dev
20%

Adhoc
10%

Prod
80%

DW
60%

Dev Reserved Prod
10%
20%
70%

P0
70%

© Hortonworks Inc. 2012

P1
30%
Multi-Tenancy with New Capacity Scheduler
•  Queues
•  Economics as queue-capacity
–  Heirarchical Queues

•  SLAs
–  Preemption

ResourceManager	
  

•  Resource Isolation
–  Linux: cgroups
–  MS Windows: Job Control
–  Roadmap: Virtualization (Xen, KVM)

•  Administration
–  Queue ACLs
–  Run-time re-configuration for queues
–  Charge-back

Scheduler	
  

root

Hierarchical
Queues
Mrkting
20%

Dev
20%

Adhoc
10%

Prod
80%

DW
70%

Dev Reserved Prod
10%
20%
70%

P0
70%

P1
30%

Capacity Scheduler
© Hortonworks Inc. 2013. Confidential and Proprietary.

Page 10
MapReduce v2
Changes to MapReduce on YARN

© Hortonworks Inc. 2013. Confidential and Proprietary.
MapReduce V2 is a library now…
•  MapReduce runs on YARN like all other Hadoop 2.x applications
–  Gone are the map and reduce slots, that’s up to containers in YARN now
–  Gone is the JobTracker, replaced by the YARN AppMaster library

•  Multiple versions of MapReduce
–  The older mapred APIs work without modification or recompilation
–  The newer mapreduce APIs may need to be recompiled

•  Still has one master server component: the Job History Server
–  The Job History Server stores the execution of jobs
–  Used to audit prior execution of jobs
–  Will also be used by YARN framework to store charge backs at that level

© Hortonworks Inc. 2013. Confidential and Proprietary.

Page 12
Shuffle in MapReduce v2
•  Faster Shuffle
–  Better embedded server: Netty

•  Encrypted Shuffle
–  Secure the shuffle phase as data moves across the cluster
–  Requires 2 way HTTPS, certificates on both sides
–  Incurs significant CPU overhead, reserve 1 core for this work
–  Certs stored on each node (provision with the cluster), refreshed every 10secs

•  Pluggable Shuffle Sort
–  Shuffle is the first phase in MapReduce that is guaranteed to not be data-local
–  Pluggable Shuffle/Sort allows for intrepid application developers or hardware
developers to intercept the network-heavy workload and optimize it
–  Typical implementations have hardware components like fast networks and
software components like sorting algorithms
–  API will change with future versions of Hadoop

© Hortonworks Inc. 2013. Confidential and Proprietary.

Page 13
Efficiency Gains of MRv2
•  Key Optimizations
–  No hard segmentation of resource into map and reduce slots
–  Yarn scheduler is more efficient
–  MRv2 framework has become more efficient than MRv1; shuffle phase in MRv2 is
more performant with the usage of netty.

•  Yahoo has over 30000 nodes running YARN across over
365PB of data.
•  They calculate running about 400,000 jobs per day for
about 10 million hours of compute time.
•  They also have estimated a 60% – 150% improvement on
node usage per day.
•  Yahoo got rid of a whole colo (10,000 node datacenter)
because of their increased utilization.

© Hortonworks Inc. 2013. Confidential and Proprietary.
HDFS v2
In a NutShell

© Hortonworks Inc. 2013. Confidential and Proprietary.
HA

© Hortonworks Inc. 2013. Confidential and Proprietary.

Page 16
HDFS Snapshots: Feature Overview
•  Admin can create point in time snapshots of HDFS
–  Of the entire file system (/root)
–  Of a specific data-set (sub-tree directory of file system)

•  Restore state of entire file system or data-set to a snapshot (like Apple
Time Machine)
–  Protect against user errors

•  Snapshot diffs identify changes made to data set
–  Keep track of how raw or derived/analytical data changes over time

© Hortonworks Inc. 2013. Confidential and Proprietary.

Page 17
NFS Gateway: Feature Overview
•  NFS v3 standard
•  Supports all HDFS commands
–  List files
–  Copy, move files
–  Create and delete directories

•  Ingest for large scale analytical workloads
–  Load immutable files as source for analytical processing
–  No random writes

•  Stream files into HDFS
–  Log ingest by applications writing directly to HDFS client mount

© Hortonworks Inc. 2013. Confidential and Proprietary.
Federation

© Hortonworks Inc. 2013. Confidential and Proprietary.

Page 19
Managing Namespaces

© Hortonworks Inc. 2013. Confidential and Proprietary.

Page 20
Performance

© Hortonworks Inc. 2013. Confidential and Proprietary.

Page 21
Other Features

© Hortonworks Inc. 2013. Confidential and Proprietary.

Page 22
Apache Tez
A New Hadoop Data Processing Framework

© Hortonworks Inc. 2013. Confidential and Proprietary.

Page 23
Moving Hadoop Beyond MapReduce
•  Low level data-processing execution engine
•  Built on YARN
•  Enables pipelining of jobs
•  Removes task and job launch times
•  Does not write intermediate output to HDFS
–  Much lighter disk and network usage

•  New base of MapReduce, Hive, Pig, Cascading etc.
•  Hive and Pig jobs no longer need to move to the end of the queue
between steps in the pipeline

© Hortonworks Inc. 2013. Confidential and Proprietary.
Apache Tez as the new Primitive
MapReduce as Base

Apache Tez as Base

HADOOP 1.0

HADOOP 2.0
Batch	
  

Pig	
  

(data	
  flow)	
  

	
  
Hive	
   Others	
  
(sql)	
  

(cascading)	
  
	
  

MapReduce	
  

MapReduce	
  

Data	
  Flow	
  
Pig	
  

SQL	
  
Hive	
  

	
  
Others	
  

(cascading)	
  

	
  

Tez	
  

Storm	
  

(execu:on	
  engine)	
  

YARN	
  

(cluster	
  resource	
  management	
  
	
  &	
  data	
  processing)	
  

(cluster	
  resource	
  management)	
  

HDFS	
  

HDFS2	
  

(redundant,	
  reliable	
  storage)	
  

© Hortonworks Inc. 2013. Confidential and Proprietary.

Online	
  	
  

Real	
  Time	
  	
  
Data	
  	
  
Stream	
  	
   Processing	
  
Processing	
   HBase,	
  

(redundant,	
  reliable	
  storage)	
  

Accumulo	
  
	
  
Hive-on-MR vs. Hive-on-Tez
Tez avoids
unneeded writes to
HDFS

SELECT a.x, AVERAGE(b.y) AS avg
FROM a JOIN b ON (a.id = b.id) GROUP BY a
UNION SELECT x, AVERAGE(y) AS AVG
FROM c GROUP BY x
ORDER BY AVG;

Hive – MR
M

M

Hive – Tez

M

SELECT a.state

SELECT b.id
R

R

M

SELECT a.state,
c.itemId

M

M

M
R

M

SELECT b.id

R

M

HDFS

JOIN (a, c)
SELECT c.price

M

R

M
R

HDFS

R

JOIN (a, c)

R

HDFS

JOIN(a, b)
GROUP BY a.state
COUNT(*)
AVERAGE(c.price)

M

M

R

© Hortonworks Inc. 2013. Confidential and Proprietary.

M

JOIN(a, b)
GROUP BY a.state
COUNT(*)
AVERAGE(c.price)

R
Apache Tez (“Speed”)
•  Replaces MapReduce as primitive for Pig, Hive, Cascading etc.
– Smaller latency for interactive queries
– Higher throughput for batch queries
– 22 contributors: Hortonworks (13), Facebook, Twitter, Yahoo, Microsoft
Task with pluggable Input, Processor and Output

Input	
  

Processor	
  

Output	
  

Task	
  
Tez Task - <Input, Processor, Output>

YARN ApplicationMaster to run DAG of Tez Tasks
© Hortonworks Inc. 2013. Confidential and Proprietary.
Tez: Building blocks for scalable data processing
Classical ‘Map’

HDFS	
  
Input	
  

Map	
  
Processor	
  

Classical ‘Reduce’

Sorted	
  
Output	
  

Shuffle	
  
Input	
  

Shuffle	
  
Input	
  

Reduce	
  
Processor	
  

Sorted	
  
Output	
  

Intermediate ‘Reduce’ for
Map-Reduce-Reduce
© Hortonworks Inc. 2013. Confidential and Proprietary.

Reduce	
  
Processor	
  

HDFS	
  
Output	
  
Hive

© Hortonworks Inc. 2013. Confidential and Proprietary.

29
SQL: Enhancing SQL Semantics
Hive	
  SQL	
  Datatypes	
  

Hive	
  SQL	
  SemanFcs	
  

SQL Compliance

INT	
  

SELECT,	
  INSERT	
  

TINYINT/SMALLINT/BIGINT	
  

GROUP	
  BY,	
  ORDER	
  BY,	
  SORT	
  BY	
  

BOOLEAN	
  

JOIN	
  on	
  explicit	
  join	
  key	
  

FLOAT	
  

Inner,	
  outer,	
  cross	
  and	
  semi	
  joins	
  

DOUBLE	
  

Sub-­‐queries	
  in	
  FROM	
  clause	
  

Hive 12 provides a wide
array of SQL datatypes
and semantics so your
existing tools integrate
more seamlessly with
Hadoop

STRING	
  

ROLLUP	
  and	
  CUBE	
  

TIMESTAMP	
  

UNION	
  

BINARY	
  

Windowing	
  Func:ons	
  (OVER,	
  RANK,	
  etc)	
  

DECIMAL	
  

Custom	
  Java	
  UDFs	
  

ARRAY,	
  MAP,	
  STRUCT,	
  UNION	
  

Standard	
  Aggrega:on	
  (SUM,	
  AVG,	
  etc.)	
  

DATE	
  

Advanced	
  UDFs	
  (ngram,	
  Xpath,	
  URL)	
  	
  

VARCHAR	
  

Sub-­‐queries	
  in	
  WHERE,	
  HAVING	
  

CHAR	
  

Expanded	
  JOIN	
  Syntax	
  
SQL	
  Compliant	
  Security	
  (GRANT,	
  etc.)	
  
INSERT/UPDATE/DELETE	
  (ACID)	
  
© Hortonworks Inc. 2013. Confidential and Proprietary.

Available	
  
Hive	
  0.12	
  
Roadmap	
  
SPEED: Increasing Hive Performance
Interactive Query Times across ALL use cases
•  Simple and advanced queries in seconds
•  Integrates seamlessly with existing tools
•  Currently a >100x improvement in just nine months
Performance Improvements
included in Hive 12
–  Base & advanced query optimization
–  Startup time improvement
–  Join optimizations

© Hortonworks Inc. 2013. Confidential and Proprietary.
Apache Tez as the new Primitive
MapReduce as Base

Apache Tez as Base

HADOOP 1.0

HADOOP 2.0
Batch	
  

Pig	
  

(data	
  flow)	
  

	
  
Hive	
   Others	
  
(sql)	
  

(cascading)	
  
	
  

MapReduce	
  

MapReduce	
  

Data	
  Flow	
  
Pig	
  

SQL	
  
Hive	
  

	
  
Others	
  

(cascading)	
  

	
  

Tez	
  

Storm	
  

(execu:on	
  engine)	
  

YARN	
  

(cluster	
  resource	
  management	
  
	
  &	
  data	
  processing)	
  

(cluster	
  resource	
  management)	
  

HDFS	
  

HDFS2	
  

(redundant,	
  reliable	
  storage)	
  

© Hortonworks Inc. 2013. Confidential and Proprietary.

Online	
  	
  

Real	
  Time	
  	
  
Data	
  	
  
Stream	
  	
   Processing	
  
Processing	
   HBase,	
  

(redundant,	
  reliable	
  storage)	
  

Accumulo	
  
	
  
Hive-on-MR vs. Hive-on-Tez
Tez avoids
unneeded writes to
HDFS

SELECT a.x, AVERAGE(b.y) AS avg
FROM a JOIN b ON (a.id = b.id) GROUP BY a
UNION SELECT x, AVERAGE(y) AS AVG
FROM c GROUP BY x
ORDER BY AVG;

Hive – MR
M

M

Hive – Tez

M

SELECT a.state

SELECT b.id
R

R

M

SELECT a.state,
c.itemId

M

M

M
R

M

SELECT b.id

R

M

HDFS

JOIN (a, c)
SELECT c.price

M

R

M
R

HDFS

R

JOIN (a, c)

R

HDFS

JOIN(a, b)
GROUP BY a.state
COUNT(*)
AVERAGE(c.price)

M

M

R

© Hortonworks Inc. 2013. Confidential and Proprietary.

M

JOIN(a, b)
GROUP BY a.state
COUNT(*)
AVERAGE(c.price)

R
Tez on YARN
ResourceManager	
  

Scheduler	
  

NodeManager	
  

NodeManager	
  

vertex1.2.2	
  

NodeManager	
  

map	
  1.1	
  

NodeManager	
  
map1.2	
  

Batch	
  
nimbus2	
  

NodeManager	
  

NodeManager	
  
nimbus1	
  
reduce1.1	
  

© Hortonworks Inc. 2012

NodeManager	
  
nimbus0	
  

NodeManager	
  
Hive/Tez	
  
(SQL)	
  

NodeManager	
  

Real-­‐Time	
  

NodeManager	
  

vertex1.1.1	
  

NodeManager	
  

vertex1.1.2	
  

NodeManager	
  

vertex1.2.1	
  
Apache Falcon
Data Lifecycle Management for Hadoop

© Hortonworks Inc. 2013. Confidential and Proprietary.
Data Lifecycle on Hadoop is Challenging

Data Management Needs

Tools

Data Processing

Oozie

Replication

Sqoop

Retention

Distcp

Scheduling

Flume

Reprocessing

Map / Reduce

Multi Cluster Management

Hive and Pig Jobs

Problem: Patchwork of tools complicate data lifecycle management.
Result:
Long development cycles and quality challenges.

© Hortonworks Inc. 2013. Confidential and Proprietary.
Falcon: One-stop Shop for Data Lifecycle
Apache Falcon
Provides

Orchestrates

Data Management Needs

Tools

Data Processing

Oozie

Replication

Sqoop

Retention

Distcp

Scheduling

Flume

Reprocessing

Map / Reduce

Multi Cluster Management

Hive and Pig Jobs

Falcon provides a single interface to orchestrate data lifecycle.
Sophisticated DLM easily added to Hadoop applications.

© Hortonworks Inc. 2013. Confidential and Proprietary.
Falcon Core Capabilities
•  Core Functionality
–  Pipeline processing
–  Replication
–  Retention
–  Late data handling

•  Automates
–  Scheduling and retry
–  Recording audit, lineage and metrics

•  Operations and Management
–  Monitoring, management, metering
–  Alerts and notifications
–  Multi Cluster Federation

•  CLI and REST API

© Hortonworks Inc. 2013. Confidential and Proprietary.
Falcon At A Glance
Data Processing Applications

Falcon Data Management Framework
Data Import
and
Replication

Scheduling
and
Coordination

Data Lifecycle
Policies

Multi-Cluster
Management

SLA
Management

>  Falcon offers a high-level abstraction of key services for Hadoop data management needs.
>  Complex data processing logic is handled by Falcon instead of hard-coded in data processing apps.
>  Falcon enables faster development of ETL, reporting and other data processing apps on Hadoop.

© Hortonworks Inc. 2013. Confidential and Proprietary.
Falcon Example: Replication
Cleansed
Data

Conformed
Data

Access
Data
Replication

Replication

Staged Data

Staged Data

Processed
Data

>  Falcon manages workflow and replication.
>  Enables business continuity without requiring full data representation.
>  Failover clusters can be smaller than primary clusters.

© Hortonworks Inc. 2013. Confidential and Proprietary.
Falcon Example: Retention

Staged Data

Cleansed Data

Conformed
Data

Access Data

Retain 20
Years

Retain 3 Years

Retain 3 Years

Retain Last
Copy Only

>  Sophisticated retention policies expressed in one place.
>  Simplify data retention for audit, compliance, or for data re-processing.

© Hortonworks Inc. 2013. Confidential and Proprietary.
Falcon Example: Late Data Handling
Online
Transaction
Data (via
Sqoop)
Wait up to 4
hours for FTP data
to arrive

Staged Data

Combined
Dataset

Web Log Data
(via FTP)

>  Processing waits until all required input data is available.
>  Checks for late data arrivals, issues retrigger processing as necessary.
>  Eliminates writing complex data handling rules within applications.

© Hortonworks Inc. 2013. Confidential and Proprietary.
Examples

© Hortonworks Inc. 2013. Confidential and Proprietary.

Page 43
Example: Cluster Specification
<?xml version="1.0"?>!
readonly!
<!--!
My Local Cluster specification!
-->!
write!
<cluster colo=”my-local-cluster" description="" name="cluster-alpha">
!
<interfaces>!
<interface type="readonly" endpoint="hftp://nn:50070" version="2.2.0" />!
<interface type="write" endpoint="hdfs://nn:8020" version="2.2.0" />!
<interface type="execute" endpoint=”rm:8050" version="2.2.0" />!
<interface type="workflow" endpoint="http://os:11000/oozie/" version="4.0.0" />!
<interface type="messaging" endpoint="tcp://mq:61616?daemon=true" version="5.1.6" />!
</interfaces>!
<locations>!
execute!
<location name="staging" path="/apps/falcon/cluster-alpha/staging" />!
<location name="temp" path="/tmp" />!
<location name="working" path="/apps/falcon/cluster-alpha/working" />!
</locations>!
</cluster>!

workflow!

© Hortonworks Inc. 2013. Confidential and Proprietary.

NameNode

Resource
Manager

Oozie Server

Page 44
Example: Weblogs
Replication and Retention

© Hortonworks Inc. 2013. Confidential and Proprietary.

Page 45
Example 1: Weblogs
•  Weblogs land hourly in my primary cluster
•  HDFS location is /weblogs/{date}

•  I want to:
–  Evict weblogs from primary cluster after 1 day

© Hortonworks Inc. 2013. Confidential and Proprietary.

Page 46
Feed Specification 1: Weblogs
<feed description="" name="feed-weblogs1" xmlns="uri:falcon:feed:0.1” >!
<frequency>hours(1)</frequency>!
!
<clusters>!
!<cluster name="cluster-primary" type="source”>!
! <validity start="2013-10-24T00:00Z" end="2014-12-31T00:00Z"/>!
! <retention limit="days(1)" action="delete"/>!
!</cluster>!
</clusters>!
!
<locations>!
!<location type="data" path="/weblogs/${YEAR}-${MONTH}-${DAY}-${HOUR}" />!
</locations>!
!
<ACL owner="hdfs" group="users" permission="0755" />!
<schema location="/none" provider="none"/>!
</feed>!

Cluster where
data is located

Retention
policy 1 day

Location of the
data

© Hortonworks Inc. 2013. Confidential and Proprietary.

Page 47
Example 2: Weblogs
•  Weblogs land hourly in my primary cluster
•  HDFS location is /weblogs/{date}

•  I want to:
–  Replicate weblogs to my secondary cluster
–  Evict weblogs from primary cluster after 2 days
–  Evict weblogs from secondary cluster after 1 week

© Hortonworks Inc. 2013. Confidential and Proprietary.

Page 48
Feed Specification 2: Weblogs
<feed description=“" name=”feed-weblogs2” xmlns="uri:falcon:feed:0.1">!
<frequency>hours(1)</frequency>!
!
<clusters>!
<cluster name=”cluster-primary" type="source">!
<validity start="2012-01-01T00:00Z" end="2099-12-31T00:00Z"/>!
<retention limit="days(2)" action="delete"/>!
</cluster>!
<cluster name=”cluster-secondary" type="target">!
<validity start="2012-01-01T00:00Z" end="2099-12-31T00:00Z"/>!
<retention limit=”days(7)" action="delete"/>!
</cluster>!
</clusters>!
!
<locations>!
<location type="data” path="/weblogs/${YEAR}-${MONTH}-${DAY}-${HOUR} "/>!
</locations>!

Cluster where
data is located

Retention
policy 2 days

Cluster where
data will be
replicated

Retention
policy 1 week

!
<ACL owner=”hdfs" group="users" permission="0755"/>!
<schema location="/none" provider="none"/>!
</feed>!

© Hortonworks Inc. 2013. Confidential and Proprietary.

Location of the
data
Example 3: Weblogs
•  Weblogs land hourly in my primary cluster
•  HDFS location is /weblogs/{date}

•  I want to:
–  Replicate weblogs to a discovery cluster
–  Replicate weblogs to a BCP cluster
–  Evict weblogs from primary cluster after 2 days
–  Evict weblogs from discovery cluster after 1 week
–  Evict weblogs from BCP cluster after 3 months

© Hortonworks Inc. 2013. Confidential and Proprietary.

Page 50
Feed Specification 3: Weblogs
<feed description=“” name=”feed-weblogs” xmlns="uri:falcon:feed:0.1">!
<frequency>hours(1)</frequency>!
!
<clusters>!
<cluster name=”cluster-primary" type="source">!
<validity start="2012-01-01T00:00Z" end="2099-12-31T00:00Z"/>!
<retention limit="days(2)" action="delete"/>!
</cluster>!
<cluster name=“cluster-discovery" type="target">!
<validity start="2012-01-01T00:00Z" end="2099-12-31T00:00Z"/>!
<retention limit=”days(7)" action="delete"/>!
<locations>!
<location type="data” path="/projects/recommendations/${YEAR}-${MONTH}-${DAY}-${HOUR} "/>!
</locations>!
</cluster>!
<cluster name=”cluster-bcp" type="target">!
<validity start="2012-01-01T00:00Z" end="2099-12-31T00:00Z"/>!
<retention limit=”months(3)" action="delete"/>!
<locations>!
<location type="data” path="/weblogs/${YEAR}-${MONTH}-${DAY}-${HOUR} "/>!
</locations>!
</cluster>!
</clusters>!
!
<locations>!
<location type="data” path="/weblogs/${YEAR}-${MONTH}-${DAY}-${HOUR} "/>!
</locations>!
!
<ACL owner=”hdfs" group="users" permission="0755"/>!
<schema location="/none" provider="none"/>!
</feed>!

© Hortonworks Inc. 2013. Confidential and Proprietary.

Cluster
specific
location

Cluster
specific
location
Apache Knox
Secure Access to Hadoop

© Hortonworks Inc. 2013. Confidential and Proprietary.
Connecting to the Cluster..Edge Nodes
•  What is an Edge Node?
–  Nodes in a DMZ zone that has access to the cluster. Only way to access the
cluster
–  Hadoop client Apis and MR/Pig/Hive jobs would be executed from these edge
nodes.
–  Users SSH to Edge Node and upload all job artifacts and then execute API/
Commands commands from shell

SSH!
User

Edge
Node

Hadoop

• Challenges
– SSH, Edge Node, and job maintenance nightmare
– Difficult to integrate with Applications
© Hortonworks Inc. 2013. Confidential and Proprietary.

Page 53
Connecting to the Cluster..REST API
Service

API

WebHDFS

Supports HDFS user operations including reading files,
writing to files, making directories, changing permissions
and renaming. Learn more about WebHDFS.

WebHCat

Job control for MapReduce, Pig and Hive jobs, and
HCatalog DDL commands. Learn more about WebHCat.

Oozie

Job submission and management, and Oozie
administration. Learn more about Oozie.

•  Useful for connecting to Hadoop from the outside the cluster
•  When more client language flexibility is required
–  i.e. Java binding not an option

•  Challenges
–  Client must have knowledge of cluster topology
–  Required to open ports (and in some cases, on every host) outside the cluster

© Hortonworks Inc. 2013. Confidential and Proprietary.

Page 54
Apache Knox Gateway – Perimeter Security

Simplified Access

Centralized Security

•  Single Hadoop access point
•  Rationalized REST API hierarchy

•  Eliminate SSH “edge node”
•  LDAP and ActiveDirectory auth

•  Consolidated API calls
•  Multi-cluster support

•  Central API management + audit

•  Client DSL

© Hortonworks Inc. 2013. Confidential and Proprietary.

Page 55
Knox Gateway Network Architecture
Kerberos/
Enterprise
Identity
Provider

Enterprise/
Cloud SSO
Provider

Firewall

Firewall
Browser

Identity Providers

Secure Hadoop Cluster 1
Masters
NN

Web
HCat

JT
DN

Ambari
Client

DMZ

Oozie

TT
YARN

HBase

Hive

Knox Gateway Cluster
REST
Client

GW
GW
GW

JDBC
Client

Secure Hadoop Cluster 2
Masters
NN

JT
DN

A stateless cluster of
reverse proxy instances
deployed in DMZ

Ambari Server/
Hue Server

© Hortonworks Inc. 2013. Confidential and Proprietary.

Web
HCat

Oozie

TT

-Requests streamed through GW
to Hadoop services after auth.
HBase
Hive
-URLs rewritten to refer to
gateway

YARN

Page 56
Wot no 2.2.0?
Where can I get the Hadoop 2.2.0 fix?

© Hortonworks Inc. 2013. Confidential and Proprietary.

Page 57
Like the Truth, Hadoop 2.2.0 is out there…
Component

HDP2.0 CDH4

CDH5
Beta

Intel
IDH3.0

MapR 3

IBM Big
Insights
2.1

Hadoop
Common

2.2.0

2.0.0

2.2.0

2.0.4

N/A

1.1.1

Hive +
HCatalog

0.12

0.10 +
0.5

0.11

0.10 + 0.5 0.11

0.9 + 0.4

Pig

0.12

0.11

0.11

0.10

0.11

0.10

Mahout

0.8

0.7

0.8

0.8

0.8

N/A

Flume

1.4.0

1.4.0

1.4.0

1.3.0

1.4.0

1.3.0

Oozie

4.0.0

3.3.2

4.0.0

3.3.0

3.3.2

3.2.0

Sqoop

1.4.4

1.4.3

1.4.4

1.4.3

1.4.4

1.4.2

HBase

0.96.0

0.94.6

95.2

0.94.7

94.9

0.94.3

© Hortonworks Inc. 2013. Confidential and Proprietary.

Page 58
Thank You
THUG Life

© Hortonworks Inc. 2013. Confidential and Proprietary.

Más contenido relacionado

La actualidad más candente

Data warehousing with Hadoop
Data warehousing with HadoopData warehousing with Hadoop
Data warehousing with Hadoophadooparchbook
 
Managing Hadoop, HBase and Storm Clusters at Yahoo Scale
Managing Hadoop, HBase and Storm Clusters at Yahoo ScaleManaging Hadoop, HBase and Storm Clusters at Yahoo Scale
Managing Hadoop, HBase and Storm Clusters at Yahoo ScaleDataWorks Summit/Hadoop Summit
 
Welcome to Hadoop2Land!
Welcome to Hadoop2Land!Welcome to Hadoop2Land!
Welcome to Hadoop2Land!Uwe Printz
 
Hadoop Infrastructure @Uber Past, Present and Future
Hadoop Infrastructure @Uber Past, Present and FutureHadoop Infrastructure @Uber Past, Present and Future
Hadoop Infrastructure @Uber Past, Present and FutureDataWorks Summit
 
A New "Sparkitecture" for modernizing your data warehouse
A New "Sparkitecture" for modernizing your data warehouseA New "Sparkitecture" for modernizing your data warehouse
A New "Sparkitecture" for modernizing your data warehouseDataWorks Summit/Hadoop Summit
 
2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive Tuning2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive TuningAdam Muise
 
Architectural considerations for Hadoop Applications
Architectural considerations for Hadoop ApplicationsArchitectural considerations for Hadoop Applications
Architectural considerations for Hadoop Applicationshadooparchbook
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemShivaji Dutta
 
Cisco connect toronto 2015 big data sean mc keown
Cisco connect toronto 2015 big data  sean mc keownCisco connect toronto 2015 big data  sean mc keown
Cisco connect toronto 2015 big data sean mc keownCisco Canada
 
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARNDeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARNDataWorks Summit
 
Overview of stinger interactive query for hive
Overview of stinger   interactive query for hiveOverview of stinger   interactive query for hive
Overview of stinger interactive query for hiveDavid Kaiser
 
The Time Has Come for Big-Data-as-a-Service
The Time Has Come for Big-Data-as-a-ServiceThe Time Has Come for Big-Data-as-a-Service
The Time Has Come for Big-Data-as-a-ServiceBlueData, Inc.
 
AWS Summit Sydney 2014 | Secure Hadoop as a Service - Session Sponsored by Intel
AWS Summit Sydney 2014 | Secure Hadoop as a Service - Session Sponsored by IntelAWS Summit Sydney 2014 | Secure Hadoop as a Service - Session Sponsored by Intel
AWS Summit Sydney 2014 | Secure Hadoop as a Service - Session Sponsored by IntelAmazon Web Services
 
Hadoop Operations - Best practices from the field
Hadoop Operations - Best practices from the fieldHadoop Operations - Best practices from the field
Hadoop Operations - Best practices from the fieldUwe Printz
 
Hadoop in the Cloud - The what, why and how from the experts
Hadoop in the Cloud - The what, why and how from the expertsHadoop in the Cloud - The what, why and how from the experts
Hadoop in the Cloud - The what, why and how from the expertsDataWorks Summit/Hadoop Summit
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesDataWorks Summit
 
Taming the Elephant: Efficient and Effective Apache Hadoop Management
Taming the Elephant: Efficient and Effective Apache Hadoop ManagementTaming the Elephant: Efficient and Effective Apache Hadoop Management
Taming the Elephant: Efficient and Effective Apache Hadoop ManagementDataWorks Summit/Hadoop Summit
 
Introduction to Data Analyst Training
Introduction to Data Analyst TrainingIntroduction to Data Analyst Training
Introduction to Data Analyst TrainingCloudera, Inc.
 

La actualidad más candente (20)

Data warehousing with Hadoop
Data warehousing with HadoopData warehousing with Hadoop
Data warehousing with Hadoop
 
Managing Hadoop, HBase and Storm Clusters at Yahoo Scale
Managing Hadoop, HBase and Storm Clusters at Yahoo ScaleManaging Hadoop, HBase and Storm Clusters at Yahoo Scale
Managing Hadoop, HBase and Storm Clusters at Yahoo Scale
 
Deep Learning using Spark and DL4J for fun and profit
Deep Learning using Spark and DL4J for fun and profitDeep Learning using Spark and DL4J for fun and profit
Deep Learning using Spark and DL4J for fun and profit
 
Welcome to Hadoop2Land!
Welcome to Hadoop2Land!Welcome to Hadoop2Land!
Welcome to Hadoop2Land!
 
Hadoop Infrastructure @Uber Past, Present and Future
Hadoop Infrastructure @Uber Past, Present and FutureHadoop Infrastructure @Uber Past, Present and Future
Hadoop Infrastructure @Uber Past, Present and Future
 
A New "Sparkitecture" for modernizing your data warehouse
A New "Sparkitecture" for modernizing your data warehouseA New "Sparkitecture" for modernizing your data warehouse
A New "Sparkitecture" for modernizing your data warehouse
 
2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive Tuning2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive Tuning
 
Architectural considerations for Hadoop Applications
Architectural considerations for Hadoop ApplicationsArchitectural considerations for Hadoop Applications
Architectural considerations for Hadoop Applications
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystem
 
Cisco connect toronto 2015 big data sean mc keown
Cisco connect toronto 2015 big data  sean mc keownCisco connect toronto 2015 big data  sean mc keown
Cisco connect toronto 2015 big data sean mc keown
 
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARNDeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
 
Overview of stinger interactive query for hive
Overview of stinger   interactive query for hiveOverview of stinger   interactive query for hive
Overview of stinger interactive query for hive
 
The Time Has Come for Big-Data-as-a-Service
The Time Has Come for Big-Data-as-a-ServiceThe Time Has Come for Big-Data-as-a-Service
The Time Has Come for Big-Data-as-a-Service
 
AWS Summit Sydney 2014 | Secure Hadoop as a Service - Session Sponsored by Intel
AWS Summit Sydney 2014 | Secure Hadoop as a Service - Session Sponsored by IntelAWS Summit Sydney 2014 | Secure Hadoop as a Service - Session Sponsored by Intel
AWS Summit Sydney 2014 | Secure Hadoop as a Service - Session Sponsored by Intel
 
Hadoop Operations - Best practices from the field
Hadoop Operations - Best practices from the fieldHadoop Operations - Best practices from the field
Hadoop Operations - Best practices from the field
 
Hadoop in the Cloud - The what, why and how from the experts
Hadoop in the Cloud - The what, why and how from the expertsHadoop in the Cloud - The what, why and how from the experts
Hadoop in the Cloud - The what, why and how from the experts
 
Apache Hadoop 3
Apache Hadoop 3Apache Hadoop 3
Apache Hadoop 3
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
 
Taming the Elephant: Efficient and Effective Apache Hadoop Management
Taming the Elephant: Efficient and Effective Apache Hadoop ManagementTaming the Elephant: Efficient and Effective Apache Hadoop Management
Taming the Elephant: Efficient and Effective Apache Hadoop Management
 
Introduction to Data Analyst Training
Introduction to Data Analyst TrainingIntroduction to Data Analyst Training
Introduction to Data Analyst Training
 

Similar a 2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0

Apache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopApache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopHortonworks
 
Tez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelTez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelt3rmin4t0r
 
Bikas saha:the next generation of hadoop– hadoop 2 and yarn
Bikas saha:the next generation of hadoop– hadoop 2 and yarnBikas saha:the next generation of hadoop– hadoop 2 and yarn
Bikas saha:the next generation of hadoop– hadoop 2 and yarnhdhappy001
 
YARN - Hadoop Next Generation Compute Platform
YARN - Hadoop Next Generation Compute PlatformYARN - Hadoop Next Generation Compute Platform
YARN - Hadoop Next Generation Compute PlatformBikas Saha
 
Apache Hadoop YARN: Understanding the Data Operating System of Hadoop
Apache Hadoop YARN: Understanding the Data Operating System of HadoopApache Hadoop YARN: Understanding the Data Operating System of Hadoop
Apache Hadoop YARN: Understanding the Data Operating System of HadoopHortonworks
 
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingApache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingTeddy Choi
 
Hadoop - Past, Present and Future - v1.2
Hadoop - Past, Present and Future - v1.2Hadoop - Past, Present and Future - v1.2
Hadoop - Past, Present and Future - v1.2Big Data Joe™ Rossi
 
Apache Tez -- A modern processing engine
Apache Tez -- A modern processing engineApache Tez -- A modern processing engine
Apache Tez -- A modern processing enginebigdatagurus_meetup
 
Hadoop - Past, Present and Future - v1.1
Hadoop - Past, Present and Future - v1.1Hadoop - Past, Present and Future - v1.1
Hadoop - Past, Present and Future - v1.1Big Data Joe™ Rossi
 
Tez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_sahaTez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_sahaData Con LA
 
Tez Data Processing over Yarn
Tez Data Processing over YarnTez Data Processing over Yarn
Tez Data Processing over YarnInMobi Technology
 
How YARN Enables Multiple Data Processing Engines in Hadoop
How YARN Enables Multiple Data Processing Engines in HadoopHow YARN Enables Multiple Data Processing Engines in Hadoop
How YARN Enables Multiple Data Processing Engines in HadoopPOSSCON
 
Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014
Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014
Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014Modern Data Stack France
 
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingApache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingBikas Saha
 
Overview of slider project
Overview of slider projectOverview of slider project
Overview of slider projectSteve Loughran
 
Hadoop past, present and future
Hadoop past, present and futureHadoop past, present and future
Hadoop past, present and futureCodemotion
 
Get Started Building YARN Applications
Get Started Building YARN ApplicationsGet Started Building YARN Applications
Get Started Building YARN ApplicationsHortonworks
 

Similar a 2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0 (20)

MHUG - YARN
MHUG - YARNMHUG - YARN
MHUG - YARN
 
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopApache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with Hadoop
 
Tez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelTez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthel
 
Bikas saha:the next generation of hadoop– hadoop 2 and yarn
Bikas saha:the next generation of hadoop– hadoop 2 and yarnBikas saha:the next generation of hadoop– hadoop 2 and yarn
Bikas saha:the next generation of hadoop– hadoop 2 and yarn
 
YARN - Hadoop Next Generation Compute Platform
YARN - Hadoop Next Generation Compute PlatformYARN - Hadoop Next Generation Compute Platform
YARN - Hadoop Next Generation Compute Platform
 
Apache Hadoop YARN: Understanding the Data Operating System of Hadoop
Apache Hadoop YARN: Understanding the Data Operating System of HadoopApache Hadoop YARN: Understanding the Data Operating System of Hadoop
Apache Hadoop YARN: Understanding the Data Operating System of Hadoop
 
Huhadoop - v1.1
Huhadoop - v1.1Huhadoop - v1.1
Huhadoop - v1.1
 
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingApache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query Processing
 
Hadoop - Past, Present and Future - v1.2
Hadoop - Past, Present and Future - v1.2Hadoop - Past, Present and Future - v1.2
Hadoop - Past, Present and Future - v1.2
 
Apache Tez -- A modern processing engine
Apache Tez -- A modern processing engineApache Tez -- A modern processing engine
Apache Tez -- A modern processing engine
 
Hadoop - Past, Present and Future - v1.1
Hadoop - Past, Present and Future - v1.1Hadoop - Past, Present and Future - v1.1
Hadoop - Past, Present and Future - v1.1
 
Hackathon bonn
Hackathon bonnHackathon bonn
Hackathon bonn
 
Tez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_sahaTez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_saha
 
Tez Data Processing over Yarn
Tez Data Processing over YarnTez Data Processing over Yarn
Tez Data Processing over Yarn
 
How YARN Enables Multiple Data Processing Engines in Hadoop
How YARN Enables Multiple Data Processing Engines in HadoopHow YARN Enables Multiple Data Processing Engines in Hadoop
How YARN Enables Multiple Data Processing Engines in Hadoop
 
Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014
Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014
Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014
 
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingApache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query Processing
 
Overview of slider project
Overview of slider projectOverview of slider project
Overview of slider project
 
Hadoop past, present and future
Hadoop past, present and futureHadoop past, present and future
Hadoop past, present and future
 
Get Started Building YARN Applications
Get Started Building YARN ApplicationsGet Started Building YARN Applications
Get Started Building YARN Applications
 

Más de Adam Muise

2015 nov 27_thug_paytm_rt_ingest_brief_final
2015 nov 27_thug_paytm_rt_ingest_brief_final2015 nov 27_thug_paytm_rt_ingest_brief_final
2015 nov 27_thug_paytm_rt_ingest_brief_finalAdam Muise
 
Moving to a data-centric architecture: Toronto Data Unconference 2015
Moving to a data-centric architecture: Toronto Data Unconference 2015Moving to a data-centric architecture: Toronto Data Unconference 2015
Moving to a data-centric architecture: Toronto Data Unconference 2015Adam Muise
 
Paytm labs soyouwanttodatascience
Paytm labs soyouwanttodatasciencePaytm labs soyouwanttodatascience
Paytm labs soyouwanttodatascienceAdam Muise
 
2015 feb 24_paytm_labs_intro_ashwin_armandoadam
2015 feb 24_paytm_labs_intro_ashwin_armandoadam2015 feb 24_paytm_labs_intro_ashwin_armandoadam
2015 feb 24_paytm_labs_intro_ashwin_armandoadamAdam Muise
 
Next Generation Hadoop Introduction
Next Generation Hadoop IntroductionNext Generation Hadoop Introduction
Next Generation Hadoop IntroductionAdam Muise
 
Hadoop at the Center: The Next Generation of Hadoop
Hadoop at the Center: The Next Generation of HadoopHadoop at the Center: The Next Generation of Hadoop
Hadoop at the Center: The Next Generation of HadoopAdam Muise
 
2014 sept 4_hadoop_security
2014 sept 4_hadoop_security2014 sept 4_hadoop_security
2014 sept 4_hadoop_securityAdam Muise
 
2014 july 24_what_ishadoop
2014 july 24_what_ishadoop2014 july 24_what_ishadoop
2014 july 24_what_ishadoopAdam Muise
 
May 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETLMay 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETLAdam Muise
 
2014 feb 24_big_datacongress_hadoopsession1_hadoop101
2014 feb 24_big_datacongress_hadoopsession1_hadoop1012014 feb 24_big_datacongress_hadoopsession1_hadoop101
2014 feb 24_big_datacongress_hadoopsession1_hadoop101Adam Muise
 
2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture
2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture
2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitectureAdam Muise
 
2014 feb 5_what_ishadoop_mda
2014 feb 5_what_ishadoop_mda2014 feb 5_what_ishadoop_mda
2014 feb 5_what_ishadoop_mdaAdam Muise
 
2013 Dec 9 Data Marketing 2013 - Hadoop
2013 Dec 9 Data Marketing 2013 - Hadoop2013 Dec 9 Data Marketing 2013 - Hadoop
2013 Dec 9 Data Marketing 2013 - HadoopAdam Muise
 
What is Hadoop? Nov 20 2013 - IRMAC
What is Hadoop? Nov 20 2013 - IRMACWhat is Hadoop? Nov 20 2013 - IRMAC
What is Hadoop? Nov 20 2013 - IRMACAdam Muise
 
What is Hadoop? Oct 17 2013
What is Hadoop? Oct 17 2013What is Hadoop? Oct 17 2013
What is Hadoop? Oct 17 2013Adam Muise
 
Sept 17 2013 - THUG - HBase a Technical Introduction
Sept 17 2013 - THUG - HBase a Technical IntroductionSept 17 2013 - THUG - HBase a Technical Introduction
Sept 17 2013 - THUG - HBase a Technical IntroductionAdam Muise
 
2013 march 26_thug_etl_cdc_talking_points
2013 march 26_thug_etl_cdc_talking_points2013 march 26_thug_etl_cdc_talking_points
2013 march 26_thug_etl_cdc_talking_pointsAdam Muise
 
2013 feb 20_thug_h_catalog
2013 feb 20_thug_h_catalog2013 feb 20_thug_h_catalog
2013 feb 20_thug_h_catalogAdam Muise
 
KnittingBoar Toronto Hadoop User Group Nov 27 2012
KnittingBoar Toronto Hadoop User Group Nov 27 2012KnittingBoar Toronto Hadoop User Group Nov 27 2012
KnittingBoar Toronto Hadoop User Group Nov 27 2012Adam Muise
 
2012 sept 18_thug_biotech
2012 sept 18_thug_biotech2012 sept 18_thug_biotech
2012 sept 18_thug_biotechAdam Muise
 

Más de Adam Muise (20)

2015 nov 27_thug_paytm_rt_ingest_brief_final
2015 nov 27_thug_paytm_rt_ingest_brief_final2015 nov 27_thug_paytm_rt_ingest_brief_final
2015 nov 27_thug_paytm_rt_ingest_brief_final
 
Moving to a data-centric architecture: Toronto Data Unconference 2015
Moving to a data-centric architecture: Toronto Data Unconference 2015Moving to a data-centric architecture: Toronto Data Unconference 2015
Moving to a data-centric architecture: Toronto Data Unconference 2015
 
Paytm labs soyouwanttodatascience
Paytm labs soyouwanttodatasciencePaytm labs soyouwanttodatascience
Paytm labs soyouwanttodatascience
 
2015 feb 24_paytm_labs_intro_ashwin_armandoadam
2015 feb 24_paytm_labs_intro_ashwin_armandoadam2015 feb 24_paytm_labs_intro_ashwin_armandoadam
2015 feb 24_paytm_labs_intro_ashwin_armandoadam
 
Next Generation Hadoop Introduction
Next Generation Hadoop IntroductionNext Generation Hadoop Introduction
Next Generation Hadoop Introduction
 
Hadoop at the Center: The Next Generation of Hadoop
Hadoop at the Center: The Next Generation of HadoopHadoop at the Center: The Next Generation of Hadoop
Hadoop at the Center: The Next Generation of Hadoop
 
2014 sept 4_hadoop_security
2014 sept 4_hadoop_security2014 sept 4_hadoop_security
2014 sept 4_hadoop_security
 
2014 july 24_what_ishadoop
2014 july 24_what_ishadoop2014 july 24_what_ishadoop
2014 july 24_what_ishadoop
 
May 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETLMay 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETL
 
2014 feb 24_big_datacongress_hadoopsession1_hadoop101
2014 feb 24_big_datacongress_hadoopsession1_hadoop1012014 feb 24_big_datacongress_hadoopsession1_hadoop101
2014 feb 24_big_datacongress_hadoopsession1_hadoop101
 
2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture
2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture
2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture
 
2014 feb 5_what_ishadoop_mda
2014 feb 5_what_ishadoop_mda2014 feb 5_what_ishadoop_mda
2014 feb 5_what_ishadoop_mda
 
2013 Dec 9 Data Marketing 2013 - Hadoop
2013 Dec 9 Data Marketing 2013 - Hadoop2013 Dec 9 Data Marketing 2013 - Hadoop
2013 Dec 9 Data Marketing 2013 - Hadoop
 
What is Hadoop? Nov 20 2013 - IRMAC
What is Hadoop? Nov 20 2013 - IRMACWhat is Hadoop? Nov 20 2013 - IRMAC
What is Hadoop? Nov 20 2013 - IRMAC
 
What is Hadoop? Oct 17 2013
What is Hadoop? Oct 17 2013What is Hadoop? Oct 17 2013
What is Hadoop? Oct 17 2013
 
Sept 17 2013 - THUG - HBase a Technical Introduction
Sept 17 2013 - THUG - HBase a Technical IntroductionSept 17 2013 - THUG - HBase a Technical Introduction
Sept 17 2013 - THUG - HBase a Technical Introduction
 
2013 march 26_thug_etl_cdc_talking_points
2013 march 26_thug_etl_cdc_talking_points2013 march 26_thug_etl_cdc_talking_points
2013 march 26_thug_etl_cdc_talking_points
 
2013 feb 20_thug_h_catalog
2013 feb 20_thug_h_catalog2013 feb 20_thug_h_catalog
2013 feb 20_thug_h_catalog
 
KnittingBoar Toronto Hadoop User Group Nov 27 2012
KnittingBoar Toronto Hadoop User Group Nov 27 2012KnittingBoar Toronto Hadoop User Group Nov 27 2012
KnittingBoar Toronto Hadoop User Group Nov 27 2012
 
2012 sept 18_thug_biotech
2012 sept 18_thug_biotech2012 sept 18_thug_biotech
2012 sept 18_thug_biotech
 

Último

Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 

Último (20)

Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 

2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0

  • 1. Hadoop 2.2.0 Hadoop grows up Adam Muise © Hortonworks Inc. 2013. Confidential and Proprietary. Page 1
  • 2. Rob Ford says… …turn off your #*@!#%!!! Mobile Phones! © Hortonworks Inc. 2013. Confidential and Proprietary. Page 2
  • 3. YARN Yet Another Resource Negotiator © Hortonworks Inc. 2013. Confidential and Proprietary.
  • 4. A new abstraction layer Single Use System Multi Purpose Platform Batch Apps Batch, Interactive, Online, Streaming, … HADOOP 1.0 HADOOP 2.0 MapReduce   Others   (data  processing)   MapReduce   (data  processing)   YARN   (cluster  resource  management    &  data  processing)   (cluster  resource  management)   HDFS   HDFS2   (redundant,  reliable  storage)   © Hortonworks Inc. 2013. Confidential and Proprietary. (redundant,  reliable  storage)   Page 4
  • 5. Concepts • Application – Application is a job submitted to the framework – Example – Map Reduce Job • Container – Basic unit of allocation – Fine-grained resource allocation across multiple resource types (memory, cpu, disk, network, gpu etc.) –  container_0 = 2GB, 1CPU –  container_1 = 1GB, 6 CPU – Replaces the fixed map/reduce slots © Hortonworks Inc. 2013. Confidential and Proprietary. 5
  • 6. YARN Architecture • Resource Manager – Global resource scheduler – Hierarchical queues • Node Manager – Per-machine agent – Manages the life-cycle of container – Container resource monitoring • Application Master – Per-application – Manages application scheduling and task execution – E.g. MapReduce Application Master © Hortonworks Inc. 2013. Confidential and Proprietary. 6
  • 7. YARN Architecture - Walkthrough ResourceManager   Client2   Scheduler   NodeManager   NodeManager   NodeManager   NodeManager   Container  1.1   Container  2.2   Container  2.4   NodeManager   NodeManager   AM  1   NodeManager   Container  1.2   NodeManager   Container  1.3   © Hortonworks Inc. 2012 NodeManager   AM2   NodeManager   NodeManager   Container  2.1   NodeManager   Container  2.3  
  • 8. YARN as OS for Data Lake ResourceManager   Scheduler   NodeManager   NodeManager   map  1.1   NodeManager   nimbus0   NodeManager   vertex1.1.1   vertex1.2.2   NodeManager   NodeManager   NodeManager   NodeManager   map1.2   Batch   InteracFve  SQL   vertex1.1.2   nimbus2   NodeManager   NodeManager   nimbus1   reduce1.1   © Hortonworks Inc. 2012 NodeManager   Real-­‐Time   NodeManager   vertex1.2.1  
  • 9. Multi-Tenant YARN ResourceManager   Scheduler   root Mrkting 30% Dev 20% Adhoc 10% Prod 80% DW 60% Dev Reserved Prod 10% 20% 70% P0 70% © Hortonworks Inc. 2012 P1 30%
  • 10. Multi-Tenancy with New Capacity Scheduler •  Queues •  Economics as queue-capacity –  Heirarchical Queues •  SLAs –  Preemption ResourceManager   •  Resource Isolation –  Linux: cgroups –  MS Windows: Job Control –  Roadmap: Virtualization (Xen, KVM) •  Administration –  Queue ACLs –  Run-time re-configuration for queues –  Charge-back Scheduler   root Hierarchical Queues Mrkting 20% Dev 20% Adhoc 10% Prod 80% DW 70% Dev Reserved Prod 10% 20% 70% P0 70% P1 30% Capacity Scheduler © Hortonworks Inc. 2013. Confidential and Proprietary. Page 10
  • 11. MapReduce v2 Changes to MapReduce on YARN © Hortonworks Inc. 2013. Confidential and Proprietary.
  • 12. MapReduce V2 is a library now… •  MapReduce runs on YARN like all other Hadoop 2.x applications –  Gone are the map and reduce slots, that’s up to containers in YARN now –  Gone is the JobTracker, replaced by the YARN AppMaster library •  Multiple versions of MapReduce –  The older mapred APIs work without modification or recompilation –  The newer mapreduce APIs may need to be recompiled •  Still has one master server component: the Job History Server –  The Job History Server stores the execution of jobs –  Used to audit prior execution of jobs –  Will also be used by YARN framework to store charge backs at that level © Hortonworks Inc. 2013. Confidential and Proprietary. Page 12
  • 13. Shuffle in MapReduce v2 •  Faster Shuffle –  Better embedded server: Netty •  Encrypted Shuffle –  Secure the shuffle phase as data moves across the cluster –  Requires 2 way HTTPS, certificates on both sides –  Incurs significant CPU overhead, reserve 1 core for this work –  Certs stored on each node (provision with the cluster), refreshed every 10secs •  Pluggable Shuffle Sort –  Shuffle is the first phase in MapReduce that is guaranteed to not be data-local –  Pluggable Shuffle/Sort allows for intrepid application developers or hardware developers to intercept the network-heavy workload and optimize it –  Typical implementations have hardware components like fast networks and software components like sorting algorithms –  API will change with future versions of Hadoop © Hortonworks Inc. 2013. Confidential and Proprietary. Page 13
  • 14. Efficiency Gains of MRv2 •  Key Optimizations –  No hard segmentation of resource into map and reduce slots –  Yarn scheduler is more efficient –  MRv2 framework has become more efficient than MRv1; shuffle phase in MRv2 is more performant with the usage of netty. •  Yahoo has over 30000 nodes running YARN across over 365PB of data. •  They calculate running about 400,000 jobs per day for about 10 million hours of compute time. •  They also have estimated a 60% – 150% improvement on node usage per day. •  Yahoo got rid of a whole colo (10,000 node datacenter) because of their increased utilization. © Hortonworks Inc. 2013. Confidential and Proprietary.
  • 15. HDFS v2 In a NutShell © Hortonworks Inc. 2013. Confidential and Proprietary.
  • 16. HA © Hortonworks Inc. 2013. Confidential and Proprietary. Page 16
  • 17. HDFS Snapshots: Feature Overview •  Admin can create point in time snapshots of HDFS –  Of the entire file system (/root) –  Of a specific data-set (sub-tree directory of file system) •  Restore state of entire file system or data-set to a snapshot (like Apple Time Machine) –  Protect against user errors •  Snapshot diffs identify changes made to data set –  Keep track of how raw or derived/analytical data changes over time © Hortonworks Inc. 2013. Confidential and Proprietary. Page 17
  • 18. NFS Gateway: Feature Overview •  NFS v3 standard •  Supports all HDFS commands –  List files –  Copy, move files –  Create and delete directories •  Ingest for large scale analytical workloads –  Load immutable files as source for analytical processing –  No random writes •  Stream files into HDFS –  Log ingest by applications writing directly to HDFS client mount © Hortonworks Inc. 2013. Confidential and Proprietary.
  • 19. Federation © Hortonworks Inc. 2013. Confidential and Proprietary. Page 19
  • 20. Managing Namespaces © Hortonworks Inc. 2013. Confidential and Proprietary. Page 20
  • 21. Performance © Hortonworks Inc. 2013. Confidential and Proprietary. Page 21
  • 22. Other Features © Hortonworks Inc. 2013. Confidential and Proprietary. Page 22
  • 23. Apache Tez A New Hadoop Data Processing Framework © Hortonworks Inc. 2013. Confidential and Proprietary. Page 23
  • 24. Moving Hadoop Beyond MapReduce •  Low level data-processing execution engine •  Built on YARN •  Enables pipelining of jobs •  Removes task and job launch times •  Does not write intermediate output to HDFS –  Much lighter disk and network usage •  New base of MapReduce, Hive, Pig, Cascading etc. •  Hive and Pig jobs no longer need to move to the end of the queue between steps in the pipeline © Hortonworks Inc. 2013. Confidential and Proprietary.
  • 25. Apache Tez as the new Primitive MapReduce as Base Apache Tez as Base HADOOP 1.0 HADOOP 2.0 Batch   Pig   (data  flow)     Hive   Others   (sql)   (cascading)     MapReduce   MapReduce   Data  Flow   Pig   SQL   Hive     Others   (cascading)     Tez   Storm   (execu:on  engine)   YARN   (cluster  resource  management    &  data  processing)   (cluster  resource  management)   HDFS   HDFS2   (redundant,  reliable  storage)   © Hortonworks Inc. 2013. Confidential and Proprietary. Online     Real  Time     Data     Stream     Processing   Processing   HBase,   (redundant,  reliable  storage)   Accumulo    
  • 26. Hive-on-MR vs. Hive-on-Tez Tez avoids unneeded writes to HDFS SELECT a.x, AVERAGE(b.y) AS avg FROM a JOIN b ON (a.id = b.id) GROUP BY a UNION SELECT x, AVERAGE(y) AS AVG FROM c GROUP BY x ORDER BY AVG; Hive – MR M M Hive – Tez M SELECT a.state SELECT b.id R R M SELECT a.state, c.itemId M M M R M SELECT b.id R M HDFS JOIN (a, c) SELECT c.price M R M R HDFS R JOIN (a, c) R HDFS JOIN(a, b) GROUP BY a.state COUNT(*) AVERAGE(c.price) M M R © Hortonworks Inc. 2013. Confidential and Proprietary. M JOIN(a, b) GROUP BY a.state COUNT(*) AVERAGE(c.price) R
  • 27. Apache Tez (“Speed”) •  Replaces MapReduce as primitive for Pig, Hive, Cascading etc. – Smaller latency for interactive queries – Higher throughput for batch queries – 22 contributors: Hortonworks (13), Facebook, Twitter, Yahoo, Microsoft Task with pluggable Input, Processor and Output Input   Processor   Output   Task   Tez Task - <Input, Processor, Output> YARN ApplicationMaster to run DAG of Tez Tasks © Hortonworks Inc. 2013. Confidential and Proprietary.
  • 28. Tez: Building blocks for scalable data processing Classical ‘Map’ HDFS   Input   Map   Processor   Classical ‘Reduce’ Sorted   Output   Shuffle   Input   Shuffle   Input   Reduce   Processor   Sorted   Output   Intermediate ‘Reduce’ for Map-Reduce-Reduce © Hortonworks Inc. 2013. Confidential and Proprietary. Reduce   Processor   HDFS   Output  
  • 29. Hive © Hortonworks Inc. 2013. Confidential and Proprietary. 29
  • 30. SQL: Enhancing SQL Semantics Hive  SQL  Datatypes   Hive  SQL  SemanFcs   SQL Compliance INT   SELECT,  INSERT   TINYINT/SMALLINT/BIGINT   GROUP  BY,  ORDER  BY,  SORT  BY   BOOLEAN   JOIN  on  explicit  join  key   FLOAT   Inner,  outer,  cross  and  semi  joins   DOUBLE   Sub-­‐queries  in  FROM  clause   Hive 12 provides a wide array of SQL datatypes and semantics so your existing tools integrate more seamlessly with Hadoop STRING   ROLLUP  and  CUBE   TIMESTAMP   UNION   BINARY   Windowing  Func:ons  (OVER,  RANK,  etc)   DECIMAL   Custom  Java  UDFs   ARRAY,  MAP,  STRUCT,  UNION   Standard  Aggrega:on  (SUM,  AVG,  etc.)   DATE   Advanced  UDFs  (ngram,  Xpath,  URL)     VARCHAR   Sub-­‐queries  in  WHERE,  HAVING   CHAR   Expanded  JOIN  Syntax   SQL  Compliant  Security  (GRANT,  etc.)   INSERT/UPDATE/DELETE  (ACID)   © Hortonworks Inc. 2013. Confidential and Proprietary. Available   Hive  0.12   Roadmap  
  • 31. SPEED: Increasing Hive Performance Interactive Query Times across ALL use cases •  Simple and advanced queries in seconds •  Integrates seamlessly with existing tools •  Currently a >100x improvement in just nine months Performance Improvements included in Hive 12 –  Base & advanced query optimization –  Startup time improvement –  Join optimizations © Hortonworks Inc. 2013. Confidential and Proprietary.
  • 32. Apache Tez as the new Primitive MapReduce as Base Apache Tez as Base HADOOP 1.0 HADOOP 2.0 Batch   Pig   (data  flow)     Hive   Others   (sql)   (cascading)     MapReduce   MapReduce   Data  Flow   Pig   SQL   Hive     Others   (cascading)     Tez   Storm   (execu:on  engine)   YARN   (cluster  resource  management    &  data  processing)   (cluster  resource  management)   HDFS   HDFS2   (redundant,  reliable  storage)   © Hortonworks Inc. 2013. Confidential and Proprietary. Online     Real  Time     Data     Stream     Processing   Processing   HBase,   (redundant,  reliable  storage)   Accumulo    
  • 33. Hive-on-MR vs. Hive-on-Tez Tez avoids unneeded writes to HDFS SELECT a.x, AVERAGE(b.y) AS avg FROM a JOIN b ON (a.id = b.id) GROUP BY a UNION SELECT x, AVERAGE(y) AS AVG FROM c GROUP BY x ORDER BY AVG; Hive – MR M M Hive – Tez M SELECT a.state SELECT b.id R R M SELECT a.state, c.itemId M M M R M SELECT b.id R M HDFS JOIN (a, c) SELECT c.price M R M R HDFS R JOIN (a, c) R HDFS JOIN(a, b) GROUP BY a.state COUNT(*) AVERAGE(c.price) M M R © Hortonworks Inc. 2013. Confidential and Proprietary. M JOIN(a, b) GROUP BY a.state COUNT(*) AVERAGE(c.price) R
  • 34. Tez on YARN ResourceManager   Scheduler   NodeManager   NodeManager   vertex1.2.2   NodeManager   map  1.1   NodeManager   map1.2   Batch   nimbus2   NodeManager   NodeManager   nimbus1   reduce1.1   © Hortonworks Inc. 2012 NodeManager   nimbus0   NodeManager   Hive/Tez   (SQL)   NodeManager   Real-­‐Time   NodeManager   vertex1.1.1   NodeManager   vertex1.1.2   NodeManager   vertex1.2.1  
  • 35. Apache Falcon Data Lifecycle Management for Hadoop © Hortonworks Inc. 2013. Confidential and Proprietary.
  • 36. Data Lifecycle on Hadoop is Challenging Data Management Needs Tools Data Processing Oozie Replication Sqoop Retention Distcp Scheduling Flume Reprocessing Map / Reduce Multi Cluster Management Hive and Pig Jobs Problem: Patchwork of tools complicate data lifecycle management. Result: Long development cycles and quality challenges. © Hortonworks Inc. 2013. Confidential and Proprietary.
  • 37. Falcon: One-stop Shop for Data Lifecycle Apache Falcon Provides Orchestrates Data Management Needs Tools Data Processing Oozie Replication Sqoop Retention Distcp Scheduling Flume Reprocessing Map / Reduce Multi Cluster Management Hive and Pig Jobs Falcon provides a single interface to orchestrate data lifecycle. Sophisticated DLM easily added to Hadoop applications. © Hortonworks Inc. 2013. Confidential and Proprietary.
  • 38. Falcon Core Capabilities •  Core Functionality –  Pipeline processing –  Replication –  Retention –  Late data handling •  Automates –  Scheduling and retry –  Recording audit, lineage and metrics •  Operations and Management –  Monitoring, management, metering –  Alerts and notifications –  Multi Cluster Federation •  CLI and REST API © Hortonworks Inc. 2013. Confidential and Proprietary.
  • 39. Falcon At A Glance Data Processing Applications Falcon Data Management Framework Data Import and Replication Scheduling and Coordination Data Lifecycle Policies Multi-Cluster Management SLA Management >  Falcon offers a high-level abstraction of key services for Hadoop data management needs. >  Complex data processing logic is handled by Falcon instead of hard-coded in data processing apps. >  Falcon enables faster development of ETL, reporting and other data processing apps on Hadoop. © Hortonworks Inc. 2013. Confidential and Proprietary.
  • 40. Falcon Example: Replication Cleansed Data Conformed Data Access Data Replication Replication Staged Data Staged Data Processed Data >  Falcon manages workflow and replication. >  Enables business continuity without requiring full data representation. >  Failover clusters can be smaller than primary clusters. © Hortonworks Inc. 2013. Confidential and Proprietary.
  • 41. Falcon Example: Retention Staged Data Cleansed Data Conformed Data Access Data Retain 20 Years Retain 3 Years Retain 3 Years Retain Last Copy Only >  Sophisticated retention policies expressed in one place. >  Simplify data retention for audit, compliance, or for data re-processing. © Hortonworks Inc. 2013. Confidential and Proprietary.
  • 42. Falcon Example: Late Data Handling Online Transaction Data (via Sqoop) Wait up to 4 hours for FTP data to arrive Staged Data Combined Dataset Web Log Data (via FTP) >  Processing waits until all required input data is available. >  Checks for late data arrivals, issues retrigger processing as necessary. >  Eliminates writing complex data handling rules within applications. © Hortonworks Inc. 2013. Confidential and Proprietary.
  • 43. Examples © Hortonworks Inc. 2013. Confidential and Proprietary. Page 43
  • 44. Example: Cluster Specification <?xml version="1.0"?>! readonly! <!--! My Local Cluster specification! -->! write! <cluster colo=”my-local-cluster" description="" name="cluster-alpha"> ! <interfaces>! <interface type="readonly" endpoint="hftp://nn:50070" version="2.2.0" />! <interface type="write" endpoint="hdfs://nn:8020" version="2.2.0" />! <interface type="execute" endpoint=”rm:8050" version="2.2.0" />! <interface type="workflow" endpoint="http://os:11000/oozie/" version="4.0.0" />! <interface type="messaging" endpoint="tcp://mq:61616?daemon=true" version="5.1.6" />! </interfaces>! <locations>! execute! <location name="staging" path="/apps/falcon/cluster-alpha/staging" />! <location name="temp" path="/tmp" />! <location name="working" path="/apps/falcon/cluster-alpha/working" />! </locations>! </cluster>! workflow! © Hortonworks Inc. 2013. Confidential and Proprietary. NameNode Resource Manager Oozie Server Page 44
  • 45. Example: Weblogs Replication and Retention © Hortonworks Inc. 2013. Confidential and Proprietary. Page 45
  • 46. Example 1: Weblogs •  Weblogs land hourly in my primary cluster •  HDFS location is /weblogs/{date} •  I want to: –  Evict weblogs from primary cluster after 1 day © Hortonworks Inc. 2013. Confidential and Proprietary. Page 46
  • 47. Feed Specification 1: Weblogs <feed description="" name="feed-weblogs1" xmlns="uri:falcon:feed:0.1” >! <frequency>hours(1)</frequency>! ! <clusters>! !<cluster name="cluster-primary" type="source”>! ! <validity start="2013-10-24T00:00Z" end="2014-12-31T00:00Z"/>! ! <retention limit="days(1)" action="delete"/>! !</cluster>! </clusters>! ! <locations>! !<location type="data" path="/weblogs/${YEAR}-${MONTH}-${DAY}-${HOUR}" />! </locations>! ! <ACL owner="hdfs" group="users" permission="0755" />! <schema location="/none" provider="none"/>! </feed>! Cluster where data is located Retention policy 1 day Location of the data © Hortonworks Inc. 2013. Confidential and Proprietary. Page 47
  • 48. Example 2: Weblogs •  Weblogs land hourly in my primary cluster •  HDFS location is /weblogs/{date} •  I want to: –  Replicate weblogs to my secondary cluster –  Evict weblogs from primary cluster after 2 days –  Evict weblogs from secondary cluster after 1 week © Hortonworks Inc. 2013. Confidential and Proprietary. Page 48
  • 49. Feed Specification 2: Weblogs <feed description=“" name=”feed-weblogs2” xmlns="uri:falcon:feed:0.1">! <frequency>hours(1)</frequency>! ! <clusters>! <cluster name=”cluster-primary" type="source">! <validity start="2012-01-01T00:00Z" end="2099-12-31T00:00Z"/>! <retention limit="days(2)" action="delete"/>! </cluster>! <cluster name=”cluster-secondary" type="target">! <validity start="2012-01-01T00:00Z" end="2099-12-31T00:00Z"/>! <retention limit=”days(7)" action="delete"/>! </cluster>! </clusters>! ! <locations>! <location type="data” path="/weblogs/${YEAR}-${MONTH}-${DAY}-${HOUR} "/>! </locations>! Cluster where data is located Retention policy 2 days Cluster where data will be replicated Retention policy 1 week ! <ACL owner=”hdfs" group="users" permission="0755"/>! <schema location="/none" provider="none"/>! </feed>! © Hortonworks Inc. 2013. Confidential and Proprietary. Location of the data
  • 50. Example 3: Weblogs •  Weblogs land hourly in my primary cluster •  HDFS location is /weblogs/{date} •  I want to: –  Replicate weblogs to a discovery cluster –  Replicate weblogs to a BCP cluster –  Evict weblogs from primary cluster after 2 days –  Evict weblogs from discovery cluster after 1 week –  Evict weblogs from BCP cluster after 3 months © Hortonworks Inc. 2013. Confidential and Proprietary. Page 50
  • 51. Feed Specification 3: Weblogs <feed description=“” name=”feed-weblogs” xmlns="uri:falcon:feed:0.1">! <frequency>hours(1)</frequency>! ! <clusters>! <cluster name=”cluster-primary" type="source">! <validity start="2012-01-01T00:00Z" end="2099-12-31T00:00Z"/>! <retention limit="days(2)" action="delete"/>! </cluster>! <cluster name=“cluster-discovery" type="target">! <validity start="2012-01-01T00:00Z" end="2099-12-31T00:00Z"/>! <retention limit=”days(7)" action="delete"/>! <locations>! <location type="data” path="/projects/recommendations/${YEAR}-${MONTH}-${DAY}-${HOUR} "/>! </locations>! </cluster>! <cluster name=”cluster-bcp" type="target">! <validity start="2012-01-01T00:00Z" end="2099-12-31T00:00Z"/>! <retention limit=”months(3)" action="delete"/>! <locations>! <location type="data” path="/weblogs/${YEAR}-${MONTH}-${DAY}-${HOUR} "/>! </locations>! </cluster>! </clusters>! ! <locations>! <location type="data” path="/weblogs/${YEAR}-${MONTH}-${DAY}-${HOUR} "/>! </locations>! ! <ACL owner=”hdfs" group="users" permission="0755"/>! <schema location="/none" provider="none"/>! </feed>! © Hortonworks Inc. 2013. Confidential and Proprietary. Cluster specific location Cluster specific location
  • 52. Apache Knox Secure Access to Hadoop © Hortonworks Inc. 2013. Confidential and Proprietary.
  • 53. Connecting to the Cluster..Edge Nodes •  What is an Edge Node? –  Nodes in a DMZ zone that has access to the cluster. Only way to access the cluster –  Hadoop client Apis and MR/Pig/Hive jobs would be executed from these edge nodes. –  Users SSH to Edge Node and upload all job artifacts and then execute API/ Commands commands from shell SSH! User Edge Node Hadoop • Challenges – SSH, Edge Node, and job maintenance nightmare – Difficult to integrate with Applications © Hortonworks Inc. 2013. Confidential and Proprietary. Page 53
  • 54. Connecting to the Cluster..REST API Service API WebHDFS Supports HDFS user operations including reading files, writing to files, making directories, changing permissions and renaming. Learn more about WebHDFS. WebHCat Job control for MapReduce, Pig and Hive jobs, and HCatalog DDL commands. Learn more about WebHCat. Oozie Job submission and management, and Oozie administration. Learn more about Oozie. •  Useful for connecting to Hadoop from the outside the cluster •  When more client language flexibility is required –  i.e. Java binding not an option •  Challenges –  Client must have knowledge of cluster topology –  Required to open ports (and in some cases, on every host) outside the cluster © Hortonworks Inc. 2013. Confidential and Proprietary. Page 54
  • 55. Apache Knox Gateway – Perimeter Security Simplified Access Centralized Security •  Single Hadoop access point •  Rationalized REST API hierarchy •  Eliminate SSH “edge node” •  LDAP and ActiveDirectory auth •  Consolidated API calls •  Multi-cluster support •  Central API management + audit •  Client DSL © Hortonworks Inc. 2013. Confidential and Proprietary. Page 55
  • 56. Knox Gateway Network Architecture Kerberos/ Enterprise Identity Provider Enterprise/ Cloud SSO Provider Firewall Firewall Browser Identity Providers Secure Hadoop Cluster 1 Masters NN Web HCat JT DN Ambari Client DMZ Oozie TT YARN HBase Hive Knox Gateway Cluster REST Client GW GW GW JDBC Client Secure Hadoop Cluster 2 Masters NN JT DN A stateless cluster of reverse proxy instances deployed in DMZ Ambari Server/ Hue Server © Hortonworks Inc. 2013. Confidential and Proprietary. Web HCat Oozie TT -Requests streamed through GW to Hadoop services after auth. HBase Hive -URLs rewritten to refer to gateway YARN Page 56
  • 57. Wot no 2.2.0? Where can I get the Hadoop 2.2.0 fix? © Hortonworks Inc. 2013. Confidential and Proprietary. Page 57
  • 58. Like the Truth, Hadoop 2.2.0 is out there… Component HDP2.0 CDH4 CDH5 Beta Intel IDH3.0 MapR 3 IBM Big Insights 2.1 Hadoop Common 2.2.0 2.0.0 2.2.0 2.0.4 N/A 1.1.1 Hive + HCatalog 0.12 0.10 + 0.5 0.11 0.10 + 0.5 0.11 0.9 + 0.4 Pig 0.12 0.11 0.11 0.10 0.11 0.10 Mahout 0.8 0.7 0.8 0.8 0.8 N/A Flume 1.4.0 1.4.0 1.4.0 1.3.0 1.4.0 1.3.0 Oozie 4.0.0 3.3.2 4.0.0 3.3.0 3.3.2 3.2.0 Sqoop 1.4.4 1.4.3 1.4.4 1.4.3 1.4.4 1.4.2 HBase 0.96.0 0.94.6 95.2 0.94.7 94.9 0.94.3 © Hortonworks Inc. 2013. Confidential and Proprietary. Page 58
  • 59. Thank You THUG Life © Hortonworks Inc. 2013. Confidential and Proprietary.