Hadoop: today and tomorrow

Hadoop: Today and Tomorrow
Steve Loughran– Hortonworks
stevel at hortonworks.com
@steveloughran

London, April 2012

© Hortonworks Inc. 2012

About me:
• HP Labs:
–Deployment, cloud infrastructure, Hadoop-in-Cloud
• Apache – member and committer
–Ant (author, Ant in Action), Axis 2
–Hadoop
–Dynamic deployments
–Diagnostics on failures
–Cloud infrastructure integration
• Joined Hortonworks in 2012
–UK based: R&D + customer engagement

Page 2

About Hortonworks
From developing and running the world's largest Hadoop clusters to
advancing open source Apache Hadoop for the broader market

Hadoop at Yahoo!
40K+ Servers
170PB Storage
5M+ Monthly Jobs
1000+ Active Users

2011

HDP, training & support

Page 3

Where is Hadoop?
• Today: Hadoop 1.x
–Status & Roadmap

• Tomorrow: Hadoop 2.x
–YARN
–HDFS HA

• Enterprise integration

Page 4

Releases slowed with Hadoop take up
0.20.0 0.20.1 0.20.2 0.21.0 0.20.20{3,4,5}.0

• 64 Releases
• Branches from the last 2.5 years:
–0.20.{0,1,2} – Stable release without security
–0.20.2xx.y – Stable release with security

–0.21.0 – released, unstable, deprecated
–0.22.0 – orphan, unstable, lack of community

Page 5

Now: two release branches, one dev
Hadoop 1.x
• Stable, used in production systems
• The one to use today

Hadoop 2.0
• The successor
• Not quite ready for use

Hadoop 2.x "trunk"
• Where features & fixes first go in
• If you want to help –start here
Page 6

Today: Hadoop 1.x
• A stable Hadoop release from the ASF
–Merges various Hadoop 0.20.* branches
(security, HBase support, …)
–A stable branch for patching and back-porting
• Highlights:
–Security
–HBase support (“append” operation)
–WebHDFS
–“new” MapReduce APIs complete & usable
–Distribution packaging includes RPM files

Page 7

WebHDFS: fast direct HTTP access
~:$ GET http://nnode:50070/webhdfs/v1/results/part-r-00000.csv?op=open

GATE4,eb8bd736445f415e18886ba037f84829,55000,2007-01-14,14:01:54,
GATE4,ec58edcce1049fa665446dc1fa690638,8030803000,2007-01-14,13:52:31,
GATE4,b6f07ce00f09035a6683c5e93e3c04b8,30000,2007-01-28,12:41:11,
GATE4,a1bc345b756090854e9dd0011087c6c0,30000,2007-01-28,12:59:33,
...

Potential Uses:
Out of cluster access to HDFS
Cross-cluster, cross version HDFS access
Native filesystem clients

dfs.webhdfs.enabled=true
Page 8

Hortonworks Data Platform HDP1
Based on Hadoop 1.0, adds
–HCatalog for table and schema management
–Open APIs for metadata, data movement, app & job
management
–Consumable “standard Hadoop” stack:
Hadoop 1.0.x core (HDFS, MapReduce)
Pig 0.9.x data flow programming language
Hive 0.8.x SQL-like language
HBase 0.92.x column table datastore
HCatalog 0.3.x table and schema management
ZooKeeper 3.4.x coordinator

Page 9

Post-SQL KVS & Column Tables

Project Voldemort

Page 10

Analysis tooling maturing

Pig

DataFu

Page 11

Ingress

Kafka

Fluentd

facebook / scribe
Page 12

Keep an eye on the graph layer

Apache
Giraph
Hama
Workshop:
Beyond MapReduce

Page 13

Tomorrow: Hadoop 2.0
• HDFS Federation
– Clear separation of Namespace and Block Storage
– Snapshots
– Improved scalability and isolation
• HDFS HA
– Active/Standby failover of Namenodes
• Next Generation MapReduce architecture (aka YARN)
– New architecture enables other application types to plug in
– Resource Manager a foundation for HA and fault tolerance
• Performance!

In beta 2012
Page 14

HDFS HA
ZK ZK ZK
Heartbeat Heartbeat

FailoverController FailoverController
Active Standby

Cmds
Monitor Health Monitor Health
of NN. OS, HW of NN. OS, HW
NN NN
Active Standby

Block Reports to Active & Standby
DN fencing: Update cmds from one

DN DN DN

YARN: foundation of a datacentre OS
Node
Manager

Container App Mstr

Client

Resource Node
Manager Manager
Client

App Mstr Container

MapReduce Status Node
Manager
Job Submission
Node Status
Resource Request Container Container

Multiple topology-aware applications in a single cluster


Microsoft embraces Hadoop

Good for enterprises & developers
Great for end users!

Page 17

Oracle accepts NoSQL
May 2011:
“Don't be risking your data on NoSQL databases.”

Sept 2011:
“Oracle NoSQL Database provides network-accessible
multi-terabyte distributed key/value pair storage with
predictable latency. ”

• Oracle need compatible SQL & NoSQL business plans
• & to justify high-end servers over “commodity” x86 boxes
• Could drive Hadoop-centric JVM development

18

Open Source “Enterprise” Tooling

Application Layer
• Spring Data for Hadoop in Beta
• Cascading → Apache 2.0 License

OS Layer
• RedHat building Hadoop story
• Canonical assisting Hadoop packaging

Page 19

What does all this mean?

Page 20

facebook: 45 PB, Yahoo! 180+PB

Page 21

Hadoop has the momentum
• Platform: stable version & evolving version
• Tooling & layers: ecosystem
• Commercial training and support
• Adoption by enterprise vendors

Page 22

Hadoop is the Big Data Platform

Page 23

Get involved with the Apache project!

•Join the -user mailing lists
– common-user@hadoop.apache.org
– hdfs-user@hadoop.apache.org
– mapreduce-user@hadoop.apache.org
•File bug reports in JIRA
•Contribute to the documentation
•Add: patches, tests, features, …

Page 24

Questions?

hortonworks.com

Page 25

hortonworks.com

Page 26

Hadoop: today and tomorrow

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Hadoop: today and tomorrow

Similar a Hadoop: today and tomorrow (20)

Más de Steve Loughran

Más de Steve Loughran (20)

Último

Último (20)

Hadoop: today and tomorrow

Notas del editor