TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
Hadoop Now, Next & Beyond
1. Hadoop Now, Next & Beyond
Eric Baldeschwieler, Hortonworks CTO
June 13, 2012
2. Hadoop Summit Is BIG!
10x growth 2012 Summit
in 5 years! 2200+ people
2011 Summit
1600+ people
2008: First Summit
200+ people
3. Timeline: Apache Hadoop 1.0 & 2.0
1.0: The most stable release
0.20.1 DEV QA beta
3 years of stabilization & key features
HADOOP 1.0
DEV QA beta
0.20.2
Security
DEV QA beta
0.20.1xx
Security, MR multi tenancy
0.20.2xx DEV QA beta Hadoop 1.0
Append GA
1.0 DEV QA beta
New Append
HADOOP 2.0
DEV
0.21
Security
DEV QA
0.22
Federation, YARN Hadoop 2.0
0.23 DEV QA alpha
HA, Wire Compatibility
alpha
QA DEV beta
2.0: Next-gen MapReduce & HDFS 2.0
Exciting community innovations under development
2008 2009 2010 2011 2012
4. Hadoop 1.0 Key Features
• Flush / Sync for HBase
– 1.0 is the first Apache Hadoop release to support HBase
– This work began in 0.18 in 2008!
– Benefit: Interactive apps – Web site personalization
• Security – Strong authentication via Kerberos
– Benefit: Audit compliance, multi-tenancy
• MapReduce limits
– Solve whack-a-mole like bad user job problem
– Benefit: Reliability, multi-tenancy
5. Hadoop 2.0 Innovations
• Focus on Scale and Community Innovation
– YARN and Federation designed to support 10,000+ computer clusters
• YARN: Scalable, Pluggable Execution Frameworks
– Improves MapReduce performance
– Will support community development of new frameworks
– Near real-time, Machine learning & Analytics use cases
• Federation: Scalable, Pluggable Storage
– Isolation via multiple volumes / Name Nodes
– Shared block pool w/ pluggable volume managers
• Always On: No Cluster Downtime
– Wire compatible APIs (protobufs)
– HDFS hot standby HA
– Rolling upgrades
– Log & checkpoint management
6. Balancing Innovation & Stability
INNOVATION STABILITY
Source: The above graphic based on concepts from Geoffrey Moore’s book – Crossing the Chasm
8. HDP 1.0 Highlights
1 Pure Apache Hadoop 1.0 code line, 100% open source
2 Open source Management & Monitoring via Ambari
3 Common Metadata Services via HCatalog
4 Enterprise Data Integration with Talend Open Studio
5 Multi-tenant Protections via Capacity Scheduler
6 Full Stack HA via proven 3rd party products
9. Management & Monitoring Services -> Ambari
• Powerful monitoring and alerting dashboards
– View topology, health & utilization of cluster
– Detailed view of cluster operations, server & storage
utilization, job status, and performance levels
– Get alerts to critical events
• Simple installation & provisioning
– Easy configuration process
– One-click deployment for clusters of all sizes
– Analyzes/recommends optimal services configuration
– Automatically configures mount points in the cluster
10. Full Stack High Availability
Proven HA solutions with proven Hadoop 1.0
Failover and restart for
• NameNode
• JobTracker
HA Cluster • Other services to come…
Open API allows use of Proven HA
from multiple vendors
Minimized changes to clients and
configuration
Auto-detects failures:
• Services, OS & Hardware
HA Cluster
Complementary to 2.0 HA efforts
11. The Road Ahead
• Ambari
– REST APIs & general hardening
– Integrations w/ enterprise & cloud management solutions
• HCatalog
– ODBC / JDBC, security, relaxed schemas (AVRO, JSON…)
– More REST APIs and Integrations with 3rd party data stores
• Full Stack HA
– Continued work with virtualization & operating system vendors
• Native Windows support
– Integrations with broader Windows ecosystem of systems/tools
12. Welcome to the Hadoop Summit!
Enjoy
Help the grow ecosystem!