2. Starting Out 2 Copyright 2010 Cloudera Inc. All rights reserved (You) “Let’s build a Hadoop cluster!” http://www.iccs.inf.ed.ac.uk/~miles/code.html
3. Starting Out 3 Copyright 2010 Cloudera Inc. All rights reserved (You) http://www.iccs.inf.ed.ac.uk/~miles/code.html
4. Where you want to be 4 Copyright 2010 Cloudera Inc. All rights reserved (You) Yahoo! Hadoop Cluster (2007)
5. What is Hadoop? A scalable fault-tolerant distributed system for data storage and processing (open source under the Apache license) Core Hadoop has two main components Hadoop Distributed File System (HDFS): self-healing high-bandwidth clustered storage MapReduce: fault-tolerant distributed processing Key value Flexible -> store data without a schema and add it later as needed Affordable -> cost / TB at a fraction of traditional options Broadly adopted -> a large and active ecosystem Proven at scale -> dozens of petabyte + implementations in production today Copyright 2010 Cloudera Inc. All Rights Reserved. 5
10. Supported – Employs project founders and committers for >70% of components6 Copyright 2010 Cloudera Inc. All Rights Reserved.
11. Overview Proper planning Data Ingestion ETL and Data Processing Infrastructure Authentication, Authorization, and Sharing Monitoring 7 Copyright 2010 Cloudera Inc. All rights reserved
12. The production data platform Data storage ETL / data processing / analysis infrastructure Data ingestion infrastructure Integration with tools Data security and access control Health and performance monitoring 8 Copyright 2010 Cloudera Inc. All rights reserved
13. Proper planning Know your use cases! Log transformation, aggregation Text mining, IR Analytics Machine learning Critical to proper configuration Hadoop Network OS Resource utilization, deep job insight will tell you more 9 Copyright 2010 Cloudera Inc. All rights reserved
14. HDFS Concerns Name node availability HA is tricky Consider where Hadoop lives in the system Manual recovery can be simple, fast, effective Backup Strategy Name node metadata – hourly, ~2 day retention User data Log shipping style strategies DistCp “Fan out” to multiple clusters on ingestion 10 Copyright 2010 Cloudera Inc. All rights reserved
15. Data Ingestion Many data sources Streaming data sources (log files, mostly) RDBMS EDW Files (usually exports from 3rd party) Common place we see DIY You probably shouldn’t Sqoop, Flume, Oozie (but I’m biased) No matter what - fault tolerant, performant, monitored 11 Copyright 2010 Cloudera Inc. All rights reserved
16. ETL and Data Processing Non-interactive jobs Establish a common directory structure for processes Need tools to handle complex chains of jobs Workflow tools support Job dependencies, error handling Tracking Invocation based on time or events Most common mistake: depending on jobs always completing successfully or within a window of time. Monitor for SLA rather than pray Defensive coding practices apply just as they do everywhere else! 12 Copyright 2010 Cloudera Inc. All rights reserved
17. Metadata Management Tool independent metadata about… Data sets we know about and their location (on HDFS) Schemata Authorization (currently HDFS permissions only) Partitioning Format and compression Guarantees (consistency, timeliness, permits duplicates) Currently still DIY in many ways, tool-dependent Most people rely on prayer and hard coding (H)OWL is interesting 13 Copyright 2010 Cloudera Inc. All rights reserved
18. Authentication and authorization Authentication Don’t talk to strangers Should integrate with existing IT infrastructure Yahoo! security (Kerberos) patches now part of CDH3b3 Authorization Not everyone can access everything Ex. Production data sets are read-only to quants / analysts. Analysts have home or group directories for derived data sets. Mostly enforced via HDFS permissions; directory structure and organization is critical Not as fine grained as column level access in EDW, RDBMS HUE as a gateway to the cluster 14 Copyright 2010 Cloudera Inc. All rights reserved
19. Resource Sharing Prefer one large cluster to many small clusters (unless maybe you’re Facebook) “Stop hogging the cluster!” Cluster resources Disk space (HDFS size quotas) Number of files (HDFS file count quotas) Simultaneous jobs Tasks – guaranteed capacity, full utilization, SLA enforcement Monitor and track resource utilization across all groups 15 Copyright 2010 Cloudera Inc. All rights reserved
20. Monitoring Critical for keeping things running Cluster health Duh. Traditional monitoring tools: Nagios, Hyperic, Zenoss Host checks, service checks When to alert? It’s tricky. Cluster performance Overall utilization in aggregate 30,000ft view of utilization and performance; macro level 16 Copyright 2010 Cloudera Inc. All rights reserved
21. Monitoring Hadoop aware cluster monitoring Traditional tools don’t cut it; Hadoop monitoring is inherently Hadoop specific Analogous to RDBMS monitoring tools Job level “monitoring” More like analysis “What resources does this job use?” “How does this run compare to last run?” “How can I make this run faster, more resource efficient?” Two views we care about Job perspective Resource perspective (task slots, scheduler pool) 17 Copyright 2010 Cloudera Inc. All rights reserved
22. Wrapping it up Hadoop proper is awesome, but is only part of the picture Much of Professional Services time is filling in the blanks There’s still a way to go Metadata management Operational tools and support Improvements to Hadoop core to improve stability, security, manageability Adoption and feedback drive progress CDH provides the infrastructure for a complete system 18 Copyright 2010 Cloudera Inc. All rights reserved
23. Cloudera Makes Hadoop Safe For the Enterprise Copyright 2010 Cloudera Inc. All Rights Reserved. 19 Software Services Training
Many small and midsize companies – especially those for whom technology is not their primary product or concern – start out the same. Someone, probably you, decides they need to build a Hadoop cluster.
…and so you do. And it looks like this. Not a bad thing, but not what you can reasonably go to production with. So how do you get from this…