5. Releases slowed with Hadoop take up
0.20.0 0.20.1 0.20.2 0.21.0 0.20.20{3,4,5}.0
• 64 Releases
• Branches from the last 2.5 years:
–0.20.{0,1,2} – Stable release without security
–0.20.2xx.y – Stable release with security
–0.21.0 – released, unstable, deprecated
–0.22.0 – orphan, unstable, lack of community
Page 5
6. Now: two release branches, one dev
Hadoop 1.x
• Stable, used in production systems
• The one to use today
Hadoop 2.0
• The successor
• Not quite ready for use
Hadoop 2.x "trunk"
• Where features & fixes first go in
• If you want to help –start here
Page 6
Picking on what is really new in this release compared to just merges and stability, webhdfs is something interesting.Set one config option and the DNs and NNs become web servers (using the chosen auth mechanism), offering read and write access to the data.This is integral to the cluster -you ask the NN for data, which triggers a 307 redirect to a DN with the data, which serves up up. A redirect that is handled transparently by all HTTP clients set up to handle redirects.
This is what we're going to be shipping based on Hadoop 1.0, a packaging of the core Hadoop stack with management tooling
There's a set of nosql databases running on or near HadoopApache HBase is the key one -look at the facebook papers on FB chat to see how this works in the field.Cassandra -not directly dependent on Hadoop but you can run Pig and Hive queries against its data, and it implements the HDFS filesystem API so you can host TTs on the same nodes as your cassandra data and get data-local work.Accumulo is going to be mentioned as it is in incubation, donated to the ASF by the NSA. Apparently it has good security on access to keys and values, which shows that some orgs put security ahead of other features in nosql-land, and that government orgs are starting to play in this space -and contribute code back.
-don’t' write at the Java level if you can help it, both Pig and Hive are a lot more productive.SQL houses should play with Hive.Pig is very good for experimentation, and is ability to call User Defined Functions lets you re-use tuned Java libraries -such as LinkedIn's DataFu
Lots of ways to get data in. Most are focused on streaming from other servers in the same datacentre -like web servers, and collecting the logs.Scribe is designed to scale up well, with the option of discarding data under heavy loadKafka is from LinkedIn, nice code which can hook up behind log4j.
If you are doing anything w/ social networks, connecting events, locations together etc, the graph layer should be of interest -it's up and coming as the next layer in the stack.There are two projects in the apache incubatorHama: graph layer with a big driver being a telcoGiraph -ex Y!, LinkedIn are using this.There's a workshop after Berlin Buzzwords on "beyond MR" that I'm co-organising; Giraph will be one of the topics there (along w/ YARN and Stratosphere)
Hadoop
This is the architecture of HDFS HA, skipping bits of the details and the roadmap of when features come out. Active/Standby HA, not shared-write (much, much harder). Failover initially manual, moving to automated.Failure controllers monitor NN health, and heartbeat to ZK so that others in the ZK farm can detect failures. DNs report to both, but only listen to one
Hadoop 2 breaks up the JT into two tasks: the Resource Manager, which manages allocation of resources on servers, with the JT, which now becomes one of the possible “Application Masters” that can be deployed in a cluster. Breaking this up allows you toRun different JT's for different users & different versions of the MR APIs. (Facebook do this in their clusters with a static striping of TT's today)Run other topology-aware applications
The NoSQL business plan is a key issue here -politics and marketing not technologyDB business pricing always put an upper financial limit on big dataOracle liked to own the customer data (and had loyal DBA support)Move to vertical solutions promised best hardware and discounting opportunities, but removed flexibility ('the IBM model')Hadoop challenges this: generic servers with many HDD, open source softwareThey will need to add something to Hadoop/HDFS that stops you moving away or getting support for others. Looking at the hardware, that could either be very-low latency IPC (benefits?) or something integrating SSD into the system (preheating SSD caching of queued job data, …?//)Closing on a brighter note, my colleagues and I have tales of terror from playing w/ JVM options on a big cluster, as you can be confident of reaching all corner cases within a short period of time. If Oracle start using Hadoop as a driver for JVM performance and qualification -and return those tweaks to openjdk, we all benefit.
Last but very much not least, there's growing integration of Hadoop in the OSS world. App levelSpring has a Spring Data for Hadoop project in Beta, which lets you integrate HDFS, MR and Pig jobs within a Spring application -as well as Cascading. You can do workflows here and really integrate with enterprise apps, especially if you use Spring already.Cascading, the Hadoop workflow language, has moved to an Apache License, to remove worries about GPL contamination of your codeAlso of interest is the fact that the Linux vendors are taking Hadoop seriously -which can only improve testing and stability of Hadoop.Finally, off this sheet: R connector for Hadoop-the statisticians get integration from their World -R- to the new datasets
Facebook, Prineville, 45MB, one single cluster. Yahoo!, 180 PBIt means that Hadoop installations are becoming the largest known storage and compute systems in the planet. It's unlikely anyone in this audience's storage or B/W requirements will be as big, but for those in the audience who want them to become as big, Hadoop makes it possible both technically and financially
The other thing it means is this: nothing else has the momentum and the support.People may say "ours is better", but that's like saying Solaris was better than Linux, or the 68K was better than the intel 8086. Better doesn't win. More valuable does, and because of its growing support, layers above, adoption and the ecosystem, it has the edge.This isn't an excuse to get complacent: Spring killed Java EE, even though EJB once had everything going for it.