SlideShare una empresa de Scribd logo
1 de 26
Descargar para leer sin conexión
Thursday, December 3, 2009
Hadoop and Cloudera
         Managing Petabytes with Open Source



         Jeff Hammerbacher
         Chief Scientist and Vice President of Products, Cloudera
         December 3, 2009



Thursday, December 3, 2009
My Background
         Thanks for Asking
         ▪   hammer@cloudera.com
         ▪   Studied Mathematics at Harvard
         ▪   Worked as a Quant on Wall Street
         ▪   Conceived, built, and led Data team at Facebook
             ▪   Nearly 30 amazing engineers and data scientists
             ▪   Several open source projects and research papers
         ▪   Founder of Cloudera
             ▪   Vice President of Products and Chief Scientist (other titles)
             ▪   Also, check out the book “Beautiful Data”

Thursday, December 3, 2009
Presentation Outline
         ▪   What is Hadoop?
             ▪   HDFS
             ▪   MapReduce
             ▪   Hive, Pig, Avro, Zookeeper, and friends
         ▪   Solving big data problems with Hadoop at Facebook and Yahoo!
             ▪   Short history of Facebook’s Data team
             ▪   Some stats about Hadoop use at Yahoo!
         ▪   What we’re building at Cloudera
         ▪   Questions and Discussion
             ▪   Including some questions Ajay sent me


Thursday, December 3, 2009
What is Hadoop?
         ▪   Apache Software Foundation project, mostly written in Java
         ▪   Inspired by Google infrastructure
         ▪   Software for programming warehouse-scale computers (WSCs)
         ▪   Hundreds of production deployments
         ▪   Project structure
             ▪   Hadoop Distributed File System (HDFS)
             ▪   Hadoop MapReduce
             ▪   Hadoop Common
             ▪   Other subprojects
                 ▪   Avro, HBase, Hive, Pig, Zookeeper


Thursday, December 3, 2009
Anatomy of a Hadoop Cluster
         ▪   Commodity servers
             ▪   1 RU, 2 x 4 core CPU, 8 GB RAM, 4 x 1 TB SATA, 2 x 1 gE NIC
         ▪   Typically arranged in 2 level architecture
             ▪
                               Commodity
                 40 nodes per rack                              Hardware Cluster
         ▪   Inexpensive to acquire and maintain




                             •! Typically in 2 level architecture
                                 –! Nodes are commodity Linux PCs
Thursday, December 3, 2009       –! 40 nodes/rack
HDFS
         ▪   Pool commodity servers into a single hierarchical namespace
         ▪   Break files into 128 MB blocks and replicate blocks
         ▪   Designed for large files written once but read many times
             ▪   Files are append-only via a single writer
         ▪   Two major daemons: NameNode and DataNode
             ▪   NameNode manages file system metadata
             ▪   DataNode manages data using local filesystem
         ▪   HDFS manages checksumming, replication, and compression
         ▪   Throughput scales nearly linearly with node cluster size



Thursday, December 3, 2009
'$*31%10$13+3&'1%)#$#I%
                                #79:"5$)$3-.".0&2$3-"&)"06-"*+,.0-2"84"82-$?()3"()*&5()3"
                                /(+-."()0&"'(-*-.;"*$++-%"C8+&*?.;D"$)%".0&2()3"-$*6"&/"06-"8+&*?."

                                                        HDFS
                                2-%,)%$)0+4"$*2&.."06-"'&&+"&/".-2=-2.<""B)"06-"*&55&)"*$.-;"
                                #79:".0&2-."062--"*&5'+-0-"*&'(-."&/"-$*6"/(+-"84"*&'4()3"-$*6"
                                '(-*-"0&"062--"%(//-2-)0".-2=-2.E"
             HDFS distributes file blocks among servers
                      "

                                                         " !"                 " F"

                                                          I"                   !"


                                    "                     H"                   H"
                                        F"

                                        !"                         " F"

                                        G"    #79:"                 G"

                                                                    I"
                                        I"

                                        H"               " !"                 " F"

                                                          G"                    G"

                                                          I"                    H"


                                                                                          "
                                        !"#$%&'()'*+!,'-"./%"0$/&.'1"2&'02345.'6738#'.&%9&%.'
                                "
Thursday, December 3, 2009
Hadoop MapReduce
         ▪   Fault tolerant execution layer and API for parallel data processing
         ▪   Can target multiple storage systems
         ▪   Key/value data model
         ▪   Two major daemons: JobTracker and TaskTracker
         ▪   Many client interfaces
             ▪   Java
             ▪   C++
             ▪   Streaming
             ▪   Pig
             ▪   SQL (Hive)


Thursday, December 3, 2009
MapReduce
                     MapReduce pushes work out to the data
             (#)**+%$#41'%
                                            Q"                         K"
             #)5#0$#.1%*6%(/789%
             )#$#%)&'$3&:;$&*0%             !"                         Q"
             '$3#$1.<%$*%+;'"%=*34%
                                            N"                         N"
             *;$%$*%>#0<%0*)1'%&0%#%
             ?@;'$13A%B"&'%#@@*='%
             #0#@<'1'%$*%3;0%&0%                          K"
             +#3#@@1@%#0)%1@&>&0#$1'%
             $"1%:*$$@101?4'%                             P"
             &>+*'1)%:<%>*0*@&$"&?%                       !"
             '$*3#.1%'<'$1>'A%
                                            Q"
                                                                       K"
                                            P"
                                                                       P"
                                            !"
                                                                       N"



                                                                                     "
                                                 !"#$%&'()'*+,--.'.$/0&/'1-%2'-$3'3-'30&',+3+'
Thursday, December 3, 2009              "
Hadoop Subprojects
         ▪   Avro
             ▪   Cross-language framework for RPC and serialization
         ▪   HBase
             ▪   Table storage on top of HDFS, modeled after Google’s BigTable
         ▪   Hive
             ▪   SQL interface to structured data stored in HDFS
         ▪   Pig
             ▪   Language for data flow programming; also Owl, Zebra, SQL
         ▪   Zookeeper
             ▪   Coordination service for distributed systems

Thursday, December 3, 2009
Hadoop Community Support
         ▪   185+ contributors to the open source code base
             ▪   ~60 engineers at Yahoo!, ~15 at Facebook, ~15 at Cloudera
         ▪   Over 500 (paid!) attendees at Hadoop World NYC
         ▪   Three books (O’Reilly, Apress, Manning)
         ▪   Training videos free online
         ▪   Regular user group meetups in many cities
             ▪   New York Meetup group has 238 members
         ▪   University courses across the world
         ▪   Growing consultant and systems integrator expertise
         ▪   Commercial training, certification, and support from Cloudera

Thursday, December 3, 2009
Hadoop Project Mechanics
         ▪   Trademark owned by ASF; Apache 2.0 license for code
         ▪   Rigorous unit, smoke, performance, and system tests
         ▪   Release cycle of 9 months
             ▪   Last major release: 0.20.0 on April 22, 2009
             ▪   0.21.0 will be last release before 1.0; nearly complete
             ▪   Subprojects on different release cycles
         ▪   Releases put to a vote according to Apache guidelines
         ▪   Releases made available as tarballs on Apache and mirrors
         ▪   Cloudera packages a distribution for many platforms
             ▪   RPM and Debian packages; AMI for Amazon’s EC2


Thursday, December 3, 2009
Hadoop at Facebook
         Early 2006: The First Research Scientist
         ▪   Source data living on horizontally partitioned MySQL tier
         ▪   Intensive historical analysis difficult
         ▪   No way to assess impact of changes to the site


         ▪   First try: Python scripts pull data into MySQL
         ▪   Second try: Python scripts pull data into Oracle


         ▪   ...and then we turned on impression logging



Thursday, December 3, 2009
Facebook Data Infrastructure
                                                   2007
                                    Scribe Tier                     MySQL Tier




                                                  Data Collection
                                                      Server




                                                  Oracle Database
                                                       Server




Thursday, December 3, 2009
Facebook Data Infrastructure
                                                          2008
                                          Scribe Tier            MySQL Tier




                                  Hadoop Tier




                                     Oracle RAC Servers




Thursday, December 3, 2009
Major Data Team Workloads
         ▪   Data collection
             ▪   server logs
             ▪   application databases
             ▪   web crawls
         ▪   Thousands of multi-stage processing pipelines
             ▪   Summaries consumed by external users
             ▪   Summaries for internal reporting
             ▪   Ad optimization pipeline
             ▪   Experimentation platform pipeline
         ▪   Ad hoc analyses


Thursday, December 3, 2009
Workload Statistics
         Facebook 2009
         ▪   Largest cluster running Hive: 4,800 cores, 5.5 PB of storage
         ▪   4 TB of compressed new data added per day
         ▪   135TB of compressed data scanned per day
         ▪   7,500+ Hive jobs on per day
         ▪   80K compute hours per day
         ▪   Around 200 people per month run Hive jobs



             (data from Ashish Thusoo’s Hadoop World NYC presentation)


Thursday, December 3, 2009
Hadoop at Yahoo!
         ▪   Jan 2006: Hired Doug Cutting
         ▪   Apr 2006: Sorted 1.9 TB on 188 nodes in 47 hours
         ▪   Apr 2008: Sorted 1 TB on 910 nodes in 209 seconds
         ▪   Aug 2008: Deployed 4,000 node Hadoop cluster
         ▪   May 2009: Sorted 1 TB on 1,460 nodes in 62 seconds
             ▪   Sorted 1 PB on 3,658 nodes in 16.25 hours
         ▪   Other data points
             ▪   Over 25,000 nodes running Hadoop across 17 clusters
             ▪   Hundreds of thousands of jobs per day from over 600 users
             ▪   82 PB of data


Thursday, December 3, 2009
Hadoop at Cloudera
         Cloudera’s Distribution for Hadoop
         ▪   Open source distribution of Apache Hadoop for enterprise use
             ▪   Includes HDFS, MapReduce, Pig, Hive, and ZooKeeper
             ▪   Ensures cross-subproject compatibility
             ▪   Adds backported patches and customer-specific patches
             ▪   Adds Cloudera utilities like MRUnit and Sqoop
             ▪   Better integration with daemon administration utilities
             ▪   Follows the Filesystem Hierarchy Standard (FHS) for file layout
             ▪   Tools for automatically generating a configuration
             ▪   Packaged as RPM, DEB, AMI, or tarball


Thursday, December 3, 2009
Hadoop at Cloudera
         Training and Certification
         ▪   Free online training
             ▪   Basic, Intermediate (including Hive and Pig), and Advanced
             ▪   Includes a virtual machine with software and exercises
         ▪   Live training sessions
             ▪   One live session per month somewhere in the world
             ▪   If you have a large group, we may come to you
         ▪   Certification
             ▪   Exams for Developers, Administrators, and Managers
             ▪   Administered online or in person

Thursday, December 3, 2009
Hadoop at Cloudera
         Services and Support
         ▪   Professional Services
             ▪   Get Hadoop up and running in your environment
             ▪   Optimize an existing Hadoop infrastructure
             ▪   Design new algorithms to make the most of your data
         ▪   Support
             ▪   Unlimited questions for Cloudera’s technical team
             ▪   Access to our Knowledge Base
             ▪   Help prioritize feature development for CDH
             ▪   Early access to upcoming Cloudera software products


Thursday, December 3, 2009
Hadoop at Cloudera
         Commercial Software
         ▪   General thesis: build commercially-licensed software products
             which complement CDH for data management and analysis
         ▪   Current products
             ▪   Cloudera Desktop
                 ▪   Extensible user interface for users of Cloudera software
         ▪   Upcoming products
             ▪   Talk to me in private




Thursday, December 3, 2009
Cloudera Desktop
                             Big Data can be Beautiful




Thursday, December 3, 2009
Gemini-Specific Questions
         ▪   Scala
             ▪   ScalaNLP’s SMR, Jonhnny Weslley’s SHadoop
             ▪   Jeff Hodges’s Componentize
         ▪   Compliance/regulatory domain
         ▪   Security roadmap
             ▪   HADOOP-4487
             ▪   HIVE-842
         ▪   Real-time BI
             ▪   MapReduce Online Prototype (MOP)
             ▪   Talk to me in private


Thursday, December 3, 2009
(c) 2009 Cloudera, Inc. or its licensors.  "Cloudera" is a registered trademark of Cloudera, Inc.. All rights reserved. 1.0




Thursday, December 3, 2009

Más contenido relacionado

Similar a 20091203gemini

Visual Studio 2013, Xamarin and Microsoft Azure Mobile Services: A Match Made...
Visual Studio 2013, Xamarin and Microsoft Azure Mobile Services: A Match Made...Visual Studio 2013, Xamarin and Microsoft Azure Mobile Services: A Match Made...
Visual Studio 2013, Xamarin and Microsoft Azure Mobile Services: A Match Made...
Rick G. Garibay
 
Brust hadoopecosystem
Brust hadoopecosystemBrust hadoopecosystem
Brust hadoopecosystem
Andrew Brust
 
CloudCon Data Mining Presentation
CloudCon Data Mining PresentationCloudCon Data Mining Presentation
CloudCon Data Mining Presentation
Brian Johnson
 

Similar a 20091203gemini (20)

20100201hplabs
20100201hplabs20100201hplabs
20100201hplabs
 
20090422 Www
20090422 Www20090422 Www
20090422 Www
 
Device deployment
Device deploymentDevice deployment
Device deployment
 
Push Podc09
Push Podc09Push Podc09
Push Podc09
 
Hadoop london
Hadoop londonHadoop london
Hadoop london
 
All about Apache ACE
All about Apache ACEAll about Apache ACE
All about Apache ACE
 
Hook Mobile Living Social Hackathon MoDevUX 2012
Hook Mobile Living Social Hackathon MoDevUX 2012Hook Mobile Living Social Hackathon MoDevUX 2012
Hook Mobile Living Social Hackathon MoDevUX 2012
 
HARE 2010 Review
HARE 2010 ReviewHARE 2010 Review
HARE 2010 Review
 
"R, Hadoop, and Amazon Web Services (20 December 2011)"
"R, Hadoop, and Amazon Web Services (20 December 2011)""R, Hadoop, and Amazon Web Services (20 December 2011)"
"R, Hadoop, and Amazon Web Services (20 December 2011)"
 
R, Hadoop and Amazon Web Services
R, Hadoop and Amazon Web ServicesR, Hadoop and Amazon Web Services
R, Hadoop and Amazon Web Services
 
Visual Studio 2013, Xamarin and Microsoft Azure Mobile Services: A Match Made...
Visual Studio 2013, Xamarin and Microsoft Azure Mobile Services: A Match Made...Visual Studio 2013, Xamarin and Microsoft Azure Mobile Services: A Match Made...
Visual Studio 2013, Xamarin and Microsoft Azure Mobile Services: A Match Made...
 
Brust hadoopecosystem
Brust hadoopecosystemBrust hadoopecosystem
Brust hadoopecosystem
 
CloudCon Data Mining Presentation
CloudCon Data Mining PresentationCloudCon Data Mining Presentation
CloudCon Data Mining Presentation
 
Amebaサービスのログ解析基盤
Amebaサービスのログ解析基盤Amebaサービスのログ解析基盤
Amebaサービスのログ解析基盤
 
InnoDB Magic
InnoDB MagicInnoDB Magic
InnoDB Magic
 
The Project Trap
The Project TrapThe Project Trap
The Project Trap
 
HTML5 & Friends
HTML5 & FriendsHTML5 & Friends
HTML5 & Friends
 
Redis
RedisRedis
Redis
 
Happy Go Programming
Happy Go ProgrammingHappy Go Programming
Happy Go Programming
 
Tabledown
TabledownTabledown
Tabledown
 

Más de Jeff Hammerbacher

Mårten Mickos's presentation "Open Source: Why Freedom Makes a Better Busines...
Mårten Mickos's presentation "Open Source: Why Freedom Makes a Better Busines...Mårten Mickos's presentation "Open Source: Why Freedom Makes a Better Busines...
Mårten Mickos's presentation "Open Source: Why Freedom Makes a Better Busines...
Jeff Hammerbacher
 

Más de Jeff Hammerbacher (20)

20120223keystone
20120223keystone20120223keystone
20120223keystone
 
20100714accel
20100714accel20100714accel
20100714accel
 
20100608sigmod
20100608sigmod20100608sigmod
20100608sigmod
 
20100513brown
20100513brown20100513brown
20100513brown
 
20100423sage
20100423sage20100423sage
20100423sage
 
20100418sos
20100418sos20100418sos
20100418sos
 
20100301icde
20100301icde20100301icde
20100301icde
 
Mårten Mickos's presentation "Open Source: Why Freedom Makes a Better Busines...
Mårten Mickos's presentation "Open Source: Why Freedom Makes a Better Busines...Mårten Mickos's presentation "Open Source: Why Freedom Makes a Better Busines...
Mårten Mickos's presentation "Open Source: Why Freedom Makes a Better Busines...
 
20090309berkeley
20090309berkeley20090309berkeley
20090309berkeley
 
20081030linkedin
20081030linkedin20081030linkedin
20081030linkedin
 
20081022cca
20081022cca20081022cca
20081022cca
 
20081009nychive
20081009nychive20081009nychive
20081009nychive
 
2008 Ur Tech Talk Zshao
2008 Ur Tech Talk Zshao2008 Ur Tech Talk Zshao
2008 Ur Tech Talk Zshao
 
Data Presentations Cassandra Sigmod
Data  Presentations  Cassandra SigmodData  Presentations  Cassandra Sigmod
Data Presentations Cassandra Sigmod
 
20080611accel
20080611accel20080611accel
20080611accel
 
HDFS Architecture
HDFS ArchitectureHDFS Architecture
HDFS Architecture
 
Hdfs Dhruba
Hdfs DhrubaHdfs Dhruba
Hdfs Dhruba
 
20080529dublinpt1
20080529dublinpt120080529dublinpt1
20080529dublinpt1
 
20080528dublinpt3
20080528dublinpt320080528dublinpt3
20080528dublinpt3
 
20080529dublinpt3
20080529dublinpt320080529dublinpt3
20080529dublinpt3
 

Último

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 

Último (20)

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 

20091203gemini

  • 2. Hadoop and Cloudera Managing Petabytes with Open Source Jeff Hammerbacher Chief Scientist and Vice President of Products, Cloudera December 3, 2009 Thursday, December 3, 2009
  • 3. My Background Thanks for Asking ▪ hammer@cloudera.com ▪ Studied Mathematics at Harvard ▪ Worked as a Quant on Wall Street ▪ Conceived, built, and led Data team at Facebook ▪ Nearly 30 amazing engineers and data scientists ▪ Several open source projects and research papers ▪ Founder of Cloudera ▪ Vice President of Products and Chief Scientist (other titles) ▪ Also, check out the book “Beautiful Data” Thursday, December 3, 2009
  • 4. Presentation Outline ▪ What is Hadoop? ▪ HDFS ▪ MapReduce ▪ Hive, Pig, Avro, Zookeeper, and friends ▪ Solving big data problems with Hadoop at Facebook and Yahoo! ▪ Short history of Facebook’s Data team ▪ Some stats about Hadoop use at Yahoo! ▪ What we’re building at Cloudera ▪ Questions and Discussion ▪ Including some questions Ajay sent me Thursday, December 3, 2009
  • 5. What is Hadoop? ▪ Apache Software Foundation project, mostly written in Java ▪ Inspired by Google infrastructure ▪ Software for programming warehouse-scale computers (WSCs) ▪ Hundreds of production deployments ▪ Project structure ▪ Hadoop Distributed File System (HDFS) ▪ Hadoop MapReduce ▪ Hadoop Common ▪ Other subprojects ▪ Avro, HBase, Hive, Pig, Zookeeper Thursday, December 3, 2009
  • 6. Anatomy of a Hadoop Cluster ▪ Commodity servers ▪ 1 RU, 2 x 4 core CPU, 8 GB RAM, 4 x 1 TB SATA, 2 x 1 gE NIC ▪ Typically arranged in 2 level architecture ▪ Commodity 40 nodes per rack Hardware Cluster ▪ Inexpensive to acquire and maintain •! Typically in 2 level architecture –! Nodes are commodity Linux PCs Thursday, December 3, 2009 –! 40 nodes/rack
  • 7. HDFS ▪ Pool commodity servers into a single hierarchical namespace ▪ Break files into 128 MB blocks and replicate blocks ▪ Designed for large files written once but read many times ▪ Files are append-only via a single writer ▪ Two major daemons: NameNode and DataNode ▪ NameNode manages file system metadata ▪ DataNode manages data using local filesystem ▪ HDFS manages checksumming, replication, and compression ▪ Throughput scales nearly linearly with node cluster size Thursday, December 3, 2009
  • 8. '$*31%10$13+3&'1%)#$#I% #79:"5$)$3-.".0&2$3-"&)"06-"*+,.0-2"84"82-$?()3"()*&5()3" /(+-."()0&"'(-*-.;"*$++-%"C8+&*?.;D"$)%".0&2()3"-$*6"&/"06-"8+&*?." HDFS 2-%,)%$)0+4"$*2&.."06-"'&&+"&/".-2=-2.<""B)"06-"*&55&)"*$.-;" #79:".0&2-."062--"*&5'+-0-"*&'(-."&/"-$*6"/(+-"84"*&'4()3"-$*6" '(-*-"0&"062--"%(//-2-)0".-2=-2.E" HDFS distributes file blocks among servers " " !" " F" I" !" " H" H" F" !" " F" G" #79:" G" I" I" H" " !" " F" G" G" I" H" " !"#$%&'()'*+!,'-"./%"0$/&.'1"2&'02345.'6738#'.&%9&%.' " Thursday, December 3, 2009
  • 9. Hadoop MapReduce ▪ Fault tolerant execution layer and API for parallel data processing ▪ Can target multiple storage systems ▪ Key/value data model ▪ Two major daemons: JobTracker and TaskTracker ▪ Many client interfaces ▪ Java ▪ C++ ▪ Streaming ▪ Pig ▪ SQL (Hive) Thursday, December 3, 2009
  • 10. MapReduce MapReduce pushes work out to the data (#)**+%$#41'% Q" K" #)5#0$#.1%*6%(/789% )#$#%)&'$3&:;$&*0% !" Q" '$3#$1.<%$*%+;'"%=*34% N" N" *;$%$*%>#0<%0*)1'%&0%#% ?@;'$13A%B"&'%#@@*='% #0#@<'1'%$*%3;0%&0% K" +#3#@@1@%#0)%1@&>&0#$1'% $"1%:*$$@101?4'% P" &>+*'1)%:<%>*0*@&$"&?% !" '$*3#.1%'<'$1>'A% Q" K" P" P" !" N" " !"#$%&'()'*+,--.'.$/0&/'1-%2'-$3'3-'30&',+3+' Thursday, December 3, 2009 "
  • 11. Hadoop Subprojects ▪ Avro ▪ Cross-language framework for RPC and serialization ▪ HBase ▪ Table storage on top of HDFS, modeled after Google’s BigTable ▪ Hive ▪ SQL interface to structured data stored in HDFS ▪ Pig ▪ Language for data flow programming; also Owl, Zebra, SQL ▪ Zookeeper ▪ Coordination service for distributed systems Thursday, December 3, 2009
  • 12. Hadoop Community Support ▪ 185+ contributors to the open source code base ▪ ~60 engineers at Yahoo!, ~15 at Facebook, ~15 at Cloudera ▪ Over 500 (paid!) attendees at Hadoop World NYC ▪ Three books (O’Reilly, Apress, Manning) ▪ Training videos free online ▪ Regular user group meetups in many cities ▪ New York Meetup group has 238 members ▪ University courses across the world ▪ Growing consultant and systems integrator expertise ▪ Commercial training, certification, and support from Cloudera Thursday, December 3, 2009
  • 13. Hadoop Project Mechanics ▪ Trademark owned by ASF; Apache 2.0 license for code ▪ Rigorous unit, smoke, performance, and system tests ▪ Release cycle of 9 months ▪ Last major release: 0.20.0 on April 22, 2009 ▪ 0.21.0 will be last release before 1.0; nearly complete ▪ Subprojects on different release cycles ▪ Releases put to a vote according to Apache guidelines ▪ Releases made available as tarballs on Apache and mirrors ▪ Cloudera packages a distribution for many platforms ▪ RPM and Debian packages; AMI for Amazon’s EC2 Thursday, December 3, 2009
  • 14. Hadoop at Facebook Early 2006: The First Research Scientist ▪ Source data living on horizontally partitioned MySQL tier ▪ Intensive historical analysis difficult ▪ No way to assess impact of changes to the site ▪ First try: Python scripts pull data into MySQL ▪ Second try: Python scripts pull data into Oracle ▪ ...and then we turned on impression logging Thursday, December 3, 2009
  • 15. Facebook Data Infrastructure 2007 Scribe Tier MySQL Tier Data Collection Server Oracle Database Server Thursday, December 3, 2009
  • 16. Facebook Data Infrastructure 2008 Scribe Tier MySQL Tier Hadoop Tier Oracle RAC Servers Thursday, December 3, 2009
  • 17. Major Data Team Workloads ▪ Data collection ▪ server logs ▪ application databases ▪ web crawls ▪ Thousands of multi-stage processing pipelines ▪ Summaries consumed by external users ▪ Summaries for internal reporting ▪ Ad optimization pipeline ▪ Experimentation platform pipeline ▪ Ad hoc analyses Thursday, December 3, 2009
  • 18. Workload Statistics Facebook 2009 ▪ Largest cluster running Hive: 4,800 cores, 5.5 PB of storage ▪ 4 TB of compressed new data added per day ▪ 135TB of compressed data scanned per day ▪ 7,500+ Hive jobs on per day ▪ 80K compute hours per day ▪ Around 200 people per month run Hive jobs (data from Ashish Thusoo’s Hadoop World NYC presentation) Thursday, December 3, 2009
  • 19. Hadoop at Yahoo! ▪ Jan 2006: Hired Doug Cutting ▪ Apr 2006: Sorted 1.9 TB on 188 nodes in 47 hours ▪ Apr 2008: Sorted 1 TB on 910 nodes in 209 seconds ▪ Aug 2008: Deployed 4,000 node Hadoop cluster ▪ May 2009: Sorted 1 TB on 1,460 nodes in 62 seconds ▪ Sorted 1 PB on 3,658 nodes in 16.25 hours ▪ Other data points ▪ Over 25,000 nodes running Hadoop across 17 clusters ▪ Hundreds of thousands of jobs per day from over 600 users ▪ 82 PB of data Thursday, December 3, 2009
  • 20. Hadoop at Cloudera Cloudera’s Distribution for Hadoop ▪ Open source distribution of Apache Hadoop for enterprise use ▪ Includes HDFS, MapReduce, Pig, Hive, and ZooKeeper ▪ Ensures cross-subproject compatibility ▪ Adds backported patches and customer-specific patches ▪ Adds Cloudera utilities like MRUnit and Sqoop ▪ Better integration with daemon administration utilities ▪ Follows the Filesystem Hierarchy Standard (FHS) for file layout ▪ Tools for automatically generating a configuration ▪ Packaged as RPM, DEB, AMI, or tarball Thursday, December 3, 2009
  • 21. Hadoop at Cloudera Training and Certification ▪ Free online training ▪ Basic, Intermediate (including Hive and Pig), and Advanced ▪ Includes a virtual machine with software and exercises ▪ Live training sessions ▪ One live session per month somewhere in the world ▪ If you have a large group, we may come to you ▪ Certification ▪ Exams for Developers, Administrators, and Managers ▪ Administered online or in person Thursday, December 3, 2009
  • 22. Hadoop at Cloudera Services and Support ▪ Professional Services ▪ Get Hadoop up and running in your environment ▪ Optimize an existing Hadoop infrastructure ▪ Design new algorithms to make the most of your data ▪ Support ▪ Unlimited questions for Cloudera’s technical team ▪ Access to our Knowledge Base ▪ Help prioritize feature development for CDH ▪ Early access to upcoming Cloudera software products Thursday, December 3, 2009
  • 23. Hadoop at Cloudera Commercial Software ▪ General thesis: build commercially-licensed software products which complement CDH for data management and analysis ▪ Current products ▪ Cloudera Desktop ▪ Extensible user interface for users of Cloudera software ▪ Upcoming products ▪ Talk to me in private Thursday, December 3, 2009
  • 24. Cloudera Desktop Big Data can be Beautiful Thursday, December 3, 2009
  • 25. Gemini-Specific Questions ▪ Scala ▪ ScalaNLP’s SMR, Jonhnny Weslley’s SHadoop ▪ Jeff Hodges’s Componentize ▪ Compliance/regulatory domain ▪ Security roadmap ▪ HADOOP-4487 ▪ HIVE-842 ▪ Real-time BI ▪ MapReduce Online Prototype (MOP) ▪ Talk to me in private Thursday, December 3, 2009
  • 26. (c) 2009 Cloudera, Inc. or its licensors.  "Cloudera" is a registered trademark of Cloudera, Inc.. All rights reserved. 1.0 Thursday, December 3, 2009