SlideShare una empresa de Scribd logo
1 de 24
Descargar para leer sin conexión
Tuesday, October 27, 2009
Hadoop and Cloudera
         Managing Petabytes with Open Source



         Jeff Hammerbacher
         Chief Scientist and Vice President of Products, Cloudera
         October 27, 2009



Tuesday, October 27, 2009
Why You Should Care
         Hadoop in the Life Sciences
         ▪   CloudBurst: Highly Sensitive Short Read Mapping with MapReduce
             ▪   “CloudBurst reduces the running time from hours to mere minutes”
         ▪   Crossbow: Genotyping from short reads using cloud computing
             ▪   “Crossbow shows how Hadoop can be a enabling technology for
                 computational biology”
         ▪   SMARTS substructure searching using the CDK and Hadoop
             ▪   “The Hadoop framework makes handling large data problems pretty
                 much trivial”
         ▪   Smith-Waterman Protein Alignment
             ▪   “Existing algorithms ported easily to Hadoop”


Tuesday, October 27, 2009
My Background
         Thanks for Asking
         ▪   hammer@cloudera.com
         ▪   Studied Mathematics at Harvard
         ▪   Worked as a Quant on Wall Street
         ▪   Conceived, built, and led Data team at Facebook
             ▪   Nearly 30 amazing engineers and data scientists
             ▪   Several open source projects and research papers
         ▪   Founder of Cloudera
             ▪   Vice President of Products and Chief Scientist (other titles)
             ▪   Also, check out the book “Beautiful Data”

Tuesday, October 27, 2009
Presentation Outline
         ▪   What is Hadoop?
             ▪   HDFS
             ▪   MapReduce
             ▪   Hive, Pig, Avro, Zookeeper, and friends
         ▪   Solving big data problems with Hadoop at Facebook and Yahoo!
             ▪   Short history of Facebook’s Data team
             ▪   Hadoop applications at Yahoo!, Facebook, and Cloudera
             ▪   Other examples: LHC, smart grid, genomes
         ▪   Questions and Discussion



Tuesday, October 27, 2009
What is Hadoop?
         ▪   Apache Software Foundation project, mostly written in Java
         ▪   Inspired by Google infrastructure
         ▪   Software for programming warehouse-scale computers (WSCs)
         ▪   Hundreds of production deployments
         ▪   Project structure
             ▪   Hadoop Distributed File System (HDFS)
             ▪   Hadoop MapReduce
             ▪   Hadoop Common
             ▪   Other subprojects
                 ▪   Avro, HBase, Hive, Pig, Zookeeper


Tuesday, October 27, 2009
Anatomy of a Hadoop Cluster
         ▪   Commodity servers
             ▪   1 RU, 2 x 4 core CPU, 8 GB RAM, 4 x 1 TB SATA, 2 x 1 gE NIC
         ▪   Typically arranged in 2 level architecture
             ▪
                               Commodity
                 40 nodes per rack                             Hardware Cluster
         ▪   Inexpensive to acquire and maintain




                            •! Typically in 2 level architecture
                                –! Nodes are commodity Linux PCs
Tuesday, October 27, 2009       –! 40 nodes/rack
HDFS
         ▪   Pool commodity servers into a single hierarchical namespace
         ▪   Break files into 128 MB blocks and replicate blocks
         ▪   Designed for large files written once but read many times
             ▪   Files are append-only
         ▪   Two major daemons: NameNode and DataNode
             ▪   NameNode manages file system metadata
             ▪   DataNode manages data using local filesystem
         ▪   HDFS manages checksumming, replication, and compression
         ▪   Throughput scales nearly linearly with node cluster size
         ▪   Access from Java, C, command line, FUSE, or Thrift

Tuesday, October 27, 2009
'$*31%10$13+3&'1%)#$#I%
                                 #79:"5$)$3-.".0&2$3-"&)"06-"*+,.0-2"84"82-$?()3"()*&5()3"
                                 /(+-."()0&"'(-*-.;"*$++-%"C8+&*?.;D"$)%".0&2()3"-$*6"&/"06-"8+&*?."

                                                         HDFS
                                 2-%,)%$)0+4"$*2&.."06-"'&&+"&/".-2=-2.<""B)"06-"*&55&)"*$.-;"
                                 #79:".0&2-."062--"*&5'+-0-"*&'(-."&/"-$*6"/(+-"84"*&'4()3"-$*6"
                                 '(-*-"0&"062--"%(//-2-)0".-2=-2.E"
              HDFS distributes file blocks among servers
                       "

                                                          " !"                 " F"

                                                           I"                   !"


                                     "                     H"                   H"
                                         F"

                                         !"                         " F"

                                         G"    #79:"                 G"

                                                                     I"
                                         I"

                                         H"               " !"                 " F"

                                                           G"                    G"

                                                           I"                    H"


                                                                                           "
                                         !"#$%&'()'*+!,'-"./%"0$/&.'1"2&'02345.'6738#'.&%9&%.'
                                 "
Tuesday, October 27, 2009
Hadoop MapReduce
         ▪   Fault tolerant execution layer and API for parallel data processing
         ▪   Can target multiple storage systems
         ▪   Key/value data model
         ▪   Two major daemons: JobTracker and TaskTracker
         ▪   Many client interfaces
             ▪   Java
             ▪   C++
             ▪   Streaming
             ▪   Pig
             ▪   SQL (Hive)


Tuesday, October 27, 2009
MapReduce
                      MapReduce pushes work out to the data
             (#)**+%$#41'%
                                            Q"                         K"
             #)5#0$#.1%*6%(/789%
             )#$#%)&'$3&:;$&*0%             !"                         Q"
             '$3#$1.<%$*%+;'"%=*34%
                                            N"                         N"
             *;$%$*%>#0<%0*)1'%&0%#%
             ?@;'$13A%B"&'%#@@*='%
             #0#@<'1'%$*%3;0%&0%                          K"
             +#3#@@1@%#0)%1@&>&0#$1'%
             $"1%:*$$@101?4'%                             P"
             &>+*'1)%:<%>*0*@&$"&?%                       !"
             '$*3#.1%'<'$1>'A%
                                            Q"
                                                                       K"
                                            P"
                                                                       P"
                                            !"
                                                                       N"



                                                                                     "
                                                 !"#$%&'()'*+,--.'.$/0&/'1-%2'-$3'3-'30&',+3+'
Tuesday, October 27, 2009               "
Hadoop Subprojects
         ▪   Avro
             ▪   Cross-language framework for RPC and serialization
         ▪   HBase
             ▪   Table storage on top of HDFS, modeled after Google’s BigTable
         ▪   Hive
             ▪   SQL interface to structured data stored in HDFS
         ▪   Pig
             ▪   Language for data flow programming; also Owl, Zebra, SQL
         ▪   Zookeeper
             ▪   Coordination service for distributed systems

Tuesday, October 27, 2009
Hadoop Community Support
         ▪   185 total contributors to the open source code base
             ▪   60 engineers at Yahoo!, 15 at Facebook, 15 at Cloudera
         ▪   Over 500 (paid!) attendees at Hadoop World NYC
             ▪   Hadoop World Beijing later this month
         ▪   Three books (O’Reilly, Apress, Manning)
         ▪   Training videos free online
         ▪   Regular user group meetups in many cities
         ▪   University courses across the world
         ▪   Growing consultant and systems integrator expertise
         ▪   Commercial training, certification, and support from Cloudera

Tuesday, October 27, 2009
Hadoop Project Mechanics
         ▪   Trademark owned by ASF; Apache 2.0 license for code
         ▪   Rigorous unit, smoke, performance, and system tests
         ▪   Release cycle of 3 months (-ish)
             ▪   Last major release: 0.20.0 on April 22, 2009
             ▪   0.21.0 will be last release before 1.0; nearly complete
             ▪   Subprojects on different release cycles
         ▪   Releases put to a vote according to Apache guidelines
         ▪   Releases made available as tarballs on Apache and mirrors
         ▪   Cloudera packages own release for many platforms
             ▪   RPM and Debian packages; AMI for Amazon’s EC2


Tuesday, October 27, 2009
Hadoop at Facebook
         Early 2006: The First Research Scientist
         ▪   Source data living on horizontally partitioned MySQL tier
         ▪   Intensive historical analysis difficult
         ▪   No way to assess impact of changes to the site


         ▪   First try: Python scripts pull data into MySQL
         ▪   Second try: Python scripts pull data into Oracle


         ▪   ...and then we turned on impression logging



Tuesday, October 27, 2009
Facebook Data Infrastructure
                                                  2007
                                   Scribe Tier                     MySQL Tier




                                                 Data Collection
                                                     Server




                                                 Oracle Database
                                                      Server




Tuesday, October 27, 2009
Facebook Data Infrastructure
                                                         2008
                                         Scribe Tier            MySQL Tier




                                 Hadoop Tier




                                    Oracle RAC Servers




Tuesday, October 27, 2009
Major Data Team Workloads
         ▪   Data collection
             ▪   server logs
             ▪   application databases
             ▪   web crawls
         ▪   Thousands of multi-stage processing pipelines
             ▪   Summaries consumed by external users
             ▪   Summaries for internal reporting
             ▪   Ad optimization pipeline
             ▪   Experimentation platform pipeline
         ▪   Ad hoc analyses


Tuesday, October 27, 2009
Workload Statistics
         Facebook 2009
         ▪   Largest cluster running Hive: 4,800 cores, 5.5 PB of storage
         ▪   4 TB of compressed new data added per day
         ▪   135TB of compressed data scanned per day
         ▪   7,500+ Hive jobs on per day
         ▪   80K compute hours per day
         ▪   Around 200 people per month run Hive jobs




Tuesday, October 27, 2009
Hadoop at Yahoo!
         ▪   Jan 2006: Hired Doug Cutting
         ▪   Apr 2006: Sorted 1.9 TB on 188 nodes in 47 hours
         ▪   Apr 2008: Sorted 1 TB on 910 nodes in 209 seconds
         ▪   Aug 2008: Deployed 4,000 node Hadoop cluster
         ▪   May 2009: Sorted 1 TB on 1,460 nodes in 62 seconds
         ▪   Data Points
             ▪   Over 25,000 nodes running Hadoop across 17 clusters
             ▪   Hundreds of thousands of jobs per day
             ▪   Typical HDFS cluster: 1,400 nodes, 2 PB capacity
             ▪   Sorted 1 PB on 3,658 nodes in 16.25 hours


Tuesday, October 27, 2009
Example Hadoop Applications
         ▪   Yahoo!
             ▪   Yahoo! Search Webmap
             ▪   Content and ad targeting optimization
         ▪   Facebook
             ▪   Fraud and abuse detection
             ▪   Lexicon (text mining)
         ▪   Cloudera
             ▪   Facial recognition for automatic tagging
             ▪   Genome sequence analysis
             ▪   Financial services, government, and of course: HEP!


Tuesday, October 27, 2009
Cloudera Offerings
         Only One Slide, I Promise
         ▪   Two software products
             ▪   Cloudera’s Distribution for Hadoop
             ▪   Cloudera Desktop
         ▪   Training and Certification
             ▪   For Developers, Operators, and Managers
         ▪   Support
         ▪   Professional services




Tuesday, October 27, 2009
Cloudera Desktop
                            Big Data can be Beautiful




Tuesday, October 27, 2009
(c) 2009 Cloudera, Inc. or its licensors.  "Cloudera" is a registered trademark of Cloudera, Inc.. All rights reserved. 1.0




Tuesday, October 27, 2009

Más contenido relacionado

La actualidad más candente (7)

Defcon 22-graham-mc millan-tentler-masscaning-the-internet
Defcon 22-graham-mc millan-tentler-masscaning-the-internetDefcon 22-graham-mc millan-tentler-masscaning-the-internet
Defcon 22-graham-mc millan-tentler-masscaning-the-internet
 
Redis
RedisRedis
Redis
 
Brasil Ross 2011
Brasil Ross 2011Brasil Ross 2011
Brasil Ross 2011
 
Sdec2011 shashank-introducing hadoop
Sdec2011 shashank-introducing hadoopSdec2011 shashank-introducing hadoop
Sdec2011 shashank-introducing hadoop
 
Sdec2011 Introducing Hadoop
Sdec2011 Introducing HadoopSdec2011 Introducing Hadoop
Sdec2011 Introducing Hadoop
 
SDEC2011 Essentials of Pig
SDEC2011 Essentials of PigSDEC2011 Essentials of Pig
SDEC2011 Essentials of Pig
 
MongoDB & Spark
MongoDB & SparkMongoDB & Spark
MongoDB & Spark
 

Destacado (6)

Animal Discovery
Animal DiscoveryAnimal Discovery
Animal Discovery
 
20100418sos
20100418sos20100418sos
20100418sos
 
20080611accel
20080611accel20080611accel
20080611accel
 
20100513brown
20100513brown20100513brown
20100513brown
 
20100608sigmod
20100608sigmod20100608sigmod
20100608sigmod
 
20080528dublinpt2
20080528dublinpt220080528dublinpt2
20080528dublinpt2
 

Similar a 20091027genentech

Brust hadoopecosystem
Brust hadoopecosystemBrust hadoopecosystem
Brust hadoopecosystem
Andrew Brust
 
Chemogenomics in the cloud: Is the sky the limit?
Chemogenomics in the cloud: Is the sky the limit?Chemogenomics in the cloud: Is the sky the limit?
Chemogenomics in the cloud: Is the sky the limit?
Rajarshi Guha
 
Advanced Technology for Web Application Design
Advanced Technology for Web Application DesignAdvanced Technology for Web Application Design
Advanced Technology for Web Application Design
Bryce Kerley
 
Rapid Prototyping
Rapid PrototypingRapid Prototyping
Rapid Prototyping
Even Wu
 

Similar a 20091027genentech (20)

20091203gemini
20091203gemini20091203gemini
20091203gemini
 
20100201hplabs
20100201hplabs20100201hplabs
20100201hplabs
 
20100201hplabs
20100201hplabs20100201hplabs
20100201hplabs
 
20090422 Www
20090422 Www20090422 Www
20090422 Www
 
Brust hadoopecosystem
Brust hadoopecosystemBrust hadoopecosystem
Brust hadoopecosystem
 
Device deployment
Device deploymentDevice deployment
Device deployment
 
All about Apache ACE
All about Apache ACEAll about Apache ACE
All about Apache ACE
 
InnoDB Magic
InnoDB MagicInnoDB Magic
InnoDB Magic
 
"R, Hadoop, and Amazon Web Services (20 December 2011)"
"R, Hadoop, and Amazon Web Services (20 December 2011)""R, Hadoop, and Amazon Web Services (20 December 2011)"
"R, Hadoop, and Amazon Web Services (20 December 2011)"
 
R, Hadoop and Amazon Web Services
R, Hadoop and Amazon Web ServicesR, Hadoop and Amazon Web Services
R, Hadoop and Amazon Web Services
 
Chemogenomics in the cloud: Is the sky the limit?
Chemogenomics in the cloud: Is the sky the limit?Chemogenomics in the cloud: Is the sky the limit?
Chemogenomics in the cloud: Is the sky the limit?
 
HARE 2010 Review
HARE 2010 ReviewHARE 2010 Review
HARE 2010 Review
 
CouchDB introduction
CouchDB introductionCouchDB introduction
CouchDB introduction
 
Push Podc09
Push Podc09Push Podc09
Push Podc09
 
Advanced Technology for Web Application Design
Advanced Technology for Web Application DesignAdvanced Technology for Web Application Design
Advanced Technology for Web Application Design
 
Adding Riak to your NoSQL Bag of Tricks
Adding Riak to your NoSQL Bag of TricksAdding Riak to your NoSQL Bag of Tricks
Adding Riak to your NoSQL Bag of Tricks
 
Rapid Prototyping
Rapid PrototypingRapid Prototyping
Rapid Prototyping
 
OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"
 
myHadoop 0.30
myHadoop 0.30myHadoop 0.30
myHadoop 0.30
 
Programming the cloud with Skywriting
Programming the cloud with SkywritingProgramming the cloud with Skywriting
Programming the cloud with Skywriting
 

Más de Jeff Hammerbacher

Mårten Mickos's presentation "Open Source: Why Freedom Makes a Better Busines...
Mårten Mickos's presentation "Open Source: Why Freedom Makes a Better Busines...Mårten Mickos's presentation "Open Source: Why Freedom Makes a Better Busines...
Mårten Mickos's presentation "Open Source: Why Freedom Makes a Better Busines...
Jeff Hammerbacher
 

Más de Jeff Hammerbacher (18)

20120223keystone
20120223keystone20120223keystone
20120223keystone
 
20100714accel
20100714accel20100714accel
20100714accel
 
20100423sage
20100423sage20100423sage
20100423sage
 
20100301icde
20100301icde20100301icde
20100301icde
 
Mårten Mickos's presentation "Open Source: Why Freedom Makes a Better Busines...
Mårten Mickos's presentation "Open Source: Why Freedom Makes a Better Busines...Mårten Mickos's presentation "Open Source: Why Freedom Makes a Better Busines...
Mårten Mickos's presentation "Open Source: Why Freedom Makes a Better Busines...
 
20090309berkeley
20090309berkeley20090309berkeley
20090309berkeley
 
20081030linkedin
20081030linkedin20081030linkedin
20081030linkedin
 
20081022cca
20081022cca20081022cca
20081022cca
 
20081009nychive
20081009nychive20081009nychive
20081009nychive
 
2008 Ur Tech Talk Zshao
2008 Ur Tech Talk Zshao2008 Ur Tech Talk Zshao
2008 Ur Tech Talk Zshao
 
Data Presentations Cassandra Sigmod
Data  Presentations  Cassandra SigmodData  Presentations  Cassandra Sigmod
Data Presentations Cassandra Sigmod
 
HDFS Architecture
HDFS ArchitectureHDFS Architecture
HDFS Architecture
 
Hdfs Dhruba
Hdfs DhrubaHdfs Dhruba
Hdfs Dhruba
 
20080529dublinpt1
20080529dublinpt120080529dublinpt1
20080529dublinpt1
 
20080528dublinpt3
20080528dublinpt320080528dublinpt3
20080528dublinpt3
 
20080529dublinpt3
20080529dublinpt320080529dublinpt3
20080529dublinpt3
 
20080528dublinpt1
20080528dublinpt120080528dublinpt1
20080528dublinpt1
 
20080529dublinpt2
20080529dublinpt220080529dublinpt2
20080529dublinpt2
 

Último

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Último (20)

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 

20091027genentech

  • 2. Hadoop and Cloudera Managing Petabytes with Open Source Jeff Hammerbacher Chief Scientist and Vice President of Products, Cloudera October 27, 2009 Tuesday, October 27, 2009
  • 3. Why You Should Care Hadoop in the Life Sciences ▪ CloudBurst: Highly Sensitive Short Read Mapping with MapReduce ▪ “CloudBurst reduces the running time from hours to mere minutes” ▪ Crossbow: Genotyping from short reads using cloud computing ▪ “Crossbow shows how Hadoop can be a enabling technology for computational biology” ▪ SMARTS substructure searching using the CDK and Hadoop ▪ “The Hadoop framework makes handling large data problems pretty much trivial” ▪ Smith-Waterman Protein Alignment ▪ “Existing algorithms ported easily to Hadoop” Tuesday, October 27, 2009
  • 4. My Background Thanks for Asking ▪ hammer@cloudera.com ▪ Studied Mathematics at Harvard ▪ Worked as a Quant on Wall Street ▪ Conceived, built, and led Data team at Facebook ▪ Nearly 30 amazing engineers and data scientists ▪ Several open source projects and research papers ▪ Founder of Cloudera ▪ Vice President of Products and Chief Scientist (other titles) ▪ Also, check out the book “Beautiful Data” Tuesday, October 27, 2009
  • 5. Presentation Outline ▪ What is Hadoop? ▪ HDFS ▪ MapReduce ▪ Hive, Pig, Avro, Zookeeper, and friends ▪ Solving big data problems with Hadoop at Facebook and Yahoo! ▪ Short history of Facebook’s Data team ▪ Hadoop applications at Yahoo!, Facebook, and Cloudera ▪ Other examples: LHC, smart grid, genomes ▪ Questions and Discussion Tuesday, October 27, 2009
  • 6. What is Hadoop? ▪ Apache Software Foundation project, mostly written in Java ▪ Inspired by Google infrastructure ▪ Software for programming warehouse-scale computers (WSCs) ▪ Hundreds of production deployments ▪ Project structure ▪ Hadoop Distributed File System (HDFS) ▪ Hadoop MapReduce ▪ Hadoop Common ▪ Other subprojects ▪ Avro, HBase, Hive, Pig, Zookeeper Tuesday, October 27, 2009
  • 7. Anatomy of a Hadoop Cluster ▪ Commodity servers ▪ 1 RU, 2 x 4 core CPU, 8 GB RAM, 4 x 1 TB SATA, 2 x 1 gE NIC ▪ Typically arranged in 2 level architecture ▪ Commodity 40 nodes per rack Hardware Cluster ▪ Inexpensive to acquire and maintain •! Typically in 2 level architecture –! Nodes are commodity Linux PCs Tuesday, October 27, 2009 –! 40 nodes/rack
  • 8. HDFS ▪ Pool commodity servers into a single hierarchical namespace ▪ Break files into 128 MB blocks and replicate blocks ▪ Designed for large files written once but read many times ▪ Files are append-only ▪ Two major daemons: NameNode and DataNode ▪ NameNode manages file system metadata ▪ DataNode manages data using local filesystem ▪ HDFS manages checksumming, replication, and compression ▪ Throughput scales nearly linearly with node cluster size ▪ Access from Java, C, command line, FUSE, or Thrift Tuesday, October 27, 2009
  • 9. '$*31%10$13+3&'1%)#$#I% #79:"5$)$3-.".0&2$3-"&)"06-"*+,.0-2"84"82-$?()3"()*&5()3" /(+-."()0&"'(-*-.;"*$++-%"C8+&*?.;D"$)%".0&2()3"-$*6"&/"06-"8+&*?." HDFS 2-%,)%$)0+4"$*2&.."06-"'&&+"&/".-2=-2.<""B)"06-"*&55&)"*$.-;" #79:".0&2-."062--"*&5'+-0-"*&'(-."&/"-$*6"/(+-"84"*&'4()3"-$*6" '(-*-"0&"062--"%(//-2-)0".-2=-2.E" HDFS distributes file blocks among servers " " !" " F" I" !" " H" H" F" !" " F" G" #79:" G" I" I" H" " !" " F" G" G" I" H" " !"#$%&'()'*+!,'-"./%"0$/&.'1"2&'02345.'6738#'.&%9&%.' " Tuesday, October 27, 2009
  • 10. Hadoop MapReduce ▪ Fault tolerant execution layer and API for parallel data processing ▪ Can target multiple storage systems ▪ Key/value data model ▪ Two major daemons: JobTracker and TaskTracker ▪ Many client interfaces ▪ Java ▪ C++ ▪ Streaming ▪ Pig ▪ SQL (Hive) Tuesday, October 27, 2009
  • 11. MapReduce MapReduce pushes work out to the data (#)**+%$#41'% Q" K" #)5#0$#.1%*6%(/789% )#$#%)&'$3&:;$&*0% !" Q" '$3#$1.<%$*%+;'"%=*34% N" N" *;$%$*%>#0<%0*)1'%&0%#% ?@;'$13A%B"&'%#@@*='% #0#@<'1'%$*%3;0%&0% K" +#3#@@1@%#0)%1@&>&0#$1'% $"1%:*$$@101?4'% P" &>+*'1)%:<%>*0*@&$"&?% !" '$*3#.1%'<'$1>'A% Q" K" P" P" !" N" " !"#$%&'()'*+,--.'.$/0&/'1-%2'-$3'3-'30&',+3+' Tuesday, October 27, 2009 "
  • 12. Hadoop Subprojects ▪ Avro ▪ Cross-language framework for RPC and serialization ▪ HBase ▪ Table storage on top of HDFS, modeled after Google’s BigTable ▪ Hive ▪ SQL interface to structured data stored in HDFS ▪ Pig ▪ Language for data flow programming; also Owl, Zebra, SQL ▪ Zookeeper ▪ Coordination service for distributed systems Tuesday, October 27, 2009
  • 13. Hadoop Community Support ▪ 185 total contributors to the open source code base ▪ 60 engineers at Yahoo!, 15 at Facebook, 15 at Cloudera ▪ Over 500 (paid!) attendees at Hadoop World NYC ▪ Hadoop World Beijing later this month ▪ Three books (O’Reilly, Apress, Manning) ▪ Training videos free online ▪ Regular user group meetups in many cities ▪ University courses across the world ▪ Growing consultant and systems integrator expertise ▪ Commercial training, certification, and support from Cloudera Tuesday, October 27, 2009
  • 14. Hadoop Project Mechanics ▪ Trademark owned by ASF; Apache 2.0 license for code ▪ Rigorous unit, smoke, performance, and system tests ▪ Release cycle of 3 months (-ish) ▪ Last major release: 0.20.0 on April 22, 2009 ▪ 0.21.0 will be last release before 1.0; nearly complete ▪ Subprojects on different release cycles ▪ Releases put to a vote according to Apache guidelines ▪ Releases made available as tarballs on Apache and mirrors ▪ Cloudera packages own release for many platforms ▪ RPM and Debian packages; AMI for Amazon’s EC2 Tuesday, October 27, 2009
  • 15. Hadoop at Facebook Early 2006: The First Research Scientist ▪ Source data living on horizontally partitioned MySQL tier ▪ Intensive historical analysis difficult ▪ No way to assess impact of changes to the site ▪ First try: Python scripts pull data into MySQL ▪ Second try: Python scripts pull data into Oracle ▪ ...and then we turned on impression logging Tuesday, October 27, 2009
  • 16. Facebook Data Infrastructure 2007 Scribe Tier MySQL Tier Data Collection Server Oracle Database Server Tuesday, October 27, 2009
  • 17. Facebook Data Infrastructure 2008 Scribe Tier MySQL Tier Hadoop Tier Oracle RAC Servers Tuesday, October 27, 2009
  • 18. Major Data Team Workloads ▪ Data collection ▪ server logs ▪ application databases ▪ web crawls ▪ Thousands of multi-stage processing pipelines ▪ Summaries consumed by external users ▪ Summaries for internal reporting ▪ Ad optimization pipeline ▪ Experimentation platform pipeline ▪ Ad hoc analyses Tuesday, October 27, 2009
  • 19. Workload Statistics Facebook 2009 ▪ Largest cluster running Hive: 4,800 cores, 5.5 PB of storage ▪ 4 TB of compressed new data added per day ▪ 135TB of compressed data scanned per day ▪ 7,500+ Hive jobs on per day ▪ 80K compute hours per day ▪ Around 200 people per month run Hive jobs Tuesday, October 27, 2009
  • 20. Hadoop at Yahoo! ▪ Jan 2006: Hired Doug Cutting ▪ Apr 2006: Sorted 1.9 TB on 188 nodes in 47 hours ▪ Apr 2008: Sorted 1 TB on 910 nodes in 209 seconds ▪ Aug 2008: Deployed 4,000 node Hadoop cluster ▪ May 2009: Sorted 1 TB on 1,460 nodes in 62 seconds ▪ Data Points ▪ Over 25,000 nodes running Hadoop across 17 clusters ▪ Hundreds of thousands of jobs per day ▪ Typical HDFS cluster: 1,400 nodes, 2 PB capacity ▪ Sorted 1 PB on 3,658 nodes in 16.25 hours Tuesday, October 27, 2009
  • 21. Example Hadoop Applications ▪ Yahoo! ▪ Yahoo! Search Webmap ▪ Content and ad targeting optimization ▪ Facebook ▪ Fraud and abuse detection ▪ Lexicon (text mining) ▪ Cloudera ▪ Facial recognition for automatic tagging ▪ Genome sequence analysis ▪ Financial services, government, and of course: HEP! Tuesday, October 27, 2009
  • 22. Cloudera Offerings Only One Slide, I Promise ▪ Two software products ▪ Cloudera’s Distribution for Hadoop ▪ Cloudera Desktop ▪ Training and Certification ▪ For Developers, Operators, and Managers ▪ Support ▪ Professional services Tuesday, October 27, 2009
  • 23. Cloudera Desktop Big Data can be Beautiful Tuesday, October 27, 2009
  • 24. (c) 2009 Cloudera, Inc. or its licensors.  "Cloudera" is a registered trademark of Cloudera, Inc.. All rights reserved. 1.0 Tuesday, October 27, 2009