SlideShare una empresa de Scribd logo
1 de 29
Descargar para leer sin conexión
Monday, February 1, 2010
What is Cloudera Building?
         For Now, At Least



         Jeff Hammerbacher
         Chief Scientist and Vice President of Products, Cloudera
         February 1, 2010



Monday, February 1, 2010
My Background
         Thanks for Asking
         ▪   hammer@cloudera.com
         ▪   Studied Mathematics at Harvard
         ▪   Worked as a Quant on Wall Street
         ▪   Conceived, built, and led Data team at Facebook
             ▪   Nearly 30 amazing engineers and data scientists
             ▪   Several open source projects and research papers
         ▪   Founder of Cloudera
             ▪   Vice President of Products and Chief Scientist
             ▪   Also, check out the book “Beautiful Data”

Monday, February 1, 2010
Presentation Outline
         ▪   What is Hadoop?
             ▪   HDFS and MapReduce
             ▪   Hive, Pig, Avro, Zookeeper, HBase
         ▪   My time at Facebook
             ▪   The problem we were trying to solve
             ▪   Why we chose Hadoop
         ▪   What we’re building at Cloudera
             ▪   What we build and sell right now
             ▪   Some plans for the future
         ▪   Questions and Discussion


Monday, February 1, 2010
What is Hadoop?
         ▪   Apache Software Foundation project, mostly written in Java
         ▪   Inspired by Google infrastructure
         ▪   Software for programming warehouse-scale computers (WSCs)
         ▪   Hundreds of production deployments
         ▪   Project structure
             ▪   Hadoop Distributed File System (HDFS)
             ▪   Hadoop MapReduce
             ▪   Hadoop Common
             ▪   Other subprojects
                 ▪   Avro, HBase, Hive, Pig, Zookeeper


Monday, February 1, 2010
Anatomy of a Hadoop Cluster
         ▪   Commodity servers
             ▪   1 RU, 2 x 4 core CPU, 8 GB RAM, 4 x 1 TB SATA, 2 x 1 gE NIC
             ▪   Or: 2 RU, 2 x 8 core CPU, 32 GB RAM, 12 x 1 TB SATA
         ▪   Typically arranged in 2 level architecture
                              Commodity Hardware                  Cluster
         ▪   Inexpensive to acquire and maintain




                           •! Typically in 2 level architecture
                               –! Nodes are commodity Linux PCs
Monday, February 1, 2010       –! 40 nodes/rack
HDFS
         ▪   Pool commodity servers into a single hierarchical namespace
         ▪   Break files into 128 MB blocks and replicate blocks
         ▪   Designed for large files written once but read many times
             ▪   Files are append-only via a single writer
         ▪   Two major daemons: NameNode and DataNode
             ▪   NameNode manages file system metadata
             ▪   DataNode manages data using local filesystem
         ▪   HDFS manages checksumming, replication, and compression
         ▪   Throughput scales nearly linearly with node cluster size



Monday, February 1, 2010
'$*31%10$13+3&'1%)#$#I%
                                #79:"5$)$3-.".0&2$3-"&)"06-"*+,.0-2"84"82-$?()3"()*&5()3"
                                /(+-."()0&"'(-*-.;"*$++-%"C8+&*?.;D"$)%".0&2()3"-$*6"&/"06-"8+&*?."

                                                        HDFS
                                2-%,)%$)0+4"$*2&.."06-"'&&+"&/".-2=-2.<""B)"06-"*&55&)"*$.-;"
                                #79:".0&2-."062--"*&5'+-0-"*&'(-."&/"-$*6"/(+-"84"*&'4()3"-$*6"
                                '(-*-"0&"062--"%(//-2-)0".-2=-2.E"
             HDFS distributes file blocks among servers
                      "

                                                         " !"                 " F"

                                                          I"                   !"


                                    "                     H"                   H"
                                        F"

                                        !"                         " F"

                                        G"    #79:"                 G"

                                                                    I"
                                        I"

                                        H"               " !"                 " F"

                                                          G"                    G"

                                                          I"                    H"


                                                                                          "
                                        !"#$%&'()'*+!,'-"./%"0$/&.'1"2&'02345.'6738#'.&%9&%.'
                                "
Monday, February 1, 2010
Hadoop MapReduce
         ▪   Fault tolerant execution layer and API for parallel data processing
         ▪   Can target multiple storage systems
         ▪   Key/value data model
         ▪   Two major daemons: JobTracker and TaskTracker
         ▪   Many client interfaces
             ▪   Java
             ▪   C++
             ▪   Streaming
             ▪   Pig
             ▪   SQL (Hive)


Monday, February 1, 2010
MapReduce
                      MapReduce pushes work out to the data
             (#)**+%$#41'%
                                            Q"                         K"
             #)5#0$#.1%*6%(/789%
             )#$#%)&'$3&:;$&*0%             !"                         Q"
             '$3#$1.<%$*%+;'"%=*34%
                                            N"                         N"
             *;$%$*%>#0<%0*)1'%&0%#%
             ?@;'$13A%B"&'%#@@*='%
             #0#@<'1'%$*%3;0%&0%                          K"
             +#3#@@1@%#0)%1@&>&0#$1'%
             $"1%:*$$@101?4'%                             P"
             &>+*'1)%:<%>*0*@&$"&?%                       !"
             '$*3#.1%'<'$1>'A%
                                            Q"
                                                                       K"
                                            P"
                                                                       P"
                                            !"
                                                                       N"



                                                                                     "
                                                 !"#$%&'()'*+,--.'.$/0&/'1-%2'-$3'3-'30&',+3+'
Monday, February 1, 2010                "
Hadoop Subprojects
         ▪   Avro
             ▪   Cross-language framework for RPC and serialization
         ▪   HBase
             ▪   Table storage on top of HDFS, modeled after Google’s BigTable
         ▪   Hive
             ▪   SQL interface to structured data stored in HDFS
         ▪   Pig
             ▪   Language for data flow programming; also Owl, Zebra, SQL
         ▪   Zookeeper
             ▪   Coordination service for distributed systems

Monday, February 1, 2010
Hadoop Community Support
         ▪   185+ contributors to the open source code base
             ▪   ~60 engineers at Yahoo!, ~15 at Facebook, ~15 at Cloudera
         ▪   Over 500 (paid!) attendees at Hadoop World NYC
         ▪   Regular user group meetups in many cities
             ▪   Bay Area Meetup group has 534 members
         ▪   Three books (O’Reilly, Apress, Manning)
         ▪   Training videos free online
         ▪   University courses across the world
         ▪   Growing consultant and systems integrator expertise
         ▪   Training, certification, support, and services from Cloudera

Monday, February 1, 2010
Hadoop Project Mechanics
         ▪   Trademark owned by ASF; Apache 2.0 license for code
         ▪   Rigorous unit, smoke, performance, and system tests
         ▪   Release cycle of 9 months
             ▪   Last major release: 0.20.0 on April 22, 2009
             ▪   0.21.0 will be last release before 1.0; nearly complete
             ▪   Subprojects on different release cycles
         ▪   Releases put to a vote according to Apache guidelines
         ▪   Releases made available as tarballs on Apache and mirrors
         ▪   Cloudera packages a distribution for many platforms
             ▪   RPM and Debian packages; AMI for Amazon’s EC2


Monday, February 1, 2010
Hadoop at Facebook
         Early 2006: The First Research Scientist
         ▪   Source data living on horizontally partitioned MySQL tier
         ▪   Intensive historical analysis difficult
         ▪   No way to assess impact of changes to the site


         ▪   First try: Python scripts pull data into MySQL
         ▪   Second try: Python scripts pull data into Oracle


         ▪   ...and then we turned on impression logging



Monday, February 1, 2010
Facebook Data Infrastructure
                                                 2007
                                  Scribe Tier                     MySQL Tier




                                                Data Collection
                                                    Server




                                                Oracle Database
                                                     Server




Monday, February 1, 2010
Facebook Data Infrastructure
                                                        2008
                                        Scribe Tier            MySQL Tier




                                Hadoop Tier




                                   Oracle RAC Servers




Monday, February 1, 2010
Major Data Team Workloads
         ▪   Data collection
             ▪   server logs
             ▪   application databases
             ▪   web crawls
         ▪   Thousands of multi-stage processing pipelines
             ▪   Summaries consumed by external users
             ▪   Summaries for internal reporting
             ▪   Ad optimization pipeline
             ▪   Experimentation platform pipeline
         ▪   Ad hoc analyses


Monday, February 1, 2010
Workload Statistics
         Facebook 2010
         ▪   Largest cluster running Hive: 8,400 cores, 12.5 PB of storage
         ▪   12 TB of compressed new data added per day
         ▪   135TB of compressed data scanned per day
         ▪   7,500+ Hive jobs on per day
         ▪   80K compute hours per day
         ▪   Around 200 people per month run Hive jobs



             (data from Ashish Thusoo’s Bay Area ACM DM SIG presentation)


Monday, February 1, 2010
Why Did Facebook Choose Hadoop?
         1. Demonstrated effectiveness for primary workload
         2. Proven ability to scale past any commercial vendor
         3. Easy provisioning and capacity planning with commodity nodes
         4. Data access for all: engineers, business analysts, sales managers
         5. Single system to manage XML/JSON, text, and relational data
         6. No schemas enabled data collection without involving Data team
         7. Simple, modular architecture
         8. Easy to build, deploy, and monitor
         9. Apache-licensed open source code granted to ASF



Monday, February 1, 2010
Why Did Facebook Choose Hadoop?
         ▪   Most importantly: the community
             ▪   Broad and deep commitment to future development from
                 multiple organizations
             ▪   Interaction with a community often useful for recruiting
             ▪   Growing body of users and operators with prior expertise
                 meant lower cost of training new users
             ▪   Learn about best practices from other organizations
             ▪   Widely available public materials for improving skills
             ▪   Not then, but now
                 ▪   Commercial training, certification, support, and services
                 ▪   Growing body of complementary software


Monday, February 1, 2010
Hadoop at Cloudera
         Cloudera’s Distribution for Hadoop
         ▪   Open source distribution of Apache Hadoop for enterprise use
             ▪   Includes HDFS, MapReduce, Pig, Hive, HBase, and ZooKeeper
             ▪   Ensures cross-subproject compatibility
             ▪   Adds backported patches and customer-specific patches
             ▪   Adds Cloudera utilities like MRUnit and Sqoop
             ▪   Better integration with daemon administration utilities
             ▪   Follows the Filesystem Hierarchy Standard (FHS) for file layout
             ▪   Tools for automatically generating a configuration
             ▪   Packaged as RPM, DEB, AMI, or tarball


Monday, February 1, 2010
Hadoop at Cloudera
         Training and Certification
         ▪   Free online training
             ▪   Basic, Intermediate (including Hive and Pig), and Advanced
             ▪   Includes a virtual machine with software and exercises
         ▪   Live training sessions
             ▪   One live session per month somewhere in the world
             ▪   If you have a large group, we may come to you
         ▪   Certification
             ▪   Exams for Developers, Administrators, and Managers
             ▪   Administered online or in person

Monday, February 1, 2010
Hadoop at Cloudera
         Services and Support
         ▪   Professional Services
             ▪   Get Hadoop up and running in your environment
             ▪   Optimize an existing Hadoop infrastructure
             ▪   Design new algorithms to make the most of your data
         ▪   Support
             ▪   Unlimited questions for Cloudera’s technical team
             ▪   Access to our Knowledge Base
             ▪   Help prioritize feature development for CDH
             ▪   Early access to upcoming Cloudera software products


Monday, February 1, 2010
Hadoop at Cloudera
         Upcoming Work: HDFS
         ▪   Symlinks
         ▪   Security
         ▪   Native client
         ▪   Refactoring of read and write path
         ▪   Fast upgrade and recovery; later, high availability
         ▪   Integration of Avro
         ▪   Integration with existing monitoring tools
         ▪   Better stress testing at scale



Monday, February 1, 2010
Hadoop at Cloudera
         Upcoming Work: MapReduce
         ▪   Avro end-to-end
             ▪   Avro in the shuffle
             ▪   Avro input and output formats
         ▪   Security
         ▪   Continued API evolution; better support for non-Java languages
         ▪   System counters
         ▪   Integration with a resource manager: Nexus?
         ▪   Potential rewrite of master process, the JobTracker
         ▪   Potential integration of MOP

Monday, February 1, 2010
Hadoop at Cloudera
         Upcoming Work: Avro
         ▪   Complete the RPC spec: SASL over TCP sockets, HTTP
             ▪   Also: counters for debugging
         ▪   Improved language support
             ▪   Currently: C, Java, C++, Python, Ruby
             ▪   Define “levels” of implementation
         ▪   Better command line utilities and tools for authoring schemas
         ▪   Additional compression codecs
         ▪   Additional benchmarks
         ▪   Integration with other projects (HDFS, MapReduce, Hive, HBase)

Monday, February 1, 2010
Hadoop at Cloudera
         Commercial Software
         ▪   General thesis: build commercially-licensed software products
             which complement CDH for data management and analysis
         ▪   Current products
             ▪   Cloudera Desktop
                 ▪   Extensible interface for users of Cloudera software
         ▪   Upcoming products
             ▪   Nothing in writing




Monday, February 1, 2010
Cloudera Desktop
                           Big Data can be Beautiful




Monday, February 1, 2010
(c) 2009 Cloudera, Inc. or its licensors.  "Cloudera" is a registered trademark of Cloudera, Inc.. All rights reserved. 1.0




Monday, February 1, 2010

Más contenido relacionado

Destacado

The Onion Patch: Migration in Open Source Ecosystems
The Onion Patch: Migration in Open Source EcosystemsThe Onion Patch: Migration in Open Source Ecosystems
The Onion Patch: Migration in Open Source EcosystemsPatrick Wagstrom
 
Open Source Business Ecosystem - PhD work
Open Source Business Ecosystem - PhD workOpen Source Business Ecosystem - PhD work
Open Source Business Ecosystem - PhD workJonathan Le Lous
 
Mårten Mickos's presentation "Open Source: Why Freedom Makes a Better Busines...
Mårten Mickos's presentation "Open Source: Why Freedom Makes a Better Busines...Mårten Mickos's presentation "Open Source: Why Freedom Makes a Better Busines...
Mårten Mickos's presentation "Open Source: Why Freedom Makes a Better Busines...Jeff Hammerbacher
 
Open Source Migration
Open Source MigrationOpen Source Migration
Open Source Migrationrw2
 

Destacado (7)

20100423sage
20100423sage20100423sage
20100423sage
 
The Onion Patch: Migration in Open Source Ecosystems
The Onion Patch: Migration in Open Source EcosystemsThe Onion Patch: Migration in Open Source Ecosystems
The Onion Patch: Migration in Open Source Ecosystems
 
Open Source Business Ecosystem - PhD work
Open Source Business Ecosystem - PhD workOpen Source Business Ecosystem - PhD work
Open Source Business Ecosystem - PhD work
 
20090422 Www
20090422 Www20090422 Www
20090422 Www
 
20080528dublinpt1
20080528dublinpt120080528dublinpt1
20080528dublinpt1
 
Mårten Mickos's presentation "Open Source: Why Freedom Makes a Better Busines...
Mårten Mickos's presentation "Open Source: Why Freedom Makes a Better Busines...Mårten Mickos's presentation "Open Source: Why Freedom Makes a Better Busines...
Mårten Mickos's presentation "Open Source: Why Freedom Makes a Better Busines...
 
Open Source Migration
Open Source MigrationOpen Source Migration
Open Source Migration
 

Similar a 20100201hplabs

OSGi Provisioning With Apache ACE
OSGi Provisioning With Apache ACEOSGi Provisioning With Apache ACE
OSGi Provisioning With Apache ACEmfrancis
 
Hook Mobile Living Social Hackathon MoDevUX 2012
Hook Mobile Living Social Hackathon MoDevUX 2012Hook Mobile Living Social Hackathon MoDevUX 2012
Hook Mobile Living Social Hackathon MoDevUX 2012Wayne Chen
 
Massive device deployment - EclipseCon 2011
Massive device deployment - EclipseCon 2011Massive device deployment - EclipseCon 2011
Massive device deployment - EclipseCon 2011Angelo van der Sijpt
 
U of U Undergraduate IMC Class
U of U Undergraduate IMC ClassU of U Undergraduate IMC Class
U of U Undergraduate IMC ClassChris Carlston
 
20th.陈晓鸣 百度海量日志分析架构及处理经验分享
20th.陈晓鸣 百度海量日志分析架构及处理经验分享20th.陈晓鸣 百度海量日志分析架构及处理经验分享
20th.陈晓鸣 百度海量日志分析架构及处理经验分享elevenma
 
Big Data @ Orange - Dev Day 2013 - part 2
Big Data @ Orange - Dev Day 2013 - part 2Big Data @ Orange - Dev Day 2013 - part 2
Big Data @ Orange - Dev Day 2013 - part 2ovarene
 
Visual Studio 2013, Xamarin and Microsoft Azure Mobile Services: A Match Made...
Visual Studio 2013, Xamarin and Microsoft Azure Mobile Services: A Match Made...Visual Studio 2013, Xamarin and Microsoft Azure Mobile Services: A Match Made...
Visual Studio 2013, Xamarin and Microsoft Azure Mobile Services: A Match Made...Rick G. Garibay
 
Cocoa pods iOSDevUK 14 talk: managing your libraries
Cocoa pods iOSDevUK 14 talk: managing your librariesCocoa pods iOSDevUK 14 talk: managing your libraries
Cocoa pods iOSDevUK 14 talk: managing your librariesDiego Freniche Brito
 
JaanSi Solutions & Services profile (v1.0)
JaanSi Solutions & Services profile (v1.0)JaanSi Solutions & Services profile (v1.0)
JaanSi Solutions & Services profile (v1.0)Siddhartha Shankar
 
"R, Hadoop, and Amazon Web Services (20 December 2011)"
"R, Hadoop, and Amazon Web Services (20 December 2011)""R, Hadoop, and Amazon Web Services (20 December 2011)"
"R, Hadoop, and Amazon Web Services (20 December 2011)"Portland R User Group
 

Similar a 20100201hplabs (20)

20091030nasajpl
20091030nasajpl20091030nasajpl
20091030nasajpl
 
Device deployment
Device deploymentDevice deployment
Device deployment
 
OSGi Provisioning With Apache ACE
OSGi Provisioning With Apache ACEOSGi Provisioning With Apache ACE
OSGi Provisioning With Apache ACE
 
All about Apache ACE
All about Apache ACEAll about Apache ACE
All about Apache ACE
 
Hook Mobile Living Social Hackathon MoDevUX 2012
Hook Mobile Living Social Hackathon MoDevUX 2012Hook Mobile Living Social Hackathon MoDevUX 2012
Hook Mobile Living Social Hackathon MoDevUX 2012
 
InnoDB Magic
InnoDB MagicInnoDB Magic
InnoDB Magic
 
Hadoop london
Hadoop londonHadoop london
Hadoop london
 
Massive device deployment - EclipseCon 2011
Massive device deployment - EclipseCon 2011Massive device deployment - EclipseCon 2011
Massive device deployment - EclipseCon 2011
 
Push Podc09
Push Podc09Push Podc09
Push Podc09
 
U of U Undergraduate IMC Class
U of U Undergraduate IMC ClassU of U Undergraduate IMC Class
U of U Undergraduate IMC Class
 
HARE 2010 Review
HARE 2010 ReviewHARE 2010 Review
HARE 2010 Review
 
hadoop事例紹介
hadoop事例紹介hadoop事例紹介
hadoop事例紹介
 
The Project Trap
The Project TrapThe Project Trap
The Project Trap
 
20th.陈晓鸣 百度海量日志分析架构及处理经验分享
20th.陈晓鸣 百度海量日志分析架构及处理经验分享20th.陈晓鸣 百度海量日志分析架构及处理经验分享
20th.陈晓鸣 百度海量日志分析架构及处理经验分享
 
Big Data @ Orange - Dev Day 2013 - part 2
Big Data @ Orange - Dev Day 2013 - part 2Big Data @ Orange - Dev Day 2013 - part 2
Big Data @ Orange - Dev Day 2013 - part 2
 
Visual Studio 2013, Xamarin and Microsoft Azure Mobile Services: A Match Made...
Visual Studio 2013, Xamarin and Microsoft Azure Mobile Services: A Match Made...Visual Studio 2013, Xamarin and Microsoft Azure Mobile Services: A Match Made...
Visual Studio 2013, Xamarin and Microsoft Azure Mobile Services: A Match Made...
 
Cocoa pods iOSDevUK 14 talk: managing your libraries
Cocoa pods iOSDevUK 14 talk: managing your librariesCocoa pods iOSDevUK 14 talk: managing your libraries
Cocoa pods iOSDevUK 14 talk: managing your libraries
 
JaanSi Solutions & Services profile (v1.0)
JaanSi Solutions & Services profile (v1.0)JaanSi Solutions & Services profile (v1.0)
JaanSi Solutions & Services profile (v1.0)
 
"R, Hadoop, and Amazon Web Services (20 December 2011)"
"R, Hadoop, and Amazon Web Services (20 December 2011)""R, Hadoop, and Amazon Web Services (20 December 2011)"
"R, Hadoop, and Amazon Web Services (20 December 2011)"
 
R, Hadoop and Amazon Web Services
R, Hadoop and Amazon Web ServicesR, Hadoop and Amazon Web Services
R, Hadoop and Amazon Web Services
 

Más de Jeff Hammerbacher (19)

20120223keystone
20120223keystone20120223keystone
20120223keystone
 
20100714accel
20100714accel20100714accel
20100714accel
 
20100608sigmod
20100608sigmod20100608sigmod
20100608sigmod
 
20100513brown
20100513brown20100513brown
20100513brown
 
20100418sos
20100418sos20100418sos
20100418sos
 
20100301icde
20100301icde20100301icde
20100301icde
 
20090309berkeley
20090309berkeley20090309berkeley
20090309berkeley
 
20081030linkedin
20081030linkedin20081030linkedin
20081030linkedin
 
20081022cca
20081022cca20081022cca
20081022cca
 
20081009nychive
20081009nychive20081009nychive
20081009nychive
 
2008 Ur Tech Talk Zshao
2008 Ur Tech Talk Zshao2008 Ur Tech Talk Zshao
2008 Ur Tech Talk Zshao
 
Data Presentations Cassandra Sigmod
Data  Presentations  Cassandra SigmodData  Presentations  Cassandra Sigmod
Data Presentations Cassandra Sigmod
 
20080611accel
20080611accel20080611accel
20080611accel
 
HDFS Architecture
HDFS ArchitectureHDFS Architecture
HDFS Architecture
 
Hdfs Dhruba
Hdfs DhrubaHdfs Dhruba
Hdfs Dhruba
 
20080529dublinpt1
20080529dublinpt120080529dublinpt1
20080529dublinpt1
 
20080528dublinpt3
20080528dublinpt320080528dublinpt3
20080528dublinpt3
 
20080529dublinpt3
20080529dublinpt320080529dublinpt3
20080529dublinpt3
 
20080529dublinpt2
20080529dublinpt220080529dublinpt2
20080529dublinpt2
 

Último

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 

Último (20)

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 

20100201hplabs

  • 2. What is Cloudera Building? For Now, At Least Jeff Hammerbacher Chief Scientist and Vice President of Products, Cloudera February 1, 2010 Monday, February 1, 2010
  • 3. My Background Thanks for Asking ▪ hammer@cloudera.com ▪ Studied Mathematics at Harvard ▪ Worked as a Quant on Wall Street ▪ Conceived, built, and led Data team at Facebook ▪ Nearly 30 amazing engineers and data scientists ▪ Several open source projects and research papers ▪ Founder of Cloudera ▪ Vice President of Products and Chief Scientist ▪ Also, check out the book “Beautiful Data” Monday, February 1, 2010
  • 4. Presentation Outline ▪ What is Hadoop? ▪ HDFS and MapReduce ▪ Hive, Pig, Avro, Zookeeper, HBase ▪ My time at Facebook ▪ The problem we were trying to solve ▪ Why we chose Hadoop ▪ What we’re building at Cloudera ▪ What we build and sell right now ▪ Some plans for the future ▪ Questions and Discussion Monday, February 1, 2010
  • 5. What is Hadoop? ▪ Apache Software Foundation project, mostly written in Java ▪ Inspired by Google infrastructure ▪ Software for programming warehouse-scale computers (WSCs) ▪ Hundreds of production deployments ▪ Project structure ▪ Hadoop Distributed File System (HDFS) ▪ Hadoop MapReduce ▪ Hadoop Common ▪ Other subprojects ▪ Avro, HBase, Hive, Pig, Zookeeper Monday, February 1, 2010
  • 6. Anatomy of a Hadoop Cluster ▪ Commodity servers ▪ 1 RU, 2 x 4 core CPU, 8 GB RAM, 4 x 1 TB SATA, 2 x 1 gE NIC ▪ Or: 2 RU, 2 x 8 core CPU, 32 GB RAM, 12 x 1 TB SATA ▪ Typically arranged in 2 level architecture Commodity Hardware Cluster ▪ Inexpensive to acquire and maintain •! Typically in 2 level architecture –! Nodes are commodity Linux PCs Monday, February 1, 2010 –! 40 nodes/rack
  • 7. HDFS ▪ Pool commodity servers into a single hierarchical namespace ▪ Break files into 128 MB blocks and replicate blocks ▪ Designed for large files written once but read many times ▪ Files are append-only via a single writer ▪ Two major daemons: NameNode and DataNode ▪ NameNode manages file system metadata ▪ DataNode manages data using local filesystem ▪ HDFS manages checksumming, replication, and compression ▪ Throughput scales nearly linearly with node cluster size Monday, February 1, 2010
  • 8. '$*31%10$13+3&'1%)#$#I% #79:"5$)$3-.".0&2$3-"&)"06-"*+,.0-2"84"82-$?()3"()*&5()3" /(+-."()0&"'(-*-.;"*$++-%"C8+&*?.;D"$)%".0&2()3"-$*6"&/"06-"8+&*?." HDFS 2-%,)%$)0+4"$*2&.."06-"'&&+"&/".-2=-2.<""B)"06-"*&55&)"*$.-;" #79:".0&2-."062--"*&5'+-0-"*&'(-."&/"-$*6"/(+-"84"*&'4()3"-$*6" '(-*-"0&"062--"%(//-2-)0".-2=-2.E" HDFS distributes file blocks among servers " " !" " F" I" !" " H" H" F" !" " F" G" #79:" G" I" I" H" " !" " F" G" G" I" H" " !"#$%&'()'*+!,'-"./%"0$/&.'1"2&'02345.'6738#'.&%9&%.' " Monday, February 1, 2010
  • 9. Hadoop MapReduce ▪ Fault tolerant execution layer and API for parallel data processing ▪ Can target multiple storage systems ▪ Key/value data model ▪ Two major daemons: JobTracker and TaskTracker ▪ Many client interfaces ▪ Java ▪ C++ ▪ Streaming ▪ Pig ▪ SQL (Hive) Monday, February 1, 2010
  • 10. MapReduce MapReduce pushes work out to the data (#)**+%$#41'% Q" K" #)5#0$#.1%*6%(/789% )#$#%)&'$3&:;$&*0% !" Q" '$3#$1.<%$*%+;'"%=*34% N" N" *;$%$*%>#0<%0*)1'%&0%#% ?@;'$13A%B"&'%#@@*='% #0#@<'1'%$*%3;0%&0% K" +#3#@@1@%#0)%1@&>&0#$1'% $"1%:*$$@101?4'% P" &>+*'1)%:<%>*0*@&$"&?% !" '$*3#.1%'<'$1>'A% Q" K" P" P" !" N" " !"#$%&'()'*+,--.'.$/0&/'1-%2'-$3'3-'30&',+3+' Monday, February 1, 2010 "
  • 11. Hadoop Subprojects ▪ Avro ▪ Cross-language framework for RPC and serialization ▪ HBase ▪ Table storage on top of HDFS, modeled after Google’s BigTable ▪ Hive ▪ SQL interface to structured data stored in HDFS ▪ Pig ▪ Language for data flow programming; also Owl, Zebra, SQL ▪ Zookeeper ▪ Coordination service for distributed systems Monday, February 1, 2010
  • 12. Hadoop Community Support ▪ 185+ contributors to the open source code base ▪ ~60 engineers at Yahoo!, ~15 at Facebook, ~15 at Cloudera ▪ Over 500 (paid!) attendees at Hadoop World NYC ▪ Regular user group meetups in many cities ▪ Bay Area Meetup group has 534 members ▪ Three books (O’Reilly, Apress, Manning) ▪ Training videos free online ▪ University courses across the world ▪ Growing consultant and systems integrator expertise ▪ Training, certification, support, and services from Cloudera Monday, February 1, 2010
  • 13. Hadoop Project Mechanics ▪ Trademark owned by ASF; Apache 2.0 license for code ▪ Rigorous unit, smoke, performance, and system tests ▪ Release cycle of 9 months ▪ Last major release: 0.20.0 on April 22, 2009 ▪ 0.21.0 will be last release before 1.0; nearly complete ▪ Subprojects on different release cycles ▪ Releases put to a vote according to Apache guidelines ▪ Releases made available as tarballs on Apache and mirrors ▪ Cloudera packages a distribution for many platforms ▪ RPM and Debian packages; AMI for Amazon’s EC2 Monday, February 1, 2010
  • 14. Hadoop at Facebook Early 2006: The First Research Scientist ▪ Source data living on horizontally partitioned MySQL tier ▪ Intensive historical analysis difficult ▪ No way to assess impact of changes to the site ▪ First try: Python scripts pull data into MySQL ▪ Second try: Python scripts pull data into Oracle ▪ ...and then we turned on impression logging Monday, February 1, 2010
  • 15. Facebook Data Infrastructure 2007 Scribe Tier MySQL Tier Data Collection Server Oracle Database Server Monday, February 1, 2010
  • 16. Facebook Data Infrastructure 2008 Scribe Tier MySQL Tier Hadoop Tier Oracle RAC Servers Monday, February 1, 2010
  • 17. Major Data Team Workloads ▪ Data collection ▪ server logs ▪ application databases ▪ web crawls ▪ Thousands of multi-stage processing pipelines ▪ Summaries consumed by external users ▪ Summaries for internal reporting ▪ Ad optimization pipeline ▪ Experimentation platform pipeline ▪ Ad hoc analyses Monday, February 1, 2010
  • 18. Workload Statistics Facebook 2010 ▪ Largest cluster running Hive: 8,400 cores, 12.5 PB of storage ▪ 12 TB of compressed new data added per day ▪ 135TB of compressed data scanned per day ▪ 7,500+ Hive jobs on per day ▪ 80K compute hours per day ▪ Around 200 people per month run Hive jobs (data from Ashish Thusoo’s Bay Area ACM DM SIG presentation) Monday, February 1, 2010
  • 19. Why Did Facebook Choose Hadoop? 1. Demonstrated effectiveness for primary workload 2. Proven ability to scale past any commercial vendor 3. Easy provisioning and capacity planning with commodity nodes 4. Data access for all: engineers, business analysts, sales managers 5. Single system to manage XML/JSON, text, and relational data 6. No schemas enabled data collection without involving Data team 7. Simple, modular architecture 8. Easy to build, deploy, and monitor 9. Apache-licensed open source code granted to ASF Monday, February 1, 2010
  • 20. Why Did Facebook Choose Hadoop? ▪ Most importantly: the community ▪ Broad and deep commitment to future development from multiple organizations ▪ Interaction with a community often useful for recruiting ▪ Growing body of users and operators with prior expertise meant lower cost of training new users ▪ Learn about best practices from other organizations ▪ Widely available public materials for improving skills ▪ Not then, but now ▪ Commercial training, certification, support, and services ▪ Growing body of complementary software Monday, February 1, 2010
  • 21. Hadoop at Cloudera Cloudera’s Distribution for Hadoop ▪ Open source distribution of Apache Hadoop for enterprise use ▪ Includes HDFS, MapReduce, Pig, Hive, HBase, and ZooKeeper ▪ Ensures cross-subproject compatibility ▪ Adds backported patches and customer-specific patches ▪ Adds Cloudera utilities like MRUnit and Sqoop ▪ Better integration with daemon administration utilities ▪ Follows the Filesystem Hierarchy Standard (FHS) for file layout ▪ Tools for automatically generating a configuration ▪ Packaged as RPM, DEB, AMI, or tarball Monday, February 1, 2010
  • 22. Hadoop at Cloudera Training and Certification ▪ Free online training ▪ Basic, Intermediate (including Hive and Pig), and Advanced ▪ Includes a virtual machine with software and exercises ▪ Live training sessions ▪ One live session per month somewhere in the world ▪ If you have a large group, we may come to you ▪ Certification ▪ Exams for Developers, Administrators, and Managers ▪ Administered online or in person Monday, February 1, 2010
  • 23. Hadoop at Cloudera Services and Support ▪ Professional Services ▪ Get Hadoop up and running in your environment ▪ Optimize an existing Hadoop infrastructure ▪ Design new algorithms to make the most of your data ▪ Support ▪ Unlimited questions for Cloudera’s technical team ▪ Access to our Knowledge Base ▪ Help prioritize feature development for CDH ▪ Early access to upcoming Cloudera software products Monday, February 1, 2010
  • 24. Hadoop at Cloudera Upcoming Work: HDFS ▪ Symlinks ▪ Security ▪ Native client ▪ Refactoring of read and write path ▪ Fast upgrade and recovery; later, high availability ▪ Integration of Avro ▪ Integration with existing monitoring tools ▪ Better stress testing at scale Monday, February 1, 2010
  • 25. Hadoop at Cloudera Upcoming Work: MapReduce ▪ Avro end-to-end ▪ Avro in the shuffle ▪ Avro input and output formats ▪ Security ▪ Continued API evolution; better support for non-Java languages ▪ System counters ▪ Integration with a resource manager: Nexus? ▪ Potential rewrite of master process, the JobTracker ▪ Potential integration of MOP Monday, February 1, 2010
  • 26. Hadoop at Cloudera Upcoming Work: Avro ▪ Complete the RPC spec: SASL over TCP sockets, HTTP ▪ Also: counters for debugging ▪ Improved language support ▪ Currently: C, Java, C++, Python, Ruby ▪ Define “levels” of implementation ▪ Better command line utilities and tools for authoring schemas ▪ Additional compression codecs ▪ Additional benchmarks ▪ Integration with other projects (HDFS, MapReduce, Hive, HBase) Monday, February 1, 2010
  • 27. Hadoop at Cloudera Commercial Software ▪ General thesis: build commercially-licensed software products which complement CDH for data management and analysis ▪ Current products ▪ Cloudera Desktop ▪ Extensible interface for users of Cloudera software ▪ Upcoming products ▪ Nothing in writing Monday, February 1, 2010
  • 28. Cloudera Desktop Big Data can be Beautiful Monday, February 1, 2010
  • 29. (c) 2009 Cloudera, Inc. or its licensors.  "Cloudera" is a registered trademark of Cloudera, Inc.. All rights reserved. 1.0 Monday, February 1, 2010