SlideShare una empresa de Scribd logo
1 de 29
Descargar para leer sin conexión
Tuesday, November 10, 2009
Socializing Big Data
         Lessons from the Hadoop Community



         Jeff Hammerbacher
         Chief Scientist and Vice President of Products, Cloudera
         November 10, 2009



Tuesday, November 10, 2009
My Background
        Thanks for Asking
        ▪   hammer@cloudera.com
        ▪   Studied Mathematics at Harvard
        ▪   Worked as a Quant on Wall Street
        ▪   Conceived, built, and led the Data team at Facebook
            ▪   Nearly 30 amazing engineers and data scientists
            ▪   Released Hive and Cassandra as open source projects
            ▪   Published research at conferences: SIGMOD, CHI, ICWSM
        ▪   Founder of Cloudera
            ▪   Rethinking data analysis with Apache Hadoop at the core

Tuesday, November 10, 2009
Presentation Outline
        ▪   What is Hadoop?
        ▪   Hadoop at Facebook
            ▪   Brief history of the Facebook Data team
            ▪   Summary of how we used Hadoop
            ▪   Reasons for choosing Hadoop
        ▪   How is software built and adopted?
            ▪   “Laboratory Life”
            ▪   Social Learning Theory
            ▪   Organizations and tools in open source development
        ▪   Moving from the “Age of Data” to the “Age of Learning”


Tuesday, November 10, 2009
What I’m Not Talking About
        Ask Questions
        ▪   How to build a team of data scientists
        ▪   Where and how to use data analysis in your organization
        ▪   The growing importance of measurement and attention
        ▪   Which tools to use for collecting, storing, and analyzing data
        ▪   Statistics, Machine Learning, Data Visualization, Open Data
        ▪   How data analysis is done outside of the web domain
        ▪   What Big Data means for your startup




Tuesday, November 10, 2009
The Apache Hadoop community is producing
        innovative, world class software for web
        scale data management and analysis.

        By studying how software is built and
        adopted, we can enhance the rate at which
        data processing technologies evolve.

        The Hadoop community is open to everyone and
        will play a central role in this evolution.
        You should join us!


Tuesday, November 10, 2009
What is Hadoop?
        Not Just a Stuffed Elephant
        ▪   Open source project, written mostly in Java
        ▪   Inspired by Google infrastructure
            ▪   Software for “warehouse-scale computers”
        ▪   Hundreds of production deployments
        ▪   Project structure
            ▪   Hadoop Distributed File System (HDFS)
            ▪   Hadoop MapReduce
            ▪   Hadoop Common: client libraries and management tools
            ▪   Other subprojects: Avro, HBase, Hive, Pig, Zookeeper

Tuesday, November 10, 2009
Anatomy of a Hadoop Cluster
        ▪   Commodity servers
            ▪   2 x 4 core CPU, 8 GB RAM, 4 x 1 TB SATA, 2 x 1 gE NIC
        ▪   Typically arranged in 2 level architecture
            ▪
                              Commodity
                40 nodes per rack                               Hardware Cluster
        ▪   Inexpensive to acquire and maintain




                             •! Typically in 2 level architecture
                                 –! Nodes are commodity Linux PCs
Tuesday, November 10, 2009       –! 40 nodes/rack
'$*31%10$13+3&'1%)#$#I%
                                #79:"5$)$3-.".0&2$3-"&)"06-"*+,.0-2"84"82-$?()3"()*&5()3"
                                /(+-."()0&"'(-*-.;"*$++-%"C8+&*?.;D"$)%".0&2()3"-$*6"&/"06-"8+&*?."

                                                        HDFS
                                2-%,)%$)0+4"$*2&.."06-"'&&+"&/".-2=-2.<""B)"06-"*&55&)"*$.-;"
                                #79:".0&2-."062--"*&5'+-0-"*&'(-."&/"-$*6"/(+-"84"*&'4()3"-$*6"
                                '(-*-"0&"062--"%(//-2-)0".-2=-2.E"
             HDFS distributes file blocks among servers
                      "

                                                         " !"                 " F"

                                                          I"                   !"


                                    "                     H"                   H"
                                        F"

                                        !"                         " F"

                                        G"    #79:"                 G"

                                                                    I"
                                        I"

                                        H"               " !"                 " F"

                                                          G"                    G"

                                                          I"                    H"


                                                                                          "
                                        !"#$%&'()'*+!,'-"./%"0$/&.'1"2&'02345.'6738#'.&%9&%.'
                                "
Tuesday, November 10, 2009
MapReduce
                     MapReduce pushes work out to the data
            (#)**+%$#41'%
                                           Q"                         K"
            #)5#0$#.1%*6%(/789%
            )#$#%)&'$3&:;$&*0%             !"                         Q"
            '$3#$1.<%$*%+;'"%=*34%
                                           N"                         N"
            *;$%$*%>#0<%0*)1'%&0%#%
            ?@;'$13A%B"&'%#@@*='%
            #0#@<'1'%$*%3;0%&0%                          K"
            +#3#@@1@%#0)%1@&>&0#$1'%
            $"1%:*$$@101?4'%                             P"
            &>+*'1)%:<%>*0*@&$"&?%                       !"
            '$*3#.1%'<'$1>'A%
                                           Q"
                                                                      K"
                                           P"
                                                                      P"
                                           !"
                                                                      N"



                                                                                    "
                                                !"#$%&'()'*+,--.'.$/0&/'1-%2'-$3'3-'30&',+3+'
Tuesday, November 10, 2009             "
Hadoop Subprojects
        ▪   Avro
            ▪   Cross-language framework for data serialization and RPC
        ▪   HBase
            ▪   Table storage above HDFS, modeled after Google’s BigTable
        ▪   Hive
            ▪   SQL interface to structured data stored in HDFS
        ▪   Pig
            ▪   Language for data flow programming
        ▪   Zookeeper
            ▪   Coordination service for distributed systems

Tuesday, November 10, 2009
Hadoop at Yahoo!
        ▪   Jan 2006: Hired Doug Cutting
        ▪   Apr 2006: Sorted 1.9 TB on 188 nodes in 47 hours
        ▪   Apr 2008: Sorted 1 TB on 910 nodes in 209 seconds
        ▪   Aug 2008: Deployed 4,000 node Hadoop cluster
        ▪   May 2009: Sorted 1 TB on 1,460 nodes in 62 seconds
            ▪   Sorted 1 PB on 3,658 nodes in 16.25 hours
        ▪   Other data points
            ▪   Over 25,000 nodes running Hadoop across 17 clusters
            ▪   Hundreds of thousands of jobs per day from over 600 users
            ▪   82 PB of data


Tuesday, November 10, 2009
Facebook Before Hadoop
        Early 2006: The First Research Scientist
        ▪   Source data living on horizontally partitioned MySQL tier
        ▪   Intensive historical analysis difficult
        ▪   No way to assess impact of changes to the site


        ▪   First try: Python scripts pull data into MySQL
        ▪   Second try: Python scripts pull data into Oracle


        ▪   ...and then we turned on impression logging



Tuesday, November 10, 2009
Facebook Data Infrastructure
                                                   2007
                                    Scribe Tier                     MySQL Tier




                                                  Data Collection
                                                      Server




                                                  Oracle Database
                                                       Server




Tuesday, November 10, 2009
Facebook Data Infrastructure
                                                          2008
                                          Scribe Tier            MySQL Tier




                                  Hadoop Tier




                                     Oracle RAC Servers




Tuesday, November 10, 2009
Facebook Workloads
        ▪   Data collection
            ▪   server logs
            ▪   application databases
            ▪   web crawls
        ▪   Thousands of multi-stage processing pipelines
            ▪   Summaries consumed by external users
            ▪   Summaries for internal reporting
            ▪   Ad optimization pipeline
            ▪   Experimentation platform pipeline
        ▪   Ad hoc analyses


Tuesday, November 10, 2009
Workload Statistics
        Facebook 2009
        ▪   Largest cluster running Hive: 4,800 cores, 5.5 PB of storage
        ▪   4 TB of compressed new data added per day
        ▪   135TB of compressed data scanned per day
        ▪   7,500+ Hive jobs on per day
        ▪   80K compute hours per day
        ▪   Around 200 people per month run Hive jobs



            (data from Ashish Thusoo’s Hadoop World NYC presentation)


Tuesday, November 10, 2009
Why Did Facebook Choose Hadoop?
        1. Demonstrated effectiveness for primary workload
        2. Proven ability to scale past any commercial vendor
        3. Easy provisioning and capacity planning with commodity nodes
        4. Data access for engineers and business analysts
        5. Single system to manage XML, JSON, text, and relational data
        6. No schemas enabled data collection without involving Data team
        7. Cost of software: zero dollars
        8. Deep commitment to continued development from Yahoo!
        9. Active user and developer community
        10. Apache-licensed open source code; ASF owns copyright

Tuesday, November 10, 2009
Hadoop Community Support
        People Build Technology
        ▪   185+ contributors to the open source code base
            ▪   ~50 engineers at Yahoo!, ~15 at Facebook, ~15 at Cloudera
        ▪   Over 500 (paid!) attendees at Hadoop World NYC
        ▪   Three books (O’Reilly, Apress, Manning)
        ▪   Training videos free online
        ▪   Regular user group meetups in many cities
        ▪   University courses across the world
        ▪   Growing consultant and systems integrator expertise
        ▪   Commercial training, certification, and support from Cloudera

Tuesday, November 10, 2009
How Software is Built
        Methodological Reflexivity
        ▪   Latour and Woolgar’s “Laboratory Life”
            ▪   Study scientists doing science
            ▪   Use “thick descriptions” and focus on “microconcerns”
        ▪   Some studies of closed and open source development exist
            ▪   “Mythical Man Month”, “Cathedral and the Bazaar”
            ▪   Hertel et al. surveyed 141 Linux kernel developers
        ▪   Focus on the people creating code
        ▪   Less religion, more empirical analyses
        ▪   Build tools to facilitate interaction and output

Tuesday, November 10, 2009
Building Open Source Software
        Structural Conditions for Success
        ▪   Moon and Sproul proposed some rules for successful projects
            ▪   Authority comes from competence
            ▪   Leaders have clear responsibilities and delegate often
            ▪   The code has a modular structure
            ▪   Establish a parallel release policy: stable and experimental
            ▪   Give credit to non-source contributions, e.g. documentation
            ▪   Communicate clear rules and norms for community online
            ▪   Use simple and reliable communication tools



Tuesday, November 10, 2009
Building Software Faster
        Consolidate Best Practices
        ▪   Javascript frameworks starting to converge
            ▪   Many adopting jQuery’s selector syntax
            ▪   Significant benchmarks emerging
        ▪   Web frameworks push idioms into project structure
            ▪   What would be the Rails/Django equivalent for data storage?
            ▪   Reusable components also nice, e.g. log structured merge trees
            ▪   Compare work on BOOM, RodentStore
        ▪   Debian distributes release note writing responsibility via “beats”



Tuesday, November 10, 2009
Complications of Open Source
        ▪   Intellectual property
            ▪   Trademark, Copyright, Patent, and Trade Secret
            ▪   Litigation history
        ▪   Business models and foundations to ensure long-term support
            ▪   Direct support: Red Hat, MySQL
            ▪   Indirect support: LLVM, GSoC
            ▪   Foundations: Apache, Python, Django
        ▪   Diversity of licenses
            ▪   Licenses form communities
            ▪   Licenses change over time (cf. Rambus BSD incident)


Tuesday, November 10, 2009
How Software is Adopted
        Choosing the Right Tool for the Job
        ▪   Must be aware that a software project exists
            ▪   Tools like GitHub, Ohloh, Launchpad
            ▪   Sites like Reddit and Hacker News
        ▪   Existing example use cases are critical
            ▪   At Facebook, we studied motivations for content production
            ▪   Especially effective: Bandura’s “Social Learning Theory”
            ▪   Hadoop being run in production at scale by Yahoo!/Facebook
        ▪   Active user communities and great documentation
            ▪   Reward first approach

Tuesday, November 10, 2009
Open Learning
        Open Data, Hypotheses and Workflows
        ▪   In science, data is generated once and analyzed many times
            ▪   IceCube
            ▪   LHC
        ▪   Lots of places where data and visualizations get shared
            ▪   data.gov, Many Eyes, Swivel, theinfo.org, InfoChimps, iCharts
        ▪   Record which hypotheses and workflows have been applied
        ▪   Increase diversity of questions asked and applications built
        ▪   Analysis skills unevenly distributed; send skills to the data!



Tuesday, November 10, 2009
The Future of Data Processing
        Hadoop, the Browser, and Collaboration
        ▪   “The Unreasonable Effectiveness of Data”, “MAD Skills”
        ▪   Single namespace for your organization’s bits
        ▪   Single engine for distributed data processing
        ▪   Materialization of structured subsets into optimized stores
        ▪   Browser as client interface with focus on user experience
        ▪   The system gets better over time using workload information
        ▪   Cloning and sharing of common libraries and workflows
        ▪   Global metadata store driving collection, analysis, and reporting
        ▪   Version control within and between sites, cf. Orchestra

Tuesday, November 10, 2009
Cloudera Offerings
        Only One Slide, I Promise
        ▪   Two software products
            ▪   Cloudera’s Distribution for Hadoop
            ▪   Cloudera Desktop
            ▪   ...more on the way
        ▪   Training and Certification
            ▪   For Developers, Operators, and Managers
        ▪   Support
        ▪   Professional services



Tuesday, November 10, 2009
Cloudera Desktop
                             Big Data can be Beautiful




Tuesday, November 10, 2009
(c) 2009 Cloudera, Inc. or its licensors.  "Cloudera" is a registered trademark of Cloudera, Inc.. All rights reserved. 1.0




Tuesday, November 10, 2009

Más contenido relacionado

La actualidad más candente

Web Development with CoffeeScript and Sass
Web Development with CoffeeScript and SassWeb Development with CoffeeScript and Sass
Web Development with CoffeeScript and SassBrian Hogan
 
Leveraging the Power of Graph Databases in PHP
Leveraging the Power of Graph Databases in PHPLeveraging the Power of Graph Databases in PHP
Leveraging the Power of Graph Databases in PHPJeremy Kendall
 
Php 102: Out with the Bad, In with the Good
Php 102: Out with the Bad, In with the GoodPhp 102: Out with the Bad, In with the Good
Php 102: Out with the Bad, In with the GoodJeremy Kendall
 
Leveraging the Power of Graph Databases in PHP
Leveraging the Power of Graph Databases in PHPLeveraging the Power of Graph Databases in PHP
Leveraging the Power of Graph Databases in PHPJeremy Kendall
 
Defcon 22-blake-self-cisc0ninja-dont-ddos-me-bro
Defcon 22-blake-self-cisc0ninja-dont-ddos-me-broDefcon 22-blake-self-cisc0ninja-dont-ddos-me-bro
Defcon 22-blake-self-cisc0ninja-dont-ddos-me-broPriyanka Aash
 
Tapping the Data Deluge with R
Tapping the Data Deluge with RTapping the Data Deluge with R
Tapping the Data Deluge with RJeffrey Breen
 
An Empirical Study on the Risks of Using Off-the-Shelf Techniques for Process...
An Empirical Study on the Risks of Using Off-the-Shelf Techniques for Process...An Empirical Study on the Risks of Using Off-the-Shelf Techniques for Process...
An Empirical Study on the Risks of Using Off-the-Shelf Techniques for Process...Nicolas Bettenburg
 
Defcon 22-graham-mc millan-tentler-masscaning-the-internet
Defcon 22-graham-mc millan-tentler-masscaning-the-internetDefcon 22-graham-mc millan-tentler-masscaning-the-internet
Defcon 22-graham-mc millan-tentler-masscaning-the-internetPriyanka Aash
 

La actualidad más candente (8)

Web Development with CoffeeScript and Sass
Web Development with CoffeeScript and SassWeb Development with CoffeeScript and Sass
Web Development with CoffeeScript and Sass
 
Leveraging the Power of Graph Databases in PHP
Leveraging the Power of Graph Databases in PHPLeveraging the Power of Graph Databases in PHP
Leveraging the Power of Graph Databases in PHP
 
Php 102: Out with the Bad, In with the Good
Php 102: Out with the Bad, In with the GoodPhp 102: Out with the Bad, In with the Good
Php 102: Out with the Bad, In with the Good
 
Leveraging the Power of Graph Databases in PHP
Leveraging the Power of Graph Databases in PHPLeveraging the Power of Graph Databases in PHP
Leveraging the Power of Graph Databases in PHP
 
Defcon 22-blake-self-cisc0ninja-dont-ddos-me-bro
Defcon 22-blake-self-cisc0ninja-dont-ddos-me-broDefcon 22-blake-self-cisc0ninja-dont-ddos-me-bro
Defcon 22-blake-self-cisc0ninja-dont-ddos-me-bro
 
Tapping the Data Deluge with R
Tapping the Data Deluge with RTapping the Data Deluge with R
Tapping the Data Deluge with R
 
An Empirical Study on the Risks of Using Off-the-Shelf Techniques for Process...
An Empirical Study on the Risks of Using Off-the-Shelf Techniques for Process...An Empirical Study on the Risks of Using Off-the-Shelf Techniques for Process...
An Empirical Study on the Risks of Using Off-the-Shelf Techniques for Process...
 
Defcon 22-graham-mc millan-tentler-masscaning-the-internet
Defcon 22-graham-mc millan-tentler-masscaning-the-internetDefcon 22-graham-mc millan-tentler-masscaning-the-internet
Defcon 22-graham-mc millan-tentler-masscaning-the-internet
 

Destacado (17)

20080528dublinpt1
20080528dublinpt120080528dublinpt1
20080528dublinpt1
 
20080529dublinpt1
20080529dublinpt120080529dublinpt1
20080529dublinpt1
 
20081022cca
20081022cca20081022cca
20081022cca
 
20100423sage
20100423sage20100423sage
20100423sage
 
20100201hplabs
20100201hplabs20100201hplabs
20100201hplabs
 
20090422 Www
20090422 Www20090422 Www
20090422 Www
 
Mapreduce Pact06 Keynote
Mapreduce Pact06 KeynoteMapreduce Pact06 Keynote
Mapreduce Pact06 Keynote
 
HDFS Architecture
HDFS ArchitectureHDFS Architecture
HDFS Architecture
 
Mårten Mickos's presentation "Open Source: Why Freedom Makes a Better Busines...
Mårten Mickos's presentation "Open Source: Why Freedom Makes a Better Busines...Mårten Mickos's presentation "Open Source: Why Freedom Makes a Better Busines...
Mårten Mickos's presentation "Open Source: Why Freedom Makes a Better Busines...
 
20100714accel
20100714accel20100714accel
20100714accel
 
20090309berkeley
20090309berkeley20090309berkeley
20090309berkeley
 
20100301icde
20100301icde20100301icde
20100301icde
 
20081030linkedin
20081030linkedin20081030linkedin
20081030linkedin
 
20120223keystone
20120223keystone20120223keystone
20120223keystone
 
Partitioning 20061205
Partitioning 20061205Partitioning 20061205
Partitioning 20061205
 
20080115yahoobrickhouse
20080115yahoobrickhouse20080115yahoobrickhouse
20080115yahoobrickhouse
 
Data Presentations Cassandra Sigmod
Data  Presentations  Cassandra SigmodData  Presentations  Cassandra Sigmod
Data Presentations Cassandra Sigmod
 

Similar a 20091110startup2startup

Hook Mobile Living Social Hackathon MoDevUX 2012
Hook Mobile Living Social Hackathon MoDevUX 2012Hook Mobile Living Social Hackathon MoDevUX 2012
Hook Mobile Living Social Hackathon MoDevUX 2012Wayne Chen
 
OSGi Provisioning With Apache ACE
OSGi Provisioning With Apache ACEOSGi Provisioning With Apache ACE
OSGi Provisioning With Apache ACEmfrancis
 
Delve Prototyping In The Wild
Delve Prototyping In The WildDelve Prototyping In The Wild
Delve Prototyping In The WildTodd Zaki Warfel
 
Bug Bounty Programs For The Web
Bug Bounty Programs For The WebBug Bounty Programs For The Web
Bug Bounty Programs For The WebMichael Coates
 
7 data citation challenges, illustrated with data (includes elephants)
7 data citation challenges, illustrated with data (includes elephants) 7 data citation challenges, illustrated with data (includes elephants)
7 data citation challenges, illustrated with data (includes elephants) Heather Piwowar
 
U of U Undergraduate IMC Class
U of U Undergraduate IMC ClassU of U Undergraduate IMC Class
U of U Undergraduate IMC ClassChris Carlston
 
Data Citation from the perspective of tracking data reuse
Data Citation from the perspective of tracking data reuseData Citation from the perspective of tracking data reuse
Data Citation from the perspective of tracking data reuseHeather Piwowar
 
Moosecon native apps_blackberry_10-optimized
Moosecon native apps_blackberry_10-optimizedMoosecon native apps_blackberry_10-optimized
Moosecon native apps_blackberry_10-optimizedHeinrich Seeger
 
Visual Studio 2013, Xamarin and Microsoft Azure Mobile Services: A Match Made...
Visual Studio 2013, Xamarin and Microsoft Azure Mobile Services: A Match Made...Visual Studio 2013, Xamarin and Microsoft Azure Mobile Services: A Match Made...
Visual Studio 2013, Xamarin and Microsoft Azure Mobile Services: A Match Made...Rick G. Garibay
 
Жилль Домартини еx-Вице-президент управления он-лайн проектами, PHILIPS, гене...
Жилль Домартини еx-Вице-президент управления он-лайн проектами, PHILIPS, гене...Жилль Домартини еx-Вице-президент управления он-лайн проектами, PHILIPS, гене...
Жилль Домартини еx-Вице-президент управления он-лайн проектами, PHILIPS, гене...Red Keds
 
Жилль Домартини еx-Вице-президент управления он-лайн проектами, PHILIPS, гене...
Жилль Домартини еx-Вице-президент управления он-лайн проектами, PHILIPS, гене...Жилль Домартини еx-Вице-президент управления он-лайн проектами, PHILIPS, гене...
Жилль Домартини еx-Вице-президент управления он-лайн проектами, PHILIPS, гене...Ira Pavlovskaya
 
We are all media companies now
We are all media companies nowWe are all media companies now
We are all media companies nowDG2ALL
 
Getting Other People to Care - Social Media Breakfast CT
Getting Other People to Care - Social Media Breakfast CTGetting Other People to Care - Social Media Breakfast CT
Getting Other People to Care - Social Media Breakfast CTCauseShift
 

Similar a 20091110startup2startup (20)

Hook Mobile Living Social Hackathon MoDevUX 2012
Hook Mobile Living Social Hackathon MoDevUX 2012Hook Mobile Living Social Hackathon MoDevUX 2012
Hook Mobile Living Social Hackathon MoDevUX 2012
 
Push Podc09
Push Podc09Push Podc09
Push Podc09
 
All about Apache ACE
All about Apache ACEAll about Apache ACE
All about Apache ACE
 
Device deployment
Device deploymentDevice deployment
Device deployment
 
OSGi Provisioning With Apache ACE
OSGi Provisioning With Apache ACEOSGi Provisioning With Apache ACE
OSGi Provisioning With Apache ACE
 
Delve Prototyping In The Wild
Delve Prototyping In The WildDelve Prototyping In The Wild
Delve Prototyping In The Wild
 
InnoDB Magic
InnoDB MagicInnoDB Magic
InnoDB Magic
 
Bug Bounty Programs For The Web
Bug Bounty Programs For The WebBug Bounty Programs For The Web
Bug Bounty Programs For The Web
 
Think small
Think smallThink small
Think small
 
7 data citation challenges, illustrated with data (includes elephants)
7 data citation challenges, illustrated with data (includes elephants) 7 data citation challenges, illustrated with data (includes elephants)
7 data citation challenges, illustrated with data (includes elephants)
 
The Project Trap
The Project TrapThe Project Trap
The Project Trap
 
U of U Undergraduate IMC Class
U of U Undergraduate IMC ClassU of U Undergraduate IMC Class
U of U Undergraduate IMC Class
 
Data Citation from the perspective of tracking data reuse
Data Citation from the perspective of tracking data reuseData Citation from the perspective of tracking data reuse
Data Citation from the perspective of tracking data reuse
 
Moosecon native apps_blackberry_10-optimized
Moosecon native apps_blackberry_10-optimizedMoosecon native apps_blackberry_10-optimized
Moosecon native apps_blackberry_10-optimized
 
Tabledown
TabledownTabledown
Tabledown
 
Visual Studio 2013, Xamarin and Microsoft Azure Mobile Services: A Match Made...
Visual Studio 2013, Xamarin and Microsoft Azure Mobile Services: A Match Made...Visual Studio 2013, Xamarin and Microsoft Azure Mobile Services: A Match Made...
Visual Studio 2013, Xamarin and Microsoft Azure Mobile Services: A Match Made...
 
Жилль Домартини еx-Вице-президент управления он-лайн проектами, PHILIPS, гене...
Жилль Домартини еx-Вице-президент управления он-лайн проектами, PHILIPS, гене...Жилль Домартини еx-Вице-президент управления он-лайн проектами, PHILIPS, гене...
Жилль Домартини еx-Вице-президент управления он-лайн проектами, PHILIPS, гене...
 
Жилль Домартини еx-Вице-президент управления он-лайн проектами, PHILIPS, гене...
Жилль Домартини еx-Вице-президент управления он-лайн проектами, PHILIPS, гене...Жилль Домартини еx-Вице-президент управления он-лайн проектами, PHILIPS, гене...
Жилль Домартини еx-Вице-президент управления он-лайн проектами, PHILIPS, гене...
 
We are all media companies now
We are all media companies nowWe are all media companies now
We are all media companies now
 
Getting Other People to Care - Social Media Breakfast CT
Getting Other People to Care - Social Media Breakfast CTGetting Other People to Care - Social Media Breakfast CT
Getting Other People to Care - Social Media Breakfast CT
 

Más de Jeff Hammerbacher (10)

20100608sigmod
20100608sigmod20100608sigmod
20100608sigmod
 
20100513brown
20100513brown20100513brown
20100513brown
 
20100418sos
20100418sos20100418sos
20100418sos
 
20081009nychive
20081009nychive20081009nychive
20081009nychive
 
2008 Ur Tech Talk Zshao
2008 Ur Tech Talk Zshao2008 Ur Tech Talk Zshao
2008 Ur Tech Talk Zshao
 
20080611accel
20080611accel20080611accel
20080611accel
 
Hdfs Dhruba
Hdfs DhrubaHdfs Dhruba
Hdfs Dhruba
 
20080528dublinpt3
20080528dublinpt320080528dublinpt3
20080528dublinpt3
 
20080529dublinpt3
20080529dublinpt320080529dublinpt3
20080529dublinpt3
 
20080529dublinpt2
20080529dublinpt220080529dublinpt2
20080529dublinpt2
 

Último

Top 10 Squarespace Development Companies
Top 10 Squarespace Development CompaniesTop 10 Squarespace Development Companies
Top 10 Squarespace Development CompaniesTopCSSGallery
 
Automation Ops Series: Session 2 - Governance for UiPath projects
Automation Ops Series: Session 2 - Governance for UiPath projectsAutomation Ops Series: Session 2 - Governance for UiPath projects
Automation Ops Series: Session 2 - Governance for UiPath projectsDianaGray10
 
Stobox 4: Revolutionizing Investment in Real-World Assets Through Tokenization
Stobox 4: Revolutionizing Investment in Real-World Assets Through TokenizationStobox 4: Revolutionizing Investment in Real-World Assets Through Tokenization
Stobox 4: Revolutionizing Investment in Real-World Assets Through TokenizationStobox
 
UiPath Studio Web workshop series - Day 4
UiPath Studio Web workshop series - Day 4UiPath Studio Web workshop series - Day 4
UiPath Studio Web workshop series - Day 4DianaGray10
 
SIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENT
SIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENTSIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENT
SIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENTxtailishbaloch
 
Technical SEO for Improved Accessibility WTS FEST
Technical SEO for Improved Accessibility  WTS FESTTechnical SEO for Improved Accessibility  WTS FEST
Technical SEO for Improved Accessibility WTS FESTBillieHyde
 
CyberSecurity - Computers In Libraries 2024
CyberSecurity - Computers In Libraries 2024CyberSecurity - Computers In Libraries 2024
CyberSecurity - Computers In Libraries 2024Brian Pichman
 
EMEA What is ThousandEyes? Webinar
EMEA What is ThousandEyes? WebinarEMEA What is ThousandEyes? Webinar
EMEA What is ThousandEyes? WebinarThousandEyes
 
Trailblazer Community - Flows Workshop (Session 2)
Trailblazer Community - Flows Workshop (Session 2)Trailblazer Community - Flows Workshop (Session 2)
Trailblazer Community - Flows Workshop (Session 2)Muhammad Tiham Siddiqui
 
March Patch Tuesday
March Patch TuesdayMarch Patch Tuesday
March Patch TuesdayIvanti
 
How to release an Open Source Dataweave Library
How to release an Open Source Dataweave LibraryHow to release an Open Source Dataweave Library
How to release an Open Source Dataweave Libraryshyamraj55
 
Introduction - IPLOOK NETWORKS CO., LTD.
Introduction - IPLOOK NETWORKS CO., LTD.Introduction - IPLOOK NETWORKS CO., LTD.
Introduction - IPLOOK NETWORKS CO., LTD.IPLOOK Networks
 
Extra-120324-Visite-Entreprise-icare.pdf
Extra-120324-Visite-Entreprise-icare.pdfExtra-120324-Visite-Entreprise-icare.pdf
Extra-120324-Visite-Entreprise-icare.pdfInfopole1
 
Flow Control | Block Size | ST Min | First Frame
Flow Control | Block Size | ST Min | First FrameFlow Control | Block Size | ST Min | First Frame
Flow Control | Block Size | ST Min | First FrameKapil Thakar
 
Introduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its applicationIntroduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its applicationKnoldus Inc.
 
LF Energy Webinar - Unveiling OpenEEMeter 4.0
LF Energy Webinar - Unveiling OpenEEMeter 4.0LF Energy Webinar - Unveiling OpenEEMeter 4.0
LF Energy Webinar - Unveiling OpenEEMeter 4.0DanBrown980551
 
Scenario Library et REX Discover industry- and role- based scenarios
Scenario Library et REX Discover industry- and role- based scenariosScenario Library et REX Discover industry- and role- based scenarios
Scenario Library et REX Discover industry- and role- based scenariosErol GIRAUDY
 
Q4 2023 Quarterly Investor Presentation - FINAL - v1.pdf
Q4 2023 Quarterly Investor Presentation - FINAL - v1.pdfQ4 2023 Quarterly Investor Presentation - FINAL - v1.pdf
Q4 2023 Quarterly Investor Presentation - FINAL - v1.pdfTejal81
 
GraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptx
GraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptxGraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptx
GraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptxNeo4j
 
Emil Eifrem at GraphSummit Copenhagen 2024 - The Art of the Possible.pptx
Emil Eifrem at GraphSummit Copenhagen 2024 - The Art of the Possible.pptxEmil Eifrem at GraphSummit Copenhagen 2024 - The Art of the Possible.pptx
Emil Eifrem at GraphSummit Copenhagen 2024 - The Art of the Possible.pptxNeo4j
 

Último (20)

Top 10 Squarespace Development Companies
Top 10 Squarespace Development CompaniesTop 10 Squarespace Development Companies
Top 10 Squarespace Development Companies
 
Automation Ops Series: Session 2 - Governance for UiPath projects
Automation Ops Series: Session 2 - Governance for UiPath projectsAutomation Ops Series: Session 2 - Governance for UiPath projects
Automation Ops Series: Session 2 - Governance for UiPath projects
 
Stobox 4: Revolutionizing Investment in Real-World Assets Through Tokenization
Stobox 4: Revolutionizing Investment in Real-World Assets Through TokenizationStobox 4: Revolutionizing Investment in Real-World Assets Through Tokenization
Stobox 4: Revolutionizing Investment in Real-World Assets Through Tokenization
 
UiPath Studio Web workshop series - Day 4
UiPath Studio Web workshop series - Day 4UiPath Studio Web workshop series - Day 4
UiPath Studio Web workshop series - Day 4
 
SIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENT
SIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENTSIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENT
SIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENT
 
Technical SEO for Improved Accessibility WTS FEST
Technical SEO for Improved Accessibility  WTS FESTTechnical SEO for Improved Accessibility  WTS FEST
Technical SEO for Improved Accessibility WTS FEST
 
CyberSecurity - Computers In Libraries 2024
CyberSecurity - Computers In Libraries 2024CyberSecurity - Computers In Libraries 2024
CyberSecurity - Computers In Libraries 2024
 
EMEA What is ThousandEyes? Webinar
EMEA What is ThousandEyes? WebinarEMEA What is ThousandEyes? Webinar
EMEA What is ThousandEyes? Webinar
 
Trailblazer Community - Flows Workshop (Session 2)
Trailblazer Community - Flows Workshop (Session 2)Trailblazer Community - Flows Workshop (Session 2)
Trailblazer Community - Flows Workshop (Session 2)
 
March Patch Tuesday
March Patch TuesdayMarch Patch Tuesday
March Patch Tuesday
 
How to release an Open Source Dataweave Library
How to release an Open Source Dataweave LibraryHow to release an Open Source Dataweave Library
How to release an Open Source Dataweave Library
 
Introduction - IPLOOK NETWORKS CO., LTD.
Introduction - IPLOOK NETWORKS CO., LTD.Introduction - IPLOOK NETWORKS CO., LTD.
Introduction - IPLOOK NETWORKS CO., LTD.
 
Extra-120324-Visite-Entreprise-icare.pdf
Extra-120324-Visite-Entreprise-icare.pdfExtra-120324-Visite-Entreprise-icare.pdf
Extra-120324-Visite-Entreprise-icare.pdf
 
Flow Control | Block Size | ST Min | First Frame
Flow Control | Block Size | ST Min | First FrameFlow Control | Block Size | ST Min | First Frame
Flow Control | Block Size | ST Min | First Frame
 
Introduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its applicationIntroduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its application
 
LF Energy Webinar - Unveiling OpenEEMeter 4.0
LF Energy Webinar - Unveiling OpenEEMeter 4.0LF Energy Webinar - Unveiling OpenEEMeter 4.0
LF Energy Webinar - Unveiling OpenEEMeter 4.0
 
Scenario Library et REX Discover industry- and role- based scenarios
Scenario Library et REX Discover industry- and role- based scenariosScenario Library et REX Discover industry- and role- based scenarios
Scenario Library et REX Discover industry- and role- based scenarios
 
Q4 2023 Quarterly Investor Presentation - FINAL - v1.pdf
Q4 2023 Quarterly Investor Presentation - FINAL - v1.pdfQ4 2023 Quarterly Investor Presentation - FINAL - v1.pdf
Q4 2023 Quarterly Investor Presentation - FINAL - v1.pdf
 
GraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptx
GraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptxGraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptx
GraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptx
 
Emil Eifrem at GraphSummit Copenhagen 2024 - The Art of the Possible.pptx
Emil Eifrem at GraphSummit Copenhagen 2024 - The Art of the Possible.pptxEmil Eifrem at GraphSummit Copenhagen 2024 - The Art of the Possible.pptx
Emil Eifrem at GraphSummit Copenhagen 2024 - The Art of the Possible.pptx
 

20091110startup2startup

  • 2. Socializing Big Data Lessons from the Hadoop Community Jeff Hammerbacher Chief Scientist and Vice President of Products, Cloudera November 10, 2009 Tuesday, November 10, 2009
  • 3. My Background Thanks for Asking ▪ hammer@cloudera.com ▪ Studied Mathematics at Harvard ▪ Worked as a Quant on Wall Street ▪ Conceived, built, and led the Data team at Facebook ▪ Nearly 30 amazing engineers and data scientists ▪ Released Hive and Cassandra as open source projects ▪ Published research at conferences: SIGMOD, CHI, ICWSM ▪ Founder of Cloudera ▪ Rethinking data analysis with Apache Hadoop at the core Tuesday, November 10, 2009
  • 4. Presentation Outline ▪ What is Hadoop? ▪ Hadoop at Facebook ▪ Brief history of the Facebook Data team ▪ Summary of how we used Hadoop ▪ Reasons for choosing Hadoop ▪ How is software built and adopted? ▪ “Laboratory Life” ▪ Social Learning Theory ▪ Organizations and tools in open source development ▪ Moving from the “Age of Data” to the “Age of Learning” Tuesday, November 10, 2009
  • 5. What I’m Not Talking About Ask Questions ▪ How to build a team of data scientists ▪ Where and how to use data analysis in your organization ▪ The growing importance of measurement and attention ▪ Which tools to use for collecting, storing, and analyzing data ▪ Statistics, Machine Learning, Data Visualization, Open Data ▪ How data analysis is done outside of the web domain ▪ What Big Data means for your startup Tuesday, November 10, 2009
  • 6. The Apache Hadoop community is producing innovative, world class software for web scale data management and analysis. By studying how software is built and adopted, we can enhance the rate at which data processing technologies evolve. The Hadoop community is open to everyone and will play a central role in this evolution. You should join us! Tuesday, November 10, 2009
  • 7. What is Hadoop? Not Just a Stuffed Elephant ▪ Open source project, written mostly in Java ▪ Inspired by Google infrastructure ▪ Software for “warehouse-scale computers” ▪ Hundreds of production deployments ▪ Project structure ▪ Hadoop Distributed File System (HDFS) ▪ Hadoop MapReduce ▪ Hadoop Common: client libraries and management tools ▪ Other subprojects: Avro, HBase, Hive, Pig, Zookeeper Tuesday, November 10, 2009
  • 8. Anatomy of a Hadoop Cluster ▪ Commodity servers ▪ 2 x 4 core CPU, 8 GB RAM, 4 x 1 TB SATA, 2 x 1 gE NIC ▪ Typically arranged in 2 level architecture ▪ Commodity 40 nodes per rack Hardware Cluster ▪ Inexpensive to acquire and maintain •! Typically in 2 level architecture –! Nodes are commodity Linux PCs Tuesday, November 10, 2009 –! 40 nodes/rack
  • 9. '$*31%10$13+3&'1%)#$#I% #79:"5$)$3-.".0&2$3-"&)"06-"*+,.0-2"84"82-$?()3"()*&5()3" /(+-."()0&"'(-*-.;"*$++-%"C8+&*?.;D"$)%".0&2()3"-$*6"&/"06-"8+&*?." HDFS 2-%,)%$)0+4"$*2&.."06-"'&&+"&/".-2=-2.<""B)"06-"*&55&)"*$.-;" #79:".0&2-."062--"*&5'+-0-"*&'(-."&/"-$*6"/(+-"84"*&'4()3"-$*6" '(-*-"0&"062--"%(//-2-)0".-2=-2.E" HDFS distributes file blocks among servers " " !" " F" I" !" " H" H" F" !" " F" G" #79:" G" I" I" H" " !" " F" G" G" I" H" " !"#$%&'()'*+!,'-"./%"0$/&.'1"2&'02345.'6738#'.&%9&%.' " Tuesday, November 10, 2009
  • 10. MapReduce MapReduce pushes work out to the data (#)**+%$#41'% Q" K" #)5#0$#.1%*6%(/789% )#$#%)&'$3&:;$&*0% !" Q" '$3#$1.<%$*%+;'"%=*34% N" N" *;$%$*%>#0<%0*)1'%&0%#% ?@;'$13A%B"&'%#@@*='% #0#@<'1'%$*%3;0%&0% K" +#3#@@1@%#0)%1@&>&0#$1'% $"1%:*$$@101?4'% P" &>+*'1)%:<%>*0*@&$"&?% !" '$*3#.1%'<'$1>'A% Q" K" P" P" !" N" " !"#$%&'()'*+,--.'.$/0&/'1-%2'-$3'3-'30&',+3+' Tuesday, November 10, 2009 "
  • 11. Hadoop Subprojects ▪ Avro ▪ Cross-language framework for data serialization and RPC ▪ HBase ▪ Table storage above HDFS, modeled after Google’s BigTable ▪ Hive ▪ SQL interface to structured data stored in HDFS ▪ Pig ▪ Language for data flow programming ▪ Zookeeper ▪ Coordination service for distributed systems Tuesday, November 10, 2009
  • 12. Hadoop at Yahoo! ▪ Jan 2006: Hired Doug Cutting ▪ Apr 2006: Sorted 1.9 TB on 188 nodes in 47 hours ▪ Apr 2008: Sorted 1 TB on 910 nodes in 209 seconds ▪ Aug 2008: Deployed 4,000 node Hadoop cluster ▪ May 2009: Sorted 1 TB on 1,460 nodes in 62 seconds ▪ Sorted 1 PB on 3,658 nodes in 16.25 hours ▪ Other data points ▪ Over 25,000 nodes running Hadoop across 17 clusters ▪ Hundreds of thousands of jobs per day from over 600 users ▪ 82 PB of data Tuesday, November 10, 2009
  • 13. Facebook Before Hadoop Early 2006: The First Research Scientist ▪ Source data living on horizontally partitioned MySQL tier ▪ Intensive historical analysis difficult ▪ No way to assess impact of changes to the site ▪ First try: Python scripts pull data into MySQL ▪ Second try: Python scripts pull data into Oracle ▪ ...and then we turned on impression logging Tuesday, November 10, 2009
  • 14. Facebook Data Infrastructure 2007 Scribe Tier MySQL Tier Data Collection Server Oracle Database Server Tuesday, November 10, 2009
  • 15. Facebook Data Infrastructure 2008 Scribe Tier MySQL Tier Hadoop Tier Oracle RAC Servers Tuesday, November 10, 2009
  • 16. Facebook Workloads ▪ Data collection ▪ server logs ▪ application databases ▪ web crawls ▪ Thousands of multi-stage processing pipelines ▪ Summaries consumed by external users ▪ Summaries for internal reporting ▪ Ad optimization pipeline ▪ Experimentation platform pipeline ▪ Ad hoc analyses Tuesday, November 10, 2009
  • 17. Workload Statistics Facebook 2009 ▪ Largest cluster running Hive: 4,800 cores, 5.5 PB of storage ▪ 4 TB of compressed new data added per day ▪ 135TB of compressed data scanned per day ▪ 7,500+ Hive jobs on per day ▪ 80K compute hours per day ▪ Around 200 people per month run Hive jobs (data from Ashish Thusoo’s Hadoop World NYC presentation) Tuesday, November 10, 2009
  • 18. Why Did Facebook Choose Hadoop? 1. Demonstrated effectiveness for primary workload 2. Proven ability to scale past any commercial vendor 3. Easy provisioning and capacity planning with commodity nodes 4. Data access for engineers and business analysts 5. Single system to manage XML, JSON, text, and relational data 6. No schemas enabled data collection without involving Data team 7. Cost of software: zero dollars 8. Deep commitment to continued development from Yahoo! 9. Active user and developer community 10. Apache-licensed open source code; ASF owns copyright Tuesday, November 10, 2009
  • 19. Hadoop Community Support People Build Technology ▪ 185+ contributors to the open source code base ▪ ~50 engineers at Yahoo!, ~15 at Facebook, ~15 at Cloudera ▪ Over 500 (paid!) attendees at Hadoop World NYC ▪ Three books (O’Reilly, Apress, Manning) ▪ Training videos free online ▪ Regular user group meetups in many cities ▪ University courses across the world ▪ Growing consultant and systems integrator expertise ▪ Commercial training, certification, and support from Cloudera Tuesday, November 10, 2009
  • 20. How Software is Built Methodological Reflexivity ▪ Latour and Woolgar’s “Laboratory Life” ▪ Study scientists doing science ▪ Use “thick descriptions” and focus on “microconcerns” ▪ Some studies of closed and open source development exist ▪ “Mythical Man Month”, “Cathedral and the Bazaar” ▪ Hertel et al. surveyed 141 Linux kernel developers ▪ Focus on the people creating code ▪ Less religion, more empirical analyses ▪ Build tools to facilitate interaction and output Tuesday, November 10, 2009
  • 21. Building Open Source Software Structural Conditions for Success ▪ Moon and Sproul proposed some rules for successful projects ▪ Authority comes from competence ▪ Leaders have clear responsibilities and delegate often ▪ The code has a modular structure ▪ Establish a parallel release policy: stable and experimental ▪ Give credit to non-source contributions, e.g. documentation ▪ Communicate clear rules and norms for community online ▪ Use simple and reliable communication tools Tuesday, November 10, 2009
  • 22. Building Software Faster Consolidate Best Practices ▪ Javascript frameworks starting to converge ▪ Many adopting jQuery’s selector syntax ▪ Significant benchmarks emerging ▪ Web frameworks push idioms into project structure ▪ What would be the Rails/Django equivalent for data storage? ▪ Reusable components also nice, e.g. log structured merge trees ▪ Compare work on BOOM, RodentStore ▪ Debian distributes release note writing responsibility via “beats” Tuesday, November 10, 2009
  • 23. Complications of Open Source ▪ Intellectual property ▪ Trademark, Copyright, Patent, and Trade Secret ▪ Litigation history ▪ Business models and foundations to ensure long-term support ▪ Direct support: Red Hat, MySQL ▪ Indirect support: LLVM, GSoC ▪ Foundations: Apache, Python, Django ▪ Diversity of licenses ▪ Licenses form communities ▪ Licenses change over time (cf. Rambus BSD incident) Tuesday, November 10, 2009
  • 24. How Software is Adopted Choosing the Right Tool for the Job ▪ Must be aware that a software project exists ▪ Tools like GitHub, Ohloh, Launchpad ▪ Sites like Reddit and Hacker News ▪ Existing example use cases are critical ▪ At Facebook, we studied motivations for content production ▪ Especially effective: Bandura’s “Social Learning Theory” ▪ Hadoop being run in production at scale by Yahoo!/Facebook ▪ Active user communities and great documentation ▪ Reward first approach Tuesday, November 10, 2009
  • 25. Open Learning Open Data, Hypotheses and Workflows ▪ In science, data is generated once and analyzed many times ▪ IceCube ▪ LHC ▪ Lots of places where data and visualizations get shared ▪ data.gov, Many Eyes, Swivel, theinfo.org, InfoChimps, iCharts ▪ Record which hypotheses and workflows have been applied ▪ Increase diversity of questions asked and applications built ▪ Analysis skills unevenly distributed; send skills to the data! Tuesday, November 10, 2009
  • 26. The Future of Data Processing Hadoop, the Browser, and Collaboration ▪ “The Unreasonable Effectiveness of Data”, “MAD Skills” ▪ Single namespace for your organization’s bits ▪ Single engine for distributed data processing ▪ Materialization of structured subsets into optimized stores ▪ Browser as client interface with focus on user experience ▪ The system gets better over time using workload information ▪ Cloning and sharing of common libraries and workflows ▪ Global metadata store driving collection, analysis, and reporting ▪ Version control within and between sites, cf. Orchestra Tuesday, November 10, 2009
  • 27. Cloudera Offerings Only One Slide, I Promise ▪ Two software products ▪ Cloudera’s Distribution for Hadoop ▪ Cloudera Desktop ▪ ...more on the way ▪ Training and Certification ▪ For Developers, Operators, and Managers ▪ Support ▪ Professional services Tuesday, November 10, 2009
  • 28. Cloudera Desktop Big Data can be Beautiful Tuesday, November 10, 2009
  • 29. (c) 2009 Cloudera, Inc. or its licensors.  "Cloudera" is a registered trademark of Cloudera, Inc.. All rights reserved. 1.0 Tuesday, November 10, 2009