SlideShare una empresa de Scribd logo
1 de 31
Descargar para leer sin conexión
Thursday, January 28, 2010
Hadoop, Cloudera, and eBay
         Managing Petabytes with Open Source



         Jeff Hammerbacher
         Chief Scientist and Vice President of Products, Cloudera
         January 28, 2010



Thursday, January 28, 2010
My Background
         Thanks for Asking
         ▪   hammer@cloudera.com
         ▪   Studied Mathematics at Harvard
         ▪   Worked as a Quant on Wall Street
         ▪   Conceived, built, and led Data team at Facebook
             ▪   Nearly 30 amazing engineers and data scientists
             ▪   Several open source projects and research papers
         ▪   Founder of Cloudera
             ▪   Vice President of Products and Chief Scientist
             ▪   Also, check out the book “Beautiful Data”

Thursday, January 28, 2010
Presentation Outline
         ▪   What is Hadoop?
             ▪   HDFS and MapReduce
             ▪   Hive, Pig, Avro, Zookeeper, HBase
         ▪   From Steve
             ▪   Why Hadoop?
             ▪   Hadoop for machine learning and modeling
             ▪   Other things I find interesting
         ▪   What we’re building at Cloudera
         ▪   Questions and Discussion



Thursday, January 28, 2010
What is Hadoop?
         ▪   Apache Software Foundation project, mostly written in Java
         ▪   Inspired by Google infrastructure
         ▪   Software for programming warehouse-scale computers (WSCs)
         ▪   Hundreds of production deployments
         ▪   Project structure
             ▪   Hadoop Distributed File System (HDFS)
             ▪   Hadoop MapReduce
             ▪   Hadoop Common
             ▪   Other subprojects
                 ▪   Avro, HBase, Hive, Pig, Zookeeper


Thursday, January 28, 2010
Anatomy of a Hadoop Cluster
         ▪   Commodity servers
             ▪   1 RU, 2 x 4 core CPU, 8 GB RAM, 4 x 1 TB SATA, 2 x 1 gE NIC
             ▪   Or: 2 RU, 2 x 8 core CPU, 32 GB RAM, 12 x 1 TB SATA
         ▪   Typically arranged in 2 level architecture
                              Commodity Hardware                    Cluster
         ▪   Inexpensive to acquire and maintain




                             •! Typically in 2 level architecture
                                 –! Nodes are commodity Linux PCs
Thursday, January 28, 2010       –! 40 nodes/rack
HDFS
         ▪   Pool commodity servers into a single hierarchical namespace
         ▪   Break files into 128 MB blocks and replicate blocks
         ▪   Designed for large files written once but read many times
             ▪   Files are append-only via a single writer
         ▪   Two major daemons: NameNode and DataNode
             ▪   NameNode manages file system metadata
             ▪   DataNode manages data using local filesystem
         ▪   HDFS manages checksumming, replication, and compression
         ▪   Throughput scales nearly linearly with node cluster size



Thursday, January 28, 2010
'$*31%10$13+3&'1%)#$#I%
                                 #79:"5$)$3-.".0&2$3-"&)"06-"*+,.0-2"84"82-$?()3"()*&5()3"
                                 /(+-."()0&"'(-*-.;"*$++-%"C8+&*?.;D"$)%".0&2()3"-$*6"&/"06-"8+&*?."

                                                         HDFS
                                 2-%,)%$)0+4"$*2&.."06-"'&&+"&/".-2=-2.<""B)"06-"*&55&)"*$.-;"
                                 #79:".0&2-."062--"*&5'+-0-"*&'(-."&/"-$*6"/(+-"84"*&'4()3"-$*6"
                                 '(-*-"0&"062--"%(//-2-)0".-2=-2.E"
              HDFS distributes file blocks among servers
                       "

                                                          " !"                 " F"

                                                           I"                   !"


                                     "                     H"                   H"
                                         F"

                                         !"                         " F"

                                         G"    #79:"                 G"

                                                                     I"
                                         I"

                                         H"               " !"                 " F"

                                                           G"                    G"

                                                           I"                    H"


                                                                                           "
                                         !"#$%&'()'*+!,'-"./%"0$/&.'1"2&'02345.'6738#'.&%9&%.'
                                 "
Thursday, January 28, 2010
Hadoop MapReduce
         ▪   Fault tolerant execution layer and API for parallel data processing
         ▪   Can target multiple storage systems
         ▪   Key/value data model
         ▪   Two major daemons: JobTracker and TaskTracker
         ▪   Many client interfaces
             ▪   Java
             ▪   C++
             ▪   Streaming
             ▪   Pig
             ▪   SQL (Hive)


Thursday, January 28, 2010
MapReduce
                      MapReduce pushes work out to the data
             (#)**+%$#41'%
             #)5#0$#.1%*6%(/789%
             )#$#%)&'$3&:;$&*0%
             '$3#$1.<%$*%+;'"%=*34%
             *;$%$*%>#0<%0*)1'%&0%#%
             ?@;'$13A%B"&'%#@@*='%
             #0#@<'1'%$*%3;0%&0%
             +#3#@@1@%#0)%1@&>&0#$1'%
             $"1%:*$$@101?4'%
             &>+*'1)%:<%>*0*@&$"&?%
             '$*3#.1%'<'$1>'A%




                                        !"#$%&'()'*+,--.'.$/0&/'1-%2'-$3'3-'30&',+3+'
Thursday, January 28, 2010
Hadoop Subprojects
         ▪   Avro
             ▪   Cross-language framework for RPC and serialization
         ▪   HBase
             ▪   Table storage on top of HDFS, modeled after Google’s BigTable
         ▪   Hive
             ▪   SQL interface to structured data stored in HDFS
         ▪   Pig
             ▪   Language for data flow programming; also Owl, Zebra, SQL
         ▪   Zookeeper
             ▪   Coordination service for distributed systems

Thursday, January 28, 2010
Hadoop Community Support
         ▪   185+ contributors to the open source code base
             ▪   ~60 engineers at Yahoo!, ~15 at Facebook, ~15 at Cloudera
         ▪   Over 500 (paid!) attendees at Hadoop World NYC
         ▪   Regular user group meetups in many cities
             ▪   Bay Area Meetup group has 534 members
         ▪   Three books (O’Reilly, Apress, Manning)
         ▪   Training videos free online
         ▪   University courses across the world
         ▪   Growing consultant and systems integrator expertise
         ▪   Training, certification, support, and services from Cloudera

Thursday, January 28, 2010
Hadoop Project Mechanics
         ▪   Trademark owned by ASF; Apache 2.0 license for code
         ▪   Rigorous unit, smoke, performance, and system tests
         ▪   Release cycle of 9 months
             ▪   Last major release: 0.20.0 on April 22, 2009
             ▪   0.21.0 will be last release before 1.0; nearly complete
             ▪   Subprojects on different release cycles
         ▪   Releases put to a vote according to Apache guidelines
         ▪   Releases made available as tarballs on Apache and mirrors
         ▪   Cloudera packages a distribution for many platforms
             ▪   RPM and Debian packages; AMI for Amazon’s EC2


Thursday, January 28, 2010
Hadoop at Facebook
         Early 2006: The First Research Scientist
         ▪   Source data living on horizontally partitioned MySQL tier
         ▪   Intensive historical analysis difficult
         ▪   No way to assess impact of changes to the site


         ▪   First try: Python scripts pull data into MySQL
         ▪   Second try: Python scripts pull data into Oracle


         ▪   ...and then we turned on impression logging



Thursday, January 28, 2010
Facebook Data Infrastructure
                                                   2007
                                    Scribe Tier                     MySQL Tier




                                                  Data Collection
                                                      Server




                                                  Oracle Database
                                                       Server




Thursday, January 28, 2010
Facebook Data Infrastructure
                                                          2008
                                          Scribe Tier            MySQL Tier




                                  Hadoop Tier




                                     Oracle RAC Servers




Thursday, January 28, 2010
Major Data Team Workloads
         ▪   Data collection
             ▪   server logs
             ▪   application databases
             ▪   web crawls
         ▪   Thousands of multi-stage processing pipelines
             ▪   Summaries consumed by external users
             ▪   Summaries for internal reporting
             ▪   Ad optimization pipeline
             ▪   Experimentation platform pipeline
         ▪   Ad hoc analyses


Thursday, January 28, 2010
Workload Statistics
         Facebook 2010
         ▪   Largest cluster running Hive: 8,400 cores, 12.5 PB of storage
         ▪   12 TB of compressed new data added per day
         ▪   135TB of compressed data scanned per day
         ▪   7,500+ Hive jobs on per day
         ▪   80K compute hours per day
         ▪   Around 200 people per month run Hive jobs



             (data from Ashish Thusoo’s Bay Area ACM DM SIG presentation)


Thursday, January 28, 2010
Why Did Facebook Choose Hadoop?
         1. Demonstrated effectiveness for primary workload
         2. Proven ability to scale past any commercial vendor
         3. Easy provisioning and capacity planning with commodity nodes
         4. Data access for all: engineers, business analysts, sales managers
         5. Single system to manage XML/JSON, text, and relational data
         6. No schemas enabled data collection without involving Data team
         7. Simple, modular architecture
         8. Easy to build, deploy, and monitor
         9. Apache-licensed open source code granted to ASF



Thursday, January 28, 2010
Why Did Facebook Choose Hadoop?
         ▪   Most importantly: the community
             ▪   Broad and deep commitment to future development from
                 multiple organizations
             ▪   Interaction with a community often useful for recruiting
             ▪   Growing body of users and operators with prior expertise
                 meant lower cost of training new users
             ▪   Learn about best practices from other organizations
             ▪   Widely available public materials for improving skills
             ▪   Not then, but now
                 ▪   Commercial training, certification, support, and services
                 ▪   Growing body of complementary software


Thursday, January 28, 2010
Hadoop and Machine Learning/Modeling
         ▪   Data preparation using familiar programming tools
             ▪   Scalable historical storage of data for training and validation
             ▪   Field coding, aggregation, and data quality assertions
             ▪   Feature extraction over massive or complex data sets
             ▪   Efficient sampling and extraction to other tools
             ▪   Combination with other data sets
             ▪   Extensible metadata for organizing data sets
         ▪   Fundamental operations
             ▪   Matrix multiplication and other linear algebra
             ▪   Statistical tests of significance


Thursday, January 28, 2010
Hadoop and Machine Learning/Modeling
         ▪   Scoring
             ▪   eHarmony matching users
             ▪   Fraud detection for billing platforms
         ▪   Genetic Algorithms
             ▪   Mailchimp’s Project Omnivore
             ▪   Xavier Llorà’s research
         ▪   Collaborative filtering
             ▪   Google News personalization
             ▪   Yahoo! front page personalization (Cokeheads)



Thursday, January 28, 2010
Hadoop and Machine Learning/Modeling
         ▪   Model fitting
             ▪   EM algorithm and HMMs (Jimmy Lin)
         ▪   Graph analysis
             ▪   Finding largest connected component (Jeff Hodges)
             ▪   Social graph analysis (Jake Hofman)
         ▪   Document analysis
             ▪   Named entity extraction (Evri)
             ▪   Document similarity (Jimmy Lin)
         ▪   Image similarity: Google paper



Thursday, January 28, 2010
Hadoop and Machine Learning/Modeling
         ▪   Classification
             ▪   Google’s PLANET for building decision trees
             ▪   eBay’s linear Poisson regression for behavioral targeting
             ▪   Sessionization of clickstream logs and path prediction
         ▪   Bioinformatics
             ▪   Cloudburst
             ▪   Crossbow
         ▪   Computer vision
             ▪   Face detection
             ▪   Face recognition


Thursday, January 28, 2010
Hadoop and Machine Learning/Modeling
         ▪   Simulation
             ▪   Protein folding
             ▪   Particle-swarm optimization
         ▪   Crazy stuff
             ▪   Factoring integers
             ▪   Solving Boggle
             ▪   Generating fractals
         ▪   Books and conferences
             ▪   MDAC 2010
             ▪   “Data Intensive Text Processing with MapReduce”


Thursday, January 28, 2010
Hadoop at Cloudera
         Cloudera’s Distribution for Hadoop
         ▪   Open source distribution of Apache Hadoop for enterprise use
             ▪   Includes HDFS, MapReduce, Pig, Hive, and ZooKeeper
             ▪   Ensures cross-subproject compatibility
             ▪   Adds backported patches and customer-specific patches
             ▪   Adds Cloudera utilities like MRUnit and Sqoop
             ▪   Better integration with daemon administration utilities
             ▪   Follows the Filesystem Hierarchy Standard (FHS) for file layout
             ▪   Tools for automatically generating a configuration
             ▪   Packaged as RPM, DEB, AMI, or tarball


Thursday, January 28, 2010
Hadoop at Cloudera
         Training and Certification
         ▪   Free online training
             ▪   Basic, Intermediate (including Hive and Pig), and Advanced
             ▪   Includes a virtual machine with software and exercises
         ▪   Live training sessions
             ▪   One live session per month somewhere in the world
             ▪   If you have a large group, we may come to you
         ▪   Certification
             ▪   Exams for Developers, Administrators, and Managers
             ▪   Administered online or in person

Thursday, January 28, 2010
Hadoop at Cloudera
         Services and Support
         ▪   Professional Services
             ▪   Get Hadoop up and running in your environment
             ▪   Optimize an existing Hadoop infrastructure
             ▪   Design new algorithms to make the most of your data
         ▪   Support
             ▪   Unlimited questions for Cloudera’s technical team
             ▪   Access to our Knowledge Base
             ▪   Help prioritize feature development for CDH
             ▪   Early access to upcoming Cloudera software products


Thursday, January 28, 2010
Hadoop at Cloudera
         Commercial Software
         ▪   General thesis: build commercially-licensed software products
             which complement CDH for data management and analysis
         ▪   Current products
             ▪   Cloudera Desktop
                 ▪   Extensible interface for users of Cloudera software
         ▪   Upcoming products for data collection
             ▪   Talk to me offline




Thursday, January 28, 2010
Cloudera Desktop
                             Big Data can be Beautiful




Thursday, January 28, 2010
(c) 2009 Cloudera, Inc. or its licensors.  "Cloudera" is a registered trademark of Cloudera, Inc.. All rights reserved. 1.0




Thursday, January 28, 2010

Más contenido relacionado

La actualidad más candente

Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...Cloudera, Inc.
 
Hadoop World 2011: Mike Olson Keynote Presentation
Hadoop World 2011: Mike Olson Keynote PresentationHadoop World 2011: Mike Olson Keynote Presentation
Hadoop World 2011: Mike Olson Keynote PresentationCloudera, Inc.
 
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of AltiscaleDebugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of AltiscaleData Con LA
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with HadoopJosh Devins
 
Sdec2011 Introducing Hadoop
Sdec2011 Introducing HadoopSdec2011 Introducing Hadoop
Sdec2011 Introducing HadoopKorea Sdec
 
Sdec2011 shashank-introducing hadoop
Sdec2011 shashank-introducing hadoopSdec2011 shashank-introducing hadoop
Sdec2011 shashank-introducing hadoopKorea Sdec
 
SDEC2011 Essentials of Pig
SDEC2011 Essentials of PigSDEC2011 Essentials of Pig
SDEC2011 Essentials of PigKorea Sdec
 
Real world Django deployment using Chef
Real world Django deployment using ChefReal world Django deployment using Chef
Real world Django deployment using Chefcoderanger
 
SELF 2011: Deploying Django Application Stacks with Chef
SELF 2011: Deploying Django Application Stacks with ChefSELF 2011: Deploying Django Application Stacks with Chef
SELF 2011: Deploying Django Application Stacks with ChefChef Software, Inc.
 
Introduction to Big Data & Hadoop
Introduction to Big Data & HadoopIntroduction to Big Data & Hadoop
Introduction to Big Data & HadoopEdureka!
 
Hdp r-google charttools-webinar-3-5-2013 (2)
Hdp r-google charttools-webinar-3-5-2013 (2)Hdp r-google charttools-webinar-3-5-2013 (2)
Hdp r-google charttools-webinar-3-5-2013 (2)Hortonworks
 
Big Data Cloud Meetup - Jan 24 2013 - Zettaset
Big Data Cloud Meetup - Jan 24 2013 - ZettasetBig Data Cloud Meetup - Jan 24 2013 - Zettaset
Big Data Cloud Meetup - Jan 24 2013 - ZettasetBigDataCloud
 
Hadoop online training
Hadoop online training Hadoop online training
Hadoop online training Keylabs
 

La actualidad más candente (17)

Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
 
Hadoop World 2011: Mike Olson Keynote Presentation
Hadoop World 2011: Mike Olson Keynote PresentationHadoop World 2011: Mike Olson Keynote Presentation
Hadoop World 2011: Mike Olson Keynote Presentation
 
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of AltiscaleDebugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with Hadoop
 
Hadoop - Introduction to Hadoop
Hadoop - Introduction to HadoopHadoop - Introduction to Hadoop
Hadoop - Introduction to Hadoop
 
Hadoop at Nokia
Hadoop at NokiaHadoop at Nokia
Hadoop at Nokia
 
Sdec2011 Introducing Hadoop
Sdec2011 Introducing HadoopSdec2011 Introducing Hadoop
Sdec2011 Introducing Hadoop
 
Sdec2011 shashank-introducing hadoop
Sdec2011 shashank-introducing hadoopSdec2011 shashank-introducing hadoop
Sdec2011 shashank-introducing hadoop
 
SDEC2011 Essentials of Pig
SDEC2011 Essentials of PigSDEC2011 Essentials of Pig
SDEC2011 Essentials of Pig
 
Picconf12
Picconf12Picconf12
Picconf12
 
Jan 2012 HUG: HCatalog
Jan 2012 HUG: HCatalogJan 2012 HUG: HCatalog
Jan 2012 HUG: HCatalog
 
Real world Django deployment using Chef
Real world Django deployment using ChefReal world Django deployment using Chef
Real world Django deployment using Chef
 
SELF 2011: Deploying Django Application Stacks with Chef
SELF 2011: Deploying Django Application Stacks with ChefSELF 2011: Deploying Django Application Stacks with Chef
SELF 2011: Deploying Django Application Stacks with Chef
 
Introduction to Big Data & Hadoop
Introduction to Big Data & HadoopIntroduction to Big Data & Hadoop
Introduction to Big Data & Hadoop
 
Hdp r-google charttools-webinar-3-5-2013 (2)
Hdp r-google charttools-webinar-3-5-2013 (2)Hdp r-google charttools-webinar-3-5-2013 (2)
Hdp r-google charttools-webinar-3-5-2013 (2)
 
Big Data Cloud Meetup - Jan 24 2013 - Zettaset
Big Data Cloud Meetup - Jan 24 2013 - ZettasetBig Data Cloud Meetup - Jan 24 2013 - Zettaset
Big Data Cloud Meetup - Jan 24 2013 - Zettaset
 
Hadoop online training
Hadoop online training Hadoop online training
Hadoop online training
 

Destacado

MALT: Distributed Data-Parallelism for Existing ML Applications (Distributed ...
MALT: Distributed Data-Parallelism for Existing ML Applications (Distributed ...MALT: Distributed Data-Parallelism for Existing ML Applications (Distributed ...
MALT: Distributed Data-Parallelism for Existing ML Applications (Distributed ...asimkadav
 
Cloudera Federal Forum 2014: The Evolution of Machine Learning from Science t...
Cloudera Federal Forum 2014: The Evolution of Machine Learning from Science t...Cloudera Federal Forum 2014: The Evolution of Machine Learning from Science t...
Cloudera Federal Forum 2014: The Evolution of Machine Learning from Science t...Cloudera, Inc.
 
Cloudera Federal Forum 2014: Tracking Provenance in Hadoop Clusters
Cloudera Federal Forum 2014: Tracking Provenance in Hadoop ClustersCloudera Federal Forum 2014: Tracking Provenance in Hadoop Clusters
Cloudera Federal Forum 2014: Tracking Provenance in Hadoop ClustersCloudera, Inc.
 
Mårten Mickos's presentation "Open Source: Why Freedom Makes a Better Busines...
Mårten Mickos's presentation "Open Source: Why Freedom Makes a Better Busines...Mårten Mickos's presentation "Open Source: Why Freedom Makes a Better Busines...
Mårten Mickos's presentation "Open Source: Why Freedom Makes a Better Busines...Jeff Hammerbacher
 

Destacado (20)

MALT: Distributed Data-Parallelism for Existing ML Applications (Distributed ...
MALT: Distributed Data-Parallelism for Existing ML Applications (Distributed ...MALT: Distributed Data-Parallelism for Existing ML Applications (Distributed ...
MALT: Distributed Data-Parallelism for Existing ML Applications (Distributed ...
 
Cloudera Federal Forum 2014: The Evolution of Machine Learning from Science t...
Cloudera Federal Forum 2014: The Evolution of Machine Learning from Science t...Cloudera Federal Forum 2014: The Evolution of Machine Learning from Science t...
Cloudera Federal Forum 2014: The Evolution of Machine Learning from Science t...
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
20100714accel
20100714accel20100714accel
20100714accel
 
Cloudera Federal Forum 2014: Tracking Provenance in Hadoop Clusters
Cloudera Federal Forum 2014: Tracking Provenance in Hadoop ClustersCloudera Federal Forum 2014: Tracking Provenance in Hadoop Clusters
Cloudera Federal Forum 2014: Tracking Provenance in Hadoop Clusters
 
20080529dublinpt1
20080529dublinpt120080529dublinpt1
20080529dublinpt1
 
Mapreduce Pact06 Keynote
Mapreduce Pact06 KeynoteMapreduce Pact06 Keynote
Mapreduce Pact06 Keynote
 
20100201hplabs
20100201hplabs20100201hplabs
20100201hplabs
 
20090422 Www
20090422 Www20090422 Www
20090422 Www
 
HDFS Architecture
HDFS ArchitectureHDFS Architecture
HDFS Architecture
 
20080528dublinpt1
20080528dublinpt120080528dublinpt1
20080528dublinpt1
 
20100423sage
20100423sage20100423sage
20100423sage
 
20081022cca
20081022cca20081022cca
20081022cca
 
Mårten Mickos's presentation "Open Source: Why Freedom Makes a Better Busines...
Mårten Mickos's presentation "Open Source: Why Freedom Makes a Better Busines...Mårten Mickos's presentation "Open Source: Why Freedom Makes a Better Busines...
Mårten Mickos's presentation "Open Source: Why Freedom Makes a Better Busines...
 
20090309berkeley
20090309berkeley20090309berkeley
20090309berkeley
 
20120223keystone
20120223keystone20120223keystone
20120223keystone
 
20081030linkedin
20081030linkedin20081030linkedin
20081030linkedin
 
20100301icde
20100301icde20100301icde
20100301icde
 
Partitioning 20061205
Partitioning 20061205Partitioning 20061205
Partitioning 20061205
 
20080115yahoobrickhouse
20080115yahoobrickhouse20080115yahoobrickhouse
20080115yahoobrickhouse
 

Similar a 20100128ebay

Brust hadoopecosystem
Brust hadoopecosystemBrust hadoopecosystem
Brust hadoopecosystemAndrew Brust
 
Microsoft's Big Play for Big Data
Microsoft's Big Play for Big DataMicrosoft's Big Play for Big Data
Microsoft's Big Play for Big DataAndrew Brust
 
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Andrew Brust
 
Migrating structured data between Hadoop and RDBMS
Migrating structured data between Hadoop and RDBMSMigrating structured data between Hadoop and RDBMS
Migrating structured data between Hadoop and RDBMSBouquet
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesCorley S.r.l.
 
Getting started with R & Hadoop
Getting started with R & HadoopGetting started with R & Hadoop
Getting started with R & HadoopJeffrey Breen
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft PlatformJesus Rodriguez
 
Hadoop for the Absolute Beginner
Hadoop for the Absolute BeginnerHadoop for the Absolute Beginner
Hadoop for the Absolute BeginnerIke Ellis
 
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...rhatr
 
Data warehousing with Hadoop
Data warehousing with HadoopData warehousing with Hadoop
Data warehousing with Hadoophadooparchbook
 
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopAmir Shaikh
 
Bar Camp Auckland - Mongo DB Presentation BCA4
Bar Camp Auckland - Mongo DB Presentation BCA4Bar Camp Auckland - Mongo DB Presentation BCA4
Bar Camp Auckland - Mongo DB Presentation BCA4John Ballinger
 
Yahoo! Hack Europe Workshop
Yahoo! Hack Europe WorkshopYahoo! Hack Europe Workshop
Yahoo! Hack Europe WorkshopHortonworks
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoopbddmoscow
 
DevOps Columbus Meetup Kickoff - Infrastructure as Code
DevOps Columbus Meetup Kickoff - Infrastructure as CodeDevOps Columbus Meetup Kickoff - Infrastructure as Code
DevOps Columbus Meetup Kickoff - Infrastructure as CodeMichael Ducy
 
Chef for OpenStack: Grizzly Roadmap
Chef for OpenStack: Grizzly RoadmapChef for OpenStack: Grizzly Roadmap
Chef for OpenStack: Grizzly RoadmapMatt Ray
 

Similar a 20100128ebay (20)

Brust hadoopecosystem
Brust hadoopecosystemBrust hadoopecosystem
Brust hadoopecosystem
 
Microsoft's Big Play for Big Data
Microsoft's Big Play for Big DataMicrosoft's Big Play for Big Data
Microsoft's Big Play for Big Data
 
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
 
Apache hadoop
Apache hadoopApache hadoop
Apache hadoop
 
Migrating structured data between Hadoop and RDBMS
Migrating structured data between Hadoop and RDBMSMigrating structured data between Hadoop and RDBMS
Migrating structured data between Hadoop and RDBMS
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting Languages
 
Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815
 
Getting started with R & Hadoop
Getting started with R & HadoopGetting started with R & Hadoop
Getting started with R & Hadoop
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft Platform
 
מיכאל
מיכאלמיכאל
מיכאל
 
Hadoop for the Absolute Beginner
Hadoop for the Absolute BeginnerHadoop for the Absolute Beginner
Hadoop for the Absolute Beginner
 
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
 
Data warehousing with Hadoop
Data warehousing with HadoopData warehousing with Hadoop
Data warehousing with Hadoop
 
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and Hadoop
 
Bar Camp Auckland - Mongo DB Presentation BCA4
Bar Camp Auckland - Mongo DB Presentation BCA4Bar Camp Auckland - Mongo DB Presentation BCA4
Bar Camp Auckland - Mongo DB Presentation BCA4
 
Hadoop intro
Hadoop introHadoop intro
Hadoop intro
 
Yahoo! Hack Europe Workshop
Yahoo! Hack Europe WorkshopYahoo! Hack Europe Workshop
Yahoo! Hack Europe Workshop
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 
DevOps Columbus Meetup Kickoff - Infrastructure as Code
DevOps Columbus Meetup Kickoff - Infrastructure as CodeDevOps Columbus Meetup Kickoff - Infrastructure as Code
DevOps Columbus Meetup Kickoff - Infrastructure as Code
 
Chef for OpenStack: Grizzly Roadmap
Chef for OpenStack: Grizzly RoadmapChef for OpenStack: Grizzly Roadmap
Chef for OpenStack: Grizzly Roadmap
 

Más de Jeff Hammerbacher (11)

20100608sigmod
20100608sigmod20100608sigmod
20100608sigmod
 
20100513brown
20100513brown20100513brown
20100513brown
 
20100418sos
20100418sos20100418sos
20100418sos
 
20081009nychive
20081009nychive20081009nychive
20081009nychive
 
2008 Ur Tech Talk Zshao
2008 Ur Tech Talk Zshao2008 Ur Tech Talk Zshao
2008 Ur Tech Talk Zshao
 
Data Presentations Cassandra Sigmod
Data  Presentations  Cassandra SigmodData  Presentations  Cassandra Sigmod
Data Presentations Cassandra Sigmod
 
20080611accel
20080611accel20080611accel
20080611accel
 
Hdfs Dhruba
Hdfs DhrubaHdfs Dhruba
Hdfs Dhruba
 
20080528dublinpt3
20080528dublinpt320080528dublinpt3
20080528dublinpt3
 
20080529dublinpt3
20080529dublinpt320080529dublinpt3
20080529dublinpt3
 
20080529dublinpt2
20080529dublinpt220080529dublinpt2
20080529dublinpt2
 

Último

Scenario Library et REX Discover industry- and role- based scenarios
Scenario Library et REX Discover industry- and role- based scenariosScenario Library et REX Discover industry- and role- based scenarios
Scenario Library et REX Discover industry- and role- based scenariosErol GIRAUDY
 
Flow Control | Block Size | ST Min | First Frame
Flow Control | Block Size | ST Min | First FrameFlow Control | Block Size | ST Min | First Frame
Flow Control | Block Size | ST Min | First FrameKapil Thakar
 
Introduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its applicationIntroduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its applicationKnoldus Inc.
 
Explore the UiPath Community and ways you can benefit on your journey to auto...
Explore the UiPath Community and ways you can benefit on your journey to auto...Explore the UiPath Community and ways you can benefit on your journey to auto...
Explore the UiPath Community and ways you can benefit on your journey to auto...DianaGray10
 
From the origin to the future of Open Source model and business
From the origin to the future of  Open Source model and businessFrom the origin to the future of  Open Source model and business
From the origin to the future of Open Source model and businessFrancesco Corti
 
Top 10 Squarespace Development Companies
Top 10 Squarespace Development CompaniesTop 10 Squarespace Development Companies
Top 10 Squarespace Development CompaniesTopCSSGallery
 
UiPath Studio Web workshop series - Day 1
UiPath Studio Web workshop series  - Day 1UiPath Studio Web workshop series  - Day 1
UiPath Studio Web workshop series - Day 1DianaGray10
 
Extra-120324-Visite-Entreprise-icare.pdf
Extra-120324-Visite-Entreprise-icare.pdfExtra-120324-Visite-Entreprise-icare.pdf
Extra-120324-Visite-Entreprise-icare.pdfInfopole1
 
LF Energy Webinar - Unveiling OpenEEMeter 4.0
LF Energy Webinar - Unveiling OpenEEMeter 4.0LF Energy Webinar - Unveiling OpenEEMeter 4.0
LF Energy Webinar - Unveiling OpenEEMeter 4.0DanBrown980551
 
UiPath Studio Web workshop series - Day 4
UiPath Studio Web workshop series - Day 4UiPath Studio Web workshop series - Day 4
UiPath Studio Web workshop series - Day 4DianaGray10
 
Novo Nordisk's journey in developing an open-source application on Neo4j
Novo Nordisk's journey in developing an open-source application on Neo4jNovo Nordisk's journey in developing an open-source application on Neo4j
Novo Nordisk's journey in developing an open-source application on Neo4jNeo4j
 
TrustArc Webinar - How to Live in a Post Third-Party Cookie World
TrustArc Webinar - How to Live in a Post Third-Party Cookie WorldTrustArc Webinar - How to Live in a Post Third-Party Cookie World
TrustArc Webinar - How to Live in a Post Third-Party Cookie WorldTrustArc
 
3 Pitfalls Everyone Should Avoid with Cloud Data
3 Pitfalls Everyone Should Avoid with Cloud Data3 Pitfalls Everyone Should Avoid with Cloud Data
3 Pitfalls Everyone Should Avoid with Cloud DataEric D. Schabell
 
My key hands-on projects in Quantum, and QAI
My key hands-on projects in Quantum, and QAIMy key hands-on projects in Quantum, and QAI
My key hands-on projects in Quantum, and QAIVijayananda Mohire
 
.NET 8 ChatBot with Azure OpenAI Services.pptx
.NET 8 ChatBot with Azure OpenAI Services.pptx.NET 8 ChatBot with Azure OpenAI Services.pptx
.NET 8 ChatBot with Azure OpenAI Services.pptxHansamali Gamage
 
Trailblazer Community - Flows Workshop (Session 2)
Trailblazer Community - Flows Workshop (Session 2)Trailblazer Community - Flows Workshop (Session 2)
Trailblazer Community - Flows Workshop (Session 2)Muhammad Tiham Siddiqui
 
AI Workshops at Computers In Libraries 2024
AI Workshops at Computers In Libraries 2024AI Workshops at Computers In Libraries 2024
AI Workshops at Computers In Libraries 2024Brian Pichman
 
Technical SEO for Improved Accessibility WTS FEST
Technical SEO for Improved Accessibility  WTS FESTTechnical SEO for Improved Accessibility  WTS FEST
Technical SEO for Improved Accessibility WTS FESTBillieHyde
 
CyberSecurity - Computers In Libraries 2024
CyberSecurity - Computers In Libraries 2024CyberSecurity - Computers In Libraries 2024
CyberSecurity - Computers In Libraries 2024Brian Pichman
 

Último (20)

Scenario Library et REX Discover industry- and role- based scenarios
Scenario Library et REX Discover industry- and role- based scenariosScenario Library et REX Discover industry- and role- based scenarios
Scenario Library et REX Discover industry- and role- based scenarios
 
Flow Control | Block Size | ST Min | First Frame
Flow Control | Block Size | ST Min | First FrameFlow Control | Block Size | ST Min | First Frame
Flow Control | Block Size | ST Min | First Frame
 
Introduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its applicationIntroduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its application
 
Explore the UiPath Community and ways you can benefit on your journey to auto...
Explore the UiPath Community and ways you can benefit on your journey to auto...Explore the UiPath Community and ways you can benefit on your journey to auto...
Explore the UiPath Community and ways you can benefit on your journey to auto...
 
From the origin to the future of Open Source model and business
From the origin to the future of  Open Source model and businessFrom the origin to the future of  Open Source model and business
From the origin to the future of Open Source model and business
 
Top 10 Squarespace Development Companies
Top 10 Squarespace Development CompaniesTop 10 Squarespace Development Companies
Top 10 Squarespace Development Companies
 
SheDev 2024
SheDev 2024SheDev 2024
SheDev 2024
 
UiPath Studio Web workshop series - Day 1
UiPath Studio Web workshop series  - Day 1UiPath Studio Web workshop series  - Day 1
UiPath Studio Web workshop series - Day 1
 
Extra-120324-Visite-Entreprise-icare.pdf
Extra-120324-Visite-Entreprise-icare.pdfExtra-120324-Visite-Entreprise-icare.pdf
Extra-120324-Visite-Entreprise-icare.pdf
 
LF Energy Webinar - Unveiling OpenEEMeter 4.0
LF Energy Webinar - Unveiling OpenEEMeter 4.0LF Energy Webinar - Unveiling OpenEEMeter 4.0
LF Energy Webinar - Unveiling OpenEEMeter 4.0
 
UiPath Studio Web workshop series - Day 4
UiPath Studio Web workshop series - Day 4UiPath Studio Web workshop series - Day 4
UiPath Studio Web workshop series - Day 4
 
Novo Nordisk's journey in developing an open-source application on Neo4j
Novo Nordisk's journey in developing an open-source application on Neo4jNovo Nordisk's journey in developing an open-source application on Neo4j
Novo Nordisk's journey in developing an open-source application on Neo4j
 
TrustArc Webinar - How to Live in a Post Third-Party Cookie World
TrustArc Webinar - How to Live in a Post Third-Party Cookie WorldTrustArc Webinar - How to Live in a Post Third-Party Cookie World
TrustArc Webinar - How to Live in a Post Third-Party Cookie World
 
3 Pitfalls Everyone Should Avoid with Cloud Data
3 Pitfalls Everyone Should Avoid with Cloud Data3 Pitfalls Everyone Should Avoid with Cloud Data
3 Pitfalls Everyone Should Avoid with Cloud Data
 
My key hands-on projects in Quantum, and QAI
My key hands-on projects in Quantum, and QAIMy key hands-on projects in Quantum, and QAI
My key hands-on projects in Quantum, and QAI
 
.NET 8 ChatBot with Azure OpenAI Services.pptx
.NET 8 ChatBot with Azure OpenAI Services.pptx.NET 8 ChatBot with Azure OpenAI Services.pptx
.NET 8 ChatBot with Azure OpenAI Services.pptx
 
Trailblazer Community - Flows Workshop (Session 2)
Trailblazer Community - Flows Workshop (Session 2)Trailblazer Community - Flows Workshop (Session 2)
Trailblazer Community - Flows Workshop (Session 2)
 
AI Workshops at Computers In Libraries 2024
AI Workshops at Computers In Libraries 2024AI Workshops at Computers In Libraries 2024
AI Workshops at Computers In Libraries 2024
 
Technical SEO for Improved Accessibility WTS FEST
Technical SEO for Improved Accessibility  WTS FESTTechnical SEO for Improved Accessibility  WTS FEST
Technical SEO for Improved Accessibility WTS FEST
 
CyberSecurity - Computers In Libraries 2024
CyberSecurity - Computers In Libraries 2024CyberSecurity - Computers In Libraries 2024
CyberSecurity - Computers In Libraries 2024
 

20100128ebay

  • 2. Hadoop, Cloudera, and eBay Managing Petabytes with Open Source Jeff Hammerbacher Chief Scientist and Vice President of Products, Cloudera January 28, 2010 Thursday, January 28, 2010
  • 3. My Background Thanks for Asking ▪ hammer@cloudera.com ▪ Studied Mathematics at Harvard ▪ Worked as a Quant on Wall Street ▪ Conceived, built, and led Data team at Facebook ▪ Nearly 30 amazing engineers and data scientists ▪ Several open source projects and research papers ▪ Founder of Cloudera ▪ Vice President of Products and Chief Scientist ▪ Also, check out the book “Beautiful Data” Thursday, January 28, 2010
  • 4. Presentation Outline ▪ What is Hadoop? ▪ HDFS and MapReduce ▪ Hive, Pig, Avro, Zookeeper, HBase ▪ From Steve ▪ Why Hadoop? ▪ Hadoop for machine learning and modeling ▪ Other things I find interesting ▪ What we’re building at Cloudera ▪ Questions and Discussion Thursday, January 28, 2010
  • 5. What is Hadoop? ▪ Apache Software Foundation project, mostly written in Java ▪ Inspired by Google infrastructure ▪ Software for programming warehouse-scale computers (WSCs) ▪ Hundreds of production deployments ▪ Project structure ▪ Hadoop Distributed File System (HDFS) ▪ Hadoop MapReduce ▪ Hadoop Common ▪ Other subprojects ▪ Avro, HBase, Hive, Pig, Zookeeper Thursday, January 28, 2010
  • 6. Anatomy of a Hadoop Cluster ▪ Commodity servers ▪ 1 RU, 2 x 4 core CPU, 8 GB RAM, 4 x 1 TB SATA, 2 x 1 gE NIC ▪ Or: 2 RU, 2 x 8 core CPU, 32 GB RAM, 12 x 1 TB SATA ▪ Typically arranged in 2 level architecture Commodity Hardware Cluster ▪ Inexpensive to acquire and maintain •! Typically in 2 level architecture –! Nodes are commodity Linux PCs Thursday, January 28, 2010 –! 40 nodes/rack
  • 7. HDFS ▪ Pool commodity servers into a single hierarchical namespace ▪ Break files into 128 MB blocks and replicate blocks ▪ Designed for large files written once but read many times ▪ Files are append-only via a single writer ▪ Two major daemons: NameNode and DataNode ▪ NameNode manages file system metadata ▪ DataNode manages data using local filesystem ▪ HDFS manages checksumming, replication, and compression ▪ Throughput scales nearly linearly with node cluster size Thursday, January 28, 2010
  • 8. '$*31%10$13+3&'1%)#$#I% #79:"5$)$3-.".0&2$3-"&)"06-"*+,.0-2"84"82-$?()3"()*&5()3" /(+-."()0&"'(-*-.;"*$++-%"C8+&*?.;D"$)%".0&2()3"-$*6"&/"06-"8+&*?." HDFS 2-%,)%$)0+4"$*2&.."06-"'&&+"&/".-2=-2.<""B)"06-"*&55&)"*$.-;" #79:".0&2-."062--"*&5'+-0-"*&'(-."&/"-$*6"/(+-"84"*&'4()3"-$*6" '(-*-"0&"062--"%(//-2-)0".-2=-2.E" HDFS distributes file blocks among servers " " !" " F" I" !" " H" H" F" !" " F" G" #79:" G" I" I" H" " !" " F" G" G" I" H" " !"#$%&'()'*+!,'-"./%"0$/&.'1"2&'02345.'6738#'.&%9&%.' " Thursday, January 28, 2010
  • 9. Hadoop MapReduce ▪ Fault tolerant execution layer and API for parallel data processing ▪ Can target multiple storage systems ▪ Key/value data model ▪ Two major daemons: JobTracker and TaskTracker ▪ Many client interfaces ▪ Java ▪ C++ ▪ Streaming ▪ Pig ▪ SQL (Hive) Thursday, January 28, 2010
  • 10. MapReduce MapReduce pushes work out to the data (#)**+%$#41'% #)5#0$#.1%*6%(/789% )#$#%)&'$3&:;$&*0% '$3#$1.<%$*%+;'"%=*34% *;$%$*%>#0<%0*)1'%&0%#% ?@;'$13A%B"&'%#@@*='% #0#@<'1'%$*%3;0%&0% +#3#@@1@%#0)%1@&>&0#$1'% $"1%:*$$@101?4'% &>+*'1)%:<%>*0*@&$"&?% '$*3#.1%'<'$1>'A% !"#$%&'()'*+,--.'.$/0&/'1-%2'-$3'3-'30&',+3+' Thursday, January 28, 2010
  • 11. Hadoop Subprojects ▪ Avro ▪ Cross-language framework for RPC and serialization ▪ HBase ▪ Table storage on top of HDFS, modeled after Google’s BigTable ▪ Hive ▪ SQL interface to structured data stored in HDFS ▪ Pig ▪ Language for data flow programming; also Owl, Zebra, SQL ▪ Zookeeper ▪ Coordination service for distributed systems Thursday, January 28, 2010
  • 12. Hadoop Community Support ▪ 185+ contributors to the open source code base ▪ ~60 engineers at Yahoo!, ~15 at Facebook, ~15 at Cloudera ▪ Over 500 (paid!) attendees at Hadoop World NYC ▪ Regular user group meetups in many cities ▪ Bay Area Meetup group has 534 members ▪ Three books (O’Reilly, Apress, Manning) ▪ Training videos free online ▪ University courses across the world ▪ Growing consultant and systems integrator expertise ▪ Training, certification, support, and services from Cloudera Thursday, January 28, 2010
  • 13. Hadoop Project Mechanics ▪ Trademark owned by ASF; Apache 2.0 license for code ▪ Rigorous unit, smoke, performance, and system tests ▪ Release cycle of 9 months ▪ Last major release: 0.20.0 on April 22, 2009 ▪ 0.21.0 will be last release before 1.0; nearly complete ▪ Subprojects on different release cycles ▪ Releases put to a vote according to Apache guidelines ▪ Releases made available as tarballs on Apache and mirrors ▪ Cloudera packages a distribution for many platforms ▪ RPM and Debian packages; AMI for Amazon’s EC2 Thursday, January 28, 2010
  • 14. Hadoop at Facebook Early 2006: The First Research Scientist ▪ Source data living on horizontally partitioned MySQL tier ▪ Intensive historical analysis difficult ▪ No way to assess impact of changes to the site ▪ First try: Python scripts pull data into MySQL ▪ Second try: Python scripts pull data into Oracle ▪ ...and then we turned on impression logging Thursday, January 28, 2010
  • 15. Facebook Data Infrastructure 2007 Scribe Tier MySQL Tier Data Collection Server Oracle Database Server Thursday, January 28, 2010
  • 16. Facebook Data Infrastructure 2008 Scribe Tier MySQL Tier Hadoop Tier Oracle RAC Servers Thursday, January 28, 2010
  • 17. Major Data Team Workloads ▪ Data collection ▪ server logs ▪ application databases ▪ web crawls ▪ Thousands of multi-stage processing pipelines ▪ Summaries consumed by external users ▪ Summaries for internal reporting ▪ Ad optimization pipeline ▪ Experimentation platform pipeline ▪ Ad hoc analyses Thursday, January 28, 2010
  • 18. Workload Statistics Facebook 2010 ▪ Largest cluster running Hive: 8,400 cores, 12.5 PB of storage ▪ 12 TB of compressed new data added per day ▪ 135TB of compressed data scanned per day ▪ 7,500+ Hive jobs on per day ▪ 80K compute hours per day ▪ Around 200 people per month run Hive jobs (data from Ashish Thusoo’s Bay Area ACM DM SIG presentation) Thursday, January 28, 2010
  • 19. Why Did Facebook Choose Hadoop? 1. Demonstrated effectiveness for primary workload 2. Proven ability to scale past any commercial vendor 3. Easy provisioning and capacity planning with commodity nodes 4. Data access for all: engineers, business analysts, sales managers 5. Single system to manage XML/JSON, text, and relational data 6. No schemas enabled data collection without involving Data team 7. Simple, modular architecture 8. Easy to build, deploy, and monitor 9. Apache-licensed open source code granted to ASF Thursday, January 28, 2010
  • 20. Why Did Facebook Choose Hadoop? ▪ Most importantly: the community ▪ Broad and deep commitment to future development from multiple organizations ▪ Interaction with a community often useful for recruiting ▪ Growing body of users and operators with prior expertise meant lower cost of training new users ▪ Learn about best practices from other organizations ▪ Widely available public materials for improving skills ▪ Not then, but now ▪ Commercial training, certification, support, and services ▪ Growing body of complementary software Thursday, January 28, 2010
  • 21. Hadoop and Machine Learning/Modeling ▪ Data preparation using familiar programming tools ▪ Scalable historical storage of data for training and validation ▪ Field coding, aggregation, and data quality assertions ▪ Feature extraction over massive or complex data sets ▪ Efficient sampling and extraction to other tools ▪ Combination with other data sets ▪ Extensible metadata for organizing data sets ▪ Fundamental operations ▪ Matrix multiplication and other linear algebra ▪ Statistical tests of significance Thursday, January 28, 2010
  • 22. Hadoop and Machine Learning/Modeling ▪ Scoring ▪ eHarmony matching users ▪ Fraud detection for billing platforms ▪ Genetic Algorithms ▪ Mailchimp’s Project Omnivore ▪ Xavier Llorà’s research ▪ Collaborative filtering ▪ Google News personalization ▪ Yahoo! front page personalization (Cokeheads) Thursday, January 28, 2010
  • 23. Hadoop and Machine Learning/Modeling ▪ Model fitting ▪ EM algorithm and HMMs (Jimmy Lin) ▪ Graph analysis ▪ Finding largest connected component (Jeff Hodges) ▪ Social graph analysis (Jake Hofman) ▪ Document analysis ▪ Named entity extraction (Evri) ▪ Document similarity (Jimmy Lin) ▪ Image similarity: Google paper Thursday, January 28, 2010
  • 24. Hadoop and Machine Learning/Modeling ▪ Classification ▪ Google’s PLANET for building decision trees ▪ eBay’s linear Poisson regression for behavioral targeting ▪ Sessionization of clickstream logs and path prediction ▪ Bioinformatics ▪ Cloudburst ▪ Crossbow ▪ Computer vision ▪ Face detection ▪ Face recognition Thursday, January 28, 2010
  • 25. Hadoop and Machine Learning/Modeling ▪ Simulation ▪ Protein folding ▪ Particle-swarm optimization ▪ Crazy stuff ▪ Factoring integers ▪ Solving Boggle ▪ Generating fractals ▪ Books and conferences ▪ MDAC 2010 ▪ “Data Intensive Text Processing with MapReduce” Thursday, January 28, 2010
  • 26. Hadoop at Cloudera Cloudera’s Distribution for Hadoop ▪ Open source distribution of Apache Hadoop for enterprise use ▪ Includes HDFS, MapReduce, Pig, Hive, and ZooKeeper ▪ Ensures cross-subproject compatibility ▪ Adds backported patches and customer-specific patches ▪ Adds Cloudera utilities like MRUnit and Sqoop ▪ Better integration with daemon administration utilities ▪ Follows the Filesystem Hierarchy Standard (FHS) for file layout ▪ Tools for automatically generating a configuration ▪ Packaged as RPM, DEB, AMI, or tarball Thursday, January 28, 2010
  • 27. Hadoop at Cloudera Training and Certification ▪ Free online training ▪ Basic, Intermediate (including Hive and Pig), and Advanced ▪ Includes a virtual machine with software and exercises ▪ Live training sessions ▪ One live session per month somewhere in the world ▪ If you have a large group, we may come to you ▪ Certification ▪ Exams for Developers, Administrators, and Managers ▪ Administered online or in person Thursday, January 28, 2010
  • 28. Hadoop at Cloudera Services and Support ▪ Professional Services ▪ Get Hadoop up and running in your environment ▪ Optimize an existing Hadoop infrastructure ▪ Design new algorithms to make the most of your data ▪ Support ▪ Unlimited questions for Cloudera’s technical team ▪ Access to our Knowledge Base ▪ Help prioritize feature development for CDH ▪ Early access to upcoming Cloudera software products Thursday, January 28, 2010
  • 29. Hadoop at Cloudera Commercial Software ▪ General thesis: build commercially-licensed software products which complement CDH for data management and analysis ▪ Current products ▪ Cloudera Desktop ▪ Extensible interface for users of Cloudera software ▪ Upcoming products for data collection ▪ Talk to me offline Thursday, January 28, 2010
  • 30. Cloudera Desktop Big Data can be Beautiful Thursday, January 28, 2010
  • 31. (c) 2009 Cloudera, Inc. or its licensors.  "Cloudera" is a registered trademark of Cloudera, Inc.. All rights reserved. 1.0 Thursday, January 28, 2010