SlideShare una empresa de Scribd logo
1 de 21
Descargar para leer sin conexión
Data Processing
  with Apache Hadoop:
Scalable and Cost Effective




        Doug Cutting
the opportunity
●   data accumulating faster than ever
●   storage, CPU & network cheaper than ever
●   but conventional enterprise tech
    ●   isn't priced at commodity hardware prices
    ●   doesn't scale well to thousands of CPUs & drives
problem: scaling reliably is hard
●   need to store petabytes of data
    ●   on 1000s of nodes, MTBF < 1 day
    ●   something is always broken
●   need fault tolerant store
    ●   handle hardware faults transparently and efficiently
    ●   provide availability
●   need fault-tolerant computing framework
    ●   even on a big cluster, some things take days
problem: bandwidth to data
●   need to process 100TB dataset
●   on 1000 node cluster reading from LAN
    ●   100Mb all-to-all bandwidth
    ●   scanning @ 10MB/s = 165 min
●   on 1000 node cluster reading from local drives
    ●   scanning @ 200MB/s = 8 min
●   moving computation beats moving data
Apache Hadoop: a new Paradigm
●   scales to thousands of commodity computers
●   can effectively use all cores & spindles
    ●   simultaneously to process your data
●   new software stack
    ●   built on a different foundation
●   in use already by many
    ●   most big web 2.0 companies
    ●   many Fortune 500 companies now too
new foundations
●   commodity hardware
●   sequential file access
●   sharding of data & computation
●   automated, high-level reliability
●   open source
commodity hardware




●   typically in 2-level architecture
    ●   nodes are commodity PCs
    ●   30-40 nodes/rack
●   offers linear scalabilty
    ●   at commodity prices
how I got here
●   started in the 80's, building full-text indexes
●   first implemented with a B-Tree
    ●   foundation of relational DBs
    ●   log(n) random accesses per update
    ●   seek time is wasted time
●   too slow when updates are frequent
    ●   instead use batched sort/merge
        –   to index web at Excite in 90's
Lucene (2000):
●   open-source full-text search library
●   sorts batches of updates
●   then merges with previously sorted data
●   only n/k seeks to build an index
●   compared to B-Tree
    ●   much less wasted time
    ●   vastly faster
open source
●   Apache
    ●   supports diverse, collaborative communities
●   advantages
    ●   the price is right
         –   costs shared with collaborators
              ●   QA, documentation, features, support, training, etc.
    ●   transparency of product and process
    ●   better code
         –   publication encourages quality
         –   sharing encourages generality
    ●   happier engineers
         –   respect from wider peer pool
Nutch (2002)
●   open-source web search engine
●   db access per link in crawled page
    ●   monthly crawl requires >10k accesses/second
●   sort/merge optimization applicable
    ●   but distributed solution required
Nutch (2004)
●   Google publishes GFS & MapReduce papers
●   together, provide automation of
    ●   sort/merge+sharding
    ●   reliability
●   we then implemented these in Nutch
Hadoop (2006)
●   Yahoo! joins the effort
●   split HDFS and MapReduce from Nutch
HDFS
●   scales well
    ●   files sharded across commodity hardware
    ●   efficient and inexpensive
●   automated reliability
    ●   each block replicated on 3 datanodes
    ●   automatically rebalances, replicates, etc.
    ●   namenode has hot spare
MapReduce
●   simple programming model
    ●   generalizes common pattern
        Input      Map      Shuffle &   Reduce    Output
                               Sort


                   Map


                                         Reduce
    Input                                            Output
    data                                              data
                   Map


                                         Reduce


                   Map
MapReduce
●   compute on same nodes as HDFS storage
    ●   i/o on every drive
    ●   compute on every core
    ●   massive throughput
●   sequential access
    ●   directly supports sort/merge
●   automated reliability & scaling
    ●   datasets are sharded to tasks
    ●   failed tasks retried
pattern of adoption
●   initially adopt Hadoop for particular application
    ●   for cost effective scaling
●   then load more datasets & add more users
    ●   & find it more valuable for new, unanticipated apps
●   having all data in one place, usable, empowers
●   “We don't use Hadoop because we have a lot
    of data, we have a lot of data because we use
    Hadoop.”
Apache Hadoop:
                   the ecosystem
●   active, growing, community
    ●   multiple books in print
    ●   commercial support available
    ●   expanding network of complementary tools
Cloudera's Distribution including
           Apache Hadoop
●   packaging of ecosystem
●   100% Apache-licensed components
●   simplified installation, update, etc.
●   tested, compatible versions
key advantages of Apache Hadoop
●   cost-effective
    ●   scales linearly on commodity hardware
●   general purpose
    ●   powerful, easy-to-program
●   low barrier to entry
    ●   no schema or DB design required up-front
    ●   just load raw data & use it
Thank you!




 Questions?

Más contenido relacionado

La actualidad más candente

Introduction to apache hadoop copy
Introduction to apache hadoop   copyIntroduction to apache hadoop   copy
Introduction to apache hadoop copy
Mohammad_Tariq
 
Lessons Learned from Building an Enterprise Big Data Platform from the Ground...
Lessons Learned from Building an Enterprise Big Data Platform from the Ground...Lessons Learned from Building an Enterprise Big Data Platform from the Ground...
Lessons Learned from Building an Enterprise Big Data Platform from the Ground...
DataWorks Summit
 

La actualidad más candente (20)

Spark Core
Spark CoreSpark Core
Spark Core
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
 
Cassandra - A Basic Introduction Guide
Cassandra - A Basic Introduction GuideCassandra - A Basic Introduction Guide
Cassandra - A Basic Introduction Guide
 
Kudu demo
Kudu demoKudu demo
Kudu demo
 
Introduction to apache hadoop copy
Introduction to apache hadoop   copyIntroduction to apache hadoop   copy
Introduction to apache hadoop copy
 
Lessons Learned from Building an Enterprise Big Data Platform from the Ground...
Lessons Learned from Building an Enterprise Big Data Platform from the Ground...Lessons Learned from Building an Enterprise Big Data Platform from the Ground...
Lessons Learned from Building an Enterprise Big Data Platform from the Ground...
 
Kudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast DataKudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast Data
 
Introduction to NoSql
Introduction to NoSqlIntroduction to NoSql
Introduction to NoSql
 
Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop Overview
 
Hadoop and Distributed Computing
Hadoop and Distributed ComputingHadoop and Distributed Computing
Hadoop and Distributed Computing
 
Cloudera Hadoop Distribution
Cloudera Hadoop DistributionCloudera Hadoop Distribution
Cloudera Hadoop Distribution
 
Hadoop
HadoopHadoop
Hadoop
 
HBaseCon 2015- HBase @ Flipboard
HBaseCon 2015- HBase @ FlipboardHBaseCon 2015- HBase @ Flipboard
HBaseCon 2015- HBase @ Flipboard
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
 
Big Data and Hadoop Ecosystem
Big Data and Hadoop EcosystemBig Data and Hadoop Ecosystem
Big Data and Hadoop Ecosystem
 
Alluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the CloudAlluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the Cloud
 
Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
 
2. hadoop fundamentals
2. hadoop fundamentals2. hadoop fundamentals
2. hadoop fundamentals
 
Anju
AnjuAnju
Anju
 
Exponea - Kafka and Hadoop as components of architecture
Exponea  - Kafka and Hadoop as components of architectureExponea  - Kafka and Hadoop as components of architecture
Exponea - Kafka and Hadoop as components of architecture
 

Destacado

Processing graph/relational data with Map-Reduce and Bulk Synchronous Parallel
Processing graph/relational data with Map-Reduce and Bulk Synchronous ParallelProcessing graph/relational data with Map-Reduce and Bulk Synchronous Parallel
Processing graph/relational data with Map-Reduce and Bulk Synchronous Parallel
chodakowski
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on Hadoop
Zheng Shao
 

Destacado (8)

Processing graph/relational data with Map-Reduce and Bulk Synchronous Parallel
Processing graph/relational data with Map-Reduce and Bulk Synchronous ParallelProcessing graph/relational data with Map-Reduce and Bulk Synchronous Parallel
Processing graph/relational data with Map-Reduce and Bulk Synchronous Parallel
 
Giraph at Hadoop Summit 2014
Giraph at Hadoop Summit 2014Giraph at Hadoop Summit 2014
Giraph at Hadoop Summit 2014
 
High-level Programming Languages: Apache Pig and Pig Latin
High-level Programming Languages: Apache Pig and Pig LatinHigh-level Programming Languages: Apache Pig and Pig Latin
High-level Programming Languages: Apache Pig and Pig Latin
 
Pig, Making Hadoop Easy
Pig, Making Hadoop EasyPig, Making Hadoop Easy
Pig, Making Hadoop Easy
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pig
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on Hadoop
 
Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 

Similar a Chicago Data Summit: Keynote - Data Processing with Hadoop: Scalable and Cost Effective

Petabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructurePetabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructure
elliando dias
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
Kognitio
 
Hw09 Next Steps For Hadoop
Hw09   Next Steps For HadoopHw09   Next Steps For Hadoop
Hw09 Next Steps For Hadoop
Cloudera, Inc.
 

Similar a Chicago Data Summit: Keynote - Data Processing with Hadoop: Scalable and Cost Effective (20)

Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
RubiX
RubiXRubiX
RubiX
 
Petabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructurePetabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructure
 
Hadoop seminar
Hadoop seminarHadoop seminar
Hadoop seminar
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
Hw09 Next Steps For Hadoop
Hw09   Next Steps For HadoopHw09   Next Steps For Hadoop
Hw09 Next Steps For Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and Hadoop
 
Introduction to Hadoop and Big Data
Introduction to Hadoop and Big DataIntroduction to Hadoop and Big Data
Introduction to Hadoop and Big Data
 
Deploying Grid Services Using Apache Hadoop
Deploying Grid Services Using Apache HadoopDeploying Grid Services Using Apache Hadoop
Deploying Grid Services Using Apache Hadoop
 
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingApache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query Processing
 
Hadoop Primer
Hadoop PrimerHadoop Primer
Hadoop Primer
 

Más de Cloudera, Inc.

Más de Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Último

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Último (20)

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 

Chicago Data Summit: Keynote - Data Processing with Hadoop: Scalable and Cost Effective

  • 1. Data Processing with Apache Hadoop: Scalable and Cost Effective Doug Cutting
  • 2. the opportunity ● data accumulating faster than ever ● storage, CPU & network cheaper than ever ● but conventional enterprise tech ● isn't priced at commodity hardware prices ● doesn't scale well to thousands of CPUs & drives
  • 3. problem: scaling reliably is hard ● need to store petabytes of data ● on 1000s of nodes, MTBF < 1 day ● something is always broken ● need fault tolerant store ● handle hardware faults transparently and efficiently ● provide availability ● need fault-tolerant computing framework ● even on a big cluster, some things take days
  • 4. problem: bandwidth to data ● need to process 100TB dataset ● on 1000 node cluster reading from LAN ● 100Mb all-to-all bandwidth ● scanning @ 10MB/s = 165 min ● on 1000 node cluster reading from local drives ● scanning @ 200MB/s = 8 min ● moving computation beats moving data
  • 5. Apache Hadoop: a new Paradigm ● scales to thousands of commodity computers ● can effectively use all cores & spindles ● simultaneously to process your data ● new software stack ● built on a different foundation ● in use already by many ● most big web 2.0 companies ● many Fortune 500 companies now too
  • 6. new foundations ● commodity hardware ● sequential file access ● sharding of data & computation ● automated, high-level reliability ● open source
  • 7. commodity hardware ● typically in 2-level architecture ● nodes are commodity PCs ● 30-40 nodes/rack ● offers linear scalabilty ● at commodity prices
  • 8. how I got here ● started in the 80's, building full-text indexes ● first implemented with a B-Tree ● foundation of relational DBs ● log(n) random accesses per update ● seek time is wasted time ● too slow when updates are frequent ● instead use batched sort/merge – to index web at Excite in 90's
  • 9. Lucene (2000): ● open-source full-text search library ● sorts batches of updates ● then merges with previously sorted data ● only n/k seeks to build an index ● compared to B-Tree ● much less wasted time ● vastly faster
  • 10. open source ● Apache ● supports diverse, collaborative communities ● advantages ● the price is right – costs shared with collaborators ● QA, documentation, features, support, training, etc. ● transparency of product and process ● better code – publication encourages quality – sharing encourages generality ● happier engineers – respect from wider peer pool
  • 11. Nutch (2002) ● open-source web search engine ● db access per link in crawled page ● monthly crawl requires >10k accesses/second ● sort/merge optimization applicable ● but distributed solution required
  • 12. Nutch (2004) ● Google publishes GFS & MapReduce papers ● together, provide automation of ● sort/merge+sharding ● reliability ● we then implemented these in Nutch
  • 13. Hadoop (2006) ● Yahoo! joins the effort ● split HDFS and MapReduce from Nutch
  • 14. HDFS ● scales well ● files sharded across commodity hardware ● efficient and inexpensive ● automated reliability ● each block replicated on 3 datanodes ● automatically rebalances, replicates, etc. ● namenode has hot spare
  • 15. MapReduce ● simple programming model ● generalizes common pattern Input Map Shuffle & Reduce Output Sort Map Reduce Input Output data data Map Reduce Map
  • 16. MapReduce ● compute on same nodes as HDFS storage ● i/o on every drive ● compute on every core ● massive throughput ● sequential access ● directly supports sort/merge ● automated reliability & scaling ● datasets are sharded to tasks ● failed tasks retried
  • 17. pattern of adoption ● initially adopt Hadoop for particular application ● for cost effective scaling ● then load more datasets & add more users ● & find it more valuable for new, unanticipated apps ● having all data in one place, usable, empowers ● “We don't use Hadoop because we have a lot of data, we have a lot of data because we use Hadoop.”
  • 18. Apache Hadoop: the ecosystem ● active, growing, community ● multiple books in print ● commercial support available ● expanding network of complementary tools
  • 19. Cloudera's Distribution including Apache Hadoop ● packaging of ecosystem ● 100% Apache-licensed components ● simplified installation, update, etc. ● tested, compatible versions
  • 20. key advantages of Apache Hadoop ● cost-effective ● scales linearly on commodity hardware ● general purpose ● powerful, easy-to-program ● low barrier to entry ● no schema or DB design required up-front ● just load raw data & use it