SlideShare una empresa de Scribd logo
1 de 69
Descargar para leer sin conexión
Saturday, June 12, 2010
Evolving a New Analytical Platform
          What Works and What’s Missing


          Jeff Hammerbacher
          Chief Scientist, Cloudera
          June 8, 2010



Saturday, June 12, 2010
My Background
         Thanks for Asking
         ▪   hammer@cloudera.com
         ▪   Studied Mathematics at Harvard
         ▪   Worked as a Quant on Wall Street
         ▪   Conceived, built, and led Data team at Facebook
             ▪   Nearly 30 amazing engineers and data scientists
             ▪   Several open source projects and research papers
         ▪   Founder of Cloudera
             ▪   Chief Scientist
             ▪   Also, check out the book “Beautiful Data”

Saturday, June 12, 2010
Presentation Outline
         ▪   BI: Science for Profit
             ▪   Need tools for whole research cycle
             ▪   SQL Server 2008 R2: defining the platform
         ▪   State of the Platform Ecosystem
         ▪   New Foundations: Hadoop
             ▪   Boiling the Frog
             ▪   Future developments
         ▪   Questions and Discussion




Saturday, June 12, 2010
BI is looking more like science (for profit)




Saturday, June 12, 2010
Jim Gray: Science entering Fourth Paradigm
            “We have to do better at producing tools to
                 support the whole research cycle”




Saturday, June 12, 2010
RDBMS only a small part of this tool set




Saturday, June 12, 2010
Example: SQL Server 2008 R2




Saturday, June 12, 2010
RDBMS: SQL Server




Saturday, June 12, 2010
ETL: SQL Server Integration Services
                                 RDBMS: SQL Server




Saturday, June 12, 2010
ETL: SQL Server Integration Services
                                 RDBMS: SQL Server
                 Reporting: SQL Server Reporting Services




Saturday, June 12, 2010
ETL: SQL Server Integration Services
                                 RDBMS: SQL Server
                 Reporting: SQL Server Reporting Services
                  Analysis: SQL Server Analysis Services




Saturday, June 12, 2010
ETL: SQL Server Integration Services
                                 RDBMS: SQL Server
                 Reporting: SQL Server Reporting Services
                  Analysis: SQL Server Analysis Services
                         Search: Full-Text Search



Saturday, June 12, 2010
CEP: StreamInsight
                          ETL: SQL Server Integration Services
                                 RDBMS: SQL Server
                 Reporting: SQL Server Reporting Services
                  Analysis: SQL Server Analysis Services
                         Search: Full-Text Search



Saturday, June 12, 2010
CEP: StreamInsight
                          ETL: SQL Server Integration Services
                                 RDBMS: SQL Server
                 Reporting: SQL Server Reporting Services
                  Analysis: SQL Server Analysis Services
                         Search: Full-Text Search
                             OLAP: PowerPivot


Saturday, June 12, 2010
MDM: Master Data Services
                                   CEP: StreamInsight
                          ETL: SQL Server Integration Services
                                 RDBMS: SQL Server
                 Reporting: SQL Server Reporting Services
                  Analysis: SQL Server Analysis Services
                         Search: Full-Text Search
                             OLAP: PowerPivot


Saturday, June 12, 2010
Collaboration: SharePoint
                               MDM: Master Data Services
                                   CEP: StreamInsight
                          ETL: SQL Server Integration Services
                                 RDBMS: SQL Server
                 Reporting: SQL Server Reporting Services
                  Analysis: SQL Server Analysis Services
                         Search: Full-Text Search
                             OLAP: PowerPivot


Saturday, June 12, 2010
What do we call this unified suite?




Saturday, June 12, 2010
For today: Analytical Data Platform




Saturday, June 12, 2010
Who makes up the platform ecosystem?




Saturday, June 12, 2010
Platform Providers




Saturday, June 12, 2010
Infrastructure Providers
                            Platform Providers




Saturday, June 12, 2010
Infrastructure Providers
                            Platform Providers
                          Application Developers




Saturday, June 12, 2010
Content Providers
                          Infrastructure Providers
                            Platform Providers
                          Application Developers




Saturday, June 12, 2010
Content Providers
                          Infrastructure Providers
                            Platform Providers
                          Application Developers
                                End Users




Saturday, June 12, 2010
What is new about the ecosystem today?




Saturday, June 12, 2010
Content Providers
            1. > 95% of enterprise data is unstructured
                  2. Data volumes growing rapidly




Saturday, June 12, 2010
Infrastructure Providers
                                    1. Cloud
                          2. Warehouse-Scale Computers




Saturday, June 12, 2010
Platform Providers
                                     1. Open source
                          2. Driven by consumer web properties




Saturday, June 12, 2010
Application Developers
                             1. Data Scientists
                          2. Diversity of languages




Saturday, June 12, 2010
End Users
                          1. Move beyond reporting to analytics
                           2. Make use of all enterprise data




Saturday, June 12, 2010
New foundations: HDFS and MapReduce




Saturday, June 12, 2010
(This is what boiling a frog feels like)




Saturday, June 12, 2010
2005: Doug/Mike start project inside Nutch




Saturday, June 12, 2010
2006: Doug joins Yahoo!




Saturday, June 12, 2010
2007: Make Hadoop scale




Saturday, June 12, 2010
2007: Make Hadoop scale
                          Yahoo! makes Pig open source




Saturday, June 12, 2010
Jim Gray’s “Fourth Paradigm” lecture
                              2007: Make Hadoop scale
                             Yahoo! makes Pig open source




Saturday, June 12, 2010
Randy Bryant’s “DISC” lecture
                          Jim Gray’s “Fourth Paradigm” lecture
                              2007: Make Hadoop scale
                             Yahoo! makes Pig open source




Saturday, June 12, 2010
Randy Bryant’s “DISC” lecture
                          Jim Gray’s “Fourth Paradigm” lecture
                              2007: Make Hadoop scale
                             Yahoo! makes Pig open source
                           Powerset makes HBase open source




Saturday, June 12, 2010
2008: Make Hadoop fast




Saturday, June 12, 2010
2008: Make Hadoop fast
            Yahoo! wins Daytona terabyte sort benchmark




Saturday, June 12, 2010
First Hadoop Summit
                          2008: Make Hadoop fast
            Yahoo! wins Daytona terabyte sort benchmark




Saturday, June 12, 2010
First Hadoop Summit
                          2008: Make Hadoop fast
            Yahoo! wins Daytona terabyte sort benchmark
            Yahoo! builds production webmap with Hadoop




Saturday, June 12, 2010
Facebook makes Hive open source
                                First Hadoop Summit
                             2008: Make Hadoop fast
            Yahoo! wins Daytona terabyte sort benchmark
            Yahoo! builds production webmap with Hadoop




Saturday, June 12, 2010
“MapReduce: A Major Step Backwards”
                            Facebook makes Hive open source
                                  First Hadoop Summit
                               2008: Make Hadoop fast
            Yahoo! wins Daytona terabyte sort benchmark
            Yahoo! builds production webmap with Hadoop




Saturday, June 12, 2010
2009: Insert Hadoop into the enterprise




Saturday, June 12, 2010
2009: Insert Hadoop into the enterprise
                            Cloudera releases CDH




Saturday, June 12, 2010
First Hadoop World NYC
                   2009: Insert Hadoop into the enterprise
                            Cloudera releases CDH




Saturday, June 12, 2010
Yahoo! sorts a petabyte with Hadoop
                                First Hadoop World NYC
                   2009: Insert Hadoop into the enterprise
                                Cloudera releases CDH




Saturday, June 12, 2010
Yahoo! sorts a petabyte with Hadoop
                                First Hadoop World NYC
                   2009: Insert Hadoop into the enterprise
                          Cloudera releases CDH
                Cloudera adds training, support, services




Saturday, June 12, 2010
“The Unreasonable Effectiveness of Data”
                   Yahoo! sorts a petabyte with Hadoop
                          First Hadoop World NYC
                   2009: Insert Hadoop into the enterprise
                          Cloudera releases CDH
                Cloudera adds training, support, services




Saturday, June 12, 2010
2010: Integrate Hadoop into the enterprise




Saturday, June 12, 2010
2010: Integrate Hadoop into the enterprise
                          IBM announces InfoSphere BigInsights




Saturday, June 12, 2010
Yahoo! completes enterprise-class security
             2010: Integrate Hadoop into the enterprise
                          IBM announces InfoSphere BigInsights




Saturday, June 12, 2010
Yahoo! completes enterprise-class security
             2010: Integrate Hadoop into the enterprise
                          IBM announces InfoSphere BigInsights
                            Datameer and Karmasphere funded




Saturday, June 12, 2010
Teradata, Pentaho, and others integrate
              Yahoo! completes enterprise-class security
             2010: Integrate Hadoop into the enterprise
                          IBM announces InfoSphere BigInsights
                            Datameer and Karmasphere funded




Saturday, June 12, 2010
Hive adds JDBC and ODBC
               Teradata, Pentaho, and others integrate
              Yahoo! completes enterprise-class security
             2010: Integrate Hadoop into the enterprise
                          IBM announces InfoSphere BigInsights
                            Datameer and Karmasphere funded




Saturday, June 12, 2010
Hadoop will be an Analytical Data Platform




Saturday, June 12, 2010
What’s Next?




Saturday, June 12, 2010
Capture: Log collection and CEP




Saturday, June 12, 2010
Curate: Workflow and Scheduling




Saturday, June 12, 2010
Curate: Secondary and Full-Text Indexing




Saturday, June 12, 2010
Curate: Learn Structure from Data




Saturday, June 12, 2010
Analyze: Mesos-enabled frameworks




Saturday, June 12, 2010
Analyze: Link local and global data




Saturday, June 12, 2010
All behind a single pane of glass




Saturday, June 12, 2010
Cloudera Desktop
                          Making Many Computers Feel Like One




Saturday, June 12, 2010
(c) 2010 Cloudera, Inc. or its licensors.  "Cloudera" is a registered trademark of Cloudera, Inc.. All rights reserved. 1.0




Saturday, June 12, 2010

Más contenido relacionado

Similar a Experiences Evolving a New Analytical Platform: What Works and What's Missing

Open End To End Js Stack
Open End To End Js StackOpen End To End Js Stack
Open End To End Js Stack
Skills Matter
 
Ruby conf2010 OpenPaaS
Ruby conf2010 OpenPaaSRuby conf2010 OpenPaaS
Ruby conf2010 OpenPaaS
Derek Collison
 
Shifting from a newspapermindset to an information perspective
Shifting from a newspapermindset to an information perspectiveShifting from a newspapermindset to an information perspective
Shifting from a newspapermindset to an information perspective
WAN-IFRA
 
Fcc open-developer-day
Fcc open-developer-dayFcc open-developer-day
Fcc open-developer-day
Ted Drake
 

Similar a Experiences Evolving a New Analytical Platform: What Works and What's Missing (20)

Open End To End Js Stack
Open End To End Js StackOpen End To End Js Stack
Open End To End Js Stack
 
20100513brown
20100513brown20100513brown
20100513brown
 
Tech WG report 2011
Tech WG report 2011Tech WG report 2011
Tech WG report 2011
 
GlueCon 2015 - Publish your SQL data as web APIs
GlueCon 2015 - Publish your SQL data as web APIsGlueCon 2015 - Publish your SQL data as web APIs
GlueCon 2015 - Publish your SQL data as web APIs
 
Ruby conf2010 OpenPaaS
Ruby conf2010 OpenPaaSRuby conf2010 OpenPaaS
Ruby conf2010 OpenPaaS
 
Myth Busters II: BI Tools and Data Virtualization are Interchangeable
Myth Busters II: BI Tools and Data Virtualization are InterchangeableMyth Busters II: BI Tools and Data Virtualization are Interchangeable
Myth Busters II: BI Tools and Data Virtualization are Interchangeable
 
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
 
Migrating to CouchDB
Migrating to CouchDBMigrating to CouchDB
Migrating to CouchDB
 
Pass bac jd_sm
Pass bac jd_smPass bac jd_sm
Pass bac jd_sm
 
Application Engine ETL
Application Engine ETLApplication Engine ETL
Application Engine ETL
 
Treasure Data Cloud Strategy
Treasure Data Cloud StrategyTreasure Data Cloud Strategy
Treasure Data Cloud Strategy
 
Shifting from a newspapermindset to an information perspective
Shifting from a newspapermindset to an information perspectiveShifting from a newspapermindset to an information perspective
Shifting from a newspapermindset to an information perspective
 
Why and How to integrate Hadoop and NoSQL?
Why and How to integrate Hadoop and NoSQL?Why and How to integrate Hadoop and NoSQL?
Why and How to integrate Hadoop and NoSQL?
 
Fcc open-developer-day
Fcc open-developer-dayFcc open-developer-day
Fcc open-developer-day
 
Open Data Driven Scholarly Communication in 2020
Open Data Driven Scholarly Communication in 2020Open Data Driven Scholarly Communication in 2020
Open Data Driven Scholarly Communication in 2020
 
2011 June - Singapore GTUG presentation. App Engine program update + intro to Go
2011 June - Singapore GTUG presentation. App Engine program update + intro to Go2011 June - Singapore GTUG presentation. App Engine program update + intro to Go
2011 June - Singapore GTUG presentation. App Engine program update + intro to Go
 
Introduction to Node.js: perspectives from a Drupal dev
Introduction to Node.js: perspectives from a Drupal devIntroduction to Node.js: perspectives from a Drupal dev
Introduction to Node.js: perspectives from a Drupal dev
 
Geospatial Business Intelligence made easy with GeoMondrian & SOLAPLayers
Geospatial Business Intelligence made easy with GeoMondrian & SOLAPLayersGeospatial Business Intelligence made easy with GeoMondrian & SOLAPLayers
Geospatial Business Intelligence made easy with GeoMondrian & SOLAPLayers
 
Alex Wade, Digital Library Interoperability
Alex Wade, Digital Library InteroperabilityAlex Wade, Digital Library Interoperability
Alex Wade, Digital Library Interoperability
 
Railsconf 2010
Railsconf 2010Railsconf 2010
Railsconf 2010
 

Más de Cloudera, Inc.

Más de Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Último

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Último (20)

Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 

Experiences Evolving a New Analytical Platform: What Works and What's Missing

  • 2. Evolving a New Analytical Platform What Works and What’s Missing Jeff Hammerbacher Chief Scientist, Cloudera June 8, 2010 Saturday, June 12, 2010
  • 3. My Background Thanks for Asking ▪ hammer@cloudera.com ▪ Studied Mathematics at Harvard ▪ Worked as a Quant on Wall Street ▪ Conceived, built, and led Data team at Facebook ▪ Nearly 30 amazing engineers and data scientists ▪ Several open source projects and research papers ▪ Founder of Cloudera ▪ Chief Scientist ▪ Also, check out the book “Beautiful Data” Saturday, June 12, 2010
  • 4. Presentation Outline ▪ BI: Science for Profit ▪ Need tools for whole research cycle ▪ SQL Server 2008 R2: defining the platform ▪ State of the Platform Ecosystem ▪ New Foundations: Hadoop ▪ Boiling the Frog ▪ Future developments ▪ Questions and Discussion Saturday, June 12, 2010
  • 5. BI is looking more like science (for profit) Saturday, June 12, 2010
  • 6. Jim Gray: Science entering Fourth Paradigm “We have to do better at producing tools to support the whole research cycle” Saturday, June 12, 2010
  • 7. RDBMS only a small part of this tool set Saturday, June 12, 2010
  • 8. Example: SQL Server 2008 R2 Saturday, June 12, 2010
  • 10. ETL: SQL Server Integration Services RDBMS: SQL Server Saturday, June 12, 2010
  • 11. ETL: SQL Server Integration Services RDBMS: SQL Server Reporting: SQL Server Reporting Services Saturday, June 12, 2010
  • 12. ETL: SQL Server Integration Services RDBMS: SQL Server Reporting: SQL Server Reporting Services Analysis: SQL Server Analysis Services Saturday, June 12, 2010
  • 13. ETL: SQL Server Integration Services RDBMS: SQL Server Reporting: SQL Server Reporting Services Analysis: SQL Server Analysis Services Search: Full-Text Search Saturday, June 12, 2010
  • 14. CEP: StreamInsight ETL: SQL Server Integration Services RDBMS: SQL Server Reporting: SQL Server Reporting Services Analysis: SQL Server Analysis Services Search: Full-Text Search Saturday, June 12, 2010
  • 15. CEP: StreamInsight ETL: SQL Server Integration Services RDBMS: SQL Server Reporting: SQL Server Reporting Services Analysis: SQL Server Analysis Services Search: Full-Text Search OLAP: PowerPivot Saturday, June 12, 2010
  • 16. MDM: Master Data Services CEP: StreamInsight ETL: SQL Server Integration Services RDBMS: SQL Server Reporting: SQL Server Reporting Services Analysis: SQL Server Analysis Services Search: Full-Text Search OLAP: PowerPivot Saturday, June 12, 2010
  • 17. Collaboration: SharePoint MDM: Master Data Services CEP: StreamInsight ETL: SQL Server Integration Services RDBMS: SQL Server Reporting: SQL Server Reporting Services Analysis: SQL Server Analysis Services Search: Full-Text Search OLAP: PowerPivot Saturday, June 12, 2010
  • 18. What do we call this unified suite? Saturday, June 12, 2010
  • 19. For today: Analytical Data Platform Saturday, June 12, 2010
  • 20. Who makes up the platform ecosystem? Saturday, June 12, 2010
  • 22. Infrastructure Providers Platform Providers Saturday, June 12, 2010
  • 23. Infrastructure Providers Platform Providers Application Developers Saturday, June 12, 2010
  • 24. Content Providers Infrastructure Providers Platform Providers Application Developers Saturday, June 12, 2010
  • 25. Content Providers Infrastructure Providers Platform Providers Application Developers End Users Saturday, June 12, 2010
  • 26. What is new about the ecosystem today? Saturday, June 12, 2010
  • 27. Content Providers 1. > 95% of enterprise data is unstructured 2. Data volumes growing rapidly Saturday, June 12, 2010
  • 28. Infrastructure Providers 1. Cloud 2. Warehouse-Scale Computers Saturday, June 12, 2010
  • 29. Platform Providers 1. Open source 2. Driven by consumer web properties Saturday, June 12, 2010
  • 30. Application Developers 1. Data Scientists 2. Diversity of languages Saturday, June 12, 2010
  • 31. End Users 1. Move beyond reporting to analytics 2. Make use of all enterprise data Saturday, June 12, 2010
  • 32. New foundations: HDFS and MapReduce Saturday, June 12, 2010
  • 33. (This is what boiling a frog feels like) Saturday, June 12, 2010
  • 34. 2005: Doug/Mike start project inside Nutch Saturday, June 12, 2010
  • 35. 2006: Doug joins Yahoo! Saturday, June 12, 2010
  • 36. 2007: Make Hadoop scale Saturday, June 12, 2010
  • 37. 2007: Make Hadoop scale Yahoo! makes Pig open source Saturday, June 12, 2010
  • 38. Jim Gray’s “Fourth Paradigm” lecture 2007: Make Hadoop scale Yahoo! makes Pig open source Saturday, June 12, 2010
  • 39. Randy Bryant’s “DISC” lecture Jim Gray’s “Fourth Paradigm” lecture 2007: Make Hadoop scale Yahoo! makes Pig open source Saturday, June 12, 2010
  • 40. Randy Bryant’s “DISC” lecture Jim Gray’s “Fourth Paradigm” lecture 2007: Make Hadoop scale Yahoo! makes Pig open source Powerset makes HBase open source Saturday, June 12, 2010
  • 41. 2008: Make Hadoop fast Saturday, June 12, 2010
  • 42. 2008: Make Hadoop fast Yahoo! wins Daytona terabyte sort benchmark Saturday, June 12, 2010
  • 43. First Hadoop Summit 2008: Make Hadoop fast Yahoo! wins Daytona terabyte sort benchmark Saturday, June 12, 2010
  • 44. First Hadoop Summit 2008: Make Hadoop fast Yahoo! wins Daytona terabyte sort benchmark Yahoo! builds production webmap with Hadoop Saturday, June 12, 2010
  • 45. Facebook makes Hive open source First Hadoop Summit 2008: Make Hadoop fast Yahoo! wins Daytona terabyte sort benchmark Yahoo! builds production webmap with Hadoop Saturday, June 12, 2010
  • 46. “MapReduce: A Major Step Backwards” Facebook makes Hive open source First Hadoop Summit 2008: Make Hadoop fast Yahoo! wins Daytona terabyte sort benchmark Yahoo! builds production webmap with Hadoop Saturday, June 12, 2010
  • 47. 2009: Insert Hadoop into the enterprise Saturday, June 12, 2010
  • 48. 2009: Insert Hadoop into the enterprise Cloudera releases CDH Saturday, June 12, 2010
  • 49. First Hadoop World NYC 2009: Insert Hadoop into the enterprise Cloudera releases CDH Saturday, June 12, 2010
  • 50. Yahoo! sorts a petabyte with Hadoop First Hadoop World NYC 2009: Insert Hadoop into the enterprise Cloudera releases CDH Saturday, June 12, 2010
  • 51. Yahoo! sorts a petabyte with Hadoop First Hadoop World NYC 2009: Insert Hadoop into the enterprise Cloudera releases CDH Cloudera adds training, support, services Saturday, June 12, 2010
  • 52. “The Unreasonable Effectiveness of Data” Yahoo! sorts a petabyte with Hadoop First Hadoop World NYC 2009: Insert Hadoop into the enterprise Cloudera releases CDH Cloudera adds training, support, services Saturday, June 12, 2010
  • 53. 2010: Integrate Hadoop into the enterprise Saturday, June 12, 2010
  • 54. 2010: Integrate Hadoop into the enterprise IBM announces InfoSphere BigInsights Saturday, June 12, 2010
  • 55. Yahoo! completes enterprise-class security 2010: Integrate Hadoop into the enterprise IBM announces InfoSphere BigInsights Saturday, June 12, 2010
  • 56. Yahoo! completes enterprise-class security 2010: Integrate Hadoop into the enterprise IBM announces InfoSphere BigInsights Datameer and Karmasphere funded Saturday, June 12, 2010
  • 57. Teradata, Pentaho, and others integrate Yahoo! completes enterprise-class security 2010: Integrate Hadoop into the enterprise IBM announces InfoSphere BigInsights Datameer and Karmasphere funded Saturday, June 12, 2010
  • 58. Hive adds JDBC and ODBC Teradata, Pentaho, and others integrate Yahoo! completes enterprise-class security 2010: Integrate Hadoop into the enterprise IBM announces InfoSphere BigInsights Datameer and Karmasphere funded Saturday, June 12, 2010
  • 59. Hadoop will be an Analytical Data Platform Saturday, June 12, 2010
  • 61. Capture: Log collection and CEP Saturday, June 12, 2010
  • 62. Curate: Workflow and Scheduling Saturday, June 12, 2010
  • 63. Curate: Secondary and Full-Text Indexing Saturday, June 12, 2010
  • 64. Curate: Learn Structure from Data Saturday, June 12, 2010
  • 66. Analyze: Link local and global data Saturday, June 12, 2010
  • 67. All behind a single pane of glass Saturday, June 12, 2010
  • 68. Cloudera Desktop Making Many Computers Feel Like One Saturday, June 12, 2010
  • 69. (c) 2010 Cloudera, Inc. or its licensors.  "Cloudera" is a registered trademark of Cloudera, Inc.. All rights reserved. 1.0 Saturday, June 12, 2010