SlideShare una empresa de Scribd logo
1 de 27
Descargar para leer sin conexión
Friday, April 23, 2010
Moving from Narrative to Design
           with Open Source and Open Data

           Jeff Hammerbacher
           Chief Scientist and Vice President of Products, Cloudera
           April 23, 2010



Friday, April 23, 2010
Presentation Outline
          ▪   Who am I and what am I talking about?
              ▪   My Background
              ▪   Defining “Narrative” and “Design”
          ▪   Two stories
              ▪   Finance
              ▪   Web
          ▪   Making the Sage story a success
              ▪   Open Source
              ▪   Open Data
              ▪   The Unreasonable Effectiveness of Data


Friday, April 23, 2010
My Background
          Thanks for Asking
          ▪   hammer@cloudera.com
          ▪   Studied Mathematics at Harvard
          ▪   Worked as a Quant at Bear Stearns
          ▪   Conceived, built, and led Data team at Facebook
              ▪   Nearly 30 amazing engineers and data scientists
              ▪   Several open source projects and research papers
          ▪   Founder of Cloudera
              ▪   Vice President of Products and Chief Scientist
              ▪   Also, check out the book “Beautiful Data”

Friday, April 23, 2010
Friday, April 23, 2010
“Genomic Advances of the 2000s Will
          Demand an Informatics Revolution in the
          2010s”
                                       -- Eric Schadt




Friday, April 23, 2010
From Narrative to Design
          Learning from past revolutions
          ▪   Leaving technology aside, once you have a narrative:
              ▪   collect and structure data
              ▪   build tools, models, and a domain vocabulary (“understand”)
              ▪   use effective models to achieve your goal (“automate”)


          ▪   Pinnacle: Boeing 777, designed entirely on a computer


          ▪   Thanks to “Biology is Technology” by Robert Carlson for the
              “Narrative” and “Design” terminology


Friday, April 23, 2010
Two Stories
Friday, April 23, 2010
Friday, April 23, 2010
Finance
          A Cautionary Tale
          ▪   Strengths
              ▪   Complex models
              ▪   Low latency decisions
              ▪   Compute grids
          ▪   Weaknesses
              ▪   Data needed for decisions is expensive
              ▪   Code unrelated to automating decisions is highly guarded
              ▪   Storing and manipulating large data sets
          ▪   Narrative to Design: trading strategies, financial products

Friday, April 23, 2010
Friday, April 23, 2010
The Web
          A Success Story
          ▪   Strengths
              ▪   Open data
              ▪   Open source
              ▪   Scalable data management and analysis
          ▪   Weaknesses
              ▪   Low latency
              ▪   Complex models
              ▪   Style of dress
          ▪   Narrative to Design: search quality, advertising targeting

Friday, April 23, 2010
Another Story
Friday, April 23, 2010
Friday, April 23, 2010
Extreme Neuroanatomy
          DIY Brain Imaging
          ▪   Built by Ken Hayworth, ex-JPL engineer
          ▪   Now a critical contributor to “Connectome” project
              ▪   Synapse-resolution connectivity diagram of the brain
          ▪   Image data for a single rat brain: 900 PB
          ▪   This looks familiar!
              ▪   hackers
              ▪   large data sets
              ▪   machine learning



Friday, April 23, 2010
A New Story
Friday, April 23, 2010
Friday, April 23, 2010
Biology
          Finance or the Web?
          ▪   Large drug companies not that different from banks
          ▪   Recent infusion of open source and open data initiatives!
          ▪   Learn from the web
              ▪   Open source
              ▪   Open data
              ▪   “The Unreasonable Effectiveness of Data”
          ▪   Use existing open source software
              ▪   Cambrian explosion in open source data management
              ▪   Data storage and processing at petabyte scale is solved

Friday, April 23, 2010
Last Story
Friday, April 23, 2010
Friday, April 23, 2010
High-Energy Physics
          Science at the Petabyte scale
          ▪   Tremendous investment in measurement
              ▪   Large Hadron Collider cost estimated at $4.4B
          ▪   Teams spend years performing analysis
              ▪   Shared toolset for analysis and visualization has emerged


          ▪   Lessons for Sage
              ▪   Invest in new measurement technologies and store everything
              ▪   Use existing open source tools and share those you build



Friday, April 23, 2010
The Promise of Sage
          Delivering on the Excitement
          ▪   An information platform for disease modeling and drug discovery
          ▪   More importantly: a community of developers and researchers
              ▪   Developers, developers, developers!
              ▪   Wait, that’s you guys -- be hackers


          ▪   The most important person in an open source project
              ▪   Not the creator
              ▪   The first power user with no ties to the creator



Friday, April 23, 2010
Action Items
Friday, April 23, 2010
1. Share Data


Friday, April 23, 2010
2. Share Tools


Friday, April 23, 2010
3. Share Results


Friday, April 23, 2010
(c) 2009 Cloudera, Inc. or its licensors.  "Cloudera" is a registered trademark of Cloudera, Inc.. All rights reserved. 1.0




Friday, April 23, 2010

Más contenido relacionado

Similar a 20100423sage

First, Firster, Firstest: Three lessons from history on information overload
First, Firster, Firstest: Three lessons from history on information overloadFirst, Firster, Firstest: Three lessons from history on information overload
First, Firster, Firstest: Three lessons from history on information overloadmark madsen
 
HEUGCloud services the democratization of it (heug)
HEUGCloud services the democratization of it (heug)HEUGCloud services the democratization of it (heug)
HEUGCloud services the democratization of it (heug)Leo Plugge
 
DMTM Lecture 02 Data mining
DMTM Lecture 02 Data miningDMTM Lecture 02 Data mining
DMTM Lecture 02 Data miningPier Luca Lanzi
 
Linked Open Data in Libraries, Archives & Museums
Linked Open Data in Libraries, Archives & MuseumsLinked Open Data in Libraries, Archives & Museums
Linked Open Data in Libraries, Archives & MuseumsJon Voss
 
Data workshop preso
Data workshop preso Data workshop preso
Data workshop preso Doug Moncur
 
The Evolution of Data Science
The Evolution of Data ScienceThe Evolution of Data Science
The Evolution of Data ScienceKenny Daniel
 
Open for all – the benefits of open data in a digital age_Thorley
Open for all – the benefits of open data in a digital age_ThorleyOpen for all – the benefits of open data in a digital age_Thorley
Open for all – the benefits of open data in a digital age_ThorleyPlatforma Otwartej Nauki
 
Towards a frictionless data future
Towards a frictionless data futureTowards a frictionless data future
Towards a frictionless data futureJisc RDM
 
Nuxeo World Session: Semantic Technologies - Update on Recent Research
Nuxeo World Session: Semantic Technologies - Update on Recent ResearchNuxeo World Session: Semantic Technologies - Update on Recent Research
Nuxeo World Session: Semantic Technologies - Update on Recent ResearchNuxeo
 
ExLibris National Library Meeting @ IFLA-Helsinki - Aug 15th 2012
ExLibris National Library Meeting @ IFLA-Helsinki - Aug 15th 2012ExLibris National Library Meeting @ IFLA-Helsinki - Aug 15th 2012
ExLibris National Library Meeting @ IFLA-Helsinki - Aug 15th 2012Lee Dirks
 
What data scientists really do, according to 50 data scientists
What data scientists really do, according to 50 data scientistsWhat data scientists really do, according to 50 data scientists
What data scientists really do, according to 50 data scientistsHugo Bowne-Anderson
 
Roadmaps, Roles and Re-engineering: Developing Data Informatics Capability in...
Roadmaps, Roles and Re-engineering: Developing Data Informatics Capability in...Roadmaps, Roles and Re-engineering: Developing Data Informatics Capability in...
Roadmaps, Roles and Re-engineering: Developing Data Informatics Capability in...LIBER Europe
 
Big Data Curricula at the UW eScience Institute, JSM 2013
Big Data Curricula at the UW eScience Institute, JSM 2013Big Data Curricula at the UW eScience Institute, JSM 2013
Big Data Curricula at the UW eScience Institute, JSM 2013University of Washington
 
Graham Pryor
Graham PryorGraham Pryor
Graham PryorEduserv
 
Drupal Distributions: The Dos and Don'ts:
Drupal Distributions: The Dos and Don'ts:Drupal Distributions: The Dos and Don'ts:
Drupal Distributions: The Dos and Don'ts:Development Seed
 

Similar a 20100423sage (20)

20100301icde
20100301icde20100301icde
20100301icde
 
Oc Cloud Obscurity
Oc Cloud ObscurityOc Cloud Obscurity
Oc Cloud Obscurity
 
First, Firster, Firstest: Three lessons from history on information overload
First, Firster, Firstest: Three lessons from history on information overloadFirst, Firster, Firstest: Three lessons from history on information overload
First, Firster, Firstest: Three lessons from history on information overload
 
HEUGCloud services the democratization of it (heug)
HEUGCloud services the democratization of it (heug)HEUGCloud services the democratization of it (heug)
HEUGCloud services the democratization of it (heug)
 
DMTM Lecture 02 Data mining
DMTM Lecture 02 Data miningDMTM Lecture 02 Data mining
DMTM Lecture 02 Data mining
 
Introduction to Research Data Management
Introduction to Research Data ManagementIntroduction to Research Data Management
Introduction to Research Data Management
 
Linked Open Data in Libraries, Archives & Museums
Linked Open Data in Libraries, Archives & MuseumsLinked Open Data in Libraries, Archives & Museums
Linked Open Data in Libraries, Archives & Museums
 
Data workshop preso
Data workshop preso Data workshop preso
Data workshop preso
 
The Evolution of Data Science
The Evolution of Data ScienceThe Evolution of Data Science
The Evolution of Data Science
 
Open for all – the benefits of open data in a digital age_Thorley
Open for all – the benefits of open data in a digital age_ThorleyOpen for all – the benefits of open data in a digital age_Thorley
Open for all – the benefits of open data in a digital age_Thorley
 
20100513brown
20100513brown20100513brown
20100513brown
 
Towards a frictionless data future
Towards a frictionless data futureTowards a frictionless data future
Towards a frictionless data future
 
Nuxeo World Session: Semantic Technologies - Update on Recent Research
Nuxeo World Session: Semantic Technologies - Update on Recent ResearchNuxeo World Session: Semantic Technologies - Update on Recent Research
Nuxeo World Session: Semantic Technologies - Update on Recent Research
 
ExLibris National Library Meeting @ IFLA-Helsinki - Aug 15th 2012
ExLibris National Library Meeting @ IFLA-Helsinki - Aug 15th 2012ExLibris National Library Meeting @ IFLA-Helsinki - Aug 15th 2012
ExLibris National Library Meeting @ IFLA-Helsinki - Aug 15th 2012
 
What data scientists really do, according to 50 data scientists
What data scientists really do, according to 50 data scientistsWhat data scientists really do, according to 50 data scientists
What data scientists really do, according to 50 data scientists
 
Roadmaps, Roles and Re-engineering: Developing Data Informatics Capability in...
Roadmaps, Roles and Re-engineering: Developing Data Informatics Capability in...Roadmaps, Roles and Re-engineering: Developing Data Informatics Capability in...
Roadmaps, Roles and Re-engineering: Developing Data Informatics Capability in...
 
Big Data Curricula at the UW eScience Institute, JSM 2013
Big Data Curricula at the UW eScience Institute, JSM 2013Big Data Curricula at the UW eScience Institute, JSM 2013
Big Data Curricula at the UW eScience Institute, JSM 2013
 
Zarneger "Supporting AI: Best Practices for Content Delivery Platforms"
Zarneger "Supporting AI: Best Practices for Content Delivery Platforms"Zarneger "Supporting AI: Best Practices for Content Delivery Platforms"
Zarneger "Supporting AI: Best Practices for Content Delivery Platforms"
 
Graham Pryor
Graham PryorGraham Pryor
Graham Pryor
 
Drupal Distributions: The Dos and Don'ts:
Drupal Distributions: The Dos and Don'ts:Drupal Distributions: The Dos and Don'ts:
Drupal Distributions: The Dos and Don'ts:
 

Más de Jeff Hammerbacher (20)

20120223keystone
20120223keystone20120223keystone
20120223keystone
 
20100714accel
20100714accel20100714accel
20100714accel
 
20100608sigmod
20100608sigmod20100608sigmod
20100608sigmod
 
20100418sos
20100418sos20100418sos
20100418sos
 
20100128ebay
20100128ebay20100128ebay
20100128ebay
 
20091203gemini
20091203gemini20091203gemini
20091203gemini
 
20091110startup2startup
20091110startup2startup20091110startup2startup
20091110startup2startup
 
20091030nasajpl
20091030nasajpl20091030nasajpl
20091030nasajpl
 
20091027genentech
20091027genentech20091027genentech
20091027genentech
 
20090622 Velocity
20090622 Velocity20090622 Velocity
20090622 Velocity
 
20090309berkeley
20090309berkeley20090309berkeley
20090309berkeley
 
20081030linkedin
20081030linkedin20081030linkedin
20081030linkedin
 
20081022cca
20081022cca20081022cca
20081022cca
 
20081009nychive
20081009nychive20081009nychive
20081009nychive
 
2008 Ur Tech Talk Zshao
2008 Ur Tech Talk Zshao2008 Ur Tech Talk Zshao
2008 Ur Tech Talk Zshao
 
Data Presentations Cassandra Sigmod
Data  Presentations  Cassandra SigmodData  Presentations  Cassandra Sigmod
Data Presentations Cassandra Sigmod
 
20080611accel
20080611accel20080611accel
20080611accel
 
HDFS Architecture
HDFS ArchitectureHDFS Architecture
HDFS Architecture
 
Hdfs Dhruba
Hdfs DhrubaHdfs Dhruba
Hdfs Dhruba
 
20080529dublinpt1
20080529dublinpt120080529dublinpt1
20080529dublinpt1
 

20100423sage

  • 2. Moving from Narrative to Design with Open Source and Open Data Jeff Hammerbacher Chief Scientist and Vice President of Products, Cloudera April 23, 2010 Friday, April 23, 2010
  • 3. Presentation Outline ▪ Who am I and what am I talking about? ▪ My Background ▪ Defining “Narrative” and “Design” ▪ Two stories ▪ Finance ▪ Web ▪ Making the Sage story a success ▪ Open Source ▪ Open Data ▪ The Unreasonable Effectiveness of Data Friday, April 23, 2010
  • 4. My Background Thanks for Asking ▪ hammer@cloudera.com ▪ Studied Mathematics at Harvard ▪ Worked as a Quant at Bear Stearns ▪ Conceived, built, and led Data team at Facebook ▪ Nearly 30 amazing engineers and data scientists ▪ Several open source projects and research papers ▪ Founder of Cloudera ▪ Vice President of Products and Chief Scientist ▪ Also, check out the book “Beautiful Data” Friday, April 23, 2010
  • 6. “Genomic Advances of the 2000s Will Demand an Informatics Revolution in the 2010s” -- Eric Schadt Friday, April 23, 2010
  • 7. From Narrative to Design Learning from past revolutions ▪ Leaving technology aside, once you have a narrative: ▪ collect and structure data ▪ build tools, models, and a domain vocabulary (“understand”) ▪ use effective models to achieve your goal (“automate”) ▪ Pinnacle: Boeing 777, designed entirely on a computer ▪ Thanks to “Biology is Technology” by Robert Carlson for the “Narrative” and “Design” terminology Friday, April 23, 2010
  • 10. Finance A Cautionary Tale ▪ Strengths ▪ Complex models ▪ Low latency decisions ▪ Compute grids ▪ Weaknesses ▪ Data needed for decisions is expensive ▪ Code unrelated to automating decisions is highly guarded ▪ Storing and manipulating large data sets ▪ Narrative to Design: trading strategies, financial products Friday, April 23, 2010
  • 12. The Web A Success Story ▪ Strengths ▪ Open data ▪ Open source ▪ Scalable data management and analysis ▪ Weaknesses ▪ Low latency ▪ Complex models ▪ Style of dress ▪ Narrative to Design: search quality, advertising targeting Friday, April 23, 2010
  • 15. Extreme Neuroanatomy DIY Brain Imaging ▪ Built by Ken Hayworth, ex-JPL engineer ▪ Now a critical contributor to “Connectome” project ▪ Synapse-resolution connectivity diagram of the brain ▪ Image data for a single rat brain: 900 PB ▪ This looks familiar! ▪ hackers ▪ large data sets ▪ machine learning Friday, April 23, 2010
  • 16. A New Story Friday, April 23, 2010
  • 18. Biology Finance or the Web? ▪ Large drug companies not that different from banks ▪ Recent infusion of open source and open data initiatives! ▪ Learn from the web ▪ Open source ▪ Open data ▪ “The Unreasonable Effectiveness of Data” ▪ Use existing open source software ▪ Cambrian explosion in open source data management ▪ Data storage and processing at petabyte scale is solved Friday, April 23, 2010
  • 21. High-Energy Physics Science at the Petabyte scale ▪ Tremendous investment in measurement ▪ Large Hadron Collider cost estimated at $4.4B ▪ Teams spend years performing analysis ▪ Shared toolset for analysis and visualization has emerged ▪ Lessons for Sage ▪ Invest in new measurement technologies and store everything ▪ Use existing open source tools and share those you build Friday, April 23, 2010
  • 22. The Promise of Sage Delivering on the Excitement ▪ An information platform for disease modeling and drug discovery ▪ More importantly: a community of developers and researchers ▪ Developers, developers, developers! ▪ Wait, that’s you guys -- be hackers ▪ The most important person in an open source project ▪ Not the creator ▪ The first power user with no ties to the creator Friday, April 23, 2010
  • 24. 1. Share Data Friday, April 23, 2010
  • 25. 2. Share Tools Friday, April 23, 2010
  • 26. 3. Share Results Friday, April 23, 2010
  • 27. (c) 2009 Cloudera, Inc. or its licensors.  "Cloudera" is a registered trademark of Cloudera, Inc.. All rights reserved. 1.0 Friday, April 23, 2010