SlideShare una empresa de Scribd logo
1 de 106
Descargar para leer sin conexión
eScience as a Lens on the World


     Ed Lazowska
     Bill & Melinda Gates Chair in
         Computer Science & Engineering
     University of Washington

     WCET 21st Annual Conference

     October 2009

     http://lazowska.cs.washington.edu/wcet.pdf
This morning

 The nature of eScience
 The advances that enable it
 Scalable computing for everyone
 Broadband, and the role of higher education
 Computer science & engineering: Changing the world
 The changing nature of our economy, and of
 educational requirements
eScience: Sensor-driven (data-driven)
science and engineering




             Jim Gray


Transforming science (again!)
Theory
Experiment
Observation
Theory
Experiment
Observation
Theory
             Experiment
             Observation




[John Delaney, University of Washington]
Theory
  Experiment
 Observation
Computational
     Science
Protein interactions
 in striated muscles
    Tom Daniel lab
QCD to study
  interactions of
           nuclei
David Kaplan lab
Stars
Gas




      Study of dark matter
               Dark Matter
              Tom Quinn lab
Protein structure
       prediction
 David Baker lab
Theory
  Experiment
 Observation
Computational
     Science
    eScience
eScience is driven by data

  Massive volumes of data from sensors and networks
  of sensors




 Apache Point telescope,
                   SDSS
              15TB of data
(15,000,000,000,000 bytes)
Large Synoptic Survey
    Telescope (LSST)
            30TB/day,
   60PB in its 10-year
               lifetime
Large Hadron Collider
       700MB of data
         per second,
 60TB/day, 20PB/year
Illumina Genome
        Analyzer
       ~1TB/day
Regional Scale Nodes of the
  NSF Ocean Observatories
                  Initiative
1000 km of fiber optic cable
on the seafloor, connecting
    thousands of chemical,
    physical, and biological
                    sensors
The Web
20+ billion web pages
    x 20KB = 400+TB
     One computer can
     read 30-35 MB/sec
from disk => 4 months
   just to read the web
Point-of-sale terminals
eScience is about the analysis of data

  The automated or semi-automated extraction of
  knowledge from massive volumes of data
    There’s simply too much of it to look at
The technologies of eScience

 Sensors and sensor
 networks
 Databases
 Data mining
 Machine learning
 Data visualization
 Cluster computing
 at enormous scale
eScience will be pervasive

  Computational science has been transformational, but
  it was a niche
    As an institution (e.g., a university), you didn’t need to excel
    in order to be competitive
  eScience capabilities must be broadly available in any
  organization
    If not, the organization will simply cease to be competitive
Life on Planet
     Earth




        [John Delaney, University of Washington]
[John Delaney, University of Washington]
[John Delaney, University of Washington]
Top faculty across all disciplines
understand the coming data tsunami
Questions for you …

 How does your institution track the IT needs –
 present and future – of its leading researchers?
 To what extent are you meeting these needs, and in
 what critical areas are you falling short?
 How well does your technology staff understand the
 institution’s research and disciplinary directions and
 their IT implications?
 What potential resources, other than those currently
 in place, can be used to provide broad-based IT
 support for eScience and eScholarship?
More on the enablement of eScience

 Ten quintillion: 10*1018
   The number of grains of rice
   harvested in 2004
Ten quintillion: 10*1018
  The number of grains of rice
  harvested in 2004
  The number of transistors
  fabricated in 2004
The transistor

 William Shockley, Walter Brattain
 and John Bardeen, Bell Labs, 1947
The integrated circuit

  Jack Kilby, Texas Instruments, and Bob
  Noyce, Fairchild Semiconductor
  Corporation, 1958
Exponential progress

 Gordon Moore, 1965
Processing power, historically
  1980: 1 MHz Apple II+, $2,000
     1980 also 1 MIPS VAX-11/780, $120,000
  2006: 2.4 GHz Pentium D, $800
     A factor of 6000
Processing power, recently
  Additional transistors => more cores of
  the same speed, rather than higher speed
  2008: Intel Core 2 Quad-Core 2.4 GHz,
  $800 (4x the capability, same price)
  2009: Intel Core 2 Quad-Core 2.5 GHz,
  $183 (same capability, 1/4 the price)
Primary memory – same story, same reason (but no
multicore fiasco)
  1972: 1MB, $1,000,000
  1982: 1MB, $60,000
  2005: $400/GB (1MB, $0.40)




                                     4GB vs. 2GB
                                   (@400MHz) = $800
                                      ($400/GB)
2007: $145/GB (1MB, $0.15)




                               4GB vs. 2GB
                             (@667MHz) = $290
                                ($145/GB)
2008: $49/GB (1MB, $0.05)




                              4GB vs. 3GB
                            (@800MHz) = $49
                               ($49/GB)
2009: $16/GB (1MB, $0.015)
  Remember, it was $400/GB in 2005!




                                      2GB (@667MHz) =
                                        $32 ($16/GB)
Moore’s Law drives sensors as well as processing
and memory
  LSST will have a
  3.2 Gigapixel camera
Disk capacity, 1975-1989
  doubled every 3+ years
  25% improvement each year
  factor of 10 every decade
  Still exponential, but far less rapid than processor
  performance
Disk capacity since 1990
  doubling every 12 months
  100% improvement each year
  factor of 1000 every decade
  10x as fast as processor performance
Only a few years ago, we purchased disks by the
megabyte (and it hurt!)
Current cost of 1 GB (a billion bytes) from Dell
  2005: $1.00
  2006: $0.50
  2008: $0.25
Purchase increment
  2005: 40GB
  2006: 80GB
  2008: 250GB
Optical bandwidth today
  Doubling every 9 months
  150% improvement each year
  Factor of 10,000 every decade
  10x as fast as disk capacity
  100x as fast as processor performance
A connected region – then
A connected region – now
But eScience is equally enabled by
software for scalability and for discovery

  It’s likely that Google has several million machines
    But let’s be conservative – 1,000,000 machines
    A rack holds 176 CPUs (88 1U dual-processor boards), so
    that’s about 6,000 racks
    A rack requires about 50 square feet (given datacenter
    cooling capabilities), so that’s about 300,000 square feet of
    machine room space (more than 6 football fields of real estate
    – although of course Google divides its machines among dozens
    of datacenters all over the world)
    A rack requires about 10kw to power, and about the same to
    cool, so that’s about 120,000 kw of power, or nearly
    100,000,000 kwh per month ($10 million at $0.10/kwh)
       Equivalent to about 20% of Seattle City Light’s generating
       capacity
Many hundreds of machines are involved in a single
Google search request (remember, the web is 400+TB)
  There are multiple clusters (of thousands of computers each)
  all over the world
  DNS routes your search to a nearby cluster
A cluster consists of Google Web Servers, Index Servers, Doc
Servers, and various other servers (ads, spell checking, etc.)
These are cheap standalone computers, rack-mounted,
connected by commodity networking gear
Within the cluster, load-balancing routes your search to a
lightly-loaded Google Web Server (GWS), which will
coordinate the search and response
The index is partitioned into “shards.” Each shard indexes a
subset of the docs (web pages). Each shard is replicated,
and can be searched by multiple computers – “index servers”
The GWS routes your search to one index server associated
with each shard, through another load-balancer
When the dust has settled, the result is an ID for every doc
satisfying your search, rank-ordered by relevance
The docs, too, are partitioned into “shards” – the
partitioning is a hash on the doc ID. Each shard contains the
full text of a subset of the docs. Each shard can be
searched by multiple computers – “doc servers”
The GWS sends appropriate doc IDs to one doc server
associated with each relevant shard
When the dust has settled, the result is a URL, a title, and a
summary for every relevant doc
Meanwhile, the ad server has done its thing, the spell
  checker has done its thing, etc.
  The GWS builds an HTTP response to your search and ships
  it off


Many hundreds of computers have enabled you to
search 400+TB of web in ~100 ms.
Enormous volumes of data
Extreme parallelism
The cheapest imaginable components
  Failures occur all the time
  You couldn’t afford to prevent this in hardware
Software makes it
  Fault-Tolerant
  Highly Available
  Recoverable
  Consistent
  Scalable
  Predictable
  Secure
How on earth would you enable mere mortals
write hairy applications such as this?

 Recognize that many Google applications have the
 same structure
   Apply a “map” operation to each logical record in order to
   compute a set of intermediate key/value pairs
   Apply a “reduce” operation to all the values that share the
   same key in order to combine the derived data appropriately
 Example: Count the number of occurrences of each
 word in a large collection of documents
   Map: Emit <word, 1> each time you encounter a word
   Reduce: Sum the values for each word
Build a runtime library that handles all the details,
accepting a couple of customization functions from
the user – a Map function and a Reduce function
That’s what MapReduce is
  Supported by the Google File System and the Chubby lock
  manager
  Augmented by the BigTable not-quite-a-database system
eScience utilizes the Cloud: Scalable
computing for everyone
Amazon Elastic Compute Cloud (EC2)

 $0.80 per hour for
   8 cores of 3 GHz 64-bit Intel or AMD
   7 GB memory
   1.69 TB scratch storage
 Need it 24x7 for a year?
   $3000

 $0.10 per hour for
   1 core of 1.2 GHz 32-bit Intel or AMD (1/20th the above)
   1.7 GB memory
   160 GB scratch storage
 Need it 24x7 for a year?
   $379
This includes
  Purchase + replacement
  Housing
  Power
  Operation
  Reliability
  Security
  Instantaneous expansion and contraction


1000 computers for a day costs the same as one
computer for 1000 days – revolutionary!
[Werner Vogels, Amazon.com]
[Werner Vogels, Amazon.com]
Still running your own email servers?
[Werner Vogels, Amazon.com]
Broadband, and the role of higher education
ARPANET, 1980
DARPA Gigabit Testbeds, mid-1990’s
CTC
                                                                                                     Cornell Theory
                                                                                                             Center
                                                                                                   NOR                    A
                                                            Ameritech NAP                          Cleveland
                                                                                                                          C
                                                                                  DNG
                                            NCAR                                  Chicago                                         C WORYork City
                                                                                                                                    New
                                      National Center for                   C                    C               A
                                    Atmospheric Research
               SCM                                                                                               C            C    Sprint NAP
               Sacramento                    C    A
           C                                                                                              PSC                        PYM
                                                                                       A            Pittsburgh                       Perryman, MD
                                                                                             Supercomputer            C
Pac Bell                                           DNJ C                        NCSA
                                                                                       C                Center
NAP                                                Denver                       National Center for              C
                                                                                Supercomputing
                                                                                Applications
                     RTO                                                                                   MFS NAP
                     Los Angeles
                         C

                 A                                                                                     C
                 C                                                                               AST
                      SDSC                                                                       Atlanta
                      San Diego                             HSJ
                      Supercomputer Center                  Houston

                                                              C



                                   Table
               A Ascend GRF 400                  DS-3
               C Cisco 7507                      OC-3C
                 FORE ASX-1000                   OC-12C



                                                                                           NSF vBNS, 1997
                 Network Access Point
UCAID Abilene network, 1998
ARPANET, 1983
NSFNET, 1991
DARPA / NGI Testbed, late 1990’s
NSF vBNS, 1998
Internet2, 1999
NLR – Optical Infrastructure - Phase 1

                    Seattle

   Portland
                         Boise

Sunnyvale                                         Chicago   Clev
                                                                   Pitts
                                  Denver
                          Ogden              KC
                                                                           Wash DC

                                                                           Raleigh
         LA
     San Diego                                          Atlanta

            NLR
            Route                                                   Jacksonville
       NLR
       MetaPOP
       NLR Regen or
       OADM

                                           National LambdaRail, 2004
Who reaches K-12 institutions, CC’s, tribal
colleges, libraries, telemedicine sites?
Who reaches unserved and underserved
regions?
The broadband stimulus
The broadband stimulus
Computer science & engineering:
Changing the world

 Advances in computing change the way we live, work,
 learn, and communicate
 Advances in computing drive advances in nearly all
 other fields
 Advances in computing power our economy
    Not just through the growth of the IT industry – through
    productivity growth across the entire economy
84
85
86
87
Imagine spending a day without
information technology

 A day without the Internet and all that it enables
 A day without diagnostic medical imaging
 A day during which automobiles lacked electronic ignition,
 antilock brakes, and electronic stability control
 A day without digital media – without wireless telephones,
 high-definition televisions, MP3 audio, DVD video,
 computer animation, and videogames
 A day during which aircraft could not fly, travelers had
 to navigate without benefit of GPS, weather forecasters
 had no models, banks and merchants could not transfer
 funds electronically, factory automation ceased to
 function, and the US military lacked technological
 supremacy
Imagine spending a day without
information technology

 A day without the Internet and all that it enables
 A day without diagnostic medical imaging
 A day during which automobiles lacked electronic ignition,
 antilock brakes, and electronic stability control
 A day without digital media – without wireless telephones,
 high-definition televisions, MP3 audio, DVD video,
 computer animation, and videogames
 A day during which aircraft could not fly, travelers had
 to navigate without benefit of GPS, weather forecasters
 had no models, banks and merchants could not transfer
 funds electronically, factory automation ceased to
 function, and the US military lacked technological
 supremacy
Research has built
the foundation
The future is full of opportunity

  Creating the future of
  networking
  Driving advances in all fields
  of science and engineering
  Revolutionizing transportation
  Personalized education
  The Smart Grid
  Predictive, preventive,
  personalized medicine
  Quantum computing
  Empowerment of the
  developing world
  Personalized health monitoring
  => quality of life
  Neurobotics
  Synthetic biology
Why do students choose the field?
Power to change the world




 People enter the field for a wide variety of
 aspirational reasons
 http://www.cs.washington.edu/WhyCSE/
Pathways in computer science




 People pursue diverse careers following their
 education in computer science
 http://www.cs.washington.edu/WhyCSE/
A day in the life




  Working in the software industry is creative,
  interactive, empowering
  http://www.cs.washington.edu/WhyCSE/
Science & Engineering Job Growth, 2006-16
(BLS Occupational Employment Projections)
Average annual
projected
growth rates
for all 22 major
occupational
groups in
Washington
State, 2006 to
2016
Occupations are ranked based on the
average of two criteria: average
annual growth rate for 2006 to 2016
and total number of job openings due
to growth and replacement
Science & Engineering Job Growth,
2006-16 (BLS Occupational Employment Projections)


                      13       University of Washington
                  9            academic departments

                 8
             9

     4
                                                    2
         3
The transformation of our economy, and of
educational requirements

 Once upon a time, the “content” of the goods we
 produced was largely physical
Then we transitioned to goods whose “content” was a
balance of physical and intellectual
In today’s knowledge-based economy, the “content”
of goods is almost entirely intellectual rather than
physical
What kind of education is required to produce a good
whose content is almost entirely intellectual?
Education in Washington:  Where do we stand
  The Nation’s S&E Top 5
                          among the states?
  (Workforce intensity, all S&E occupations, 2006)

   1. Virginia
                                                       LIFE/PHYS.   COMPUTER
   2. Massachusetts                     ENGINEERS      SCIENTISTS   SPECIALISTS
   3. Maryland                          PEERS: 2       PEERS: 3     PEERS: 6
   4. Washington                        NATION: 3      NATION: 7    NATION: 8
   5. Colorado



                                        BACHELOR’S     S&E DEGREE   S&E GRADUATE    S&E PhDs
  HIGHER EDUCATION:
                                        PRODUCTION     PRODUCTION   PARTICIPATION   AWARDED
  (BACHELOR’S PRODUCTION,
  S&E GRADUATE PARTICIPATION,           PEERS:   8     PEERS:   7   PEERS: 10       PEERS: 10
  S&E PhD PRODUCTION)                   NATION: 36     NATION: 31   NATION: 46      NATION: 27




                                         HIGH SCHOOL
  K-12 ACHIEVEMENT:
                                         GRADUATION
  (2007 8TH GRADE NAEP &
  GRADUATION RATE FROM                   PEERS:   7
  POSTSECONDARY.ORG)                     NATION: 32
Washington State is gambling with its
future!




                         How about you?
This morning

 The nature of eScience
 The advances that enable it
 Scalable computing for everyone
 Broadband, and the role of higher education
 “There’s only so much you can do with asphalt”
 Don’t gamble with the future of your region and of
 the kids who grow up there



 http://lazowska.cs.washington.edu/wcet.pdf

Más contenido relacionado

Destacado

Open Access in Serbia
Open Access in SerbiaOpen Access in Serbia
Open Access in SerbiaSanja Antonic
 
"Cash is King!" - R&M Audit
"Cash is King!" - R&M Audit"Cash is King!" - R&M Audit
"Cash is King!" - R&M AuditIrina Visan
 
Fresh magazine, June 08_interview CZ
Fresh magazine, June 08_interview CZFresh magazine, June 08_interview CZ
Fresh magazine, June 08_interview CZPavelKnorr
 
Open Access New Era For Life Long Learning
Open Access New Era For Life Long LearningOpen Access New Era For Life Long Learning
Open Access New Era For Life Long LearningSanja Antonic
 
Visual message system
Visual message systemVisual message system
Visual message systemElisa Bordoli
 
Botanicka basta 2014 casopisi i knjige (1)
Botanicka basta 2014 casopisi i knjige  (1)Botanicka basta 2014 casopisi i knjige  (1)
Botanicka basta 2014 casopisi i knjige (1)Sanja Antonic
 
Positive Approach Events &amp; Consulting
Positive Approach Events &amp; ConsultingPositive Approach Events &amp; Consulting
Positive Approach Events &amp; Consultingpaevents
 
“Ottimizzazione del sistema informativo di un’associazione tramite il coinvol...
“Ottimizzazione del sistema informativo di un’associazione tramite il coinvol...“Ottimizzazione del sistema informativo di un’associazione tramite il coinvol...
“Ottimizzazione del sistema informativo di un’associazione tramite il coinvol...Elisa Bordoli
 
first lecture homework/Abdullah alkaabi
first lecture homework/Abdullah alkaabifirst lecture homework/Abdullah alkaabi
first lecture homework/Abdullah alkaabialkaabi68713
 
Serbian Wordnet For Biomedical
Serbian Wordnet For BiomedicalSerbian Wordnet For Biomedical
Serbian Wordnet For BiomedicalSanja Antonic
 
Mapas Mentales Outsourcing
Mapas Mentales OutsourcingMapas Mentales Outsourcing
Mapas Mentales Outsourcingguest3ef5aef4
 
Ia avtalen 2010 til 2013 jl
Ia avtalen 2010 til 2013 jlIa avtalen 2010 til 2013 jl
Ia avtalen 2010 til 2013 jlDag Westhrin
 

Destacado (16)

Open Access in Serbia
Open Access in SerbiaOpen Access in Serbia
Open Access in Serbia
 
Live airport
Live airportLive airport
Live airport
 
Oppf PhpvåR2010
Oppf PhpvåR2010Oppf PhpvåR2010
Oppf PhpvåR2010
 
"Cash is King!" - R&M Audit
"Cash is King!" - R&M Audit"Cash is King!" - R&M Audit
"Cash is King!" - R&M Audit
 
Fresh magazine, June 08_interview CZ
Fresh magazine, June 08_interview CZFresh magazine, June 08_interview CZ
Fresh magazine, June 08_interview CZ
 
Open Access New Era For Life Long Learning
Open Access New Era For Life Long LearningOpen Access New Era For Life Long Learning
Open Access New Era For Life Long Learning
 
Visual message system
Visual message systemVisual message system
Visual message system
 
NTL Sørmarka
NTL SørmarkaNTL Sørmarka
NTL Sørmarka
 
Botanicka basta 2014 casopisi i knjige (1)
Botanicka basta 2014 casopisi i knjige  (1)Botanicka basta 2014 casopisi i knjige  (1)
Botanicka basta 2014 casopisi i knjige (1)
 
Positive Approach Events &amp; Consulting
Positive Approach Events &amp; ConsultingPositive Approach Events &amp; Consulting
Positive Approach Events &amp; Consulting
 
Ia avtalen2010
Ia avtalen2010Ia avtalen2010
Ia avtalen2010
 
“Ottimizzazione del sistema informativo di un’associazione tramite il coinvol...
“Ottimizzazione del sistema informativo di un’associazione tramite il coinvol...“Ottimizzazione del sistema informativo di un’associazione tramite il coinvol...
“Ottimizzazione del sistema informativo di un’associazione tramite il coinvol...
 
first lecture homework/Abdullah alkaabi
first lecture homework/Abdullah alkaabifirst lecture homework/Abdullah alkaabi
first lecture homework/Abdullah alkaabi
 
Serbian Wordnet For Biomedical
Serbian Wordnet For BiomedicalSerbian Wordnet For Biomedical
Serbian Wordnet For Biomedical
 
Mapas Mentales Outsourcing
Mapas Mentales OutsourcingMapas Mentales Outsourcing
Mapas Mentales Outsourcing
 
Ia avtalen 2010 til 2013 jl
Ia avtalen 2010 til 2013 jlIa avtalen 2010 til 2013 jl
Ia avtalen 2010 til 2013 jl
 

Similar a E Science As A Lens On The World Lazowska

Petascale Analytics - The World of Big Data Requires Big Analytics
Petascale Analytics - The World of Big Data Requires Big AnalyticsPetascale Analytics - The World of Big Data Requires Big Analytics
Petascale Analytics - The World of Big Data Requires Big AnalyticsHeiko Joerg Schick
 
Cluster Filesystems and the next 1000 human genomes
Cluster Filesystems and the next 1000 human genomesCluster Filesystems and the next 1000 human genomes
Cluster Filesystems and the next 1000 human genomesGuy Coates
 
Computation and Knowledge
Computation and KnowledgeComputation and Knowledge
Computation and KnowledgeIan Foster
 
Cloud Computing y Big Data, próxima frontera de la innovación
Cloud Computing y Big Data, próxima frontera de la innovaciónCloud Computing y Big Data, próxima frontera de la innovación
Cloud Computing y Big Data, próxima frontera de la innovaciónFundación Ramón Areces
 
Mateo Valero - Big data: de la investigación científica a la gestión empresarial
Mateo Valero - Big data: de la investigación científica a la gestión empresarialMateo Valero - Big data: de la investigación científica a la gestión empresarial
Mateo Valero - Big data: de la investigación científica a la gestión empresarialFundación Ramón Areces
 
BDI- The Beginning (Big data training in Coimbatore)
BDI- The Beginning (Big data training in Coimbatore)BDI- The Beginning (Big data training in Coimbatore)
BDI- The Beginning (Big data training in Coimbatore)Ashok Rangaswamy
 
PUC Masterclass Big Data
PUC Masterclass Big DataPUC Masterclass Big Data
PUC Masterclass Big DataArjen de Vries
 
Storage for next-generation sequencing
Storage for next-generation sequencingStorage for next-generation sequencing
Storage for next-generation sequencingGuy Coates
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodDuncan Hull
 
Agents In An Exponential World Foster
Agents In An Exponential World FosterAgents In An Exponential World Foster
Agents In An Exponential World FosterIan Foster
 
Nikravesh australia long_versionkeynote2012
Nikravesh australia long_versionkeynote2012Nikravesh australia long_versionkeynote2012
Nikravesh australia long_versionkeynote2012Masoud Nikravesh
 
Clouds, Grids and Data
Clouds, Grids and DataClouds, Grids and Data
Clouds, Grids and DataGuy Coates
 
Terabit Applications: What Are They, What is Needed to Enable Them?
Terabit Applications: What Are They, What is Needed to Enable Them?Terabit Applications: What Are They, What is Needed to Enable Them?
Terabit Applications: What Are They, What is Needed to Enable Them?Larry Smarr
 
The Optiputer - Toward a Terabit LAN
The Optiputer - Toward a Terabit LANThe Optiputer - Toward a Terabit LAN
The Optiputer - Toward a Terabit LANLarry Smarr
 
Cloud Experiences
Cloud ExperiencesCloud Experiences
Cloud ExperiencesGuy Coates
 
Rpi talk foster september 2011
Rpi talk foster september 2011Rpi talk foster september 2011
Rpi talk foster september 2011Ian Foster
 
Sanger HPC infrastructure Report (2007)
Sanger HPC infrastructure  Report (2007)Sanger HPC infrastructure  Report (2007)
Sanger HPC infrastructure Report (2007)Guy Coates
 
Computing Outside The Box June 2009
Computing Outside The Box June 2009Computing Outside The Box June 2009
Computing Outside The Box June 2009Ian Foster
 
Clouds: All fluff and no substance?
Clouds: All fluff and no substance?Clouds: All fluff and no substance?
Clouds: All fluff and no substance?Guy Coates
 
Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22marpierc
 

Similar a E Science As A Lens On The World Lazowska (20)

Petascale Analytics - The World of Big Data Requires Big Analytics
Petascale Analytics - The World of Big Data Requires Big AnalyticsPetascale Analytics - The World of Big Data Requires Big Analytics
Petascale Analytics - The World of Big Data Requires Big Analytics
 
Cluster Filesystems and the next 1000 human genomes
Cluster Filesystems and the next 1000 human genomesCluster Filesystems and the next 1000 human genomes
Cluster Filesystems and the next 1000 human genomes
 
Computation and Knowledge
Computation and KnowledgeComputation and Knowledge
Computation and Knowledge
 
Cloud Computing y Big Data, próxima frontera de la innovación
Cloud Computing y Big Data, próxima frontera de la innovaciónCloud Computing y Big Data, próxima frontera de la innovación
Cloud Computing y Big Data, próxima frontera de la innovación
 
Mateo Valero - Big data: de la investigación científica a la gestión empresarial
Mateo Valero - Big data: de la investigación científica a la gestión empresarialMateo Valero - Big data: de la investigación científica a la gestión empresarial
Mateo Valero - Big data: de la investigación científica a la gestión empresarial
 
BDI- The Beginning (Big data training in Coimbatore)
BDI- The Beginning (Big data training in Coimbatore)BDI- The Beginning (Big data training in Coimbatore)
BDI- The Beginning (Big data training in Coimbatore)
 
PUC Masterclass Big Data
PUC Masterclass Big DataPUC Masterclass Big Data
PUC Masterclass Big Data
 
Storage for next-generation sequencing
Storage for next-generation sequencingStorage for next-generation sequencing
Storage for next-generation sequencing
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
 
Agents In An Exponential World Foster
Agents In An Exponential World FosterAgents In An Exponential World Foster
Agents In An Exponential World Foster
 
Nikravesh australia long_versionkeynote2012
Nikravesh australia long_versionkeynote2012Nikravesh australia long_versionkeynote2012
Nikravesh australia long_versionkeynote2012
 
Clouds, Grids and Data
Clouds, Grids and DataClouds, Grids and Data
Clouds, Grids and Data
 
Terabit Applications: What Are They, What is Needed to Enable Them?
Terabit Applications: What Are They, What is Needed to Enable Them?Terabit Applications: What Are They, What is Needed to Enable Them?
Terabit Applications: What Are They, What is Needed to Enable Them?
 
The Optiputer - Toward a Terabit LAN
The Optiputer - Toward a Terabit LANThe Optiputer - Toward a Terabit LAN
The Optiputer - Toward a Terabit LAN
 
Cloud Experiences
Cloud ExperiencesCloud Experiences
Cloud Experiences
 
Rpi talk foster september 2011
Rpi talk foster september 2011Rpi talk foster september 2011
Rpi talk foster september 2011
 
Sanger HPC infrastructure Report (2007)
Sanger HPC infrastructure  Report (2007)Sanger HPC infrastructure  Report (2007)
Sanger HPC infrastructure Report (2007)
 
Computing Outside The Box June 2009
Computing Outside The Box June 2009Computing Outside The Box June 2009
Computing Outside The Box June 2009
 
Clouds: All fluff and no substance?
Clouds: All fluff and no substance?Clouds: All fluff and no substance?
Clouds: All fluff and no substance?
 
Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22
 

E Science As A Lens On The World Lazowska

  • 1. eScience as a Lens on the World Ed Lazowska Bill & Melinda Gates Chair in Computer Science & Engineering University of Washington WCET 21st Annual Conference October 2009 http://lazowska.cs.washington.edu/wcet.pdf
  • 2. This morning The nature of eScience The advances that enable it Scalable computing for everyone Broadband, and the role of higher education Computer science & engineering: Changing the world The changing nature of our economy, and of educational requirements
  • 3. eScience: Sensor-driven (data-driven) science and engineering Jim Gray Transforming science (again!)
  • 6. Theory Experiment Observation [John Delaney, University of Washington]
  • 7. Theory Experiment Observation Computational Science
  • 8. Protein interactions in striated muscles Tom Daniel lab
  • 9. QCD to study interactions of nuclei David Kaplan lab
  • 10. Stars Gas Study of dark matter Dark Matter Tom Quinn lab
  • 11. Protein structure prediction David Baker lab
  • 12. Theory Experiment Observation Computational Science eScience
  • 13. eScience is driven by data Massive volumes of data from sensors and networks of sensors Apache Point telescope, SDSS 15TB of data (15,000,000,000,000 bytes)
  • 14. Large Synoptic Survey Telescope (LSST) 30TB/day, 60PB in its 10-year lifetime
  • 15. Large Hadron Collider 700MB of data per second, 60TB/day, 20PB/year
  • 16. Illumina Genome Analyzer ~1TB/day
  • 17. Regional Scale Nodes of the NSF Ocean Observatories Initiative 1000 km of fiber optic cable on the seafloor, connecting thousands of chemical, physical, and biological sensors
  • 18. The Web 20+ billion web pages x 20KB = 400+TB One computer can read 30-35 MB/sec from disk => 4 months just to read the web
  • 20. eScience is about the analysis of data The automated or semi-automated extraction of knowledge from massive volumes of data There’s simply too much of it to look at
  • 21. The technologies of eScience Sensors and sensor networks Databases Data mining Machine learning Data visualization Cluster computing at enormous scale
  • 22. eScience will be pervasive Computational science has been transformational, but it was a niche As an institution (e.g., a university), you didn’t need to excel in order to be competitive eScience capabilities must be broadly available in any organization If not, the organization will simply cease to be competitive
  • 23. Life on Planet Earth [John Delaney, University of Washington]
  • 24. [John Delaney, University of Washington]
  • 25. [John Delaney, University of Washington]
  • 26. Top faculty across all disciplines understand the coming data tsunami
  • 27. Questions for you … How does your institution track the IT needs – present and future – of its leading researchers? To what extent are you meeting these needs, and in what critical areas are you falling short? How well does your technology staff understand the institution’s research and disciplinary directions and their IT implications? What potential resources, other than those currently in place, can be used to provide broad-based IT support for eScience and eScholarship?
  • 28. More on the enablement of eScience Ten quintillion: 10*1018 The number of grains of rice harvested in 2004
  • 29. Ten quintillion: 10*1018 The number of grains of rice harvested in 2004 The number of transistors fabricated in 2004
  • 30. The transistor William Shockley, Walter Brattain and John Bardeen, Bell Labs, 1947
  • 31. The integrated circuit Jack Kilby, Texas Instruments, and Bob Noyce, Fairchild Semiconductor Corporation, 1958
  • 33.
  • 34.
  • 35.
  • 36.
  • 37. Processing power, historically 1980: 1 MHz Apple II+, $2,000 1980 also 1 MIPS VAX-11/780, $120,000 2006: 2.4 GHz Pentium D, $800 A factor of 6000 Processing power, recently Additional transistors => more cores of the same speed, rather than higher speed 2008: Intel Core 2 Quad-Core 2.4 GHz, $800 (4x the capability, same price) 2009: Intel Core 2 Quad-Core 2.5 GHz, $183 (same capability, 1/4 the price)
  • 38. Primary memory – same story, same reason (but no multicore fiasco) 1972: 1MB, $1,000,000 1982: 1MB, $60,000 2005: $400/GB (1MB, $0.40) 4GB vs. 2GB (@400MHz) = $800 ($400/GB)
  • 39. 2007: $145/GB (1MB, $0.15) 4GB vs. 2GB (@667MHz) = $290 ($145/GB)
  • 40. 2008: $49/GB (1MB, $0.05) 4GB vs. 3GB (@800MHz) = $49 ($49/GB)
  • 41. 2009: $16/GB (1MB, $0.015) Remember, it was $400/GB in 2005! 2GB (@667MHz) = $32 ($16/GB)
  • 42. Moore’s Law drives sensors as well as processing and memory LSST will have a 3.2 Gigapixel camera
  • 43. Disk capacity, 1975-1989 doubled every 3+ years 25% improvement each year factor of 10 every decade Still exponential, but far less rapid than processor performance Disk capacity since 1990 doubling every 12 months 100% improvement each year factor of 1000 every decade 10x as fast as processor performance
  • 44. Only a few years ago, we purchased disks by the megabyte (and it hurt!) Current cost of 1 GB (a billion bytes) from Dell 2005: $1.00 2006: $0.50 2008: $0.25 Purchase increment 2005: 40GB 2006: 80GB 2008: 250GB
  • 45. Optical bandwidth today Doubling every 9 months 150% improvement each year Factor of 10,000 every decade 10x as fast as disk capacity 100x as fast as processor performance
  • 46. A connected region – then
  • 48. But eScience is equally enabled by software for scalability and for discovery It’s likely that Google has several million machines But let’s be conservative – 1,000,000 machines A rack holds 176 CPUs (88 1U dual-processor boards), so that’s about 6,000 racks A rack requires about 50 square feet (given datacenter cooling capabilities), so that’s about 300,000 square feet of machine room space (more than 6 football fields of real estate – although of course Google divides its machines among dozens of datacenters all over the world) A rack requires about 10kw to power, and about the same to cool, so that’s about 120,000 kw of power, or nearly 100,000,000 kwh per month ($10 million at $0.10/kwh) Equivalent to about 20% of Seattle City Light’s generating capacity
  • 49. Many hundreds of machines are involved in a single Google search request (remember, the web is 400+TB) There are multiple clusters (of thousands of computers each) all over the world DNS routes your search to a nearby cluster
  • 50. A cluster consists of Google Web Servers, Index Servers, Doc Servers, and various other servers (ads, spell checking, etc.) These are cheap standalone computers, rack-mounted, connected by commodity networking gear
  • 51. Within the cluster, load-balancing routes your search to a lightly-loaded Google Web Server (GWS), which will coordinate the search and response The index is partitioned into “shards.” Each shard indexes a subset of the docs (web pages). Each shard is replicated, and can be searched by multiple computers – “index servers” The GWS routes your search to one index server associated with each shard, through another load-balancer When the dust has settled, the result is an ID for every doc satisfying your search, rank-ordered by relevance
  • 52. The docs, too, are partitioned into “shards” – the partitioning is a hash on the doc ID. Each shard contains the full text of a subset of the docs. Each shard can be searched by multiple computers – “doc servers” The GWS sends appropriate doc IDs to one doc server associated with each relevant shard When the dust has settled, the result is a URL, a title, and a summary for every relevant doc
  • 53. Meanwhile, the ad server has done its thing, the spell checker has done its thing, etc. The GWS builds an HTTP response to your search and ships it off Many hundreds of computers have enabled you to search 400+TB of web in ~100 ms.
  • 54. Enormous volumes of data Extreme parallelism The cheapest imaginable components Failures occur all the time You couldn’t afford to prevent this in hardware Software makes it Fault-Tolerant Highly Available Recoverable Consistent Scalable Predictable Secure
  • 55. How on earth would you enable mere mortals write hairy applications such as this? Recognize that many Google applications have the same structure Apply a “map” operation to each logical record in order to compute a set of intermediate key/value pairs Apply a “reduce” operation to all the values that share the same key in order to combine the derived data appropriately Example: Count the number of occurrences of each word in a large collection of documents Map: Emit <word, 1> each time you encounter a word Reduce: Sum the values for each word
  • 56. Build a runtime library that handles all the details, accepting a couple of customization functions from the user – a Map function and a Reduce function That’s what MapReduce is Supported by the Google File System and the Chubby lock manager Augmented by the BigTable not-quite-a-database system
  • 57. eScience utilizes the Cloud: Scalable computing for everyone
  • 58. Amazon Elastic Compute Cloud (EC2) $0.80 per hour for 8 cores of 3 GHz 64-bit Intel or AMD 7 GB memory 1.69 TB scratch storage Need it 24x7 for a year? $3000 $0.10 per hour for 1 core of 1.2 GHz 32-bit Intel or AMD (1/20th the above) 1.7 GB memory 160 GB scratch storage Need it 24x7 for a year? $379
  • 59. This includes Purchase + replacement Housing Power Operation Reliability Security Instantaneous expansion and contraction 1000 computers for a day costs the same as one computer for 1000 days – revolutionary!
  • 62. Still running your own email servers?
  • 64. Broadband, and the role of higher education
  • 66. DARPA Gigabit Testbeds, mid-1990’s
  • 67. CTC Cornell Theory Center NOR A Ameritech NAP Cleveland C DNG NCAR Chicago C WORYork City New National Center for C C A Atmospheric Research SCM C C Sprint NAP Sacramento C A C PSC PYM A Pittsburgh Perryman, MD Supercomputer C Pac Bell DNJ C NCSA C Center NAP Denver National Center for C Supercomputing Applications RTO MFS NAP Los Angeles C A C C AST SDSC Atlanta San Diego HSJ Supercomputer Center Houston C Table A Ascend GRF 400 DS-3 C Cisco 7507 OC-3C FORE ASX-1000 OC-12C NSF vBNS, 1997 Network Access Point
  • 69.
  • 72. DARPA / NGI Testbed, late 1990’s
  • 75. NLR – Optical Infrastructure - Phase 1 Seattle Portland Boise Sunnyvale Chicago Clev Pitts Denver Ogden KC Wash DC Raleigh LA San Diego Atlanta NLR Route Jacksonville NLR MetaPOP NLR Regen or OADM National LambdaRail, 2004
  • 76. Who reaches K-12 institutions, CC’s, tribal colleges, libraries, telemedicine sites?
  • 77.
  • 78. Who reaches unserved and underserved regions?
  • 81.
  • 82.
  • 83. Computer science & engineering: Changing the world Advances in computing change the way we live, work, learn, and communicate Advances in computing drive advances in nearly all other fields Advances in computing power our economy Not just through the growth of the IT industry – through productivity growth across the entire economy
  • 84. 84
  • 85. 85
  • 86. 86
  • 87. 87
  • 88. Imagine spending a day without information technology A day without the Internet and all that it enables A day without diagnostic medical imaging A day during which automobiles lacked electronic ignition, antilock brakes, and electronic stability control A day without digital media – without wireless telephones, high-definition televisions, MP3 audio, DVD video, computer animation, and videogames A day during which aircraft could not fly, travelers had to navigate without benefit of GPS, weather forecasters had no models, banks and merchants could not transfer funds electronically, factory automation ceased to function, and the US military lacked technological supremacy
  • 89. Imagine spending a day without information technology A day without the Internet and all that it enables A day without diagnostic medical imaging A day during which automobiles lacked electronic ignition, antilock brakes, and electronic stability control A day without digital media – without wireless telephones, high-definition televisions, MP3 audio, DVD video, computer animation, and videogames A day during which aircraft could not fly, travelers had to navigate without benefit of GPS, weather forecasters had no models, banks and merchants could not transfer funds electronically, factory automation ceased to function, and the US military lacked technological supremacy
  • 91. The future is full of opportunity Creating the future of networking Driving advances in all fields of science and engineering Revolutionizing transportation Personalized education The Smart Grid Predictive, preventive, personalized medicine Quantum computing Empowerment of the developing world Personalized health monitoring => quality of life Neurobotics Synthetic biology
  • 92. Why do students choose the field?
  • 93. Power to change the world People enter the field for a wide variety of aspirational reasons http://www.cs.washington.edu/WhyCSE/
  • 94. Pathways in computer science People pursue diverse careers following their education in computer science http://www.cs.washington.edu/WhyCSE/
  • 95. A day in the life Working in the software industry is creative, interactive, empowering http://www.cs.washington.edu/WhyCSE/
  • 96. Science & Engineering Job Growth, 2006-16 (BLS Occupational Employment Projections)
  • 97. Average annual projected growth rates for all 22 major occupational groups in Washington State, 2006 to 2016
  • 98. Occupations are ranked based on the average of two criteria: average annual growth rate for 2006 to 2016 and total number of job openings due to growth and replacement
  • 99. Science & Engineering Job Growth, 2006-16 (BLS Occupational Employment Projections) 13 University of Washington 9 academic departments 8 9 4 2 3
  • 100. The transformation of our economy, and of educational requirements Once upon a time, the “content” of the goods we produced was largely physical
  • 101. Then we transitioned to goods whose “content” was a balance of physical and intellectual
  • 102. In today’s knowledge-based economy, the “content” of goods is almost entirely intellectual rather than physical
  • 103. What kind of education is required to produce a good whose content is almost entirely intellectual?
  • 104. Education in Washington:  Where do we stand The Nation’s S&E Top 5 among the states? (Workforce intensity, all S&E occupations, 2006) 1. Virginia LIFE/PHYS. COMPUTER 2. Massachusetts ENGINEERS SCIENTISTS SPECIALISTS 3. Maryland PEERS: 2 PEERS: 3 PEERS: 6 4. Washington NATION: 3 NATION: 7 NATION: 8 5. Colorado BACHELOR’S S&E DEGREE S&E GRADUATE S&E PhDs HIGHER EDUCATION: PRODUCTION PRODUCTION PARTICIPATION AWARDED (BACHELOR’S PRODUCTION, S&E GRADUATE PARTICIPATION, PEERS: 8 PEERS: 7 PEERS: 10 PEERS: 10 S&E PhD PRODUCTION) NATION: 36 NATION: 31 NATION: 46 NATION: 27 HIGH SCHOOL K-12 ACHIEVEMENT: GRADUATION (2007 8TH GRADE NAEP & GRADUATION RATE FROM PEERS: 7 POSTSECONDARY.ORG) NATION: 32
  • 105. Washington State is gambling with its future! How about you?
  • 106. This morning The nature of eScience The advances that enable it Scalable computing for everyone Broadband, and the role of higher education “There’s only so much you can do with asphalt” Don’t gamble with the future of your region and of the kids who grow up there http://lazowska.cs.washington.edu/wcet.pdf