SlideShare una empresa de Scribd logo
1 de 43
Descargar para leer sin conexión
Building a scalable, open source
              storage and processing solution for
                        biodiversity data




                   Phil Cryer
                   Anthony Goddard


Thursday, November 12, 2009
> Biodiversity Heritage Library's data 


   • all BHL storage is handled by the
     Internet Archive

   • 38,000+ scanned books

   • approximately 48
     terabytes of data

   • unable to self-host




Thursday, November 12, 2009
> BHL - Europe


   • 3 year, EU funded project

   • 28 major natural history museums,
     botanical gardens and other
     cooperating institutions

   • third file-store of all BHL data

   • collecting cultural heritage from all
     over Europe




Thursday, November 12, 2009
> Data explosion


   • more data being created

   • more data being saved

   • more data tomorrow

   • storage has not kept up with
     Moore’s Law

   • this presentation will be saved
     online, more data!




Thursday, November 12, 2009
> Data explosion


   • more data being created

   • more data being saved

   • more data tomorrow

   • storage has not kept up with
     Moore’s Law

   • this presentation will be saved
     online, more data!




Thursday, November 12, 2009
> Potential #fail’s




Thursday, November 12, 2009
> Problem 1 - Data access


   • file size we can’t store

   • latency of large files

   • quality user experience

   • processing data-mining




Thursday, November 12, 2009
> Problem 1 - Data access


   • file size we can’t store


                                Access
   • latency of large files

   • quality user experience

   • processing data-mining
                                denied...


Thursday, November 12, 2009
> Problem 2 - Copyright concerns




                                        ©
   • international copyright concerns

   • potential related funding issues

   • we’d rather not let this be an
     issue




Thursday, November 12, 2009
> Problem 3 - Redundancy




Thursday, November 12, 2009
> Problem 3 - Redundancy


   • computers crash

   • hard drives die

   • networks fail

   • natural disasters occur




Thursday, November 12, 2009
> Problem 3 - Redundancy


   • computers crash

   • hard drives die

   • networks fail

   • natural disasters occur


                              but...

              This is NOT a problem!




Thursday, November 12, 2009
...so plan for it.



Thursday, November 12, 2009
Thursday, November 12, 2009
Current




Thursday, November 12, 2009
Thursday, November 12, 2009
Thursday, November 12, 2009
> Site 1 - Internet Archive




Thursday, November 12, 2009
> Site 2 - MBL, Woods Hole




Thursday, November 12, 2009
> Site 3 - NHM, London




                              
   ...followed by new Data center
Thursday, November 12, 2009
Data Centre – “Darwin Repository”

      • €600,000 Funding secured from eContentPlus
      • Suitable location found with very good development
        potential in collaboration with Science Museum.
      • Economy of scale provides additional avenues for co-
        development of services.
         – Disaster Recovery and Business Continuity for all
           Museums (help with ongoing and running costs)
            • DCMS funding sought to help with development.
         – e-Infrastructure European initiative
            • Building Digital Repositories for Scientific
              Communities
               – PESI (Biodiversity)



Thursday, November 12, 2009
Proposed Data Centre Location


                              Swindon




          Wroughton Science Museum
                                        ©2008 Google – Imagery ©2008 DigitalGlobe, Infoterra Ltd & Bluesky, GeoEye, Map data ©2008 Tele Atlas




Thursday, November 12, 2009
Vendor Stakeholders / Partners

      • Identified Technology Partners*




      • Additional Funding Partners*




                              *Note: Discussions are ongoing with all Partners and may be at different stages




Thursday, November 12, 2009
Long Term Sustainability

      • No Dripping Tap
            – Business case should provide for significant
              self funding opportunities.
      • Diversity
            – Darwin Repository (Data Centre) will provide
              an economy of scale that will provide
              significant efficiency gains.
      • Green technology to minimise carbon footprint
        and provide industry leadership.



Thursday, November 12, 2009
> Distributed storage


   • write once, read anywhere

   • replication and fault tolerance

   • error correction

   • automatic redundancy

   • scalable horizontally




Thursday, November 12, 2009
> Distributed storage - Options


  • fully hosted storage (cloud)

  • hosted with own storage (private cloud)

  • self hosted with proprietary hardware (Sun
    Thumper)

  • self hosted with commodity hardware




Thursday, November 12, 2009
> Distributed storage - GlusterFS


   • GlusterFS: a cluster file-system
     capable of scaling to several peta-
     bytes

   • open source software on
     commodity hardware

   • tunable performance

   • simple to install and manage

   • offers seamless expansion




Thursday, November 12, 2009
> Distributed storage - Archival 


   • Fedora-commons is an open
     source repository

   • accounts for all changes, so built-
     in version control

   • provides disaster recover

   • open standards to mesh with
     future file formats

   • provides open sharing services
     such as OAI-PMH




Thursday, November 12, 2009
> Distributed storage - Mirrored data


   • now we have redundancy

   • in fact, multiple redundant
     copies

   • provides fault tolerance

   • offers load balancing

   • gives us future geographical
     distribution




Thursday, November 12, 2009
> Now we have lots of computers...




Thursday, November 12, 2009
> Distributed processing


   • more abilities available than just
     storing data

   • with distributed storage comes
     distributed processing

   • distributed processing means
     faster answers

   • faster answers mean new
     questions

   •    lather, rinse, repeat




Thursday, November 12, 2009
> Distributed processing


   • make your data more useful

   • image and OCR processing

   • distributed web services

   • identifier resolution pools

   • map/reduce frameworks

   • generate new visualizations, text
     mining, NLP




Thursday, November 12, 2009
> Distributed processing
Finder                 TaxonFinder    TaxonFinder          TaxonFinder    TaxonFinder   Taxon

                               WebService                          WebService

                                             Load Balancer

                                             Cluster Node

                                                Cluster

                                                    Site




                                                Request




 Thursday, November 12, 2009
> Some assembly required (optional)


   • our example uses new, faster
     commodity hardware

   • but it could run on any hardware
     that can run Linux

   • you could chain old "out dated"
     computers together

   • build your own cluster for next to
     nothing (host it in your basement)

   •   solves some infrastructure funding
       issues

   •   hardware vendor neutrality




Thursday, November 12, 2009
> Our proof of concept

   • we ran a six box cluster to
     demonstrate GlusterFS

   • ran stock Debian/GNU Linux

   • simulated hardware failures

   • synced data with a remote cluster

   • ran map/reduce jobs

   • defined procedures, configurations
     and build scripts




Thursday, November 12, 2009
Thursday, November 12, 2009
Thursday, November 12, 2009
www




                                                                       presentation
                                REST
             API                                     mod_glusterfs
                                            disco




                                                                      processing
                                           Hadoop
                                     Fedora Commons
              metadata




                                   Mulgara triplestore / rdf
                                 rsync                 GlusterFS
              sync support




                                                                      file system
                              BitTorrent             Ext4 (exabyte)
                                HTTP                ‘Network RAID’
                                      Raw disk array
                                Commodity SATA controllers
             storage




                                    Commodity Hosts
                                Dedicated Storage Network
Thursday, November 12, 2009
> Distributed storage - Projected costs




                                   $246,000




                              Graph from Backblaze (http://www.backblaze.com)


Thursday, November 12, 2009
> Other avenues - Cloud pilot

   • BHL is participating in a pilot with New
     York Public Library and Duraspace

   • Duraspace would provide a link to
     cloud providers

   • pilot to show feasibility of hosting

   • testing use of image server, other
     services in the cloud

   • cloud could seed new clusters




Thursday, November 12, 2009
> Code (63 6f 64 65)

   • all of our code and configurations are
     open source

   • hosted on Google Code

   • get involved

   • join the mailing-lists

   • follow us on Twitter

   • ask questions, we'll help!




Thursday, November 12, 2009
> It’s your turn...


   • similar projects?

   • distributed services and processing?

   • where can this be best applied?

   • resilient services on top of storage

              • names processing?

              • LSID resolution pools?

              • image processing?

              • text-mining / NLP?

              • #biodiv webservices?



Thursday, November 12, 2009
Web: http://www.biodiversitylibrary.org/
         Code, Support: http://code.google.com/p/bhl-bits
         Twitter: @BioDivLibrary (tag #bhl)




  Phil Cryer                          Anthony Goddard

  Missouri Botanical Garden           MBLWHOI Library
  Biodiversity Heritage Library       Biodiversity Heritage Library

      phil.cryer@mobot.org               agoddard@mbl.edu
      http://philcryer.com               http://anthonygoddard.com
      @fak3r                             @anthonygoddard



Thursday, November 12, 2009

Más contenido relacionado

La actualidad más candente

Improving Hadoop Cluster Performance via Linux Configuration
Improving Hadoop Cluster Performance via Linux ConfigurationImproving Hadoop Cluster Performance via Linux Configuration
Improving Hadoop Cluster Performance via Linux ConfigurationAlex Moundalexis
 
Hadoop and WANdisco: The Future of Big Data
Hadoop and WANdisco: The Future of Big DataHadoop and WANdisco: The Future of Big Data
Hadoop and WANdisco: The Future of Big DataWANdisco Plc
 
Introduction to GlusterFS Webinar - September 2011
Introduction to GlusterFS Webinar - September 2011Introduction to GlusterFS Webinar - September 2011
Introduction to GlusterFS Webinar - September 2011GlusterFS
 
Hadoop Operations for Production Systems (Strata NYC)
Hadoop Operations for Production Systems (Strata NYC)Hadoop Operations for Production Systems (Strata NYC)
Hadoop Operations for Production Systems (Strata NYC)Kathleen Ting
 
Gluster Webinar: Introduction to GlusterFS v3.3
Gluster Webinar: Introduction to GlusterFS v3.3Gluster Webinar: Introduction to GlusterFS v3.3
Gluster Webinar: Introduction to GlusterFS v3.3GlusterFS
 
Optimizing your Infrastrucure and Operating System for Hadoop
Optimizing your Infrastrucure and Operating System for HadoopOptimizing your Infrastrucure and Operating System for Hadoop
Optimizing your Infrastrucure and Operating System for HadoopDataWorks Summit
 
Webinar Sept 22: Gluster Partners with Redapt to Deliver Scale-Out NAS Storage
Webinar Sept 22: Gluster Partners with Redapt to Deliver Scale-Out NAS StorageWebinar Sept 22: Gluster Partners with Redapt to Deliver Scale-Out NAS Storage
Webinar Sept 22: Gluster Partners with Redapt to Deliver Scale-Out NAS StorageGlusterFS
 
Improving Hadoop Performance via Linux
Improving Hadoop Performance via LinuxImproving Hadoop Performance via Linux
Improving Hadoop Performance via LinuxAlex Moundalexis
 
Gluster Blog 11.15.2010
Gluster Blog 11.15.2010Gluster Blog 11.15.2010
Gluster Blog 11.15.2010GlusterFS
 
Cloud Storage Adoption, Practice, and Deployment
Cloud Storage Adoption, Practice, and DeploymentCloud Storage Adoption, Practice, and Deployment
Cloud Storage Adoption, Practice, and DeploymentGlusterFS
 
Inside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldInside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldRichard McDougall
 
GlusterFS Architecture - June 30, 2011 Meetup
GlusterFS Architecture - June 30, 2011 MeetupGlusterFS Architecture - June 30, 2011 Meetup
GlusterFS Architecture - June 30, 2011 MeetupGlusterFS
 
Storing and distributing data
Storing and distributing dataStoring and distributing data
Storing and distributing dataPhil Cryer
 
Scalable Storage for Massive Volume Data Systems
Scalable Storage for Massive Volume Data SystemsScalable Storage for Massive Volume Data Systems
Scalable Storage for Massive Volume Data SystemsLars Nielsen
 
How to Increase Performance of Your Hadoop Cluster
How to Increase Performance of Your Hadoop ClusterHow to Increase Performance of Your Hadoop Cluster
How to Increase Performance of Your Hadoop ClusterAltoros
 
Presentation introduction to cloud computing and technical issues
Presentation   introduction to cloud computing and technical issuesPresentation   introduction to cloud computing and technical issues
Presentation introduction to cloud computing and technical issuesxKinAnx
 
CloudStack-Developer-Day
CloudStack-Developer-DayCloudStack-Developer-Day
CloudStack-Developer-DayKimihiko Kitase
 

La actualidad más candente (20)

Improving Hadoop Cluster Performance via Linux Configuration
Improving Hadoop Cluster Performance via Linux ConfigurationImproving Hadoop Cluster Performance via Linux Configuration
Improving Hadoop Cluster Performance via Linux Configuration
 
Hadoop and WANdisco: The Future of Big Data
Hadoop and WANdisco: The Future of Big DataHadoop and WANdisco: The Future of Big Data
Hadoop and WANdisco: The Future of Big Data
 
Introduction to GlusterFS Webinar - September 2011
Introduction to GlusterFS Webinar - September 2011Introduction to GlusterFS Webinar - September 2011
Introduction to GlusterFS Webinar - September 2011
 
Unit ii sem-v-hadoop
Unit ii  sem-v-hadoopUnit ii  sem-v-hadoop
Unit ii sem-v-hadoop
 
Hadoop on VMware
Hadoop on VMwareHadoop on VMware
Hadoop on VMware
 
Hadoop Operations for Production Systems (Strata NYC)
Hadoop Operations for Production Systems (Strata NYC)Hadoop Operations for Production Systems (Strata NYC)
Hadoop Operations for Production Systems (Strata NYC)
 
Gluster Webinar: Introduction to GlusterFS v3.3
Gluster Webinar: Introduction to GlusterFS v3.3Gluster Webinar: Introduction to GlusterFS v3.3
Gluster Webinar: Introduction to GlusterFS v3.3
 
Optimizing your Infrastrucure and Operating System for Hadoop
Optimizing your Infrastrucure and Operating System for HadoopOptimizing your Infrastrucure and Operating System for Hadoop
Optimizing your Infrastrucure and Operating System for Hadoop
 
Webinar Sept 22: Gluster Partners with Redapt to Deliver Scale-Out NAS Storage
Webinar Sept 22: Gluster Partners with Redapt to Deliver Scale-Out NAS StorageWebinar Sept 22: Gluster Partners with Redapt to Deliver Scale-Out NAS Storage
Webinar Sept 22: Gluster Partners with Redapt to Deliver Scale-Out NAS Storage
 
Improving Hadoop Performance via Linux
Improving Hadoop Performance via LinuxImproving Hadoop Performance via Linux
Improving Hadoop Performance via Linux
 
Gluster Blog 11.15.2010
Gluster Blog 11.15.2010Gluster Blog 11.15.2010
Gluster Blog 11.15.2010
 
Cloud Storage Adoption, Practice, and Deployment
Cloud Storage Adoption, Practice, and DeploymentCloud Storage Adoption, Practice, and Deployment
Cloud Storage Adoption, Practice, and Deployment
 
Inside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldInside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworld
 
GlusterFS Architecture - June 30, 2011 Meetup
GlusterFS Architecture - June 30, 2011 MeetupGlusterFS Architecture - June 30, 2011 Meetup
GlusterFS Architecture - June 30, 2011 Meetup
 
Storing and distributing data
Storing and distributing dataStoring and distributing data
Storing and distributing data
 
Scalable Storage for Massive Volume Data Systems
Scalable Storage for Massive Volume Data SystemsScalable Storage for Massive Volume Data Systems
Scalable Storage for Massive Volume Data Systems
 
20090309berkeley
20090309berkeley20090309berkeley
20090309berkeley
 
How to Increase Performance of Your Hadoop Cluster
How to Increase Performance of Your Hadoop ClusterHow to Increase Performance of Your Hadoop Cluster
How to Increase Performance of Your Hadoop Cluster
 
Presentation introduction to cloud computing and technical issues
Presentation   introduction to cloud computing and technical issuesPresentation   introduction to cloud computing and technical issues
Presentation introduction to cloud computing and technical issues
 
CloudStack-Developer-Day
CloudStack-Developer-DayCloudStack-Developer-Day
CloudStack-Developer-Day
 

Similar a Building A Scalable Open Source Storage Solution

Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File Systemelliando dias
 
Future of cloud storage
Future of cloud storageFuture of cloud storage
Future of cloud storageGlusterFS
 
NDH2k12 Cloud Computing Security
NDH2k12 Cloud Computing SecurityNDH2k12 Cloud Computing Security
NDH2k12 Cloud Computing SecurityMatthieu Bouthors
 
Better, Faster, Cheaper Infrastructure: Apache CloudStack and Riak CS
Better, Faster, Cheaper Infrastructure: Apache CloudStack and Riak CSBetter, Faster, Cheaper Infrastructure: Apache CloudStack and Riak CS
Better, Faster, Cheaper Infrastructure: Apache CloudStack and Riak CSJohn Burwell
 
Open Science Days 2014 - Becker - Repositories and Linked Data
Open Science Days 2014 - Becker - Repositories and Linked DataOpen Science Days 2014 - Becker - Repositories and Linked Data
Open Science Days 2014 - Becker - Repositories and Linked DataPascal-Nicolas Becker
 
Gluster open stack dev summit 042011
Gluster open stack dev summit 042011Gluster open stack dev summit 042011
Gluster open stack dev summit 042011Open Stack
 
The Time Has Come for Big-Data-as-a-Service
The Time Has Come for Big-Data-as-a-ServiceThe Time Has Come for Big-Data-as-a-Service
The Time Has Come for Big-Data-as-a-ServiceBlueData, Inc.
 
Cloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation inCloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation inRahulBhole12
 
Borthakur hadoop univ-research
Borthakur hadoop univ-researchBorthakur hadoop univ-research
Borthakur hadoop univ-researchsaintdevil163
 
BHL hardware architecture - storage and clusters
BHL hardware architecture - storage and clustersBHL hardware architecture - storage and clusters
BHL hardware architecture - storage and clustersPhil Cryer
 
PuppetCamp SEA @ Blk 71 - Puppet: The Year That Was
PuppetCamp SEA @ Blk 71 - Puppet: The Year That WasPuppetCamp SEA @ Blk 71 - Puppet: The Year That Was
PuppetCamp SEA @ Blk 71 - Puppet: The Year That WasWalter Heck
 
PuppetCamp SEA @ Blk 71 - Puppet: The Year That Was
PuppetCamp SEA @ Blk 71 - Puppet: The Year That WasPuppetCamp SEA @ Blk 71 - Puppet: The Year That Was
PuppetCamp SEA @ Blk 71 - Puppet: The Year That WasOlinData
 
Fb talk arch_summit
Fb talk arch_summitFb talk arch_summit
Fb talk arch_summitdrewz lin
 
Share on LinkedIn Share on Twitter Share on Facebook Share on Google+ Share b...
Share on LinkedIn Share on Twitter Share on Facebook Share on Google+ Share b...Share on LinkedIn Share on Twitter Share on Facebook Share on Google+ Share b...
Share on LinkedIn Share on Twitter Share on Facebook Share on Google+ Share b...Avere Systems
 
Building modern data lakes
Building modern data lakes Building modern data lakes
Building modern data lakes Minio
 
CloudFoundry and MongoDb, a marriage made in heaven
CloudFoundry and MongoDb, a marriage made in heavenCloudFoundry and MongoDb, a marriage made in heaven
CloudFoundry and MongoDb, a marriage made in heavenPatrick Chanezon
 

Similar a Building A Scalable Open Source Storage Solution (20)

Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Future of cloud storage
Future of cloud storageFuture of cloud storage
Future of cloud storage
 
NDH2k12 Cloud Computing Security
NDH2k12 Cloud Computing SecurityNDH2k12 Cloud Computing Security
NDH2k12 Cloud Computing Security
 
Better, Faster, Cheaper Infrastructure: Apache CloudStack and Riak CS
Better, Faster, Cheaper Infrastructure: Apache CloudStack and Riak CSBetter, Faster, Cheaper Infrastructure: Apache CloudStack and Riak CS
Better, Faster, Cheaper Infrastructure: Apache CloudStack and Riak CS
 
Open Science Days 2014 - Becker - Repositories and Linked Data
Open Science Days 2014 - Becker - Repositories and Linked DataOpen Science Days 2014 - Becker - Repositories and Linked Data
Open Science Days 2014 - Becker - Repositories and Linked Data
 
Gluster open stack dev summit 042011
Gluster open stack dev summit 042011Gluster open stack dev summit 042011
Gluster open stack dev summit 042011
 
The Time Has Come for Big-Data-as-a-Service
The Time Has Come for Big-Data-as-a-ServiceThe Time Has Come for Big-Data-as-a-Service
The Time Has Come for Big-Data-as-a-Service
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Cloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation inCloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation in
 
Borthakur hadoop univ-research
Borthakur hadoop univ-researchBorthakur hadoop univ-research
Borthakur hadoop univ-research
 
BHL hardware architecture - storage and clusters
BHL hardware architecture - storage and clustersBHL hardware architecture - storage and clusters
BHL hardware architecture - storage and clusters
 
PuppetCamp SEA @ Blk 71 - Puppet: The Year That Was
PuppetCamp SEA @ Blk 71 - Puppet: The Year That WasPuppetCamp SEA @ Blk 71 - Puppet: The Year That Was
PuppetCamp SEA @ Blk 71 - Puppet: The Year That Was
 
PuppetCamp SEA @ Blk 71 - Puppet: The Year That Was
PuppetCamp SEA @ Blk 71 - Puppet: The Year That WasPuppetCamp SEA @ Blk 71 - Puppet: The Year That Was
PuppetCamp SEA @ Blk 71 - Puppet: The Year That Was
 
Hadoop training in bangalore
Hadoop training in bangaloreHadoop training in bangalore
Hadoop training in bangalore
 
Fb talk arch_summit
Fb talk arch_summitFb talk arch_summit
Fb talk arch_summit
 
Share on LinkedIn Share on Twitter Share on Facebook Share on Google+ Share b...
Share on LinkedIn Share on Twitter Share on Facebook Share on Google+ Share b...Share on LinkedIn Share on Twitter Share on Facebook Share on Google+ Share b...
Share on LinkedIn Share on Twitter Share on Facebook Share on Google+ Share b...
 
Building modern data lakes
Building modern data lakes Building modern data lakes
Building modern data lakes
 
CloudFoundry and MongoDb, a marriage made in heaven
CloudFoundry and MongoDb, a marriage made in heavenCloudFoundry and MongoDb, a marriage made in heaven
CloudFoundry and MongoDb, a marriage made in heaven
 
Tutorial Haddop 2.3
Tutorial Haddop 2.3Tutorial Haddop 2.3
Tutorial Haddop 2.3
 

Más de Phil Cryer

Getting started with Mantl
Getting started with MantlGetting started with Mantl
Getting started with MantlPhil Cryer
 
Pets versus Cattle: servers evolved
Pets versus Cattle: servers evolvedPets versus Cattle: servers evolved
Pets versus Cattle: servers evolvedPhil Cryer
 
Moving towards unified logging
Moving towards unified loggingMoving towards unified logging
Moving towards unified loggingPhil Cryer
 
What if Petraeus Was a Hacker?
What if Petraeus Was a Hacker?What if Petraeus Was a Hacker?
What if Petraeus Was a Hacker?Phil Cryer
 
What if Petraeus was a hacker? Email privacy for the rest of us
What if Petraeus was a hacker? Email privacy for the rest of usWhat if Petraeus was a hacker? Email privacy for the rest of us
What if Petraeus was a hacker? Email privacy for the rest of usPhil Cryer
 
Online privacy concerns (and what we can do about it)
Online privacy concerns (and what we can do about it)Online privacy concerns (and what we can do about it)
Online privacy concerns (and what we can do about it)Phil Cryer
 
Online Privacy in the Year of the Dragon
Online Privacy in the Year of the DragonOnline Privacy in the Year of the Dragon
Online Privacy in the Year of the DragonPhil Cryer
 
Is your data secure? privacy and trust in the social web
Is your data secure?  privacy and trust in the social webIs your data secure?  privacy and trust in the social web
Is your data secure? privacy and trust in the social webPhil Cryer
 
Adoption of Persistent Identifiers for Biodiversity Informatics
Adoption of Persistent Identifiers for Biodiversity InformaticsAdoption of Persistent Identifiers for Biodiversity Informatics
Adoption of Persistent Identifiers for Biodiversity InformaticsPhil Cryer
 
Data hosting infrastructure for primary biodiversity data
Data hosting infrastructure for primary biodiversity dataData hosting infrastructure for primary biodiversity data
Data hosting infrastructure for primary biodiversity dataPhil Cryer
 
GBIF (Global Biodiversity Information Facility) Position Paper: Data Hosting ...
GBIF (Global Biodiversity Information Facility) Position Paper: Data Hosting ...GBIF (Global Biodiversity Information Facility) Position Paper: Data Hosting ...
GBIF (Global Biodiversity Information Facility) Position Paper: Data Hosting ...Phil Cryer
 
Taking your ball and going home
Taking your ball and going homeTaking your ball and going home
Taking your ball and going homePhil Cryer
 
Building Toward an Open and Extensible Autonomous Computing Platform Utilizi...
Building Toward an Open and Extensible  Autonomous Computing Platform Utilizi...Building Toward an Open and Extensible  Autonomous Computing Platform Utilizi...
Building Toward an Open and Extensible Autonomous Computing Platform Utilizi...Phil Cryer
 
Biodiversity Heritiage Library: progress and process
Biodiversity Heritiage Library: progress and processBiodiversity Heritiage Library: progress and process
Biodiversity Heritiage Library: progress and processPhil Cryer
 
Updates on the BHL Global Cluster
Updates on the BHL Global ClusterUpdates on the BHL Global Cluster
Updates on the BHL Global ClusterPhil Cryer
 
Clustered and distributed
 storage with
 commodity hardware 
and open source ...
Clustered and distributed
 storage with
 commodity hardware 
and open source ...Clustered and distributed
 storage with
 commodity hardware 
and open source ...
Clustered and distributed
 storage with
 commodity hardware 
and open source ...Phil Cryer
 
Biodiversity Heritage Library Articles Demo
Biodiversity Heritage Library Articles DemoBiodiversity Heritage Library Articles Demo
Biodiversity Heritage Library Articles DemoPhil Cryer
 
Using Fedora Commons To Create A Persistent Archive
Using Fedora Commons To Create A Persistent ArchiveUsing Fedora Commons To Create A Persistent Archive
Using Fedora Commons To Create A Persistent ArchivePhil Cryer
 

Más de Phil Cryer (18)

Getting started with Mantl
Getting started with MantlGetting started with Mantl
Getting started with Mantl
 
Pets versus Cattle: servers evolved
Pets versus Cattle: servers evolvedPets versus Cattle: servers evolved
Pets versus Cattle: servers evolved
 
Moving towards unified logging
Moving towards unified loggingMoving towards unified logging
Moving towards unified logging
 
What if Petraeus Was a Hacker?
What if Petraeus Was a Hacker?What if Petraeus Was a Hacker?
What if Petraeus Was a Hacker?
 
What if Petraeus was a hacker? Email privacy for the rest of us
What if Petraeus was a hacker? Email privacy for the rest of usWhat if Petraeus was a hacker? Email privacy for the rest of us
What if Petraeus was a hacker? Email privacy for the rest of us
 
Online privacy concerns (and what we can do about it)
Online privacy concerns (and what we can do about it)Online privacy concerns (and what we can do about it)
Online privacy concerns (and what we can do about it)
 
Online Privacy in the Year of the Dragon
Online Privacy in the Year of the DragonOnline Privacy in the Year of the Dragon
Online Privacy in the Year of the Dragon
 
Is your data secure? privacy and trust in the social web
Is your data secure?  privacy and trust in the social webIs your data secure?  privacy and trust in the social web
Is your data secure? privacy and trust in the social web
 
Adoption of Persistent Identifiers for Biodiversity Informatics
Adoption of Persistent Identifiers for Biodiversity InformaticsAdoption of Persistent Identifiers for Biodiversity Informatics
Adoption of Persistent Identifiers for Biodiversity Informatics
 
Data hosting infrastructure for primary biodiversity data
Data hosting infrastructure for primary biodiversity dataData hosting infrastructure for primary biodiversity data
Data hosting infrastructure for primary biodiversity data
 
GBIF (Global Biodiversity Information Facility) Position Paper: Data Hosting ...
GBIF (Global Biodiversity Information Facility) Position Paper: Data Hosting ...GBIF (Global Biodiversity Information Facility) Position Paper: Data Hosting ...
GBIF (Global Biodiversity Information Facility) Position Paper: Data Hosting ...
 
Taking your ball and going home
Taking your ball and going homeTaking your ball and going home
Taking your ball and going home
 
Building Toward an Open and Extensible Autonomous Computing Platform Utilizi...
Building Toward an Open and Extensible  Autonomous Computing Platform Utilizi...Building Toward an Open and Extensible  Autonomous Computing Platform Utilizi...
Building Toward an Open and Extensible Autonomous Computing Platform Utilizi...
 
Biodiversity Heritiage Library: progress and process
Biodiversity Heritiage Library: progress and processBiodiversity Heritiage Library: progress and process
Biodiversity Heritiage Library: progress and process
 
Updates on the BHL Global Cluster
Updates on the BHL Global ClusterUpdates on the BHL Global Cluster
Updates on the BHL Global Cluster
 
Clustered and distributed
 storage with
 commodity hardware 
and open source ...
Clustered and distributed
 storage with
 commodity hardware 
and open source ...Clustered and distributed
 storage with
 commodity hardware 
and open source ...
Clustered and distributed
 storage with
 commodity hardware 
and open source ...
 
Biodiversity Heritage Library Articles Demo
Biodiversity Heritage Library Articles DemoBiodiversity Heritage Library Articles Demo
Biodiversity Heritage Library Articles Demo
 
Using Fedora Commons To Create A Persistent Archive
Using Fedora Commons To Create A Persistent ArchiveUsing Fedora Commons To Create A Persistent Archive
Using Fedora Commons To Create A Persistent Archive
 

Último

Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 

Último (20)

Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 

Building A Scalable Open Source Storage Solution

  • 1. Building a scalable, open source storage and processing solution for biodiversity data Phil Cryer Anthony Goddard Thursday, November 12, 2009
  • 2. > Biodiversity Heritage Library's data  • all BHL storage is handled by the Internet Archive • 38,000+ scanned books • approximately 48 terabytes of data • unable to self-host Thursday, November 12, 2009
  • 3. > BHL - Europe • 3 year, EU funded project • 28 major natural history museums, botanical gardens and other cooperating institutions • third file-store of all BHL data • collecting cultural heritage from all over Europe Thursday, November 12, 2009
  • 4. > Data explosion • more data being created • more data being saved • more data tomorrow • storage has not kept up with Moore’s Law • this presentation will be saved online, more data! Thursday, November 12, 2009
  • 5. > Data explosion • more data being created • more data being saved • more data tomorrow • storage has not kept up with Moore’s Law • this presentation will be saved online, more data! Thursday, November 12, 2009
  • 7. > Problem 1 - Data access • file size we can’t store • latency of large files • quality user experience • processing data-mining Thursday, November 12, 2009
  • 8. > Problem 1 - Data access • file size we can’t store Access • latency of large files • quality user experience • processing data-mining denied... Thursday, November 12, 2009
  • 9. > Problem 2 - Copyright concerns © • international copyright concerns • potential related funding issues • we’d rather not let this be an issue Thursday, November 12, 2009
  • 10. > Problem 3 - Redundancy Thursday, November 12, 2009
  • 11. > Problem 3 - Redundancy • computers crash • hard drives die • networks fail • natural disasters occur Thursday, November 12, 2009
  • 12. > Problem 3 - Redundancy • computers crash • hard drives die • networks fail • natural disasters occur but... This is NOT a problem! Thursday, November 12, 2009
  • 13. ...so plan for it. Thursday, November 12, 2009
  • 18. > Site 1 - Internet Archive Thursday, November 12, 2009
  • 19. > Site 2 - MBL, Woods Hole Thursday, November 12, 2009
  • 20. > Site 3 - NHM, London ...followed by new Data center Thursday, November 12, 2009
  • 21. Data Centre – “Darwin Repository” • €600,000 Funding secured from eContentPlus • Suitable location found with very good development potential in collaboration with Science Museum. • Economy of scale provides additional avenues for co- development of services. – Disaster Recovery and Business Continuity for all Museums (help with ongoing and running costs) • DCMS funding sought to help with development. – e-Infrastructure European initiative • Building Digital Repositories for Scientific Communities – PESI (Biodiversity) Thursday, November 12, 2009
  • 22. Proposed Data Centre Location Swindon Wroughton Science Museum ©2008 Google – Imagery ©2008 DigitalGlobe, Infoterra Ltd & Bluesky, GeoEye, Map data ©2008 Tele Atlas Thursday, November 12, 2009
  • 23. Vendor Stakeholders / Partners • Identified Technology Partners* • Additional Funding Partners* *Note: Discussions are ongoing with all Partners and may be at different stages Thursday, November 12, 2009
  • 24. Long Term Sustainability • No Dripping Tap – Business case should provide for significant self funding opportunities. • Diversity – Darwin Repository (Data Centre) will provide an economy of scale that will provide significant efficiency gains. • Green technology to minimise carbon footprint and provide industry leadership. Thursday, November 12, 2009
  • 25. > Distributed storage • write once, read anywhere • replication and fault tolerance • error correction • automatic redundancy • scalable horizontally Thursday, November 12, 2009
  • 26. > Distributed storage - Options • fully hosted storage (cloud) • hosted with own storage (private cloud) • self hosted with proprietary hardware (Sun Thumper) • self hosted with commodity hardware Thursday, November 12, 2009
  • 27. > Distributed storage - GlusterFS • GlusterFS: a cluster file-system capable of scaling to several peta- bytes • open source software on commodity hardware • tunable performance • simple to install and manage • offers seamless expansion Thursday, November 12, 2009
  • 28. > Distributed storage - Archival  • Fedora-commons is an open source repository • accounts for all changes, so built- in version control • provides disaster recover • open standards to mesh with future file formats • provides open sharing services such as OAI-PMH Thursday, November 12, 2009
  • 29. > Distributed storage - Mirrored data • now we have redundancy • in fact, multiple redundant copies • provides fault tolerance • offers load balancing • gives us future geographical distribution Thursday, November 12, 2009
  • 30. > Now we have lots of computers... Thursday, November 12, 2009
  • 31. > Distributed processing • more abilities available than just storing data • with distributed storage comes distributed processing • distributed processing means faster answers • faster answers mean new questions • lather, rinse, repeat Thursday, November 12, 2009
  • 32. > Distributed processing • make your data more useful • image and OCR processing • distributed web services • identifier resolution pools • map/reduce frameworks • generate new visualizations, text mining, NLP Thursday, November 12, 2009
  • 33. > Distributed processing Finder TaxonFinder TaxonFinder TaxonFinder TaxonFinder Taxon WebService WebService Load Balancer Cluster Node Cluster Site Request Thursday, November 12, 2009
  • 34. > Some assembly required (optional) • our example uses new, faster commodity hardware • but it could run on any hardware that can run Linux • you could chain old "out dated" computers together • build your own cluster for next to nothing (host it in your basement) • solves some infrastructure funding issues • hardware vendor neutrality Thursday, November 12, 2009
  • 35. > Our proof of concept • we ran a six box cluster to demonstrate GlusterFS • ran stock Debian/GNU Linux • simulated hardware failures • synced data with a remote cluster • ran map/reduce jobs • defined procedures, configurations and build scripts Thursday, November 12, 2009
  • 38. www presentation REST API mod_glusterfs disco processing Hadoop Fedora Commons metadata Mulgara triplestore / rdf rsync GlusterFS sync support file system BitTorrent Ext4 (exabyte) HTTP ‘Network RAID’ Raw disk array Commodity SATA controllers storage Commodity Hosts Dedicated Storage Network Thursday, November 12, 2009
  • 39. > Distributed storage - Projected costs $246,000 Graph from Backblaze (http://www.backblaze.com) Thursday, November 12, 2009
  • 40. > Other avenues - Cloud pilot • BHL is participating in a pilot with New York Public Library and Duraspace • Duraspace would provide a link to cloud providers • pilot to show feasibility of hosting • testing use of image server, other services in the cloud • cloud could seed new clusters Thursday, November 12, 2009
  • 41. > Code (63 6f 64 65) • all of our code and configurations are open source • hosted on Google Code • get involved • join the mailing-lists • follow us on Twitter • ask questions, we'll help! Thursday, November 12, 2009
  • 42. > It’s your turn... • similar projects? • distributed services and processing? • where can this be best applied? • resilient services on top of storage • names processing? • LSID resolution pools? • image processing? • text-mining / NLP? • #biodiv webservices? Thursday, November 12, 2009
  • 43. Web: http://www.biodiversitylibrary.org/ Code, Support: http://code.google.com/p/bhl-bits Twitter: @BioDivLibrary (tag #bhl) Phil Cryer Anthony Goddard Missouri Botanical Garden MBLWHOI Library Biodiversity Heritage Library Biodiversity Heritage Library phil.cryer@mobot.org agoddard@mbl.edu http://philcryer.com http://anthonygoddard.com @fak3r @anthonygoddard Thursday, November 12, 2009