SlideShare una empresa de Scribd logo
1 de 14
Descargar para leer sin conexión
Goddard et al. BMC Bioinformatics 2011, 12(Suppl 15):S5
http://www.biomedcentral.com/1471-2105/12/S15/S5




 RESEARCH                                                                                                                                         Open Access

Data hosting infrastructure for primary
biodiversity data
Anthony Goddard1*, Nathan Wilson2, Phil Cryer3, Grant Yamashita4


  Abstract
  Background: Today, an unprecedented volume of primary biodiversity data are being generated worldwide, yet
  significant amounts of these data have been and will continue to be lost after the conclusion of the projects
  tasked with collecting them. To get the most value out of these data it is imperative to seek a solution whereby
  these data are rescued, archived and made available to the biodiversity community. To this end, the biodiversity
  informatics community requires investment in processes and infrastructure to mitigate data loss and provide
  solutions for long-term hosting and sharing of biodiversity data.
  Discussion: We review the current state of biodiversity data hosting and investigate the technological and
  sociological barriers to proper data management. We further explore the rescuing and re-hosting of legacy data,
  the state of existing toolsets and propose a future direction for the development of new discovery tools. We also
  explore the role of data standards and licensing in the context of data hosting and preservation. We provide five
  recommendations for the biodiversity community that will foster better data preservation and access: (1)
  encourage the community’s use of data standards, (2) promote the public domain licensing of data, (3) establish a
  community of those involved in data hosting and archival, (4) establish hosting centers for biodiversity data, and
  (5) develop tools for data discovery.
  Conclusion: The community’s adoption of standards and development of tools to enable data discovery is
  essential to sustainable data preservation. Furthermore, the increased adoption of open content licensing, the
  establishment of data hosting infrastructure and the creation of a data hosting and archiving community are all
  necessary steps towards the community ensuring that data archival policies become standardized.


Introduction                                                                            of data [4]. As a result, biodiversity data collected today
Today, an unprecedented volume of primary biodiversity                                  are as endangered as the species they represent. There
data are being generated worldwide [1], yet significant                                 are also important questions of access to and interpreta-
amounts of this data have been and will continue to be                                  tion of the data. Inaccessible data are effectively lost
lost after the conclusion of the projects tasked with col-                              until they are made accessible. Moreover, data that are
lecting them [2]. Gray et al. [3] make a distinction                                    misrepresented or easily misinterpreted can result in
between ephemeral data, which, once collected, can                                      conclusions that are even more inaccurate than those
never be collected again, and stable data, which can be                                 that would be drawn if the data were simply lost.
recollected. The extinction of species, habitat destruc-                                  Although it is in the best interest of all who create
tion and related loss of rich sources of biodiversity make                              and use biodiversity data to encourage best practices to
ephemeral a significant amount of data that have his-                                   protect against data loss, the community still requires
torically been assumed to be stable. Whether the data                                   additional effective incentives to participate in a shared
are stable or ephemeral, however, poor record keeping                                   data environment and to help overcome existing social
and data management practices nevertheless lead to loss                                 and cultural barriers to data sharing. Separated silos of
                                                                                        data from disparate groups presently dominate the cur-
* Correspondence: agoddard@mbl.edu
                                                                                        rent global infrastructure for biodiversity data. There are
1
 Center for Library and Informatics, Woods Hole Marine Biological                       some examples of projects working to bring the data
Laboratory, Woods Hole, MA 02543, USA                                                   out of those silos and to encourage sharing between the
Full list of author information is available at the end of the article

                                          © 2011 Goddard et al; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons
                                          Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in
                                          any medium, provided the original work is properly cited.
Goddard et al. BMC Bioinformatics 2011, 12(Suppl 15):S5                                                     Page 2 of 14
http://www.biomedcentral.com/1471-2105/12/S15/S5




various projects. Examples include the Global Biodiver-     for adopting and implementing data management/pre-
sity Information Facility (GBIF) data portal [5] and the    servation changes.
Encyclopedia of Life (EOL) [6]. Each of them works to          We acknowledge that biodiversity data range widely,
bring together data from their partners [7,8] and makes     both in format and in purpose. Although some authors
those data available through their application program-     carefully restrict the use of the term ‘data’ to verifiable
ming interfaces (APIs). In addition, there are projects,    facts or sensory stimuli [23], for the purposes of this
including ScratchPads [9] and LifeDesks [10], that allow    article we intend a very broad understanding of the
taxonomists to focus on their area of expertise while       term covering essentially everything that can be repre-
automatically making the data they want to share avail-     sented using digital computers. Although not all biodi-
able to the larger biodiversity community.                  versity data are in digital form, we restrict our
  Although there is a strong and healthy mix of plat-       discussion here to digital data that relate in some way to
forms and technologies, there remains a gap in stan-        organisms. Usually the data are created to serve a speci-
dards and processes, especially for the ‘long tail’ of      fic purpose, ranging from ecosystem assessment to spe-
smaller projects [11]. In analyzing existing infrastruc-    cies identification to general education. We also
ture, we see that large and well-funded projects predic-    acknowledge that published literature is a very impor-
tably have more substantial investments in                  tant existing repository for biodiversity data, and every
infrastructure, making use of not only on-site redun-       category of biodiversity data we discussed may be pub-
dancy, but also remote mirrors. Smaller projects, on the    lished in the traditional literature as well as more digi-
other hand, often rely on manual or semi-automated          tally accessible forms. Consequently, any of the data
data backup procedures with little time or resources for    discussed here may have relevant citations that are
comprehensive high availability or disaster recovery        important for finding and interpreting the data. How-
considerations.                                             ever, given that citations have such a rich set of stan-
  It is therefore imperative to seek a solution to this     dards and supporting technologies, they will not be
data loss and ensure that data are rescued, archived and    addressed here as that would distract from the primary
made available to the biodiversity community. Although      purpose of this paper.
there are broad efforts to encourage the use of best
practices for data archiving [12-14], citation of data      Categories of data and examples
[15,16] and curation [17] as well as large archives that    We start this section with a high-level discussion of the
are focused on particular types of biology-related data,    types of data and challenges related specifically to biodi-
including sequences [18-20], ecosystems [21] and obser-     versity data. This is followed by a brief review of various
vations (GBIF) species descriptions (EOL), none of these    types of existing Internet-accessible biodiversity data
efforts are focused on the long-term preservation of bio-   sources. Our intent is not to be exhaustive, but rather
diversity data. To this end the biodiversity informatics    to demonstrate the wide variety of biodiversity data
community requires investment in trustworthy processes      sources and to point out some of the differences
and infrastructure to mitigate data loss [22] and needs     between them. The choice of sites discussed is based on
to provide solutions for long-term hosting and storage      a combination of the authors’ familiarity with the parti-
of biodiversity data. We propose the construction of        cular tools and their widespread use.
Biodiversity Data Hosting Centers (BDHCs), which are           Most discussions of the existing data models for
charged with the task of mitigating the risks presented     managing biodiversity data are either buried in what
here by the careful creation and management of the          documentation exists for specific systems, for example,
infrastructure necessary to archive and manage biodiver-    the Architectural View of GBIF’s Integrated Publishing
sity data. As such, they will provide a future safeguard    Toolkit (IPT) [24] or, alternatively, the data model is
against loss of biodiversity data.                          discussed in related informally published presentations,
  In laying out our vision of BDHCs, we begin by cate-      for example John Doolan’s ‘Proposal for a Simplified
gorizing the different kinds of biodiversity data that      Structure for EMu’ [25]. One example of a data model
are found in the literature and in various datasets. We     in the peer-reviewed literature is Rich Pyle’s Taxonomer
then lay out some features and capabilities that BDHC       [26] system. However, this is again focused on a particu-
should possess, with a discussion of standards and best     lar implementation of a particular data model for a par-
practices that are pertinent to an effective BDHC.          ticular system.
After a discussion of current tools and approaches for         Raw observational data are a key type of biodiversity
effective data management, we discuss some of the           data and include textual or numeric field notes, photo-
technological and cultural barriers to effective data       graphs, video clips, sound files, genetic sequences or any
management and preservation that have hindered the          other sort of recorded data file or dataset based on the
community. We end with a series of recommendations          observation of organisms. Although the processes
Goddard et al. BMC Bioinformatics 2011, 12(Suppl 15):S5                                                           Page 3 of 14
http://www.biomedcentral.com/1471-2105/12/S15/S5




involved in collecting these data can be complex and             Catalog of Fishes [33]. Several sites have begun develop-
may be error prone, these data are generally accepted as         ing biodiversity data for the semantic web [34] including
the basic factual data from which all other biodiversity         the Hymenoptera Anatomy Ontology [35] and Taxon-
data are derived. In terms of sheer volume, raw observa-         Concept.org [36]. In addition, many projects include
tional data are clearly the dominant type of data.               extensive catalogs of names because names are key to
Another example is nomenclatural data. These data are            bringing together nearly all biodiversity data [37]. Exam-
the relationships between various names and are thus             ples include the Catalog of Life [38], WoRMs [39], ITIS
very compact and abstract. Nomenclatural data are gen-           [40] and ION [41]. Many sources for names, including
erally derived from a set of nomenclatural codes (for            those listed above, are indexed through projects such as
example, the International Code of Botanical Nomencla-           the Global Names Index [42] and uBio [43].
ture [27]). Although the correct application of these              Some sites focus on curated museum collections (for
codes in particular circumstances may be a source of             example, Natural History Museum, London [44], and
endless debate among taxonomists, the vast majority of           the Smithsonian Institution [45], among many others).
these data are not a matter of deep debate. However,             Others focus on descriptions of biological taxa ranging
descriptions of named taxa, particularly above the spe-          from original species descriptions (the Biodiversity Heri-
cies level, are much more subjective, relying on both the        tage Library [46] and MycoBank [47]), up-to-date tech-
historical literature and fundamentally on the knowledge         nical descriptions (FishBase [48], the Missouri Botanical
and opinions of their authors. Digital resources often do        Garden [49], and the Atlas of Living Australia [50]), and
a poor job of acknowledging, much less representing,             descriptions intended for the general public (Wikipedia
this subjectivity.                                               [51] and the BBC Wildlife Finder [52]).
   For example, consider a system that records observa-            As described above, identifications are often conflated
tions of organisms by recording the name, date, location         with raw observations or names. However, there are a
and observer of the organism. Such a system can con-             number of sites set up to interactively manage newly
flate objective, raw observational data with a subjective        proposed identifications. Examples include iSpot [53],
association of that data with a name based on some               iNaturalist [54], ArtPortalen.se [55] and Nationale Data-
poorly specified definition of that name. Furthermore,           bank Flora en Fauna [56], which accept observations of
most of the raw observational data are then typically            any organisms, and sites with more restricted domains,
discarded because the observer fails to record the raw           such as Mushroom Observer [57], eBird [58] or Bug-
data that they used to make the determination or                 Guide [59]. The GBIF data portal provides access to
because the system does not provide a convenient way             observational data from a variety of sources, including
to associate raw data, such as photographs, with a parti-        many government organizations. Many sites organize
cular determination. Similarly, the person making the            their data according to a single classification that reflects
identification typically neglects to be specific about what      the current opinions and knowledge of the resource
definition of the name they intend. Although this exam-          managers. Examples of such sites whose classifications
ple raises a number of significant questions about the           come reasonably close to covering the entire tree of life
long-term value of such data, the most significant issues        are the National Center for Biotechnology Information
for our current purpose relate to combining data from            (NCBI) taxonomy [60], the Tree of Life Web [61], the
multiple digital resources. The data collected and the           Catalog of Life [38], the World Registry of Marine Spe-
way that different digital resources are used are often          cies (http://www.marinespecies.org/), and the Interim
very different. As a result it is very important to be able      Register of Marine and Nonmarine Genera [62]. EOL
to reliably trace the origin of specific pieces of data. It is   [6] gathers classifications from such sources such and
also important to characterize and document the various          allows the user to choose a preferred classification for
data sources in a common framework.                              accessing its data. Finally, some sites, such as TreeBase
   Currently, a wide variety of biodiversity data are avail-     [63] and PhylomeDB [64], focus on archiving computer
able through digital resources accessible through the            generated phylogenies based on gene sequences.
Internet. These range from sites that focus on photo-
graphs of organisms from the general public such as              Features of biodiversity data hosting centers
groups within Flickr [28] to large diverse systems               Most of the key features of a BDHC are common to the
intended for a range of audiences such as EOL [6] to             general problem of creating publicly accessible, long-
very specialized sites focused on the nomenclature of a          term data archives. Obviously, the data stored in the
specific group of organisms, such as Systema Dipter-             systems are distinctive to the biodiversity community,
orum [29] or the Index Nominum Algarum [30]. Other               such as images of diagnostic features of specific taxa.
sites focused on nomenclature include the International          However, the requirements to store images and the
Plant Names Index [31], Index Fungorum [32] and the              standards used for storing them are widespread. In
Goddard et al. BMC Bioinformatics 2011, 12(Suppl 15):S5                                                        Page 4 of 14
http://www.biomedcentral.com/1471-2105/12/S15/S5




addition, significant portions of the data models, and         ‘standard’, then the total amount of work that has to be
thus communication standards and protocols, are dis-           done by community goes up as the square of the num-
tinctive to the biodiversity community. For example, the       ber of players (roughly speaking, an N2 problem, with N
specific types of relationships between images, observa-       being the number of players). If, however, some com-
tions, taxon descriptions and scientific names.                munity standards are used, the amount of work for the
  Government, science, education and cultural heritage         community is N×S (with S being the number of stan-
communities, among others, are faced with many of the          dards used in the community). As S approaches N,
same general challenges that face the biodiversity infor-      there is no point in standardizing, but as S becomes
matics community when it comes to infrastructure and           much less than N, the total amount of support required
processes to establish long-term preservation and access       by the community to deal with accounting for the var-
of content. The NSF DataNet Program [65] was created           ious standards decreases. A full account of standards
with this specific goal in mind. The Data Conservancy          and their importance to robust technical systems is
[66] and the DataOne Project [67], both funded by              beyond the scope of this article. However, we accept
DataNet, are seeking to provide long-term preservation,        that the use of standards to facilitate data preservation
access and reuse for their stakeholder communities. The        and access is very important for BDHCs. In this context,
US National Oceanographic and Atmospheric Adminis-             we present a general overview of common standards
tration [68] is actively exploring guidelines for archiving    used in biodiversity systems in the next section.
environmental and geospatial data [69]. The wider
science and technology community has also investigated         Tools, services and standards
the challenges of preservation and access of scientific        Here we summarize some of the tools, services, and
and technical data [70]. Such investigations and projects      standards that are used in biodiversity informatics pro-
are too great to cover in the context of this paper. How-      jects. Although an exhaustive list is beyond the scope of
ever, every effort should be made by the biodiversity          this article, we attempt to present tools that are widely
informatics community to build from the frameworks             known and used by most members of the community.
and lessons learned by those who are tackling these
same challenges in areas outside biodiversity.                 Tools for replicating data
  The primary goal of BDHCs is to substantially reduce         TCP/HTTP protocols allow simple scripting of com-
the risk of loss of biodiversity data. To achieve this goal    mand-line tools, such as wget or curl, to transfer data.
they must take long-term data preservation seriously.          When this is not an option and there is access via a
Fortunately, this need is common to many other areas           login shell such as OpenSSH, other standard tools can
of data management and many excellent tools exist to           be used to replicate and mirror data. Examples of these
meet these needs. In addition to the obvious data sto-         are rsync, sftp and scp.
rage requirements, data replication and effective meta-          Peer-to-Peer (P2P) file-sharing includes technologies
data management are the primary technologies required          such as BitTorrent [71], which is a protocol for distri-
to mitigate against the dangers of data loss.                  buting large amounts of data. BitTorrent is the frame-
  Another key feature of BDHCs is to make the data             work for the BioTorrents project [72], which allows
globally available. Although any website is, in a sense,       researchers to rapidly share large datasets via a network
globally available, the deeper requirements are to make        of pooled bandwidth systems. This open sharing allows
that availability reliable and fast. Data replication across   one user to collect pieces of the download from multiple
globally distributed data centers is a well-recognized         providers, increasing the efficiency of the file transfer
approach to making data consistently available across          while simultaneously providing the downloaded bits to
the world.                                                     other users. The BioTorrents website [73] acts as a cen-
  Finally, the use and promotion of standards is a key         tral listing of datasets available to download and the Bit-
feature for BDHCs. Standards are the key component to          Torrent protocol allows data to be located on multiple
enabling effective, scalable data transfer between inde-       servers. This decentralizes the data hosting and distribu-
pendently developed systems. The key argument for              tion and provides fault tolerance. All files are then integ-
standards in BDHCs is that they reduce the need for a          rity checked via checksums to ensure they are identical
set of systems that need to communicate with each              on all nodes.
other to create a custom translator between each pair of
systems. Everyone can aim at supporting the standards          Tools for querying data providers
rather than figuring out how to interoperate with every-       OpenURL [74] provides a standardized URL format that
one else on a case by case basis. If the number of             enables a user to find a resource they are allowed to
players is close to the number of standards, then there        access via the provider they are querying. Originally
is no point in standardizing. If each player has its own       used by librarians to help patrons find scholarly articles,
Goddard et al. BMC Bioinformatics 2011, 12(Suppl 15):S5                                                        Page 5 of 14
http://www.biomedcentral.com/1471-2105/12/S15/S5




it is now used for any kind of resource on the internet.       shared space where all nodes in the network support
The standard supports linking from indexed databases           each other and sustain authoritative copies of the origi-
to other services, such as journals, via full-text search of   nal content. LOCKSS is OAIS (Open Archival Informa-
repositories, online catalogs or other services. OpenURL       tion System) [79] compliant and preserves all genres
is an open tool and allows common APIs to access data.         and formats of web content to preserve both the histori-
   The Biodiversity Heritage Library (BHL)’s OpenURL           cal context and the intellectual content. A described
resolver [75] is an API that was launched in 2009 and          ‘LOCKSS box’ collects content from the target sites
continues to offer a way for data providers and aggrega-       using a web crawler similar to the ones that search
tors to access BHL material. Any repository containing         engines use to discover content. It then watches those
citations to biodiversity literature can use this API to       sites for changes, allows the content to be cached on
determine whether a given book, volume, article and/or         the box so as to facilitate a web proxy or cache of the
page is available online through BHL. The service sup-         content in case the target system is ever down and has a
ports both OpenURL 0.1 and OpenURL 1.0 query for-              web-based administration panel to control what is being
mats, and can return its response in JSON, XML or              audited, how often and who has access to the material.
HTML formats, providing flexibility for data exchange.            DiGIR (Distributed Generic Information Retrieval)
                                                               [80] is a client/server protocol for retrieving information
Tools for metadata and citation exchange                       from distributed resources. Using HTTP to transport
OAI-PMH (Open Archives Initiative Protocol for Meta-           Darwin Core XML between the client and server, DiGIR
data Harvesting, usually referred to as simply OAI) [76]       is a set of tools to link independent databases into a sin-
is an established protocol to provide an application           gle, searchable virtual collection. Although it was initi-
independent framework to harvest metadata. Using               ally targeted to deal only with species data, it was later
XML over HTTP, OAI harvests metadata descriptions              expanded to work with any type of information and was
of repository records so that servers can be built using       integrated into a number of community collection net-
metadata from many unrelated archives. The base                works, in particular GBIF. At its core, DiGIR provides a
implementation of OAI must support metadata in the             search interface between many dissimilar databases
Dublin Core format (a metadata standard for describing         using XML as a translator. When a search query is
a wide range of networked resources), but support for          issued, the DiGIR client application sends the query to
additional representations is available. Once an OAI ser-      each institution’s DiGIR provider, which is then trans-
vice is initially harvested, future harvests will check only   lated into an equivalent request that is compatible with
for new or changed records, making it an efficient pro-        the local database. Thus the response can deal with the
tocol and one that is easily set up as a repetitive task       search even though the details of the underlying data-
that runs automatically at regular intervals.                  base are suppressed, thus allowing a uniform virtual
  CiteBank [77] is an open access repository for biodi-        view of the contents on the network. Network speed
versity publications published by the BHL that allows          and availability were major concerns of the DiGIR sys-
sharing, categorizing and promoting of citations. Cita-        tem; nodes would time out before requests could be
tions are harvested from resources via the OAI-PMH             processed, resulting in failed queries. The functionality
protocol, which seamlessly deals with updates and              of DiGIR was superceded by the Tapir protocol.
changes from remote providers. Consequently, all data             BioCASE (The Biological Collection Access Service
within CiteBank is also available to anyone via OAI,           for Europe) [81] “is a transnational network of biological
thus creating a new discovery node. Tools are available        collections”. BioCASE provides access to collection and
that allow users to upload individual citations of whole       observational databases by providing an XML abstrac-
collections of bibliographies. Open groups may be              tion layer in front of a database. Although its search,
formed around various categories and can assist in mod-        retrieval and framework is similar to DiGIR, it is never-
erating and updating citations.                                theless incompatible with DiGIR. BioCASE uses a
                                                               schema based on the Access to Biological Collections
Tools for data exchange                                        Data (ABCD) schema [82].
LOCKSS (Lots of Copies Keep Stuff Safe) [78] is an                TAPIR is a current Taxonomic Database Working
international community program, based at Stanford             Group (TDWG) standard that provides an XML API
University Libraries, that uses open source software and       protocol for accessing structured data. TAPIR extends
P2P networking technology to map a large, decentra-            features of BioCASE and DiGIR by making a more gen-
lized and replicated digital repository. Using off-the-        eric method of data interchange. The TAPIR project is
shelf hardware and requiring very little technical exper-      run by a task group that oversees the development. The
tise, the system preserves an institution’s content while      XML request and response for access can be stored in a
providing preservation for other content - a virtual           distributed database. TAPIR combines and extends
Goddard et al. BMC Bioinformatics 2011, 12(Suppl 15):S5                                                      Page 6 of 14
http://www.biomedcentral.com/1471-2105/12/S15/S5




features of BioCASE and DiGIR to create a more gen-             Linking multiple clusters of Hadoop servers globally
eric means for sharing data. A task group oversees the        would present biodiversity researchers with a commu-
maintenance of the software as well as the protocol’s         nity utility. GBIF developers have already worked exten-
standard.                                                     sively with Hadoop in a distributed computing
   Although DiGIR providers are still available, their        environment, so much of the background research into
numbers have diminished since its inception and now           the platform and its application to biodiversity data has
number around 260 [83]. BioCASE, meanwhile, has               been done. In a post entitled ‘Hadoop on Amazon EC2
approximately 350 active nodes [84]. TAPIR is being           to generate Species by Cell Index’ [89], GBIF generated
used as the primary collection method by GBIF, who            a ‘species per cell index’ map of occurrence data across
continue to maintain and add functionality to the project.    an index of the entire GBIF occurrence record store,
Institutions such as the Missouri Botanical Garden host       where each cell represented an area of one degree lati-
all three services (DiGIR, BioCASE and TAPIR) to allow        tude by one degree longitude. This map consisted of
ongoing open access to their collections. Many projects       over 135 million records. This was processed using 20
will endeavor to maintain legacy tools for as long as there   Amazon EC2 instances running Linux, Hadoop and
is demand. This support is helpful for systems that have      Java. This map generation was completed in 472 sec-
already invested in legacy tools, but it is important to      onds and was used to show that processing tasks could
promote new tools and standards to ensure their adop-         be parsed out over many Hadoop instances to come up
tion as improvements in tools and standards are made. In      with a unified result. Because Hadoop runs in Java it is
so doing, there needs to be a clear migration path from       very simple to provision nodes and link them together
legacy systems to the new one.                                to form a private computing network.

Distributed computing                                         Standards used in the biodiversity community
Infrastructure as a service                                   The biodiversity community uses a large range of data-
Cloud infrastructure services provide server virtualiza-      related standards ranging from explicit data file formats,
tion environments as a service. When a user requires a        to extensions to metadata frameworks, to protocols. Lar-
number of servers to store data or run processing tasks,      ger data objects such as various forms of media,
they simply rent computing resources from a provider,         sequence data or other types of raw data are addressed
who provides the resources from a pool of available           by standardized data file formats, such as JPEG [90],
equipment. From here, virtual machines can be brought         PNG [91], MPEG [92] and OGG [93]. For more specific
online and offline quickly, with the user paying only for     forms of data, standards are usually decided by the orga-
the actual services used or resources consumed. The           nization governing that data type. The Biodiversity
most well known implementation of this is the Amazon          Information Standards [94] efforts are intended to
Elastic Compute Cloud (Amazon EC2) [85].                      address the core needs for biodiversity data. The meta-
                                                              data standard Darwin Core [95] and the related GBIF-
Distributed processing                                        developed Darwin Core Archive [96] file format, in par-
Using common frameworks to share processing globally          ticular, address most of the categories of data discussed
enables researchers to use far more computing power           above. The current Darwin Core standard is designed to
than had been previously possible. By applying this addi-     be used in XML documents for exchanging data and is
tional computing power, projects can leverage techni-         the most common format for sharing zoological data. It
ques such as data mining to open up new avenues of            is also worth noting that the Darwin Core Archive file
exploration in biodiversity research. After computational     format is explicitly designed to address the most com-
targets are selected, individual jobs are distributed to      mon problems that researchers encounter in this space,
processing nodes until all work is completed. Connect-        but not necessarily all of the potential edge cases that
ing disparate system only requires a secure remote con-       could be encountered in representing and exchanging
nection, such as OpenSSH [86], over the internet.             biodiversity data.
  Apache Hadoop [87] is a popular open source project            The TAPIR [97] standard specifies a protocol for
from Yahoo that provides a Java software framework to         querying biodiversity data stores and, along with DiGIR
support distributed computing. Running on large clus-         and BioCASE, provides a generic way to communicate
ters of commodity hardware, Hadoop uses its own dis-          between client applications and data providers. Biodiver-
tributed file system (HDFS) to connect various nodes          sity-specific standards for data observations and descrip-
and provides resilience against intermittent failures. Its    tions are also being developed. For example, the EnvO
computational method, map/reduce [88], passes small           [98] standard is an ontology similar to Darwin Core that
fragments of work to other nodes in the cluster and           defines a wide variety of environments and some of
directs further jobs while aggregating the results.           their relationships.
Goddard et al. BMC Bioinformatics 2011, 12(Suppl 15):S5                                                        Page 7 of 14
http://www.biomedcentral.com/1471-2105/12/S15/S5




  Community acceptance of standards is an ongoing              has already scanned approximately 30 million pages of
process and different groups will support various ones         core biodiversity literature. The project estimates that
at different times. The process for determining a biodi-       there are approximately 500 million total pages of core
versity specific standard is generally organized by the        biodiversity literature. Of that, 100 million are out of
Biodiversity Information Standards community (orga-            copyright and are waiting to be scanned by the project.
nized by TDWG) centered on a working group or mail-              Interoperability presents a further technological bar-
ing list and often has enough exposure to ensure that          rier to data access and interpretation. In many cases pri-
those with a stake in the standard being discussed will        mary biodiversity data cannot be easily exchanged
have an opportunity to comment on the proposed stan-           between a source system and a researcher’s system. The
dard. This process ensures that standards are tailored to      standards discussed here help to address this problem,
those who eventually use them, which further facilitates       but must be correctly implemented in both the source
community acceptance. Nomenclatural codes in taxon-            and destination systems.
omy, for example, are standardized, and members of               Finally, there is no existing general solution for user-
this discipline recognize the importance of these              driven feedback mechanisms in the biodiversity space.
standards.                                                     Misinterpretation of biodiversity data is largely the result
                                                               of incomplete raw observations and barriers to the pro-
Barriers to data management and preservation                   pagation of data. The process of realizing that two raw
Technological barriers                                         observations refer to the same taxon (or have some
We distinguish between technological and social/cul-           more complex relationship) takes effort, as does the pro-
tural barriers to effective data sharing, management and       pagation of that new data. To be successful, such a sys-
preservation. Technological barriers are those that are        tem would need to be open to user-driven feedback
primarily due to devices, methods and the processes and        mechanisms that allow local changes for immediate use,
workflows that use those devices and methods. Social/          which propagate back to the original source of the data.
cultural barriers are those that arise from both explicit      Much of this data propagation could be automated by a
and tacit mores of the larger community as well as pro-        wider adoption of standards and better models for the
cedures, customs and financial considerations of the           dynamic interchange and caching of data.
individuals and institutions that participate in the data
management/preservation.                                       Social and cultural barriers
  The primary technological cause of loss of biodiversity      The social and cultural barriers that cause data loss are
data is the poor archiving of raw observations. In many        well known and not specific to biodiversity data [99]. In
cases raw observations are not explicitly made or they         addition to outright, unintentional data loss, there are
are actively deleted after more refined taxon description      well known social barriers to data sharing. Members of
information has been created from them. The lack of            the scientific community can be afraid to share data
archival storage space for raw observations is a key           since this might allow other scientists to publish
cause of poor archiving, but simply having a policy of         research results without explicitly collaborating with or
keeping everything is not scalable. At minimum there           crediting the original creator(s) of the data. One
should be strict controls governing what data are              approach for addressing this concern is forming data
archived. There also must to be a method for identifying       embargoes, in which the original source data are ren-
data that either are not biodiversity-related or have little   dered unavailable for a specified period of time relative
value for future biodiversity research. Lack of access to      to its original collection or subsequent publication. In
primary biodiversity data can be due to a lack of techni-      addition, there are no clear standards for placing a value
cal knowledge by the creators of the data regarding how        on biodiversity data. This results in radically different
to digitally publish them, and often this is caused by a       beliefs about the value of such data, and in extreme
lack of funding to support publication and long-term           cases can lead scientists to assert rights over data with
preservation. In cases in which data do get published,         expensive or severely limited licensing terms.
there remain issues surrounding maintenance of those             The financial cost of good archiving and the more
data sources. Hardware failures and format changes can         general cost of hardware needed, software development
result in data becoming obsolete and inaccessible if con-      and the cost of data creation are all other forms of
sideration is not given to hardware and file format            social barriers to the preservation of biodiversity data
redundancy.                                                    [100]. The cost of creating new raw observations con-
  Another primarily technological barrier to the access        tinues to drop, but some types of raw observational data
and interpretation of biodiversity data is that much of        are still extremely expensive or otherwise difficult to
the key literature is not available digitally. Projects such   acquire, especially if significant travel or field work are
as the BHL are working to address this problem. BHL            required to create the data. Similarly, access to museum
Goddard et al. BMC Bioinformatics 2011, 12(Suppl 15):S5                                                    Page 8 of 14
http://www.biomedcentral.com/1471-2105/12/S15/S5




specimens or living organisms can be crucial to accu-       between the recommendations, as having completed one
rately interpreting existing data. There are efficient      recommendation will make it easier to pursue others.
mechanisms to facilitate such access within the aca-
demic sphere, but for scientists outside of the academic    1. Short-term: encourage community use of data
world this can be a significant barrier to creating more    standards
meaningful data.                                            The biodiversity community must work in the short
  Overcoming these barriers will require effective          term with associations such as the TDWG standards
incentives. Ultimately the advantages of having shared,     body to strongly promote biodiversity standards in data
centralized access to the data should serve as its own      transfer, sharing and citation. Key players, such as data
incentive. This has already happened for genetic            providers, must ensure that they work to adopt these
sequence data as reflected by the widespread adoption       standards as a short-term priority and encourage other
of GenBank [18]. In comparison, biodiversity data have      projects to follow suit. The adoption of standards is
a much greater historical record than sequence data         paramount to the goal of achieving sustainable data pre-
and more importantly a set of legacy conventions that       servation and access and paves the way for interoper-
were largely created outside of the context of digital      ability between datasets, discovery tools and archiving
data management. As a result, investment in accurately      systems [104]. The processes for standards establish-
modeling these conventions and providing easy-to-use        ment and community acceptance need to be as flexible
interfaces for processing data that conform to these        and inclusive as possible, to ensure the greatest adoption
conventions is likely to have greater return in the long    of standards. Therefore, an open, transparent and easily
run. Providing more direct incentives to the data crea-     accessible method for participating in the standards pro-
tors could be valuable as way to get the process            cess is important to ensure that standards meet the
started.                                                    community’s needs. An interesting example is the use of
                                                            unique identifiers. Popular solutions in the biodiversity
Legal and mandatory solutions                               community include GUIDs [105], URLs [106] and LSIDs
Major US grant funding organizations including the          [107]. This debate has not resolved on a single agreed-
National Institutes of Health [101] and the National        on standard and it is not clear that it ever will. There-
Science Foundation [102] are now requiring an up-front      fore, it is important that BDHCs support all such sys-
plan for data management, including data access and         tems, most likely through an internally managed
archiving as part of every grant proposal. The National     additional layer of indirection and replication of the
Science Foundation is taking steps to ensure that data      sources when possible.
from publicly funded research are made public [103].
Such policies, although designed to ensure that data are    2. Short-term: promote public domain licensing of
more accessible, have implications for data archiving.      content
Mandatory data archiving policies are likely to be very     To minimize the risk of loss, biodiversity data, including
effective in raising awareness of issues surrounding data   the images, videos, classifications, datasets and data-
loss and archiving. However, such policies are neither      bases, must all be replicable without asking permission
strictly necessary to ensure widespread adoption of data    of any rights holders associated with that data. Such
archiving best practices, nor are they a sufficient solu-   data must be placed under as liberal a license as possible
tion on their own. Adoption of these policies will assist   or, alternatively, placed into the public domain. This
in ensuring that a project’s data are available and         position should be adopted in the short term by existing
archivable.                                                 data providers. To avoid long-term issues, these licenses
                                                            should allow additional data replication without seeking
Recommendations and discussion                              permission from the original rights holders. A variety of
Here we recommended five ways in which the commu-           appropriate licenses are available from Creative Com-
nity can begin to implement the changes required for        mons [108]. ‘Non-commercial’ licenses should be
sustainable data preservation and access: (1) Encourage     avoided if possible because this makes it more difficult
the community’s use of data standards, (2) promote the      for the data to be shared and used by sites that may
public domain licensing of data, (3) establish a commu-     require commercial means to support their projects. In
nity of those involved in data hosting and archival, (4)    these cases, an open, ‘share-alike’ license that prevents
establish hosting centers for biodiversity data, and (5)    the end user from encumbering the data with exclusive
develop tools for data discovery.                           licenses should be used.
  We prioritize recommendations here from short-term           Open source code should be stored in both an online
(immediate to 1 year), medium-term (1 to 5 years), to       repository, such as github [109] or Google code [110],
long-term (5 or more years). There is inter-dependency      and mirrored locally at a project’s institution for
Goddard et al. BMC Bioinformatics 2011, 12(Suppl 15):S5                                                       Page 9 of 14
http://www.biomedcentral.com/1471-2105/12/S15/S5




redundancy. Making open source code publicly available        opening them up to a larger community. Allowing com-
is an important and often overlooked step for many pro-       munity input into biodiversity datasets via comments,
jects and must be encouraged by funders, not only for         direct access to the data, adding to the data or deriving
the sake of community involvement in a project, but to        new datasets from the original data are all powerful
allow others to maintain and further develop the soft-        ways of enhancing the original dataset and building a
ware should a project’s funding conclude. The use of          community that cares about and cares for the data.
open source software to support the hosting and devel-        Wikipedia, which allows users to add or edit definitions
opment of biodiversity informatics tools and projects         of any article, is an example of a successful model on a
should also be encouraged by funders. Using common-           large scale. In the case of biodiversity, there has been
place, open source software lowers the barrier to entry       success with amateur photographers contributing their
for collaborators, further increasing the chances of a        images of animals and plants and amateur scientists
project being maintained after funding has ended.             helping to identify various specimens. Crowd-sourcing is
                                                              particularly notable as a scalable approach for enhancing
3. Short-term: develop data hosting and archiving             data quality. Community input can also work effectively
community                                                     to improve the initial quality of the data and associated
Encouraging the use of standards and data preservation        metadata [114].
practices to the wider biodiversity community requires          In 2008, the National Library of Australia, as part of
an investment in training resources. In the short term,       the Australian Newspapers Digitization Program
the community, including projects, institutions and           (ANDP) and in collaboration with Australian state and
organizations such as GBIF and TDWG, should estab-            territory libraries, enabled public access to selected out-
lish online resources to familiarize researchers with the     of-copyright Australian newspapers. This service allows
process of storing, transforming and sharing their data       users to search and browse over 8.4 million articles
using standards, as well as with the concepts of data         from over 830,000 newspapers dating back to 1803.
preservation and data hosting. Such resources may             Because the papers were scanned by Optical Character
include screen casts, tutorials and white papers. Even        Recognition (OCR) software, they could be searched.
better, intensive, hands-on training programs have pro-       Although OCR works well with documents that have
ven to be effective in educating domain experts and           consistent typefaces and formats, historical texts and
scientists in technical processes, as can be seen in          newspapers fall outside of this category, leading to less
courses such as the Marine Biology Laboratory’s Medi-         than perfect OCR. To address this, the National Library
cal Informatics course, which teaches informatics tools       opened up the collection of newspapers to crowd-sour-
to medical professionals [111]. Funders and organiza-         cing, allowing the user community the ability to correct
tions such as GBIF should encourage the establishment         and suggest improvements to the OCR text. The volun-
of such courses for biologists and others working in bio-     teers not only enjoyed the interaction, but also felt that
diversity. Partners in existing large-scale informatics       they were making a difference by improving the under-
projects, such as those of the Encyclopedia of Life or        standing of past events in their country. By March 2010
the Biodiversity Heritage Library, as well as institutions    over 12 million lines of text had been improved by
such as GBIF regional nodes, would be well placed with        thousands of users. The ongoing results of this work
the expertise and resources to host such courses. An          can be seen on the National Library of Australia’s Trove
alternative hands-on method that has been effective is        site [115]. Biodiversity informatics projects should con-
to embed scientists within informatics groups [112].          sider how crowd-sourcing could be used in their pro-
   For hosts of biodiversity data, technical proficiency is   jects and see this as a new way of engaging users and
important for any information technology project and          the general public.
data archiving and sharing are concepts that most sys-          Development community to enable biodiversity
tems engineers are capable of facilitating. Understanding     sharing. A community of developers that have experi-
the need for knowledge transfer, a communal group             ence with biodiversity standards should be identified
that is comfortable with email and mailing lists is criti-    and promoted to assist in enabling biodiversity
cal for a project’s success. Communication between the        resources to join the network of BDHC. In general
group can be fostered via the use of open/public wiki, a      these developers should be available as consultants for
shared online knowledge base that can be continuously         funded projects and even included in the original grant
edited as the tools, approaches and best practices            applications. This community should also consider a
change.                                                       level of overhead funded work to help enable data
   Community input and crowd-sourcing. Crowd-sour-            resources that lack a replication strategy to join the net-
cing [113] is defined as the process of taking tasks that     work. A working example of this can be found in tasks
would normally be handled by an individual and                groups such as the GBIF/TDWG joint multimedia
Goddard et al. BMC Bioinformatics 2011, 12(Suppl 15):S5                                                    Page 10 of 14
http://www.biomedcentral.com/1471-2105/12/S15/S5




resources task group, which is working to evaluate and       the low latency and high bandwidth available when data
share standards for multimedia resources relevant to         are served from a location local to the user. For exam-
biodiversity [116].                                          ple, a project in Europe and a project in Australia who
   Industry engagement. Industry has shown, in the           both serve each other’s data will serve both projects’
case of services such as Amazon’s Public Dataset project     data from the closest respective location, reducing band-
[117] and Google’s Public Data Explorer [118], that          width and latency complications when accessing the
there is a great benefit in providing users with access to   data. The creation of specialized data archival and host-
large public datasets. Even when access to a public data-    ing centers, centered on specific data types, would pro-
set is simple, working with that dataset is often not as     vide for specialized archival of specific formats. For
straightforward. Amazon provides a way for the public        example, with many projects hosting DwC archive files
to interact with these datasets without the onerous          within a central hosting center, the center could provide
groundwork usually required. Amazon benefits in this         services to maintain the data and keep it in a current
relationship by selling its processing services to those     and readable format. By providing this service to a large
doing the research. Engaging industry to host data and       quantity of data providers, the costs to each provider
provide access methods provides tangible benefits to the     would be significantly lower than if the data provider
wider biodiversity informatics community, in terms of        were to maintain these formats alone. Such centers
its ability to engage users in developing services on top    would host both standardized, specific formats and pro-
of the data, with dramatically lower barriers to entry       vide for a general data store area for raw data that are
than traditional methods, and from data archival or          yet to be transformed into a common format.
redundancy perspectives.                                       In establishing data hosting infrastructure, the com-
                                                             munity should pay attention to existing efforts in data
4. Medium-term: establish hosting centers for biodiversity   management standards and hosting infrastructure, such
data                                                         as those being undertaken by the DataONE project, and
Key players, such as institutions with access to appropri-   find common frameworks to build biodiversity data spe-
ate infrastructure, should act in the medium term to         cific solutions.
establish biodiversity data hosting centers (BDHCs).           Geographically distributed processing. The sheer
BDHCs would mostly probably evolve out of existing           size of some datasets often makes it difficult to move
projects in the biodiversity data community. These cen-      data offsite for processing and makes it difficult to con-
ters will need to focus on collaborative development of      duct research if these datasets are physically distant
software architectures to deal with biodiversity-specific    from the processing infrastructure. By implementing
datasets, studying the lifecycle of both present and         common frameworks to access and process these data,
legacy datasets. Consideration should be made for these      the need to run complex algorithms and tools across
data centers to be situated in geographically dispersed      slow or unstable internet connections is reduced. It is
locations to provide an avenue for load sharing of data      assumed that more researchers would access data than
processing and content delivery tasks as well as physical    would have the resources to implement their own sto-
redundancy. By setting up mechanisms for sharing             rage infrastructure for the data. Therefore it is impor-
redundant copies of data in hosting centers around the       tant for projects to consider the benefits of placing
world, projects can take advantage of increased proces-      common processing frameworks on top of their data
sing power, whereby the project distributes their com-       and providing API access to those frameworks in order
puting tasks to other participating nodes, allowing          to provide researchers who are geographically separated
informatics teams to more effectively use their infra-       from the data access to higher speed processing.
structure at times when it would otherwise be                  Data snapshots. Data hosting centers should track
underused.                                                   archival data access and updates in order to understand
  When funding or institutional relationships do not         data access patterns. Given the rapid rate of change in
allow mutual hosting and sharing of data, and as a stop      datasets, consideration needs to be given to the concept
gap before specialized data hosting centers become           of ‘snapshotting’ data. In a snapshot, a copy of the state
available, third party options should be sought for the      of a dataset is captured at any particular time. The main
backup of core data and related code. By engaging the        dataset may then be added to, but the snapshot will
services of online, offsite backup services, data backup     always remain static, thereby providing a revision history
can be automated simply and at a low cost when com-          of the data. An example use case for snapshotting a
pared with the capital expense of backup hardware            dataset is an EOL species page. Species pages are dyna-
onsite.                                                      mically generated datasets comprising user submitted
  When mutual hosting and serving of data is possible,       data and aggregated data sources. As EOL data are
geographically separated projects can take advantage of      archived and shared, species pages are updated and
Goddard et al. BMC Bioinformatics 2011, 12(Suppl 15):S5                                                      Page 11 of 14
http://www.biomedcentral.com/1471-2105/12/S15/S5




modified. Being able to retrieve the latest copy of a data-   online data backup. These services would be designed to
set may not be as important as retrieving an earlier ver-     deal with biodiversity-related data specifically. By tailor-
sion of the same dataset, especially if relevant data have    ing solutions to the various file formats and standards
been removed or modified. Archival mechanisms need            used in the biodiversity informatics community, issues
to take snapshotting into consideration to ensure that        such as format obsolescence can be better managed. For
historical dataset changes can be referenced, located and     example, when backing up a mysql file or DarwinCore
recovered. This formalized approach to data stewardship       archive file, metadata associated with that file can be
makes the data more stable and, in turn, more valuable.       stored in the backup. Such metadata could include the
   Data and application archiving. Although some data         version of DarwinCore or mysql or the date of file crea-
exist in standardized formats (for example xls, xml and       tion. These data can be later used when migrating for-
DwC), others are available only through application-spe-      mats in the event of a legacy format becoming obsolete.
cific code, commonly a website or web service with            Such migrations would be difficult for a large number of
non-standardized output. Often, these services will store     small projects to consider, so by centralizing this pro-
datasets in a way that is optimized for the application       cess a greater number of projects will have access to
code and is unreadable outside the context of this code.      such migrations and therefore the associated data.
Whether data archiving on its own is a sustainable
option or whether application archiving is also required      5. Long-term: develop data discovery tools
needs to be considered by project leaders. Applications       The abundance of data made available for public con-
that serve data should have a mechanism to export             sumption provides a rich source for data-intensive
these data into a standard format. Legacy applications        research applications, such as automated markup, text
that do not have the capability to export data into a         extraction, machine learning, visualizations and digital
standard format need to be archived alongside the data.       ‘mash-ups’ of the data [119]. As a long-term goal for the
This can be in a long-term offline archive, where the         GBIF secretariat and nodes, as well as biodiversity infor-
code is simply stored with the data, or in an online          matics projects, this process should be encouraged as a
archive where the application remains accessible along-       natural incentive. Three such uses of these applications
side the data. In both cases, BDHC need to store appli-       are in geo-location, taxon finding and text-mining.
cation metadata, such as software versions and                   Services for extracting information. By extracting
environment variables, both of which are critical to          place names or geographic coordinates from a dataset,
ensuring that the application can be run successfully         geo-location services can provide maps of datasets and
when required. Re-hosting applications of any complex-        easily combine these with maps of additional datasets,
ity can be a difficult task as new software deprecates fea-   providing mapping outputs such as Google Earth’s [120]
tures in older versions. One solution that BDHC should        kmz output. Taxon finding tools can scan large quanti-
consider is the use of virtual machines, which can be         ties of text for species names, providing the dataset with
customized and snapshotted with the exact environment         metadata describing which taxa are present in the data
required to serve the application. However, such solu-        and the location within the data in which each taxon
tions should be seen as a stop-gap. It should be empha-       can be found. Using natural language processing, auto-
sized that a focus on raw data portability and                mated markup tools can analyze datasets and return a
accessibility is the preferred solution where possible.       structured, marked-up version of the data. uBio RSS
   Replicate indexes. Online data indexes, such as those      [121] and RSS novum are two tools that offer automated
from GBIF and from the global names index, provide a          recognition of species names within scientific and media
method for linking online data to objects, in this            RSS feeds. By identifying data containing species names,
instance to names. Centrally storing the index enables        new datasets and sources of data can be discovered.
users to navigate through the indexes to the huge data-       Further development of these tools and integration of
sets in centralized locations. These data are of particular   this technology into web crawling software will pave the
value to the biodiversity data discovery process as well      way for automated discovery of online biodiversity data
as being generally valuable for many types of biodiver-       that might benefit from being included in the recom-
sity research. It is recommended that these index data        mended network of biodiversity data stores. The devel-
be replicated widely, and mirroring these indexes should      opment of web services and APIs on top of datasets and
be a high priority when establishing BDHC.                    the investment in informatics projects that harvest and
   Tailored backup services. Given the growing quantity       remix data should all be encouraged, both as a means to
of data and informatics projects, there is scope for          enable discovery within existing datasets and as an
industry to provide tailored data management and              incentive for data owners to publish their data publicly.
backup solutions to the biodiversity informatics commu-       GBIF, funding agencies and biodiversity informatics pro-
nity. These tools can range from hosting services to          jects should extend existing tools while developing and
Goddard et al. BMC Bioinformatics 2011, 12(Suppl 15):S5                                                                      Page 12 of 14
http://www.biomedcentral.com/1471-2105/12/S15/S5




promoting new tools that can be used to enable data          data. The community needs to recognize and proac-
discovery. The community should develop a central reg-       tively encourage this evolution by developing a shared
istry of such tools and, where practical, provide hosting    understanding of what is needed to ensure the preser-
for these tools at data hosting centers.                     vation of biodiversity data and then acting on the
   Services to enable data push. As new data are pub-        resulting requirements. We see the recommendations
lished, there is a lag between the publication time and      laid out in this paper as a step towards that shared
the harvesting of this data by other services, such as       understanding.
EOL or Google. Furthermore, even after articles are
published, data are in most cases available only in
                                                             Acknowledgments
HTML format, with the underlying databases rendered          This article has been published as part of BMC Bioinformatics Volume 12
inaccessible. Development of services to allow new data      Supplement 15, 2011: Data publishing framework for primary biodiversity
to be ‘pushed’ to online repositories should be investi-     data. The full contents of the supplement are available online at http://www.
                                                             biomedcentral.com/1471-2105/12?issue=S15. Publication of the supplement
gated, as this would make the new data available for         was supported by the Global Biodiversity Information Facility.
harvesting by data aggregation services and data hosting
centers. Once the data are discovered they can then be       Author details
                                                             1
                                                              Center for Library and Informatics, Woods Hole Marine Biological
further distributed via standardized APIs, removing this     Laboratory, Woods Hole, MA 02543, USA. 2Encyclopedia of Life, Center for
burden from the original data provider.                      Library and Informatics, Woods Hole Marine Biological Laboratory, Woods
   Once data aggregators discover new data, they should      Hole, MA 02543, USA. 3Center for Biodiversity Informatics (CBI), Missouri
                                                             Botanical Garden, St Louis, MO 63119, USA. 4Center for Biology and Society,
cache the data and include them in their archival and        Arizona State University, Tempe, AZ 85287, USA.
backup procedures. They should also investigate how
best to recover the content provided by a content provi-     Competing interests
                                                             The authors declare that they have no competing interests.
der should the need arise. The EOL project follows this
recommended practice and includes this caching func-         Published: 15 December 2011
tionality. This allows EOL to serve aggregated data
where the source data are unavailable, while at the same     References
                                                             1. Scholes RJ, Mace GM, Turner W, Geller GN, Jürgens N, Larigauderie A,
time providing the added benefit of becoming a redun-            Muchoney D, Walther BA, Mooney HA: Towards a global biodiversity
dant back-up of many projects’ data. This is an invalu-          observing system. Science 2008, 321:1044-1045.
able resource should a project suffer catastrophic data      2. Güntsch A, Berendsohn WG: Biodiversity research leaks data. Create the
                                                                 biodiversity data archives! In Systematics 2008 Programme and Abstracts,
loss or go offline.                                              Göttingen 2008. Göttingen: Göttingen University Press;Gradstein SR, Klatt S,
                                                                 Normann F, Weigelt P, Willmann R, Wilson R 2008:.
Conclusion                                                   3. Gray J, Szalay AS, Thakar AR, Stoughton C, Vandenberg J: Online Scientific
                                                                 Data Curation, Publication, and Archiving. 2002 [http://research.microsoft.
As the biodiversity data community grows and standards           com/apps/pubs/default.aspx?id=64568].
development and access to affordable infrastructure is       4. Whitlock MC, McPeek MA, Rausher MD, Rieseberg L, Moore AJ: Data
made available, some of the above-mentioned challenges           archiving. Am Nat 2010, 175:145-146.
                                                             5. Global Biodiversity Information Facility Data Portal. [http://data.gbif.org].
will be met in due course. However, the importance of        6. Encyclopedia of Life. [http://eol.org].
the data and the real risk of data loss suggest that these   7. GBIF: Data Providers. [http://data.gbif.org/datasets/providers].
challenges and recommendations must be faced sooner          8. Encyclopedia of Life: Content Partners. [http://eol.org/content_partners].
                                                             9. ScratchPads. [http://scratchpads.eu].
rather than later by key players in the community. A         10. LifeDesks. [http://www.lifedesks.org/].
generic data hosting infrastructure that is designed for a   11. Heidorn PB: Shedding light on the dark data in the long tail of science.
broader scientific scope than biodiversity data may also         Library Trends 2008, 57:280-299.
                                                             12. Digital Preservation Coalition. [http://www.dpconline.org].
have a role in access and preservation of biodiversity       13. Library of Congress Digital Preservation. [http://www.digitalpreservation.
data. Indeed, the more widespread the data becomes,              gov].
the safer it will be. However, the community should          14. Image Permanence Institute. [https://www.imagepermanenceinstitute.org].
                                                             15. The Dataverse Network Project. [http://thedata.org].
ensure that the infrastructure and mechanisms it uses        16. DataCite. [http://datacite.org].
are specific enough to biodiversity datasets that they       17. Digital Curation Centre. [http://www.dcc.ac.uk].
meet both the short-term and long-term needs of the          18. GenBank. [http://www.ncbi.nlm.nih.gov/genbank/].
                                                             19. DNA Data Bank of Japan. [http://www.ddbj.nig.ac.jp].
data and community. As funding agencies requiring data       20. European Molecular Biology Laboratory Nucleotide Sequence Database.
management plans becomes more widespread, the biodi-             [http://www.ebi.ac.uk/embl].
versity informatics community will become more               21. The US Long Term Ecological Research Network. [http://www.lternet.edu/
                                                                 ].
attuned to the concepts discussed here.                      22. Klump J: Criteria for the trustworthiness of data-centres. D-Lib Magazine
  As the community evolves to take a holistic and col-           2011, doi:10.1045/january2011-klump.
lective position on data preservation and access, it will    23. Zins C: Conceptual approaches for defining data, information, and
                                                                 knowledge. Journal of the American Society for Information Science and
find new avenues for collaboration, data discovery and           Technology 2007, 58:479-493.
data re-use, all of which will improve the value of their
Goddard et al. BMC Bioinformatics 2011, 12(Suppl 15):S5                                                                                               Page 13 of 14
http://www.biomedcentral.com/1471-2105/12/S15/S5




24. IPT Architecture PDF. [http://code.google.com/p/gbif-providertoolkit/            72. Langille MGI, Eisen JA: BioTorrents: a file sharing service for scientific
    downloads/detail?name=ipt-architecture_1.1.pdf&can=2&q=IPT                            data. PLoS ONE 2010, 5:e10071[http://www.plosone.org/article/info:doi/
    +Architecture].                                                                       10.1371/journal.pone.0010071].
25. Doolan J: Proposal for a Simplified Structure for Emu.[http://www.               73. BioTorrents. [http://www.biotorrents.net/faq.php].
    kesoftware.com/downloads/EMu/UserGroupMeetings/2005%20North%                     74. OpenURL. [http://www.exlibrisgroup.com/category/sfxopenurl].
    20American/john%20doolan_SimplifiedStructure.pps].                               75. BHL OpenURL Resolver Help. [http://www.biodiversitylibrary.org/
26. Pyle R: Taxonomer: a relational data model for managing information                   openurlhelp.aspx].
    relevant to taxonomic research. PhyloInformatics 2004, 1:1-54.                   76. Open Archives Initiative. [http://www.openarchives.org/].
27. International Code of Botanical Nomenclature (Vienna Code) adopted by            77. CiteBank. [http://citebank.org/].
    the seventeenth International Botanical Congress Vienna, Austria, July           78. LOCKSS. [http://lockss.stanford.edu/lockss/Home].
    2005. Ruggell, Liechtenstein: Gantner Verlag;McNeill J, Barrie F, Demoulin V,    79. OAIS. [http://lockss.stanford.edu/lockss/OAIS].
    Hawksworth D, Wiersema, J 2006:.                                                 80. Distributed Generic Information Retrieval (DiGIR). [http://digir.sourceforge.
28. Flickr. [http://flickr.com].                                                          net].
29. Systema Dipterorum. [http://diptera.org].                                        81. Biological Collection Access Services. [http://BioCASE.org].
30. Index Nominum Algarum Bibliographia Phycologica Universalis. [http://            82. ABCD Schema 2.06 - ratified TDWG Standard. [http://www.bgbm.org/
    ucjeps.berkeley.edu/INA.html].                                                        tdwg/CODATA/Schema/].
31. International Plant Names Index. [http://ipni.org].                              83. The Big Dig. [http://bigdig.ecoforge.net].
32. Index Fungorum. [http://indexfungorum.org].                                      84. List of BioCASE Data Sources. [http://www.BioCASE.org/whats_BioCASE/
33. Catalog of Fishes. [http://research.calacademy.org/ichthyology/catalog].              providers_list.cfm].
34. Berners-Lee T, Hendler J, Lassila O: The Semantic Web. Sci Am 2001, 29.          85. Amazon Elastic Compute Cloud (Amazon EC2). [http://aws.amazon.com/
35. Hymenoptera Anatomy Ontology. [http://hymao.org].                                     ec2/].
36. TaxonConcept. [http://lod.taxonconcept.org].                                     86. OpenSSH. [http://www.openssh.com/].
37. Patterson DJ, Cooper J, Kirk PM, Remsen DP: Names are key to the big             87. Apache Hadoop. [http://hadoop.apache.org/].
    new biology. Trends Ecol Evol 2010, 25:686-691.                                  88. Dean J, Ghemawat S: MapReduce: simplified data processing on large
38. Catalogue of Life. [http://www.catalogueoflife.org].                                  clusters. OSDI’04: Sixth Symposium on Operating System Design and
39. World Register of Marine Species. [http://marinespecies.org].                         Implementation 2004 [http://labs.google.com/papers/mapreduce.html].
40. Integrated Taxonomic Information System. [http://itis.gov].                      89. Hadoop on Amazon EC2 to generate Species by Cell Index. [http://
41. Index to Organism Names. [http://organismnames.com].                                  biodivertido.blogspot.com/2008/06/hadoop-on-amazon-ec2-to-generate.
42. Global Names Index. [http://gni.globalnames.org].                                     html].
43. uBio. [http://ubio.org].                                                         90. Joint Photographic Experts Group. [http://jpeg.org].
44. Natural History Museum Collections. [http://nhm.ac.uk/research-curation/         91. W3C Portable Network Graphics. [http://w3.org/Graphics/PNG/].
    collections].                                                                    92. Moving Picture Experts Group. [http://mpeg.chiariglione.org/].
45. Smithsonian Museum of Natural History: Search Museum Collection                  93. Vorbis.com. [http://www.vorbis.com/].
    Records. [http://collections.nmnh.si.edu].                                       94. Biodiversity Information Standards (TDWG). [http://www.tdwg.org].
46. Biodiversity Heritage Library. [http://biodiversitylibrary.org].                 95. Darwin Core. [http://rs.tdwg.org/dwc/].
47. Myco Bank. [http://www.mycobank.org].                                            96. Darwin Core Archive Data Standards. [http://www.gbif.org/informatics/
48. Fishbase. [http://fishbase.org].                                                      standards-and-tools/publishing-data/data-standards/darwin-core-archives/].
49. Missouri Botanical Garden. [http://mobot.org].                                   97. TAPIR (TDWG Access Protocol for Information Retrieval). [http://www.
50. Atlas of Living Australia. [http://ala.org.au].                                       tdwg.org/activities/tapir].
51. Wikipedia. [http://wikipedia.org].                                               98. Environmental Ontology. [http://environmentontology.org].
52. BBC Nature Wildlife. [http://bbc.co.uk/wildlifefinder].                          99. Kabooza Global backup survey: About backup habits, risk factors,
53. iSpot. [http://ispot.org.uk].                                                         worries and data loss of home PCs. 2009 [http://www.kabooza.com/
54. iNaturalist. [http://inaturalist.org].                                                globalsurvey.html].
55. Artportalen. [http://artportalen.se].                                            100. Hodge G, Frangakis E: Digital Preservation and Permanent Access to
56. Nationale Databank Flora en Fauna. [https://ndff-ecogrid.nl/].                        Scientific Information: The State of the Practice. 2004 [http://www.dtic.
57. Mushroom Observer. [http://mushroomobserver.org].                                     mil/cgi-bin/GetTRDoc?Location=U2&doc=GetTRDoc.pdf&AD=ADA432687].
58. eBird. [http://ebird.org].                                                       101. National Institutes of Health. [http://www.nih.gov/].
59. BugGuide. [http://bugguide.net].                                                 102. National Science Foundation. [http://nsf.gov/].
60. National Center for Biotechnology Information Taxonomy. [http://ncbi.            103. Scientists Seeking NSF Funding Will Soon Be Required to Submit Data
    nlm.nih.gov/taxonomy].                                                                Management Plans. [http://www.nsf.gov/news/news_summ.jsp?
61. Tree of Life Web Project. [http://tolweb.org].                                        cntn_id=116928&org=NSF&from=news].
62. Interim Register of Marine and Nonmarine Genera. [http://www.obis.org.           104. Harvey CC, Huc C: Future trends in data archiving and analysis tools. Adv
    au/irmng/].                                                                           Space Res 2003, 32:347-353, doi: 10.1016/S0273-1177(03)90273-0.
63. TreeBASE. [http://treebase.org].                                                 105. Leach P, Mealling M, Salz R: A universally unique identifier (UUID) URN
64. PhylomeDB. [http://phylomedb.org].                                                    namespace. 2005 [http://www.ietf.org/rfc/rfc4122.txt].
65. Sustainable Digital Data Preservation and Access Network Partners                106. Uniform Resource Locators (URL). Berners-Lee T, Masinter L, McCahill M
    (DataNet). [http://www.nsf.gov/funding/pgm_summ.jsp?pims_id=503141].                  1994 [http://www.ietf.org/rfc/rfc1738.txt].
66. Data Conservancy. [http://www.dataconservancy.org].                              107. OMG Life Sciences Identifiers Specification. [http://xml.coverpages.org/lsid.
67. DataONE. [http://dataone.org].                                                        html].
68. National Oceanographic and Atmospheric Administration (NOAA).                    108. Creative Commons. [http://creativecommons.org].
    [http://noaa.gov].                                                               109. GitHub. [https://github.com].
69. Preliminary Principles and Guidelines for Archiving Environmental and            110. Google Code. [http://code.google.com].
    Geospatial Data at NOAA: Interim Report by Committee on Archiving                111. Bennett-McNew C, Ragon B: Inspiring vanguards: the Woods Hole
    and Accessing Environmental and Geospatial Data at NOAA. USA:                         experience. Med Ref Serv Q 2008, 27:105-110.
    National Research Council; 2006.                                                 112. Yamashita G, Miller H, Goddard A, Norton C: A model for bioinformatics
70. Hunter J, Choudhury S: Semi-automated preservation and archival of                    training: the marine biological laboratory. Brief Bioinform 2010, 11:610-615,
    scientific data using semantic grid services. In Proceedings of the Fifth IEEE        doi:10.1093/bib/bbq029.
    International Symposium on Cluster Computing and the Grid - Volume 01            113. Howe J: Crowdsourcing: Why the Power of the Crowd Is Driving the
    (CCGRID ‘05). Volume 1. Washington, DC: IEEE Computer Society;                        Future of Business. Random House Business 2008.
    2005:160-167.                                                                    114. Hsueh P-Y, Melville P, Sindhwani V: Data quality from crowdsourcing: a
71. Bittorrent. [http://www.bittorrent.com/].                                             study of annotation selection criteria. Proceedings of the NAACL HLT
Goddard et al. BMC Bioinformatics 2011, 12(Suppl 15):S5                                                                                      Page 14 of 14
http://www.biomedcentral.com/1471-2105/12/S15/S5




       Workshop on Active Learning for Natural Language Processing.
       Computational Linguistics 2009, 27-35[http://portal.acm.org/citation.cfm?
       id=1564131.1564137].
115.   National Library of Australia’s Trove. [http://trove.nla.gov.au/].
116.   GBIF/TDWG Multimedia Resources Task Group. [http://www.keytonature.
       eu/wiki/MRTG].
117.   Public Data Sets on Amazon Web Services. [http://aws.amazon.com/
       publicdatasets/].
118.   Google Public Data Explorer. [http://google.com/publicdata].
119.   Smith V: Data Publication: towards a database of everything. BMC
       Research Notes 2009, 2:113.
120.   Google Earth. [http://earth.google.com].
121.   uBIo RSS. [http://www.ubio.org/rss].

 doi:10.1186/1471-2105-12-S15-S5
 Cite this article as: Goddard et al.: Data hosting infrastructure for
 primary biodiversity data. BMC Bioinformatics 2011 12(Suppl 15):S5.




                                                                                   Submit your next manuscript to BioMed Central
                                                                                   and take full advantage of:

                                                                                   • Convenient online submission
                                                                                   • Thorough peer review
                                                                                   • No space constraints or color figure charges
                                                                                   • Immediate publication on acceptance
                                                                                   • Inclusion in PubMed, CAS, Scopus and Google Scholar
                                                                                   • Research which is freely available for redistribution


                                                                                   Submit your manuscript at
                                                                                   www.biomedcentral.com/submit

Más contenido relacionado

La actualidad más candente

GBIF BIFA mentoring, Day 5a Data management, July 2016
GBIF BIFA mentoring, Day 5a Data management, July 2016GBIF BIFA mentoring, Day 5a Data management, July 2016
GBIF BIFA mentoring, Day 5a Data management, July 2016Dag Endresen
 
GBIF web services for biodiversity data, for USDA GRIN, Washington DC, USA (2...
GBIF web services for biodiversity data, for USDA GRIN, Washington DC, USA (2...GBIF web services for biodiversity data, for USDA GRIN, Washington DC, USA (2...
GBIF web services for biodiversity data, for USDA GRIN, Washington DC, USA (2...Dag Endresen
 
From policy to practice with DMP Online
From policy to practice with DMP OnlineFrom policy to practice with DMP Online
From policy to practice with DMP OnlineSarah Jones
 
Martin Donnelly Sarah Jones DMP Online
Martin Donnelly Sarah Jones DMP OnlineMartin Donnelly Sarah Jones DMP Online
Martin Donnelly Sarah Jones DMP OnlineFuture Perfect 2012
 
2021-01-27--biodiversity-informatics-gbif-(52slides)
2021-01-27--biodiversity-informatics-gbif-(52slides)2021-01-27--biodiversity-informatics-gbif-(52slides)
2021-01-27--biodiversity-informatics-gbif-(52slides)Dag Endresen
 
How open is public data agi 2011-13
How open is public data agi 2011-13How open is public data agi 2011-13
How open is public data agi 2011-13lgdigitalcomms
 
There is No Intelligent Life Down Here
There is No Intelligent Life Down HereThere is No Intelligent Life Down Here
There is No Intelligent Life Down HerePhilip Bourne
 
Towards the Digital Research Enterprise
Towards the Digital Research EnterpriseTowards the Digital Research Enterprise
Towards the Digital Research EnterprisePhilip Bourne
 
Big Data as a Catalyst for Collaboration & Innovation
Big Data as a Catalyst for Collaboration & InnovationBig Data as a Catalyst for Collaboration & Innovation
Big Data as a Catalyst for Collaboration & InnovationPhilip Bourne
 
Supporting Research Data Management at the University of Stirling
Supporting Research Data Management at the University of StirlingSupporting Research Data Management at the University of Stirling
Supporting Research Data Management at the University of StirlingLisa Haddow
 
The Commons: Leveraging the Power of the Cloud for Big Data
The Commons: Leveraging the Power of the Cloud for Big DataThe Commons: Leveraging the Power of the Cloud for Big Data
The Commons: Leveraging the Power of the Cloud for Big DataPhilip Bourne
 
Introduction to digital curation
Introduction to digital curationIntroduction to digital curation
Introduction to digital curationMichael Day
 
Digital Curation in Libraries: An innovative way of content preservation and...
Digital Curation in Libraries:  An innovative way of content preservation and...Digital Curation in Libraries:  An innovative way of content preservation and...
Digital Curation in Libraries: An innovative way of content preservation and...Bhojaraju Gunjal
 
A VIVO VIEW OF CANCER RESEARCH: Dream, Vision and Reality
A VIVO VIEW OF CANCER RESEARCH: Dream, Vision and RealityA VIVO VIEW OF CANCER RESEARCH: Dream, Vision and Reality
A VIVO VIEW OF CANCER RESEARCH: Dream, Vision and Reality Paul Courtney
 
Intro to GBIF: Infrastructures and Platforms for Environmental Crowd Sensing ...
Intro to GBIF: Infrastructures and Platforms for Environmental Crowd Sensing ...Intro to GBIF: Infrastructures and Platforms for Environmental Crowd Sensing ...
Intro to GBIF: Infrastructures and Platforms for Environmental Crowd Sensing ...Kyle Copas
 

La actualidad más candente (19)

GBIF BIFA mentoring, Day 5a Data management, July 2016
GBIF BIFA mentoring, Day 5a Data management, July 2016GBIF BIFA mentoring, Day 5a Data management, July 2016
GBIF BIFA mentoring, Day 5a Data management, July 2016
 
GBIF web services for biodiversity data, for USDA GRIN, Washington DC, USA (2...
GBIF web services for biodiversity data, for USDA GRIN, Washington DC, USA (2...GBIF web services for biodiversity data, for USDA GRIN, Washington DC, USA (2...
GBIF web services for biodiversity data, for USDA GRIN, Washington DC, USA (2...
 
From policy to practice with DMP Online
From policy to practice with DMP OnlineFrom policy to practice with DMP Online
From policy to practice with DMP Online
 
Martin Donnelly Sarah Jones DMP Online
Martin Donnelly Sarah Jones DMP OnlineMartin Donnelly Sarah Jones DMP Online
Martin Donnelly Sarah Jones DMP Online
 
Ices wgdim-may-2010
Ices wgdim-may-2010Ices wgdim-may-2010
Ices wgdim-may-2010
 
2021-01-27--biodiversity-informatics-gbif-(52slides)
2021-01-27--biodiversity-informatics-gbif-(52slides)2021-01-27--biodiversity-informatics-gbif-(52slides)
2021-01-27--biodiversity-informatics-gbif-(52slides)
 
How open is public data agi 2011-13
How open is public data agi 2011-13How open is public data agi 2011-13
How open is public data agi 2011-13
 
BD2K Update
BD2K Update BD2K Update
BD2K Update
 
There is No Intelligent Life Down Here
There is No Intelligent Life Down HereThere is No Intelligent Life Down Here
There is No Intelligent Life Down Here
 
Towards the Digital Research Enterprise
Towards the Digital Research EnterpriseTowards the Digital Research Enterprise
Towards the Digital Research Enterprise
 
Big Data as a Catalyst for Collaboration & Innovation
Big Data as a Catalyst for Collaboration & InnovationBig Data as a Catalyst for Collaboration & Innovation
Big Data as a Catalyst for Collaboration & Innovation
 
Supporting Research Data Management at the University of Stirling
Supporting Research Data Management at the University of StirlingSupporting Research Data Management at the University of Stirling
Supporting Research Data Management at the University of Stirling
 
The Commons: Leveraging the Power of the Cloud for Big Data
The Commons: Leveraging the Power of the Cloud for Big DataThe Commons: Leveraging the Power of the Cloud for Big Data
The Commons: Leveraging the Power of the Cloud for Big Data
 
Greening of Corrections: Creating a Sustainable System
Greening of Corrections: Creating a Sustainable SystemGreening of Corrections: Creating a Sustainable System
Greening of Corrections: Creating a Sustainable System
 
Introduction to digital curation
Introduction to digital curationIntroduction to digital curation
Introduction to digital curation
 
Digital Curation in Libraries: An innovative way of content preservation and...
Digital Curation in Libraries:  An innovative way of content preservation and...Digital Curation in Libraries:  An innovative way of content preservation and...
Digital Curation in Libraries: An innovative way of content preservation and...
 
A VIVO VIEW OF CANCER RESEARCH: Dream, Vision and Reality
A VIVO VIEW OF CANCER RESEARCH: Dream, Vision and RealityA VIVO VIEW OF CANCER RESEARCH: Dream, Vision and Reality
A VIVO VIEW OF CANCER RESEARCH: Dream, Vision and Reality
 
Data sharing: Seeing & Thinking Together
Data sharing: Seeing & Thinking TogetherData sharing: Seeing & Thinking Together
Data sharing: Seeing & Thinking Together
 
Intro to GBIF: Infrastructures and Platforms for Environmental Crowd Sensing ...
Intro to GBIF: Infrastructures and Platforms for Environmental Crowd Sensing ...Intro to GBIF: Infrastructures and Platforms for Environmental Crowd Sensing ...
Intro to GBIF: Infrastructures and Platforms for Environmental Crowd Sensing ...
 

Destacado

Collaboratively creating a network of ideas, data and software
Collaboratively creating a network of ideas, data and softwareCollaboratively creating a network of ideas, data and software
Collaboratively creating a network of ideas, data and softwareAnita de Waard
 
Taking your ball and going home
Taking your ball and going homeTaking your ball and going home
Taking your ball and going homePhil Cryer
 
GBIF (Global Biodiversity Information Facility) Position Paper: Data Hosting ...
GBIF (Global Biodiversity Information Facility) Position Paper: Data Hosting ...GBIF (Global Biodiversity Information Facility) Position Paper: Data Hosting ...
GBIF (Global Biodiversity Information Facility) Position Paper: Data Hosting ...Phil Cryer
 
Clustered and distributed
 storage with
 commodity hardware 
and open source ...
Clustered and distributed
 storage with
 commodity hardware 
and open source ...Clustered and distributed
 storage with
 commodity hardware 
and open source ...
Clustered and distributed
 storage with
 commodity hardware 
and open source ...Phil Cryer
 
Getting started with Mantl
Getting started with MantlGetting started with Mantl
Getting started with MantlPhil Cryer
 
Pets versus Cattle: servers evolved
Pets versus Cattle: servers evolvedPets versus Cattle: servers evolved
Pets versus Cattle: servers evolvedPhil Cryer
 
What if Petraeus Was a Hacker?
What if Petraeus Was a Hacker?What if Petraeus Was a Hacker?
What if Petraeus Was a Hacker?Phil Cryer
 
Moving towards unified logging
Moving towards unified loggingMoving towards unified logging
Moving towards unified loggingPhil Cryer
 

Destacado (8)

Collaboratively creating a network of ideas, data and software
Collaboratively creating a network of ideas, data and softwareCollaboratively creating a network of ideas, data and software
Collaboratively creating a network of ideas, data and software
 
Taking your ball and going home
Taking your ball and going homeTaking your ball and going home
Taking your ball and going home
 
GBIF (Global Biodiversity Information Facility) Position Paper: Data Hosting ...
GBIF (Global Biodiversity Information Facility) Position Paper: Data Hosting ...GBIF (Global Biodiversity Information Facility) Position Paper: Data Hosting ...
GBIF (Global Biodiversity Information Facility) Position Paper: Data Hosting ...
 
Clustered and distributed
 storage with
 commodity hardware 
and open source ...
Clustered and distributed
 storage with
 commodity hardware 
and open source ...Clustered and distributed
 storage with
 commodity hardware 
and open source ...
Clustered and distributed
 storage with
 commodity hardware 
and open source ...
 
Getting started with Mantl
Getting started with MantlGetting started with Mantl
Getting started with Mantl
 
Pets versus Cattle: servers evolved
Pets versus Cattle: servers evolvedPets versus Cattle: servers evolved
Pets versus Cattle: servers evolved
 
What if Petraeus Was a Hacker?
What if Petraeus Was a Hacker?What if Petraeus Was a Hacker?
What if Petraeus Was a Hacker?
 
Moving towards unified logging
Moving towards unified loggingMoving towards unified logging
Moving towards unified logging
 

Similar a Data hosting infrastructure for primary biodiversity data

Data as a research output and a research asset: the case for Open Science/Sim...
Data as a research output and a research asset: the case for Open Science/Sim...Data as a research output and a research asset: the case for Open Science/Sim...
Data as a research output and a research asset: the case for Open Science/Sim...African Open Science Platform
 
Open Source Software for Digital Preservation Repositories : A Survey
Open Source Software for Digital Preservation Repositories : A SurveyOpen Source Software for Digital Preservation Repositories : A Survey
Open Source Software for Digital Preservation Repositories : A SurveyIJCSES Journal
 
Open Source Software for Digital Preservation Repositories : A Survey
Open Source Software for Digital Preservation Repositories : A SurveyOpen Source Software for Digital Preservation Repositories : A Survey
Open Source Software for Digital Preservation Repositories : A SurveyIJCSES Journal
 
The Biodiversity Informatics Landscape
The Biodiversity Informatics LandscapeThe Biodiversity Informatics Landscape
The Biodiversity Informatics LandscapeVince Smith
 
dkNET Webinar: Creating and Sustaining a FAIR Biomedical Data Ecosystem 10/09...
dkNET Webinar: Creating and Sustaining a FAIR Biomedical Data Ecosystem 10/09...dkNET Webinar: Creating and Sustaining a FAIR Biomedical Data Ecosystem 10/09...
dkNET Webinar: Creating and Sustaining a FAIR Biomedical Data Ecosystem 10/09...dkNET
 
Using e-Infrastructures for Biodiversity Conservation
Using e-Infrastructures for Biodiversity ConservationUsing e-Infrastructures for Biodiversity Conservation
Using e-Infrastructures for Biodiversity ConservationBlue BRIDGE
 
Open Data in a Global Ecosystem
Open Data in a Global EcosystemOpen Data in a Global Ecosystem
Open Data in a Global EcosystemPhilip Bourne
 
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...GigaScience, BGI Hong Kong
 
Big Data in Bioinformatics & the Era of Cloud Computing
Big Data in Bioinformatics & the Era of Cloud ComputingBig Data in Bioinformatics & the Era of Cloud Computing
Big Data in Bioinformatics & the Era of Cloud ComputingIOSR Journals
 
Big Data R&D Strategy - Ensure the long term sustainability, access, and deve...
Big Data R&D Strategy - Ensure the long term sustainability, access, and deve...Big Data R&D Strategy - Ensure the long term sustainability, access, and deve...
Big Data R&D Strategy - Ensure the long term sustainability, access, and deve...Sky Bristol
 
2016 09 cxo forum
2016 09 cxo forum2016 09 cxo forum
2016 09 cxo forumChris Dwan
 
BioMed Central's open data initiatives
BioMed Central's open data initiativesBioMed Central's open data initiatives
BioMed Central's open data initiativesiainh_z
 
I o dav data workshop prof wafula final 19.9.17
I o dav data workshop prof wafula final 19.9.17I o dav data workshop prof wafula final 19.9.17
I o dav data workshop prof wafula final 19.9.17Tom Nyongesa
 
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...Robert Grossman
 
Big Data in Biomedicine – An NIH Perspective
Big Data in Biomedicine – An NIH PerspectiveBig Data in Biomedicine – An NIH Perspective
Big Data in Biomedicine – An NIH PerspectivePhilip Bourne
 
Global Biodiversity Information Facility - 2013
Global Biodiversity Information Facility - 2013Global Biodiversity Information Facility - 2013
Global Biodiversity Information Facility - 2013Dag Endresen
 

Similar a Data hosting infrastructure for primary biodiversity data (20)

Data as a research output and a research asset: the case for Open Science/Sim...
Data as a research output and a research asset: the case for Open Science/Sim...Data as a research output and a research asset: the case for Open Science/Sim...
Data as a research output and a research asset: the case for Open Science/Sim...
 
Open Source Software for Digital Preservation Repositories : A Survey
Open Source Software for Digital Preservation Repositories : A SurveyOpen Source Software for Digital Preservation Repositories : A Survey
Open Source Software for Digital Preservation Repositories : A Survey
 
Open Source Software for Digital Preservation Repositories : A Survey
Open Source Software for Digital Preservation Repositories : A SurveyOpen Source Software for Digital Preservation Repositories : A Survey
Open Source Software for Digital Preservation Repositories : A Survey
 
The Biodiversity Informatics Landscape
The Biodiversity Informatics LandscapeThe Biodiversity Informatics Landscape
The Biodiversity Informatics Landscape
 
dkNET Webinar: Creating and Sustaining a FAIR Biomedical Data Ecosystem 10/09...
dkNET Webinar: Creating and Sustaining a FAIR Biomedical Data Ecosystem 10/09...dkNET Webinar: Creating and Sustaining a FAIR Biomedical Data Ecosystem 10/09...
dkNET Webinar: Creating and Sustaining a FAIR Biomedical Data Ecosystem 10/09...
 
Open Access as a Means to Produce High Quality Data
Open Access as a Means to Produce High Quality DataOpen Access as a Means to Produce High Quality Data
Open Access as a Means to Produce High Quality Data
 
Using e-Infrastructures for Biodiversity Conservation
Using e-Infrastructures for Biodiversity ConservationUsing e-Infrastructures for Biodiversity Conservation
Using e-Infrastructures for Biodiversity Conservation
 
Open Data in a Global Ecosystem
Open Data in a Global EcosystemOpen Data in a Global Ecosystem
Open Data in a Global Ecosystem
 
Open Science - Global Perspectives/Simon Hodson
Open Science - Global Perspectives/Simon HodsonOpen Science - Global Perspectives/Simon Hodson
Open Science - Global Perspectives/Simon Hodson
 
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...
 
Big Data in Bioinformatics & the Era of Cloud Computing
Big Data in Bioinformatics & the Era of Cloud ComputingBig Data in Bioinformatics & the Era of Cloud Computing
Big Data in Bioinformatics & the Era of Cloud Computing
 
Big Data R&D Strategy - Ensure the long term sustainability, access, and deve...
Big Data R&D Strategy - Ensure the long term sustainability, access, and deve...Big Data R&D Strategy - Ensure the long term sustainability, access, and deve...
Big Data R&D Strategy - Ensure the long term sustainability, access, and deve...
 
Geospatial Data Preservation Primer GeoConnections
Geospatial Data Preservation Primer GeoConnectionsGeospatial Data Preservation Primer GeoConnections
Geospatial Data Preservation Primer GeoConnections
 
2016 09 cxo forum
2016 09 cxo forum2016 09 cxo forum
2016 09 cxo forum
 
Open Science Governance and Regulation/Simon Hodson
Open Science Governance and Regulation/Simon HodsonOpen Science Governance and Regulation/Simon Hodson
Open Science Governance and Regulation/Simon Hodson
 
BioMed Central's open data initiatives
BioMed Central's open data initiativesBioMed Central's open data initiatives
BioMed Central's open data initiatives
 
I o dav data workshop prof wafula final 19.9.17
I o dav data workshop prof wafula final 19.9.17I o dav data workshop prof wafula final 19.9.17
I o dav data workshop prof wafula final 19.9.17
 
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
 
Big Data in Biomedicine – An NIH Perspective
Big Data in Biomedicine – An NIH PerspectiveBig Data in Biomedicine – An NIH Perspective
Big Data in Biomedicine – An NIH Perspective
 
Global Biodiversity Information Facility - 2013
Global Biodiversity Information Facility - 2013Global Biodiversity Information Facility - 2013
Global Biodiversity Information Facility - 2013
 

Más de Phil Cryer

What if Petraeus was a hacker? Email privacy for the rest of us
What if Petraeus was a hacker? Email privacy for the rest of usWhat if Petraeus was a hacker? Email privacy for the rest of us
What if Petraeus was a hacker? Email privacy for the rest of usPhil Cryer
 
Online privacy concerns (and what we can do about it)
Online privacy concerns (and what we can do about it)Online privacy concerns (and what we can do about it)
Online privacy concerns (and what we can do about it)Phil Cryer
 
Online Privacy in the Year of the Dragon
Online Privacy in the Year of the DragonOnline Privacy in the Year of the Dragon
Online Privacy in the Year of the DragonPhil Cryer
 
Is your data secure? privacy and trust in the social web
Is your data secure?  privacy and trust in the social webIs your data secure?  privacy and trust in the social web
Is your data secure? privacy and trust in the social webPhil Cryer
 
Building Toward an Open and Extensible Autonomous Computing Platform Utilizi...
Building Toward an Open and Extensible  Autonomous Computing Platform Utilizi...Building Toward an Open and Extensible  Autonomous Computing Platform Utilizi...
Building Toward an Open and Extensible Autonomous Computing Platform Utilizi...Phil Cryer
 
Biodiversity Heritiage Library: progress and process
Biodiversity Heritiage Library: progress and processBiodiversity Heritiage Library: progress and process
Biodiversity Heritiage Library: progress and processPhil Cryer
 
Updates on the BHL Global Cluster
Updates on the BHL Global ClusterUpdates on the BHL Global Cluster
Updates on the BHL Global ClusterPhil Cryer
 
Storing and distributing data
Storing and distributing dataStoring and distributing data
Storing and distributing dataPhil Cryer
 
BHL hardware architecture - storage and clusters
BHL hardware architecture - storage and clustersBHL hardware architecture - storage and clusters
BHL hardware architecture - storage and clustersPhil Cryer
 
Building A Scalable Open Source Storage Solution
Building A Scalable Open Source Storage SolutionBuilding A Scalable Open Source Storage Solution
Building A Scalable Open Source Storage SolutionPhil Cryer
 
Biodiversity Heritage Library Articles Demo
Biodiversity Heritage Library Articles DemoBiodiversity Heritage Library Articles Demo
Biodiversity Heritage Library Articles DemoPhil Cryer
 
Using Fedora Commons To Create A Persistent Archive
Using Fedora Commons To Create A Persistent ArchiveUsing Fedora Commons To Create A Persistent Archive
Using Fedora Commons To Create A Persistent ArchivePhil Cryer
 

Más de Phil Cryer (12)

What if Petraeus was a hacker? Email privacy for the rest of us
What if Petraeus was a hacker? Email privacy for the rest of usWhat if Petraeus was a hacker? Email privacy for the rest of us
What if Petraeus was a hacker? Email privacy for the rest of us
 
Online privacy concerns (and what we can do about it)
Online privacy concerns (and what we can do about it)Online privacy concerns (and what we can do about it)
Online privacy concerns (and what we can do about it)
 
Online Privacy in the Year of the Dragon
Online Privacy in the Year of the DragonOnline Privacy in the Year of the Dragon
Online Privacy in the Year of the Dragon
 
Is your data secure? privacy and trust in the social web
Is your data secure?  privacy and trust in the social webIs your data secure?  privacy and trust in the social web
Is your data secure? privacy and trust in the social web
 
Building Toward an Open and Extensible Autonomous Computing Platform Utilizi...
Building Toward an Open and Extensible  Autonomous Computing Platform Utilizi...Building Toward an Open and Extensible  Autonomous Computing Platform Utilizi...
Building Toward an Open and Extensible Autonomous Computing Platform Utilizi...
 
Biodiversity Heritiage Library: progress and process
Biodiversity Heritiage Library: progress and processBiodiversity Heritiage Library: progress and process
Biodiversity Heritiage Library: progress and process
 
Updates on the BHL Global Cluster
Updates on the BHL Global ClusterUpdates on the BHL Global Cluster
Updates on the BHL Global Cluster
 
Storing and distributing data
Storing and distributing dataStoring and distributing data
Storing and distributing data
 
BHL hardware architecture - storage and clusters
BHL hardware architecture - storage and clustersBHL hardware architecture - storage and clusters
BHL hardware architecture - storage and clusters
 
Building A Scalable Open Source Storage Solution
Building A Scalable Open Source Storage SolutionBuilding A Scalable Open Source Storage Solution
Building A Scalable Open Source Storage Solution
 
Biodiversity Heritage Library Articles Demo
Biodiversity Heritage Library Articles DemoBiodiversity Heritage Library Articles Demo
Biodiversity Heritage Library Articles Demo
 
Using Fedora Commons To Create A Persistent Archive
Using Fedora Commons To Create A Persistent ArchiveUsing Fedora Commons To Create A Persistent Archive
Using Fedora Commons To Create A Persistent Archive
 

Último

New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Visualising and forecasting stocks using Dash
Visualising and forecasting stocks using DashVisualising and forecasting stocks using Dash
Visualising and forecasting stocks using Dashnarutouzumaki53779
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 

Último (20)

New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Visualising and forecasting stocks using Dash
Visualising and forecasting stocks using DashVisualising and forecasting stocks using Dash
Visualising and forecasting stocks using Dash
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 

Data hosting infrastructure for primary biodiversity data

  • 1. Goddard et al. BMC Bioinformatics 2011, 12(Suppl 15):S5 http://www.biomedcentral.com/1471-2105/12/S15/S5 RESEARCH Open Access Data hosting infrastructure for primary biodiversity data Anthony Goddard1*, Nathan Wilson2, Phil Cryer3, Grant Yamashita4 Abstract Background: Today, an unprecedented volume of primary biodiversity data are being generated worldwide, yet significant amounts of these data have been and will continue to be lost after the conclusion of the projects tasked with collecting them. To get the most value out of these data it is imperative to seek a solution whereby these data are rescued, archived and made available to the biodiversity community. To this end, the biodiversity informatics community requires investment in processes and infrastructure to mitigate data loss and provide solutions for long-term hosting and sharing of biodiversity data. Discussion: We review the current state of biodiversity data hosting and investigate the technological and sociological barriers to proper data management. We further explore the rescuing and re-hosting of legacy data, the state of existing toolsets and propose a future direction for the development of new discovery tools. We also explore the role of data standards and licensing in the context of data hosting and preservation. We provide five recommendations for the biodiversity community that will foster better data preservation and access: (1) encourage the community’s use of data standards, (2) promote the public domain licensing of data, (3) establish a community of those involved in data hosting and archival, (4) establish hosting centers for biodiversity data, and (5) develop tools for data discovery. Conclusion: The community’s adoption of standards and development of tools to enable data discovery is essential to sustainable data preservation. Furthermore, the increased adoption of open content licensing, the establishment of data hosting infrastructure and the creation of a data hosting and archiving community are all necessary steps towards the community ensuring that data archival policies become standardized. Introduction of data [4]. As a result, biodiversity data collected today Today, an unprecedented volume of primary biodiversity are as endangered as the species they represent. There data are being generated worldwide [1], yet significant are also important questions of access to and interpreta- amounts of this data have been and will continue to be tion of the data. Inaccessible data are effectively lost lost after the conclusion of the projects tasked with col- until they are made accessible. Moreover, data that are lecting them [2]. Gray et al. [3] make a distinction misrepresented or easily misinterpreted can result in between ephemeral data, which, once collected, can conclusions that are even more inaccurate than those never be collected again, and stable data, which can be that would be drawn if the data were simply lost. recollected. The extinction of species, habitat destruc- Although it is in the best interest of all who create tion and related loss of rich sources of biodiversity make and use biodiversity data to encourage best practices to ephemeral a significant amount of data that have his- protect against data loss, the community still requires torically been assumed to be stable. Whether the data additional effective incentives to participate in a shared are stable or ephemeral, however, poor record keeping data environment and to help overcome existing social and data management practices nevertheless lead to loss and cultural barriers to data sharing. Separated silos of data from disparate groups presently dominate the cur- * Correspondence: agoddard@mbl.edu rent global infrastructure for biodiversity data. There are 1 Center for Library and Informatics, Woods Hole Marine Biological some examples of projects working to bring the data Laboratory, Woods Hole, MA 02543, USA out of those silos and to encourage sharing between the Full list of author information is available at the end of the article © 2011 Goddard et al; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
  • 2. Goddard et al. BMC Bioinformatics 2011, 12(Suppl 15):S5 Page 2 of 14 http://www.biomedcentral.com/1471-2105/12/S15/S5 various projects. Examples include the Global Biodiver- for adopting and implementing data management/pre- sity Information Facility (GBIF) data portal [5] and the servation changes. Encyclopedia of Life (EOL) [6]. Each of them works to We acknowledge that biodiversity data range widely, bring together data from their partners [7,8] and makes both in format and in purpose. Although some authors those data available through their application program- carefully restrict the use of the term ‘data’ to verifiable ming interfaces (APIs). In addition, there are projects, facts or sensory stimuli [23], for the purposes of this including ScratchPads [9] and LifeDesks [10], that allow article we intend a very broad understanding of the taxonomists to focus on their area of expertise while term covering essentially everything that can be repre- automatically making the data they want to share avail- sented using digital computers. Although not all biodi- able to the larger biodiversity community. versity data are in digital form, we restrict our Although there is a strong and healthy mix of plat- discussion here to digital data that relate in some way to forms and technologies, there remains a gap in stan- organisms. Usually the data are created to serve a speci- dards and processes, especially for the ‘long tail’ of fic purpose, ranging from ecosystem assessment to spe- smaller projects [11]. In analyzing existing infrastruc- cies identification to general education. We also ture, we see that large and well-funded projects predic- acknowledge that published literature is a very impor- tably have more substantial investments in tant existing repository for biodiversity data, and every infrastructure, making use of not only on-site redun- category of biodiversity data we discussed may be pub- dancy, but also remote mirrors. Smaller projects, on the lished in the traditional literature as well as more digi- other hand, often rely on manual or semi-automated tally accessible forms. Consequently, any of the data data backup procedures with little time or resources for discussed here may have relevant citations that are comprehensive high availability or disaster recovery important for finding and interpreting the data. How- considerations. ever, given that citations have such a rich set of stan- It is therefore imperative to seek a solution to this dards and supporting technologies, they will not be data loss and ensure that data are rescued, archived and addressed here as that would distract from the primary made available to the biodiversity community. Although purpose of this paper. there are broad efforts to encourage the use of best practices for data archiving [12-14], citation of data Categories of data and examples [15,16] and curation [17] as well as large archives that We start this section with a high-level discussion of the are focused on particular types of biology-related data, types of data and challenges related specifically to biodi- including sequences [18-20], ecosystems [21] and obser- versity data. This is followed by a brief review of various vations (GBIF) species descriptions (EOL), none of these types of existing Internet-accessible biodiversity data efforts are focused on the long-term preservation of bio- sources. Our intent is not to be exhaustive, but rather diversity data. To this end the biodiversity informatics to demonstrate the wide variety of biodiversity data community requires investment in trustworthy processes sources and to point out some of the differences and infrastructure to mitigate data loss [22] and needs between them. The choice of sites discussed is based on to provide solutions for long-term hosting and storage a combination of the authors’ familiarity with the parti- of biodiversity data. We propose the construction of cular tools and their widespread use. Biodiversity Data Hosting Centers (BDHCs), which are Most discussions of the existing data models for charged with the task of mitigating the risks presented managing biodiversity data are either buried in what here by the careful creation and management of the documentation exists for specific systems, for example, infrastructure necessary to archive and manage biodiver- the Architectural View of GBIF’s Integrated Publishing sity data. As such, they will provide a future safeguard Toolkit (IPT) [24] or, alternatively, the data model is against loss of biodiversity data. discussed in related informally published presentations, In laying out our vision of BDHCs, we begin by cate- for example John Doolan’s ‘Proposal for a Simplified gorizing the different kinds of biodiversity data that Structure for EMu’ [25]. One example of a data model are found in the literature and in various datasets. We in the peer-reviewed literature is Rich Pyle’s Taxonomer then lay out some features and capabilities that BDHC [26] system. However, this is again focused on a particu- should possess, with a discussion of standards and best lar implementation of a particular data model for a par- practices that are pertinent to an effective BDHC. ticular system. After a discussion of current tools and approaches for Raw observational data are a key type of biodiversity effective data management, we discuss some of the data and include textual or numeric field notes, photo- technological and cultural barriers to effective data graphs, video clips, sound files, genetic sequences or any management and preservation that have hindered the other sort of recorded data file or dataset based on the community. We end with a series of recommendations observation of organisms. Although the processes
  • 3. Goddard et al. BMC Bioinformatics 2011, 12(Suppl 15):S5 Page 3 of 14 http://www.biomedcentral.com/1471-2105/12/S15/S5 involved in collecting these data can be complex and Catalog of Fishes [33]. Several sites have begun develop- may be error prone, these data are generally accepted as ing biodiversity data for the semantic web [34] including the basic factual data from which all other biodiversity the Hymenoptera Anatomy Ontology [35] and Taxon- data are derived. In terms of sheer volume, raw observa- Concept.org [36]. In addition, many projects include tional data are clearly the dominant type of data. extensive catalogs of names because names are key to Another example is nomenclatural data. These data are bringing together nearly all biodiversity data [37]. Exam- the relationships between various names and are thus ples include the Catalog of Life [38], WoRMs [39], ITIS very compact and abstract. Nomenclatural data are gen- [40] and ION [41]. Many sources for names, including erally derived from a set of nomenclatural codes (for those listed above, are indexed through projects such as example, the International Code of Botanical Nomencla- the Global Names Index [42] and uBio [43]. ture [27]). Although the correct application of these Some sites focus on curated museum collections (for codes in particular circumstances may be a source of example, Natural History Museum, London [44], and endless debate among taxonomists, the vast majority of the Smithsonian Institution [45], among many others). these data are not a matter of deep debate. However, Others focus on descriptions of biological taxa ranging descriptions of named taxa, particularly above the spe- from original species descriptions (the Biodiversity Heri- cies level, are much more subjective, relying on both the tage Library [46] and MycoBank [47]), up-to-date tech- historical literature and fundamentally on the knowledge nical descriptions (FishBase [48], the Missouri Botanical and opinions of their authors. Digital resources often do Garden [49], and the Atlas of Living Australia [50]), and a poor job of acknowledging, much less representing, descriptions intended for the general public (Wikipedia this subjectivity. [51] and the BBC Wildlife Finder [52]). For example, consider a system that records observa- As described above, identifications are often conflated tions of organisms by recording the name, date, location with raw observations or names. However, there are a and observer of the organism. Such a system can con- number of sites set up to interactively manage newly flate objective, raw observational data with a subjective proposed identifications. Examples include iSpot [53], association of that data with a name based on some iNaturalist [54], ArtPortalen.se [55] and Nationale Data- poorly specified definition of that name. Furthermore, bank Flora en Fauna [56], which accept observations of most of the raw observational data are then typically any organisms, and sites with more restricted domains, discarded because the observer fails to record the raw such as Mushroom Observer [57], eBird [58] or Bug- data that they used to make the determination or Guide [59]. The GBIF data portal provides access to because the system does not provide a convenient way observational data from a variety of sources, including to associate raw data, such as photographs, with a parti- many government organizations. Many sites organize cular determination. Similarly, the person making the their data according to a single classification that reflects identification typically neglects to be specific about what the current opinions and knowledge of the resource definition of the name they intend. Although this exam- managers. Examples of such sites whose classifications ple raises a number of significant questions about the come reasonably close to covering the entire tree of life long-term value of such data, the most significant issues are the National Center for Biotechnology Information for our current purpose relate to combining data from (NCBI) taxonomy [60], the Tree of Life Web [61], the multiple digital resources. The data collected and the Catalog of Life [38], the World Registry of Marine Spe- way that different digital resources are used are often cies (http://www.marinespecies.org/), and the Interim very different. As a result it is very important to be able Register of Marine and Nonmarine Genera [62]. EOL to reliably trace the origin of specific pieces of data. It is [6] gathers classifications from such sources such and also important to characterize and document the various allows the user to choose a preferred classification for data sources in a common framework. accessing its data. Finally, some sites, such as TreeBase Currently, a wide variety of biodiversity data are avail- [63] and PhylomeDB [64], focus on archiving computer able through digital resources accessible through the generated phylogenies based on gene sequences. Internet. These range from sites that focus on photo- graphs of organisms from the general public such as Features of biodiversity data hosting centers groups within Flickr [28] to large diverse systems Most of the key features of a BDHC are common to the intended for a range of audiences such as EOL [6] to general problem of creating publicly accessible, long- very specialized sites focused on the nomenclature of a term data archives. Obviously, the data stored in the specific group of organisms, such as Systema Dipter- systems are distinctive to the biodiversity community, orum [29] or the Index Nominum Algarum [30]. Other such as images of diagnostic features of specific taxa. sites focused on nomenclature include the International However, the requirements to store images and the Plant Names Index [31], Index Fungorum [32] and the standards used for storing them are widespread. In
  • 4. Goddard et al. BMC Bioinformatics 2011, 12(Suppl 15):S5 Page 4 of 14 http://www.biomedcentral.com/1471-2105/12/S15/S5 addition, significant portions of the data models, and ‘standard’, then the total amount of work that has to be thus communication standards and protocols, are dis- done by community goes up as the square of the num- tinctive to the biodiversity community. For example, the ber of players (roughly speaking, an N2 problem, with N specific types of relationships between images, observa- being the number of players). If, however, some com- tions, taxon descriptions and scientific names. munity standards are used, the amount of work for the Government, science, education and cultural heritage community is N×S (with S being the number of stan- communities, among others, are faced with many of the dards used in the community). As S approaches N, same general challenges that face the biodiversity infor- there is no point in standardizing, but as S becomes matics community when it comes to infrastructure and much less than N, the total amount of support required processes to establish long-term preservation and access by the community to deal with accounting for the var- of content. The NSF DataNet Program [65] was created ious standards decreases. A full account of standards with this specific goal in mind. The Data Conservancy and their importance to robust technical systems is [66] and the DataOne Project [67], both funded by beyond the scope of this article. However, we accept DataNet, are seeking to provide long-term preservation, that the use of standards to facilitate data preservation access and reuse for their stakeholder communities. The and access is very important for BDHCs. In this context, US National Oceanographic and Atmospheric Adminis- we present a general overview of common standards tration [68] is actively exploring guidelines for archiving used in biodiversity systems in the next section. environmental and geospatial data [69]. The wider science and technology community has also investigated Tools, services and standards the challenges of preservation and access of scientific Here we summarize some of the tools, services, and and technical data [70]. Such investigations and projects standards that are used in biodiversity informatics pro- are too great to cover in the context of this paper. How- jects. Although an exhaustive list is beyond the scope of ever, every effort should be made by the biodiversity this article, we attempt to present tools that are widely informatics community to build from the frameworks known and used by most members of the community. and lessons learned by those who are tackling these same challenges in areas outside biodiversity. Tools for replicating data The primary goal of BDHCs is to substantially reduce TCP/HTTP protocols allow simple scripting of com- the risk of loss of biodiversity data. To achieve this goal mand-line tools, such as wget or curl, to transfer data. they must take long-term data preservation seriously. When this is not an option and there is access via a Fortunately, this need is common to many other areas login shell such as OpenSSH, other standard tools can of data management and many excellent tools exist to be used to replicate and mirror data. Examples of these meet these needs. In addition to the obvious data sto- are rsync, sftp and scp. rage requirements, data replication and effective meta- Peer-to-Peer (P2P) file-sharing includes technologies data management are the primary technologies required such as BitTorrent [71], which is a protocol for distri- to mitigate against the dangers of data loss. buting large amounts of data. BitTorrent is the frame- Another key feature of BDHCs is to make the data work for the BioTorrents project [72], which allows globally available. Although any website is, in a sense, researchers to rapidly share large datasets via a network globally available, the deeper requirements are to make of pooled bandwidth systems. This open sharing allows that availability reliable and fast. Data replication across one user to collect pieces of the download from multiple globally distributed data centers is a well-recognized providers, increasing the efficiency of the file transfer approach to making data consistently available across while simultaneously providing the downloaded bits to the world. other users. The BioTorrents website [73] acts as a cen- Finally, the use and promotion of standards is a key tral listing of datasets available to download and the Bit- feature for BDHCs. Standards are the key component to Torrent protocol allows data to be located on multiple enabling effective, scalable data transfer between inde- servers. This decentralizes the data hosting and distribu- pendently developed systems. The key argument for tion and provides fault tolerance. All files are then integ- standards in BDHCs is that they reduce the need for a rity checked via checksums to ensure they are identical set of systems that need to communicate with each on all nodes. other to create a custom translator between each pair of systems. Everyone can aim at supporting the standards Tools for querying data providers rather than figuring out how to interoperate with every- OpenURL [74] provides a standardized URL format that one else on a case by case basis. If the number of enables a user to find a resource they are allowed to players is close to the number of standards, then there access via the provider they are querying. Originally is no point in standardizing. If each player has its own used by librarians to help patrons find scholarly articles,
  • 5. Goddard et al. BMC Bioinformatics 2011, 12(Suppl 15):S5 Page 5 of 14 http://www.biomedcentral.com/1471-2105/12/S15/S5 it is now used for any kind of resource on the internet. shared space where all nodes in the network support The standard supports linking from indexed databases each other and sustain authoritative copies of the origi- to other services, such as journals, via full-text search of nal content. LOCKSS is OAIS (Open Archival Informa- repositories, online catalogs or other services. OpenURL tion System) [79] compliant and preserves all genres is an open tool and allows common APIs to access data. and formats of web content to preserve both the histori- The Biodiversity Heritage Library (BHL)’s OpenURL cal context and the intellectual content. A described resolver [75] is an API that was launched in 2009 and ‘LOCKSS box’ collects content from the target sites continues to offer a way for data providers and aggrega- using a web crawler similar to the ones that search tors to access BHL material. Any repository containing engines use to discover content. It then watches those citations to biodiversity literature can use this API to sites for changes, allows the content to be cached on determine whether a given book, volume, article and/or the box so as to facilitate a web proxy or cache of the page is available online through BHL. The service sup- content in case the target system is ever down and has a ports both OpenURL 0.1 and OpenURL 1.0 query for- web-based administration panel to control what is being mats, and can return its response in JSON, XML or audited, how often and who has access to the material. HTML formats, providing flexibility for data exchange. DiGIR (Distributed Generic Information Retrieval) [80] is a client/server protocol for retrieving information Tools for metadata and citation exchange from distributed resources. Using HTTP to transport OAI-PMH (Open Archives Initiative Protocol for Meta- Darwin Core XML between the client and server, DiGIR data Harvesting, usually referred to as simply OAI) [76] is a set of tools to link independent databases into a sin- is an established protocol to provide an application gle, searchable virtual collection. Although it was initi- independent framework to harvest metadata. Using ally targeted to deal only with species data, it was later XML over HTTP, OAI harvests metadata descriptions expanded to work with any type of information and was of repository records so that servers can be built using integrated into a number of community collection net- metadata from many unrelated archives. The base works, in particular GBIF. At its core, DiGIR provides a implementation of OAI must support metadata in the search interface between many dissimilar databases Dublin Core format (a metadata standard for describing using XML as a translator. When a search query is a wide range of networked resources), but support for issued, the DiGIR client application sends the query to additional representations is available. Once an OAI ser- each institution’s DiGIR provider, which is then trans- vice is initially harvested, future harvests will check only lated into an equivalent request that is compatible with for new or changed records, making it an efficient pro- the local database. Thus the response can deal with the tocol and one that is easily set up as a repetitive task search even though the details of the underlying data- that runs automatically at regular intervals. base are suppressed, thus allowing a uniform virtual CiteBank [77] is an open access repository for biodi- view of the contents on the network. Network speed versity publications published by the BHL that allows and availability were major concerns of the DiGIR sys- sharing, categorizing and promoting of citations. Cita- tem; nodes would time out before requests could be tions are harvested from resources via the OAI-PMH processed, resulting in failed queries. The functionality protocol, which seamlessly deals with updates and of DiGIR was superceded by the Tapir protocol. changes from remote providers. Consequently, all data BioCASE (The Biological Collection Access Service within CiteBank is also available to anyone via OAI, for Europe) [81] “is a transnational network of biological thus creating a new discovery node. Tools are available collections”. BioCASE provides access to collection and that allow users to upload individual citations of whole observational databases by providing an XML abstrac- collections of bibliographies. Open groups may be tion layer in front of a database. Although its search, formed around various categories and can assist in mod- retrieval and framework is similar to DiGIR, it is never- erating and updating citations. theless incompatible with DiGIR. BioCASE uses a schema based on the Access to Biological Collections Tools for data exchange Data (ABCD) schema [82]. LOCKSS (Lots of Copies Keep Stuff Safe) [78] is an TAPIR is a current Taxonomic Database Working international community program, based at Stanford Group (TDWG) standard that provides an XML API University Libraries, that uses open source software and protocol for accessing structured data. TAPIR extends P2P networking technology to map a large, decentra- features of BioCASE and DiGIR by making a more gen- lized and replicated digital repository. Using off-the- eric method of data interchange. The TAPIR project is shelf hardware and requiring very little technical exper- run by a task group that oversees the development. The tise, the system preserves an institution’s content while XML request and response for access can be stored in a providing preservation for other content - a virtual distributed database. TAPIR combines and extends
  • 6. Goddard et al. BMC Bioinformatics 2011, 12(Suppl 15):S5 Page 6 of 14 http://www.biomedcentral.com/1471-2105/12/S15/S5 features of BioCASE and DiGIR to create a more gen- Linking multiple clusters of Hadoop servers globally eric means for sharing data. A task group oversees the would present biodiversity researchers with a commu- maintenance of the software as well as the protocol’s nity utility. GBIF developers have already worked exten- standard. sively with Hadoop in a distributed computing Although DiGIR providers are still available, their environment, so much of the background research into numbers have diminished since its inception and now the platform and its application to biodiversity data has number around 260 [83]. BioCASE, meanwhile, has been done. In a post entitled ‘Hadoop on Amazon EC2 approximately 350 active nodes [84]. TAPIR is being to generate Species by Cell Index’ [89], GBIF generated used as the primary collection method by GBIF, who a ‘species per cell index’ map of occurrence data across continue to maintain and add functionality to the project. an index of the entire GBIF occurrence record store, Institutions such as the Missouri Botanical Garden host where each cell represented an area of one degree lati- all three services (DiGIR, BioCASE and TAPIR) to allow tude by one degree longitude. This map consisted of ongoing open access to their collections. Many projects over 135 million records. This was processed using 20 will endeavor to maintain legacy tools for as long as there Amazon EC2 instances running Linux, Hadoop and is demand. This support is helpful for systems that have Java. This map generation was completed in 472 sec- already invested in legacy tools, but it is important to onds and was used to show that processing tasks could promote new tools and standards to ensure their adop- be parsed out over many Hadoop instances to come up tion as improvements in tools and standards are made. In with a unified result. Because Hadoop runs in Java it is so doing, there needs to be a clear migration path from very simple to provision nodes and link them together legacy systems to the new one. to form a private computing network. Distributed computing Standards used in the biodiversity community Infrastructure as a service The biodiversity community uses a large range of data- Cloud infrastructure services provide server virtualiza- related standards ranging from explicit data file formats, tion environments as a service. When a user requires a to extensions to metadata frameworks, to protocols. Lar- number of servers to store data or run processing tasks, ger data objects such as various forms of media, they simply rent computing resources from a provider, sequence data or other types of raw data are addressed who provides the resources from a pool of available by standardized data file formats, such as JPEG [90], equipment. From here, virtual machines can be brought PNG [91], MPEG [92] and OGG [93]. For more specific online and offline quickly, with the user paying only for forms of data, standards are usually decided by the orga- the actual services used or resources consumed. The nization governing that data type. The Biodiversity most well known implementation of this is the Amazon Information Standards [94] efforts are intended to Elastic Compute Cloud (Amazon EC2) [85]. address the core needs for biodiversity data. The meta- data standard Darwin Core [95] and the related GBIF- Distributed processing developed Darwin Core Archive [96] file format, in par- Using common frameworks to share processing globally ticular, address most of the categories of data discussed enables researchers to use far more computing power above. The current Darwin Core standard is designed to than had been previously possible. By applying this addi- be used in XML documents for exchanging data and is tional computing power, projects can leverage techni- the most common format for sharing zoological data. It ques such as data mining to open up new avenues of is also worth noting that the Darwin Core Archive file exploration in biodiversity research. After computational format is explicitly designed to address the most com- targets are selected, individual jobs are distributed to mon problems that researchers encounter in this space, processing nodes until all work is completed. Connect- but not necessarily all of the potential edge cases that ing disparate system only requires a secure remote con- could be encountered in representing and exchanging nection, such as OpenSSH [86], over the internet. biodiversity data. Apache Hadoop [87] is a popular open source project The TAPIR [97] standard specifies a protocol for from Yahoo that provides a Java software framework to querying biodiversity data stores and, along with DiGIR support distributed computing. Running on large clus- and BioCASE, provides a generic way to communicate ters of commodity hardware, Hadoop uses its own dis- between client applications and data providers. Biodiver- tributed file system (HDFS) to connect various nodes sity-specific standards for data observations and descrip- and provides resilience against intermittent failures. Its tions are also being developed. For example, the EnvO computational method, map/reduce [88], passes small [98] standard is an ontology similar to Darwin Core that fragments of work to other nodes in the cluster and defines a wide variety of environments and some of directs further jobs while aggregating the results. their relationships.
  • 7. Goddard et al. BMC Bioinformatics 2011, 12(Suppl 15):S5 Page 7 of 14 http://www.biomedcentral.com/1471-2105/12/S15/S5 Community acceptance of standards is an ongoing has already scanned approximately 30 million pages of process and different groups will support various ones core biodiversity literature. The project estimates that at different times. The process for determining a biodi- there are approximately 500 million total pages of core versity specific standard is generally organized by the biodiversity literature. Of that, 100 million are out of Biodiversity Information Standards community (orga- copyright and are waiting to be scanned by the project. nized by TDWG) centered on a working group or mail- Interoperability presents a further technological bar- ing list and often has enough exposure to ensure that rier to data access and interpretation. In many cases pri- those with a stake in the standard being discussed will mary biodiversity data cannot be easily exchanged have an opportunity to comment on the proposed stan- between a source system and a researcher’s system. The dard. This process ensures that standards are tailored to standards discussed here help to address this problem, those who eventually use them, which further facilitates but must be correctly implemented in both the source community acceptance. Nomenclatural codes in taxon- and destination systems. omy, for example, are standardized, and members of Finally, there is no existing general solution for user- this discipline recognize the importance of these driven feedback mechanisms in the biodiversity space. standards. Misinterpretation of biodiversity data is largely the result of incomplete raw observations and barriers to the pro- Barriers to data management and preservation pagation of data. The process of realizing that two raw Technological barriers observations refer to the same taxon (or have some We distinguish between technological and social/cul- more complex relationship) takes effort, as does the pro- tural barriers to effective data sharing, management and pagation of that new data. To be successful, such a sys- preservation. Technological barriers are those that are tem would need to be open to user-driven feedback primarily due to devices, methods and the processes and mechanisms that allow local changes for immediate use, workflows that use those devices and methods. Social/ which propagate back to the original source of the data. cultural barriers are those that arise from both explicit Much of this data propagation could be automated by a and tacit mores of the larger community as well as pro- wider adoption of standards and better models for the cedures, customs and financial considerations of the dynamic interchange and caching of data. individuals and institutions that participate in the data management/preservation. Social and cultural barriers The primary technological cause of loss of biodiversity The social and cultural barriers that cause data loss are data is the poor archiving of raw observations. In many well known and not specific to biodiversity data [99]. In cases raw observations are not explicitly made or they addition to outright, unintentional data loss, there are are actively deleted after more refined taxon description well known social barriers to data sharing. Members of information has been created from them. The lack of the scientific community can be afraid to share data archival storage space for raw observations is a key since this might allow other scientists to publish cause of poor archiving, but simply having a policy of research results without explicitly collaborating with or keeping everything is not scalable. At minimum there crediting the original creator(s) of the data. One should be strict controls governing what data are approach for addressing this concern is forming data archived. There also must to be a method for identifying embargoes, in which the original source data are ren- data that either are not biodiversity-related or have little dered unavailable for a specified period of time relative value for future biodiversity research. Lack of access to to its original collection or subsequent publication. In primary biodiversity data can be due to a lack of techni- addition, there are no clear standards for placing a value cal knowledge by the creators of the data regarding how on biodiversity data. This results in radically different to digitally publish them, and often this is caused by a beliefs about the value of such data, and in extreme lack of funding to support publication and long-term cases can lead scientists to assert rights over data with preservation. In cases in which data do get published, expensive or severely limited licensing terms. there remain issues surrounding maintenance of those The financial cost of good archiving and the more data sources. Hardware failures and format changes can general cost of hardware needed, software development result in data becoming obsolete and inaccessible if con- and the cost of data creation are all other forms of sideration is not given to hardware and file format social barriers to the preservation of biodiversity data redundancy. [100]. The cost of creating new raw observations con- Another primarily technological barrier to the access tinues to drop, but some types of raw observational data and interpretation of biodiversity data is that much of are still extremely expensive or otherwise difficult to the key literature is not available digitally. Projects such acquire, especially if significant travel or field work are as the BHL are working to address this problem. BHL required to create the data. Similarly, access to museum
  • 8. Goddard et al. BMC Bioinformatics 2011, 12(Suppl 15):S5 Page 8 of 14 http://www.biomedcentral.com/1471-2105/12/S15/S5 specimens or living organisms can be crucial to accu- between the recommendations, as having completed one rately interpreting existing data. There are efficient recommendation will make it easier to pursue others. mechanisms to facilitate such access within the aca- demic sphere, but for scientists outside of the academic 1. Short-term: encourage community use of data world this can be a significant barrier to creating more standards meaningful data. The biodiversity community must work in the short Overcoming these barriers will require effective term with associations such as the TDWG standards incentives. Ultimately the advantages of having shared, body to strongly promote biodiversity standards in data centralized access to the data should serve as its own transfer, sharing and citation. Key players, such as data incentive. This has already happened for genetic providers, must ensure that they work to adopt these sequence data as reflected by the widespread adoption standards as a short-term priority and encourage other of GenBank [18]. In comparison, biodiversity data have projects to follow suit. The adoption of standards is a much greater historical record than sequence data paramount to the goal of achieving sustainable data pre- and more importantly a set of legacy conventions that servation and access and paves the way for interoper- were largely created outside of the context of digital ability between datasets, discovery tools and archiving data management. As a result, investment in accurately systems [104]. The processes for standards establish- modeling these conventions and providing easy-to-use ment and community acceptance need to be as flexible interfaces for processing data that conform to these and inclusive as possible, to ensure the greatest adoption conventions is likely to have greater return in the long of standards. Therefore, an open, transparent and easily run. Providing more direct incentives to the data crea- accessible method for participating in the standards pro- tors could be valuable as way to get the process cess is important to ensure that standards meet the started. community’s needs. An interesting example is the use of unique identifiers. Popular solutions in the biodiversity Legal and mandatory solutions community include GUIDs [105], URLs [106] and LSIDs Major US grant funding organizations including the [107]. This debate has not resolved on a single agreed- National Institutes of Health [101] and the National on standard and it is not clear that it ever will. There- Science Foundation [102] are now requiring an up-front fore, it is important that BDHCs support all such sys- plan for data management, including data access and tems, most likely through an internally managed archiving as part of every grant proposal. The National additional layer of indirection and replication of the Science Foundation is taking steps to ensure that data sources when possible. from publicly funded research are made public [103]. Such policies, although designed to ensure that data are 2. Short-term: promote public domain licensing of more accessible, have implications for data archiving. content Mandatory data archiving policies are likely to be very To minimize the risk of loss, biodiversity data, including effective in raising awareness of issues surrounding data the images, videos, classifications, datasets and data- loss and archiving. However, such policies are neither bases, must all be replicable without asking permission strictly necessary to ensure widespread adoption of data of any rights holders associated with that data. Such archiving best practices, nor are they a sufficient solu- data must be placed under as liberal a license as possible tion on their own. Adoption of these policies will assist or, alternatively, placed into the public domain. This in ensuring that a project’s data are available and position should be adopted in the short term by existing archivable. data providers. To avoid long-term issues, these licenses should allow additional data replication without seeking Recommendations and discussion permission from the original rights holders. A variety of Here we recommended five ways in which the commu- appropriate licenses are available from Creative Com- nity can begin to implement the changes required for mons [108]. ‘Non-commercial’ licenses should be sustainable data preservation and access: (1) Encourage avoided if possible because this makes it more difficult the community’s use of data standards, (2) promote the for the data to be shared and used by sites that may public domain licensing of data, (3) establish a commu- require commercial means to support their projects. In nity of those involved in data hosting and archival, (4) these cases, an open, ‘share-alike’ license that prevents establish hosting centers for biodiversity data, and (5) the end user from encumbering the data with exclusive develop tools for data discovery. licenses should be used. We prioritize recommendations here from short-term Open source code should be stored in both an online (immediate to 1 year), medium-term (1 to 5 years), to repository, such as github [109] or Google code [110], long-term (5 or more years). There is inter-dependency and mirrored locally at a project’s institution for
  • 9. Goddard et al. BMC Bioinformatics 2011, 12(Suppl 15):S5 Page 9 of 14 http://www.biomedcentral.com/1471-2105/12/S15/S5 redundancy. Making open source code publicly available opening them up to a larger community. Allowing com- is an important and often overlooked step for many pro- munity input into biodiversity datasets via comments, jects and must be encouraged by funders, not only for direct access to the data, adding to the data or deriving the sake of community involvement in a project, but to new datasets from the original data are all powerful allow others to maintain and further develop the soft- ways of enhancing the original dataset and building a ware should a project’s funding conclude. The use of community that cares about and cares for the data. open source software to support the hosting and devel- Wikipedia, which allows users to add or edit definitions opment of biodiversity informatics tools and projects of any article, is an example of a successful model on a should also be encouraged by funders. Using common- large scale. In the case of biodiversity, there has been place, open source software lowers the barrier to entry success with amateur photographers contributing their for collaborators, further increasing the chances of a images of animals and plants and amateur scientists project being maintained after funding has ended. helping to identify various specimens. Crowd-sourcing is particularly notable as a scalable approach for enhancing 3. Short-term: develop data hosting and archiving data quality. Community input can also work effectively community to improve the initial quality of the data and associated Encouraging the use of standards and data preservation metadata [114]. practices to the wider biodiversity community requires In 2008, the National Library of Australia, as part of an investment in training resources. In the short term, the Australian Newspapers Digitization Program the community, including projects, institutions and (ANDP) and in collaboration with Australian state and organizations such as GBIF and TDWG, should estab- territory libraries, enabled public access to selected out- lish online resources to familiarize researchers with the of-copyright Australian newspapers. This service allows process of storing, transforming and sharing their data users to search and browse over 8.4 million articles using standards, as well as with the concepts of data from over 830,000 newspapers dating back to 1803. preservation and data hosting. Such resources may Because the papers were scanned by Optical Character include screen casts, tutorials and white papers. Even Recognition (OCR) software, they could be searched. better, intensive, hands-on training programs have pro- Although OCR works well with documents that have ven to be effective in educating domain experts and consistent typefaces and formats, historical texts and scientists in technical processes, as can be seen in newspapers fall outside of this category, leading to less courses such as the Marine Biology Laboratory’s Medi- than perfect OCR. To address this, the National Library cal Informatics course, which teaches informatics tools opened up the collection of newspapers to crowd-sour- to medical professionals [111]. Funders and organiza- cing, allowing the user community the ability to correct tions such as GBIF should encourage the establishment and suggest improvements to the OCR text. The volun- of such courses for biologists and others working in bio- teers not only enjoyed the interaction, but also felt that diversity. Partners in existing large-scale informatics they were making a difference by improving the under- projects, such as those of the Encyclopedia of Life or standing of past events in their country. By March 2010 the Biodiversity Heritage Library, as well as institutions over 12 million lines of text had been improved by such as GBIF regional nodes, would be well placed with thousands of users. The ongoing results of this work the expertise and resources to host such courses. An can be seen on the National Library of Australia’s Trove alternative hands-on method that has been effective is site [115]. Biodiversity informatics projects should con- to embed scientists within informatics groups [112]. sider how crowd-sourcing could be used in their pro- For hosts of biodiversity data, technical proficiency is jects and see this as a new way of engaging users and important for any information technology project and the general public. data archiving and sharing are concepts that most sys- Development community to enable biodiversity tems engineers are capable of facilitating. Understanding sharing. A community of developers that have experi- the need for knowledge transfer, a communal group ence with biodiversity standards should be identified that is comfortable with email and mailing lists is criti- and promoted to assist in enabling biodiversity cal for a project’s success. Communication between the resources to join the network of BDHC. In general group can be fostered via the use of open/public wiki, a these developers should be available as consultants for shared online knowledge base that can be continuously funded projects and even included in the original grant edited as the tools, approaches and best practices applications. This community should also consider a change. level of overhead funded work to help enable data Community input and crowd-sourcing. Crowd-sour- resources that lack a replication strategy to join the net- cing [113] is defined as the process of taking tasks that work. A working example of this can be found in tasks would normally be handled by an individual and groups such as the GBIF/TDWG joint multimedia
  • 10. Goddard et al. BMC Bioinformatics 2011, 12(Suppl 15):S5 Page 10 of 14 http://www.biomedcentral.com/1471-2105/12/S15/S5 resources task group, which is working to evaluate and the low latency and high bandwidth available when data share standards for multimedia resources relevant to are served from a location local to the user. For exam- biodiversity [116]. ple, a project in Europe and a project in Australia who Industry engagement. Industry has shown, in the both serve each other’s data will serve both projects’ case of services such as Amazon’s Public Dataset project data from the closest respective location, reducing band- [117] and Google’s Public Data Explorer [118], that width and latency complications when accessing the there is a great benefit in providing users with access to data. The creation of specialized data archival and host- large public datasets. Even when access to a public data- ing centers, centered on specific data types, would pro- set is simple, working with that dataset is often not as vide for specialized archival of specific formats. For straightforward. Amazon provides a way for the public example, with many projects hosting DwC archive files to interact with these datasets without the onerous within a central hosting center, the center could provide groundwork usually required. Amazon benefits in this services to maintain the data and keep it in a current relationship by selling its processing services to those and readable format. By providing this service to a large doing the research. Engaging industry to host data and quantity of data providers, the costs to each provider provide access methods provides tangible benefits to the would be significantly lower than if the data provider wider biodiversity informatics community, in terms of were to maintain these formats alone. Such centers its ability to engage users in developing services on top would host both standardized, specific formats and pro- of the data, with dramatically lower barriers to entry vide for a general data store area for raw data that are than traditional methods, and from data archival or yet to be transformed into a common format. redundancy perspectives. In establishing data hosting infrastructure, the com- munity should pay attention to existing efforts in data 4. Medium-term: establish hosting centers for biodiversity management standards and hosting infrastructure, such data as those being undertaken by the DataONE project, and Key players, such as institutions with access to appropri- find common frameworks to build biodiversity data spe- ate infrastructure, should act in the medium term to cific solutions. establish biodiversity data hosting centers (BDHCs). Geographically distributed processing. The sheer BDHCs would mostly probably evolve out of existing size of some datasets often makes it difficult to move projects in the biodiversity data community. These cen- data offsite for processing and makes it difficult to con- ters will need to focus on collaborative development of duct research if these datasets are physically distant software architectures to deal with biodiversity-specific from the processing infrastructure. By implementing datasets, studying the lifecycle of both present and common frameworks to access and process these data, legacy datasets. Consideration should be made for these the need to run complex algorithms and tools across data centers to be situated in geographically dispersed slow or unstable internet connections is reduced. It is locations to provide an avenue for load sharing of data assumed that more researchers would access data than processing and content delivery tasks as well as physical would have the resources to implement their own sto- redundancy. By setting up mechanisms for sharing rage infrastructure for the data. Therefore it is impor- redundant copies of data in hosting centers around the tant for projects to consider the benefits of placing world, projects can take advantage of increased proces- common processing frameworks on top of their data sing power, whereby the project distributes their com- and providing API access to those frameworks in order puting tasks to other participating nodes, allowing to provide researchers who are geographically separated informatics teams to more effectively use their infra- from the data access to higher speed processing. structure at times when it would otherwise be Data snapshots. Data hosting centers should track underused. archival data access and updates in order to understand When funding or institutional relationships do not data access patterns. Given the rapid rate of change in allow mutual hosting and sharing of data, and as a stop datasets, consideration needs to be given to the concept gap before specialized data hosting centers become of ‘snapshotting’ data. In a snapshot, a copy of the state available, third party options should be sought for the of a dataset is captured at any particular time. The main backup of core data and related code. By engaging the dataset may then be added to, but the snapshot will services of online, offsite backup services, data backup always remain static, thereby providing a revision history can be automated simply and at a low cost when com- of the data. An example use case for snapshotting a pared with the capital expense of backup hardware dataset is an EOL species page. Species pages are dyna- onsite. mically generated datasets comprising user submitted When mutual hosting and serving of data is possible, data and aggregated data sources. As EOL data are geographically separated projects can take advantage of archived and shared, species pages are updated and
  • 11. Goddard et al. BMC Bioinformatics 2011, 12(Suppl 15):S5 Page 11 of 14 http://www.biomedcentral.com/1471-2105/12/S15/S5 modified. Being able to retrieve the latest copy of a data- online data backup. These services would be designed to set may not be as important as retrieving an earlier ver- deal with biodiversity-related data specifically. By tailor- sion of the same dataset, especially if relevant data have ing solutions to the various file formats and standards been removed or modified. Archival mechanisms need used in the biodiversity informatics community, issues to take snapshotting into consideration to ensure that such as format obsolescence can be better managed. For historical dataset changes can be referenced, located and example, when backing up a mysql file or DarwinCore recovered. This formalized approach to data stewardship archive file, metadata associated with that file can be makes the data more stable and, in turn, more valuable. stored in the backup. Such metadata could include the Data and application archiving. Although some data version of DarwinCore or mysql or the date of file crea- exist in standardized formats (for example xls, xml and tion. These data can be later used when migrating for- DwC), others are available only through application-spe- mats in the event of a legacy format becoming obsolete. cific code, commonly a website or web service with Such migrations would be difficult for a large number of non-standardized output. Often, these services will store small projects to consider, so by centralizing this pro- datasets in a way that is optimized for the application cess a greater number of projects will have access to code and is unreadable outside the context of this code. such migrations and therefore the associated data. Whether data archiving on its own is a sustainable option or whether application archiving is also required 5. Long-term: develop data discovery tools needs to be considered by project leaders. Applications The abundance of data made available for public con- that serve data should have a mechanism to export sumption provides a rich source for data-intensive these data into a standard format. Legacy applications research applications, such as automated markup, text that do not have the capability to export data into a extraction, machine learning, visualizations and digital standard format need to be archived alongside the data. ‘mash-ups’ of the data [119]. As a long-term goal for the This can be in a long-term offline archive, where the GBIF secretariat and nodes, as well as biodiversity infor- code is simply stored with the data, or in an online matics projects, this process should be encouraged as a archive where the application remains accessible along- natural incentive. Three such uses of these applications side the data. In both cases, BDHC need to store appli- are in geo-location, taxon finding and text-mining. cation metadata, such as software versions and Services for extracting information. By extracting environment variables, both of which are critical to place names or geographic coordinates from a dataset, ensuring that the application can be run successfully geo-location services can provide maps of datasets and when required. Re-hosting applications of any complex- easily combine these with maps of additional datasets, ity can be a difficult task as new software deprecates fea- providing mapping outputs such as Google Earth’s [120] tures in older versions. One solution that BDHC should kmz output. Taxon finding tools can scan large quanti- consider is the use of virtual machines, which can be ties of text for species names, providing the dataset with customized and snapshotted with the exact environment metadata describing which taxa are present in the data required to serve the application. However, such solu- and the location within the data in which each taxon tions should be seen as a stop-gap. It should be empha- can be found. Using natural language processing, auto- sized that a focus on raw data portability and mated markup tools can analyze datasets and return a accessibility is the preferred solution where possible. structured, marked-up version of the data. uBio RSS Replicate indexes. Online data indexes, such as those [121] and RSS novum are two tools that offer automated from GBIF and from the global names index, provide a recognition of species names within scientific and media method for linking online data to objects, in this RSS feeds. By identifying data containing species names, instance to names. Centrally storing the index enables new datasets and sources of data can be discovered. users to navigate through the indexes to the huge data- Further development of these tools and integration of sets in centralized locations. These data are of particular this technology into web crawling software will pave the value to the biodiversity data discovery process as well way for automated discovery of online biodiversity data as being generally valuable for many types of biodiver- that might benefit from being included in the recom- sity research. It is recommended that these index data mended network of biodiversity data stores. The devel- be replicated widely, and mirroring these indexes should opment of web services and APIs on top of datasets and be a high priority when establishing BDHC. the investment in informatics projects that harvest and Tailored backup services. Given the growing quantity remix data should all be encouraged, both as a means to of data and informatics projects, there is scope for enable discovery within existing datasets and as an industry to provide tailored data management and incentive for data owners to publish their data publicly. backup solutions to the biodiversity informatics commu- GBIF, funding agencies and biodiversity informatics pro- nity. These tools can range from hosting services to jects should extend existing tools while developing and
  • 12. Goddard et al. BMC Bioinformatics 2011, 12(Suppl 15):S5 Page 12 of 14 http://www.biomedcentral.com/1471-2105/12/S15/S5 promoting new tools that can be used to enable data data. The community needs to recognize and proac- discovery. The community should develop a central reg- tively encourage this evolution by developing a shared istry of such tools and, where practical, provide hosting understanding of what is needed to ensure the preser- for these tools at data hosting centers. vation of biodiversity data and then acting on the Services to enable data push. As new data are pub- resulting requirements. We see the recommendations lished, there is a lag between the publication time and laid out in this paper as a step towards that shared the harvesting of this data by other services, such as understanding. EOL or Google. Furthermore, even after articles are published, data are in most cases available only in Acknowledgments HTML format, with the underlying databases rendered This article has been published as part of BMC Bioinformatics Volume 12 inaccessible. Development of services to allow new data Supplement 15, 2011: Data publishing framework for primary biodiversity to be ‘pushed’ to online repositories should be investi- data. The full contents of the supplement are available online at http://www. biomedcentral.com/1471-2105/12?issue=S15. Publication of the supplement gated, as this would make the new data available for was supported by the Global Biodiversity Information Facility. harvesting by data aggregation services and data hosting centers. Once the data are discovered they can then be Author details 1 Center for Library and Informatics, Woods Hole Marine Biological further distributed via standardized APIs, removing this Laboratory, Woods Hole, MA 02543, USA. 2Encyclopedia of Life, Center for burden from the original data provider. Library and Informatics, Woods Hole Marine Biological Laboratory, Woods Once data aggregators discover new data, they should Hole, MA 02543, USA. 3Center for Biodiversity Informatics (CBI), Missouri Botanical Garden, St Louis, MO 63119, USA. 4Center for Biology and Society, cache the data and include them in their archival and Arizona State University, Tempe, AZ 85287, USA. backup procedures. They should also investigate how best to recover the content provided by a content provi- Competing interests The authors declare that they have no competing interests. der should the need arise. The EOL project follows this recommended practice and includes this caching func- Published: 15 December 2011 tionality. This allows EOL to serve aggregated data where the source data are unavailable, while at the same References 1. Scholes RJ, Mace GM, Turner W, Geller GN, Jürgens N, Larigauderie A, time providing the added benefit of becoming a redun- Muchoney D, Walther BA, Mooney HA: Towards a global biodiversity dant back-up of many projects’ data. This is an invalu- observing system. Science 2008, 321:1044-1045. able resource should a project suffer catastrophic data 2. Güntsch A, Berendsohn WG: Biodiversity research leaks data. Create the biodiversity data archives! In Systematics 2008 Programme and Abstracts, loss or go offline. Göttingen 2008. Göttingen: Göttingen University Press;Gradstein SR, Klatt S, Normann F, Weigelt P, Willmann R, Wilson R 2008:. Conclusion 3. Gray J, Szalay AS, Thakar AR, Stoughton C, Vandenberg J: Online Scientific Data Curation, Publication, and Archiving. 2002 [http://research.microsoft. As the biodiversity data community grows and standards com/apps/pubs/default.aspx?id=64568]. development and access to affordable infrastructure is 4. Whitlock MC, McPeek MA, Rausher MD, Rieseberg L, Moore AJ: Data made available, some of the above-mentioned challenges archiving. Am Nat 2010, 175:145-146. 5. Global Biodiversity Information Facility Data Portal. [http://data.gbif.org]. will be met in due course. However, the importance of 6. Encyclopedia of Life. [http://eol.org]. the data and the real risk of data loss suggest that these 7. GBIF: Data Providers. [http://data.gbif.org/datasets/providers]. challenges and recommendations must be faced sooner 8. Encyclopedia of Life: Content Partners. [http://eol.org/content_partners]. 9. ScratchPads. [http://scratchpads.eu]. rather than later by key players in the community. A 10. LifeDesks. [http://www.lifedesks.org/]. generic data hosting infrastructure that is designed for a 11. Heidorn PB: Shedding light on the dark data in the long tail of science. broader scientific scope than biodiversity data may also Library Trends 2008, 57:280-299. 12. Digital Preservation Coalition. [http://www.dpconline.org]. have a role in access and preservation of biodiversity 13. Library of Congress Digital Preservation. [http://www.digitalpreservation. data. Indeed, the more widespread the data becomes, gov]. the safer it will be. However, the community should 14. Image Permanence Institute. [https://www.imagepermanenceinstitute.org]. 15. The Dataverse Network Project. [http://thedata.org]. ensure that the infrastructure and mechanisms it uses 16. DataCite. [http://datacite.org]. are specific enough to biodiversity datasets that they 17. Digital Curation Centre. [http://www.dcc.ac.uk]. meet both the short-term and long-term needs of the 18. GenBank. [http://www.ncbi.nlm.nih.gov/genbank/]. 19. DNA Data Bank of Japan. [http://www.ddbj.nig.ac.jp]. data and community. As funding agencies requiring data 20. European Molecular Biology Laboratory Nucleotide Sequence Database. management plans becomes more widespread, the biodi- [http://www.ebi.ac.uk/embl]. versity informatics community will become more 21. The US Long Term Ecological Research Network. [http://www.lternet.edu/ ]. attuned to the concepts discussed here. 22. Klump J: Criteria for the trustworthiness of data-centres. D-Lib Magazine As the community evolves to take a holistic and col- 2011, doi:10.1045/january2011-klump. lective position on data preservation and access, it will 23. Zins C: Conceptual approaches for defining data, information, and knowledge. Journal of the American Society for Information Science and find new avenues for collaboration, data discovery and Technology 2007, 58:479-493. data re-use, all of which will improve the value of their
  • 13. Goddard et al. BMC Bioinformatics 2011, 12(Suppl 15):S5 Page 13 of 14 http://www.biomedcentral.com/1471-2105/12/S15/S5 24. IPT Architecture PDF. [http://code.google.com/p/gbif-providertoolkit/ 72. Langille MGI, Eisen JA: BioTorrents: a file sharing service for scientific downloads/detail?name=ipt-architecture_1.1.pdf&can=2&q=IPT data. PLoS ONE 2010, 5:e10071[http://www.plosone.org/article/info:doi/ +Architecture]. 10.1371/journal.pone.0010071]. 25. Doolan J: Proposal for a Simplified Structure for Emu.[http://www. 73. BioTorrents. [http://www.biotorrents.net/faq.php]. kesoftware.com/downloads/EMu/UserGroupMeetings/2005%20North% 74. OpenURL. [http://www.exlibrisgroup.com/category/sfxopenurl]. 20American/john%20doolan_SimplifiedStructure.pps]. 75. BHL OpenURL Resolver Help. [http://www.biodiversitylibrary.org/ 26. Pyle R: Taxonomer: a relational data model for managing information openurlhelp.aspx]. relevant to taxonomic research. PhyloInformatics 2004, 1:1-54. 76. Open Archives Initiative. [http://www.openarchives.org/]. 27. International Code of Botanical Nomenclature (Vienna Code) adopted by 77. CiteBank. [http://citebank.org/]. the seventeenth International Botanical Congress Vienna, Austria, July 78. LOCKSS. [http://lockss.stanford.edu/lockss/Home]. 2005. Ruggell, Liechtenstein: Gantner Verlag;McNeill J, Barrie F, Demoulin V, 79. OAIS. [http://lockss.stanford.edu/lockss/OAIS]. Hawksworth D, Wiersema, J 2006:. 80. Distributed Generic Information Retrieval (DiGIR). [http://digir.sourceforge. 28. Flickr. [http://flickr.com]. net]. 29. Systema Dipterorum. [http://diptera.org]. 81. Biological Collection Access Services. [http://BioCASE.org]. 30. Index Nominum Algarum Bibliographia Phycologica Universalis. [http:// 82. ABCD Schema 2.06 - ratified TDWG Standard. [http://www.bgbm.org/ ucjeps.berkeley.edu/INA.html]. tdwg/CODATA/Schema/]. 31. International Plant Names Index. [http://ipni.org]. 83. The Big Dig. [http://bigdig.ecoforge.net]. 32. Index Fungorum. [http://indexfungorum.org]. 84. List of BioCASE Data Sources. [http://www.BioCASE.org/whats_BioCASE/ 33. Catalog of Fishes. [http://research.calacademy.org/ichthyology/catalog]. providers_list.cfm]. 34. Berners-Lee T, Hendler J, Lassila O: The Semantic Web. Sci Am 2001, 29. 85. Amazon Elastic Compute Cloud (Amazon EC2). [http://aws.amazon.com/ 35. Hymenoptera Anatomy Ontology. [http://hymao.org]. ec2/]. 36. TaxonConcept. [http://lod.taxonconcept.org]. 86. OpenSSH. [http://www.openssh.com/]. 37. Patterson DJ, Cooper J, Kirk PM, Remsen DP: Names are key to the big 87. Apache Hadoop. [http://hadoop.apache.org/]. new biology. Trends Ecol Evol 2010, 25:686-691. 88. Dean J, Ghemawat S: MapReduce: simplified data processing on large 38. Catalogue of Life. [http://www.catalogueoflife.org]. clusters. OSDI’04: Sixth Symposium on Operating System Design and 39. World Register of Marine Species. [http://marinespecies.org]. Implementation 2004 [http://labs.google.com/papers/mapreduce.html]. 40. Integrated Taxonomic Information System. [http://itis.gov]. 89. Hadoop on Amazon EC2 to generate Species by Cell Index. [http:// 41. Index to Organism Names. [http://organismnames.com]. biodivertido.blogspot.com/2008/06/hadoop-on-amazon-ec2-to-generate. 42. Global Names Index. [http://gni.globalnames.org]. html]. 43. uBio. [http://ubio.org]. 90. Joint Photographic Experts Group. [http://jpeg.org]. 44. Natural History Museum Collections. [http://nhm.ac.uk/research-curation/ 91. W3C Portable Network Graphics. [http://w3.org/Graphics/PNG/]. collections]. 92. Moving Picture Experts Group. [http://mpeg.chiariglione.org/]. 45. Smithsonian Museum of Natural History: Search Museum Collection 93. Vorbis.com. [http://www.vorbis.com/]. Records. [http://collections.nmnh.si.edu]. 94. Biodiversity Information Standards (TDWG). [http://www.tdwg.org]. 46. Biodiversity Heritage Library. [http://biodiversitylibrary.org]. 95. Darwin Core. [http://rs.tdwg.org/dwc/]. 47. Myco Bank. [http://www.mycobank.org]. 96. Darwin Core Archive Data Standards. [http://www.gbif.org/informatics/ 48. Fishbase. [http://fishbase.org]. standards-and-tools/publishing-data/data-standards/darwin-core-archives/]. 49. Missouri Botanical Garden. [http://mobot.org]. 97. TAPIR (TDWG Access Protocol for Information Retrieval). [http://www. 50. Atlas of Living Australia. [http://ala.org.au]. tdwg.org/activities/tapir]. 51. Wikipedia. [http://wikipedia.org]. 98. Environmental Ontology. [http://environmentontology.org]. 52. BBC Nature Wildlife. [http://bbc.co.uk/wildlifefinder]. 99. Kabooza Global backup survey: About backup habits, risk factors, 53. iSpot. [http://ispot.org.uk]. worries and data loss of home PCs. 2009 [http://www.kabooza.com/ 54. iNaturalist. [http://inaturalist.org]. globalsurvey.html]. 55. Artportalen. [http://artportalen.se]. 100. Hodge G, Frangakis E: Digital Preservation and Permanent Access to 56. Nationale Databank Flora en Fauna. [https://ndff-ecogrid.nl/]. Scientific Information: The State of the Practice. 2004 [http://www.dtic. 57. Mushroom Observer. [http://mushroomobserver.org]. mil/cgi-bin/GetTRDoc?Location=U2&doc=GetTRDoc.pdf&AD=ADA432687]. 58. eBird. [http://ebird.org]. 101. National Institutes of Health. [http://www.nih.gov/]. 59. BugGuide. [http://bugguide.net]. 102. National Science Foundation. [http://nsf.gov/]. 60. National Center for Biotechnology Information Taxonomy. [http://ncbi. 103. Scientists Seeking NSF Funding Will Soon Be Required to Submit Data nlm.nih.gov/taxonomy]. Management Plans. [http://www.nsf.gov/news/news_summ.jsp? 61. Tree of Life Web Project. [http://tolweb.org]. cntn_id=116928&org=NSF&from=news]. 62. Interim Register of Marine and Nonmarine Genera. [http://www.obis.org. 104. Harvey CC, Huc C: Future trends in data archiving and analysis tools. Adv au/irmng/]. Space Res 2003, 32:347-353, doi: 10.1016/S0273-1177(03)90273-0. 63. TreeBASE. [http://treebase.org]. 105. Leach P, Mealling M, Salz R: A universally unique identifier (UUID) URN 64. PhylomeDB. [http://phylomedb.org]. namespace. 2005 [http://www.ietf.org/rfc/rfc4122.txt]. 65. Sustainable Digital Data Preservation and Access Network Partners 106. Uniform Resource Locators (URL). Berners-Lee T, Masinter L, McCahill M (DataNet). [http://www.nsf.gov/funding/pgm_summ.jsp?pims_id=503141]. 1994 [http://www.ietf.org/rfc/rfc1738.txt]. 66. Data Conservancy. [http://www.dataconservancy.org]. 107. OMG Life Sciences Identifiers Specification. [http://xml.coverpages.org/lsid. 67. DataONE. [http://dataone.org]. html]. 68. National Oceanographic and Atmospheric Administration (NOAA). 108. Creative Commons. [http://creativecommons.org]. [http://noaa.gov]. 109. GitHub. [https://github.com]. 69. Preliminary Principles and Guidelines for Archiving Environmental and 110. Google Code. [http://code.google.com]. Geospatial Data at NOAA: Interim Report by Committee on Archiving 111. Bennett-McNew C, Ragon B: Inspiring vanguards: the Woods Hole and Accessing Environmental and Geospatial Data at NOAA. USA: experience. Med Ref Serv Q 2008, 27:105-110. National Research Council; 2006. 112. Yamashita G, Miller H, Goddard A, Norton C: A model for bioinformatics 70. Hunter J, Choudhury S: Semi-automated preservation and archival of training: the marine biological laboratory. Brief Bioinform 2010, 11:610-615, scientific data using semantic grid services. In Proceedings of the Fifth IEEE doi:10.1093/bib/bbq029. International Symposium on Cluster Computing and the Grid - Volume 01 113. Howe J: Crowdsourcing: Why the Power of the Crowd Is Driving the (CCGRID ‘05). Volume 1. Washington, DC: IEEE Computer Society; Future of Business. Random House Business 2008. 2005:160-167. 114. Hsueh P-Y, Melville P, Sindhwani V: Data quality from crowdsourcing: a 71. Bittorrent. [http://www.bittorrent.com/]. study of annotation selection criteria. Proceedings of the NAACL HLT
  • 14. Goddard et al. BMC Bioinformatics 2011, 12(Suppl 15):S5 Page 14 of 14 http://www.biomedcentral.com/1471-2105/12/S15/S5 Workshop on Active Learning for Natural Language Processing. Computational Linguistics 2009, 27-35[http://portal.acm.org/citation.cfm? id=1564131.1564137]. 115. National Library of Australia’s Trove. [http://trove.nla.gov.au/]. 116. GBIF/TDWG Multimedia Resources Task Group. [http://www.keytonature. eu/wiki/MRTG]. 117. Public Data Sets on Amazon Web Services. [http://aws.amazon.com/ publicdatasets/]. 118. Google Public Data Explorer. [http://google.com/publicdata]. 119. Smith V: Data Publication: towards a database of everything. BMC Research Notes 2009, 2:113. 120. Google Earth. [http://earth.google.com]. 121. uBIo RSS. [http://www.ubio.org/rss]. doi:10.1186/1471-2105-12-S15-S5 Cite this article as: Goddard et al.: Data hosting infrastructure for primary biodiversity data. BMC Bioinformatics 2011 12(Suppl 15):S5. Submit your next manuscript to BioMed Central and take full advantage of: • Convenient online submission • Thorough peer review • No space constraints or color figure charges • Immediate publication on acceptance • Inclusion in PubMed, CAS, Scopus and Google Scholar • Research which is freely available for redistribution Submit your manuscript at www.biomedcentral.com/submit