A presentation given by Dave Roberts and coauthored by David King, Simon Rycroft, David Morse, Lyubomir Penev, Donat Agosti & Vince Smith. This was given at the Fourth Metadata and Semantics Research Conference (MTSR 2010) at Acala de Henares, Madrid, in the premises of the Faculty of Law.
The 7 Things I Know About Cyber Security After 25 Years | April 2024
Community web sites: small pieces loosely joined
1. ViBRANT
Virtual Biodiversity
Community web sites: small pieces loosely joined
Dave Roberts, David King, Simon Rycroft, David Morse,
Lyubomir Penev, Donat Agosti & Vince Smith
SEVENTH FRAMEWORK
PROGRAMME
-infrastructure
3. ViBRANT
Virtual Biodiversity
Small pieces loosely joined
Has many potential meanings:
Joining contributors together to form
communities
Joining the data together that go towards
forming a Scratchpad
Joining Scratchpad content with the landscape of
biodiversity informatics data on the web
SEVENTH FRAMEWORK
PROGRAMME -infrastructure
4. ViBRANT
Virtual Biodiversity
Addressing the challenges of taxonomy
Goal ...
Inventory the Earth’s species
Document their relationships
“Publish” & apply these data
Data set ...
1.8 M described spp. (10M names)
300M pages (over last 250 years)
1.5-3B specimens
People ...
4-6,000 taxonomists
30-40,000 “pro-amateurs”
Many more citizen scientists?
SEVENTH FRAMEWORK
PROGRAMME -infrastructure
5. ViBRANT
Virtual Biodiversity
I The technology must largely embody the cause–effect
relationship connecting problem to solution.
II The effects of the technological fix must be assessable using
relatively unambiguous or uncontroversial criteria.
III Research and development is most likely to contribute
decisively to solving a social problem when it focuses on
improving a standardized technical core that already exists.
Sarewitz and Nelson (2008) Three rules for technological fixes. Nature, 456: 871-872
SEVENTH FRAMEWORK
PROGRAMME -infrastructure
6. ViBRANT
Virtual Biodiversity
Biodiversity - a kind of washing powder?
When 2010 was named as the "year of biodiversity" by
the UN, it began with a plea to save the world's
ecosystems.
UN Secretary-General Ban Ki-moon said: "Biological
diversity underpins ecosystem functioning... its
continued loss, therefore, has major implications for Recently, members of the public
current and future human well-being." were asked what biodiversity is.
The most common answer was
"some kind of washing powder".
http://www.bbc.co.uk/news/science-environment-11546289 15 October 2010
SEVENTH FRAMEWORK
PROGRAMME -infrastructure
7. ViBRANT
Virtual Biodiversity
Addressing the challenges of biodiversity informatics
“…the field [of biodiversity informatics] appears to be growing in
a void of overarching, motivating questions, effectively making it
a set of technologies in search of questions to address.”
Peterson et al, Syst. & Biodiv. 2010
SEVENTH FRAMEWORK
PROGRAMME -infrastructure
8. ViBRANT
Virtual Biodiversity
Scratchpads Hosted websites for taxonomists
http://scratchpads.eu Research & publication platform
Modular (Drupal) & flexible
Supports the taxonomic workflow
Bottom-up design, agile dev.
Ecosystem of communities (185)
2,350+ users (unpaid) from 2007
ViBRANT follow on, €4.75M
SEVENTH FRAMEWORK
PROGRAMME -infrastructure
9. ViBRANT
Virtual Biodiversity
Taxonomy & Literature DNA, Phylogeny & Specimens
2.3k users, 58 countries,
268k pages
185 "Virtual Research
Communities"
EDIT, GBIF, NHM, & EOL
Platform for biodiversity
research & data publication
eBooks eJournals
Changing the nature of
collaboration
Expanding opportunities to
participate in science
Image Galleries Societies & Organizations
SEVENTH FRAMEWORK
PROGRAMME -infrastructure
10. ViBRANT
Virtual Biodiversity
A website for you & your community
Magic
Your data Your web site
SEVENTH FRAMEWORK
PROGRAMME -infrastructure
18. ViBRANT
Virtual Biodiversity
Static web
pages
Web fora
with e-mail
integration
Newsletters with
User blogs e-mail integration
SEVENTH FRAMEWORK
PROGRAMME -infrastructure
19. ViBRANT
Virtual Biodiversity
Import from CSV text file to any
content type
SEVENTH FRAMEWORK
PROGRAMME -infrastructure
20. ViBRANT
Virtual Biodiversity
ViBRANT Products
A Virtual Research Environment (Scratchpads) where users can
safely store, share and manage their research information.
Analytical services for users to build identification keys and
phylogenetic trees.
A publication platform for users to automatically compile taxonomic
manuscripts from their research database.
A portal for users to centrally access publicly accessible biodiversity
research information and literature.
Training, support & sociological study, helping research communities
to use these tools and services.
A standards compliant technical architecture that can be sustained by
biodiversity research community.
SEVENTH FRAMEWORK
PROGRAMME -infrastructure
21. ViBRANT
Virtual Biodiversity
Training Biodiversity
& outreach data
programme standards
Networking
User Controlled
feedback vocabulary
The “chromosome” systems platform
WP3. Training
User
sociology
Data
aggregation
WP4. Standards
study portal
WP8. Mobilisation
Field GBIF
recording integration
support activities
Citizen
Biodiversity
science
visualisation
programme
layers
Scratchpads
Virtual Research
Service Research
Environment Distributed
Phylogenetic
Scratchpad
analysis
hosting
Bioclimatic Software
WP5. Data modelling
& metrics
module
integration
WP2. Architecture
WP6. Publishing WP7. Literature
Identification Sustainability
tools plan
Communal
Matrix data
biodiversity
editor
literature
Biodiversity Biodiversity
data literature
publishing markup
Scholarly
Biodiversity
manuscript
datamining
publishing
SEVENTH FRAMEWORK
PROGRAMME -infrastructure
22. ViBRANT
Virtual Biodiversity
Biodiversity literature
looks like this
Cues
Indented text
UPPER CASE TEXT
Bold text
Italic text
Latin
Keywords
Symbols
SEVENTH FRAMEWORK
PROGRAMME -infrastructure
23. ViBRANT
Virtual Biodiversity
Adobe Reader has this
M
BRITISH MUSEUM
(NATURAL HiSi
26JU
PRESENTED
GENERAL UC.-lARY
Bulletin ofthe
BritishMuseum (Natural History)
The ichneumon-fly genus Banchus
in the OldWorld
(Hymenoptera)
M. G. Fitton
series
Entomology
Vol51 Nol 25 July 1985
SEVENTH FRAMEWORK
PROGRAMME -infrastructure
24. ViBRANT
Virtual Biodiversity
Lura (BHL) has this
M
BRITISH MUSEUM
(NATURAL HiSi
26 JU
PRESENTED
GENERAL UC.-lARY
Bulletin of the
British Museum (Natural History)
The ichneumon-fly genus Banchus
(Hymenoptera) in the Old World
M. G. Fitton
Entomology series
Vol51 Nol 25 July 1985
SEVENTH FRAMEWORK
PROGRAMME -infrastructure
25. ViBRANT
Virtual Biodiversity
But choice of XML schema is important
ABBYY XML is very detailed
This line of text has 202 bytes:
The Bulletin of the British Museum (Natural History), instituted in
1949, is issued in fourscientific series, Botany, Entomology,
Geology (incorporating Mineralogy) and Zoology,and an
Historical series.
To encode in ABBYY XML format this line requires 45,533 bytes.
There are 84,263 lines in the document from which this example
was taken.
SEVENTH FRAMEWORK
PROGRAMME -infrastructure
26. ViBRANT
Virtual Biodiversity
Look for taxon names
Used uBio FindIT web service
Overall excellent
Especially as add Namebank ID
But still some oddities
Genus = ‘The’
The scutellum
The primitive
Species or Author = ‘and’
Exetastes and
B[anchus] falcatorius and
SEVENTH FRAMEWORK
PROGRAMME -infrastructure
27. ViBRANT
Virtual Biodiversity
Look for paragraph types
Simple keyword matching
Surprisingly effective!
Issue – can identify start, but not end…
Follow up work
Punctuation
Concepts
SEVENTH FRAMEWORK
PROGRAMME -infrastructure
28. ViBRANT
Virtual Biodiversity
Look for other proper names
Biologia Centrali-Americana has a gazetteer
Most journals do not
Generic solution = OpenCalais
Good accuracy
Old countries
D.D.R.
West Germany
Continents
America
SEVENTH FRAMEWORK
PROGRAMME -infrastructure
29. ViBRANT
Virtual Biodiversity
Ambiguities and Mis-identifications
New York Other Oddities
City Persons
State Surname only
Washington Two part names
City Van Veen
State van Veen
Lake George Regions and Continents
City East Africa
Lake Victoria Africa
City
SEVENTH FRAMEWORK
PROGRAMME -infrastructure
30. ViBRANT
Virtual Biodiversity
Negative spell checking
Go beyond stop words
Remove everything not in a spell dictionary
Check:
Minor
Vulgar
Bulletin 27 from the Zoology Series reduced
From 139,034
to 5,219 words
SEVENTH FRAMEWORK
PROGRAMME -infrastructure
31. ViBRANT
Virtual Biodiversity
Ligatures
INTRODUCTION.
Volume, one of five required for the enumeration of the Rhynchophora, was
THIS
commenced by Dr. Sharp in 1889 and is now concluded by myself. The study of the
" Otiorhynchinœ Alatse " has unfortunately been delayed for many years, during the
publication of Vol. IV. parts 4, 5, and 7, all of which are devoted to the Family
Curculionidœ. The present Volume, IV. part 3, includes the Subfamilies Attelabinae,
Pterocolinœ, Allocoryninee, Apioninœ, Thecesterninae, and Otiorhynchinre. The
Attelabinae are represented by 104 (88 new), the Pterocolinse by three (all new), the
Allocoryninse (a new subfamily) and Thecesterninse each by one, the Apioninae by
88 (84 new), and the Otiorhynchinae by 419 (340 new) species respectively; the total
number for the six subfamilies being 616 species, with 516 new, and forty new
genera. Amongst the 419 Otiorhynchinae, the apterous and winged forms are almost
equal in number, there being a preponderance of apterous terrestrial species
(Eupagoderes, Epicœrus, Epayriopsis, &c.) in the arid portions of Mexico and the
winged forms ÇExophthalmuS) &c.) becoming relatively more numerous in the forest
regions southward. Taking the Curculionidœ as a whole—the subfamilies
Curculioninae and Calandrinse, in addition to those worked out in the present
Volume,—the number of species enumerated altogether from Central America is as
follows :— Vol. IV. part 3, 616; IV. part 4, 1365; IV. part 5, 908; IV. part 7, 344 : total
3233. The three other families of Rhynchophora—the Brenthidae, Scolytidae, and
SEVENTH FRAMEWORK
PROGRAMME -infrastructure
32. ViBRANT
Virtual Biodiversity
Ligatures
Otiorhynchinæ => Otiorhynchinœ Thecesterninæ => Thecesterninse
Alatæ => Alatse Apioninæ => Apioninae
Curculionidæ => Curculionidœ Otiorhynchinæ => Otiorhynchinae
Attelabinæ => Attelabinae Otiorhynchinæ => Otiorhynchinae
Pterocolinæ => Pterocolinœ Curculionidæ => Curculionidœ
Allocoryninæ => Allocoryninee Curculioninæ => Curculioninae
Apioninæ => Apioninœ Calandrinæ => Calandrinse
Thecesterninæ => Thecesterninae Brenthidæ => Brenthidae
Otiorhynchinæ => Otiorhynchinre Scolytidæ => Scolytidae
Attelabinæ => Attelabinae Anthribidæ => Anthribidae
Pterocolinæ => Pterocolinse Hispidæ => Hispida
Allocoryninæ => Allocoryninse Cassididæ => Cassididae
For the 24 æ there are: 11 ae; 5 œ; 5 se; 1 ee; 1 re; 1 a?;
So not a single correct rendering of the ligature, æ.
By contrast, the only example of œ in the page, Epicœrus, was correctly rendered.
SEVENTH FRAMEWORK
PROGRAMME -infrastructure
34. ViBRANT
Virtual Biodiversity
Similar words?
denticulate => denticulata
Levenshtein distances of 1: 0,0,1
denticulate => reticulate
Levenshtein distances of 2: 3,2,0
denticulate => geniculate
Levenshtein distances of 2: 2,2,0
SEVENTH FRAMEWORK
PROGRAMME -infrastructure
35. ViBRANT
Virtual Biodiversity
What did we achieve?
Marked up 11 volumes, i.e. 4,504 pages
Have robust workflow, can mark up a Bulletin in about 10-15
minutes. Choke point is call to OpenCalais web service
No manual intervention or review required: workflow is
scalable
Recognising taxon names:
Well uBio gives us a goods start, and we have techniques to
cluster ALL mis-spellings and variants with a valid taxon; but
not perfect, eg BanchusFabricius ends up in more than one
cluster
SEVENTH FRAMEWORK
PROGRAMME -infrastructure
36. ViBRANT
Virtual Biodiversity
“making the Scratchpads better”
More reliable (e.g., distribute the servers)
More functional (e.g., phylogenetic & publication services)
Easier to use (better workflows)
Prettier (better graphical design - more intuitive)
More integrated (for data stored inside & outside the Scratchpad framework)
More sustainable (simple administration, distribute developers, development sandbox)
“making natural history better”
Easier to compile, manage and reuse your data
Easier to find and reuse other peoples data
Promoting your data inside & outside the taxonomic community
Getting people to work for you (crowdsourcing)
SEVENTH FRAMEWORK
PROGRAMME -infrastructure
37. ViBRANT
Virtual Biodiversity
Author
Manuscript Public Enhanced
preparation on HTML
a Scratchpad
PDF
Submit as XML
Enhanced XML Printed paper
Produce PDF
Send to Register with
reviewers ZooBank,
Publisher GBIF, EoL etc.
SEVENTH FRAMEWORK
PROGRAMME -infrastructure
38. ViBRANT
Virtual Biodiversity
Thank you for your
attention.
Any questions
SEVENTH FRAMEWORK
PROGRAMME -infrastructure