This document discusses a project that aims to extract semantic metadata from biodiversity literature through automatic text mining in order to enhance search capabilities. The project will transform the Biodiversity Heritage Library (BHL) into a next-generation digital library by applying techniques like text mining, machine learning, and social media to generate semantic annotations for entities, types, and relations. This semantic metadata will allow for more precise searching of BHL's collection compared to current keyword-based search, helping users discover relevant information despite ambiguity in searches.
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Unlocking knowledge in biodiversity legacy literature through automatic semantic metadata extraction
1. Unlocking knowledge in biodiversity
legacy literature through automatic
semantic metadata extraction
Riza Batista-Navarro, William Ulate, Jennifer
Hammock, Georgios Kontonatsios, Trish
Rose-Sandler and Sophia Ananiadou
5. Mining Biodiversity
• Transform BHL into a next-generation social
digital library
• A multi-disciplinary approach
– Text Mining
– Machine learning
– History of Science
– Environmental History & Studies
– Library and Information Science
– Social Media
510/9/2015 Mining Biodiversity
6. What do we want to do?
Social Media
Visualisation
Semantic
Metadata
610/9/2015 Mining Biodiversity
7. Biodiversity Heritage Library
• a consortium of botanical and natural history
libraries
• stores digitised legacy literature on
biodiversity
• currently holds 160,000 volumes = millions of
pages (PDFs and OCR-generated text)
• open-access
710/9/2015 Mining Biodiversity
8. Current features
• supports keyword-based search
• species names annotated and linked to the
Encyclopedia of Life
• integrates automatic taxonomic name finding
tools (uBio Taxonfinder)
• data access through export functionalities and
Web services
810/9/2015 Mining Biodiversity
11. What’s wrong with
keyword-based search?
• Ambiguity!
Boxwood
historic place in
Alabama?
North American term for
plants in the Buxaceae
family?
Box
container?
Boxwood for other English-
speaking countries?
19. Automatic annotation by
text mining (TM)
– Web-based, graphical TM workbench
– conforms with the Unstructured Information
Management Architecture (UIMA) standard
– facilitates the straightforward integration of
various analytics into workflows
– allows for the validation of annotations
10/9/2015 Mining Biodiversity 19
21. Learning semantics
• Training of models using machine learning
– conditional random fields (CRFs) for sequence
labelling
– learning the features of mentions and relations of
interest based on labelled documents
• contextual features: surrounding, co-occurring words
• dictionary matches: presence of certain words in
controlled vocabularies, e.g., Catalogue of Life,
Phenotype and Trait Ontology, Gazetteer
10/9/2015 Mining Biodiversity 21
27. Conclusions
• Literature is a rich source of information but
difficult to search
• Keyword-based search not enough to address
ambiguity
• Semantic metadata allows for more accurate
searching
• Semantic metadata can be extracted using text
mining tools
• The Argo text mining workbench facilitates the
construction of custom semantic metadata
generation workflows
Notas del editor
Most of us in the biodiversity informatics community are reliant on curated databases such as EOL (click) and NCBI Taxonomy (click).
Indeed, they are some of the most fundamental sources of structured information that is critical to understanding biodiversity (click)
Another rich, albeit less exploited resource is biodiversity literature (click) which provides possibly even more comprehensive information, considering that any significant findings have most likely been published in one form of writing or another: in reports, articles, books or monographs.
However, unlike curated databases which provide information in a structured, readily computable form, literature collections are characterised by copious textual data expressed in natural language. This unstructured and voluminous nature of literature makes it difficult to find information of interest, thus posing a barrier to knowledge accessibility and discovery (click).
As many of you know, the Biodiversity Heritage Library or BHL holds the biggest literature collection on biodiversity.
In this talk, I will be describing our work on how we are extracting semantic content from BHL and putting it in a structured form that is a lot easier to access and search (click), and how we’re using text mining as the enabling technology for this (click).
We are doing this work as part of a project funded by the transatlantic Digging Into Data program called Mining Biodiversity.
In a nutshell, we have incorporated into BHL three elements, as part of the Mining Biodiversity project: Visualisation, Social Media and Semantic Metadata.
The rest of this talk will be focussing on the extraction of semantic metadata aspect (click).
One might say, I’m currently very much happy with how I’m searching BHL. What’s wrong with keywords?
Well then, the answer to that is ambiguity!
If one searches for “Boxwood”, a keyword-based system wouldn’t know if he/she was referring to a place in Alabama, or the North American term for plants under the Buxaceae family. It will just return all documents pertaining to both.
Nor will it know if a query “Box” pertains to the same plant family because apparently this is how other English-speaking countries refer to it, or a container.
Or “California bay”. A keyword-based system will not know if the user is referring to the hardwood tree or some location.
What about “Drum”? Is it a fish or a musical instrument?
“Emperor” too. It wouldn’t know if the user wants the fish or a person.
Event “Scrambled eggs”. Is it breakfast or the plant known as such?
To alleviate such issues we are enriching BHL content with semantic metadata.
To this end, we are marking up mentions of different entity and association types within text.
For entities, we are capturing species, locations, habitats, anatomical parts, qualities, people and temporal expressions.
To capture associations, we link up these entities to encapsulate relationships such as observation, habitation and nutrition.
So why does semantic information help?
With semantic categorisation of terms, for example, if a user specified that he/she is looking for California bay in the SPECIES sense of the term, the system knows it should look for documents which contain a species entity of that name.
And if the user specifies he/she is looking for a LOCATION called California bay, then similarly the system knows it should look for documents in which “California bay” has been annotated as a name of a place or location.
In fleshing out the semantics from BHL documents, we took a text mining-based approach, the overall architecture of which is depicted in this figure (click).
Firstly, we set aside a seed set of documents which were manually annotated (click).
This set was used by our system to learn the semantics, i.e., entities and associations, in the documents (click).
The system then applies what it learns on unlabelled documents (click).
The annotations the system produces on these documents are then validated manually by an expert (click).
Whatever corrections the expert makes are fed back into the system and are used by the system to learn again, in order to improve itself. (Active Learning)
When the performance of the system is satisfactory, we run the final version of the system on the whole BHL collection and (click) store all of the generated annotations or semantic metadata in a search index, e.g., Solr.
This index is what we’re using to complement the bibliographic metadata in BHL.
This is Argo’s main interface. Argo comes with a library of various text mining components, which you can see on the left panel.
Basically, these components can be dragged and dropped to the canvas in the middle which serves as a block diagramming tool.
The user can then arrange these components according to the desired order of processing, and interconnect them to form a pipeline or workflow.
What did we mean earlier by “learning semantics”? How does the text mining system or Argo workflow do this?
This is Argo’s main interface. Argo comes with a library of various text mining components, which you can see on the left panel.
Basically, these components can be dragged and dropped to the canvas in the middle which serves as a block diagramming tool.
The user can then arrange these components according to the desired order of processing, and interconnect them to form a pipeline or workflow.
This is the workflow that we put together using Argo.
Without going too much into detail, I will just point out the general types of processing it tries to do: pre-processing (sentence splitting, tokenisation and part-of-speech tagging), matching against dictionaries or controlled vocabularies such as the ENVO and PATO ontologies, machine learning-based recognition of entities, extraction of relations based on the results of dependency parsing, and serialisation of the generated annotations.
Additionally, Argo allows users to validate or correct any of the automatically generated annotations.