Unlocking knowledge in biodiversity legacy literature through automatic semantic metadata extraction

•Descargar como PPTX, PDF•

0 recomendaciones•575 vistas

This document discusses a project that aims to extract semantic metadata from biodiversity literature through automatic text mining in order to enhance search capabilities. The project will transform the Biodiversity Heritage Library (BHL) into a next-generation digital library by applying techniques like text mining, machine learning, and social media to generate semantic annotations for entities, types, and relations. This semantic metadata will allow for more precise searching of BHL's collection compared to current keyword-based search, helping users discover relevant information despite ambiguity in searches.

Ciencias

Unlocking knowledge in biodiversity
legacy literature through automatic
semantic metadata extraction
Riza Batista-Navarro, William Ulate, Jennifer
Hammock, Georgios Kontonatsios, Trish
Rose-Sandler and Sophia Ananiadou

The partners
Social Media Lab
410/9/2015 Mining Biodiversity

Mining Biodiversity
• Transform BHL into a next-generation social
digital library
• A multi-disciplinary approach
– Text Mining
– Machine learning
– History of Science
– Environmental History & Studies
– Library and Information Science
– Social Media
510/9/2015 Mining Biodiversity

What do we want to do?
Social Media
Visualisation
Semantic
Metadata
610/9/2015 Mining Biodiversity

Biodiversity Heritage Library
• a consortium of botanical and natural history
libraries
• stores digitised legacy literature on
biodiversity
• currently holds 160,000 volumes = millions of
pages (PDFs and OCR-generated text)
• open-access
710/9/2015 Mining Biodiversity

Current features
• supports keyword-based search
• species names annotated and linked to the
Encyclopedia of Life
• integrates automatic taxonomic name finding
tools (uBio Taxonfinder)
• data access through export functionalities and
Web services
810/9/2015 Mining Biodiversity

Advanced search
(also keyword-based)
10/9/2015 10Mining Biodiversity

What’s wrong with
keyword-based search?
• Ambiguity!
Boxwood
historic place in
Alabama?
North American term for
plants in the Buxaceae
family?
Box
container?
Boxwood for other English-
speaking countries?

What’s wrong with
keyword-based search?
• Ambiguity!
California bay
hardwood
tree?
location?
Drum
musical
instrument?
fish?

What’s wrong with
keyword-based search?
• Ambiguity!
Emperor
fish?
person?
Scrambled eggs
food?
plant?

Semantic metadata generation
• Entity types
– species
– location
– habitat
– anatomical parts
– qualities
– persons
– temporal expressions
• Association types
– observation
– Habitation
– nutrition
– trait
10/9/2015 Mining Biodiversity 14

Examples of semantic metadata
(annotations)
• Observation
• Habitation

Examples of semantic metadata
(annotations)
• Nutrition
• Trait

How does semantic
information help?
SPECIES:
California bay
hardwood tree
location
LOCATION:
California bay

Text mining-based approach
Seed
documents
Unlabelled
documents
Learn semantics
Annotator/Curator
Validate
Feedback
Annotate
Search
index
Store
Annotate

Automatic annotation by
text mining (TM)
– Web-based, graphical TM workbench
– conforms with the Unstructured Information
Management Architecture (UIMA) standard
– facilitates the straightforward integration of
various analytics into workflows
– allows for the validation of annotations
10/9/2015 Mining Biodiversity 19

interface
10/9/2015 20Mining Biodiversity

Learning semantics
• Training of models using machine learning
– conditional random fields (CRFs) for sequence
labelling
– learning the features of mentions and relations of
interest based on labelled documents
• contextual features: surrounding, co-occurring words
• dictionary matches: presence of certain words in
controlled vocabularies, e.g., Catalogue of Life,
Phenotype and Trait Ontology, Gazetteer
10/9/2015 Mining Biodiversity 21

interface
10/9/2015 22Mining Biodiversity

Annotation workflowPre-
processing
Dictionary
lookup
Machine
learning-based
recognition
Relation
extraction
Saving

Enhanced searching of BHL content
Faceted
search
Automatically
generated
questions
Time-
sensitive
search

Enhanced document viewing
Page in
PDF/image
format
OCR-corrected text
with colour-coded
annotations

Conclusions
• Literature is a rich source of information but
difficult to search
• Keyword-based search not enough to address
ambiguity
• Semantic metadata allows for more accurate
searching
• Semantic metadata can be extracted using text
mining tools
• The Argo text mining workbench facilitates the
construction of custom semantic metadata
generation workflows

Más contenido relacionado

La actualidad más candente

We've Got Issues: Issue Tracking and Workflow in the Digital LibraryElectronic Resources & Libraries

2009 05 20 Cimc PilskSCPilsk

Bhl knowledge-ecology-rlg-collaborationtgarnett

Building a Global Library of Taxonomic LiteratureMartin Kalfatovic

Cybertaxonomy may 31 2011tgarnett

The Biodiversity Heritage Library: Workflow OverviewMartin Kalfatovic

Smithsonian Libraries Partnering in ResearchSCPilsk

“Yet Another BHL Presentation”: The Biodiversity Heritage LibraryMartin Kalfatovic

M sc advanced food marketing finding infonmjb

Stage 2 animal science finding infonmjb

Eol fellow-march2010tgarnett

Smithsonian Libraries 2.0 and the Biodiversity Heritage Library ProjectMartin Kalfatovic

Digital Services Division & The Biodiversity Heritage LibraryMartin Kalfatovic

3 Years On: The Biodiversity Heritage LibraryMartin Kalfatovic

Botany and the BHL: A Botanical Overview of the Biodiversity Heritage LibraryMartin Kalfatovic

Biodiversity Heritage Library: Cornerstone of the Encyclopedia of LifeMartin Kalfatovic

The Biodiversity Heritage Library: A Cornerstone of the Encyclopedia of LifeMartin Kalfatovic

Crowd-sourcing the creation of "articles" within the Biodiversity Heritage Li...Trish Rose-Sandler

The Biodiversity Heritage Library. 10+1 and Beyond: Looking ForwardMartin Kalfatovic

Donat Agosti - Copyright, Biopiracy and the Taxonomic Impediment ICZN

La actualidad más candente (20)

We've Got Issues: Issue Tracking and Workflow in the Digital Library

2009 05 20 Cimc Pilsk

Bhl knowledge-ecology-rlg-collaboration

Building a Global Library of Taxonomic Literature

Cybertaxonomy may 31 2011

The Biodiversity Heritage Library: Workflow Overview

Smithsonian Libraries Partnering in Research

“Yet Another BHL Presentation”: The Biodiversity Heritage Library

M sc advanced food marketing finding info

Stage 2 animal science finding info

Eol fellow-march2010

Smithsonian Libraries 2.0 and the Biodiversity Heritage Library Project

Digital Services Division & The Biodiversity Heritage Library

3 Years On: The Biodiversity Heritage Library

Botany and the BHL: A Botanical Overview of the Biodiversity Heritage Library

Biodiversity Heritage Library: Cornerstone of the Encyclopedia of Life

The Biodiversity Heritage Library: A Cornerstone of the Encyclopedia of Life

Crowd-sourcing the creation of "articles" within the Biodiversity Heritage Li...

The Biodiversity Heritage Library. 10+1 and Beyond: Looking Forward

Donat Agosti - Copyright, Biopiracy and the Taxonomic Impediment

Destacado

Mastering sap business objects 2011ldasss

MediaLaura

реклама стокгольмаguest2adea9

Dmd Group West101009dmdwest

BHL Tech Status Update Tech Director W.Ulate 2015.12.11William Ulate

A new flora fauna mycota should...William Ulate

The Biodiversity Heritage Library: an Open Global Resource of Literature for ...William Ulate

Destacado (7)

Mastering sap business objects 2011

Media

реклама стокгольма

Dmd Group West101009

BHL Tech Status Update Tech Director W.Ulate 2015.12.11

A new flora fauna mycota should...

The Biodiversity Heritage Library: an Open Global Resource of Literature for ...

Similar a Unlocking knowledge in biodiversity legacy literature through automatic semantic metadata extraction

FAIR data requires FAIR ontologies, how do we do?INRAE (MISTEA) and University of Montpellier (LIRMM)

Ontology repositories and case study with OntoPortalINRAE (MISTEA) and University of Montpellier (LIRMM)

FAIR data requires FAIR ontologies, how do we do?EUDAT

Presentation FAIRsFAIR workshop (April 2020)INRAE (MISTEA) and University of Montpellier (LIRMM)

Evolving Scholarly Record - implications for rank and reputation assessmentConstance Malpas

Semantic standards for the webAIMS (Agricultural Information Management Standards)

Challenges for ontology repositories and applications to biomedicine and agro...INRAE (MISTEA) and University of Montpellier (LIRMM)

Transformation of library and information science: Resources, services and pr...Nabi Hasan

Descubrimiento, entrega de información y gestión: tendencias actuales de las ...innovatics

Bibliographic References in BHLWilliam Ulate

4th Special Track on Metadata and Semanticsfor Agriculture, Food and Enviro...AIMS (Agricultural Information Management Standards)

Νetworking content repositories to provide meaningful services to usersNikos Manouselis

Current metadata landscape in the library world (Getaneh Alemu)Getaneh Alemu

Gaining Weight for Good Reason: Analysis of Fuller Bibliographic Records in S...CALA-MW

Emerging trends in librarianshipH Anil Kumar

Presentation OntoCommons Workshop March 2021INRAE (MISTEA) and University of Montpellier (LIRMM)

Ontology Web services for Semantic ApplicationsTrish Whetzel

Information Retrieval Methods in Libraries and Information CentersEdeama Onwuchekwa

Change Management for LibrariesThomas King

Ontology-based Tools to Enhance the Curation WorkflowTrish Whetzel

Similar a Unlocking knowledge in biodiversity legacy literature through automatic semantic metadata extraction (20)

FAIR data requires FAIR ontologies, how do we do?

Ontology repositories and case study with OntoPortal

FAIR data requires FAIR ontologies, how do we do?

Presentation FAIRsFAIR workshop (April 2020)

Evolving Scholarly Record - implications for rank and reputation assessment

Semantic standards for the web

Challenges for ontology repositories and applications to biomedicine and agro...

Transformation of library and information science: Resources, services and pr...

Descubrimiento, entrega de información y gestión: tendencias actuales de las ...

Bibliographic References in BHL

4th Special Track on Metadata and Semanticsfor Agriculture, Food and Enviro...

Νetworking content repositories to provide meaningful services to users

Current metadata landscape in the library world (Getaneh Alemu)

Gaining Weight for Good Reason: Analysis of Fuller Bibliographic Records in S...

Emerging trends in librarianship

Presentation OntoCommons Workshop March 2021

Ontology Web services for Semantic Applications

Information Retrieval Methods in Libraries and Information Centers

Change Management for Libraries

Ontology-based Tools to Enhance the Curation Workflow

Más de William Ulate

Enhancing the WFO in support of GSPC.pptxWilliam Ulate

Finding the annotation needs of the botanical community in a digital libraryWilliam Ulate

Botanists and annotations printer friendlyWilliam Ulate

Digitalización de Literatura de Biodiversidad: an overview of the BHL for CON...William Ulate

BHL Technical Director's Report, Mar. 2014William Ulate

BHL Markup Efforts and PlansWilliam Ulate

Purposeful Gaming and BHLWilliam Ulate

Fourth Global BHL Meeting - Technical UpdateWilliam Ulate

BHL Technical Update (May 2013)William Ulate

Global BHL Update May 2013William Ulate

The BHL way to contentWilliam Ulate

TDWG 2012 Poster for Art of Life projectWilliam Ulate

BHL Technical Projects UpdatesWilliam Ulate

BHL: Toward a Global, Sustainable ResourceWilliam Ulate

Global BHL Meeting Action ItemsWilliam Ulate

Más de William Ulate (15)

Enhancing the WFO in support of GSPC.pptx

Finding the annotation needs of the botanical community in a digital library

Botanists and annotations printer friendly

Digitalización de Literatura de Biodiversidad: an overview of the BHL for CON...

BHL Technical Director's Report, Mar. 2014

BHL Markup Efforts and Plans

Purposeful Gaming and BHL

Fourth Global BHL Meeting - Technical Update

BHL Technical Update (May 2013)

Global BHL Update May 2013

The BHL way to content

TDWG 2012 Poster for Art of Life project

BHL Technical Projects Updates

BHL: Toward a Global, Sustainable Resource

Global BHL Meeting Action Items

Último

Carbon Dioxide Capture and Storage (CSS)Tamer Koksalan, PhD

GenBio2 - Lesson 1 - Introduction to Genetics.pptxBerniceCayabyab1

(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)riyaescorts54

User Guide: Pulsar™ Weather Station (Columbia Weather Systems)Columbia Weather Systems

The dark energy paradox leads to a new structure of spacetime.pptxEran Akiva Sinbar

Microteaching on terms used in filtration .Pharmaceutical EngineeringPrajakta Shinde

Vision and reflection on Mining Software Repositories research in 2024AyushiRastogi48

REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...Universidade Federal de Sergipe - UFS

Radiation physics in Dental Radiology...navyadasi1992

Pests of castor_Binomics_Identification_Dr.UPR.pdfPirithiRaju

Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPirithiRaju

LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxmalonesandreagweneth

Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...D. B. S. College Kanpur

THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxNandakishor Bhaurao Deshmukh

Neurodevelopmental disorders according to the dsm 5 trssuser06f238

Citronella presentation SlideShare mani upadhyayupadhyaymani499

BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.PraveenaKalaiselvan1

ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptxmaryFF1

OECD bibliometric indicators: Selected highlights, April 2024innovationoecd

Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubaikojalkojal131

Unlocking knowledge in biodiversity legacy literature through automatic semantic metadata extraction

1. Unlocking knowledge in biodiversity legacy literature through automatic semantic metadata extraction Riza Batista-Navarro, William Ulate, Jennifer Hammock, Georgios Kontonatsios, Trish Rose-Sandler and Sophia Ananiadou

2. Structured Data ? Text Mining

3. http://miningbiodiversity.org

4. The partners Social Media Lab 410/9/2015 Mining Biodiversity

5. Mining Biodiversity • Transform BHL into a next-generation social digital library • A multi-disciplinary approach – Text Mining – Machine learning – History of Science – Environmental History & Studies – Library and Information Science – Social Media 510/9/2015 Mining Biodiversity

6. What do we want to do? Social Media Visualisation Semantic Metadata 610/9/2015 Mining Biodiversity

7. Biodiversity Heritage Library • a consortium of botanical and natural history libraries • stores digitised legacy literature on biodiversity • currently holds 160,000 volumes = millions of pages (PDFs and OCR-generated text) • open-access 710/9/2015 Mining Biodiversity

8. Current features • supports keyword-based search • species names annotated and linked to the Encyclopedia of Life • integrates automatic taxonomic name finding tools (uBio Taxonfinder) • data access through export functionalities and Web services 810/9/2015 Mining Biodiversity

9. Keyword-based search and Browsing

10. Advanced search (also keyword-based) 10/9/2015 10Mining Biodiversity

11. What’s wrong with keyword-based search? • Ambiguity! Boxwood historic place in Alabama? North American term for plants in the Buxaceae family? Box container? Boxwood for other English- speaking countries?

12. What’s wrong with keyword-based search? • Ambiguity! California bay hardwood tree? location? Drum musical instrument? fish?

13. What’s wrong with keyword-based search? • Ambiguity! Emperor fish? person? Scrambled eggs food? plant?

14. Semantic metadata generation • Entity types – species – location – habitat – anatomical parts – qualities – persons – temporal expressions • Association types – observation – Habitation – nutrition – trait 10/9/2015 Mining Biodiversity 14

15. Examples of semantic metadata (annotations) • Observation • Habitation

16. Examples of semantic metadata (annotations) • Nutrition • Trait

17. How does semantic information help? SPECIES: California bay hardwood tree location LOCATION: California bay

18. Text mining-based approach Seed documents Unlabelled documents Learn semantics Annotator/Curator Validate Feedback Annotate Search index Store Annotate

19. Automatic annotation by text mining (TM) – Web-based, graphical TM workbench – conforms with the Unstructured Information Management Architecture (UIMA) standard – facilitates the straightforward integration of various analytics into workflows – allows for the validation of annotations 10/9/2015 Mining Biodiversity 19

20. interface 10/9/2015 20Mining Biodiversity

21. Learning semantics • Training of models using machine learning – conditional random fields (CRFs) for sequence labelling – learning the features of mentions and relations of interest based on labelled documents • contextual features: surrounding, co-occurring words • dictionary matches: presence of certain words in controlled vocabularies, e.g., Catalogue of Life, Phenotype and Trait Ontology, Gazetteer 10/9/2015 Mining Biodiversity 21

22. interface 10/9/2015 22Mining Biodiversity

23. Annotation workflowPre- processing Dictionary lookup Machine learning-based recognition Relation extraction Saving

24. Validation interface

25. Enhanced searching of BHL content Faceted search Automatically generated questions Time- sensitive search

26. Enhanced document viewing Page in PDF/image format OCR-corrected text with colour-coded annotations

27. Conclusions • Literature is a rich source of information but difficult to search • Keyword-based search not enough to address ambiguity • Semantic metadata allows for more accurate searching • Semantic metadata can be extracted using text mining tools • The Argo text mining workbench facilitates the construction of custom semantic metadata generation workflows

Notas del editor

Most of us in the biodiversity informatics community are reliant on curated databases such as EOL (click) and NCBI Taxonomy (click). Indeed, they are some of the most fundamental sources of structured information that is critical to understanding biodiversity (click) Another rich, albeit less exploited resource is biodiversity literature (click) which provides possibly even more comprehensive information, considering that any significant findings have most likely been published in one form of writing or another: in reports, articles, books or monographs. However, unlike curated databases which provide information in a structured, readily computable form, literature collections are characterised by copious textual data expressed in natural language. This unstructured and voluminous nature of literature makes it difficult to find information of interest, thus posing a barrier to knowledge accessibility and discovery (click). As many of you know, the Biodiversity Heritage Library or BHL holds the biggest literature collection on biodiversity. In this talk, I will be describing our work on how we are extracting semantic content from BHL and putting it in a structured form that is a lot easier to access and search (click), and how we’re using text mining as the enabling technology for this (click).
We are doing this work as part of a project funded by the transatlantic Digging Into Data program called Mining Biodiversity.
In a nutshell, we have incorporated into BHL three elements, as part of the Mining Biodiversity project: Visualisation, Social Media and Semantic Metadata. The rest of this talk will be focussing on the extraction of semantic metadata aspect (click).
One might say, I’m currently very much happy with how I’m searching BHL. What’s wrong with keywords? Well then, the answer to that is ambiguity! If one searches for “Boxwood”, a keyword-based system wouldn’t know if he/she was referring to a place in Alabama, or the North American term for plants under the Buxaceae family. It will just return all documents pertaining to both. Nor will it know if a query “Box” pertains to the same plant family because apparently this is how other English-speaking countries refer to it, or a container.
Or “California bay”. A keyword-based system will not know if the user is referring to the hardwood tree or some location. What about “Drum”? Is it a fish or a musical instrument?
“Emperor” too. It wouldn’t know if the user wants the fish or a person. Event “Scrambled eggs”. Is it breakfast or the plant known as such?
To alleviate such issues we are enriching BHL content with semantic metadata. To this end, we are marking up mentions of different entity and association types within text. For entities, we are capturing species, locations, habitats, anatomical parts, qualities, people and temporal expressions. To capture associations, we link up these entities to encapsulate relationships such as observation, habitation and nutrition.
So why does semantic information help? With semantic categorisation of terms, for example, if a user specified that he/she is looking for California bay in the SPECIES sense of the term, the system knows it should look for documents which contain a species entity of that name. And if the user specifies he/she is looking for a LOCATION called California bay, then similarly the system knows it should look for documents in which “California bay” has been annotated as a name of a place or location.
In fleshing out the semantics from BHL documents, we took a text mining-based approach, the overall architecture of which is depicted in this figure (click). Firstly, we set aside a seed set of documents which were manually annotated (click). This set was used by our system to learn the semantics, i.e., entities and associations, in the documents (click). The system then applies what it learns on unlabelled documents (click). The annotations the system produces on these documents are then validated manually by an expert (click). Whatever corrections the expert makes are fed back into the system and are used by the system to learn again, in order to improve itself. (Active Learning) When the performance of the system is satisfactory, we run the final version of the system on the whole BHL collection and (click) store all of the generated annotations or semantic metadata in a search index, e.g., Solr. This index is what we’re using to complement the bibliographic metadata in BHL.
This is Argo’s main interface. Argo comes with a library of various text mining components, which you can see on the left panel. Basically, these components can be dragged and dropped to the canvas in the middle which serves as a block diagramming tool. The user can then arrange these components according to the desired order of processing, and interconnect them to form a pipeline or workflow.
What did we mean earlier by “learning semantics”? How does the text mining system or Argo workflow do this?
This is Argo’s main interface. Argo comes with a library of various text mining components, which you can see on the left panel. Basically, these components can be dragged and dropped to the canvas in the middle which serves as a block diagramming tool. The user can then arrange these components according to the desired order of processing, and interconnect them to form a pipeline or workflow.
This is the workflow that we put together using Argo. Without going too much into detail, I will just point out the general types of processing it tries to do: pre-processing (sentence splitting, tokenisation and part-of-speech tagging), matching against dictionaries or controlled vocabularies such as the ENVO and PATO ontologies, machine learning-based recognition of entities, extraction of relations based on the results of dependency parsing, and serialisation of the generated annotations.
Additionally, Argo allows users to validate or correct any of the automatically generated annotations.

Unlocking knowledge in biodiversity legacy literature through automatic semantic metadata extraction

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (7)

Similar a Unlocking knowledge in biodiversity legacy literature through automatic semantic metadata extraction

Similar a Unlocking knowledge in biodiversity legacy literature through automatic semantic metadata extraction (20)

Más de William Ulate

Más de William Ulate (15)

Último

Último (20)

Unlocking knowledge in biodiversity legacy literature through automatic semantic metadata extraction

Notas del editor