SlideShare una empresa de Scribd logo
1 de 31
Text Mining Biodiversity
S. Ananiadou
E. Milios
W. Ulate
Partners
24/15/2016 Mining Biodiversity
Outline
1. Introduction
2. Creating a Term Inventory of Biodiversity
3. Interactive Visualization of Inventory
4. Creating a Text Mining Infrastructure for Biodiversity
5. Interactive Clustering of Search Engine results
6. OCR Error correction
7. Social media platform
8. Impact
Social Media
Visualisation
Semantic
Metadata
What do we want to do?
54/15/2016 Mining Biodiversity
http://miningbiodiversity.org
Help transform BHL into a next-generation social digital
library through a multi-disciplinary approach that includes:
• Text Mining
• Machine learning
• History of Science
• Environmental History & Studies
• Library and Information Science
• Social Media
Creating the Term Inventory: why we need it
• A species name may usually be expressed in multiple ways, e.g., using
scientific names or vernacular names
– Balaena mysticetus Bowhead whale, bowhead
– Spizella passerina Chipping sparrows
• Identify synonymous terms in biodiversity text
• Why? To go beyond keyword-based search!
6
Search Results Using Vernacular Names
Vernacular name of “Balaena
mysticetus”
Different results!!
7
Keyword-based Search: Ambiguity
Boxwood
historic place in
Alabama?
North American term for plants in the
Buxaceae family?
Box
container?
Boxwood for other English-speaking
countries?
8
Methods: Distributional Semantics
• Determine the meaning of terms and phrases by looking at the context
and the meaning of individual words
bowhead whale
43.99 39.99 25.06 23.92 20.84 19.86 19.52 17.91 … 5.62
balaena mysticetus alaska seals distribu
tion
ringed catch quota … murray
9
mysticetus seals distribut
ion
ringed … murray
43.99 25.06 19.52 17.91 …
balaena alaska catch quota …
bowhead
whale
39.99 23.92 20.84 19.52 … 5.62
Distributional semantics methods
balaena mysticetus
balaena glacialis 0.7896
bowhead whale 0.7392
bowhead 0.7074
bowhead whales 0.6999
eubalaena glacialis 0.6905
minke whale 0.6864
humpback whale 0.6490
sperm whale 0.6440
finback whale 0.6322
sei whale 0.6287
eubalaena japonica 0.6065
brydes whale 0.6052
humpback whales 0.6000
finback whales 0.5998 10
Experiments
• Training data: all English texts from the BHL
• about 26 million pages with a size of 49GB
• Evaluation data: synonymous terms from the Catalogue of Life
• Select 500 scientific names and their synonyms from the CoL
• Results at top-20
Category Class #terms in
CoL
#terms in
BHL
#average synonyms
in CoL
Birds Aves 1140 818 2.28
Mammals Mammalia 1131 726 2.26
Plants Plantae 1141 826 2.28
Category Pre@20 Re@20
Birds 69.41% 63%
Mammals 62.12% 53.84%
Plants 56.17% 21.43% 11
3. Interactive visualization of term inventory
12
TermInventoryVisualization
Video
4. Creating a text mining infrastructure for
biodiversity
14
• Web-based, graphical TM workbench
• Straightforward integration of tools into modular, extensible,
reconfigurable and reusable workflows
http://argo.nactem.ac.uk
Source: LEGO DUPLO
Annotation Workflow for Biodiversity
Pre-processing
Dictionary lookup
Machine learning-
based recognition
Relation extraction
Saving
15
AnnotationWorkflowsVideo
5. Interactive clustering of search engine
results
• Goal: to cluster BHL search engine results
• Input dataset: output of an “Or” query based on the following terms:
1. Kangaroo
2. Lion
3. Rabbit
4. Shark
• Only titles of books or articles are considered in clustering
• Interactive clustering based on the keyterms of the titles
InteractiveClusteringVideo
6. OCR error correction
• Correct errors in natural language texts
• Spelling errors (e.g. the => teh)
• Grammar errors (e.g. this is => this are)
• Outline
OCR error correction
• Input
• Document
• Component selection (select components to use for processing)
• Correction candidates
• A list of candidates with confidence for each error
• Component structure
OCRerrorcorrectionvideo
7. Social media platform
Making Biodiversity
Digital Objects More
Social and Shareable
Follow us on Twitter: @SMLabTO
“My Tweeps” app
mytweeps.com
Helping BHL (and other organizations)
to get daily insights about their Twitter
followers (or Tweeps) and what they
are interested in.
We call it a "reverse" Twitter because
instead of seeing tweets from people
whom you follow, the app shows you
tweets from people who follow you.
Follow us on Twitter: @SMLabTO
We also partnered with Altmetric to better understand who and why people
share BHL content across various social media platforms
Follow us on Twitter: @SMLabTO
MyTweepsvideo
8. Impact
Enhanced Searching of BHL Content
Faceted search
Automatically
generated
questions
Time-sensitive
search
28
Enhanced Document Viewing
Page in
PDF/image
format
OCR-corrected text with
colour-coded annotations
29
The Team
• NaCTeM • Ryerson
• Dalhousie
• Missouri Botanical Garden
• Smithsonian Libraries (contract)
Thanks to the sponsors:

Más contenido relacionado

La actualidad más candente

Digitizing Entomology: The Biodiversity Heritage Library @ the Smithsonian
Digitizing Entomology: The Biodiversity Heritage Library @ the SmithsonianDigitizing Entomology: The Biodiversity Heritage Library @ the Smithsonian
Digitizing Entomology: The Biodiversity Heritage Library @ the SmithsonianMartin Kalfatovic
 
2009 05 20 Cimc Pilsk
2009 05 20 Cimc Pilsk2009 05 20 Cimc Pilsk
2009 05 20 Cimc PilskSCPilsk
 
Smithsonian Libraries 2.0 and the Biodiversity Heritage Library Project
Smithsonian Libraries 2.0 and the Biodiversity Heritage Library ProjectSmithsonian Libraries 2.0 and the Biodiversity Heritage Library Project
Smithsonian Libraries 2.0 and the Biodiversity Heritage Library ProjectMartin Kalfatovic
 
Quentin D. Wheeler - ZooBank and the Taxonomic Renaissance
Quentin D. Wheeler - ZooBank and the Taxonomic RenaissanceQuentin D. Wheeler - ZooBank and the Taxonomic Renaissance
Quentin D. Wheeler - ZooBank and the Taxonomic RenaissanceICZN
 
Disruptive Communities and Technology
Disruptive Communities and TechnologyDisruptive Communities and Technology
Disruptive Communities and Technologypetermurrayrust
 
Books, Bytes, Biodiversity: Using the Biodiversity Heritage Library in Your R...
Books, Bytes, Biodiversity: Using the Biodiversity Heritage Library in Your R...Books, Bytes, Biodiversity: Using the Biodiversity Heritage Library in Your R...
Books, Bytes, Biodiversity: Using the Biodiversity Heritage Library in Your R...Becky Morin
 
Joe Coleman Biodiversity Heritage Library
Joe Coleman Biodiversity Heritage LibraryJoe Coleman Biodiversity Heritage Library
Joe Coleman Biodiversity Heritage LibraryFuture Perfect 2012
 
The biodiversity informatics landscape: a systematics perspective
The biodiversity informatics landscape: a systematics perspectiveThe biodiversity informatics landscape: a systematics perspective
The biodiversity informatics landscape: a systematics perspectiveVince Smith
 
The Biodiversity Heritage Library. 10+1 and Beyond: Looking Forward
The Biodiversity Heritage Library. 10+1 and Beyond: Looking ForwardThe Biodiversity Heritage Library. 10+1 and Beyond: Looking Forward
The Biodiversity Heritage Library. 10+1 and Beyond: Looking ForwardMartin Kalfatovic
 
An Introduction to the Biodiversity Heritage Library
An Introduction to the Biodiversity Heritage LibraryAn Introduction to the Biodiversity Heritage Library
An Introduction to the Biodiversity Heritage LibraryMartin Kalfatovic
 
2016 BHL Program Director's Report
2016 BHL Program Director's Report2016 BHL Program Director's Report
2016 BHL Program Director's ReportMartin Kalfatovic
 
Crowd-sourcing the creation of "articles" within the Biodiversity Heritage Li...
Crowd-sourcing the creation of "articles" within the Biodiversity Heritage Li...Crowd-sourcing the creation of "articles" within the Biodiversity Heritage Li...
Crowd-sourcing the creation of "articles" within the Biodiversity Heritage Li...Trish Rose-Sandler
 
Eol fellow-march2010
Eol fellow-march2010Eol fellow-march2010
Eol fellow-march2010tgarnett
 
IUCN Species Conservation Profile (SCP)
IUCN Species Conservation Profile (SCP)IUCN Species Conservation Profile (SCP)
IUCN Species Conservation Profile (SCP)Pensoft Publishers
 
ContentMine and WikiData
ContentMine and WikiDataContentMine and WikiData
ContentMine and WikiDatapetermurrayrust
 
An International Cooperative Digital Library for Taxonomic Literature: The Bi...
An International Cooperative Digital Library for Taxonomic Literature: The Bi...An International Cooperative Digital Library for Taxonomic Literature: The Bi...
An International Cooperative Digital Library for Taxonomic Literature: The Bi...Martin Kalfatovic
 

La actualidad más candente (20)

Digitizing Entomology: The Biodiversity Heritage Library @ the Smithsonian
Digitizing Entomology: The Biodiversity Heritage Library @ the SmithsonianDigitizing Entomology: The Biodiversity Heritage Library @ the Smithsonian
Digitizing Entomology: The Biodiversity Heritage Library @ the Smithsonian
 
2009 05 20 Cimc Pilsk
2009 05 20 Cimc Pilsk2009 05 20 Cimc Pilsk
2009 05 20 Cimc Pilsk
 
Smithsonian Libraries 2.0 and the Biodiversity Heritage Library Project
Smithsonian Libraries 2.0 and the Biodiversity Heritage Library ProjectSmithsonian Libraries 2.0 and the Biodiversity Heritage Library Project
Smithsonian Libraries 2.0 and the Biodiversity Heritage Library Project
 
Quentin D. Wheeler - ZooBank and the Taxonomic Renaissance
Quentin D. Wheeler - ZooBank and the Taxonomic RenaissanceQuentin D. Wheeler - ZooBank and the Taxonomic Renaissance
Quentin D. Wheeler - ZooBank and the Taxonomic Renaissance
 
Disruptive Communities and Technology
Disruptive Communities and TechnologyDisruptive Communities and Technology
Disruptive Communities and Technology
 
Books, Bytes, Biodiversity: Using the Biodiversity Heritage Library in Your R...
Books, Bytes, Biodiversity: Using the Biodiversity Heritage Library in Your R...Books, Bytes, Biodiversity: Using the Biodiversity Heritage Library in Your R...
Books, Bytes, Biodiversity: Using the Biodiversity Heritage Library in Your R...
 
Joe Coleman Biodiversity Heritage Library
Joe Coleman Biodiversity Heritage LibraryJoe Coleman Biodiversity Heritage Library
Joe Coleman Biodiversity Heritage Library
 
The biodiversity informatics landscape: a systematics perspective
The biodiversity informatics landscape: a systematics perspectiveThe biodiversity informatics landscape: a systematics perspective
The biodiversity informatics landscape: a systematics perspective
 
The Biodiversity Heritage Library. 10+1 and Beyond: Looking Forward
The Biodiversity Heritage Library. 10+1 and Beyond: Looking ForwardThe Biodiversity Heritage Library. 10+1 and Beyond: Looking Forward
The Biodiversity Heritage Library. 10+1 and Beyond: Looking Forward
 
An Introduction to the Biodiversity Heritage Library
An Introduction to the Biodiversity Heritage LibraryAn Introduction to the Biodiversity Heritage Library
An Introduction to the Biodiversity Heritage Library
 
Muswebho
MuswebhoMuswebho
Muswebho
 
Csvconf
CsvconfCsvconf
Csvconf
 
2016 BHL Program Director's Report
2016 BHL Program Director's Report2016 BHL Program Director's Report
2016 BHL Program Director's Report
 
Crowd-sourcing the creation of "articles" within the Biodiversity Heritage Li...
Crowd-sourcing the creation of "articles" within the Biodiversity Heritage Li...Crowd-sourcing the creation of "articles" within the Biodiversity Heritage Li...
Crowd-sourcing the creation of "articles" within the Biodiversity Heritage Li...
 
Eol fellow-march2010
Eol fellow-march2010Eol fellow-march2010
Eol fellow-march2010
 
Making Theses USEFUL
Making Theses USEFULMaking Theses USEFUL
Making Theses USEFUL
 
IUCN Species Conservation Profile (SCP)
IUCN Species Conservation Profile (SCP)IUCN Species Conservation Profile (SCP)
IUCN Species Conservation Profile (SCP)
 
ContentMine and WikiData
ContentMine and WikiDataContentMine and WikiData
ContentMine and WikiData
 
BHL & The Catalogue of Life
BHL & The Catalogue of LifeBHL & The Catalogue of Life
BHL & The Catalogue of Life
 
An International Cooperative Digital Library for Taxonomic Literature: The Bi...
An International Cooperative Digital Library for Taxonomic Literature: The Bi...An International Cooperative Digital Library for Taxonomic Literature: The Bi...
An International Cooperative Digital Library for Taxonomic Literature: The Bi...
 

Similar a Text Mining Biodiversity 20160127

BHL @ #TDWG09 - with discussion
BHL @ #TDWG09 - with discussionBHL @ #TDWG09 - with discussion
BHL @ #TDWG09 - with discussionChris Freeland
 
Molecular & Cell Biology Honours 2015
Molecular & Cell Biology Honours 2015Molecular & Cell Biology Honours 2015
Molecular & Cell Biology Honours 2015UCT
 
Geological Sciences Honours Class of 2017
Geological Sciences Honours Class of 2017Geological Sciences Honours Class of 2017
Geological Sciences Honours Class of 2017UCT
 
Database resources for Molecular & Cell Biology 2014
Database resources for Molecular & Cell Biology 2014Database resources for Molecular & Cell Biology 2014
Database resources for Molecular & Cell Biology 2014UCT
 
BHL Developments - Prague
BHL Developments - PragueBHL Developments - Prague
BHL Developments - PragueChris Freeland
 
An Overview of Standards for Biodiversity Literature and the State of the BHL
An Overview of Standards for Biodiversity Literature and the State of the BHLAn Overview of Standards for Biodiversity Literature and the State of the BHL
An Overview of Standards for Biodiversity Literature and the State of the BHLMartin Kalfatovic
 
Content Mining at Wellcome Trust
Content Mining at Wellcome TrustContent Mining at Wellcome Trust
Content Mining at Wellcome Trustpetermurrayrust
 
Content Mining at Wellcome Trust
Content Mining at Wellcome TrustContent Mining at Wellcome Trust
Content Mining at Wellcome TrustTheContentMine
 
Mcb database resources workshop 2013
Mcb database resources workshop 2013Mcb database resources workshop 2013
Mcb database resources workshop 2013UCT
 
The Biodiversity Heritage Library
The Biodiversity Heritage LibraryThe Biodiversity Heritage Library
The Biodiversity Heritage LibraryMartin Kalfatovic
 
Molecular and Cell Biology Honours class 2016
Molecular and Cell Biology Honours class 2016Molecular and Cell Biology Honours class 2016
Molecular and Cell Biology Honours class 2016UCT
 
An International Cooperative Digital Library for Taxonomic Literature: The Bi...
An International Cooperative Digital Library for Taxonomic Literature: The Bi...An International Cooperative Digital Library for Taxonomic Literature: The Bi...
An International Cooperative Digital Library for Taxonomic Literature: The Bi...Martin Kalfatovic
 
Biological sciences Honours 2016 @ UCT Libraries
Biological sciences Honours 2016 @ UCT LibrariesBiological sciences Honours 2016 @ UCT Libraries
Biological sciences Honours 2016 @ UCT LibrariesUCT
 
Biological Science Honours class of 2017
Biological Science Honours class of 2017Biological Science Honours class of 2017
Biological Science Honours class of 2017UCT
 
Eureka! research
Eureka! researchEureka! research
Eureka! researchcybraryman
 
Collection of bulletins
Collection of bulletinsCollection of bulletins
Collection of bulletinsJennie Oleksak
 
Collection of bulletins
Collection of bulletinsCollection of bulletins
Collection of bulletinsJennie Oleksak
 
The Biodiversity Heritage Library: Corn-fed, Missouri Raised, Going Global
The Biodiversity Heritage Library: Corn-fed, Missouri Raised, Going GlobalThe Biodiversity Heritage Library: Corn-fed, Missouri Raised, Going Global
The Biodiversity Heritage Library: Corn-fed, Missouri Raised, Going GlobalMartin Kalfatovic
 

Similar a Text Mining Biodiversity 20160127 (20)

BHL @ #TDWG09 - with discussion
BHL @ #TDWG09 - with discussionBHL @ #TDWG09 - with discussion
BHL @ #TDWG09 - with discussion
 
Molecular & Cell Biology Honours 2015
Molecular & Cell Biology Honours 2015Molecular & Cell Biology Honours 2015
Molecular & Cell Biology Honours 2015
 
Geological Sciences Honours Class of 2017
Geological Sciences Honours Class of 2017Geological Sciences Honours Class of 2017
Geological Sciences Honours Class of 2017
 
Database resources for Molecular & Cell Biology 2014
Database resources for Molecular & Cell Biology 2014Database resources for Molecular & Cell Biology 2014
Database resources for Molecular & Cell Biology 2014
 
BHL Developments - Prague
BHL Developments - PragueBHL Developments - Prague
BHL Developments - Prague
 
An Overview of Standards for Biodiversity Literature and the State of the BHL
An Overview of Standards for Biodiversity Literature and the State of the BHLAn Overview of Standards for Biodiversity Literature and the State of the BHL
An Overview of Standards for Biodiversity Literature and the State of the BHL
 
Content Mining at Wellcome Trust
Content Mining at Wellcome TrustContent Mining at Wellcome Trust
Content Mining at Wellcome Trust
 
Content Mining at Wellcome Trust
Content Mining at Wellcome TrustContent Mining at Wellcome Trust
Content Mining at Wellcome Trust
 
Mcb database resources workshop 2013
Mcb database resources workshop 2013Mcb database resources workshop 2013
Mcb database resources workshop 2013
 
The Biodiversity Heritage Library
The Biodiversity Heritage LibraryThe Biodiversity Heritage Library
The Biodiversity Heritage Library
 
Molecular and Cell Biology Honours class 2016
Molecular and Cell Biology Honours class 2016Molecular and Cell Biology Honours class 2016
Molecular and Cell Biology Honours class 2016
 
An International Cooperative Digital Library for Taxonomic Literature: The Bi...
An International Cooperative Digital Library for Taxonomic Literature: The Bi...An International Cooperative Digital Library for Taxonomic Literature: The Bi...
An International Cooperative Digital Library for Taxonomic Literature: The Bi...
 
Biological sciences Honours 2016 @ UCT Libraries
Biological sciences Honours 2016 @ UCT LibrariesBiological sciences Honours 2016 @ UCT Libraries
Biological sciences Honours 2016 @ UCT Libraries
 
Shorthouse
ShorthouseShorthouse
Shorthouse
 
Biological Science Honours class of 2017
Biological Science Honours class of 2017Biological Science Honours class of 2017
Biological Science Honours class of 2017
 
Eureka! research
Eureka! researchEureka! research
Eureka! research
 
Collection of bulletins
Collection of bulletinsCollection of bulletins
Collection of bulletins
 
Collection of bulletins
Collection of bulletinsCollection of bulletins
Collection of bulletins
 
The Biodiversity Heritage Library: Corn-fed, Missouri Raised, Going Global
The Biodiversity Heritage Library: Corn-fed, Missouri Raised, Going GlobalThe Biodiversity Heritage Library: Corn-fed, Missouri Raised, Going Global
The Biodiversity Heritage Library: Corn-fed, Missouri Raised, Going Global
 
Senior Seminar in Biology (BG 403)
Senior Seminar in Biology (BG 403)Senior Seminar in Biology (BG 403)
Senior Seminar in Biology (BG 403)
 

Más de William Ulate

Enhancing the WFO in support of GSPC.pptx
Enhancing the WFO in support of GSPC.pptxEnhancing the WFO in support of GSPC.pptx
Enhancing the WFO in support of GSPC.pptxWilliam Ulate
 
Finding the annotation needs of the botanical community in a digital library
Finding the annotation needs of the botanical community in a digital libraryFinding the annotation needs of the botanical community in a digital library
Finding the annotation needs of the botanical community in a digital libraryWilliam Ulate
 
Botanists and annotations printer friendly
Botanists and annotations   printer friendlyBotanists and annotations   printer friendly
Botanists and annotations printer friendlyWilliam Ulate
 
BHL Tech Status Update Tech Director W.Ulate 2015.12.11
BHL Tech Status Update Tech Director W.Ulate 2015.12.11BHL Tech Status Update Tech Director W.Ulate 2015.12.11
BHL Tech Status Update Tech Director W.Ulate 2015.12.11William Ulate
 
Digitalización de Literatura de Biodiversidad: an overview of the BHL for CON...
Digitalización de Literatura de Biodiversidad: an overview of the BHL for CON...Digitalización de Literatura de Biodiversidad: an overview of the BHL for CON...
Digitalización de Literatura de Biodiversidad: an overview of the BHL for CON...William Ulate
 
BHL Technical Director's Report, Mar. 2014
BHL Technical Director's Report, Mar. 2014BHL Technical Director's Report, Mar. 2014
BHL Technical Director's Report, Mar. 2014William Ulate
 
BHL Markup Efforts and Plans
BHL Markup Efforts and PlansBHL Markup Efforts and Plans
BHL Markup Efforts and PlansWilliam Ulate
 
Purposeful Gaming and BHL
Purposeful Gaming and BHLPurposeful Gaming and BHL
Purposeful Gaming and BHLWilliam Ulate
 
Fourth Global BHL Meeting - Technical Update
Fourth Global BHL Meeting - Technical UpdateFourth Global BHL Meeting - Technical Update
Fourth Global BHL Meeting - Technical UpdateWilliam Ulate
 
Bibliographic References in BHL
Bibliographic References in BHLBibliographic References in BHL
Bibliographic References in BHLWilliam Ulate
 
A new flora fauna mycota should...
A new flora fauna mycota should...A new flora fauna mycota should...
A new flora fauna mycota should...William Ulate
 
BHL Technical Update (May 2013)
BHL Technical Update (May 2013)BHL Technical Update (May 2013)
BHL Technical Update (May 2013)William Ulate
 
Global BHL Update May 2013
Global BHL Update May 2013Global BHL Update May 2013
Global BHL Update May 2013William Ulate
 
The BHL way to content
The BHL way to contentThe BHL way to content
The BHL way to contentWilliam Ulate
 
TDWG 2012 Poster for Art of Life project
TDWG 2012 Poster for Art of Life projectTDWG 2012 Poster for Art of Life project
TDWG 2012 Poster for Art of Life projectWilliam Ulate
 
BHL Technical Projects Updates
BHL Technical Projects UpdatesBHL Technical Projects Updates
BHL Technical Projects UpdatesWilliam Ulate
 
The Biodiversity Heritage Library: an Open Global Resource of Literature for ...
The Biodiversity Heritage Library: an Open Global Resource of Literature for ...The Biodiversity Heritage Library: an Open Global Resource of Literature for ...
The Biodiversity Heritage Library: an Open Global Resource of Literature for ...William Ulate
 
BHL: Toward a Global, Sustainable Resource
BHL: Toward a Global, Sustainable ResourceBHL: Toward a Global, Sustainable Resource
BHL: Toward a Global, Sustainable ResourceWilliam Ulate
 
Global BHL Meeting Action Items
Global BHL Meeting Action ItemsGlobal BHL Meeting Action Items
Global BHL Meeting Action ItemsWilliam Ulate
 

Más de William Ulate (19)

Enhancing the WFO in support of GSPC.pptx
Enhancing the WFO in support of GSPC.pptxEnhancing the WFO in support of GSPC.pptx
Enhancing the WFO in support of GSPC.pptx
 
Finding the annotation needs of the botanical community in a digital library
Finding the annotation needs of the botanical community in a digital libraryFinding the annotation needs of the botanical community in a digital library
Finding the annotation needs of the botanical community in a digital library
 
Botanists and annotations printer friendly
Botanists and annotations   printer friendlyBotanists and annotations   printer friendly
Botanists and annotations printer friendly
 
BHL Tech Status Update Tech Director W.Ulate 2015.12.11
BHL Tech Status Update Tech Director W.Ulate 2015.12.11BHL Tech Status Update Tech Director W.Ulate 2015.12.11
BHL Tech Status Update Tech Director W.Ulate 2015.12.11
 
Digitalización de Literatura de Biodiversidad: an overview of the BHL for CON...
Digitalización de Literatura de Biodiversidad: an overview of the BHL for CON...Digitalización de Literatura de Biodiversidad: an overview of the BHL for CON...
Digitalización de Literatura de Biodiversidad: an overview of the BHL for CON...
 
BHL Technical Director's Report, Mar. 2014
BHL Technical Director's Report, Mar. 2014BHL Technical Director's Report, Mar. 2014
BHL Technical Director's Report, Mar. 2014
 
BHL Markup Efforts and Plans
BHL Markup Efforts and PlansBHL Markup Efforts and Plans
BHL Markup Efforts and Plans
 
Purposeful Gaming and BHL
Purposeful Gaming and BHLPurposeful Gaming and BHL
Purposeful Gaming and BHL
 
Fourth Global BHL Meeting - Technical Update
Fourth Global BHL Meeting - Technical UpdateFourth Global BHL Meeting - Technical Update
Fourth Global BHL Meeting - Technical Update
 
Bibliographic References in BHL
Bibliographic References in BHLBibliographic References in BHL
Bibliographic References in BHL
 
A new flora fauna mycota should...
A new flora fauna mycota should...A new flora fauna mycota should...
A new flora fauna mycota should...
 
BHL Technical Update (May 2013)
BHL Technical Update (May 2013)BHL Technical Update (May 2013)
BHL Technical Update (May 2013)
 
Global BHL Update May 2013
Global BHL Update May 2013Global BHL Update May 2013
Global BHL Update May 2013
 
The BHL way to content
The BHL way to contentThe BHL way to content
The BHL way to content
 
TDWG 2012 Poster for Art of Life project
TDWG 2012 Poster for Art of Life projectTDWG 2012 Poster for Art of Life project
TDWG 2012 Poster for Art of Life project
 
BHL Technical Projects Updates
BHL Technical Projects UpdatesBHL Technical Projects Updates
BHL Technical Projects Updates
 
The Biodiversity Heritage Library: an Open Global Resource of Literature for ...
The Biodiversity Heritage Library: an Open Global Resource of Literature for ...The Biodiversity Heritage Library: an Open Global Resource of Literature for ...
The Biodiversity Heritage Library: an Open Global Resource of Literature for ...
 
BHL: Toward a Global, Sustainable Resource
BHL: Toward a Global, Sustainable ResourceBHL: Toward a Global, Sustainable Resource
BHL: Toward a Global, Sustainable Resource
 
Global BHL Meeting Action Items
Global BHL Meeting Action ItemsGlobal BHL Meeting Action Items
Global BHL Meeting Action Items
 

Último

Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPirithiRaju
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptxanandsmhk
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxgindu3009
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisDiwakar Mishra
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)Areesha Ahmad
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINsankalpkumarsahoo174
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxUmerFayaz5
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.Nitya salvi
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...anilsa9823
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptxRajatChauhan518211
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)Areesha Ahmad
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksSérgio Sacani
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSarthak Sekhar Mondal
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Lokesh Kothari
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...Sérgio Sacani
 

Último (20)

Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptx
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 

Text Mining Biodiversity 20160127

  • 1. Text Mining Biodiversity S. Ananiadou E. Milios W. Ulate
  • 3. Outline 1. Introduction 2. Creating a Term Inventory of Biodiversity 3. Interactive Visualization of Inventory 4. Creating a Text Mining Infrastructure for Biodiversity 5. Interactive Clustering of Search Engine results 6. OCR Error correction 7. Social media platform 8. Impact
  • 4.
  • 5. Social Media Visualisation Semantic Metadata What do we want to do? 54/15/2016 Mining Biodiversity http://miningbiodiversity.org Help transform BHL into a next-generation social digital library through a multi-disciplinary approach that includes: • Text Mining • Machine learning • History of Science • Environmental History & Studies • Library and Information Science • Social Media
  • 6. Creating the Term Inventory: why we need it • A species name may usually be expressed in multiple ways, e.g., using scientific names or vernacular names – Balaena mysticetus Bowhead whale, bowhead – Spizella passerina Chipping sparrows • Identify synonymous terms in biodiversity text • Why? To go beyond keyword-based search! 6
  • 7. Search Results Using Vernacular Names Vernacular name of “Balaena mysticetus” Different results!! 7
  • 8. Keyword-based Search: Ambiguity Boxwood historic place in Alabama? North American term for plants in the Buxaceae family? Box container? Boxwood for other English-speaking countries? 8
  • 9. Methods: Distributional Semantics • Determine the meaning of terms and phrases by looking at the context and the meaning of individual words bowhead whale 43.99 39.99 25.06 23.92 20.84 19.86 19.52 17.91 … 5.62 balaena mysticetus alaska seals distribu tion ringed catch quota … murray 9 mysticetus seals distribut ion ringed … murray 43.99 25.06 19.52 17.91 … balaena alaska catch quota … bowhead whale 39.99 23.92 20.84 19.52 … 5.62
  • 10. Distributional semantics methods balaena mysticetus balaena glacialis 0.7896 bowhead whale 0.7392 bowhead 0.7074 bowhead whales 0.6999 eubalaena glacialis 0.6905 minke whale 0.6864 humpback whale 0.6490 sperm whale 0.6440 finback whale 0.6322 sei whale 0.6287 eubalaena japonica 0.6065 brydes whale 0.6052 humpback whales 0.6000 finback whales 0.5998 10
  • 11. Experiments • Training data: all English texts from the BHL • about 26 million pages with a size of 49GB • Evaluation data: synonymous terms from the Catalogue of Life • Select 500 scientific names and their synonyms from the CoL • Results at top-20 Category Class #terms in CoL #terms in BHL #average synonyms in CoL Birds Aves 1140 818 2.28 Mammals Mammalia 1131 726 2.26 Plants Plantae 1141 826 2.28 Category Pre@20 Re@20 Birds 69.41% 63% Mammals 62.12% 53.84% Plants 56.17% 21.43% 11
  • 12. 3. Interactive visualization of term inventory 12
  • 14. 4. Creating a text mining infrastructure for biodiversity 14 • Web-based, graphical TM workbench • Straightforward integration of tools into modular, extensible, reconfigurable and reusable workflows http://argo.nactem.ac.uk Source: LEGO DUPLO
  • 15. Annotation Workflow for Biodiversity Pre-processing Dictionary lookup Machine learning- based recognition Relation extraction Saving 15
  • 17. 5. Interactive clustering of search engine results • Goal: to cluster BHL search engine results • Input dataset: output of an “Or” query based on the following terms: 1. Kangaroo 2. Lion 3. Rabbit 4. Shark • Only titles of books or articles are considered in clustering • Interactive clustering based on the keyterms of the titles
  • 19. 6. OCR error correction • Correct errors in natural language texts • Spelling errors (e.g. the => teh) • Grammar errors (e.g. this is => this are) • Outline
  • 20. OCR error correction • Input • Document • Component selection (select components to use for processing) • Correction candidates • A list of candidates with confidence for each error • Component structure
  • 22. 7. Social media platform
  • 23. Making Biodiversity Digital Objects More Social and Shareable Follow us on Twitter: @SMLabTO
  • 24. “My Tweeps” app mytweeps.com Helping BHL (and other organizations) to get daily insights about their Twitter followers (or Tweeps) and what they are interested in. We call it a "reverse" Twitter because instead of seeing tweets from people whom you follow, the app shows you tweets from people who follow you. Follow us on Twitter: @SMLabTO
  • 25. We also partnered with Altmetric to better understand who and why people share BHL content across various social media platforms Follow us on Twitter: @SMLabTO
  • 28. Enhanced Searching of BHL Content Faceted search Automatically generated questions Time-sensitive search 28
  • 29. Enhanced Document Viewing Page in PDF/image format OCR-corrected text with colour-coded annotations 29
  • 30. The Team • NaCTeM • Ryerson • Dalhousie • Missouri Botanical Garden • Smithsonian Libraries (contract)
  • 31. Thanks to the sponsors:

Notas del editor

  1. Shortcuts for fast forward of VLC videos: http://www.shortcutworld.com/en/win/VLC-Media-Player.html Before starting, go to display settings and make the projector screen the main screen, so that videos pop up There and not on the laptop screen.
  2. BHL is the data source IMLS is the Funding Agency Missouri Botanical Garden is the partner for the US Smithsonian Libraries is a contractor (not sure if we should include it)
  3. Sophia Sophia Sophia Sophia Evangelos Evangelos William (Anatoliy’s video has voice, so it is self-explanatory) William
  4. The Biodiversity Heritage Library (BHL) is a consortium of natural history and botanical libraries that cooperate to digitize and make accessible the legacy literature of biodiversity held in their collections and to make that literature available for open access and responsible use as a part of a global “biodiversity commons.” The BHL consortium works with the international taxonomic community, rights holders, and other interested parties to ensure that this biodiversity heritage is made available to a global audience through open access principles. In partnership with the Internet Archive and through local digitization efforts, the BHL has digitized more than 48 million pages of taxonomic literature, representing over 100,000 titles and over 170,000 volumes.
  5. MiBIO will integrate TM tools within an interoperable platform to provide a semantic search system for the BHL, enhanced through clustering and visualisation capabilities. MiBIO will also provide a social media environment, which will enable BHL users to discuss, link and share digital artifacts posted to social media sites linked to the BHL search portal. The outcome will be the transformation of the BHL from a Digital Library (DL) into a Social Digital Library (SDL). This will be achieved through the enrichment of its historical digital archives with semantic metadata generated by TM. Furthermore, by leveraging existing social media sites and providing facilities for their integration with the BHL, we will engage a community of users to exploit the BHL as a forum for the exchange of ideas. In a nutshell, we have incorporated into BHL three elements, as part of the Mining Biodiversity project: Visualisation, Social Media and Semantic Metadata.
  6. Such variants may cause low performance to a keyword-based search engine and moreover it causes difficulties for non-expert users (users that are not familiar with scientific names). To alleviate the issue of variants searching in the search engine, we have compiled a terminological inventory containing semantic variants of biodiversity terms, e.g., mammals, birds, plants, by using distributional semantic methods. Learn the representation vector of each term Calculate the cosine similarity between two terms Extract top-20 candidates of synonyms.
  7. And here is the search result when we use a common name of the previous term, which consists only one document related to “bowhead whale”. Apparently, the search engine returns a different result with the previous one …
  8. Another problem with keyword-based search, as mentioned above, is ambiguity. If one searches for “Boxwood”, a keyword-based system wouldn’t know if he/she was referring to a place in Alabama, or the North American term for plants under the Buxaceae family. It will just return all documents pertaining to both. Nor will it know if a query “Box” pertains to the same plant family because apparently this is how other English-speaking countries refer to it, or a container.
  9. We then implemented two distributional semantic models. The first one is a count-based model that determines the … For example, within a 7-word window, this is the context vector of “bowhead whale” -- SA rubbish frequency
  10. In this manner, for each name, we generate a list of names ranked by similarity. For “balaena mysticetus”, for example, we obtained the following list. Determine the meaning of a term by considering all lexical units occurring within a N-word window.
  11. We have conducted our experiments on the Biodiversity Heritage Library (BHL) corpus. The corpus size is about 49 GB. We have created a golden data of synonymous terms based on the Catalogue of Life. For each scientific name, we extract the corresponding common names and synonyms. We then picked randomly 500 species whose class is Aves. As a result, we got about 11 hundred terms of bird names (both vernacular and scientific names), of which about 8 hundreds existing in the BHL corpus. According to CoL, the average number of synonyms for each scientific names is about 2. We did the same process with mammal and plant names. Follows are the precision and recall scores at top-20. Among the three categories, the performance of bird names is the best. With plant names, its lower performance can be explained by the fact that unlike mammals and birds, most of synonyms of plant names are also scientific names, which is more difficult to detect than the other.
  12. Shift+Arrow Right/Arrow LeftJump 3 seconds forward/ backward Alt+Arrow Right/Arrow LeftJump 10 seconds forward/ backward Ctrl+Arrow Right/Arrow LeftJump 1 minute forward/ backward -Frequency of species names can be visually explored, or queried by a search interface -Clicking on a species name acts as a query to retrieve its top-20 semantically related species. --Their semantically related score can be inspected --A blue color denotes that the species names appear as synonym in the CoL -Interactive visualizations were constructed for mammals, plants and birds [and in case somebody asks:] -Images, which were crawled from external open sources, may help assess visually species' relatedness based on their visible features. Shift+Arrow Right/Arrow LeftJump 3 seconds forward/ backward Alt+Arrow Right/Arrow LeftJump 10 seconds forward/ backward Ctrl+Arrow Right/Arrow LeftJump 1 minute forward/ backward
  13. Species names are shown in bubbles Larger bubbles denote species more frequently mentioned in the biodiversity literature Upon interaction (semantically) related species can be inspected Color opacity indicates degree of relatedness Blue color indicates that species also appear as synonyms in CoL Images are retrieved from open data collections (e.g. Wikipedia)
  14. Web-based application: No installation; Access with a web browser Multi-user system: Remote collaborative annotation Supports Unstructured Information Management Architecture UIMA, Cloud and high-performance computing
  15. This is the workflow that we put together using Argo. Without going too much into detail, I will just point out the general types of processing it tries to do: pre-processing (sentence splitting, tokenisation and part-of-speech tagging), matching against dictionaries or controlled vocabularies such as the ENVO and PATO ontologies, machine learning-based recognition of entities, extraction of relations based on the results of dependency parsing, and serialisation of the generated annotations.
  16. Shift+Arrow Right/Arrow LeftJump 3 seconds forward/ backward Alt+Arrow Right/Arrow LeftJump 10 seconds forward/ backward Ctrl+Arrow Right/Arrow LeftJump 1 minute forward/ backward
  17. NaCTeM: Riza Theresa Batista-Navarro, Sophia Ananiadou, Georgios Kontonatsios Dalhousie: Axel Soto, Aminul Islam, Evangelos Milios, Abdul Moh’d, Hamid Missouri Botanical Garden team:  Mike Lichtenberg, Trish Rose-Sandler & William Ulate Smithsonian Libraries staff (contract): Grace Costantino & Jen Hammock