Experiences on integrating explicit knowledge on information access tools in the medical domain
1. + Computer- Query user-
aided defined
summarization expansion
Post-retrieval Extractive
clustering Summarization
Experiences on integrating explicit knowledge on
information access tools in the medical domain
Manuel de la Villa
Department of Information Technologies
University of Huelva
2. + 2
Index
Brief CV
Why a research stay? In Wolverhampton?
Teaching
Integrating explicit knowledge on information
access tools
Knowledge sources (UMLS & Freebase)
Automatic Text Summarization
Information Retrieval
Research Group in Computational Linguistics (Univ. Wolverhampton), June 20th 2011
3. + 3
Brief CV
Research Group in Computational Linguistics (Univ. Wolverhampton), June 20th 2011
4. + 6
Teaching experience
Software Engineering
Process and Methodologies, Metrics,
Requirements analysis, Design, …
Software Engineering Lab (UML, NetBeans,
Subversion, Java, JUnit, Persistence…)
Multimedia applications development
Adobe Director, Flash, Photoshop, Premiere
Sony Sound Forge, Audacity
Research Group in Computational Linguistics (Univ. Wolverhampton), June 20th 2011
5. + 7
Knowledge integration
Research Group in Computational Linguistics (Univ. Wolverhampton), June 20th 2011
6. + Specific Domain Knowledge source. UMLS (I) 8
ICD-10
LOINC
SNOMED-CT UK-Clinical Terms
UMLS MeSH
DSM-IV
…
Gene Ontology RxNorm
An homogeneus group of terminologies A saturation of different terminologies
UMLS aims to overcome a significant barrier, the variety of
ways the same concepts are expressed in different
machine-readable sources.
Research Group in Computational Linguistics (Univ. Wolverhampton), June 20th 2011
7. + Specific Domain Knowledge source. UMLS (II) 9
Project NLM Unified Medical Language System (UMLS):
Aim, to develop tools that help researchers in the knowledge
representation, retrieval and integration of biomedical information.
UMLS Knowledge Sources
Software tools
Three main components:
SPECIALIST Lexicon: Compilation of lexical elements (>200.000) with grammatical
information and linguistic variants.
“Anaesthetic” “Anaesthetic”
{base=anesthetic {base=anesthetic
spelling_variant=anaesthetic spelling_variant=anaesthetic
entry=E0330018 cat=noun entry=E0330019 cat=adj
variants=reg variants=uncount } variants=inv position=attrib(3)
position=pred stative }
Research Group in Computational Linguistics (Univ. Wolverhampton), June 20th 2011
8. + Specific Domain Knowledge source. UMLS (III) 10
Metathesaurus: very large, multi-purpose, and multi-lingual
vocabulary database (compiles more than 100 source
vocabularios), https://uts.nlm.nih.gov/metathesaurus.html
every term (>5M) associated with a concept (>1.5M), terms
related (e.g., synonyms) (16M relations)
each concept assigned to one or more semantic types of the 135
existing
Different terms…
for a same concept…
Included in a semantic type
Research Group in Computational Linguistics (Univ. Wolverhampton), June 20th 2011
9. + Specific Domain Knowledge source. UMLS (IV) 11
https://uts.nlm.nih.gov/semanticnetwork.html
UMLS Semantic Network: is an ontology with 135
semantic types and to 54 types of relationships
between types
Research Group in Computational Linguistics (Univ. Wolverhampton), June 20th 2011
10. + General Domain Knowledge Source: Freebase (I)
Freebase is a large public database that collects three kinds of
information:
data;
texts ; and
media , that references…
…entities or topics (≈ 12 million). An entity is a unique single person,
place, or thing.
A single concept or real-world thing.
A topic could also be called an entity, resource or element or thing, it is a
fundamental unit in Freebase.
/common/topic
Each topic has a Guid or globally unique ID
http://www.freebase.com/view/en/barack_obama
http://www.freebase.com/guid/9202a8c04000641f800000000029c277
11. + General Domain Knowledge Source: Freebase (II)
Freebase connects entities together as a graph,
defines its data structure as a set of nodes and a set
of links that establish relationships between the
nodes.
Most of our topics are associated with one or more types (such as
people, places, books, films, etc) and may have additional
properties like "date of birth" for a person or latitude and
longitude for a location. These types and properties and related
concepts are called Schema.
12. + General Domain Knowledge Source: Freebase (III)
The Schema
Schema (the way Freebase's data is laid out) is expressed through
Types and Properties. Types are grouped together in Domains.
13. + General Domain Knowledge Source: Freebase (III)
The Schema
Schema (the way Freebase's data is laid out) is expressed through
Types and Properties. Types are grouped together in Domains.
14. + General Domain Knowledge Source: Freebase (III)
The Schema
Schema (the way Freebase's data is laid out) is expressed through
Types and Properties. Types are grouped together in Domains.
15. + General Domain Knowledge Source: Freebase (III)
The Schema
Schema (the way Freebase's data is laid out) is expressed through
Types and Properties. Types are grouped together in Domains.
16. + General Domain Knowledge Source: Freebase (IV)
The Schema: Medicine
17. + General Domain Knowledge Source: Freebase (V)
How can we use it…
As a reference or information source
Create interesting Views and Visualizations and
share them with others
Embed Freebase data in your website
Use our API or Acre, our hosted app development
platform, to build apps that use Freebase data
Download our Data dumps
Use Freebase's RDF for Semantic Web applications
18. + General Domain Knowledge Source: Freebase (IV)
The Freebase approach
20. + 22
Knowledge integration
Research Group in Computational Linguistics (Univ. Wolverhampton), June 20th 2011
21. + 23
Experiences in Automatic summarization (I)
+ We develop a proposal with this main
characteristics:
Sentences extraction
Document representation as a graph
Centered on biomedical concepts
Using concept frequency to measure relevance
Research Group in Computational Linguistics (Univ. Wolverhampton), June 20th 2011
22. + 24
Experiences in Automatic summarization (II)
+ Phase I: Graph generation
Sentences and UMLS concepts identification
+ Phase II: Similarity algorithm
Concepts overlapping between sentences
(edges) means “recommendation”
+ Phase III: Ranking algorithm
Weight associated with each edge depends on
similarity
+ Phase IV: Summary building
Top ranked sentences are selected
Research Group in Computational Linguistics (Univ. Wolverhampton), June 20th 2011
23. + 25
Experiences in Automatic summarization (II)
+ Phase I: Graph generation
Sentences and UMLS concepts identification
+ Phase II: Similarity algorithm
Concepts overlapping between sentences
(edges) means “recommendation”
+ Phase III: Ranking algorithm
Weight associated with each edge depends on
similarity
+ Phase IV: Summary building
Top ranked sentences are selected
Research Group in Computational Linguistics (Univ. Wolverhampton), June 20th 2011
24. + 26
Experiences in Automatic summarization (II)
+ Phase I: Graph generation
Sentences and UMLS concepts identification
+ Phase II: Similarity algorithm
Concepts overlapping between sentences
(edges) means “recommendation”
+ Phase III: Ranking algorithm
Weight associated with each edge depends on
similarity
+ Phase IV: Summary building
Top ranked sentences are selected
Research Group in Computational Linguistics (Univ. Wolverhampton), June 20th 2011
25. + 27
Experiences in Automatic summarization (II)
+ Phase I: Graph generation
Sentences and UMLS concepts identification
+ Phase II: Similarity algorithm
Concepts overlapping between sentences
(edges) means “recommendation”
+ Phase III: Ranking algorithm
Weight associated with each edge depends on
similarity
+ Phase IV: Summary building
Top ranked sentences are selected
Research Group in Computational Linguistics (Univ. Wolverhampton), June 20th 2011
26. + 28
Experiences in Automatic summarization (II)
+ Phase I: Graph generation
Sentences and UMLS concepts identification
+ Phase II: Similarity algorithm
Concepts overlapping between sentences
(edges) means “recommendation”
+ Phase III: Ranking algorithm
Weight associated with each edge depends on
similarity
+ Phase IV: Summary building
Top ranked sentences are selected
Research Group in Computational Linguistics (Univ. Wolverhampton), June 20th 2011
27. + 29
Automatic Summarization. Evaluation
Evaluation
with ROUGE (based on n-grams) against generic
summarizers
Our method obtains good results, specially with small n-grams
de la Villa, M., Maña, M.
“Propuesta y evaluación de un método de generación de
resúmenes extractivo basado en conceptos en el ámbito
biomédico”. XXV edición del Congreso Anual de la Sociedad
Española para el Procesamiento del Lenguaje Natural 2009
(SEPLN´09) San Sebastián (Sept-2009).
Research Group in Computational Linguistics (Univ. Wolverhampton), June 20th 2011
28. + 30
Knowledge integration
Research Group in Computational Linguistics (Univ. Wolverhampton), June 20th 2011
29. + 31
Experiences in Computer-aided
summarization(I)
Computer-aided
summarization combines automatic
and human summarization.
The CAS system suggest an initial summary,
selecting relevant sentences
The human can change the sentences selection and
edit manually the summary.
Purpose: construction of a Gold-Standard building
assistant.
Novelty: Considering biomedical concepts distribution
(Reeve et al., 2006)
Research Group in Computational Linguistics (Univ. Wolverhampton), June 20th 2011
30. + 32
Experiences in Computer-aided
summarization(and II)
Experience in the design
and construction of a
Gold-Standard building
assistant (or Computer-
aided summarization)
Considering biomedical
concepts distribution
(Reeve et al., 2006)
-Client-server app
-Centralized repository
-Supports PDF, XML
Research Group in Computational Linguistics (Univ. Wolverhampton), June 20th 2011
31. + 33
Knowledge integration
Research Group in Computational Linguistics (Univ. Wolverhampton), June 20th 2011
32. + 34
Experiences in Information Retrieval
and Post-retrieval clustering
Experience in the design and
construction of an information
retrieval system with:
• ost-retrieval clustering,
P
• rientation to biomedical
o
documents and
• obile devices
m
Research Group in Computational Linguistics (Univ. Wolverhampton), June 20th 2011
33. Search
and
Informa.on
Retrieval
Our
implementa.on
36
Document sources: Biomed Central (web crawling in progress)
Text Processing: lowercasing, stemming, stop-words ,…
Lucene for indexing…
Research Group in Computational Linguistics (Univ. Wolverhampton), June 20th 2011
34. Search
and
Informa.on
Retrieval
Our
implementa.on
(and
II)
37
… and Lucene for searching
Research Group in Computational Linguistics (Univ. Wolverhampton), June 20th 2011
35. Clustering
Our
implementa.on
38
Weka for Clustering
The post-processing clustering is to associate, according to their
similarity, a set of documents retrieved from a query in different
subsets
Research Group in Computational Linguistics (Univ. Wolverhampton), June 20th 2011
38
36. Clustering
Why
Simple-‐K-‐Means?
Clustering algorithm:
Simple-K-Means vs Expectation Maximization
Algorithms
Simple-‐K-‐means
EM
Querys
(Documents)
Ligaments
(10)
1
2
Cancer
Skin
(25)
4
12
Cancer
(46)
5
26
Disease
(62)
8
57
Time it takes to perform the grouping in seconds
K? It depends on the number of documents retrieved.
39
38. + 41
Knowledge integration
Research Group in Computational Linguistics (Univ. Wolverhampton), June 20th 2011
39. +
Experiences in Information Retrieval
and Query user-defined expansion (I)
Userhave problems to define their information needs in a
query string (Jansen, Spink y Koshman, 2007).
Queries containe less than three terms (75,2%) and the majority of
queries contained one (18,5%), two (32,2%)
Methods to improve (expand) query:
Relevance feedback.
Local analysis or global analysis.
Natural Language Processing Resources.
Experiments with users show the preferences of these to
maintain control over how the query is reformulated (Belkin
et al., 2001).
40. + 43
Experiences in Information Retrieval
and Query user-defined expansion (II)
Experience on using Ontologies to assist the definition of the
search string… previosly
Research Group in Computational Linguistics (Univ. Wolverhampton), June 20th 2011
41. +
Experiences in Information Retrieval
and Query user-defined expansion (II)
How does it works?
Pre-retrieval Construction o f the Graph
42. + 45
Research: Information Retrieval
(and III)
… or using Ontologies to build an enriched concept graph that
assist the definition of the search string
http://www.uhu.es/manuel.villa/viewmed/
de la Villa, M., Garcia, S., Maña, M.
“¿De verdad sabes lo que quieres buscar? Expansión guiada visualmente
de la cadena de búsqueda usando ontologías y grafos de conceptos”.
XXVII edición del Congreso Anual de la Sociedad Española para el
Procesamiento del Lenguaje Natural 2011 (SEPLN´11) Huelva (Sept-2011).
Research Group in Computational Linguistics (Univ. Wolverhampton), June 20th 2011
43. + 46
Tools knowns. Expectations.
UMLS:
Metathesaurus, Semantic Network Ioffer my collaboration if
Tools: you’re interested in using
Metamap, any of these resources
MMTx API,
I’mopen to collaborate on
Semrep
whatever task you
UTS Web Services, …
consider related and…
Freebase
… to receive some
MQL (Metaweb Query Language) guidelines to improve
summarization method
Newbie with UIMA & GATE
Any questions?
Research Group in Computational Linguistics (Univ. Wolverhampton), June 20th 2011