Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Semantic annotation, clustering and visualization
1. Semantic annotation, clustering
and visualization
Media Technology Msc Programme
David Graus Graduation Project
Supervisor: Joris Slob
2. David Graus Media Technology Msc Programme 07/02/2012
Introduction
3. David Graus Media Technology Msc Programme 07/02/2012
Cyttron DB entry
"The volume of the brain evaluated in this study. The
color scale represents the number of 4-mm voxels with
data in at least 7 subjects along a 3-cm deep line into
the brain. A three-dimensional rendering of a brain is
shown in regions where insufficient data were
obtained. The most superior regions of the frontal and
parietal lobes and the most inferior regions of the
temporal lobes were not evaluated. Imaging artifacts
may also compromise the significance of results in the
most inferior portions of the frontal lobe."
4. David Graus Media Technology Msc Programme 07/02/2012
Tasks
1. Semantic annotation
Identify and tag most important concepts from text [NLP]
2. Topic extraction
Relate concepts and find clusters [Linked Data]
3. Visualization
Draw resulting graphs and clusters [Datavisualization]
5. David Graus Media Technology Msc Programme 07/02/2012
1. Semantic Annotation
Method I: Find words Method II: Compare texts
6. David Graus Media Technology Msc Programme 07/02/2012
Semantic Annotation: Method I
"The volume of the brain evaluated in this study. The
color scale represents the number of 4-mm voxels with
data in at least 7 subjects along a 3-cm deep line into
the brain. A three-dimensional rendering of a brain is
shown in regions where insufficient data were
obtained. The most superior regions of the frontal and
parietal lobes and the most inferior regions of the
temporal lobes were not evaluated. Imaging artifacts
may also compromise the significance of results in the
most inferior portions of the frontal lobe."
7. David Graus Media Technology Msc Programme 07/02/2012
Formal knowledge: Biomedical Ontology
8. David Graus Media Technology Msc Programme 07/02/2012
NCI Thesaurus
89.129 unique concepts
50.804 definitions
258.051 synonyms
Relations!
Concept Agrobacterium tumefaciens
Definition A species of Gram negative, rod shaped bacteria
assigned to the phylum Proteobacteria. This
bacteria is motile by flagella and mediates the
horizontal gene transfer of its Ti plasmid to
infect plants. A. tumefaciens is commonly found
in soil and around the root surfaces of plants
and is the causative agent of crown gall disease.
Synonyms RHIZOBIUM RADIOBACTER
CDC GROUP VD-3
9. David Graus Media Technology Msc Programme 07/02/2012
Semantic Annotation: Method I
"The volume of the brain evaluated in this study. The
color scale represents the number of 4-mm voxels with
data in at least 7 subjects along a 3-cm deep line into
the brain. A three-dimensional rendering of a brain is
shown in regions where insufficient data were
obtained. The most superior regions of the frontal and
parietal lobes and the most inferior regions of the
temporal lobes were not evaluated. Imaging artifacts
may also compromise the significance of results in the
most inferior portions of the frontal lobe."
10. David Graus Media Technology Msc Programme 07/02/2012
Semantic Annotation: Method I
"The volume of the brain evaluated in this study. The
color scale represents the number of 4-mm voxels with
data in at least 7 subjects along a 3-cm deep line into
the brain. A three-dimensional rendering of a brain is
shown in regions where insufficient data were
obtained. The most superior regions of the frontal and
parietal lobes and the most inferior regions of the
temporal lobes were not evaluated. Imaging artifacts
may also compromise the significance of results in the
most inferior portions of the frontal lobe."
11. David Graus Media Technology Msc Programme 07/02/2012
Semantic Annotation: Method I
"The volume of the brain evaluated in this study. The
color scale represents the number of 4-mm voxels with
data in at least 7 subjects along a 3-cm deep line into
the brain. A three-dimensional rendering of a brain is
shown in regions where insufficient data were
obtained. The most superior regions of the frontal and
parietal lobes and the most inferior regions of the
temporal lobes were not evaluated. Imaging artifacts
may also compromise the significance of results in the
most inferior portions of the frontal lobe."
12. David Graus Media Technology Msc Programme 07/02/2012
Example
"The volume of the brain evaluated in this study. The color scale represents the number of
4-mm voxels with data in at least 7 subjects along a 3-cm deep line into the brain. A three-
dimensional rendering of a brain is shown in regions where insufficient data were obtained.
The most superior regions of the frontal and parietal lobes and the most inferior regions of
the temporal lobes were not evaluated. Imaging artifacts may also compromise the
significance of results in the most inferior portions of the frontal lobe."
Most, Brain, A, Inferior, Data, And, With, Volume, Volume, Three,
Temporal, Superior, Study, Scale, Parietal, Number, Lobe, Line, Into,
Frontal Lobe, Deep, Color, At
13. David Graus Media Technology Msc Programme 07/02/2012
Example
"The volume of the brain evaluated in this study. The color scale represents the number of
4-mm voxels with data in at least 7 subjects along a 3-cm deep line into the brain. A three-
dimensional rendering of a brain is shown in regions where insufficient data were obtained.
The most superior regions of the frontal and parietal lobes and the most inferior regions of
the temporal lobes were not evaluated. Imaging artifacts may also compromise the
significance of results in the most inferior portions of the frontal lobe."
14. David Graus Media Technology Msc Programme 07/02/2012
Semantic Annotation: Method I
2 ‘Modifiers’ of representations:
1. (Porter) Stemming (text & ontologyconcepts)
Lobes – lobe
Brains – brain
Etc…
2. Generate synonyms (using WordNet)
15. David Graus Media Technology Msc Programme 07/02/2012
Different text representations
Most frequent 'brain, regions, data, evaluated, frontal, inferior, lobes, along, also, artifacts,
words
color, compromise, deep, dimensional, imaging, insufficient, least, line, lobe‘
Most frequent 'brain, color, deep, imaging, insufficient, line, lobe, number, rendering, scale,
nouns
significance, study, volume‘
Bigrams 'also compromise, artifacts may, cm deep, color scale, compromise significance,
deep line, dimensional rendering, imaging artifacts, may also, mm voxels,
represents number, scale represents, significance results, subjects along, data
least, data obtained, evaluated study, frontal lobe, frontal parietal, inferior
portions‘
Trigrams 'also compromise significance, artifacts may also, cm deep line, color scale
represents, compromise significance results, imaging artifacts may, may also
compromise, scale represents number, insufficient data obtained, mm voxels
data, portions frontal lobe, […]
Combo 'brain, regions, data, evaluated, frontal, inferior, lobes, along, also, artifacts,
color, compromise, deep, dimensional, imaging, insufficient, least, line, lobe.
brain, color, deep, imaging, insufficient, […]
16. David Graus Media Technology Msc Programme 07/02/2012
Semantic Annotation: Method I
6 Representations (literal + 5 keyword variations)
4 Treatments (literal + stem + synonyms + both)
24 results
17. David Graus Media Technology Msc Programme 07/02/2012
Method II: Text Comparison
Find concepts that might not occur in text
"The volume of the brain evaluated in
this study. The color scale represents the
number of 4-mm voxels with data in at
least 7 subjects along a 3-cm deep line
into the brain. A three-dimensional
rendering of a brain is shown in regions
where insufficient data were obtained.
The most superior regions of the frontal
and parietal lobes and the most inferior
regions of the temporal lobes were not
evaluated. Imaging artifacts may also
compromise the significance of results
in the most inferior portions of the
frontal lobe."
18. David Graus Media Technology Msc Programme 07/02/2012
Compare text to definitions
Find relevant concepts based on their (textual) definitions
"The volume of the
brain evaluated in
this study. The
color scale
represents the Parietal Lobe: One
number of 4-mm of the lobes of the
voxels with data in cerebral hemisphere
at least 7 subjects located superiorly to
along a 3-cm deep the occipital lobe and
line into the brain. posteriorly to the
A three-dim frontal lobe.
Cognition and
visuospatial
processing are its
Cyttron entry main functions.
NCI Thesaurus
definitions
19. David Graus Media Technology Msc Programme 07/02/2012
Method II: Text Comparison
Find concepts that might not occur in text
"The volume of the brain evaluated in
this study. The color scale represents the
number of 4-mm voxels with data in at
least 7 subjects along a 3-cm deep line
into the brain. A three-dimensional
rendering of a brain is shown in regions
where insufficient data were obtained.
The most superior regions of the frontal
and parietal lobes and the most inferior
regions of the temporal lobes were not
evaluated. Imaging artifacts may also
compromise the significance of results
in the most inferior portions of the
frontal lobe."
20. David Graus Media Technology Msc Programme 07/02/2012
Compare how?
Bag of Words + TF-IDF
Dictionary: BioMedCentral Corpus
> 100.000 articles
> 8GB raw data
Process Corpus
Clean (strip tags, store only article body)
Tokenize (create list of words)
Remove common words (stopwords)
Stem remaining words
21. David Graus Media Technology Msc Programme 07/02/2012
Method II: Text Comparison
Convert both texts to vector space using dictionary, compute similarity.
Return most similar concepts.
"The volume of the brain evaluated in this study. 1. Frontotemporal Dementia
The color scale represents the number of 4-mm
voxels with data in at least 7 subjects along a 3-
2. Parietal Lobe
cm deep line into the brain. A three-dimensional 3. Area of Broca
rendering of a brain is shown in regions where 4. Anterior Cranial Fossa
insufficient data were obtained. The most
superior regions of the frontal and parietal lobes 5. Brain Lobectomy
and the most inferior regions of the temporal 6. Anterior Parietal Artery
lobes were not evaluated. Imaging artifacts may
also compromise the significance of results in the
7. Mammary Gland
most inferior portions of the frontal lobe." 8. Frontal Lobe
9. Interlobar
10. Lobar
22. David Graus Media Technology Msc Programme 07/02/2012
Method II: Text Comparison
Different cut-off rules:
1. Anything over x% similar
2. 5 most similar
3. 10 most similar
4. 20% most similar
5. 10% most similar
23. David Graus Media Technology Msc Programme 07/02/2012
Result
Long list of (linked) concepts
Relevancy?
24. David Graus Media Technology Msc Programme 07/02/2012
Find clusters
Measure semantic similarity between concepts
- Shortest paths
- Shared parents
- Node’s ‘depth’
25. David Graus Media Technology Msc Programme 07/02/2012
26. David Graus Media Technology Msc Programme 07/02/2012
To do
Get data!
Analyse algorithms
Notas del editor
So these are the 10 most similar concepts returned
Example of a connectedgraph.I want to explore the possibilities of visualizing the results, withvarying node (circle) sizesfor more and less important concepts.Colored and transparant circlesforliteral and non-literalconcepts.Conveying the information from the text in a graph.This might also help with analyzing the differences of my method vs. that of humans.