Anna Divoli (Pingar Research): Extracting and Mapping SharePoint Content to Create a Custom Taxonomy
Pingar presentation at ShareFEST in Philadelphia (Apr 2013).
2. Why?
Why Automatic Generation?
Dynamic
Fast
Cheap
Consistent
RDF / Flexible
…
Why from a Document
Collection?
Focused/specific
Optimal for those documents
…
Why Taxonomies?
Organize knowledge
Domain representation
Enable automatic tasks
…
Why in SharePoint?
All you need is there!
Can be used straight away!
3. Talk Overview
The Team
The Process
Evaluation
Use Cases
– Withdrawn drug
– Cancer treatments
– Re-purposed drug
Summary
4. Taxonomy Generation Research Team
Olena Medelyan, Steve Manion, Jeen Broekstra, Anna Divoli, Anna Lan Huang and Ian Witten
Constructing a Focused Taxonomy from a Document Collection
ESWC 2013, Montpellier, France
5. Taxonomy Generation Process
Input:
Documents
stored somewhere
Analysis:
Using variety of tools*
and datasets, extract
concepts,
entities, relations
Grouping & Output:
A taxonomy is created
that groups resulting
taxonomy terms
hierarchically
Custom
Taxonomy
7. Document
Database
Solr
Concepts &
Relations Database
Sesame
1. Import
& convert to text
2. Extract concepts
3. Annotate
with Linked Data
4. Disambiguate
clashing concepts
5. Consolidate
taxonomy
Input
Docs
Preferred
top-level terms
In 5 Steps!
Focused
SKOS
Taxonomy
8. Step 1. Document input & conversion
Input
Documents Document
Database
1. Convert to text
Current input:
• Directory path read
recursively
Other possible inputs:
• Docs in a database or a DMS
• Emails +attachments
(Exchange)
• Website URL
• RSS feed
External tool to
convert different file
formats to text
Database to store
document content
9. Step 2. Extracting concepts
Documents
Database
Concepts
Database
2. Extract concepts
http://localhost/solr/select?q=path:mycollectiondocument456.txt
Pingar API:
Taxonomy Terms:
Climate and Weather
Leaders
Agreements
People:
Yvo de Boer
Maite Nkoana-Mashabane
Organizations:
Associated Press
South African Council of Churches
Locations:
South Africa
Wikify:
Wikipedia Terms:
South Africa
Yvo de Boer
U.N.
Climate agreements
Associated Press
Specific terminology:
green policies; climate diplomacy
10. Step 3. Annotation with meaning
Annotations
Database
3. Annotate with
Linked Data
mycollection/document456.txt
Pingar API:
People:
Yvo de Boer
Maite Nkoana-Mashabane
Organizations:
Associated Press
South African Council of Churches
Locations:
South Africa
Later this additional info
will help create
e-Discovery & semantic search
solutions
Concepts
Database
11. Step 4. Discarding irrelevant meanings
Final Concepts
Database
4. Disambiguate
clashing concepts
wikipedia.org/wiki/Ocean
wikipedia.org/wiki/Apple_Corps freebase.com/view/en/apple_inc
www.fao.org/aos/agrovoc#c_4607
Over the past three years, Apple has acquired three mapping companies
For millions of years, the oceans have been filled with sounds from natural sources.
Two concepts were extracted,
that are dissimilar
Discard the incorrect one
Two concepts were extracted,
that are similar
Accept both correct
Agrovoc term:
Marine areas
Concepts
Database
12. Step 5. Group taxonomy (a)
5a. Add relationsConcepts &
Relations Database
felines tiger bird
horse family
zebra donkey pigeonhorselizard
Category:Carnivorous animals Category:Animals
animals Building the taxonomy
bottom up
Broader: Sqamata/Reptiles/Tetrapods/Vertebrates/Chordates/Animals
Focused
SKOS
Taxonomy
13. Step 5. Consolidating taxonomy (b)
Films and film making
Film stars
Mila Kunis
Daniel Radcliffe
Sally Hawkins
Julianna Margulies
Association football clubs
Former Football League clubs
Manchester United F.C.
Manchester United F.C.
Manchester City F.C.
Finance
Economics and finance
Personal finance
Commercial finance
Tax
Capital gains tax
Tax
Capital gains tax
5b. Prune relationsConcepts &
Relations Database
Focused
SKOS
Taxonomy
14. Evaluation
Recall: 75%
(comparing with manually generated taxonomy for the
same domain)
Precision:
89% for concepts
90% for relations
(15 human judges based evaluation)
15. SharePoint Taxonomy Generation Process
Analysis:
Using variety of tools*
and datasets, extract
concepts,
entities, relations
Custom
Taxonomy
16. Triazolam
[A benzodiazepine drug used for short-
term treatment of acute insomnia.
Withdrawn in 1991 in the UK because of
risk of psychiatric adverse drug reactions.
It continues to be available in the U.S.]
Excerpt of the taxonomy generated from:
- 131 PubMed abstracts of clinical trials
on triazolam before1991
- 180 PubMed abstracts of clinical trials
on triazolam since1991
Colors of terms:
- proposed to group other terms
- found in both document collections
- in before withdrawal docs
- in since withdrawal docs
Taxonomy Statistics
Concept Count: 305
Edges Count: 437
Intermediate Count: 97
Leaves Count: 183
Labels Count: 353
Nesting Counts
0: 25
1: 51
2: 124
3: 160
4: 176
5: 153
6: 54
7: 4
Average Depth: 3.6
17. proposed to group other terms
in both document collections
in before withdrawal docs
in since withdrawal docs
18. proposed to group other terms
in both document collections
in before withdrawal docs
in since withdrawal docs
19. proposed to group other terms
in both document collections
in before withdrawal docs
in since withdrawal docs
20. Cancer Treatments
Excerpt of the taxonomy generated from:
- 200 PubMed abstracts on breast cancer
treatments
- 149 (all) PubMed abstracts on lung
cancer treatments
- 47 (all) PubMed abstracts on gastric
cancer treatments
Colors of terms:
- proposed to group other terms
- found in two or more document
collections
- in the breast treatment docs
- in the stomach treatment docs
- in the lung treatment docs
Taxonomy Statistics
Concept Count: 308
Edges Count: 387
Intermediate Count: 90
Leaves Count: 195
Labels Count: 371
Nesting Counts
0: 23
1: 52
2: 99
3: 138
4: 137
5: 159
6: 60
7: 36
8: 6
Average Depth: 3.88
21.
22. proposed to group other terms
in two or more document collections
in the breast treatment docs
in the stomach treatment docs
in the lung treatment docs
23. proposed to group other terms
in two or more document collections
in the breast treatment docs
in the stomach treatment docs
in the lung treatment docs
24. proposed to group other terms
in two or more document collections
in the breast treatment docs
in the stomach treatment docs
in the lung treatment docs
25. proposed to group other terms
in two or more document collections
in the breast treatment docs
in the stomach treatment docs
in the lung treatment docs
26. proposed to group other terms
in two or more document collections
in the breast treatment docs
in the stomach treatment docs
in the lung treatment docs
27. Tamoxifen
Tamoxifen is drug commonly used to treat breast cancer
but with a subsequent indication for treating bipolar
disorder.
Excerpt of the taxonomy generated from:
- papers discussing tamoxifen and bipolar disorder: 8 PubMed
abstracts AND 2 PDFs of full papers (17641532, 18316672)
- papers discussing tamoxifen and breast cancer: 50 PubMed
abstracts of AND 2 PDFs of full papers (21635709, 12618491)
- papers discussing tamoxifen but no mention of either breast
cancer nor bipolar disorder: 50 PubMed abstracts of AND 2
PDFs of full papers (16275887, 19458291)
Colors of terms:
- proposed to group other concepts
- in two or more document collections
- in the bipolar document collection
- in the breast cancer document collection
- in the neither cancer or bipolar document collection
Taxonomy Statistics
Concept Count: 587
Edges Count: 751
Intermediate Count: 188
Leaves Count: 365
Labels Count: 718
Nesting Counts
0: 34
1: 73
2: 133
3: 284
4: 225
5: 157
6: 89
7: 30
8: 2
Average Depth: 3.66
28. proposed to group other concepts
in two or more document collections
in the bipolar document collection
in the breast cancer document collection
in the neither cancer or bipolar doc. collection
29. proposed to group other concepts
in two or more document collections
in the bipolar document collection
in the breast cancer document collection
in the neither cancer or bipolar doc. collection
30. proposed to group other concepts
in two or more document collections
in the bipolar document collection
in the breast cancer document collection
in the neither cancer or bipolar doc. collection
31. proposed to group other concepts
in two or more document collections
in the bipolar document collection
in the breast cancer document collection
in the neither cancer or bipolar doc. collection
32. proposed to group other concepts
in two or more document collections
in the bipolar document collection
in the breast cancer document collection
in the neither cancer or bipolar doc. collection
33. proposed to group other concepts
in two or more document collections
in the bipolar document collection
in the breast cancer document collection
in the neither cancer or bipolar doc. collection