Kalpa Gunaratna's Ph.D. dissertation defense: April 19 2017
The processing of structured and semi-structured content on the Web has been gaining attention with the rapid progress in the Linking Open Data project and the development of commercial knowledge graphs. Knowledge graphs capture domain-specific or encyclopedic knowledge in the form of a data layer and add rich and explicit semantics on top of the data layer to infer additional knowledge. The data layer of a knowledge graph represents entities and their descriptions. The semantic layer on top of the data layer is called the schema (ontology), where relationships of the entity descriptions, their classes, and the hierarchy of the relationships and classes are defined. Today, there exist large knowledge graphs in the research community (e.g., encyclopedic datasets like DBpedia and Yago) and corporate world (e.g., Google knowledge graph) that encapsulate a large amount of knowledge for human and machine consumption. Typically, they consist of millions of entities and billions of facts describing these entities. While it is good to have this much knowledge available on the Web for consumption, it leads to information overload, and hence proper summarization (and presentation) techniques need to be explored.
In this dissertation, we focus on creating both comprehensive and concise entity summaries at: (i) the single entity level and (ii) the multiple entity level. To summarize a single entity, we propose a novel approach called FACeted Entity Summarization (FACES) that considers importance, which is computed by combining popularity and uniqueness, and diversity of facts getting selected for the summary. We first conceptually group facts using semantic expansion and hierarchical incremental clustering techniques and form facets (i.e., groupings) that go beyond syntactic similarity. Then we rank both the facts and facets using Information Retrieval (IR) ranking techniques to pick the highest ranked facts from these facets for the summary. The important and unique contribution of this approach is that because of its generation of facets, it adds diversity into entity summaries, making them comprehensive. For creating multiple entity summaries, we simultaneously process facts belonging to the given entities using combinatorial optimization techniques. In this process, we maximize diversity and importance of facts within each entity summary and relatedness of facts between the entity summaries. The proposed approach uniquely combines semantic expansion, graph-based relatedness, and combinatorial optimization techniques to generate relatedness-based multi-entity summaries.
Complementing the entity summarization approaches, we introduce a novel approach using light Natural Language Processing (NLP) techniques to enrich knowledge graphs by adding type semantics to literals.
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Semantics based Summarization of Entities in Knowledge Graphs
1. Semantics-based Summarization of
Entities in Knowledge Graphs
Kalpa Gunaratna
Ohio Center of Excellence in Knowledge-enabled Computing (Kno.e.sis)
Wright State University
04.19.2017
Advisors: Prof. Amit Sheth and Prof. Krishnaprasad Thirunarayan
Ph.D Committee: Prof. Keke Chen (Kno.e.sis), Prof. Gong Cheng (Nanjing University, China),
Dr. Edward Curry (NUIG, Ireland), Dr. Hamid R. Motahari-Nezhad (IBM Research, USA)
PhD Dissertation Defense
2. 1. Knowledge on the Web and concise presentation
2. Diversity-aware entity summarization
- Using hierarchical conceptual grouping.
3. Enriching knowledge graphs and entity summarization
- Add type semantics to literals and adapt them in summarization.
4. Relatedness-based multi-entity summarization
- Using quadratic multidimensional optimization techniques.
2
Talk overview
3. 3
What are triples?
dbr:Marie_Curie dbo:Person
rdf:type
RDF/Turtle syntax of the triple
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix dbr: <http://dbpedia.org/resource/> .
@prefix dbo: <http://dbpedia.org/ontology/> .
dbr:Marie_Curie rdf:type dbo:Person .
dbr:Marie_Curie dbo:spouse dbr:Pierre_Curie .
Subject Predicate Object
dbr:Marie_Curie dbr:Pierre_Curie
dbo:spouse
4. 4
Open Datasets – Linked Open Data (LOD) cloud
200720102014
2017
Image credits: http://lod-cloud.net/
6. 6
Entity summarization
spouse: Pierre_Curie
birthPlace: Warsaw
almaMater: ESPCI_ParisTech
workInstitutions: University_of_Paris
knownFor: Radioactivity
……
Entity Description
Knowledge Graph
property
value
Summary for
• Quick understanding
• Performing specific tasks
Millions of entities and billions of facts
Triple dbr:Marie_Curie dbp:spouse dbr:Pierre_Curie
7. 7
Entities and summaries in early 2000s
Rich Media Reference Page
Baltimore 31, Pit 24
http://www.nfl.com
Quandry Ismail and Tony Banks hook up for their third long
touchdown, this time on a 76-yarder to extend the Raven’s
lead to 31-24 in the third quarter.
Professional
Ravens, Steelers
Bal 31, Pit 24
Quandry Ismail, Tony Banks
Touchdown
NFL.com
2/02/2000
League:
Teams:
Score:
Players:
Event:
Produced by:
Posted date:
Patent: Sheth, Amit, David Avant, and Clemens Bertram. "System and method for creating
a semantic web and its applications in browsing, searching, profiling, personalization and advertising.
" U.S. Patent 6,311,194, issued October 30, 2001.
Publication: Sheth, Amit, Clemens Bertram, David Avant, Brian Hammond, Krys Kochut, and
Yashodhan Warke. "Managing semantic content for the Web." IEEE Internet Computing 6, no. 4 (2002): 80-87.
Taalee’s Rich Media Reference.
Talk: Semantic Web and Information Brokering: Opportunities, Early Commercialization, and Challenges. Keynote at Workshop on Semantic Web: Models, Architectures and
Management. Lisbon, Portugal, Sept. 21, 2000
.
8. 8
Entities and summaries – Today’s Web Search
Google Knowledge Graph
(GKG) facilitates Google
Search.
Summarization is one of
their top priorities*.
* Singhal, A. 2012. Introducing the knowledge graph: things, not strings. Official Google Blog, May.
10. Entity related structured data on the Web can be concisely and
comprehensively summarized for efficient and convenient information
presentation. This can be achieved through synergistic use of:
(i) Unsupervised knowledge-based methods to conceptually group,
(ii) Information Retrieval-based techniques to intuitively rank,
(iii) Natural Language Processing techniques to semantically enrich
structured data, and
(iv) Combinatorial optimization techniques to handle relatedness of
multiple entities.
10
Thesis statement
11. 1. Knowledge on the Web and concise presentation
2. Diversity-aware entity summarization
- Using hierarchical conceptual grouping.
3. Enriching knowledge graphs and entity summarization
- Add type semantics to literals and adapt them in summarization.
4. Relatedness-based multi-entity summarization
- Using quadratic multidimensional optimization techniques.
11
Talk overview
12. 12
FACeted Entity Summaries – FACES
Existing approaches focus
on ranking, causing
redundancy in fixed-length
summary
FACES* approach
Ranking
Grouping
FACES produces diversified summaries
from: ranking + grouping
*[AAAI15] Kalpa Gunaratna, Krishnaparasad Thirunarayan, and Amit Sheth. "FACES: Diversity-Aware Entity
Summarization Using Incremental Hierarchical Conceptual Clustering." Twenty-Ninth AAAI Conference on
Artificial Intelligence. 2015.
adds comprehensiveness (through diversity)
knownFor : Radioactivity
field : Chemistry
workInstitutions : University of Paris
spouse : Pierre Curie
rank
13. 13
Pierre Curie
Warsaw
Passy,_Haute-
Savoie
ESPCI_ParisTechUniversity_of_Paris
Radioactivity
Chemistry
Birth
Place
Field
KnownFor
Concise and comprehensive summary
could be: {f1,f2, f6}
Non-faceted summary: {f4, f7, f5}
Entity - Marie Curie
Feature
Set
Facets Features Property Value
FS
F1 f1 spouse Pierre_Curie
F2
f2 birthPlace Warsaw
f3 deathPlace Passy,_Haute-Savoie
F3
f4 almaMater ESPCI_ParisTech
f5 workInstitutions University_of_Paris
f6 knownFor Radioactivity
f7 field Chemistry
Marie Curie
Pierre Curie
Warsaw
Passy,_Haute-
Savoie
ESPCI_ParisTechUniversity_of_Paris
Radioactivity
Chemistry
1
1
1
2
2
3 4
14. o Number of groups in a feature set is unknown a priori.
– Hence, supervised techniques do not work.
o We want to identify conceptually similar groups.
– E.g., field: chemistry and almaMater: ESPCI_ParisTech to be in
the same group.
o We adapt Cobweb* to get facets, which is:
– Conceptual
– Incremental
– Hierarchical
14
Grouping (clustering) in FACES
* Fisher, D. H. 1987. Knowledge acquisition via incremental conceptual clustering. Machine
learning 2, 2, 139-172.
15. o Group similar themed features (i.e., conceptually similar).
o Each feature has only two attribute-value pairs.
– Example, birthPlace-Honolulu. Pairs are (property, birthPlace)
and (value, Honolulu).
o Expand property and value of each feature.
o Use the expanded feature set for clustering.
15
How to use Cobweb in FACES
16. o Get the property label.
o Pre-process
– Remove stop words
– CamelCase, spaces, punctuation processing
– Tokenize
o Get hypernyms for the tokens using a lexical database (e.g.,
WordNet).
o Add hypernyms to the original set of tokens + label and create
the WordSet WS.
16
Property expansion
birthPlace
birthPlace, birth,
place, beginning,
point, area, locality
Expansion
property WordSet
17. o Get the object URI.
o Get the types (ontology classes) for the URI.
o Pre-process types
– Remove stop words
– CamelCase, spaces, punctuation processing
– Tokenize
o Get hypernyms for types.
o Add hypernyms to the original set of type labels + tokens to
create WordSet WS.
17
Value expansion
Honolulu
place, PopulatedPlace,
populated, point,
area, locality
Expansion
value WordSet
18. 18
WordSet examples
Feature (f) Property expansion Value expansion WordSet (WS)
region:Illinois {region, location,
domain}
{place, PopulatedPlace,
populated, point, area,
locality}
{region, location, domain,
PopulatedPlace, populated, place,
point, area, locality}
birthPlace:Honolulu {birthPlace, birth,
place, beginning,
point, area, locality}
{place, PopulatedPlace,
populated, point, area,
locality}
{birthPlace, birth, place,
beginning, point, area, locality,
PopulatedPlace, populated}
vicePresident:Joe_Bi
den
{vicePresident, vice,
president, corporate
executive, head of
state}
{person, OfficeHolder,
office, holder, organism,
flesh, human body,
occupation, job, staff,
possessor, owner}
{vicePresident, vice, president,
corporate executive, head of
state, person, OfficeHolder,
office, holder, organism, flesh,
human body, occupation, job,
staff, possessor, owner}
predecessor:George
_W._Bush
{predecessor,
forerunner,
precursor}
{person, officeholder,
office, holder, organism,
flesh, human body,
occupation, job, staff,
possessor, owner}
{predecessor, forerunner,
precursor, person, officeholder,
office, holder, organism, flesh,
human body, occupation, job,
staff, possessor, owner}
Original sets are in orange color.
19. 19
WordSet – How it helps for better grouping
region:
illinois
vicePresident:
Joe Biden
birthPlace
:Honolulu
birthPlace
:Honolulu
region:illinois vicePresident:Joe Biden
region, location, domain,
PopulatedPlace, place,
point, area, locality
vicePresident, vice,
president, corporate
executive, head of state,
person, OfficeHolder, human
body, occupation, job, staff
birthPlace:
Honolulu
birthPlace, birth, place,
PopulatedPlace , point,
area, locality, beginning
21. o Influenced by tf-idf.
o Inf(f): Informativeness of the feature (Uniqueness).
– Example: residence-WhiteHouse
o Po(v): Popularity of the value of the feature (frequent).
– Example: WhiteHouse
o Rank(f): Higher informativeness and popularity.
21
Ranking features
N is the total number of entities
22. 22
Faceted entity summary creation process
(1) (2) (3) (4) (5)
Semantic expansion Cobweb tf-idf based
23. o Gold standard contains ideal summaries generated by 15
judges.
o An Ideal summary for entity e is denoted by SummI. Then
agreement is the overlap between ideal summaries.
o Summary quality is the overlap between the computer
generated summary (Summ(e)) and the ideal summaries for
the entity.
23
Evaluation
24. 24
Evaluation cont.
• 50 entities in the gold standard. 69 users participated in
the user preference evaluation.
• On average, 44 features per entity.
System
Evaluation 1 – Gold Standard Evaluation 2 –
User PreferenceSummary Length = 5 Summary Length = 10
Avg.
Quality
FACES %
Gain
Time/Entity Avg.
Quality
FACES %
Gain
Study 1 Study 2
FACES 1.4314 NA 0.76 sec 4.3350 NA 84% 54%
RELIN 0.4981 187 % 10.96 sec 2.5188 72 % NA NA
RELINM 0.6008 138 % 11.08 sec 3.0906 40 % 16 % 16 %
SUMMARUM 1.2249 17 % NA 3.4207 27 % NA 30 %
Avg. Agreement 1.9168 4.6415
26. 1. Knowledge on the Web and concise presentation
2. Diversity-aware entity summarization
- Using hierarchical conceptual grouping.
3. Enriching knowledge graphs and entity summarization
- Add type semantics to literals and adapt them in summarization.
4. Relatedness-based multi-entity summarization
- Using quadratic multidimensional optimization techniques.
26
Talk overview
27. o Lot of information encoded in literal format.
– 1608 datatype properties (literal based) vs. 1103 object
properties (entity based) in Dbpedia (2016-04)
o Many literals can be easily typed for proper interpretation and
use.
– Example: in DBpedia, http://dbpedia.org/property/location has
~1,00,000 unique literals that can be directly mapped to entities.
o Added semantics is useful in practical applications such as
summarization, property alignment, data integration, and
dataset profiling.
27
Typing literals (enriching) in knowledge graphs
28. o FACES can only handle object property based features.
o Our contributions*:
1. Compute types for the values of datatype property based
features (data enrichment) - novel contribution.
2. Adapt and improve ranking algorithms (summarization).
28
Enrichment for entity summarization
*[ESWC16] Kalpa Gunaratna, Krishnaprasad Thirunarayan, Amit Sheth, and Gong Cheng. 'Gleaning Types
for Literals in RDF Triples with Application to Entity Summarization'. In Proc. 13th Extended Semantic
Web Conference (ESWC 2016), 2016, pages 85-100.
Barack Obama
Person
type
“Michelle Obama”
type
String
FACES
Partitioning
29. o Focus of the literal is not clear unlike URIs.
o May contain several entities or labels matching ontology
classes.
29
Challenges
44th President of the United States
option 1
option 2 option 3
30. 30
Enrichment outcomes
dbr:Barack_Obama dbo:Politician
“44th President of the United States”^^xsd:string
dbr:Joe_Biden
dbr:Barack_Obama
dbp:short
Description
dbr:Calvin_Coolidge “48th Governor of Massachusetts”^^xsd:string
dbo:orderInOffice
dbo:Politician
dbo:President
dbo:Governor
dbp:vicePresident rdf:type
32. 32
Type computation algorithm flow
Extract n-grams
&
Find focus term
Match focus term
& ontology class
Yes
TYPE
No
Match n-grams, focus term
& ontology class
TYPE
Yes
No
TYPE
Yes
No
Match n-grams, focus term
& entity label
Get entity type
Get similarity of
focus term and
all ontology
classes
TYPE
For non-numeric
literals
Get max
similarity
ontology class
Process focus term
& ontology classes
33. o Type Set TS(v) is the generated set of types for the value v.
33
Evaluation – type generation metrics
n is the total number of features.
34. o DBpedia Spotlight is taken as the baseline and there were 1117
unique property-value pairs (features).
o 118 pairs (consisting of labelling properties and noisy features)
were removed.
34
Evaluation
Mean Precision (MP) Any Mean Precision (AMP) Coverage
Our approach 0.8290 0.8829 0.8529
Baseline 0.4867 0.5825 0.5533
36. o Ranking equations in the FACES approach do not work.
– Two literals can be unique even if their types and the main
entities are the same.
• Example, “United States President” Vs. “President of the United
States”. Not desirable to search using the whole phrase
(syntactically different but semantically the same).
– A literal can have several entities. Which one to choose?
36
Ranking datatype property features
37. o Humans recognize popular entities.
o Entities can be mentioned in literals with variations.
o Proposal: Use the popular entities in literals and not the
literals themselves for ranking.
o Functions
– Function ES(v) returns all entities present in the value v.
– Function max(ES(v)) returns the most popular entity in ES(v).
37
Intuitions for ranking
v = “44th President of the United States”
ES(v) = {db:President, db:United States}
max(ES(v)) = db:United States
39. o Aggregate feature ranking scores for each facet.
o Rank facets based on the aggregated scores.
39
Facet ranking
Rank(f) is the original function and Rank(f)’ is the modified one for datatype property based features.
40. 40
FACES-E entity summary generation
(1) (2) (3) (4) (5)
Semantic expansion
+
Type computation
Cobweb tf-idf based
41. o The gold standard consists of 20 random entities used in FACES
taken from DBpedia 3.9 and 60 random entities taken from
DBpedia 2015-04.
o 17 human users created ideal summaries (total of 900).
41
Evaluation – FACES-E summary generation
System Summary Length = 5 Summary Length = 10
Avg. Quality % Gain Avg. Quality % Gain
FACES-E 1.5308 -- 4.5320 --
RELIN 0.9611 59 % 3.0988 46 %
RELINM 1.0251 49 % 3.6514 24 %
Avg. Agreement 2.1168 5.4363
42. 1. Knowledge on the Web and concise presentation
2. Diversity-aware entity summarization
- Using hierarchical conceptual grouping.
3. Enriching knowledge graphs and entity summarization
- Add type semantics to literals and adapt them in summarization.
4. Relatedness-based multi-entity summarization
- Using quadratic multidimensional optimization techniques.
42
Talk overview
43. 43
Single vs. Multiple entity summarization
Single entity summarization
importance
diversity
importance
diversity
Multi-entity summarization
importance
diversity
importance
diversity
Improve relatedness
Apple Computer Steve Jobs
Apple Computer Steve Jobs
44. 44
Motivating example
Within one month of the iPod nano and iTunes phone special event, Apple Computer
announced today another special event to be held on October 12. It is to be held at
the California Theater in downtown San Jose, California. The invitation reads, “One
more thing …”, the teasing tagline of Steve Jobs.
founders Steve_Jobs
product IPod
locationCity California
industry Consumer_electronics
after Tim_Cook
knownFor Microcomputer_revolution
title Apple_Inc.
birthPlace California
47. 47
Formalizing Quadratic Multidimensional Knapsack Problem (QMKP)
Variable x denotes whether the feature is selected or not
We want to maximize the profit considering each knapsack size
Entity 1 Entity 2 Entity 3 Entity 4
Summary
- - - - - - -
- - - - - - -
Summary
- - - - - - -
- - - - - - -
Summary
- - - - - - -
- - - - - - -
Summary
- - - - - - -
- - - - - - -
48. o We measure the importance of features using the
informativeness and popularity measure used in FACES.
o Within each entity summary, features should have higher
importance.
– Hence, we use a positive weight
48
1. Importance of features
49. o Features consist of properties and values.
o For properties, we use the expansion method used in FACES
and calculate the Jaccard similarity for properties.
o For values, we measure their relatedness using graph-based
co-appearance for values. We use RDF2Vec model.
o We combine the two measures and get the relatedness
between two features.
49
How to measure relatedness of features
50. o Each entity summary should have diverse features.
– (i) Penalize relatedness score with a negative weight (i.e.,
maximize diversity).
– (ii) Modify candidate feature selection to improve diversity.
50
2. Diversity of features within summaries
51. o Maximize profit for related features between summaries.
o Use a positive weight.
51
3. Relatedness of features between summaries
52. o GRASP – Greedy Randomized Adaptive Search Procedure.
o GRASP provides an approximate solution to QKP.
– We simply use it for multiple constraints to suit QMKP.
o We use a memory-based GRASP implementation version*.
– Construction phase
• Random selection of features (also using a greedy ranking function)
– Local search phase
• Tries to improve answer by replacing selected features
– Update the best solution
o To improve intra-entity diversity of features, we modified Restricted
Candidate List (RCL) of GRASP.
– We use a threshold to filter related features of the same entity.
52
GRASP – for combinatorial optimization
* Yang, Zhen, Guoqing Wang, and Feng Chu. "An effective GRASP and tabu search for the 0–1 quadratic knapsack problem."
Computers & Operations Research 40, no. 5 (2013): 1176-1185.
53. o 15 judges, 2 datasets, 30 news items, and 850 question
instances.
o Qualitative evaluation.
o Quantitative evaluation
53
Evaluation
54. o Faceted entity summarization.
– Conceptual (abstract) grouping of features/triples.
– tf-idf based ranking.
o Type computation for literals to enrich knowledge graphs.
– Improve coverage for faceted entity summarization.
o Relatedness-based multi entity summarization.
54
Conclusion
Entity related structured data on the Web can be concisely and comprehensively
summarized for efficient and convenient information presentation. This can be achieved
through synergistic use of:
(i) Unsupervised knowledge-based methods to conceptually group,
(ii) Information Retrieval-based techniques to intuitively rank,
(iii) Natural Language Processing techniques to semantically enrich structured data, and
(iv) Combinatorial optimization techniques to handle relatedness of multiple entities.
Thesis Statement
56. o Conference Papers
[WWW 2017] Hamid R. Motahari Nezhad, Kalpa Gunaratna, and Juan Cappi. “eAssistant: Cognitive Assistance for
Identification and Auto-Triage of Actionable Conversations.” Proceedings of the 26th International Conference on World
Wide Web Companion. International World Wide Web Conferences Steering Committee, 2017.
[ESWC 2016] Kalpa Gunaratna, Krishnaprasad Thirunarayan, Amit Sheth, and Gong Cheng. “Gleaning Types for Literals in
RDF Triples with Application to Entity Summarization”. In Proc. 13th Extended Semantic Web Conference (ESWC 2016),
2016, pages 85-100. DOI=10.1007/978-3-319-34129-3_6
[AAAI 2015] Kalpa Gunaratna, Krishnaprasad Thirunarayan, and Amit Sheth. “FACES: Diversity-Aware Entity Summarization
using Incremental Hierarchical Conceptual Clustering”. 29th AAAI Conference on Artificial Intelligence (AAAI 2015), 2015.
[Semantics 2013] Kalpa Gunaratna, Krishnaprasad Thirunarayan, Prateek Jain, Amit Sheth and Sanjaya Wijeratne. “A
Statistical and Schema Independent Approach for Indentifying Equivalent Properties on Linked Data.” In Proc. 9th
International Conference on Semantic Systems, ACM, 2013, pages 33-40. DOI=10.1145/2506182.2506187.
o Articles
[W 2014] Kalpa Gunaratna, Sarasi Lalithsena and Amit Sheth. “Alignment and Dataset Identification of Linked Data in
Semantic Web.” Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery (2014).
o Patents
[P 2015] Kalpa Gunaratna, Hamid Motahari. Adaptive Learning of Actionable Statements in Natural Language Conversation.
US patent filed, January 2016 (pending).
o Edited proceedings
[SumPre 2016] Andreas Thalhammer, Gong Cheng, Kalpa Gunaratna. Proceedings of the 2nd International Workshop on
Summarizing and Presenting Entities and Ontologies (SumPre 2016) co-located with the 13th Extended Semantic Web
Conference (ESWC 2016), Greece, May 30, 2016. CEUR Workshop Proceedings 1605, CEUR-WS.org 2016.
[SumPre 2015] Gong Cheng, Kalpa Gunaratna, Andreas Thalhammer, Heiko Paulheim, Martin Voigt, Roberto Garca. Joint
Proceedings of the 1st International Workshop on Summarizing and Presenting Entities and Ontologies and the 3rd
International Workshop on Human Semantic Web Interfaces (SumPre 2015, HSWI 2015) co-located with the 12th Extended
Semantic Web Conference (ESWC 2015), Portoroz, Slovenia, June 1, 2015. CEUR Workshop Proceedings 1556, CEUR-WS.org
2016.
56
Selected publications
57. o Top-tier conference publications (AAAI-2015, ESWC-2016 , and
WWW-2017).
o Research internships at well-known places (INSIGHT-Ireland,
NLM-USA, IBM-USA).
o Co-chairing and organizing workshops at international
conferences (SumPre2015 and SumPre2016 at ESWC).
o PC member (e.g., ISWC15 and ESWC16) and W3C working
group member (LDP14).
o Competition winner (IBM Blockchain Hackathon runner-up,
National Best Quality Software award finalist).
o Travel and professional development grants (AAAI, WS-GSA).
o US patent application (filed with IBM).
57
Selected accomplishments
58. 58
Acknowledgements
Prof. Amit Sheth
(Advisor)
Prof. Krishnaprasad Thirunarayan
(Advisor)
Dr. Edward Curry
NUIG, Ireland
Prof. Gong Cheng
Nanjing University, China
Dr. Hamid R. Motahari Nezhad
IBM Research, USA
Prof. Keke Chen
59. 59
Thank You
Dr. Olivier Bodenreider
NLM, USA
Dr. Ajith Ranabahu
Amazon, USA
Dr. Gamini Palihawadana
My Family for always
encouraging me …
My colleagues
at Kno.e.sis …
First triple is at the schema level and the second at the data level.
1146 datasets as of January 26 2017
Facebook social graph
IBM Watson knowledge graphs (health)
Amazon product graph
Datasets contain mere data without much processing and enhancements in the forms of semantics whereas knowledge graphs provide more semantics and knowledge.
Entity – A real world thing (e.g., person, book, place) at the data level that encapsulates facts and is represented by a URI.
Knowledge graph – A knowledge graphs is a collection of facts and rules that can also provide semantics.
Rich Media Reference, created using “WorldModel” (an ontology) and “Knowledgebase” (data extracted from text).
Grows over time: DBpedia (3.9) has around 200 triples on average per entity.
3.9 4 million entities
2014 4.5 million entities
2015 4.6 million entities
2017 6 million entities, 1.3 billion facts
Number of facts continues grow. Hence need concise presentation.
Grouping is challenging because # of groups for each entity is unknown
Conceptually similar features are colored in the same color.
Conceptual – uses probability based grouping.
Incremental – special operators make it insensitive to order of items.
Hierarchical – groups items in a tree structure.
Cobweb uses probability in grouping facts and hence noted as “conceptual”. Fisher mention that this is similar to what humans do in grouping – pick the most probable group.
Also we need groups that agree on concept level and not lexical level.
All the expansions are not shown for clarity.
Without WordSet and with WordSet
Values are popular. More identifiable to humans.
Property-value pairs are unique. Can distinctly identify the entity.
Want to get a balance of both.
When k > F(e) , picking facets to pick features is random and not proper.
Extract features for the entity e.
Enrich each feature and get the WordSet WS(f).
Enriched feature set FS(e) is input to the partitioning algorithm and get facet set F(e).
Get the feature ranking scores (Rank(f)) for each facet.
Top ranked features from the facets are picked to form the faceted entity summary. The constraints defined in the definition for the faceted entity summary hold.
Dataset dependent features were removed like owl:sameAs, wordnet_type, Wikipedia links, dcterms:subject, rdf:type,…
All the three methods are automatic.
1600 vs 1079 in DBpedia 2015-04
(i) the creator was unable to find a suitable entity URI for the object value, and hence chose to use a literal instead,
(ii) the creator of the triple did not want to attach more details to the value and hence represented it in plain text,
(iii) the value contains only basic implementation types like integer, boolean, and date, and hence not meaningful to create an entity, or
(iv) the value has a lengthy description spanning several sentences (e.g., dbo:abstract property in DBpedia) that covers a diverse set of entities and facts.
The literal can be long.
In this work, we focus on one sentence long literals.
Head word detection – Colin’s Head Word Detection algorithm.
Directly matches head word to class
Matches N-grams and head word to class label or else, match entities to N-grams and head word and then get the types.
Semantic matcher of head word using UMBC matching service.
n is bound to 3 in our DBpedia experiment
For non-numeric strings, extract the n-grams.
Get the focus term for the phrase.
Check for a match in focus term and ontology class. If found success
Analyze n-grams that contain focus term (maximal match):
Check for a match in n-grams and ontology class. If found success
Otherwise, check for a match in n-gram and entity label. Then get the types of the entity success
Finally, check similarity scores between the focus term and all the ontology classes. Get the ontology class that has the highest similarity score. success
Our finding is: “Typing needs to be handled carefully”
Recall is not measured because it is hard to do so (check so many pairs).
Inf(f)’ – count # of entities having the feature. Property should match but value has to contain the popular entity of the input feature’s value.
Po(v)’ – count the number of triples that have the matching feature with most popular entity of the value.
Extract features for the entity e.
Enrich each feature and get the WordSet WS(f).
Enriched feature set FS(e) is input to the partitioning algorithm and get facet set F(e).
First get the feature ranking scores (R(f)) and then compute the facet ranking scores for each facet (FacetRank(F(e)).
Top ranked features from top ranked facets in the order are picked to form the faceted entity summary. The constraints defined in the definition for the faceted entity summary hold.
We can add inter-entity summary relatedness.
This is not easy to achieve as we have to process multiple entities at the same time.
Motivating example to show what we want to achieve in a multi-entity summarization scenario (main focus is for relatedness between summaries).
Image icons from Google search (free to use)
We consider weights of the features to be uniform in this case (= 1)
Negative weighted score reduces the profit if related features are selected and hence avoids related things getting selected.
GRASP uses pairwise profit matrix
We filter features above max function from RCL, making sure less intra-entity relatedness
UCI word co-occurrence (w to refer to words in the equation)
Umass counts the number of documents containing both words (D to refer to documents in the equation)