Semantics based Summarization of Entities in Knowledge Graphs

Semantics-based Summarization of
Entities in Knowledge Graphs
Kalpa Gunaratna
Ohio Center of Excellence in Knowledge-enabled Computing (Kno.e.sis)
Wright State University
04.19.2017
Advisors: Prof. Amit Sheth and Prof. Krishnaprasad Thirunarayan
Ph.D Committee: Prof. Keke Chen (Kno.e.sis), Prof. Gong Cheng (Nanjing University, China),
Dr. Edward Curry (NUIG, Ireland), Dr. Hamid R. Motahari-Nezhad (IBM Research, USA)
PhD Dissertation Defense

1. Knowledge on the Web and concise presentation
2. Diversity-aware entity summarization
- Using hierarchical conceptual grouping.
3. Enriching knowledge graphs and entity summarization
- Add type semantics to literals and adapt them in summarization.
4. Relatedness-based multi-entity summarization
- Using quadratic multidimensional optimization techniques.
2
Talk overview

3
What are triples?
dbr:Marie_Curie dbo:Person
rdf:type
RDF/Turtle syntax of the triple
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix dbr: <http://dbpedia.org/resource/> .
@prefix dbo: <http://dbpedia.org/ontology/> .
dbr:Marie_Curie rdf:type dbo:Person .
dbr:Marie_Curie dbo:spouse dbr:Pierre_Curie .
Subject Predicate Object
dbr:Marie_Curie dbr:Pierre_Curie
dbo:spouse

4
Open Datasets – Linked Open Data (LOD) cloud
200720102014
2017
Image credits: http://lod-cloud.net/

5
Commercial knowledge graphs
Image credit: Google

6
Entity summarization
spouse: Pierre_Curie
birthPlace: Warsaw
almaMater: ESPCI_ParisTech
workInstitutions: University_of_Paris
knownFor: Radioactivity
……
Entity Description
Knowledge Graph
property
value
Summary for
• Quick understanding
• Performing specific tasks
Millions of entities and billions of facts
Triple  dbr:Marie_Curie dbp:spouse dbr:Pierre_Curie

7
Entities and summaries in early 2000s
Rich Media Reference Page
Baltimore 31, Pit 24
http://www.nfl.com
Quandry Ismail and Tony Banks hook up for their third long
touchdown, this time on a 76-yarder to extend the Raven’s
lead to 31-24 in the third quarter.
Professional
Ravens, Steelers
Bal 31, Pit 24
Quandry Ismail, Tony Banks
Touchdown
NFL.com
2/02/2000
League:
Teams:
Score:
Players:
Event:
Produced by:
Posted date:
Patent: Sheth, Amit, David Avant, and Clemens Bertram. "System and method for creating
a semantic web and its applications in browsing, searching, profiling, personalization and advertising.
" U.S. Patent 6,311,194, issued October 30, 2001.
Publication: Sheth, Amit, Clemens Bertram, David Avant, Brian Hammond, Krys Kochut, and
Yashodhan Warke. "Managing semantic content for the Web." IEEE Internet Computing 6, no. 4 (2002): 80-87.
Taalee’s Rich Media Reference.
Talk: Semantic Web and Information Brokering: Opportunities, Early Commercialization, and Challenges. Keynote at Workshop on Semantic Web: Models, Architectures and
Management. Lisbon, Portugal, Sept. 21, 2000
.

8
Entities and summaries – Today’s Web Search
Google Knowledge Graph
(GKG) facilitates Google
Search.
Summarization is one of
their top priorities*.
* Singhal, A. 2012. Introducing the knowledge graph: things, not strings. Official Google Blog, May.

9
Semantic
Expansion
Semantic
Enrichment
Ranking
Semantic
Relatedness
Conceptual Grouping
Ranking and Summary
Creation
Single Entity Description
FACES Approach
User applications and engagements
Schema
Knowledge
Lexical
Database
Multiple Entity Descriptions
Combinatorial Optimization
&
Multi-entity Summary Creation

Entity related structured data on the Web can be concisely and
comprehensively summarized for efficient and convenient information
presentation. This can be achieved through synergistic use of:
(i) Unsupervised knowledge-based methods to conceptually group,
(ii) Information Retrieval-based techniques to intuitively rank,
(iii) Natural Language Processing techniques to semantically enrich
structured data, and
(iv) Combinatorial optimization techniques to handle relatedness of
multiple entities.
10
Thesis statement

11
Talk overview

12
FACeted Entity Summaries – FACES
Existing approaches focus
on ranking, causing
redundancy in fixed-length
summary
FACES* approach
Ranking
Grouping
FACES produces diversified summaries
from: ranking + grouping
*[AAAI15] Kalpa Gunaratna, Krishnaparasad Thirunarayan, and Amit Sheth. "FACES: Diversity-Aware Entity
Summarization Using Incremental Hierarchical Conceptual Clustering." Twenty-Ninth AAAI Conference on
Artificial Intelligence. 2015.
adds comprehensiveness (through diversity)
knownFor : Radioactivity
field : Chemistry
workInstitutions : University of Paris
spouse : Pierre Curie
rank

13
Pierre Curie
Warsaw
Passy,_Haute-
Savoie
ESPCI_ParisTechUniversity_of_Paris
Radioactivity
Chemistry
Birth
Place
Field
KnownFor
Concise and comprehensive summary
could be: {f1,f2, f6}
Non-faceted summary: {f4, f7, f5}
Entity - Marie Curie
Feature
Set
Facets Features Property Value
FS
F1 f1 spouse Pierre_Curie
F2
f2 birthPlace Warsaw
f3 deathPlace Passy,_Haute-Savoie
F3
f4 almaMater ESPCI_ParisTech
f5 workInstitutions University_of_Paris
f6 knownFor Radioactivity
f7 field Chemistry
Marie Curie
Pierre Curie
Warsaw
Passy,_Haute-
Savoie
ESPCI_ParisTechUniversity_of_Paris
Radioactivity
Chemistry
1
1
1
2
2
3 4

o Number of groups in a feature set is unknown a priori.
– Hence, supervised techniques do not work.
o We want to identify conceptually similar groups.
– E.g., field: chemistry and almaMater: ESPCI_ParisTech to be in
the same group.
o We adapt Cobweb* to get facets, which is:
– Conceptual
– Incremental
– Hierarchical
14
Grouping (clustering) in FACES
* Fisher, D. H. 1987. Knowledge acquisition via incremental conceptual clustering. Machine
learning 2, 2, 139-172.

o Group similar themed features (i.e., conceptually similar).
o Each feature has only two attribute-value pairs.
– Example, birthPlace-Honolulu. Pairs are (property, birthPlace)
and (value, Honolulu).
o Expand property and value of each feature.
o Use the expanded feature set for clustering.
15
How to use Cobweb in FACES

o Get the property label.
o Pre-process
– Remove stop words
– CamelCase, spaces, punctuation processing
– Tokenize
o Get hypernyms for the tokens using a lexical database (e.g.,
WordNet).
o Add hypernyms to the original set of tokens + label and create
the WordSet WS.
16
Property expansion
birthPlace
birthPlace, birth,
place, beginning,
point, area, locality
Expansion
property WordSet

o Get the object URI.
o Get the types (ontology classes) for the URI.
o Pre-process types
– Remove stop words
– CamelCase, spaces, punctuation processing
– Tokenize
o Get hypernyms for types.
o Add hypernyms to the original set of type labels + tokens to
create WordSet WS.
17
Value expansion
Honolulu
place, PopulatedPlace,
populated, point,
area, locality
Expansion
value WordSet

18
WordSet examples
Feature (f) Property expansion Value expansion WordSet (WS)
region:Illinois {region, location,
domain}
{place, PopulatedPlace,
populated, point, area,
locality}
{region, location, domain,
PopulatedPlace, populated, place,
point, area, locality}
birthPlace:Honolulu {birthPlace, birth,
place, beginning,
point, area, locality}
{place, PopulatedPlace,
populated, point, area,
locality}
{birthPlace, birth, place,
beginning, point, area, locality,
PopulatedPlace, populated}
vicePresident:Joe_Bi
den
{vicePresident, vice,
president, corporate
executive, head of
state}
{person, OfficeHolder,
office, holder, organism,
flesh, human body,
occupation, job, staff,
possessor, owner}
{vicePresident, vice, president,
corporate executive, head of
state, person, OfficeHolder,
office, holder, organism, flesh,
human body, occupation, job,
staff, possessor, owner}
predecessor:George
_W._Bush
{predecessor,
forerunner,
precursor}
{person, officeholder,
office, holder, organism,
flesh, human body,
occupation, job, staff,
possessor, owner}
{predecessor, forerunner,
precursor, person, officeholder,
office, holder, organism, flesh,
human body, occupation, job,
staff, possessor, owner}
Original sets are in orange color.

19
WordSet – How it helps for better grouping
region:
illinois
vicePresident:
Joe Biden
birthPlace
:Honolulu
birthPlace
:Honolulu
region:illinois vicePresident:Joe Biden
region, location, domain,
PopulatedPlace, place,
point, area, locality
vicePresident, vice,
president, corporate
executive, head of state,
person, OfficeHolder, human
body, occupation, job, staff
birthPlace:
Honolulu
birthPlace, birth, place,
PopulatedPlace , point,
area, locality, beginning

20
Ranking intuition
workPlace: Washington D.C. residence: White House
New York City Beavercreek
Popular values
Informative features

o Influenced by tf-idf.
o Inf(f): Informativeness of the feature (Uniqueness).
– Example: residence-WhiteHouse
o Po(v): Popularity of the value of the feature (frequent).
– Example: WhiteHouse
o Rank(f): Higher informativeness and popularity.
21
Ranking features
N is the total number of entities

22
Faceted entity summary creation process
(1) (2) (3) (4) (5)
Semantic expansion Cobweb tf-idf based

o Gold standard contains ideal summaries generated by 15
judges.
o An Ideal summary for entity e is denoted by SummI. Then
agreement is the overlap between ideal summaries.
o Summary quality is the overlap between the computer
generated summary (Summ(e)) and the ideal summaries for
the entity.
23
Evaluation

24
Evaluation cont.
• 50 entities in the gold standard. 69 users participated in
the user preference evaluation.
• On average, 44 features per entity.
System
Evaluation 1 – Gold Standard Evaluation 2 –
User PreferenceSummary Length = 5 Summary Length = 10
Avg.
Quality
FACES %
Gain
Time/Entity Avg.
Quality
FACES %
Gain
Study 1 Study 2
FACES 1.4314 NA 0.76 sec 4.3350 NA 84% 54%
RELIN 0.4981 187 % 10.96 sec 2.5188 72 % NA NA
RELINM 0.6008 138 % 11.08 sec 3.0906 40 % 16 % 16 %
SUMMARUM 1.2249 17 % NA 3.4207 27 % NA 30 %
Avg. Agreement 1.9168 4.6415

25
FACES
RELINM
SUMMARUM
birthPlace: Warsaw
workInstitutions: University of Paris
field: Physics
spouse: Pierre Curie
deathPlace: Passy, Haute-Savoie
isPrimaryTopicOf: Marie_Curie
wasDerivedFrom: oldid=547107936
knownFor: Polonium
almaMater: ESPCI
deathPlace: Passy, Haute-Savoie
birthPlace: Poland
birthPlace: Warsaw
birthPlace: Russian_Empire
field: Physics
field: Chemistry
54 %
16 %
30 %

26
Talk overview

o Lot of information encoded in literal format.
– 1608 datatype properties (literal based) vs. 1103 object
properties (entity based) in Dbpedia (2016-04)
o Many literals can be easily typed for proper interpretation and
use.
– Example: in DBpedia, http://dbpedia.org/property/location has
~1,00,000 unique literals that can be directly mapped to entities.
o Added semantics is useful in practical applications such as
summarization, property alignment, data integration, and
dataset profiling.
27
Typing literals (enriching) in knowledge graphs

o FACES can only handle object property based features.
o Our contributions*:
1. Compute types for the values of datatype property based
features (data enrichment) - novel contribution.
2. Adapt and improve ranking algorithms (summarization).
28
Enrichment for entity summarization
*[ESWC16] Kalpa Gunaratna, Krishnaprasad Thirunarayan, Amit Sheth, and Gong Cheng. 'Gleaning Types
for Literals in RDF Triples with Application to Entity Summarization'. In Proc. 13th Extended Semantic
Web Conference (ESWC 2016), 2016, pages 85-100.
Barack Obama
Person
type
“Michelle Obama”
type
String
FACES
Partitioning

o Focus of the literal is not clear unlike URIs.
o May contain several entities or labels matching ontology
classes.
29
Challenges
44th President of the United States
option 1
option 2 option 3

30
Enrichment outcomes
dbr:Barack_Obama dbo:Politician
“44th President of the United States”^^xsd:string
dbr:Joe_Biden
dbr:Barack_Obama
dbp:short
Description
dbr:Calvin_Coolidge “48th Governor of Massachusetts”^^xsd:string
dbo:orderInOffice
dbo:Politician
dbo:President
dbo:Governor
dbp:vicePresident rdf:type

31
Process flow
N-grams
Extractor
Head-word
Detector
Entity Spotter
Phrase
Identifier
Primary
Type Filter
N-grams + Head-word
to Class Label Matcher
Head-word
Semantic
Matcher
Output types for the literal
Pre-processing
Type processing
Head-word to
Class Label
Matcher
Input literal

32
Type computation algorithm flow
Extract n-grams
&
Find focus term
Match focus term
& ontology class
Yes
TYPE
No
Match n-grams, focus term
& ontology class
TYPE
Yes
No
TYPE
Yes
No
Match n-grams, focus term
& entity label
Get entity type
Get similarity of
focus term and
all ontology
classes
TYPE
For non-numeric
literals
Get max
similarity
ontology class
Process focus term
& ontology classes

o Type Set TS(v) is the generated set of types for the value v.
33
Evaluation – type generation metrics
n is the total number of features.

o DBpedia Spotlight is taken as the baseline and there were 1117
unique property-value pairs (features).
o 118 pairs (consisting of labelling properties and noisy features)
were removed.
34
Evaluation
Mean Precision (MP) Any Mean Precision (AMP) Coverage
Our approach 0.8290 0.8829 0.8529
Baseline 0.4867 0.5825 0.5533

o Ranking equations in the FACES approach do not work.
– Two literals can be unique even if their types and the main
entities are the same.
• Example, “United States President” Vs. “President of the United
States”. Not desirable to search using the whole phrase
(syntactically different but semantically the same).
– A literal can have several entities. Which one to choose?
36
Ranking datatype property features

o Humans recognize popular entities.
o Entities can be mentioned in literals with variations.
o Proposal: Use the popular entities in literals and not the
literals themselves for ranking.
o Functions
– Function ES(v) returns all entities present in the value v.
– Function max(ES(v)) returns the most popular entity in ES(v).
37
Intuitions for ranking
v = “44th President of the United States”
ES(v) = {db:President, db:United States}
max(ES(v)) = db:United States

38
Modified ranking equations
Take the frequent entity for informativeness of feature
and popularity of value.

o Aggregate feature ranking scores for each facet.
o Rank facets based on the aggregated scores.
39
Facet ranking
Rank(f) is the original function and Rank(f)’ is the modified one for datatype property based features.

40
FACES-E entity summary generation
(1) (2) (3) (4) (5)
Semantic expansion
+
Type computation
Cobweb tf-idf based

o The gold standard consists of 20 random entities used in FACES
taken from DBpedia 3.9 and 60 random entities taken from
DBpedia 2015-04.
o 17 human users created ideal summaries (total of 900).
41
Evaluation – FACES-E summary generation
System Summary Length = 5 Summary Length = 10
Avg. Quality % Gain Avg. Quality % Gain
FACES-E 1.5308 -- 4.5320 --
RELIN 0.9611 59 % 3.0988 46 %
RELINM 1.0251 49 % 3.6514 24 %
Avg. Agreement 2.1168 5.4363

42
Talk overview

43
Single vs. Multiple entity summarization
Single entity summarization
importance
diversity
importance
diversity
Multi-entity summarization
importance
diversity
importance
diversity
Improve relatedness
Apple Computer Steve Jobs
Apple Computer Steve Jobs

44
Motivating example
Within one month of the iPod nano and iTunes phone special event, Apple Computer
announced today another special event to be held on October 12. It is to be held at
the California Theater in downtown San Jose, California. The invitation reads, “One
more thing …”, the teasing tagline of Steve Jobs.
founders Steve_Jobs
product IPod
locationCity California
industry Consumer_electronics
after Tim_Cook
knownFor Microcomputer_revolution
title Apple_Inc.
birthPlace California

45
Relatedness-based multi-entity summarization
Entity 1 Entity 2 Entity 3 Entity 4
Problem: Given a collection of entities, we select features belonging to these entities
maximizing inter-entity relatedness, intra-entity importance, and
intra-entity diversity of features.
Summary
- - - - - - -
- - - - - - -
Summary
- - - - - - -
- - - - - - -
Summary
- - - - - - -
- - - - - - -
Summary
- - - - - - -
- - - - - - -

46
We have 3 optimization objectives
Image credit: https://www.flickr.com/photos/68751915@N05/6551525739
Maximize intra-entity feature importance
Maximize intra-entity feature diversity
Maximize inter-entity feature relatedness

47
Formalizing Quadratic Multidimensional Knapsack Problem (QMKP)
Variable x denotes whether the feature is selected or not
We want to maximize the profit considering each knapsack size
Entity 1 Entity 2 Entity 3 Entity 4
Summary
- - - - - - -
- - - - - - -
Summary
- - - - - - -
- - - - - - -
Summary
- - - - - - -
- - - - - - -
Summary
- - - - - - -
- - - - - - -

o We measure the importance of features using the
informativeness and popularity measure used in FACES.
o Within each entity summary, features should have higher
importance.
– Hence, we use a positive weight
48
1. Importance of features

o Features consist of properties and values.
o For properties, we use the expansion method used in FACES
and calculate the Jaccard similarity for properties.
o For values, we measure their relatedness using graph-based
co-appearance for values. We use RDF2Vec model.
o We combine the two measures and get the relatedness
between two features.
49
How to measure relatedness of features

o Each entity summary should have diverse features.
– (i) Penalize relatedness score with a negative weight (i.e.,
maximize diversity).
– (ii) Modify candidate feature selection to improve diversity.
50
2. Diversity of features within summaries

o Maximize profit for related features between summaries.
o Use a positive weight.
51
3. Relatedness of features between summaries

o GRASP – Greedy Randomized Adaptive Search Procedure.
o GRASP provides an approximate solution to QKP.
– We simply use it for multiple constraints to suit QMKP.
o We use a memory-based GRASP implementation version*.
– Construction phase
• Random selection of features (also using a greedy ranking function)
– Local search phase
• Tries to improve answer by replacing selected features
– Update the best solution
o To improve intra-entity diversity of features, we modified Restricted
Candidate List (RCL) of GRASP.
– We use a threshold to filter related features of the same entity.
52
GRASP – for combinatorial optimization
* Yang, Zhen, Guoqing Wang, and Feng Chu. "An effective GRASP and tabu search for the 0–1 quadratic knapsack problem."
Computers & Operations Research 40, no. 5 (2013): 1176-1185.

o 15 judges, 2 datasets, 30 news items, and 850 question
instances.
o Qualitative evaluation.
o Quantitative evaluation
53
Evaluation

o Faceted entity summarization.
– Conceptual (abstract) grouping of features/triples.
– tf-idf based ranking.
o Type computation for literals to enrich knowledge graphs.
– Improve coverage for faceted entity summarization.
o Relatedness-based multi entity summarization.
54
Conclusion
Entity related structured data on the Web can be concisely and comprehensively
summarized for efficient and convenient information presentation. This can be achieved
through synergistic use of:
(i) Unsupervised knowledge-based methods to conceptually group,
(ii) Information Retrieval-based techniques to intuitively rank,
(iii) Natural Language Processing techniques to semantically enrich structured data, and
(iv) Combinatorial optimization techniques to handle relatedness of multiple entities.
Thesis Statement

55
Research areas
Semantic Web, Linked Data &
Semantic Computing
Publications
Tool
Patent

o Conference Papers
[WWW 2017] Hamid R. Motahari Nezhad, Kalpa Gunaratna, and Juan Cappi. “eAssistant: Cognitive Assistance for
Identification and Auto-Triage of Actionable Conversations.” Proceedings of the 26th International Conference on World
Wide Web Companion. International World Wide Web Conferences Steering Committee, 2017.
[ESWC 2016] Kalpa Gunaratna, Krishnaprasad Thirunarayan, Amit Sheth, and Gong Cheng. “Gleaning Types for Literals in
RDF Triples with Application to Entity Summarization”. In Proc. 13th Extended Semantic Web Conference (ESWC 2016),
2016, pages 85-100. DOI=10.1007/978-3-319-34129-3_6
[AAAI 2015] Kalpa Gunaratna, Krishnaprasad Thirunarayan, and Amit Sheth. “FACES: Diversity-Aware Entity Summarization
using Incremental Hierarchical Conceptual Clustering”. 29th AAAI Conference on Artificial Intelligence (AAAI 2015), 2015.
[Semantics 2013] Kalpa Gunaratna, Krishnaprasad Thirunarayan, Prateek Jain, Amit Sheth and Sanjaya Wijeratne. “A
Statistical and Schema Independent Approach for Indentifying Equivalent Properties on Linked Data.” In Proc. 9th
International Conference on Semantic Systems, ACM, 2013, pages 33-40. DOI=10.1145/2506182.2506187.
o Articles
[W 2014] Kalpa Gunaratna, Sarasi Lalithsena and Amit Sheth. “Alignment and Dataset Identification of Linked Data in
Semantic Web.” Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery (2014).
o Patents
[P 2015] Kalpa Gunaratna, Hamid Motahari. Adaptive Learning of Actionable Statements in Natural Language Conversation.
US patent filed, January 2016 (pending).
o Edited proceedings
[SumPre 2016] Andreas Thalhammer, Gong Cheng, Kalpa Gunaratna. Proceedings of the 2nd International Workshop on
Summarizing and Presenting Entities and Ontologies (SumPre 2016) co-located with the 13th Extended Semantic Web
Conference (ESWC 2016), Greece, May 30, 2016. CEUR Workshop Proceedings 1605, CEUR-WS.org 2016.
[SumPre 2015] Gong Cheng, Kalpa Gunaratna, Andreas Thalhammer, Heiko Paulheim, Martin Voigt, Roberto Garca. Joint
Proceedings of the 1st International Workshop on Summarizing and Presenting Entities and Ontologies and the 3rd
International Workshop on Human Semantic Web Interfaces (SumPre 2015, HSWI 2015) co-located with the 12th Extended
Semantic Web Conference (ESWC 2015), Portoroz, Slovenia, June 1, 2015. CEUR Workshop Proceedings 1556, CEUR-WS.org
2016.
56
Selected publications

o Top-tier conference publications (AAAI-2015, ESWC-2016 , and
WWW-2017).
o Research internships at well-known places (INSIGHT-Ireland,
NLM-USA, IBM-USA).
o Co-chairing and organizing workshops at international
conferences (SumPre2015 and SumPre2016 at ESWC).
o PC member (e.g., ISWC15 and ESWC16) and W3C working
group member (LDP14).
o Competition winner (IBM Blockchain Hackathon runner-up,
National Best Quality Software award finalist).
o Travel and professional development grants (AAAI, WS-GSA).
o US patent application (filed with IBM).
57
Selected accomplishments

58
Acknowledgements
Prof. Amit Sheth
(Advisor)
Prof. Krishnaprasad Thirunarayan
(Advisor)
Dr. Edward Curry
NUIG, Ireland
Prof. Gong Cheng
Nanjing University, China
Dr. Hamid R. Motahari Nezhad
IBM Research, USA
Prof. Keke Chen

59
Thank You
Dr. Olivier Bodenreider
NLM, USA
Dr. Ajith Ranabahu
Amazon, USA
Dr. Gamini Palihawadana
My Family for always
encouraging me …
My colleagues
at Kno.e.sis …

60
Thank You
http://knoesis.wright.edu/researchers/kalpa
kalpa@knoesis.org
Kno.e.sis – Ohio Center of Excellence in Knowledge-enabled Computing
Wright State University, Dayton, Ohio, USA
Questions ?
All trademarks, logos, and images used in this presentation belong to their respective owners.

Semantics based Summarization of Entities in Knowledge Graphs

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Semantics based Summarization of Entities in Knowledge Graphs

Similar a Semantics based Summarization of Entities in Knowledge Graphs (20)

Último

Último (20)

Semantics based Summarization of Entities in Knowledge Graphs

Notas del editor