The integration of a small fraction of the information present in the Web of Documents to the Linked Data
Web can provide a significant shift on the amount of information available to data consumers. However, information extracted from text does not easily fit into the usually highly normalized structure of ontology-based datasets. While the representation of structured data assumes a high level of regularity, relatively
simple and consistent conceptual models, the representation of information extracted from texts need to take into account large terminological variation, complex contextual/dependency patterns, and fuzzy or conflicting semantics. This work focuses on bridging the gap between structured and unstructured data, proposing the representation of text as structured discourse graphs (SDGs), targeting an RDF representation of unstructured
data. The representation focuses on a semantic best-effort information extraction scenario, where information from text is extracted under a pay-as-you-go data quality perspective, trading terminological normalization for domain-independency, context capture, wider representation scope and maximization of textual
information capture.
Representing Texts as contextualized Entity Centric Linked Data Graphs
1. ,
Representing Texts as Contextualized
Entity-Centric Linked Data Graphs
Andr´e Freitas1 Jo˜ao C. P. da Silva2 Danilo S. Carvalho2
Se´an O’Riain1 Edward Curry1
1DERI - Ireland
2Universidade Federal Rio de Janeiro - Brazil
August 27, 2013
Freitas, Silva, Carvalho, O’Riain, Curry 1/ 54
6. ,
Motivation
Integration of unstructured information into the Linked Data Web
Natural Language Texts
Terminological variation
Complex context patterns
Ambiguous sentences
Linked Data
Terminological and structural regularity
Consensual semantics - agreement between data consumers
Freitas, Silva, Carvalho, O’Riain, Curry 6/ 54
7. ,
Motivation
Traditional Information Extraction
Triples - Primary concerns
accuracy
consistency
lexical and structural normalization (schema/ontology)
Complemented - Best-Effort Information Extraction
domain-independency
context capture
maximization of the text semantics representation
wider extraction scope
Freitas, Silva, Carvalho, O’Riain, Curry 7/ 54
8. ,
Motivation
What is the relationship between Barack Obama and Indonesia?
Freitas, Silva, Carvalho, O’Riain, Curry 8/ 54
9. ,
Objective
Structured Discourse Graph (SDG) Model
Improve semantic integration, representation and interpretation of
unstructured text within the context of Linked Data.
Freitas, Silva, Carvalho, O’Riain, Curry 9/ 54
13. ,
Our Work
Focuses on:
Providing a principled description of a representation model
complementary to existing approaches
Supports context representation
Defining a graph representation approach for texts which can
be easily integrated to the Linked Data Web
Freitas, Silva, Carvalho, O’Riain, Curry 13/ 54
15. ,
SDGs - Representation Requirements
Entity-centric graph model
Entity pivot : named entity (subject or object)
Document Centric ⇒ Entity Centric
Maximized representation of text semantics
Information in sentence ⇒ graph representation
Resulting graph should support an algorithmic interpretation of
the extracted SDG
Conceptual model independency
Ontology agnostic
Freitas, Silva, Carvalho, O’Riain, Curry 15/ 54
16. ,
SDGs - Representation Requirements
Context Capture & Representation
Contextual information related to a triple statement should be
represented in the extracted graph
Contextual statements (such as temporality) define the context
in which another statement holds
Dependencies between sentences in the text should also be
made explicit
Freitas, Silva, Carvalho, O’Riain, Curry 16/ 54
17. ,
SDGs - Representation Requirements
Pay-as-you-go semantic reference
Unstructured text may contain complex semantic dependencies
Should support the evolution and refinement of the semantic
model
Standardized representation compatibility
Maximize compatibility with a standards-based data
representation format, facilitating
graph integration
interoperability on the Web
Freitas, Silva, Carvalho, O’Riain, Curry 17/ 54
18. ,
Structured Discourse Graphs - SDGs
Semantic Model Elements & Graph Patterns
Freitas, Silva, Carvalho, O’Riain, Curry 18/ 54
19. ,
SDGs - Semantic Model Elements
In late 1988, Obama entered Harvard Law School.
Text Segmentation: (subject,predicate,object)
Named Entities
Refer to the description of entities for which one or many rigid
designators stands for the referent
Rigid designators: people, places, organization, biological
species, substances, etc
Freitas, Silva, Carvalho, O’Riain, Curry 19/ 54
20. ,
SDGs - Semantic Model Elements
In late 1988, Obama entered Harvard Law School.
Text Segmentation: (subject,predicate,object)
Named Entities
Refer to the description of entities for which one or many rigid
designators stands for the referent
Rigid designators: people, places, organization, biological
species, substances, etc
Freitas, Silva, Carvalho, O’Riain, Curry 20/ 54
21. ,
SDGs - Semantic Model Elements
In late 1988, Obama entered Harvard Law School.
Text Segmentation: (subject,predicate,object)
Named Entities
Refer to the description of entities for which one or many rigid
designators stands for the referent
Rigid designators: people, places, organization, biological
species, substances, etc
Freitas, Silva, Carvalho, O’Riain, Curry 21/ 54
22. ,
SDGs - Semantic Model Elements
In late 1988, Obama entered Harvard Law School.
Co-Referential Elements
Two types co-references: pronominal and non-pronominal
Co-references can refer to either intra or inter sentences
Substituting the co-referent term by the named entity
Can corrupt the semantics of the representation
Freitas, Silva, Carvalho, O’Riain, Curry 22/ 54
23. ,
SDGs - Semantic Model Elements
In late 1988, Obama entered Harvard Law School.
Context Elements
A semantic interpretation may depend on different contexts
(temporal context)
The main contextual information is intra-sentence
Intra-sentence context ⇒ reification
Freitas, Silva, Carvalho, O’Riain, Curry 23/ 54
24. ,
SDGs - Semantic Model Elements
In late 1988, Obama entered Harvard Law School.
Quantifiers & Generic Operators
Quantifier: one, two, (cardinal numbers), many (much),
some, all, thousands of, one of, several, only, most of
Negation: not
Modal: could, may, shall, need to, have to, must, maybe,
always, possibly
Comparative: largest, smallest, most, largest, smallest, the
same, is equal, like, similar to, more than, less than
Freitas, Silva, Carvalho, O’Riain, Curry 24/ 54
25. ,
SDGs - Semantic Model Elements & Graph Patterns
In late 1988, Obama entered Harvard Law School.
Text segmentation: (Obama,entered,Harvard Law School)
Named entities: Obama, Harvard Law School
Resolve co-references: Barack Obama
Context representation: time
Freitas, Silva, Carvalho, O’Riain, Curry 25/ 54
26. ,
SDGs - Semantic Model Elements & Graph Patterns
In late 1988, Obama entered Harvard Law School.
Resolved & Normalized Entities
Resolved entities: a node-substitution in the graph was made
from a co-reference to a named entity
Normalized entities: entities are transformed to a normalized
form (September 1st of 2010 to 01/09/2010)
Freitas, Silva, Carvalho, O’Riain, Curry 26/ 54
27. ,
SDGs - Semantic Model Elements & Graph Patterns
Later in 2007, Obama sponsored an amendment to the
Defense Authorization Act to add safeguards for
personality-disorder military discharges.
Freitas, Silva, Carvalho, O’Riain, Curry 27/ 54
28. ,
SDGs - Semantic Model Elements & Graph Patterns
Freitas, Silva, Carvalho, O’Riain, Curry 28/ 54
29. ,
SDGs - Semantic Model Elements & Graph Patterns
He served three terms representing the 13th District in the
Illinois Senate from 1997 to 2004.
Non-Named (Generic) Entities
Non-named entities map to non-rigid designators
Are more subject to vocabulary variation
Have more complex compositional patterns
Freitas, Silva, Carvalho, O’Riain, Curry 29/ 54
30. ,
SDGs - Semantic Model Elements & Graph Patterns
Following high school, Obama moved to Los Angeles in 1979
to attend Occidental College.
Triple Trees
Not all facts extracted can be represented in one triple.
Transformation from the syntactic tree to a set of triples
The sentence subject defines the root node
Interpretation: DFS traversal of the tree
Freitas, Silva, Carvalho, O’Riain, Curry 30/ 54
31. ,
SDGs - Semantic Model Elements & Graph Patterns
He won election to the U.S. Senate in Illinois in November
2004.
Pronominal Co-Reference
Freitas, Silva, Carvalho, O’Riain, Curry 31/ 54
32. ,
SDGs - Semantic Model Elements & Graph Patterns
On Agust 23, Obama announced his election of Delaware
Senator Joe Biden as his vice presidential running mate.
Pronominal Co-Reference
Freitas, Silva, Carvalho, O’Riain, Curry 32/ 54
33. ,
SDGs - Semantic Model Elements & Graph Patterns
As a member of the Senate Foreign Relations Commitee,
Obama made official trips to Eastern Europe, the Middle
East, Central Asis and Africa.
Conjunctive Co-Reference
Freitas, Silva, Carvalho, O’Riain, Curry 33/ 54
34. ,
SDGs - Semantic Model Elements & Graph Patterns
Freitas, Silva, Carvalho, O’Riain, Curry 34/ 54
36. ,
Semantic Model - Formalization
Graph pattern: atomic graph structure which maps to a
discourse structure
Named and Non-named Entities: [[ne]], [[∼ ne]] ∈ U,
where U is a set of IRIs
Basic Triple: tr = (es , p, eo) where es , eo represent
named/non-named entities associated with the subject (s) and
object (o), and p represents a relation between es and eo,
interpreted by [[tr]] = ([[es ]], [[p]], [[eo ]]) ∈ U × U × U
Core Triple: trc = (nes , p, neo)
Freitas, Silva, Carvalho, O’Riain, Curry 36/ 54
37. ,
Semantic Model - Formalization
Reification Triple: trrei = (tr, reilink , reiobj )
tr - basic triples
reilink - relation
reiobj - reification object
Temporal Reification: special reification triple
reilink : time
reiobj : explicit or implicit data references
Interpretation:
[[trrei ]] = ([[tr]], [[reilink ]], [[reiobj ]]) ∈ U × U × U
Freitas, Silva, Carvalho, O’Riain, Curry 37/ 54
38. ,
Semantic Model - Formalization
Quantifier Operators & Generic Operators:
opt = (eo, oplink , op)
eo - object element in a basic triple tr
op - specific operator of eo wrt oplink
Interpretation:
[[opt]] = ([[proj3(tr)]], [[oplink ]], [[op]]) = ([[eo]], [[oplink ]], [[op]])
where proj3(tr) = proj3((es , p, eo)) = eo
Freitas, Silva, Carvalho, O’Riain, Curry 38/ 54
39. ,
Semantic Model - Formalization
Conjunctive Co-Reference: ccr = n
i=0{(e, conjlinki
, nei )}
Interpretation:
[[ccr]] = {([[e]], [[conjlinki
]], [[nei ]]) ∈ U × U × U such that
[[e]] = [[proj3(tr)]] and (∧n
i=0[[nei ]] sameas [[e]])}
Freitas, Silva, Carvalho, O’Riain, Curry 39/ 54
40. ,
Semantic Model - Formalization
Possessive/Reflexive/Demonstrative Co-Reference:
pcr = {(∼ nei , coreflink , pr), (pr, coreflink , ej )}
Interpretation:
[[pcr]] =
{([[proj1(tr)]], [[coreflink ]], [[pr]]), ([[pr]], [[coreflink ]], [[proj3(tr)]]) ∈
U × U × U such that tr = (∼ nei , p, ej )}
Freitas, Silva, Carvalho, O’Riain, Curry 40/ 54
41. ,
Semantic Model - Formalization
Extracted Graph: set of basic and reified triples and generic,
quantifier and co-reference operators.
Paths
Basic: sequence of basic triples
Reified: basic paths with some reified triples associated
Operational: basic paths with some operators associated
Complex: contains both reified and operational paths
Freitas, Silva, Carvalho, O’Riain, Curry 41/ 54
42. ,
Semantic Model - Formalization
Context Triples: (tr, contextlink , ct) which indicates that a
basic triple tr can be associated with a specific context ct
[[context]] = ([[tr]], [[contextlink ]], [[ct]]) ∈ U3 × U × U
Multi-Context Graphs: is an extracted graph with more
than one context associated to its triples
if all basic triples in a path belong to an unique (same)
context, the path is an unique context (basic, reified,
operational or complex) path
otherwise, we call this path a multi-context path
Freitas, Silva, Carvalho, O’Riain, Curry 42/ 54
44. ,
Graphia
http://graphia.dcc.ufrj.br
Graphia is an information extraction pipeline
Takes factual text as input and produces SDGs as output
Graphia’s modules combine state-of-art NLP tools with an
efficient set of heuristics
Can build graphs for sentences and
entire documents.
Freitas, Silva, Carvalho, O’Riain, Curry 44/ 54
46. ,
Graphia - Preliminary Evaluation
1033 relations (triples) from 150 sentences from 5 randomly
selected Wikipedia articles
Manually classified the graphs
Correct Graphs: fully consistent with the semantic model
Complete Graphs: correct graph which maps all the
information of a sentence
Interpretable Graphs: graph fragment which has the correct
semantics of its basic triple paths
Freitas, Silva, Carvalho, O’Riain, Curry 46/ 54
47. ,
Graphia - Extraction Examples
In 2002 GE acquired the wind power assets of Enron during its
bankruptcy proceedings.
Freitas, Silva, Carvalho, O’Riain, Curry 47/ 54
48. ,
Graphia - Extraction Examples
In 1935, GE was one of the top 30 companies traded
at the London Stock Exchange.
Freitas, Silva, Carvalho, O’Riain, Curry 48/ 54
49. ,
Graphia - Extraction Examples
The Radio Corporation of America (RCA) was founded by GE in
1919 to further international radio.
Freitas, Silva, Carvalho, O’Riain, Curry 49/ 54
50. ,
Graphia - Extraction Examples
It acquired ScanWind in 2009.
Freitas, Silva, Carvalho, O’Riain, Curry 50/ 54
51. ,
Graphia - Extraction Examples
The new company, named GXS, is based in Gaithersburg,
Maryland.
Freitas, Silva, Carvalho, O’Riain, Curry 51/ 54
53. ,
Conclusion & Future Work
Conclusion
SDGs provide a discourse representation as a set of
contextualized relationships
Supports entity-centric integration between graphs
Conceptual Model Independency
Worth putting effort on enumerable patterns (timestamps,
operators)
Future Work
More complex sentence patterns
Freitas, Silva, Carvalho, O’Riain, Curry 53/ 54
54. ,
Representing Texts as Contextualized
Entity-Centric Linked Data Graphs
Andr´e Freitas1 Jo˜ao C. P. da Silva2 Danilo S. Carvalho2
Se´an O’Riain1 Edward Curry1
1DERI - Ireland
2Universidade Federal Rio de Janeiro - Brazil
August 27, 2013
Freitas, Silva, Carvalho, O’Riain, Curry 54/ 54