Improving recall for conjunctive queries on NLP graphs

“30 are better than one”
Improving recall for conjunctive
queries on NLP graphs

Chris Welty, Ken Barker, Lora Aroyo, Shilpa Arora

Tex
Text
t Answering Conjunctive SPARQL Queries over NLP Graphs (c)Lora Warhol
Andy Aroyo

Wednesday, October 17, 12 1

Goal: hypothesis generation & validation
framework for NLP Graphs

Hypothesis: within this framework, there
is value in the secondary extraction graph for
conjunctive query answering

the probability of a secondary graph statement
being correct increases signiﬁcantly when that
statement generates a new result to a conjunctive
query over the primary graph

Answering Conjunctive SPARQL Queries over NLP Graphs Lora Aroyo

Machine Reading Program


The MRP Vision
to decrease the cost of maintaining critical system DBs
can we replace the human without changing the LSW
can we build a machine reader for this

DB SME
W
yS
Legac


The MRP Vision

DB SME
W
yS query
Legac


The MRP Vision

Machine
Reader!
W
yS query
Legac


SRI Answer to the Vision
replacing the human, but still with a DB
NLP components must make their best guess,
without any knowledge of the speciﬁc task at hand, e.g. the query

DB SME
W
yS query
Legac



NLP Stack
DB
W
yS query
Legac



NLP Stack
DB
W Machine
yS query
Legac Reader!


The MRP Vision

DB SME
W query
yS
Legac


The MRP Vision

Machine
DB SME
W query Reader!
yS
Legac


The MRP Vision
the NLP process is not a one-shot deal
the query provides context for what the user is seeking
and thus an opportunity to re-interpret the text

NLP NLP
Graphs Stack
Machine
DB SME
W query Reader! re-interpret
yS
Legac


NLP Stack
• Contains NER, CoRef, RelEx, entity disambiguation

• RelEx: SVM learner with output score: probabilities/
conﬁdences for each known relation that the
sentence expresses it between each pair of
mentions

• Run over target corpus producing NLP graph

• nodes are entities (clusters of mentions produced
by coref)

• edges are type statements between entities and
classes in the ontology, or relations detected
between mentions of these entities in the corpus


RDF for NLP
• use SemTech to influence the NLP stack vs. NLP components
to only feed the knowledge integration layer

• to store the results of IE in RDF Graphs (NLP Graphs), where:

• each triple has a confidence of the NLP components and
provenance indicating where the triple was stated in natural
language text

• triple - not an expression of truth, but a representation of
what an NLP component, or a human annotator, read in a
document

• confidence - not that the triple is true, but reflects the
confidence that the text states the triple (component level
confidence)

“... Mr. X of India ...”

“... in countries like, India, Iran, Iraq ...”


Person GPE

“... Mr. X of India ...”
Mr. X India

citizenOf
NLP Stack sameAs

GPE Country

“... in countries like, India, Iran, Iraq ...”
India Iran Iraq

subPlaceOf

Evidence


Mr. X rdf:type
rdf:type

citizenOf India Country
India
Person
GPE rdf:type
subPlaceOf
rdf:type
Iran Iraq

rdf:subClassOf
NLP Graph
RDF Graph
The nodes & arcs refer to the results of NLP, not “truth”
There is error (precision, recall)
There is conﬁdence associated with each triple

NLP Stack produces

• two NLP graphs

• primary graph = the single best type, relation &
coreference results for each mention or mention
pair

• secondary graph = all possibilities considered by
the NLP stack


SPARQL Queries on
NLP Graphs

19-Sept-2012 Hypothesis Generation for Answering Queries over NLP Graphs Lora Aroyo

Conjunctive Query
Find Jordanian citizens who are members of Hezbollah

SELECT ?p
WHERE {
?p mric:citizenOf geo:Jordan .
mric:Hezbollah mric:hasMember ?p .

ﬁnd all bindings for the variable ?p that satisfy the query
report where in the target corpus the answers were
found (spans of text expressing the relations in the query)


Conjunctive Queries Recall

n
• [Π Recall(R ) ] x Recall
k=1
k coref

• for conjunctive query of n terms recall could be O(Recalln)

• for complex queries Recall becomes dominating factor, where
the overall Recall gets severely degraded by term Recall

• in our experiments: query recall <.1 for n>3
• all NLP components had to work correctly to get an answer


... solution?

• ﬁnd solutions to subsets of a conjunctive SPARQL query
as candidate solutions to the full query

• attempt to conﬁrm the candidate solutions using
various kinds of inference, external resources &
secondary extraction results


... in other words
hypothesis generation that focuses on parts of an NLP graph
that almost match a query, identifying statements that if
proven would generate new query solutions

we are looking for missing links in a graph that, if added,
would result in a new query solution


Q: R1(x,y) R2(x,z) R3(z,w)

R3
R1 R3?

R2
R3?

R3?

so, each hypothesis set if added to the primary NLP
graph would provide a new answer to the original query

only validated hypotheses are added to the query result

Hypothesis Generation
• Relaxes queries of size N by removing query terms Q

• Finds solutions to the remaining set of terms

• for each solution bind the variables in Q forming a
hypothesis

• If no solutions to subqueries of size N-1 are found, then N-2

• appropriate for queries that are almost answerable, e.g. when
most of the terms in query are not missing

• biased towards generating more answers to queries, e.g.
perform poorly on queries for which the corpus does not
contain the answer


ﬁnd all terrorist organizations that were agents of bombings
in Lebanon on October 23, 1983:

SELECT ?t
WHERE {
?t rdf:type mric:TerroristOrganization .
?b rdf:type mric:Bombing .
?b mric:mediatingAgent ?t .
?b mric:eventLocation mric:Lebanon .
?b mric:eventDate "1983-10-23" .
}


mric:bombing
mric:TerroristOrganization

rdf:type
rdf:type

t mric:mediatingAgent b

mric:eventLocation
mric:eventDate

mric:Lebanon
1983-10-23


mric:bombing

rdf:type
rdf:type


mric:eventLocation
mric:eventDate

mric:Lebanon
1983-10-23

ﬁnd all bombings in Lebanon on 1983-10-23 with agents
1
(hypothesize that the agents are terrorist organizations)

mric:bombing

rdf:type


mric:eventLocation
mric:eventDate

mric:Lebanon
1983-10-23

ﬁnd all bombings in Lebanon on 1983-10-23 with agents
1
(hypothesize that the agents are terrorist organizations)

mric:bombing

rdf:type
rdf:type


mric:eventLocation
mric:eventDate

mric:Lebanon
1983-10-23

ﬁnd all events in Lebanon on 1983-10-23 by terrorist orgs
2
(hypothesize that the events are bombings)


rdf:type


mric:eventLocation
mric:eventDate

mric:Lebanon
1983-10-23

ﬁnd all events in Lebanon on 1983-10-23 by terrorist orgs
2
(hypothesize that the events are bombings)

mric:bombing

rdf:type
rdf:type


mric:eventLocation
mric:eventDate

mric:Lebanon
1983-10-23

ﬁnd all bombings in Lebanon on 1983-10-23
3
(all known terrorist organizations are hypothetical agents)

mric:bombing

rdf:type
rdf:type

t b

mric:eventLocation
mric:eventDate

mric:Lebanon
1983-10-23

ﬁnd all bombings in Lebanon on 1983-10-23
3
(all known terrorist organizations are hypothetical agents)

mric:bombing

rdf:type
rdf:type


mric:eventLocation
mric:eventDate

mric:Lebanon
1983-10-23

ﬁnd all bombings by terrorist orgs on 1983-10-23
4
(hypothesize that the bombings were in Lebanon)

mric:bombing

rdf:type
rdf:type


mric:eventDate

1983-10-23

ﬁnd all bombings by terrorist orgs on 1983-10-23
4
(hypothesize that the bombings were in Lebanon)

mric:bombing

rdf:type
rdf:type


mric:eventLocation
mric:eventDate

mric:Lebanon
1983-10-23

ﬁnd all bombings by terrorist orgs in Lebanon
(hypothesize that the bombings were on 1983-10-23)

mric:bombing

rdf:type
rdf:type


mric:eventLocation

mric:Lebanon

(hypothesize that the bombings were on 1983-10-23)

mric:bombing

rdf:type
rdf:type
(hypothesize that the bombing1 was on
t mric:mediatingAgent b 1983-10-23)
mric:eventLocation

mric:Lebanon


mric:bombing

rdf:type
rdf:type
mric:eventLocation

mric:Lebanon

mric:bombing

rdf:type
rdf:type

racr:orgs65
t mric:mediatingAgent racr: bombing1
b

mric:eventLocation

mric:Lebanon


mric:bombing

rdf:type
rdf:type
mric:eventLocation

mric:Lebanon

mric:bombing

rdf:type
rdf:type

racr:orgs65
t mric:mediatingAgent racr: bombing1
b

mric:eventLocation
mric:eventDate

mric:Lebanon
1983-10-23


Hypothesis Validation
• a stack of hypothesis checkers: (1) report confidence
whether a hypothesis holds and (2) provide provenance:
a pointer to a span of text that supports the hypothesis

• to limit complex computational tasks, e.g. formal
reasoning or choosing between multiple low-
confidence extractions

• such tasks are made more tractable by using
hypotheses as goals, e.g. a reasoner may be used
effectively by constraining to only a part of the graph
connected to a hypothesis


Hypothesis Checkers

• knowledge base (previous work)
• taxonomic inference & complex rules
• rules derived directly from the ontology
• general, domain-independent rules, e.g. family
relationships, and geo knowledge

• TyCor (previous work)
• secondary extraction graph (new work)

Rules Derived from Ontology

• simple superclass-subclass rules
(Bombing (?x) → Attack (?x))

• simple relation-subrelation rules
(hasSon (?x, ?y) → hasChild (?x, ?y))

• simple relation inverse rules
(hasChild (?x,?y) hasParent (?y,?x))


Complex Rules from Ontology

• 40 complex rules based on specialization of
the domain or range of sub-relations

(hasSubGroup (?x, ?y) & HumanOrganization (?x)
→ hasSubOrganization (?x, ?y))


Core Claim: Secondary Graph

is a productive source for hypothesis validation
in conjunction with the primary graph to answer a query


Secondary Graph
• an NLP Graph generated from *all* the
interpretations considered by the NLP stack, so
obviously quite large

• multiple mentions, mention types, multiple
entities, multiple entity types & multiple
relations between them

• pruned at a particular conﬁdence threshold


Experimental Setup
testing the ideas


Initial MRP Setup
ﬁnd mentions of the ontology types & relations in the corpus
& extract them into an RDF Graph

• OWL target ontology: types & binary relations

• 10-50K documents - Gigaword (sub)corpus

• 79 docs manually annotated (mentions of the target
relations & their argument types)

• 50 SPARQL queries (expected to be answered in NLP Graph)

• query results evaluated manually
• each query has at least one correct answer in the corpus
• some queries have over 1000 correct answers

Initial MRP Evaluation
• required extensive manual effort:

• no match between system node IDs and GS node IDs

• provenance for evaluators to find mentions from a graph

• evaluators semi-automatically map the system result
entity IDs to GS entity IDs

• expensive, error-prone & difficult to reproduce ...

• difficult to test systems adequately before the evaluation

• only 50 queries were used - not enough for significant
system validation, e.g. not able to tune system thresholds

How did we change this?
• we decided to sacriﬁce corpus size in favor of having entity IDs (eliminating
the manual step in the evaluation)

• we created a gold standard corpus

• 169 docs manually annotated with types, relations, coreference and
entity names

• generated Gold-Standard NLP graph from manually annotated data

• automatically generated SPARQL queries from GS graph

• we ran only the RelEx component using GS mentions, types & coref giving
us the GS entity IDs in the system graph

measure performance of system results against these GS results

Evaluation & Test Data
• 60 train, 60 devtest & 49 ﬁnal (blind) test

• manually annotated with NER, coref, relations
• extracted from Gigaword
• split to balance distribution of 48 domain relations

• generated Gold-Standard NLP graph from manually annotated data

• RelEx component trained & applied using GS mentions,
types & coref

• increases the F-measure (F=.28) of the RelEx output, but
used in the baseline and in the test experiments so it
doesn’t aﬀect the results


SPARQL Evaluation Queries
• 475 test queries for the devtest set and 492 for test.

• generated from the GS NLP graph for each document set by:

• extracting random connected subsets of the graph containing 2-8
domain relations (not including rdf:type)

• adding type statements for each node

• replacing nodes that had no proper names in text with select variables

• run the query over the same GS graph and the results became our gold
standard results for query evaluation (since they had variables the results
would be diﬀerent than what we started with)


NLP Graphs from
RelEx Output
• RelEx: a set of SVM binary classifiers, one per relation

• for each sentence in the corpus, for each pair of mentions in that sentence,
for each relation it produces a probability that that pair is related by the
relation

• NLP graphs are generated by selecting relations from RelEx output in two ways:

• Primary: takes only the top scoring relation between any mention pair
above a confidence threshold (0, .1 and .2)

• Secondary: takes all relations between all mention pairs above 0
confidence

• All type triples come from the Gold Standard (GS)

• Precision & Recall are determined by automatically comparing system
query results to the GS query results (for every query we know all the answers)

Threshold Choices
• Threshold .2 --> max F1=.28 on devset for RelEx

• Threshold .1 --> guessed threshold before
having any data to back it up

• we could have tried more thresholds but it was
a lot of work

• in our experiments, we explored threshold
space over hundreds of queries - satisfactory
to tune the threshold parameters

Graph Notation
• We refer to the graphs by document set (dev or test) and top/
all @threshold, e.g.

• devTop@.2 = NLP Graph on dev set using top relations
above .2 conﬁdence

• testAll@0 = NLP graph on test set using all relations
above 0 conﬁdence

• 3 primary graphs, in all cases using top, and selecting relations
at thresholds 0, .1, and .2

• 1 secondary graph using the all@0 setting (R=.97)


This Evaluation Setup
allows to run experiments repeatedly over
hundreds of queries with no manual intervention


6 Experiments

• 3 for dev, 3 for test

• each experiment compares query results from only PG
to query results using the PG+SG for hypothesis
validation

• the three experiments compare performance at
diﬀerent primary graph thresholds


0-threshold primary graph
with & without secondary graph
F1 secondary graph: all@0

for a given PG threshold we vary the SG threshold for validated
hypotheses (x-axis)

.1-threshold primary graph

the red line indicates the PG threshold - the PG-only
ﬂattens below this threshold as expected


best performance point
(.01 SG threshold)

the red line indicates the PG threshold - the PG-only
ﬂattens below this threshold as expected


the best performing conﬁguration for dev is .2 threshold PG with SG
hypotheses validated at .01 threshold


best performance point
(.01 SG threshold)

the best performing conﬁguration for dev is .2 threshold PG with SG
hypotheses validated at .01 threshold

Performance

Text
the test set was truly blind, we ran it only once

R - expected, F - hoped, P - surprised

the probability of a relation holding between two
mentions increases signiﬁcantly if that relation would
complete a conjunctive query result

Example: Generated
Query
Q161: "Find events in which the leader of Venezuela is the
mediating agent"

?e1 mric:MediatingAgent ?p1
geo:Venezuela mric:isLedBy ?p1
geo:Venezuela rdf:type mric:GeopoliticalEntity
?p1 rdf:type mric:Person
?e1 rdf:type mric:Event


Example: Generated
Query
mediating agent"

no solutions in PG


Example: Generated
Query
mediating agent"

ﬁnd binding for p1 (346)


Example: Generated
Query
mediating agent"

generates 346 hypotheses
ﬁnds support in SG for isLedBy("Venezuela", "Hugo Chavez")

Questions?

@laroyo
http://lora-aroyo.org

Improving recall for conjunctive queries on NLP graphs

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (10)

More from Lora Aroyo

More from Lora Aroyo (20)

Recently uploaded

Recently uploaded (20)

Improving recall for conjunctive queries on NLP graphs