Unsupervised Learning of a Social Network from a Multiple-Source News Corpus

Unsupervised Learning
of Social Networks
from a Multiple-Source News Corpus

Hristo Tanev

European Commission
Joint Research Centre
hristo.tanev@jrc.it

Introduction
Social networks provide an intuitive
picture of inferred relationships between
entities, such as people and organizations.
Social network analysis uses Social
Networks to identify underlying groups,
communication patterns, and other
information.
Manual construction of a social network is
very laborious task. Algorithms for
automatic detection of relations may be
used to save time and human efforts.

Introduction
We present an unsupervised methodology
for automatic learning of social networks
We use multiple-source syntactically
parsed news corpus.
In order to overcome the efficiency
problems which emerge from using
syntactic information on real-world data,
we put forward an efficient graph
matching algorithm.

Related work
Learning social networks from
Friend-Of-A-Friend links (Mika 2005)
or statistical co-occurrences
Disadvantage: cannot detect the
type of the relation

Related work
Support Vector Machines (SVM)
provide more accurate means for
relation extraction (Zelenko et.al.
2003)
Disadvantages:
• require a sufficient amount of annotated
data
• each pair of named entities should be
evaluated separately, which slows down
the relation extraction

Related work
(Romano et.al. 2006) propose a generic
unsupervised method for learning of
syntactic patterns for relation extraction
Disadvantages:
• they use the Web as a training corpus, which
makes the learning very slow
• they match each pattern against each
sentence which is not efficient when matching
many templates against a big corpus

Unsupervised learning of social
networks
Our algorithm is unsupervised – it accepts on its
input one, two, or other small number of two-slot
seed syntactic templates which express certain
semantic relation.
The algorithm uses news clusters to learn new
syntactic patterns expressing the same semantic
relation.
When the patterns are learned we apply a novel
efficient methodology for pattern matching to
extract related person names from the text.
Extracted relations are aggregated in a social
network.

EMM news clusters
European Media Monitor downloads
news from different sources around
the clock.
Every day 4000-5000 English
language news are downloaded.
The news articles are grouped into
topic clusters.

Parsing the corpus
The training and the test corpus
consist of English-language news
articles from 200 sources.
Articles are parsed with a full
dependency parser, MiniPar.
meet
subj obj

in
Bush Blair

March

Learning patterns
Provide manually a very small
number of seed syntactic templates
which express the main relation.
For example, for the relation “X
supports Y” we use the syntactic
patterns:
X subj support obj Y
X subj praise obj Y

Learning patterns
Match these templates against the
news clusters in the corpus. Each
pair of person names which fill the
slots X and Y is called an anchor
pair.
From “Bush praised the Prime
Minister Hamid Karzai”, the
algorithm will extract the anchor
pair (X:Bush; Y:Hamid Karzai)

Learning patterns
Normalize the anchor pairs using
the information in the EMM
database.
After this step, the example anchor
pair will become (X:George W.
Bush; Y:Hamid Karzai).

Learning patterns
For each extracted anchor pair,
search in the same cluster all the
sentences where both names of the
anchor pair occur.
The assumption is that the same
relation will hold between the same
pairs of names in the whole news
cluster, since all articles in it have
the same topic.

Learning patterns
From all the sentences in which at least
one anchor pair appears, learn syntactic
pattern using our pattern-learning
algorithm similar to the General
Structure Learning algorithm (GSL)
described in (Szpektor et.al. 2006)
Example: X subj-agree-with Y
Each pattern obtains as a score the
number of different anchor pairs which
support it

Learning patterns
Pattern selection and filtering
• Filter out all templates which appear for
less than 2 anchor pairs.
• Take out generic patterns like “X say Y”,
“X have Y”, “X is Y”, etc. using a a
predefined template list

Syntactic Network model
“Prodi met “Berlusconi met
President Bush in President Chirac”
September”

Efficiency
The worst case time complexity of building
SyntNet is O(|w| log |w|), where |w| is the
number of the words in the parsed corpus
The worst case time complexity of the syntactic
matching algorithm is bounded by O((|s|+|t|)
(log MaxArcO)), where |s| is the number of the
sentences in the corpus, |t| is the number of the
templates, and the MaxArcO is the maximum
number of occurrences of an SyntNet arc, i.e. the
size of the maximal index set of a SyntNet arc

Evaluation schema

To evaluate our algorithm we learned syntactic
patterns for “meeting” and “support”
relationships between people
We evaluate the algorithm how well it captures
relationship between the top 33 VIP from our
database
We do not evaluate how it captures relation
mentions
If a specific relation (e.g. “meeting”) holds
between a pair of people X and Y, it is sufficient
that the algorithm finds at least one mention of
this relation between X and Y

Experiments
For paraphrase learning we used a training
corpus of 98'000 English-language news articles
clustered in 22'000 EMM topic clusters published
in the period 01/May/2006 – 03/Oct/2006.
For testing the method, we used 125'000
English-language news articles published in the
period 03/Oct/2006 – 31/Oct/2006.
To read the test corpus and the templates in the
memory and to build SyntNet+ it took 9 min and
3 sec. It took only 45 seconds to match the 101
syntactic templates against the test corpus of
about 1'080'000 parsed sentences.
We normalized extracted names using the EMM
database

Relationship extraction evaluation on the top
33 VIP from the EMM DB
Precision Recall F1

0.61 0.56 0.58
meeting

0.57 0.10 0.17
support

0.60 0.32 0.42
overall

Using the social network view
We run the PageRank algorithm on
the automatically extracted
“meeting” network and found the top
5 ranked people
We compared this ranking with
simple frequency-based people
ranking

Comparing two people ranking
schemas
Pagerank Frequency

C. Rice G.W. Bush

G.W. Bush T. Blair

V. Putin C. Rice

E. Olmert N. al-Maliki

T. Blair S. Hussein

Conclusions and future work
We presented an unsupervised method for
social network learning from news clusters
We presented very efficient syntactic
pattern matching algorithm
Automatically learned social networks can
be used for some analyst tasks
In our future work we will try to consider
more types of relations
We consider learning and using more
abstract patterns

Unsupervised Learning of a Social Network from a Multiple-Source News Corpus

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Unsupervised Learning of a Social Network from a Multiple-Source News Corpus

Similar a Unsupervised Learning of a Social Network from a Multiple-Source News Corpus (20)

Último

Último (20)

Unsupervised Learning of a Social Network from a Multiple-Source News Corpus