論文サーベイ（Sasaki）

論文サーベイ＠研究会2012.04.19:

N.Shibata,
Y.Kajikawa,
I.Sakata,

“Link
Predic?on
in
Cita?on
Networks”

Journal
of
the
American
society
for
informa?on

science
and
technology,
63(1):
78-‐85,
2012

佐々木一
Hajime
SASAKI

政策ビジョン研究センター特任研究員

工学系研究科総合研究機構イノベーション政策研究センター連携研究員

技術経営戦略学専攻坂田一郎研究室協力研究員

概要と結論
•  概要：学術論文の引用関係の予測問題を、グラ
フ構造のリンク予測問題と捉え、5つの学術分野
を対象に11の素性を適用しSVMを分類器とした
モデル化を行った。

•  結論１：分類器の性能指標であるf値より、良いリ
ンク予測のモデル化ができた。

•  結論２：分野の構造によって、効果のある素性が
異なる。従って、分野の構造ごとに異なるモデル
を適用させる必要があることがわかった。

Introduc?on
•  The
number
of
academic
papers
exponen?ally
increases
(Price,

1965),
each
academic
area
becomes
specialized
and
segmented.

•  The
individual
scien?st
has
to
focus
on
or
specialize
in
only
a
few

scien?fic
subdomains
to
keep
up
with
the
growth
of
the
domains,

which
means
that
researchers
must
focus
on
increasingly
narrowing

domains.

Research
Ques?on:
What
factors
affect
the
existence
of
links
using

features
intrinsic
to
the
network
itself,
namely,
link
predic+on,
which

will
help
scholars
to
know
which
paper
to
cite
and
managers
to

iden?fy
future
core
papers?

•  In
this
ar?cle,
The
authors
u?lize
textual,
topological,
and
abribute

features
for
link
predic?on,
which
are
considered
to
influence
ci?ng

behaviors.

既存研究
•  Liben-‐Nowell
and
Kleinberg
(2003)
:
proposed
a
model
for
link
predic?on
in
large

coauthorship
networks.

•  Clauset,
Moore,
and
Newman
(2008):
inves?gated
the
hierarchical
structure
of
social

networks
to
predict
missing
connec?ons
in
par?ally
known
networks
with
high
accuracy.

•  Popescul
and
Ungar
(2003):
proposed
a
new
approach
for
Sta?s?cal
Rela?onal
Learning
to

build
link
predic?on
models.

•  Hasan,
Chaoji,
Salem,
and
Zaki
(2006):
tested
several
supervised
learning
models
(decision

tree,
k-‐nearest
neighbor,
mul?layer
percep?on,
support
vector
machine
[SVM],
radial
basis

func?on
[RBF]
network)
for
link
predic?ons

•  Murata
and
Moriyasu
(2008):
applied
the
model
of
Liben-‐Nowell
and
Kleinberg
to
social

networks
of
Ques?on-‐Answering
Bulle?n
Boards.

•  Caragea,
Bahirwani,
Aljandal,
and
Hsu
(2009)
:proposed
an
algorithm
to
predict
poten?al

friendships
based
on
a
clustering
approach
in
Live-‐
Journal,
a
social
network
journal
service

with
a
focus
on
user
interac?ons.

•  Lu,
Jin,
and
Zhou
(2009)
:presented
a
local
path
index
to
es?mate
the
likelihood
of
the

existence
of
a
link
between
two
nodes.

•  Seglen
(1994)
:analysed
the
trends
of
papers
in
the
journals
with
large
impact
factors.

•  Vinkler
and
Davidson
(2002)
:indicated
that
the
papers
in
growing
journals
in
terms
of
the

number
of
papers
are
more
likely
to
be
cited.

•  Hwang,
Wylie,
Wei,
and
Liao
(2010):
proposed
recommenda?on
engines
based
on
the

coauthorship
networks.

本研究の特徴
•  1:The
focus
is
on
cita?on
networks.

引用ネットワークに着目した。

•  2:The
authors
apply
SVMs
as
our
supervised
learning

method,
as
SVM
is
the
best
learner
according
to
Hasan
et

al.
(2006).

教師あり学習における分類器としてSVMを利用した。

•  3:
The
authors
use
more
comprehensive
features
op?mized

for
cita?on
networks.

引用ネットワークを対象するにあたって、網羅的な素性を適
用した。

本研究の意義
•  Helps
us
make
decisions
whether
to
link
more
accurately
even
with

a
huge
number
data.

•  Applica?on：引用推薦システムを構築する

Cita?on
recommenda?on
system
for
authors
of
scien?fic
publica?ons

and
patents.

–  The
reviewers
of
scien?fic
papers
can
reduce
their
?me
to
check

whether
the
references
in
those
papers
are
adequate
or
not.（査読に
おいて、適切な論文を引用しているかどうかを効率的に判断できる）

–  Second,
well-‐
organized
link
predic?on
can
reveal
how
and
why

authors
cite
other
scien?fic
papers.

(著者が引用した理由がわかる）

–  Finally,
link
predic?on
can
bond
different
research
fields
with
similar

topics
but
from
different
disciplines.（類する問題を扱っている異なる
学術分野をつなぐことができる）

SVMにおけるマージン最大化

赤丸と青丸を分ける直線は無数に存在。 SVMでは
その無数の直線の中から、もっとも適したものを選
ぶために「マージン最大化」を考える。
f(x)=0:
分離超平面

マージンとは、分離を行う直線と、その直線にもっと
も近い丸との距離のこと。データにはばらつきがあこの線を満たすパラメータ決定

るので、間違った判断をしないためにはこのマージ
ンが大きい方が良さそう。
手法：Support
Vector
Machine
図の例では、青い直線より赤い線の方がマージンが
大きいので、赤い直線の優れた分離だと考えられる。
SVMはマージンがもっとも大きい直線を見つけること
で、未知のデータも正しく分類しようとする。

補足：なんでマージンが 2/||ω||なの?

b=0とすると、

d
=
1/a

マージン:2d=2/a

線形分離できない場合

オーバーフィッティング

A

B

オーバーフィットして，サンプル（パラメー
タ）を増やしても真の解に近づかない。

なめらかさなどの制約をおいて対処する
（正則化）

C
予測モデルは

シンプルにしたい。

and w = (w1 , w2 , . . . , wd ) is the parameter vector of the same
dimension that speciﬁes the model. A positive value of wj
indicates that the j-th feature xj positively contributes to the
prediction, while a negative value contributes to it negatively.
できるだけ確信度を持って間違いを少なく
The sign function returns +1 when its argument is positive,
するという項（損失）と、できるだけシンプル
and returns −1 otherwise. Given the data set X and Y , the
なモデルを採用するという項（正則化項）の
SVM learning algorithm ﬁnds the optimal parameter w∗ that
和を最小化したい。
minimizes the following objective function:
max{1 − yi h(xi ), 0} + c w 2 ,
2
i
損失関数：間違った判別の正則化項：

際にペナルティ。
学習データに対して過度に適応して
FORMATION SCIENCE AND TECHNOLOGY—January 2012 79
しまうと、未知のデータに対する性能
DOI: 10.1002/asi
（汎化性能）が逆に落ちてしまう

オーバーフィッティング防止。

全体を最小にするようなパラメータ（ウェイト）を決めたい。

素性 (全部で11種)
Topological
Features

•  (1)
The
number
of
common
neighbours.
(共通ノード数)

•  (2)
Link-‐based
Jaccard
coefficient.
(共通ノードの割合）

•  (3)
Difference
in
betweenness
centrality.（媒介中心の高いnodeを引用）

•  (4)
Difference
in
the
number
of
in-‐links.
（リンク数が多いnode引用）

•  (5)
Is
same
cluster(同じクラスタ内かどうか)

Seman3c
Features

•  (6)
Cosine
similarity
of
term
frequency–inverse
document
frequency
(M–idf)

vectors.（同じ意味的特徴を有しているか）

A5ribute
Features

•  (7)
Difference
in
publica+on
year.（最近のものは良く引用される）

•  (8)
The
number
of
common
authors.（共通著者数）

•  (9)
Is
self
cita+on.（同じ著者）

•  (10)
Is
published
in
same
journal.（同じジャーナルかどうか）

•  (11)
Number
of
+mes
“to”
cited.（富めるものはますます富む）

Dataset

TABLE 2. Datasets of citation networks.

Dataset Query Published through No. of papers No. of citations

A Innovation innovation* 2009 20,564 106,619
B Nano Bio nano* and bio* 2009 33,830 175,875
C Organic LED ((organic* or polymer*) and (electroluminescen* or 2009 19,486 196,123
electro-luminescen* or electro luminescen* or
light emitting or LED*)) or OLED*
D Solar Cells solar cell* 2008 18,587 111,051
E Secondary Batteries (*) ((secondary or storage or rechargeable or reserve) 2008 20,430 145,008
and cell*) or batter*

Data and Experiment TABLE 3. Prediction results.

In this article, five large-scale citation datasets, Innovation, Dataset Precision Recall F1
Nano Bio, Organic LED, Solar Cells, and Secondary Batter- A Innovation 0.75 0.91 0.82
ies, are collected as shown in Table 2. We searched databases B Nano Bio 0.83 0.76 0.79
of academic papers and patents using the same query for each C Organic LED 0.79 0.71 0.74
domain. The databases of academic papers used are the Sci- D Solar Cells 0.76 0.72 0.74
ence Citation Index Expanded (SCI-EXPANDED), the Social E Secondary Batteries 0.80 0.77 0.77
Sciences Citation Index (SSCI), and the Arts & Humanities
Citation Index (A&HCI) compiled by the Institute for Sci-
entific Information (ISI). After collecting data, we extracted 4. We repeated step 3 five times in total with different choice
the papers and citations in the largest-graph component to of answer set.

Cross
Valida?on（交差検定）
•  1.
These
exis?ng
cita?ons
are
divided
into
five
groups
(posi?ve
instances,

namely,
P[1]
to
P[5]).

•  2.
We
randomly
created
the
same
number
of
pair
where
cita?ons
did
not
exist

(nega?ve
instances,
namely,
N[1]
to
N[5]).

•  3.
In
the
first
experiment,
P[2]
to
P[5]
and
N[2]
to
N[5]
were
used
as
the

training
data
and
P[1]
and
N[1]
were
used
as
the
test
data.

•  4.
We
repeated
step3
five
?mes
in
total
with
different
choice
of
answer
set.

引用有りデータ
引用無しデータ
テストデータ
学習データ
テストデータ
学習データ

１回目：
P1
P2
P3
P4
P5
N1
N2
N3
N4
N5

２回目：
P1
P2
P3
P4
P5
N1
N2
N3
N4
N5

３回目：
P1
P2
P3
P4
P5
N1
N2
N3
N4
N5

４回目：
P1
P2
P3
P4
P5
N1
N2
N3
N4
N5

５回目：
P1
P2
P3
P4
P5
N1
N2
N3
N4
N5

評価指標：Precision,
Recall,
F-‐value
交差行列
True
Result

(真の結果)
Posi?ve
Nega?ve
(正例）
(負例）
精度

Posi?ve
TP

Predic?on

(正例）
TP
FP
Precision:
=

TP
+
FP

(予測)
Nega?ve
（負例）
FN
TN
再現率
TP
2
*
Precision
*
Recall

Recall:
=

F-‐value:
=

TP
+
FN
Precision
+
Recall

精度と再現率の調和平均

2008 18,587 111,051
e or reserve) 2008 20,430 145,008

Result
TABLE 3. Prediction results.

on, Dataset Precision Recall F1
er- A Innovation 0.75 0.91 0.82
ses B Nano Bio 0.83 0.76 0.79
ach C Organic LED 0.79 0.71 0.74
ci- D Solar Cells 0.76 0.72 0.74
ial E Secondary Batteries 0.80 0.77 0.77
ies
ci- f-‐value:
0.74~0.82:

ted 4. We repeated step 3 ﬁve times in total with different choice
to
Based
on
the
results,
we
obtained
the
learning

of answer set.
led model
on
our
training
data.
As a learner, we employed L2-regularized and L2-loss
D,

Weights
of
features
Posi?ve
contribu?on:
>= 0.5
Nega?ve
contribu?on:
<= -‐0.5
TABLE 4. Weights of features. No
contribu?on:

-‐0.5~0.5
E. Secondary
Features A. Innovation B. Nano Bio C. Organic LED D. Solar Cells Batteries

1. No. common neighbors 0.566 0.889 0.520 0.683 0.987
2. Link-based Jaccard coefficient 1.354 2.198 −6.150 −0.703 −4.742
3. Difference in betweenness centrality −1.446 −6.107 −2.175 −5.468 −10.049
4. Difference in the number of in-links 0.052 0.033 0.034 0.045 0.047
5. Is same cluster 0.018 0.086 −0.308 −0.160 −0.062
6. Cosine similarity of tf-idf vectors −19.897 −17.817 −15.527 1.624 1.519
7. Difference in publication year 0.018 0.046 0.032 0.009 0.008
8. The number of common authors −0.112 0.476 0.403 0.152 0.036
9. Is self-citation 1.975 0.756 0.605 0.865 0.918
10. Is published in same journal 0.726 0.614 0.198 0.027 −0.108
11. Number of times “to” cited −0.018 −0.019 −0.015 −0.031 −0.033

・Especially,
(2),
(3)
and
(6)
largely
affected
the
predic?ons
of
cita?ons.

・(2):
(A)
(B)
comprise
mul?ple
research
fields
and
most
cita?ons
are
in
each
research
field
so

that
papers
ofite
locally.
(C),
(D)
and
(E)
are
contained
in
a
research
field
with
a
single.
cases, because
the existence c a citation with a probability from 74% to of common neighbours positively affected all
・(3):
igiven are
that
core
nodes
and
citation network. thewhich
have
different
values
of
have, the more
82%, t
is
r a pair of papers and the entire peripheral
nodes,
more common neighbours two papers
Especially three features, (2) link-based Jaccard coefficient, related they are. That the self-citation result had a posi-
betweenness
centrality,
centrality, andin
the
cita?on
ntive effect is reasonable because authors tend to cite their
(3) difference in betweenness
are
linked
(6) cosine sim- etworks.

・(6):
same
as
vectors, largely affected the predictions of own papers. The feature of is published in the same jour-
ilarity of tf–idf (3)

・(1):
the
more
common
neighbours
two
papers
have,
affectedore
r(A) Innovations are.
and (B) Nano Bio
citations. nal the
m only elated
they
(0.726)
・(9):
because
authors
tend
tcontributed positivelypin (0.614) positively. Similar to the result of link-based Jaccard
Link-based Jaccard coefficient o
cite
their
own
apers.

the cases of (A) Innovations (weight: 1.354) and (B) Nano coefficient, papers tend to cite in each research field in the
・(10):
same
anegatively in the cases of (C) Organic LED case of research fields with multiple issues.
Bio (2.198) but s
(3),(6)

(−6.150), (D) Solar Cells (−0.703) and (E) Secondary In summary, different models are required for differ-
Batteries (−4.742). These results indicate that the former ent types of research areas—research fields with a single
research areas, such as (A) Innovations and (B) Nano Bio, issue or research fields with multiple issues. In the case

Summary

•  It
is
difficult
to
build
a
universal
learner
for
link

predic?on
and
we
need
to
build
learners
based
on
the

characteris?cs
of
each
research
domain.

•  Different
models
are
required
for
different
types
of

research
areas—research
fields
with
a
single
issue
or

research
fields
with
mul?ple
issues.

–  The
first
one
is
the
research
field
with
mul?ple
issues
such

as
(A)
Innova?ons
and
(B)
Nano
Bio.

–  The
second
one
is
a
simple
research
field
type
with

commonly
understood
targets
of
research
and

development
such
as
(C)
Organic
LED,
(D)
Solar
Cells
and

(E)
Secondary
Baberies.

論文サーベイ（Sasaki）

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (16)

Similar a 論文サーベイ（Sasaki）

Similar a 論文サーベイ（Sasaki） (20)

Más de Hajime Sasaki

Más de Hajime Sasaki (7)

Último

Último (20)

論文サーベイ（Sasaki）