Apidays New York 2024 - The value of a flexible API Management solution for O...
論文サーベイ(Sasaki)
1. 論文サーベイ@研究会2012.04.19:
N.Shibata,
Y.Kajikawa,
I.Sakata,
“Link
Predic?on
in
Cita?on
Networks”
Journal
of
the
American
society
for
informa?on
science
and
technology,
63(1):
78-‐85,
2012
佐々木一
Hajime
SASAKI
政策ビジョン研究センター 特任研究員
工学系研究科総合研究機構イノベーション政策研究センター 連携研究員
技術経営戦略学専攻 坂田一郎研究室 協力研究員
3. Introduc?on
• The
number
of
academic
papers
exponen?ally
increases
(Price,
1965),
each
academic
area
becomes
specialized
and
segmented.
• The
individual
scien?st
has
to
focus
on
or
specialize
in
only
a
few
scien?fic
subdomains
to
keep
up
with
the
growth
of
the
domains,
which
means
that
researchers
must
focus
on
increasingly
narrowing
domains.
Research
Ques?on:
What
factors
affect
the
existence
of
links
using
features
intrinsic
to
the
network
itself,
namely,
link
predic+on,
which
will
help
scholars
to
know
which
paper
to
cite
and
managers
to
iden?fy
future
core
papers?
• In
this
ar?cle,
The
authors
u?lize
textual,
topological,
and
abribute
features
for
link
predic?on,
which
are
considered
to
influence
ci?ng
behaviors.
4. 既存研究
• Liben-‐Nowell
and
Kleinberg
(2003)
:
proposed
a
model
for
link
predic?on
in
large
coauthorship
networks.
• Clauset,
Moore,
and
Newman
(2008):
inves?gated
the
hierarchical
structure
of
social
networks
to
predict
missing
connec?ons
in
par?ally
known
networks
with
high
accuracy.
• Popescul
and
Ungar
(2003):
proposed
a
new
approach
for
Sta?s?cal
Rela?onal
Learning
to
build
link
predic?on
models.
• Hasan,
Chaoji,
Salem,
and
Zaki
(2006):
tested
several
supervised
learning
models
(decision
tree,
k-‐nearest
neighbor,
mul?layer
percep?on,
support
vector
machine
[SVM],
radial
basis
func?on
[RBF]
network)
for
link
predic?ons
• Murata
and
Moriyasu
(2008):
applied
the
model
of
Liben-‐Nowell
and
Kleinberg
to
social
networks
of
Ques?on-‐Answering
Bulle?n
Boards.
• Caragea,
Bahirwani,
Aljandal,
and
Hsu
(2009)
:proposed
an
algorithm
to
predict
poten?al
friendships
based
on
a
clustering
approach
in
Live-‐
Journal,
a
social
network
journal
service
with
a
focus
on
user
interac?ons.
• Lu,
Jin,
and
Zhou
(2009)
:presented
a
local
path
index
to
es?mate
the
likelihood
of
the
existence
of
a
link
between
two
nodes.
• Seglen
(1994)
:analysed
the
trends
of
papers
in
the
journals
with
large
impact
factors.
• Vinkler
and
Davidson
(2002)
:indicated
that
the
papers
in
growing
journals
in
terms
of
the
number
of
papers
are
more
likely
to
be
cited.
• Hwang,
Wylie,
Wei,
and
Liao
(2010):
proposed
recommenda?on
engines
based
on
the
coauthorship
networks.
5. 本研究の特徴
• 1:The
focus
is
on
cita?on
networks.
引用ネットワークに着目した。
• 2:The
authors
apply
SVMs
as
our
supervised
learning
method,
as
SVM
is
the
best
learner
according
to
Hasan
et
al.
(2006).
教師あり学習における分類器としてSVMを利用した。
• 3:
The
authors
use
more
comprehensive
features
op?mized
for
cita?on
networks.
引用ネットワークを対象するにあたって、網羅的な素性を適
用した。
6. 本研究の意義
• Helps
us
make
decisions
whether
to
link
more
accurately
even
with
a
huge
number
data.
• Applica?on:引用推薦システムを構築する
Cita?on
recommenda?on
system
for
authors
of
scien?fic
publica?ons
and
patents.
– The
reviewers
of
scien?fic
papers
can
reduce
their
?me
to
check
whether
the
references
in
those
papers
are
adequate
or
not.(査読に
おいて、適切な論文を引用しているかどうかを効率的に判断できる)
– Second,
well-‐
organized
link
predic?on
can
reveal
how
and
why
authors
cite
other
scien?fic
papers.
(著者が引用した理由がわかる)
– Finally,
link
predic?on
can
bond
different
research
fields
with
similar
topics
but
from
different
disciplines.(類する問題を扱っている異なる
学術分野をつなぐことができる)
10. オーバーフィッティング
A
B
オーバーフィットして,サンプル(パラメー
タ)を増やしても真の解に近づかない。
なめらかさなどの制約をおいて対処する
(正則化)
C
予測モデルは
シンプルにしたい。
11. and w = (w1 , w2 , . . . , wd ) is the parameter vector of the same
dimension that specifies the model. A positive value of wj
indicates that the j-th feature xj positively contributes to the
prediction, while a negative value contributes to it negatively.
できるだけ確信度を持って間違いを少なく
The sign function returns +1 when its argument is positive,
するという項(損失)と、できるだけシンプル
and returns −1 otherwise. Given the data set X and Y , the
なモデルを採用するという項(正則化項)の
SVM learning algorithm finds the optimal parameter w∗ that
和を最小化したい。
minimizes the following objective function:
max{1 − yi h(xi ), 0} + c w 2 ,
2
i
損失関数:間違った判別の 正則化項:
際にペナルティ。
学習データに対して過度に適応して
FORMATION SCIENCE AND TECHNOLOGY—January 2012 79
しまうと、未知のデータに対する性能
DOI: 10.1002/asi
(汎化性能)が逆に落ちてしまう
オーバーフィッティング防止。
全体を最小にするようなパラメータ(ウェイト)を決めたい。
12. 素性 (全部で11種)
Topological
Features
• (1)
The
number
of
common
neighbours.
(共通ノード数)
• (2)
Link-‐based
Jaccard
coefficient.
(共通ノードの割合)
• (3)
Difference
in
betweenness
centrality.(媒介中心の高いnodeを引用)
• (4)
Difference
in
the
number
of
in-‐links.
(リンク数が多いnode引用)
• (5)
Is
same
cluster(同じクラスタ内かどうか)
Seman3c
Features
• (6)
Cosine
similarity
of
term
frequency–inverse
document
frequency
(M–idf)
vectors.(同じ意味的特徴を有しているか)
A5ribute
Features
• (7)
Difference
in
publica+on
year.(最近のものは良く引用される)
• (8)
The
number
of
common
authors.(共通著者数)
• (9)
Is
self
cita+on.(同じ著者)
• (10)
Is
published
in
same
journal.(同じジャーナルかどうか)
• (11)
Number
of
+mes
“to”
cited.(富めるものはますます富む)
13. Dataset
TABLE 2. Datasets of citation networks.
Dataset Query Published through No. of papers No. of citations
A Innovation innovation* 2009 20,564 106,619
B Nano Bio nano* and bio* 2009 33,830 175,875
C Organic LED ((organic* or polymer*) and (electroluminescen* or 2009 19,486 196,123
electro-luminescen* or electro luminescen* or
light emitting or LED*)) or OLED*
D Solar Cells solar cell* 2008 18,587 111,051
E Secondary Batteries (*) ((secondary or storage or rechargeable or reserve) 2008 20,430 145,008
and cell*) or batter*
Data and Experiment TABLE 3. Prediction results.
In this article, five large-scale citation datasets, Innovation, Dataset Precision Recall F1
Nano Bio, Organic LED, Solar Cells, and Secondary Batter- A Innovation 0.75 0.91 0.82
ies, are collected as shown in Table 2. We searched databases B Nano Bio 0.83 0.76 0.79
of academic papers and patents using the same query for each C Organic LED 0.79 0.71 0.74
domain. The databases of academic papers used are the Sci- D Solar Cells 0.76 0.72 0.74
ence Citation Index Expanded (SCI-EXPANDED), the Social E Secondary Batteries 0.80 0.77 0.77
Sciences Citation Index (SSCI), and the Arts & Humanities
Citation Index (A&HCI) compiled by the Institute for Sci-
entific Information (ISI). After collecting data, we extracted 4. We repeated step 3 five times in total with different choice
the papers and citations in the largest-graph component to of answer set.
14. Cross
Valida?on(交差検定)
• 1.
These
exis?ng
cita?ons
are
divided
into
five
groups
(posi?ve
instances,
namely,
P[1]
to
P[5]).
• 2.
We
randomly
created
the
same
number
of
pair
where
cita?ons
did
not
exist
(nega?ve
instances,
namely,
N[1]
to
N[5]).
• 3.
In
the
first
experiment,
P[2]
to
P[5]
and
N[2]
to
N[5]
were
used
as
the
training
data
and
P[1]
and
N[1]
were
used
as
the
test
data.
• 4.
We
repeated
step3
five
?mes
in
total
with
different
choice
of
answer
set.
引用有りデータ
引用無しデータ
テストデータ
学習データ
テストデータ
学習データ
1回目:
P1
P2
P3
P4
P5
N1
N2
N3
N4
N5
2回目:
P1
P2
P3
P4
P5
N1
N2
N3
N4
N5
3回目:
P1
P2
P3
P4
P5
N1
N2
N3
N4
N5
4回目:
P1
P2
P3
P4
P5
N1
N2
N3
N4
N5
5回目:
P1
P2
P3
P4
P5
N1
N2
N3
N4
N5
16. 2008 18,587 111,051
e or reserve) 2008 20,430 145,008
Result
TABLE 3. Prediction results.
on, Dataset Precision Recall F1
er- A Innovation 0.75 0.91 0.82
ses B Nano Bio 0.83 0.76 0.79
ach C Organic LED 0.79 0.71 0.74
ci- D Solar Cells 0.76 0.72 0.74
ial E Secondary Batteries 0.80 0.77 0.77
ies
ci- f-‐value:
0.74~0.82:
ted 4. We repeated step 3 five times in total with different choice
to
Based
on
the
results,
we
obtained
the
learning
of answer set.
led model
on
our
training
data.
As a learner, we employed L2-regularized and L2-loss
D,
17. Weights
of
features
Posi?ve
contribu?on:
>= 0.5
Nega?ve
contribu?on:
<= -‐0.5
TABLE 4. Weights of features. No
contribu?on:
-‐0.5~0.5
E. Secondary
Features A. Innovation B. Nano Bio C. Organic LED D. Solar Cells Batteries
1. No. common neighbors 0.566 0.889 0.520 0.683 0.987
2. Link-based Jaccard coefficient 1.354 2.198 −6.150 −0.703 −4.742
3. Difference in betweenness centrality −1.446 −6.107 −2.175 −5.468 −10.049
4. Difference in the number of in-links 0.052 0.033 0.034 0.045 0.047
5. Is same cluster 0.018 0.086 −0.308 −0.160 −0.062
6. Cosine similarity of tf-idf vectors −19.897 −17.817 −15.527 1.624 1.519
7. Difference in publication year 0.018 0.046 0.032 0.009 0.008
8. The number of common authors −0.112 0.476 0.403 0.152 0.036
9. Is self-citation 1.975 0.756 0.605 0.865 0.918
10. Is published in same journal 0.726 0.614 0.198 0.027 −0.108
11. Number of times “to” cited −0.018 −0.019 −0.015 −0.031 −0.033
・Especially,
(2),
(3)
and
(6)
largely
affected
the
predic?ons
of
cita?ons.
・(2):
(A)
(B)
comprise
mul?ple
research
fields
and
most
cita?ons
are
in
each
research
field
so
that
papers
ofite
locally.
(C),
(D)
and
(E)
are
contained
in
a
research
field
with
a
single.
cases, because
the existence c a citation with a probability from 74% to of common neighbours positively affected all
・(3):
igiven are
that
core
nodes
and
citation network. thewhich
have
different
values
of
have, the more
82%, t
is
r a pair of papers and the entire peripheral
nodes,
more common neighbours two papers
Especially three features, (2) link-based Jaccard coefficient, related they are. That the self-citation result had a posi-
betweenness
centrality,
centrality, andin
the
cita?on
ntive effect is reasonable because authors tend to cite their
(3) difference in betweenness
are
linked
(6) cosine sim- etworks.
・(6):
same
as
vectors, largely affected the predictions of own papers. The feature of is published in the same jour-
ilarity of tf–idf (3)
・(1):
the
more
common
neighbours
two
papers
have,
affectedore
r(A) Innovations are.
and (B) Nano Bio
citations. nal the
m only elated
they
(0.726)
・(9):
because
authors
tend
tcontributed positivelypin (0.614) positively. Similar to the result of link-based Jaccard
Link-based Jaccard coefficient o
cite
their
own
apers.
the cases of (A) Innovations (weight: 1.354) and (B) Nano coefficient, papers tend to cite in each research field in the
・(10):
same
anegatively in the cases of (C) Organic LED case of research fields with multiple issues.
Bio (2.198) but s
(3),(6)
(−6.150), (D) Solar Cells (−0.703) and (E) Secondary In summary, different models are required for differ-
Batteries (−4.742). These results indicate that the former ent types of research areas—research fields with a single
research areas, such as (A) Innovations and (B) Nano Bio, issue or research fields with multiple issues. In the case
18. Summary
• It
is
difficult
to
build
a
universal
learner
for
link
predic?on
and
we
need
to
build
learners
based
on
the
characteris?cs
of
each
research
domain.
• Different
models
are
required
for
different
types
of
research
areas—research
fields
with
a
single
issue
or
research
fields
with
mul?ple
issues.
– The
first
one
is
the
research
field
with
mul?ple
issues
such
as
(A)
Innova?ons
and
(B)
Nano
Bio.
– The
second
one
is
a
simple
research
field
type
with
commonly
understood
targets
of
research
and
development
such
as
(C)
Organic
LED,
(D)
Solar
Cells
and
(E)
Secondary
Baberies.