1) The document discusses using big data and financial innovation from research to practice. It identifies challenges that traditional financial services face and opportunities that big data presents.
2) It analyzes the three main values of big data: insights from scale, knowledge from enrichment, and agility from real-time responsiveness. It also compares internal enterprise data and external social media big data.
3) The document provides examples of using big data for precision marketing and relationship marketing/risk management. It also discusses research topics like mining offline relationships from online social networks.
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
大数据助推金融创新
1. 大数据与金融创新:从研究到实践
Assistant Professor
DS Lee Foundation Fellow
School of Information Systems
Singapore Management University
Dec. 11, 2015
朱飞达 Feida Zhu
Founding Director
Pinnacle Lab for Analytics
DBS-SMU Lab for Life Analytics
Singapore Management University
3. 大数据与金融创新:从研究到实践
大数据的三大价值
– Insight
from
scale
• What
can
big
data
tell
us
that
small
data
cannot?
– Knowledge
from
enrichment
• What
important
knowledge
can
we
learn
from
enriching
small
data
with
big
data?
– Agility
from
real-‐?me
responsiveness
• What
are
the
values
of
being
real-‐?me?
VOLUME
VARIETY
VELOCITY
4. 大数据与金融创新:从研究到实践
外部大数据到底能为企业提供什么价值?
企业内部数据
通常只以交易纪录为基础
Transac?on-‐based
总量和覆盖有限
Limited
coverage
只反映用户生活的局部和侧面
Fragmented
par?al
perspec?ve
静态,低频
Sta?c,
low
frequency
孤立单一的客户视图,只见人
Isolated
view
of
individual
user
外部社交媒体大数据
能展现交易行为的上下文场景
Context-‐based
海量的社会级覆盖
Societal
scale
提供用户的多角度全景式洞察
Mul?-‐facet
insight
动态实时,高频
Dynamic,
high
frequency
能综合考虑丰富真实社交关系
Network-‐embedded
user
view
22. 339
Figure 2. Friendship Retainability.
1
2
3
45
46
51
17281729
17298179
17981799
17998176
17681769
1769817
17
817
9
17
98173
12345
671
Figure 3. Community A
Problem:
Given
a
TwiNer
follow
network
of
a
target
user,
iden?fy
the
user’s
offline
community
by
examining
the
follow
linkage
alone.
Informa.on
should
be
able
to
flow
in
both
direc.ons
within
a
small
distance
between
real-‐life
friends.
Principle I:
Mutual Reachability
Principle II:
Friendship Retainability
1
21
31
41
51
61
12345
6789745
7
35. 339
Figure 2. Friendship Retainability.
1
2
3
45
46
51
17281729
17298179
17981799
17998176
17681769
1769817
17
817
9
17
98173
17381739
1739817
12345
671
Figure 3. Community Affini
The
size
of
a
user’s
offline
community
has
an
upper-‐bound
threshold
σ
related
to
Dunbar’s
number
Principle III:
Community Affinity
Figure 6: Case study of a user’s fo
5. EXPERIMENTAL STUDY ity with
A
user’s
off-‐line
friends
usually
group
into
clusters
within
which
members
know
each
other
36. 大数据与金融创新:从研究到实践
研究课题-线下关系挖掘
Figure 6: Case study of a user’s follow network.
5. EXPERIMENTAL STUDY
An implementation of our algorithm as a demo system –
TwiCube1
– is publicly available.
5.1 Case Study
We now present a case study on a real user X who par-
ticipated in our evaluation. X has 107 followers and follows
385 other users. Figure 6 illustrates the discovery of his core
community in a total of 4 iterations each indicated by a dif-
ferent color. In summary, 34 users are identified in Iteration
1, 19 in Iteration 2, 3 in Iteration 3 and only one user in
the last iteration. The precision and recall for this result
of X’s core community is 0.8947 and 0.9807 respectively. It
can be observed from Figure 6 that there is a dense clusters
of core community members heavily linked among one an-
other (lower left to X) and another such cluster of non-core-
community users similarly linked (upper right to X). This
shows that approaches based on dense subgraph mining or
structural clustering would have a hard time in distinguish-
ing between these two similarly-structured communities and,
consequently, identifying the true core community. In fact,
this cluster of non-core-community users consists of media,
business and active Twitter users sharing similar interests
and topics, which is a good indicator of those of X’s own.
ity with X. This case would fail the naive approa
identify core community members by two-way f
In (b), we show the follow networks between X
community member Y , who is discovered in Iter
this case, X follows Y but Y does not follow X. M
is not until more core community members have
tified at Iteration 1 and 2 that Y ’s sophisticated c
with the core community are revealed. In this
by unleashing the power of iterated core commu
fication, our algorithm is still able to correctly id
5.2 Effectiveness
One naive method to identify the core commun
get user u is to find the set of users who have dire
follow links with u, i.e., they and u follow each ot
rect two-way follow links provide good indication
real-world friendship? Our experiments suggest
links are not sufficient. In Figure 7 we show the
on the distribution (among the 65 user evaluati
cision, recall and F score between our algorithm
the naive algorithm. In general our solution o
the naive solution by a large margin. To conduc
tailed comparison between the two methods, l
Figure 5: Core Community Discovery
RWR and closeness score between a user node i a
as follows.
ri,S =
j∈S
ri,j
rS,i =
j∈S
rj,i
ci,S = cS,i = ri,S ∗ rS,i
Given a user node i, the probability transition
Approach
Figure 1: Three Types of Core Community Mem-
bers.
We now show how these three principles help us identify
core communities members of different kinds. Based on our
study, we categorize a user’s follow network based on three
attributes each reflects one of the above-mentioned princi-
ples. Note that these attributes and their corresponding
parameters are proposed for the categorization only, none of
which will be actually computed in our algorithm. Suppose
the target user is u and the user in consideration is v.
(I) Mutual Following. The first attribute is whether u
and v directly follow each other. There are two cases: (I). u
and v follow each other, i.e., v ∈ N1
u← N1
u→. We call this
a two-way follow case. (II). Either u follows v or v follows
u, but not both, i.e., v ∈ N1
u← N1
u→ N1
u← N1
u→. We
call this a one-way follow case. Principle 1 is immediately
satisfied in a two-way follow case as tweets of both u and v
are delivered directly to each other, while in a one-way follow
case, computation considering the k-hop neighborhood of u
is necessary to determine the satisfiability of Principle 1.
(II) Friendship Exclusivity. The second attribute is the
larger one between |Fu←| and |Fu→|. For simplicity, we use
|Fu←| to illustrate while the analysis with |Fu→| can be done
similarly. This attribute indicates the number of other users
in whom u is interested in hearing about. In general, this
! Random Walk with Restart
! Closeness Score
! Iterative Off-line Community Discovery
! Off-line community is discovered by iterations.
! A virtual user node is used as the threshold to cut
for each iteration.
ose our algorithm based on the idea of random walk
art(RWR). RWR has been successfully used to mea-
relevance score between two nodes in a weighted
3, 9, 2, 12]. It is defined in [9] with the following
⃗ri = (1 − c) ˜W ⃗ri + c⃗ei (1)
tting, given a weighted graph, a particle starts from
d conducts random movement. It transmits to the
hood of its current node with a probability propor-
the edge weights. At each step, the particle also
o the start node i with some probability c. The
score of node j with respect to i is defined as the
ate probability ri,j that the particle finally stays at
roblem setting, given the Twitter network G =
target user u ∈ V and a number k, we focus on
raph Gk
u induced by Nk
u , which is simplified as Gu
s fixed. A probability transition matrix W is de-
Gu(V ) such that, for two nodes v, w ∈ Gu(V ), the
puted iteratively and it finally converges to
)−1
⃗ei [9]. When it converges, the steady-state
tor ⃗ri reflects the bandwidth of information
from user i to user j for every j ∈ Gu(V ).
eady-state probability to define the closeness
wo users i and j:
ci,j = ri,j ∗ rj,i (3)
score thus defined satisfies Principle (I). It
ng desirable properties, the proofs of which
e to space limit.
1. Given a Twitter follow network G(V, E)
i, j ∈ V , ci,j is symmetric, i.e., ci,j = cj,i.
Property 2. Given a Twitter follow network G(V, E),
two users i, j ∈ V and k, ci,j 0 if and only if i and j
satisfy Principle 1 — i ∈ Nk
j→ Nk
j← and j ∈ Nk
i→ Nk
i←,
i.e., tweets originated from either user i or j should be able
to reach the other one in k hops.
Property 3. Given a Twitter follow network G(V, E),
two users i, j ∈ V and k, obtain a node j′
resulted from
removing a set S of users from j’s immediate neighborhood
such that for each v ∈ S, either v ∈ Fj→ Nk
i← or v ∈
Fj← Nk
i→. We have ci,j ≤ ci,j′ .
Figure 2: Core Community Discovery
closeness score between u and all the rest users, t
we compute the closeness score between ˜u and eve
user. From the ranking list thus generated, if any us
ahead of ˆv in this iteration, the user will be adde
core community of u, which ends this iteration. So
so forth. Figure 2 illustrates the process. The targ
is shown in red in the center and the auxiliary dum
ˆv is shown in purple. In iteration 1, the core comm
just u itself, which is indicated by the shaded circle
u. The highlighted blue nodes and follow links re
Fu← Fu→. After computing the closeness score cu
v, three users are found to be ahead of ˆv in the
ranking list. They are therefore added to the core
nity, indicated by their color changed from blue to
In iteration 2, we use the new core community ˜u, c
now of 4 users, to compute the closeness scores c˜u
rest nodes v. Those ranked ahead of ˆv will be adde
core community. The iterations continue until no n
can be added to the core community, ending the al
As the virtual user node ˜u is actually a set, we no
RWR and closeness score between a user node i and
as follows.
ri,S =
j∈S
ri,j
the naive approach respectively. The result shows that for
most users, our solution outperforms the naive solution for
both precision and recall. In particular, in two cases, the
difference is even close to 1. There is only one single case
in which our algorithm is prevailed for both precision and
recall.
5.4 On Ranking
compare2(v1, v2) =
⎪⎩
compare1(v1,
−1,
Which one is better? We evaluate
computing their AUC value for eac
tions of the AUC values are showed i
shows that for both rankings, more
values are greater than 0.9 and more
Figure 7: AUC comparison for rankings with and without incorporatin
values are greater than 0.8. The right graph in Figure 7
shows that in most cases, the ranking with iteration informa-
tion incorporated is superior than the ranking based solely
on closeness score. This demonstrates that core community
information helps the ranking.
5.5 On Iteration
It has been observed in our experiments that the core com-
munity discovery process ends after a few iterations. One
interesting question is whether core community members
identified in later iterations are as good as those found in
earlier iterations. If we set a maximum number of iteration
allowed in the algorithm to force termination, will the result
give better precision and recall? Our experiments suggest
a negative answer. Figure 8 shows that the average pre-
cision, recall and F-score for varied maximum number of
iterations allowed from 1 to 10 as well as unlimited. As the
maximum number of iterations allowed increases, although
average precision drops slightly, recall improves significantly,
and so does the F-score. Intuitively, earlier iterations tend
to capture those closest members to the target user, which
results in a higher precision yet at the cost of missing out
many other core community members with more sophisti-
cated social connections with the target user. By setting no
maximum number of iterations and allowing the core com-
munity itself to take shape, much greater gain in recall could
be achieved, offering a better result overall. In most cases,
core communities stabilize after 5 or 6 iterations, as shown
in Figure 9 which presents the distribution of number of
iterations of all our eval
5.6 Modeling Use
How to model user inter
tent recommendation an
Furthermore, our study
ery could significantly en
following two aspects: (I
munity members themse
terizing u’s interests tha
network. u follow them m
life friends anyway. On t
or topics that drive u t
users. As such, when i
step is to distinguish u’s
low network. (II). Altho
themselves may not nec
users followed by these c
less could help understa
could follow media/celeb
In our experiments, we
users, A,B and C to hel
that A and B share mu
interests, background an
if we check the common
by A and B, they have
in Figure 11), while A
in Figure 12). This me
community, C could be
Figure 7: AUC comparison for rankings with and without incorporating iteratio
values are greater than 0.8. The right graph in Figure 7
shows that in most cases, the ranking with iteration informa-
tion incorporated is superior than the ranking based solely
on closeness score. This demonstrates that core community
information helps the ranking.
5.5 On Iteration
It has been observed in our experiments that the core com-
munity discovery process ends after a few iterations. One
interesting question is whether core community members
identified in later iterations are as good as those found in
earlier iterations. If we set a maximum number of iteration
allowed in the algorithm to force termination, will the result
give better precision and recall? Our experiments suggest
a negative answer. Figure 8 shows that the average pre-
cision, recall and F-score for varied maximum number of
iterations allowed from 1 to 10 as well as unlimited. As the
maximum number of iterations allowed increases, although
average precision drops slightly, recall improves significantly,
and so does the F-score. Intuitively, earlier iterations tend
to capture those closest members to the target user, which
results in a higher precision yet at the cost of missing out
many other core community members with more sophisti-
cated social connections with the target user. By setting no
maximum number of iterations and allowing the core com-
munity itself to take shape, much greater gain in recall could
be achieved, offering a better result overall. In most cases,
core communities stabilize after 5 or 6 iterations, as shown
in Figure 9 which presents the distribution of number of
iterations of all our evaluation part
5.6 Modeling User Interest
How to model user interests is of cri
tent recommendation and linkage pr
Furthermore, our study reveals that
ery could significantly enhance user
following two aspects: (I) For a tar
munity members themselves are les
terizing u’s interests than the rest
network. u follow them mostly beca
life friends anyway. On the other ha
or topics that drive u to follow oth
users. As such, when investigating
step is to distinguish u’s core comm
low network. (II). Although the co
themselves may not necessarily refl
users followed by these core commu
less could help understand u’s inte
could follow media/celebrity/busine
In our experiments, we identify and
users, A,B and C to help us evalua
that A and B share much more sim
interests, background and life-style t
if we check the common non-core-co
by A and B, they have 15 such u
in Figure 11), while A and C have
in Figure 12). This means that, w
community, C could be considered
Figure 7: AUC comparison for rankings with and without incorporating iteration inform
values are greater than 0.8. The right graph in Figure 7
shows that in most cases, the ranking with iteration informa-
tion incorporated is superior than the ranking based solely
on closeness score. This demonstrates that core community
information helps the ranking.
5.5 On Iteration
It has been observed in our experiments that the core com-
munity discovery process ends after a few iterations. One
interesting question is whether core community members
identified in later iterations are as good as those found in
earlier iterations. If we set a maximum number of iteration
allowed in the algorithm to force termination, will the result
give better precision and recall? Our experiments suggest
a negative answer. Figure 8 shows that the average pre-
cision, recall and F-score for varied maximum number of
iterations allowed from 1 to 10 as well as unlimited. As the
maximum number of iterations allowed increases, although
average precision drops slightly, recall improves significantly,
and so does the F-score. Intuitively, earlier iterations tend
to capture those closest members to the target user, which
results in a higher precision yet at the cost of missing out
many other core community members with more sophisti-
cated social connections with the target user. By setting no
maximum number of iterations and allowing the core com-
munity itself to take shape, much greater gain in recall could
be achieved, offering a better result overall. In most cases,
core communities stabilize after 5 or 6 iterations, as shown
in Figure 9 which presents the distribution of number of
iterations of all our evaluation participants.
5.6 Modeling User Interests
How to model user interests is of critical imp
tent recommendation and linkage prediction in
Furthermore, our study reveals that core com
ery could significantly enhance user interest m
following two aspects: (I) For a target user u
munity members themselves are less informa
terizing u’s interests than the rest user node
network. u follow them mostly because they a
life friends anyway. On the other hand, it is s
or topics that drive u to follow other non-c
users. As such, when investigating u’s inte
step is to distinguish u’s core community fro
low network. (II). Although the core commu
themselves may not necessarily reflect u’s i
users followed by these core community mem
less could help understand u’s interests, e.g
could follow media/celebrity/business users o
In our experiments, we identify and hire thre
users, A,B and C to help us evaluate. The g
that A and B share much more similar profi
interests, background and life-style than A an
if we check the common non-core-community
by A and B, they have 15 such users in co
in Figure 11), while A and C have 18 in co
in Figure 12). This means that, without th
community, C could be considered more sim
Application Example: User Interest Pro
Figure 11: Interest profile comparison for A and B Figure 12: Interest profile compari
bi-directional way and relies on no other attribute informa- to predict link strength in online soci
Parameters
! On # of Iterations ! On Robustness
Figure 8: The result for limiting the
max # of iterations allowed.
Figure 9: The distribution of # of
iterations.
Figure 10: R
B, contradicting the truth. In fact, we can use core com-
munity to remedy the situation. Similar as in the idea of
TF-IDF [11], for target user u, we use the following formula
to compute the weight for each non-core-community user v
wu(v) =
|Fv→ Cu|
|Cu|
log |Fv→|
(9)
As such, for a target user u, we obtain a vector ⃗xu where each
dimension is one non-core-community member. For two tar-
get users u1 and u2, we compute the similarity between their
interest profile as Sim(u1, u2) =
⃗xu1 ·⃗xu2
|⃗xu1 ||⃗xu2 |
. In Figure 11 and
Figure 12, we show the relative ratio between user A and B,
where the percent for user A on dimension v is computed by
wA(v)
wA(v)+wB (v)
, and wB (v)
wA(v)+wB (v)
for user B. Now if we com-
of SNS and real-life social networks. [14]
book has influenced the establishment o
lationships. Another related direction is
real-life friendship or relationship stren
work using hyperlinks and text informat
predict relationships between individua
further information including network to
tions to predict relationship strength.
the same problem with a link-based late
While the relationship between a user’
social network has been investigated in
Facebook, few studies have so far pose
on Twitter network. More importantly
Facebook, Twitter has two important d
tics — (I) As shown in [8], Twitter fun
of news media and social network comb
both. (II) Follow links on Twitter are
Figure 8: The result for limiting the
max # of iterations allowed.
Figure 9: The distribution of # of
iterations.
Figure 10: Robus
B, contradicting the truth. In fact, we can use core com-
munity to remedy the situation. Similar as in the idea of
TF-IDF [11], for target user u, we use the following formula
to compute the weight for each non-core-community user v
wu(v) =
|Fv→ Cu|
|Cu|
log |Fv→|
(9)
As such, for a target user u, we obtain a vector ⃗xu where each
dimension is one non-core-community member. For two tar-
get users u1 and u2, we compute the similarity between their
interest profile as Sim(u1, u2) =
⃗xu1 ·⃗xu2
|⃗xu1 ||⃗xu2 |
. In Figure 11 and
Figure 12, we show the relative ratio between user A and B,
where the percent for user A on dimension v is computed by
wA(v) wB (v)
of SNS and real-life social networks. [14] looked
book has influenced the establishment of new
lationships. Another related direction is to us
real-life friendship or relationship strength.
work using hyperlinks and text information on
predict relationships between individuals. [6,
further information including network topolog
tions to predict relationship strength. [17] ha
the same problem with a link-based latent va
While the relationship between a user’s onlin
social network has been investigated in stand
Facebook, few studies have so far pose the sa
on Twitter network. More importantly, com
Facebook, Twitter has two important differen
tics — (I) As shown in [8], Twitter functions
of news media and social network combiningFigure 5: Core Community Discovery
RWR and closeness score between a u
as follows.
ri,S =
j∈S
ri,j
rS,i =
j∈S
rj,i
ci,S = cS,i = ri,S ∗ r
Given a user node i, the probability
Figure 6: Case study of a user’s follow
A
real
TwiFer
user:
§ Following
385
users
§ Followed
by
107
users
37. 大数据与金融创新:从研究到实践
研究课题-线下亲密关系挖掘
Problem:
Given
a
user’s
tweets,
iden?fy
all
interpersonal
rela?onships
that
involve
physical
or
emo?onal
in?macy,
such
as
family
members,
husband
and
wife,
roman?c
rela?onship,
etc..
Example:
§ In.mate
expressions
§ “honey”,
“baby”,
“dear”,
“my
dear
wife”,…
§ Occasions/Events
§ Valen?ne’s
day,
anniversary,
father’s
day,
birthday,…
§ In.macy-‐related
name
en..es
§ Resort
hotels,
kids,
home-‐improvement,
…
§ Screen-‐name
correla.on
§ Substring
swaps
§ Similar
PaNerns
with
keywords
§ PaNerns
with
domain
knowledge
Design Ideas I
Intimacy-related Entity
Use
Dempster–Shafer
theory
to
model
the
associa?on
degree
between
en??es
and
a
certain
type
of
rela?onship.
The
final
in?mate
rela?onship
scores
are
achieved
through
an
itera?ve
algorithm.
Design Ideas II:
Exclusivity of “@” to identify relationship candidates
38. 大数据与金融创新:从研究到实践
外部数据跨平台用户身份归一
Linkage Information Collection
Photos
Tweets/Retweets
Trajectories
...
Profiles
Username
Photos
Tweets/Retweets
Trajectories
...
Profiles
Username
t
Unlinked Identities…
Step 3: Multi-objective Optimization
MinW [F1(w), F2(w),…, FM(w)]
Linkage Function fW
Unknown Identities
Step 2: Structure Information
Modeling
Step 1:Heterogeneous Behavior Modeling
Figure 3: HYDRA framework.
Figure 4: The workflow of
A face detector is employe
profile images. Then a pre-
fidence score in [0, 1] indica
to one person.
attributes used in the matchi
set by probabilistic modeling
Specifically, given a set o
• Nodal
aFributes
(numeric,
categorical)
• Demographics,
loca?on,
personal
interest,
etc.
• User
Generated
Content
(topics,
sen.ments)
• Reviews,
tweets,
ra?ngs,
mul?media,
etc.
• Social
network
(snapshot/sta.c
view)
• Friend
network,
followers/followees
network,
communi?es/interest
groups,
etc.
• Behavior
trajectory
(dynamic,
evolu.onary)
•
content
sharing
history,
social
interac?on
paNern,
network
forma?on,
etc.
39. 大数据与金融创新:从研究到实践
外部数据跨平台用户身份归一
• People’s
closest
friends
are
similar
across
different
social
plaaorms.
• Behavior
similarity
aggrega?on
of
the
most
frequently
interac?ng
friends
of
users
provides
insights
into
user
iden?ty
linkage.
• Supervised
Learning
• Structure
Consistency
Modeling
• Mul?-‐objec?ve
Op?miza?on
A
two-‐class
classifica?on
problem
-‐-‐-‐
construct
mul?-‐objec?ve
op?miza?on
which
jointly
op?mizes
the
predic.on
accuracy
on
the
labeled
user
pairs
and
mul.ple
structure
consistency
measurements
across
different
plaaorms.
45. 大数据与金融创新:从研究到实践
提取社交维度信用特征,加入现有传统信用模型
Fid Feature Name Pearson Correlation χ2
Statistics
1 Gender 4.45 × 10−2 14.27∗
2 Age 1.92 × 10−2 16.28∗
3 Verified 5.128 × 10−2 17.02∗
4 Education 4.18 × 10−3 0
5 Location 4.81 × 10−2 16.68∗
6 Occupation 2.244 × 10−2 0.137
7 Registration time 6.944 × 10−2 39.44∗
∗ Passes the significance test at the confidence level of 95%.
Table 5: Pearson correlation and χ2
statistics evaluation for
demographic features
0
10
20
30
40
50
1 2 3 4 5 6 7
Fid
ImportanceValue
0
2
4
6
8
1 2 3 4 5 6 7 8 9 10
Fid
ImportanceValue
Fid Feature Name Pearson Correlation χ2
Statistics
1 Length 5.546 × 10−2 48.04∗
2 Containing images 4.149 × 10−2 3.650
3 Containing URL 1.827 × 10−2 58.02∗
4 Conta. HashTag 3.422 × 10−2 2.376
5 Conta. only mentions 6.114 × 10−2 21.63∗
6 Conta. only emotions 5.504 × 10−2 9.475∗
7 Grant of “badges” 2.212 × 10−2 6.449∗
8 Commercial purpose 1.134 × 10−2 2.026
9 N. B. based prob. 7.716 × 10−2 25.76∗
10 Topic distributions 5.370 × 10−2 39.44∗
∗ Passes the significance test at the confidence level of 95%.
Table 6: Pearson correlation and χ2
statistics evaluation for
microblog features
Fid Feature Name Pearson Correlation χ2
Statistics
1 Near Duplicate 2.740 × 10−2 2.642
2 Retweet Chain 9.200 × 10−2 53.05∗
3 Plain Retweet 3.374 × 10−2 34.61∗
4 Emoticon behavior 8.637 × 10−2 25.68∗
5 Mention behavior 6.236 × 10−2 28.10∗
6 Posting time 5.162 × 10−2 61.06∗
7 Metaphysical power 4.370 × 10−2 0.660
lowees and #followees
test at confidence level o
parison in Figure 5 (d)
more important features
This phenomenon shows
degree features are inform
predictions in different w
4. EXPERIMEN
4.1 Experiment
Data Sets.
Description
#user of good cr
#user of bad cre
Total Number o
#Microblogs by
#Microblogs by
Total number of
Size of vocabula
7 Grant of “badges” 2.212 × 10 6.449
8 Commercial purpose 1.134 × 10−2 2.026
9 N. B. based prob. 7.716 × 10−2 25.76∗
10 Topic distributions 5.370 × 10−2 39.44∗
∗ Passes the significance test at the confidence level of 95%.
Table 6: Pearson correlation and χ2
statistics evaluation for
microblog features
Fid Feature Name Pearson Correlation χ2
Statistics
1 Near Duplicate 2.740 × 10−2 2.642
2 Retweet Chain 9.200 × 10−2 53.05∗
3 Plain Retweet 3.374 × 10−2 34.61∗
4 Emoticon behavior 8.637 × 10−2 25.68∗
5 Mention behavior 6.236 × 10−2 28.10∗
6 Posting time 5.162 × 10−2 61.06∗
7 Metaphysical power 4.370 × 10−2 0.660
8 Active level 4.770 × 10−2 31.77∗
9 Sentiment word(+) 4.240 × 10−2 0.380
10 Sentiment word(-) 5.063 × 10−2 0.092
11 Sentiment ploarity(+) 2.602 × 10−2 4.851
12 Sentiment ploarity(-) 9.272 × 10−3 2.268
∗ Passes the significance test at the confidence level of 95%.
Table 7: Pearson correlation and χ2
statistics evaluation for
behavior features
ing time are especially important since their chi2
statistics are all
considerable high and there are 24 different features of this kind.
Figure 5 (c) shows the feature importance when behavior features
are used as input for GBDT model. Their importance values are
all comparable with each other, and the low importance values also
validate the intuition that behavior information only indirectly and
limitedly reflect user’s credit risk. Although the feature importance
of each feature is not very high as a whole, the combination of so
many predictive behavior features also demonstrates very high per-
of each feature is not very high as
many predictive behavior features a
formance, as will be shown in the e
3.5.4 Network Features
Fid Feature Name P
1 #followees
2 #followers
3 #friends
4 #friends/#followees
5 #followers+#followees
6 Aggregated feature 1
7 Aggregated feature 4
8 Betweenness Cetnrality
∗ Passes the significance test at t
Table 8: Pearson correlation an
network features
Table 8 and Figure 5 (d) presen
network features proposed in Sect
tures’ correlation value and χ2
st
list all of them in the table. Amon
• 发帖时间分布
• 手机终端
• 签到地区分布
• 签到地区时间跨度
0.52
0.54
0.56
0.58
0.6
0.62
1
3
5
7
9
11
13
15
17
19
21
Number
of
Features
Accuracy
accuracy
51. 大数据与金融创新:从研究到实践
社交大数据用于金融创新的挑战和课题
• The
“CANNOTs
(or
SHOULD-‐NOTs)”:
the
boundaries
and
fron.ers
– Privacy
• How
to
provide
non-‐intrusive
yet
personalized
customer
service?
• Where
is
the
boundary
between
public
and
private
data?
– Ownership
• Who
should
own
the
data
shared
on
various
plaaorms?
• How
to
split
profit
from
the
data?
– Valua?on
• How
to
assess
value
for
different
data
sets?
• How
to
promote
and
regulate
data
exchange
among
par?es?