大数据助推金融创新

1. 大数据与金融创新：从研究到实践 Assistant Professor DS Lee Foundation Fellow School of Information Systems Singapore Management University Dec. 11, 2015 朱飞达 Feida Zhu Founding Director Pinnacle Lab for Analytics DBS-SMU Lab for Life Analytics Singapore Management University

2. 大数据与金融创新：从研究到实践企业痛点 1. 经济下行，市场竞争压力大，金融行业已经不同满足于传统的被动服务，而需要全方位了解用户，把高效便利的金融服务渗透入“医，食，住，行，玩，学”各个生活场景中，提升用户体验，把产品服务嵌入用户生活各个侧面。 2. 企业内部数据对用户了解有局限，用户数据来源不足，且获取的合法性，可持续性，隐私保护性值得担忧。 3. 达到用户的“最后一公里”渠道匮乏，传统营销手段日益疲软（营销骚扰电话等），难以构建用户数据从收集，分析，到最后营销的闭环。金融创新的一个角度：生活即金融

3. 大数据与金融创新：从研究到实践大数据的三大价值 – Insight from scale • What can big data tell us that small data cannot? – Knowledge from enrichment • What important knowledge can we learn from enriching small data with big data? – Agility from real-‐?me responsiveness • What are the values of being real-‐?me? VOLUME VARIETY VELOCITY

4. 大数据与金融创新：从研究到实践外部大数据到底能为企业提供什么价值？企业内部数据通常只以交易纪录为基础 Transac?on-‐based 总量和覆盖有限 Limited coverage 只反映用户生活的局部和侧面 Fragmented par?al perspec?ve 静态，低频 Sta?c, low frequency 孤立单一的客户视图，只见人 Isolated view of individual user 外部社交媒体大数据能展现交易行为的上下文场景 Context-‐based 海量的社会级覆盖 Societal scale 提供用户的多角度全景式洞察 Mul?-‐facet insight 动态实时，高频 Dynamic, high frequency 能综合考虑丰富真实社交关系 Network-‐embedded user view

5. 大数据与金融创新：从研究到实践 “以人为本”的三个打通跨外部数据平台用户身份归一运用独有算法技术识别同一个用户在不同外部数据平台上的不同账号（即便是使用不同用户名），把各个平台的数据以自然人为单位整合到同一用户。目前我们有超过1.5亿中国用户跨多个核心平台的数据。内外部用户数据匹配对企业内部客户和外部数据平台用户建立身份匹配基于大数据的360度全方位动态客户视图对企业内部客户提供全景式客户画像，动态追踪客户生活和潜在需求，及时捕捉销售和服务最佳时机和方式。 1 2 3 用户兴趣画像真实生活社交网络产品倾向性模型 1 2 企业内部数据 3

6. 大数据与金融创新：从研究到实践应用案例－精准营销 §  海量数据 §  2亿潜在客户的巨大候选空间 §  应用场景：保险业务员每天需要联系大量用户来推销各种保险，如何及时找到目标客户以及最适合这个客户的保险产品？ §  精准目标：基于大数据挖掘，自然语言处理，网络结构分析的准确客户画像 §  及时推送：动态监听客户，定义最佳营销时间点，实时响应潜在需求

7. 大数据与金融创新：从研究到实践应用案例－精准营销从1.5亿多潜在客户的海量数据巨大候选空间中进行实时排名，选出排名最高50名（1）精准目标基于自然语言处理和文本挖掘，机器自动给用户文本打标签，如“孩子” （3）及时推送动态监听客户，发现兴趣的增长趋势，在最佳营销时间点实时响应潜在需求我今天要联系 50个潜在客户卖“少儿险”，告诉我打哪些电话？为什么是这些人？你今天最应该打这些人的电话！（2）关系网络分析基于线下人际关系挖掘，她的周围人，亲密好友也很关心孩子因为这些人最关心孩子，而且最近有增长趋势，现在是最好时机！

8. 大数据与金融创新：从研究到实践应用案例－关系营销和风控 §  应用场景：银行时刻在关心两类人：高风险客户和高净值客户，如何利用客户之间的人际关系顺藤摸瓜找到其他潜在相关客户？ §  海量线下人际关系网络 §  3亿人，60亿条人际关系边组成的巨大关系网豪车高尔夫游艇赌博高价值 §  线下人际关系：通过线下亲密关系来顺藤摸瓜找到其他相关目标客户 §  精准客户画像：基于大数据挖掘，自然语言处理的准确客户画像 §  人工智能挖掘 §  从用户外部大数据中自动挖掘出用户的线下真实人际关系网络

9. 大数据与金融创新：从研究到实践研究课题－线下关系挖掘 1 21 31 41 51 61 12345 6789745 7

11. 874

12. 5 4

13. 353

14. 87 9 5 789

15. 8 9 8

16. 789

17. 8 98

18. Figure 1. Mutual Reachability. 1 211 311 411 511 611 711 811 1 611 2111 2611 12342563789 67

19. 2344 82 4689 2342

20. 3285

21. 36

22. 339 Figure 2. Friendship Retainability. 1 2 3 45 46 51 17281729 17298179 17981799 17998176 17681769 1769817 17 817 9 17 98173 12345 671 Figure 3. Community A Problem: Given a TwiNer follow network of a target user, iden?fy the user’s oﬄine community by examining the follow linkage alone. Informa.on should be able to ﬂow in both direc.ons within a small distance between real-‐life friends. Principle I: Mutual Reachability Principle II: Friendship Retainability 1 21 31 41 51 61 12345 6789745 7

24. 874

25. 5 4

26. 353

27. 87 9 5 789

28. 8 9 8

29. 789

30. 8 98

31. Figure 1. Mutual Reachability. 1 211 311 411 511 611 711 811 1 611 2111 2611 12342563789 67

32. 2344 82 4689 2342

33. 3285

34. 36

35. 339 Figure 2. Friendship Retainability. 1 2 3 45 46 51 17281729 17298179 17981799 17998176 17681769 1769817 17 817 9 17 98173 17381739 1739817 12345 671 Figure 3. Community Affini The size of a user’s offline community has an upper-‐bound threshold σ related to Dunbar’s number Principle III: Community Affinity Figure 6: Case study of a user’s fo 5. EXPERIMENTAL STUDY ity with A user’s off-‐line friends usually group into clusters within which members know each other

36. 大数据与金融创新：从研究到实践研究课题－线下关系挖掘 Figure 6: Case study of a user’s follow network. 5. EXPERIMENTAL STUDY An implementation of our algorithm as a demo system – TwiCube1 – is publicly available. 5.1 Case Study We now present a case study on a real user X who par- ticipated in our evaluation. X has 107 followers and follows 385 other users. Figure 6 illustrates the discovery of his core community in a total of 4 iterations each indicated by a different color. In summary, 34 users are identified in Iteration 1, 19 in Iteration 2, 3 in Iteration 3 and only one user in the last iteration. The precision and recall for this result of X’s core community is 0.8947 and 0.9807 respectively. It can be observed from Figure 6 that there is a dense clusters of core community members heavily linked among one another (lower left to X) and another such cluster of non-core- community users similarly linked (upper right to X). This shows that approaches based on dense subgraph mining or structural clustering would have a hard time in distinguish- ing between these two similarly-structured communities and, consequently, identifying the true core community. In fact, this cluster of non-core-community users consists of media, business and active Twitter users sharing similar interests and topics, which is a good indicator of those of X’s own. ity with X. This case would fail the naive approa identify core community members by two-way f In (b), we show the follow networks between X community member Y , who is discovered in Iter this case, X follows Y but Y does not follow X. M is not until more core community members have tified at Iteration 1 and 2 that Y ’s sophisticated c with the core community are revealed. In this by unleashing the power of iterated core commu fication, our algorithm is still able to correctly id 5.2 Effectiveness One naive method to identify the core commun get user u is to find the set of users who have dire follow links with u, i.e., they and u follow each ot rect two-way follow links provide good indication real-world friendship? Our experiments suggest links are not sufficient. In Figure 7 we show the on the distribution (among the 65 user evaluati cision, recall and F score between our algorithm the naive algorithm. In general our solution o the naive solution by a large margin. To conduc tailed comparison between the two methods, l Figure 5: Core Community Discovery RWR and closeness score between a user node i a as follows. ri,S = j∈S ri,j rS,i = j∈S rj,i ci,S = cS,i = ri,S ∗ rS,i Given a user node i, the probability transition Approach Figure 1: Three Types of Core Community Mem- bers. We now show how these three principles help us identify core communities members of different kinds. Based on our study, we categorize a user’s follow network based on three attributes each reflects one of the above-mentioned principles. Note that these attributes and their corresponding parameters are proposed for the categorization only, none of which will be actually computed in our algorithm. Suppose the target user is u and the user in consideration is v. (I) Mutual Following. The first attribute is whether u and v directly follow each other. There are two cases: (I). u and v follow each other, i.e., v ∈ N1 u← N1 u→. We call this a two-way follow case. (II). Either u follows v or v follows u, but not both, i.e., v ∈ N1 u← N1 u→ N1 u← N1 u→. We call this a one-way follow case. Principle 1 is immediately satisfied in a two-way follow case as tweets of both u and v are delivered directly to each other, while in a one-way follow case, computation considering the k-hop neighborhood of u is necessary to determine the satisfiability of Principle 1. (II) Friendship Exclusivity. The second attribute is the larger one between |Fu←| and |Fu→|. For simplicity, we use |Fu←| to illustrate while the analysis with |Fu→| can be done similarly. This attribute indicates the number of other users in whom u is interested in hearing about. In general, this !  Random Walk with Restart !  Closeness Score !  Iterative Off-line Community Discovery !  Off-line community is discovered by iterations. !  A virtual user node is used as the threshold to cut for each iteration. ose our algorithm based on the idea of random walk art(RWR). RWR has been successfully used to mea- relevance score between two nodes in a weighted 3, 9, 2, 12]. It is defined in [9] with the following ⃗ri = (1 − c) ˜W ⃗ri + c⃗ei (1) tting, given a weighted graph, a particle starts from d conducts random movement. It transmits to the hood of its current node with a probability propor- the edge weights. At each step, the particle also o the start node i with some probability c. The score of node j with respect to i is defined as the ate probability ri,j that the particle finally stays at roblem setting, given the Twitter network G = target user u ∈ V and a number k, we focus on raph Gk u induced by Nk u , which is simplified as Gu s fixed. A probability transition matrix W is de- Gu(V ) such that, for two nodes v, w ∈ Gu(V ), the puted iteratively and it finally converges to )−1 ⃗ei [9]. When it converges, the steady-state tor ⃗ri reflects the bandwidth of information from user i to user j for every j ∈ Gu(V ). eady-state probability to define the closeness wo users i and j: ci,j = ri,j ∗ rj,i (3) score thus defined satisfies Principle (I). It ng desirable properties, the proofs of which e to space limit. 1. Given a Twitter follow network G(V, E) i, j ∈ V , ci,j is symmetric, i.e., ci,j = cj,i. Property 2. Given a Twitter follow network G(V, E), two users i, j ∈ V and k, ci,j 0 if and only if i and j satisfy Principle 1 — i ∈ Nk j→ Nk j← and j ∈ Nk i→ Nk i←, i.e., tweets originated from either user i or j should be able to reach the other one in k hops. Property 3. Given a Twitter follow network G(V, E), two users i, j ∈ V and k, obtain a node j′ resulted from removing a set S of users from j’s immediate neighborhood such that for each v ∈ S, either v ∈ Fj→ Nk i← or v ∈ Fj← Nk i→. We have ci,j ≤ ci,j′ . Figure 2: Core Community Discovery closeness score between u and all the rest users, t we compute the closeness score between ˜u and eve user. From the ranking list thus generated, if any us ahead of ˆv in this iteration, the user will be adde core community of u, which ends this iteration. So so forth. Figure 2 illustrates the process. The targ is shown in red in the center and the auxiliary dum ˆv is shown in purple. In iteration 1, the core comm just u itself, which is indicated by the shaded circle u. The highlighted blue nodes and follow links re Fu← Fu→. After computing the closeness score cu v, three users are found to be ahead of ˆv in the ranking list. They are therefore added to the core nity, indicated by their color changed from blue to In iteration 2, we use the new core community ˜u, c now of 4 users, to compute the closeness scores c˜u rest nodes v. Those ranked ahead of ˆv will be adde core community. The iterations continue until no n can be added to the core community, ending the al As the virtual user node ˜u is actually a set, we no RWR and closeness score between a user node i and as follows. ri,S = j∈S ri,j the naive approach respectively. The result shows that for most users, our solution outperforms the naive solution for both precision and recall. In particular, in two cases, the difference is even close to 1. There is only one single case in which our algorithm is prevailed for both precision and recall. 5.4 On Ranking compare2(v1, v2) = ⎪⎩ compare1(v1, −1, Which one is better? We evaluate computing their AUC value for eac tions of the AUC values are showed i shows that for both rankings, more values are greater than 0.9 and more Figure 7: AUC comparison for rankings with and without incorporatin values are greater than 0.8. The right graph in Figure 7 shows that in most cases, the ranking with iteration information incorporated is superior than the ranking based solely on closeness score. This demonstrates that core community information helps the ranking. 5.5 On Iteration It has been observed in our experiments that the core community discovery process ends after a few iterations. One interesting question is whether core community members identified in later iterations are as good as those found in earlier iterations. If we set a maximum number of iteration allowed in the algorithm to force termination, will the result give better precision and recall? Our experiments suggest a negative answer. Figure 8 shows that the average precision, recall and F-score for varied maximum number of iterations allowed from 1 to 10 as well as unlimited. As the maximum number of iterations allowed increases, although average precision drops slightly, recall improves significantly, and so does the F-score. Intuitively, earlier iterations tend to capture those closest members to the target user, which results in a higher precision yet at the cost of missing out many other core community members with more sophisticated social connections with the target user. By setting no maximum number of iterations and allowing the core community itself to take shape, much greater gain in recall could be achieved, offering a better result overall. In most cases, core communities stabilize after 5 or 6 iterations, as shown in Figure 9 which presents the distribution of number of iterations of all our eval 5.6 Modeling Use How to model user inter tent recommendation an Furthermore, our study ery could significantly en following two aspects: (I munity members themse terizing u’s interests tha network. u follow them m life friends anyway. On t or topics that drive u t users. As such, when i step is to distinguish u’s low network. (II). Altho themselves may not nec users followed by these c less could help understa could follow media/celeb In our experiments, we users, A,B and C to hel that A and B share mu interests, background an if we check the common by A and B, they have in Figure 11), while A in Figure 12). This me community, C could be Figure 7: AUC comparison for rankings with and without incorporating iteratio values are greater than 0.8. The right graph in Figure 7 shows that in most cases, the ranking with iteration information incorporated is superior than the ranking based solely on closeness score. This demonstrates that core community information helps the ranking. 5.5 On Iteration It has been observed in our experiments that the core community discovery process ends after a few iterations. One interesting question is whether core community members identified in later iterations are as good as those found in earlier iterations. If we set a maximum number of iteration allowed in the algorithm to force termination, will the result give better precision and recall? Our experiments suggest a negative answer. Figure 8 shows that the average precision, recall and F-score for varied maximum number of iterations allowed from 1 to 10 as well as unlimited. As the maximum number of iterations allowed increases, although average precision drops slightly, recall improves significantly, and so does the F-score. Intuitively, earlier iterations tend to capture those closest members to the target user, which results in a higher precision yet at the cost of missing out many other core community members with more sophisticated social connections with the target user. By setting no maximum number of iterations and allowing the core community itself to take shape, much greater gain in recall could be achieved, offering a better result overall. In most cases, core communities stabilize after 5 or 6 iterations, as shown in Figure 9 which presents the distribution of number of iterations of all our evaluation part 5.6 Modeling User Interest How to model user interests is of cri tent recommendation and linkage pr Furthermore, our study reveals that ery could significantly enhance user following two aspects: (I) For a tar munity members themselves are les terizing u’s interests than the rest network. u follow them mostly beca life friends anyway. On the other ha or topics that drive u to follow oth users. As such, when investigating step is to distinguish u’s core comm low network. (II). Although the co themselves may not necessarily refl users followed by these core commu less could help understand u’s inte could follow media/celebrity/busine In our experiments, we identify and users, A,B and C to help us evalua that A and B share much more sim interests, background and life-style t if we check the common non-core-co by A and B, they have 15 such u in Figure 11), while A and C have in Figure 12). This means that, w community, C could be considered Figure 7: AUC comparison for rankings with and without incorporating iteration inform values are greater than 0.8. The right graph in Figure 7 shows that in most cases, the ranking with iteration information incorporated is superior than the ranking based solely on closeness score. This demonstrates that core community information helps the ranking. 5.5 On Iteration It has been observed in our experiments that the core community discovery process ends after a few iterations. One interesting question is whether core community members identified in later iterations are as good as those found in earlier iterations. If we set a maximum number of iteration allowed in the algorithm to force termination, will the result give better precision and recall? Our experiments suggest a negative answer. Figure 8 shows that the average precision, recall and F-score for varied maximum number of iterations allowed from 1 to 10 as well as unlimited. As the maximum number of iterations allowed increases, although average precision drops slightly, recall improves significantly, and so does the F-score. Intuitively, earlier iterations tend to capture those closest members to the target user, which results in a higher precision yet at the cost of missing out many other core community members with more sophisticated social connections with the target user. By setting no maximum number of iterations and allowing the core community itself to take shape, much greater gain in recall could be achieved, offering a better result overall. In most cases, core communities stabilize after 5 or 6 iterations, as shown in Figure 9 which presents the distribution of number of iterations of all our evaluation participants. 5.6 Modeling User Interests How to model user interests is of critical imp tent recommendation and linkage prediction in Furthermore, our study reveals that core com ery could significantly enhance user interest m following two aspects: (I) For a target user u munity members themselves are less informa terizing u’s interests than the rest user node network. u follow them mostly because they a life friends anyway. On the other hand, it is s or topics that drive u to follow other non-c users. As such, when investigating u’s inte step is to distinguish u’s core community fro low network. (II). Although the core commu themselves may not necessarily reflect u’s i users followed by these core community mem less could help understand u’s interests, e.g could follow media/celebrity/business users o In our experiments, we identify and hire thre users, A,B and C to help us evaluate. The g that A and B share much more similar profi interests, background and life-style than A an if we check the common non-core-community by A and B, they have 15 such users in co in Figure 11), while A and C have 18 in co in Figure 12). This means that, without th community, C could be considered more sim Application Example: User Interest Pro Figure 11: Interest profile comparison for A and B Figure 12: Interest profile compari bi-directional way and relies on no other attribute informa- to predict link strength in online soci Parameters !  On # of Iterations !  On Robustness Figure 8: The result for limiting the max # of iterations allowed. Figure 9: The distribution of # of iterations. Figure 10: R B, contradicting the truth. In fact, we can use core community to remedy the situation. Similar as in the idea of TF-IDF [11], for target user u, we use the following formula to compute the weight for each non-core-community user v wu(v) = |Fv→ Cu| |Cu| log |Fv→| (9) As such, for a target user u, we obtain a vector ⃗xu where each dimension is one non-core-community member. For two target users u1 and u2, we compute the similarity between their interest profile as Sim(u1, u2) = ⃗xu1 ·⃗xu2 |⃗xu1 ||⃗xu2 | . In Figure 11 and Figure 12, we show the relative ratio between user A and B, where the percent for user A on dimension v is computed by wA(v) wA(v)+wB (v) , and wB (v) wA(v)+wB (v) for user B. Now if we com- of SNS and real-life social networks. [14] book has influenced the establishment o lationships. Another related direction is real-life friendship or relationship stren work using hyperlinks and text informat predict relationships between individua further information including network to tions to predict relationship strength. the same problem with a link-based late While the relationship between a user’ social network has been investigated in Facebook, few studies have so far pose on Twitter network. More importantly Facebook, Twitter has two important d tics — (I) As shown in [8], Twitter fun of news media and social network comb both. (II) Follow links on Twitter are Figure 8: The result for limiting the max # of iterations allowed. Figure 9: The distribution of # of iterations. Figure 10: Robus B, contradicting the truth. In fact, we can use core community to remedy the situation. Similar as in the idea of TF-IDF [11], for target user u, we use the following formula to compute the weight for each non-core-community user v wu(v) = |Fv→ Cu| |Cu| log |Fv→| (9) As such, for a target user u, we obtain a vector ⃗xu where each dimension is one non-core-community member. For two target users u1 and u2, we compute the similarity between their interest profile as Sim(u1, u2) = ⃗xu1 ·⃗xu2 |⃗xu1 ||⃗xu2 | . In Figure 11 and Figure 12, we show the relative ratio between user A and B, where the percent for user A on dimension v is computed by wA(v) wB (v) of SNS and real-life social networks. [14] looked book has influenced the establishment of new lationships. Another related direction is to us real-life friendship or relationship strength. work using hyperlinks and text information on predict relationships between individuals. [6, further information including network topolog tions to predict relationship strength. [17] ha the same problem with a link-based latent va While the relationship between a user’s onlin social network has been investigated in stand Facebook, few studies have so far pose the sa on Twitter network. More importantly, com Facebook, Twitter has two important differen tics — (I) As shown in [8], Twitter functions of news media and social network combiningFigure 5: Core Community Discovery RWR and closeness score between a u as follows. ri,S = j∈S ri,j rS,i = j∈S rj,i ci,S = cS,i = ri,S ∗ r Given a user node i, the probability Figure 6: Case study of a user’s follow A real TwiFer user: §  Following 385 users §  Followed by 107 users

37. 大数据与金融创新：从研究到实践研究课题－线下亲密关系挖掘 Problem: Given a user’s tweets, iden?fy all interpersonal rela?onships that involve physical or emo?onal in?macy, such as family members, husband and wife, roman?c rela?onship, etc.. Example: §  In.mate expressions §  “honey”, “baby”, “dear”, “my dear wife”,… §  Occasions/Events §  Valen?ne’s day, anniversary, father’s day, birthday,… §  In.macy-‐related name en..es §  Resort hotels, kids, home-‐improvement, … §  Screen-‐name correla.on §  Substring swaps §  Similar PaNerns with keywords §  PaNerns with domain knowledge Design Ideas I Intimacy-related Entity Use Dempster–Shafer theory to model the associa?on degree between en??es and a certain type of rela?onship. The ﬁnal in?mate rela?onship scores are achieved through an itera?ve algorithm. Design Ideas II: Exclusivity of “@” to identify relationship candidates

38. 大数据与金融创新：从研究到实践外部数据跨平台用户身份归一 Linkage Information Collection Photos Tweets/Retweets Trajectories ... Profiles Username Photos Tweets/Retweets Trajectories ... Profiles Username t Unlinked Identities… Step 3: Multi-objective Optimization MinW [F1(w), F2(w),…, FM(w)] Linkage Function fW Unknown Identities Step 2: Structure Information Modeling Step 1:Heterogeneous Behavior Modeling Figure 3: HYDRA framework. Figure 4: The workflow of A face detector is employe profile images. Then a pre- fidence score in [0, 1] indica to one person. attributes used in the matchi set by probabilistic modeling Specifically, given a set o •  Nodal aFributes (numeric, categorical) •  Demographics, loca?on, personal interest, etc. •  User Generated Content (topics, sen.ments) •  Reviews, tweets, ra?ngs, mul?media, etc. •  Social network (snapshot/sta.c view) •  Friend network, followers/followees network, communi?es/interest groups, etc. •  Behavior trajectory (dynamic, evolu.onary) •  content sharing history, social interac?on paNern, network forma?on, etc.

39. 大数据与金融创新：从研究到实践外部数据跨平台用户身份归一 •  People’s closest friends are similar across different social plaaorms. •  Behavior similarity aggrega?on of the most frequently interac?ng friends of users provides insights into user iden?ty linkage. •  Supervised Learning •  Structure Consistency Modeling •  Mul?-‐objec?ve Op?miza?on A two-‐class classifica?on problem -‐-‐-‐ construct mul?-‐objec?ve op?miza?on which jointly op?mizes the predic.on accuracy on the labeled user pairs and mul.ple structure consistency measurements across different plaaorms.

40. 大数据与金融创新：从研究到实践社交媒体大数据的核心: 5个“C •  Content 内容 –  个人档案，话题分布，情感模型，兴趣画像. •  Context 情景 –  地点，时序分析，行为轨迹，社群分析. •  Connec.on 关联 –  线下关系挖掘，核心网络分析. •  Crowd 众智 –  利用大众的人脑智慧，众包，众筹. •  Cloud 云平台 –  开发多源的思维模式. 社交媒体大数据内容 Content 情景 Context 云平台 Cloud 众智 Crowd 关联 Connec.on

41. 大数据与金融创新：从研究到实践社交媒体大数据的个人征信应用 •  弥补个⼈人信⽤用数据的稀疏性 •  在中国，官⽅方正式的个⼈人信⽤用数据匮乏，尤其是中低收⼊入层次的申请⼈人，⽽而这部分⼈人群正是互联⺴⽹网⾦金融的主要⺫⽬目标客户。 •  冷启动 •  对抗恶意欺诈 •  社交数据和⾦金融领域的弱相关 •  侦测异地诈骗 •  挖掘⻛风险的前瞻性 •  利⽤用⽣生活情景的时序推理 •  深挖信⽤用⻛风险的社会关系传递

42. 大数据与金融创新：从研究到实践社交媒体大数据的个人征信应用 •  提取社交维度信⽤用特征，加⼊入现有传统信⽤用模型 •  采⽤用产⽣生式模型挖掘不同信⽤用类别的隐含⽤用户原型 •  基于社会关系⺴⽹网络的⻛风险传递查询和探索引擎 •  实时反欺诈侦测和预警系统应用模式

43. 大数据与金融创新：从研究到实践社交媒体大数据的个人征信应用 •  Upstart Upstart于2014年5月上线，2014年促成了超过8700笔贷款共计1亿250万美元，良好的运营业绩使之成为 P2P行业新参与者中的佼佼者。该平台的借款对象专注于千禧一代（1984-‐1995年出生），即80后、90初的年轻群体。

44. 大数据与金融创新：从研究到实践 •  组合较优的独立特征为复合特征，加入传统模型。 •  使用决策树组合：地理位置特征（所在地、签到地点）根据各个特征上Good、 Bad分布的差异性，选出特征放入决策树。 •  根据数据生成的决策树如下表。表中用不同的颜色来区分决策树的层次，黄色为第一层，绿色为第二层，蓝色为第三层。表中的数值表示满足该条件的人群是坏人的风险指数。基于此决策树模型，分类的准确率达到0.83。提取社交维度信用特征，加入现有传统信用模型

45. 大数据与金融创新：从研究到实践提取社交维度信用特征，加入现有传统信用模型 Fid Feature Name Pearson Correlation χ2 Statistics 1 Gender 4.45 × 10−2 14.27∗ 2 Age 1.92 × 10−2 16.28∗ 3 Verified 5.128 × 10−2 17.02∗ 4 Education 4.18 × 10−3 0 5 Location 4.81 × 10−2 16.68∗ 6 Occupation 2.244 × 10−2 0.137 7 Registration time 6.944 × 10−2 39.44∗ ∗ Passes the significance test at the confidence level of 95%. Table 5: Pearson correlation and χ2 statistics evaluation for demographic features 0 10 20 30 40 50 1 2 3 4 5 6 7 Fid ImportanceValue 0 2 4 6 8 1 2 3 4 5 6 7 8 9 10 Fid ImportanceValue Fid Feature Name Pearson Correlation χ2 Statistics 1 Length 5.546 × 10−2 48.04∗ 2 Containing images 4.149 × 10−2 3.650 3 Containing URL 1.827 × 10−2 58.02∗ 4 Conta. HashTag 3.422 × 10−2 2.376 5 Conta. only mentions 6.114 × 10−2 21.63∗ 6 Conta. only emotions 5.504 × 10−2 9.475∗ 7 Grant of “badges” 2.212 × 10−2 6.449∗ 8 Commercial purpose 1.134 × 10−2 2.026 9 N. B. based prob. 7.716 × 10−2 25.76∗ 10 Topic distributions 5.370 × 10−2 39.44∗ ∗ Passes the significance test at the confidence level of 95%. Table 6: Pearson correlation and χ2 statistics evaluation for microblog features Fid Feature Name Pearson Correlation χ2 Statistics 1 Near Duplicate 2.740 × 10−2 2.642 2 Retweet Chain 9.200 × 10−2 53.05∗ 3 Plain Retweet 3.374 × 10−2 34.61∗ 4 Emoticon behavior 8.637 × 10−2 25.68∗ 5 Mention behavior 6.236 × 10−2 28.10∗ 6 Posting time 5.162 × 10−2 61.06∗ 7 Metaphysical power 4.370 × 10−2 0.660 lowees and #followees test at confidence level o parison in Figure 5 (d) more important features This phenomenon shows degree features are inform predictions in different w 4. EXPERIMEN 4.1 Experiment Data Sets. Description #user of good cr #user of bad cre Total Number o #Microblogs by #Microblogs by Total number of Size of vocabula 7 Grant of “badges” 2.212 × 10 6.449 8 Commercial purpose 1.134 × 10−2 2.026 9 N. B. based prob. 7.716 × 10−2 25.76∗ 10 Topic distributions 5.370 × 10−2 39.44∗ ∗ Passes the significance test at the confidence level of 95%. Table 6: Pearson correlation and χ2 statistics evaluation for microblog features Fid Feature Name Pearson Correlation χ2 Statistics 1 Near Duplicate 2.740 × 10−2 2.642 2 Retweet Chain 9.200 × 10−2 53.05∗ 3 Plain Retweet 3.374 × 10−2 34.61∗ 4 Emoticon behavior 8.637 × 10−2 25.68∗ 5 Mention behavior 6.236 × 10−2 28.10∗ 6 Posting time 5.162 × 10−2 61.06∗ 7 Metaphysical power 4.370 × 10−2 0.660 8 Active level 4.770 × 10−2 31.77∗ 9 Sentiment word(+) 4.240 × 10−2 0.380 10 Sentiment word(-) 5.063 × 10−2 0.092 11 Sentiment ploarity(+) 2.602 × 10−2 4.851 12 Sentiment ploarity(-) 9.272 × 10−3 2.268 ∗ Passes the significance test at the confidence level of 95%. Table 7: Pearson correlation and χ2 statistics evaluation for behavior features ing time are especially important since their chi2 statistics are all considerable high and there are 24 different features of this kind. Figure 5 (c) shows the feature importance when behavior features are used as input for GBDT model. Their importance values are all comparable with each other, and the low importance values also validate the intuition that behavior information only indirectly and limitedly reflect user’s credit risk. Although the feature importance of each feature is not very high as a whole, the combination of so many predictive behavior features also demonstrates very high per- of each feature is not very high as many predictive behavior features a formance, as will be shown in the e 3.5.4 Network Features Fid Feature Name P 1 #followees 2 #followers 3 #friends 4 #friends/#followees 5 #followers+#followees 6 Aggregated feature 1 7 Aggregated feature 4 8 Betweenness Cetnrality ∗ Passes the significance test at t Table 8: Pearson correlation an network features Table 8 and Figure 5 (d) presen network features proposed in Sect tures’ correlation value and χ2 st list all of them in the table. Amon •  发帖时间分布 •  手机终端 •  签到地区分布 •  签到地区时间跨度 0.52 0.54 0.56 0.58 0.6 0.62 1 3 5 7 9 11 13 15 17 19 21 Number of Features Accuracy accuracy

46. 大数据与金融创新：从研究到实践采用产生式模型挖掘不同信用类别的隐含用户原型

47. 大数据与金融创新：从研究到实践基于社会关系网络的风险传递查询和探索引擎产生种子网络拓展业务应用内部不良客户名单外部大数据平台 •  社交媒体举报名单 •  互联网金融类网站不良记录名单 •  政府公共信息平台不良纪录名单 •  事件新闻触发名单多元数据挖掘维度 •  用户内容分析（主题，意见，情感等） •  上下文情景分析（时空序列，地理位置等） •  社会关系网分析（家庭，同事，好友，社区等）海量客户自动评分交互式侦测调查系统

48. 大数据与金融创新：从研究到实践实时反欺诈侦测和预警系统

49. 大数据与金融创新：从研究到实践实时反欺诈侦测和预警系统

50. 大数据与金融创新：从研究到实践社交媒体大数据用于个人信用评估的优势 •  基于⽤用户个⼈人数据可以建⽴立⽤用户个⼈人信⽤用评分 •  个⼈人数据：海量，全⽅方位，动态实时，场景理解 •  分析⼿手段： •  内容分析：兴趣爱好（赌博，⾊色情，奢侈品⾼高消费等），个⼈人素质（粗俗⽤用语，说谎，⾃自相⽭矛盾），性格特征（易怒，偏激，鲁莽，冲动） •  上下⽂文场景分析：⾏行动轨迹（是否居⽆无定所，出⼊入不良场所，出没于诈骗⾼高发地区），⽣生活习惯（夜⽣生活，发帖时间），使⽤用设备（⼿手机类型配置） •  基于⽤用户社交⺴⽹网络可以建⽴立⽤用户综合信⽤用评估，挖掘潜在信⽤用⻛风险 •  社交⺴⽹网络：⽤用户的核⼼心⺴⽹网络（家庭，好友，合作伙伴） •  分析⼿手段： •  基于⺴⽹网络的信⽤用推导（例如：是否和信⽤用不良⼈人⼠士关系密切）

51. 大数据与金融创新：从研究到实践社交大数据用于金融创新的挑战和课题 •  The “CANNOTs (or SHOULD-‐NOTs)”: the boundaries and fron.ers –  Privacy • How to provide non-‐intrusive yet personalized customer service? • Where is the boundary between public and private data? –  Ownership • Who should own the data shared on various plaaorms? • How to split proﬁt from the data? –  Valua?on • How to assess value for diﬀerent data sets? • How to promote and regulate data exchange among par?es?

52. 大数据与金融创新：从研究到实践 Questions fdzhu@smu.edu.sg

大数据助推金融创新

Recomendados

Recomendados

Más contenido relacionado

Destacado

Destacado (20)

Similar a 大数据助推金融创新

Similar a 大数据助推金融创新 (20)

Más de Jerry Wen

Más de Jerry Wen (10)

Último

Último (20)

大数据助推金融创新