Creating Community at WeWork through Graph Embeddings with node2vec - Karry Lu
A Novel Target Marketing Approach based on Influence Maximization
1. A Novel Target Marketing
Approach based on
Influence Maximization
2. Motivation
• “Businesses on Facebook and Twitter are reaching only 2% of
their fans and only 0.07% of follower actually interact with
their post.” – Forrester Study, Nov. 17, 2014
• Local business owner need to target market people nearby, to
increase footfall
• Traditional methods of marketing like leafleting are inefficient
• “82% people check review online before spending money on
product/service” – Nielsen Study, July 1, 2013
• Local businesses can use online review websites like Yelp,
Zomato to target customers effectively.
3. Problem Statement
• “To develop a novel approach for Identification of influential customers for target
marketing through Influence Maximization.”
Objectives
Fig. 1
4. Influence Maximization
• It is problem to find K vertices in the graph such that under the diffusion model, the expected
number of vertices influenced by the K vertices (referred to as influence spread) is the largest
possible
• The Independent Cascade (IC) model is the simplest diffusion model. If j is a neighbor of i
then the probability of j being activated by i is:
Eq. 5
i j
pij
wij
5. Existing Work
• Kemp et al. were first to study the optimization problem of influence maximization
• Proved it to be a NP-hard problem, gave a time inefficient Greedy algorithm
• GeneralGreedy repeats k rounds: in the ith round, select a node v that provides the largest
increase in influence spread
• In each round influence spread is calculated by Monte-Carlo simulations.
6. Cont’d
• Chen et al. developed NewGreedyIC, an improved Greedy algorithm
• NewGreedyIC also runs Monte-Carlo simulations, but in each iteration it generates a random
graph G’ by randomly removing edges from the existing graph G. This makes the size of graph
in that iteration smaller and hence is faster than GeneralGreedy method
7. Cont’d
• Chen et al. also proposed a more efficient DegreeDiscount method
• DegreeDiscount method doesn’t run Monte-Carlo simulations, it uses degree discount
heuristics where it is assumed that the spread increases with the degree of nodes.
• It gives discount in the degree of a node by one if any of its neighbors have already been
selected in the set of active nodes.
• It is 6 time faster than NewGreedyIC. It gives influence spread slightly lower than
NewGreedyIC.
[link]
A
3
5
6
A
2
4
5
8. Inspiration from existing work
• DegreeDiscount method eliminates need for Monte-Carlo simulations by using degree
heuristic.
• This reduces running time compared to NewGreedy by manifold.
9. Research Gap
• DegreeDiscount doesn’t take into account
the overlapping part of spread of two
influential nodes
• Due to which the total influence spread will
be lesser than sum of their individual
influence spread
• Our novel algorithm adds that node as kth
node which maximize difference between
spread of already selected k-1 nodes and
that of k nodes after addition
• C-A has more difference in spread than B-A.
A
B
C
A
B
C
13. Data and Preprocessing
• The semi-structured data obtained from Yelp is stored in a Document Oriented database.
• Preprocessing is done to clean the data.
• Social network is formed from users who have reviewed similar nearby businesses.
• Users are represented as nodes in the network, and two nodes are joined by an edge only if
they are friends.
14. Edge weight calculation in network
• The weight of an edge between two users X and Y is calculated by the formula:
• w1 is the normalized count of mutual friends between X and Y
where nx and ny are the list of friends of user X and user Y.
• w2 signifies the similarity in opinion of user X and user Y
where and
• Xpos is the set of businesses that X rated positively; Xneg is the set of businesses that X rated
negatively.
• We have considered a rating of 3 or below as negative review, and 4 or above as positive
review. [old]
Eq. 9
Eq. 7
Eq. 8
15. Propagation probability calculation in network
• Propagation probability of an edge going from u to v was calculated by:
• Strength of an edge between u and v is the average of influence of u and v
• Where
• For popularity we used two attributes of the user, reviewCount and averageStars
• The clustering value is defined as the closeness of a node to a cluster of highly interconnected
nodes.
• C(v) is clustering value of a node given by:
• Quartiles were used for normalization.[link]
Eq. 17
Eq. 16
Eq. 15
Eq. 10
Eq. 11
17. Our novel approach: spreadHeuristicIC Algorithm
• Proposed algorithm is a greedy algorithm.
• It iteratively finds a node and add it to the set S of top-K influential nodes.
• While adding kth node to set S, it finds the node that maximize the difference between
spread of already selected k-1 nodes and spread of set S after adding that kth node.
A
B
C
19. Complexity Analysis
• The algorithm take O(V) steps in line 3 and line 4 take O(T) time, where T is the time to
compute the coverage of a node in the graph G, and it takes O(IE) time (where I is the number
of simulations for the Independent Cascade model, and E is the number of edges in graph G).
• From lines 7-9, complexity of each line is O(VlgV) when we use sorting for union operation.
• So, overall complexity of the algorithm is O(K(VIE + VlgV)).
20. Experiments and Results
• We have conducted experiments for our algorithm and various other algorithms (i.e.- Degree
Discount algorithm, Single Discount algorithm, Degree Discount algorithm, General Greedy
algorithm etc.) on Yelp’s network.
• We find that the Spread Heuristic based algorithm has more influence spread compared to
the other algorithms. The ranking based on influence spread comes out to be:
spreadHeuristicIC > newGreedyIC > degreeDiscountIC > random
21. Cont’d
Influence spread for G with n=1617, E=2058
0
50
100
150
200
250
300
0 10 20 30 40 50 60 70 80
InfluenceSpread
K
degreeDiscountIC degreeDiscountIC2 degreeDiscountStar
degreeHeuristic degreeHeuristic2 singleDiscount
highestDegree newGreedyIC randomHeuristic
spreedHeuristic
Influence spread for G with n=4292, E=8147
0
50
100
150
200
250
0 10 20 30 40 50 60 70 80
InfluenceSpread
K
degreeDiscountIC degreeDiscountIC2 degreeDiscountStar
degreeHeuristic degreeHeuristic2 singleDiscount
highestDegree newGreedyIC randomHeuristic
spreadHeuristic
Fig. 9Fig. 7
22. Cont’d
Run time for G with n=1617, E=2058 Run time for G with n=4292, E=8147
-10
0
10
20
30
40
50
60
70
80
0 10 20 30 40 50 60 70 80
RunningTime(sec)
K
degreeDiscountIC degreeDiscountIC2 degreeDiscountStar
degreeHeuristic degreeHeuristic2 singleDiscount
highestDegree newGreedyIC randomHeuristic
spreadHeuristic
0
50
100
150
200
250
300
350
0 10 20 30 40 50 60 70 80
RunningTime(sec)
K
degreeDiscountIC degreeDiscountIC2 degreeDiscountStar
degreeHeuristic degreeHeuristic2 singleDiscount
highestDegree newGreedyIC randomHeuristic
spreedHeuristic
Fig. 10Fig. 8
23. Conclusion
• With respect to initial aims and objectives of this project, the final outcome is fairly
successful.
• After series of experiments, we concluded that our algorithm outperforms existing influence
maximization algorithms.
• We developed a dashboard for the businesses to visualize the influential users and their
spread among the people nearby.
24. References
[1] M.E.J. Newman, M. Girvan, Finding and evaluating community structure in networks, Phys. Rev. E 69 (2) (2004) 026113.
[2] Blondel, Vincent D., et al. "Fast unfolding of communities in large networks. "Journal of Statistical Mechanics: Theory and Experiment 2008.10 (2008):
P10008.
[3]. J. Leskovec, A. Krause, C. Guestrin, C. Faloutsos, J. VanBriesen, and N. S. Glance. Cost-effective outbreak detection in networks. In Proceedings of the 13th
ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 420–429, 2007.
[4] “Yelp Dataset,” https://www.yelp.com/dataset challenge/dataset.
[5]. D. Kempe, J. Kleinberg, E. Tardos. Maximizing the Spread of Influence through a Social Network. Proc. 9th ACM SIGKDD Intl. Conf. on Knowledge Discovery
and Data Mining, 2003.
[6]. M. Richardson, P. Domingos. Mining Knowledge-Sharing Sites for Viral Marketing. Eighth Intl. Conf. on Knowledge Discovery and Data Mining, 2002.
[7] J. Goldenberg, B. Libai, E. Muller. Talk of the Network: A Complex Systems Look at the Underlying Process of Word-of-Mouth. Marketing Letters 12:3(2001),
211-223
[8] M. Granovetter, Threshold models of collective behavior, the American Journal of sociology, vol. 83, no. 6, pp.1420-1443, May 1978
[9] Chen, Wei, Yajun Wang, and Siyu Yang. "Efficient influence maximization in social networks." Proceedings of the 15th ACM SIGKDD international conference
on Knowledge discovery and data mining. ACM, 2009.
[10] Kempe, David, Jon Kleinberg, and Éva Tardos. "Maximizing the spread of influence through a social network." Proceedings of the ninth ACM SIGKDD
international conference on Knowledge discovery and data mining. ACM, 2003.
[11] Wang, Yu, et al. "Community-based greedy algorithm for mining top-k influential nodes in mobile social networks." Proceedings of the 16th ACM SIGKDD
international conference on Knowledge discovery and data mining. ACM, 2010.
[12] Saito, Kazumi, Ryohei Nakano, and Masahiro Kimura. "Prediction of information diffusion probabilities for independent cascade model." Knowledge-Based
Intelligent Information and Engineering Systems. Springer Berlin Heidelberg, 2008.
[13] Newman, Mark EJ. "Analysis of weighted networks." Physical Review E 70.5 (2004): 056131.