36. [1] Pang-Ning, T., Steinbach, M., & Kumar, V. (2006). Introduction to data mining. In Library of Congress.
•
•
•
•
•
[2] Jain, A. K. (2010). Data clustering: 50 years beyond K-means. Pattern Recognition Letters, 31(8), 651-666..
- 36 -
37. [3] Schaeffer, S. E. (2007). Graph clustering. Computer Science Review, 1(1), 27-64.
•
k
min ∑ ∑ w jl
i =1 j∈Ci
l∉Ci
where k is the number of clusters
•
•
•
•
•
[4] Boutin, F., & Hascoet, M. (2004, July). Cluster validity indices for graph partitioning. In Information Visualisation, 2004. IV 2004. Proceedings.
Eighth International Conference on (pp. 376-381). IEEE.
[5] Tibshirani, R., Walther, G., & Hastie, T. (2001). Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal
Statistical Society: Series B (Statistical Methodology), 63(2), 411-423.
[6] Patkar, S. B., & Narayanan, H. (2003, January). An efficient practical heuristic for good ratio-cut partitioning. In VLSI Design, 2003.
Proceedings. 16th International Conference on (pp. 64-69). IEEE.
- 37 -
38. •
•
•
•
•
[7] Sibson, R. (1973), “SLINK: an optimally efficient algorithm for the single-link cluster method”, The Computer Journal, Vol. 116, No. 1, pp. 30-34.
[8] Defays, D. (1977). An efficient algorithm for a complete link method. The Computer Journal, 20(4), 364-366.
•
L
L= D− A
•
A
D
d1, d2, ..., dn
A = ⎡ aij ⎤ , i,j=1, 2,
⎣ ⎦
,n
n
di = ∑ aij
j =1
•
•
•
•
[9] Von Luxburg, U. (2007). A tutorial on spectral clustering. Statistics and computing, 17(4), 395-416.
[10] Ng, A. Y., Jordan, M. I., & Weiss, Y. (2002). On spectral clustering: Analysis and an algorithm. Advances in neural information processing
systems, 2, 849-856.
- 38 -
39. •
Q=
(
1 k
∑ ∑ A jl − d j d l / 2m
2m i =1 j∈Ci
)
l∈Ci
•
•
•
•
•
[11] Newman, M. E., & Girvan, M. (2004). Finding and evaluating community structure in networks. Physical review E, 69(2), 026113.
[12] Clauset, A., Newman, M. E., & Moore, C. (2004). Finding community structure in very large networks. Physical review E, 70(6), 066111.
[13] Kehagias, A. (2012). Bad Communities with High Modularity. arXiv preprint arXiv:1209.2678.
•
•
•
[14] Daszykowski, M., Walczak, B., & Massart, D. L. (2001). Looking for natural patterns in data: Part 1. Density-based approach. Chemometrics
and Intelligent Laboratory Systems, 56(2), 83-92.
- 39 -
41. •
•
⎧
⎛ d x ,x
i
j
⎪
exp ⎜ −
⎪
⎜
wij = ⎨
d ik d k
j
⎜
⎝
⎪
⎪0
⎩
(
k
i
)
2
⎞
⎟
⎟
⎟
⎠
if x j ∈ xik and xi ∈ x k
j
ohterwise
k
i
where x is the k-nearest set of point i and d is distance between point i and k-th neighbor of point i
•
•
i
j
i
j
•
[15] Zelnik-Manor, L., & Perona, P. (2004). Self-tuning spectral clustering. In Advances in neural information processing systems (pp. 1601-1608).
[16] Ertoz, L., Steinbach, M., & Kumar, V. (2002, April). A new shared nearest neighbor clustering algorithm and its applications. In Workshop
on Clustering High Dimensional Data and its Applications at 2nd SIAM International Conference on Data Mining (pp. 105-115).
•
- 41 -
42. •
•
•
di = ∑ wij + ∑ w jk .
j∈xik
j ,k∈xik
k
where xi is the k-nearest set of point i
•
i
i
•
•
•
α
α
•
•
160
5
140
4
120
3
100
2
80
1
60
0
40
-1
20
-2
-5
-4
-3
-2
-1
0
1
2
3
4
5
0
0
- 42 -
50
100
150
200
250
300
350
400
450
500
60. Item 1
Item 2
Item 3
Item 4
Matrix Factorization for Collaborative Prediction
User 1
6
9
3
?
3
0
User 2
4
?
2
0
2
0
User 3
0
0
2
3
0
1
User 4
0
?
4
?
0
2
Item Factor Matrix
|
u
2
0
3
0
1
2
0
3
User Factor Matrix
• Collaborative prediction
Filling missing entries of the user-item rating matrix
• Matrix factorization
Predicting an unknown rating by
product of user factor vector and item factor vector
3
Regularized Matrix Factorization
• Minimize the regularized squared error loss
Alternating Least Squares (ALS)
Time complexity
O(2|Ω|K2+(I+J)K3)
Parallelization
Easy
Tuning parameter
λ (regularization)
- 60 -
4
61. Regularized Matrix Factorization
• Minimize the regularized squared error loss
Stochastic Gradient Descent (SGD)
Time complexity
O(2|Ω|K)
Parallelization
Possible, but not easy
Tuning parameter
λ (regularization)
(learning rate)
5
Problem of parameter tuning
• Too small : overfitting
• Too large : underfitting
- 61 -
6
62. Problem of parameter tuning
• The value of optimal regularization parameter is
different depend on the dataset and rank K.
Regularization parameter chosen by cross-validation on various
datasets and rank K (Kim & Choi, IEEE SPL 2013)
7
Problem of parameter tuning
• SGD require tuning of regularization parameter,
learning rate and even the number of epochs.
0.005
0.007
0.010
0.015
0.020
0.005
0.9061/ 13 0.9079/ 15 0.9117/ 19 0.9168/ 28 0.9168/ 44
0.007
0.9056/ 10 0.9074/ 11 0.9112/ 13 0.9168/ 19 0.9169/ 31
0.010
0.9064/ 7
0.9077/ 8
0.9113/ 10 0.9174/ 13 0.9186/ 21
0.015
0.9099/ 5
0.9011/ 6
0.9152/ 6
0.9257/ 7
0.9390/ 7
0.020
0.9166/ 4
0.9175/ 4
0.9217/ 4
0.9314/ 4
0.9431/ 3
Netflix probe10 RMSE/optimal number of epochs of the BRSIMF for
various and values ( =40). (Tákacs et al., JMLR 2009)
- 62 -
8
63. Bayesian Matrix Factorization
Prior
P(U), P(V)
Likelihood
P(X |U,V)
Posterior
P(U,V |X)
Approximate the posterior by
MCMC (Salakhutdinov & Mnih, ICML 2008)
Variational method (Lim & Teh, KDDcup 2007)
MCMC on Netflix
No parameter tuning
No overfitting
High accuracy
Huge computational cost
O(2|Ω|K2+(I+J)K3)
9
Scalable Variational Bayesian Matrix Factorization
• No parameter tuning
• Linear space complexity: O(2(I+J)K)
• Linear time complexity: O(6|Ω|K)
• Easily parallelized on multi-core systems
• Optimize
element-wisely factorized variational distribution
with coordinate descent method.
- 63 -
10
64. Variational Bayesian Matrix Factorization
• Likelihood is given by
• Gaussian priors on factor matrices U and V:
• Approximate posterior by variational distribution by
maximizing the variational lower bound,
or equivalently minimizing the KL-divergence
11
VBMF-BCD (Lim & The KDDcup 2007)
• Matrix-wisely factorized variational distribution
VBMF-BCD
Space complexity
O((I+J)(K+K2))
Time complexity
O(2|Ω|K2+(I+J)K3)
Parallelization
Easy
- 64 -
12
65. Scalable VBMF: linear space complexity
Element-wisely factorized variational distribution
K=100
O((I+J)(K+K2))
O(2(I+J)K)
Netflix
I = 480,189
J = 17,770
4.4 GB
0.8 GB
Yahoo-music
I = 1,000,990
J = 624,961
131 GB
2.6 GB
13
Scalable VBMF: quadratic time complexity
Updating rules for q(uki)
Updating all variational parameters
- 65 -
14
66. Scalable VBMF: linear time complexity
Let Rij denote the residual on ( i, j ) observation:
With Rij , updating rule can be rewritten as
15
Scalable VBMF: linear time complexity
When
is changed to
updated to
,
- 66 -
can be easily
16
67. Scalable VBMF: parallelization
I
K
• Each column of variational parameters can be updated
independently from the updates of other columns.
• Parallelization can be easily done in a column-by-column
manner.
• Easy implementation with the OpenMP library on multi-core
system.
17
Related work
(Pilásy et al., ReSys 2010)
• Similar idea is used to reduce the cubic time
complexity of ALS to linear one.
RMF
Scalable VBMF
With small extra effort,
more accurate model
is obtainable without
tuning of regularization
parameter
- 67 -
18
68. Related Work
(Raiko et al., ECML 2007)
• Consider element-wisely factorized variational
distribution
• Update U and V by scaled gradient descent method
• Require tuning of learning rate
• Learning speed is slower than our algorithm
19
Numerical Experiments
• Compare VBMF-CD, VBMF-BCD (Lim & The KDDcup 2007),
VBMF-GD (Raiko et al., ECML 2007)
• Experimental environment
– Quad-core Intel® core™ i7-3820 @ 3.6GHz
– 64 GB memory
– Implemented in Matlab 2011a, where main computational
modules are implemented in C++ as mex files
– Parallelized with the OpenMP library
• Datasets
MovieLens10M
Netflix
Yahoo-music
# of user
69,878
480,189
1,000,990
# of item
10,677
17,770
624,961
10,000,054
100,480,507
262,810,275
# of rating
- 68 -
20
69. Numerical Experiments:
= 20
RMSE versus computation time on a quad-core system for each dataset:
(a) MovieLens10M, (b) Netflix, (c) Yahoo-music
MovieLens10M
Netflix
Yahoo-music
VBMF-CD
0.8589
0.9065
22.3425
VBMF-BCD
0.8671
0.9070
22.3671
VBMF-GD
0.8591
0.9167
22.5883
21
Numerical Experiments: Netflix,
= 50
Time per iter.
VBMF-BCD
66 min.
VBMF-CD
77 sec.
VBMF-GD
29 sec.
RMSE
VBMF-BCD
VBMF-CD
Iter.
Time
Iter.
Time
0.9005
19
21 h
63
74 m
0.9004
21
23 h
70
82 m
0.9003
22
24 h
84
98 m
0.9002
25
28 h
108
2h
0.9001
27
31 h
680
13 h
0.9000
30
33 h
- 69 -
22
70. Conclusion
• We presented scalable learning algorithm for VBMF, VBMFCD.
• VBMF-CD optimizes element-wisely factorized variational
distributions with coordinate descent method.
• Space and time complexity of VBMF-CD are linear.
• VBMF-CD can be easily parallelized.
• Experimental results confirmed the user behavior of VBMFCD such as scalability, fast learning, and prediction accuracy.
23
- 70 -
71. A hybrid genetic algorithm for accelerating feature selection and
parameter optimization of support vector machine
2013. 11. 29.
Introduction
• Support Vector Machine (SVM)
– One of the most popular state-of-the-art classification algorithms.
– efficiently finds non-linear solutions by exploiting kernel functions.
– Takes training time complexity O(N3).
• “Very important” issues on training SVM
– Feature selection
• SVM is a distance based algorithm (kernel matrix computation), and doesn’t include
any feature selection mechanism.
• Irrelevant features degrade the model performance.
– Parameter optimization
• Model Tradeoff parameter C, Kernel parameter σ (for the RBF kernel).
• SVM is very sensitive to the parameter settings.
– For SVM, feature selection and parameter optimization should be performed
simultaneously.
- 71 -
2
72. Introduction
• Genetic algorithm (GA)
– A stochastic algorithm that mimics natural evolution.
– easy, but very effective!
Selection
Parents
Genetic operation
(Crossover, Mutation)
Population
p
Replacement
Offspring
• GA-based feature selection and parameter selection of SVM [1-4]
– GA effectively finds near-optimal feature subsets and parameters.
– But, Slow. (But, MUCH better than Grid-search mechanism.)
3
Introduction
If the SVM have to be re-trained periodically, fast feature selection and
parameter optimization is required.
This study aims to avoid producing a bad offspring in the “Genetic Operation”
step of GA.
This study proposes a chromosome filtering method for faster convergence of
GA using Decision Tree (DT) for feature selection and parameter optimization
of SVM.
- 72 -
4
73. The proposed method
• Flowchart
Initialization
Population
Population Replacement
Evaluate fitness
no
yes
Chromosome
Filtering
Termination
condition?
no
yes
Do genetic operations
Optimized
parameters and
feature subset
5
The proposed method
• Chromosome design
– Parameters: binary representation
C:
0
0
1
0
1
10-2
σ:
1
10-1
1
101
102
103
C=1 x 10-2 + 1 x 101
2-5 , … , 25
– Feature subset: binary representation
1
0
0
1
0
…
f 1 f2 f 3 f4 f 5
1
0
{f1, f4, … , fp-1}
fp-1 fp
Genotype
Phenotype
- 73 -
6
74. The proposed method
• Fitness evaluation
– Decode chromosome and obtain C, σ, and a feature subset.
• Genotype Æ Phenotype
– Train SVM for a dataset
given the selected C, σ, and feature
subset.
– Fitness value: Cross Validation Accuracy
7
The proposed method
• Genetic operation
– Parent selection
• Roulette-wheel scheme - Fitness proportional selection (FPS)
• Probability of i-th chromosome ci in the population to be selected =
– where f(i) is the fitness of ci
– Crossover: N-point crossover
• Choose N random crossover points, split along those points.
– Mutation: Bit-flipping mutation
• Bitwise bit-flipping with fixed probability.
- 74 -
8
75. The proposed method
• Chromosome Filtering
– For each generation, chromosomes and their fitness are stored in the
knowledgebase. A DT is trained periodically based on the knowledgebase.
Using the DT, the offspring chromosomes that are likely to have bad fitness are
removed before the fitness evaluation step.
– Assumption
• Some features and parameter settings improve (or degrade) the model
performance.
• DT can find these rules.
9
The proposed method
• Chromosome Filtering (continued)
– Why DT?
Knowledgebase (sorted by fitness)
• Effectively deal with Categorical Features.
• Find Non-linear relationship.
• Use a few, relevant features in the classification
procedure.
– DT Training
• Each ci (i-th chromosome) in the knowledgebase is
labeled by
– first highest M fitness values Æ GOOD
(probable to yield a good fitness value)
– next highest M fitness values Æ NORMAL
– remaining Æ BAD
(probable to yield a bad fitness value)
c1
c2
c3
…
cM
GOOD
cM+1
cM+2
cM+3
…
c2M
NORMAL
c2M+1
c2M+2
c2M+3
…
…
…
BAD
• Input feature: chromosome (in phenotype)
• Output feature: label {GOOD, NORMAL, BAD}
- 75 -
10
76. The proposed method
• Chromosome Filtering (continued)
– Filtering
• A DT gives rules that assess a chromosome before fitness evaluation.
: Is a chromosome GOOD or NORMAL or BAD?
• Each chromosome has a different survival probability.
ex) GOOD: 1.0, NORMAL: 0.5, BAD: 0.2
• The DT is periodically updated, so the criteria of good chromosome changes
through the generations.
11
The proposed method
• Chromosome Filtering (continued)
– DT example
C>100
Contain
F1?
BAD
σ>1
GOOD
σ>0.25
BAD
Contain
F3?
GOOD
NORMAL
- 76 -
BAD
12
77. The proposed method
• Population Replacement: Steady state model
Å to verify the effectiveness of the proposed method in the initial period of GA.
– Only one chromosome in the population is updated in a generation.
– Replacement scheme [5, 6]: The offspring replaces one of its parents or the
lowest fitness chromosome in the population.
• If the offspring is superior to both parents, it replaces the similar parent.
• If it is in between the two parents, it replaces the inferior parent.
• otherwise, the most inferior chromosome in the population is replaced.
13
Experiments
• Experimental Design
–
–
–
–
10 datasets from UCI repository, all datasets were normalized to be in [-1,1].
5 independent runs, a random seed set was used for fairness.
In SVM training, 10-fold cross validation was used.
Parameter Settings
• GA parameters
–
–
–
–
–
population size Npop = 30
crossover probability pc = 0.9
mutation probability pm = 0.05
max iteration = 300
pgood=1; pnormal=0.5; pbad=0.2
• DT parameters
– CART
– Labeling: good=10, normal=10, bad=remaining
– Training starting point: 30th generation / period=10
- 77 -
14
79. Concluding Remarks
We presented a chromosome filtering method for GA-based feature selection
and parameter optimization of SVM.
The proposed method employed a DT as a chromosome filter to remove the
offspring chromosomes that are likely to have bad fitness before the fitness
evaluation step of GA.
On most datasets, the proposed method showed faster improvement of fitness
than standard GA.
17
Acknowledgements
This work was supported by the National Research Foundation of Korea (NRF)
grant funded by the Korea government (MSIP) (No. 2011-0030814), and the
Brain Korea 21 Program for Leading Universities & Students. This work was
also supported by the Engineering Research Institute of SNU.
- 79 -
18
80. References
1.
2.
3.
4.
5.
6.
Frohlich, H., Chapelle, O., & Scholkopf, B. (2003, November). Feature selection for support vector
machines by means of genetic algorithm. In Tools with Artificial Intelligence, 2003. Proceedings. 15th IEEE
International Conference on (pp. 142-148). IEEE.
Huang, C. L., & Wang, C. J. (2006). A GA-based feature selection and parameters optimization for support
vector machines. Expert Systems with applications, 31(2), 231-240.
Min, S. H., Lee, J., & Han, I. (2006). Hybrid genetic algorithms and support vector machines for bankruptcy
prediction. Expert Systems with Applications,31(3), 652-660.
Zhao, M., Fu, C., Ji, L., Tang, K., & Zhou, M. (2011). Feature selection and parameter optimization for
support vector machines: A new approach based on genetic algorithm with feature chromosomes. Expert
Systems with Applications,38(5), 5197-5204.
Bui, T. N., & Moon, B. R. (1996). Genetic algorithm and graph partitioning.Computers, IEEE Transactions
on, 45(7), 841-855.
Oh, I. S., Lee, J. S., & Moon, B. R. (2004). Hybrid genetic algorithms for feature selection. Pattern Analysis
and Machine Intelligence, IEEE Transactions on,26(11), 1424-1437.
19
End of Document
- 80 -
20
135. Document Indexing by Ensemble Model
Yanshan Wang and In-Chan Choi
Korea University
System Optimization Lab
yansh.wang@gmail.com
November 25, 2013
Yanshan Wang and In-Chan Choi (KU)
Indexing by EnM
November 25, 2013
1 / 18
November 25, 2013
2 / 18
Overview
1
The Basics
Information Retrieval and Document Indexing
Topic Modelling
Indexing by Latent Dirichlet Allocation
2
Indexing by Ensemble Model
Introduction to Ensemble Model
Algorithms
Experimental Results
3
Conclusions and Discussion
Yanshan Wang and In-Chan Choi (KU)
- 135 EnM
Indexing by-
136. The problem in Information Retrieval
As more information (Big
Data) becomes available, it is
more difficult to access what
users are looking for.
We need new tools to help us
understand and search among
vast amounts of information.
Source: www.betaversion.org/ stefano/linotype/news/26/
Yanshan Wang and In-Chan Choi (KU)
Indexing by EnM
November 25, 2013
3 / 18
Document Indexing is Important
Users can get desired information by indexing (or ranking)
documents (or items). The higher position the document has, the
more valuable to users.
Yanshan Wang and In-Chan Choi (KU)
- 136 EnM
Indexing by-
November 25, 2013
4 / 18
137. Problems in Conventional Methods: Word
Representation
The majority of rule-based and statistical Natural Language
Processing (NLP) models regards words as atomic symbols.
In Vector Space Models (VSM), a word is represented by one 1
and a lot of zeros. For example,
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0]
Its problem:
motel [0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0] AND
hotel [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0] =0
The conceptual meaning of words is ignored.
Yanshan Wang and In-Chan Choi (KU)
Indexing by EnM
November 25, 2013
5 / 18
Topic Modeling
Latent Dirichlet Allocation (LDA)
[Blei et al. (2003)].
Uncover the hidden topics that
generate the collection.
Words and Documents can be
represented according to those
topics.
Use the representation to organize,
index and search the text.
Yanshan Wang and In-Chan Choi (KU)
- 137 EnM
Indexing by-
⎡
⎢
⎢
⎢
⎢
⎢
apple = ⎢
⎢
⎢
⎢
⎢
⎣
0.325
0.792
0.214
0.107
0.109
0.612
0.314
0.245
November 25, 2013
⎤
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎦
6 / 18
138. LDA [Blei et al. (2003)]
E
D
1
2
3
T
]
Z
1
0
Choose the number of words N ∼ Poisson(ξ).
Choose θ ∼ Drichelet(α).
For n = 1, 2, ..., N
Choose a topic zn ∼ Multinomial(θ);
Choose a word wn ∼ Multinomial(wn |zn , β), a multinomial
distribution conditioned on the topic zn .
Joint Distribution: p(θ, z, d|α, β) = p(θ|α)
Yanshan Wang and In-Chan Choi (KU)
N
n=1
p(zn |θ)p(wn |zn , β)
Indexing by EnM
November 25, 2013
7 / 18
Indexing by LDA (LDI) [Choi and Lee (2010)]
With adequate assumptions, the probability of a word wj
embodying the concept z k is
βjk
Wjk = p(z k = 1|wj = 1) = K
h=1 βjh
The document (or query) probability can be defined within the
topic space
V
k
j=1 Wj nij
k
k
,
Di (Qi ) =
Ndi
where nij denotes the number of occurrence of word wj in
document di and Ndi denotes the number of words in the
document di , i.e. Ndi = V nij .
j=1
Similarity between document and query
ρ(D, Q) = D · Q
where D · Q =
D
D
Yanshan Wang and In-Chan Choi (KU)
,
Q
Q
.
- 138 EnM
Indexing by-
November 25, 2013
8 / 18
139. Indexing by Ensemble Model (EnM)
[Wang et al. (2013)]
Motivation: There exit optimal weights over constituent models.
Table: A toy example. The values in the table represent similarities of
documents with respect to a given query. The scores of Ensemble 1 and
2 are defined by 0.5*Model 1+0.5*Model 2 and 0.7*Model 1+0.3*Model
2, respectively. The relevant document list is assumed to be {2,3}.
Document 1
Document 2
Document 3
(M)AP
Model 1
0.35
0.4
0.25
0.72
Yanshan Wang and In-Chan Choi (KU)
Model 2
0.2
0.1
0.7
0.72
Indexing by EnM
Ensemble 1
0.55
0.5
0.95
0.72
Ensemble 2
0.305
0.31
0.385
0.89
November 25, 2013
9 / 18
AP and MAP
Average Precision (AP) and Mean Average Precision (MAP)
Notation
|Q|
|Di |
dij ∈ Di
φki
R(dij , φki )
H=
αk φk
the number of queries in the query set;
the number of documents in the relevant document
set w.r.t. the ith query;
the jth document in Di ;
the relevant score returned by kth model w.r.t. ith
query;
the indexing position of the jth document for the ith
query returned by the kth model;
the ensemble model, a linear combination of the constituent models, where αk ≥ 0.
Definition
1
E(H, Q) ==
|Q|
Yanshan Wang and In-Chan Choi (KU)
|Q|
1
AP (H, Di ), AP (H, Di ) =
|Di |
i=1
- 139 EnM
Indexing by-
|Di |
j=1
j
R(dij , H)
November 25, 2013
.
10 / 18
140. Formulation
Formulation of the Optimization Problem
Since 0 ≤ AP ≤ 1, we can define the empirical loss as follows:
|Q|
(1 − AP (H, Di )), or
min
i=1
|Q|
1
(1 − i
min
|D |
i=1
|Di |
j=1
j
R(dij , H)
).
Our goal is to uncover optimal weights α’s that minimize the
empirical loss.
Difficulty
The position function R(dij , H) is nonconvex, nondifferentiable and
noncontinuous w.r.t. α’s.
Yanshan Wang and In-Chan Choi (KU)
Indexing by EnM
November 25, 2013
11 / 18
Boosting Scheme
1
Select model:
'
|Q|
φˆ = arg max
j
j
2
TXHU
i=1
Di AP (φji );
Update the weight:
where δj =
3
1
2
=
log
t
αˆ
j
/RVV
'
+
|Q|
i=1
|Q|
i=1
t
δˆ,
j
t
αˆ
j
M
M
/RVV
Di (1+AP (φji ))
Di (1−AP (φji ))
EDG
TXHU
;
Update distribution on queries:
'
M
/RVV
exp(−AP (Hi ))
,
Di =
Z
where Z is a normalizer.
Yanshan Wang and In-Chan Choi (KU)
- 140 EnM
Indexing by-
November 25, 2013
12 / 18
141. Coordinate Descent
Since the objective is nonconvex, not each
coordinate will reduce the loss.
Select model:
1
φˆ = arg max E(Q, φj );
j
j
Update the weight:
2
D N
1 + AP (φji )
1
;
αj = log
2
1 − AP (φji )
If
3
Et
≤
E t−1 ,
DN
delete this coordinate.
Yanshan Wang and In-Chan Choi (KU)
Indexing by EnM
D N
November 25, 2013
13 / 18
Parallel Coordinate Descent
The coordinate descent algorithm can be parallelized on cores.
1:
2:
3:
4:
parfor p = 1, 2, ..., Kφ do
Update the weights using αp =
end parfor
return Ensemble model H.
Yanshan Wang and In-Chan Choi (KU)
1
2
log
1+AP (φpi )
;
1−AP (φpi )
- 141 EnM
Indexing by-
November 25, 2013
14 / 18
142. Experimental Results on EnM
Data: MED corpus1 .
1033 documents from the National Library of Medicine.
30 queries.
Results.
1
TFIDF
LSA
pLSI
LDI
EnM
0.9
0.8
Method
TFIDF
LSI
pLSI
LDI
EnM.B
EnM.CD
EnM.PCD
MAP
0.4605
0.5026
0.5334
0.5738
0.6420
0.6461
0.6414
improvement (%)
0.6
0.5
0.4
0.3
9.1
15.8
24.6
39.4
40.3
39.3
0.2
0.1
0
1: ftp://ftp.cs.cornell.edu/pub/smart.
Yanshan Wang and In-Chan Choi (KU)
0.7
Precision
Table: MAP of various methods for
MED corpus.
0.1
0.2
0.3
0.4
0.5
Recall
0.6
0.7
0.8
0.9
1
Figure: Precision-Recall Curves for
various methods.
Indexing by EnM
November 25, 2013
15 / 18
Conclusions and Discussion
Conclusion
An ensemble model (EnM) is proposed and three algorithms are
introduced for solving the optimization problem.
The EnM outperformed any basis models through the overall recall
regimes.
Discussion
The algorithms cannot guarantee to converge to the global optimum
due to the nonconvexity of objective.
The parallel coordinate descent algorithm cannot guarantee the
optimum, even local optimum, due to the coupling between
variables.
Future Works
Approximate the objective with convex functions.
Using stochastic gradient descent for stochastic sequences and
large-scale data sets.
Yanshan Wang and In-Chan Choi (KU)
- 142 EnM
Indexing by-
November 25, 2013
16 / 18
143. References
Yanshan Wang and In-Chan Choi(2013)
Indexing by ensemble model
Working Paper. arXiv preprint arXiv:1309.3421.
David M, Blei, Andrew Y, Ng and Micheal I, Jordan (2003)
Latent dirichlet allocation
the Journal of machine Learning research, 3, 993-1022.
In-Chan Choi and Jae-Sung Lee (2010)
Document indexing by latent dirichlet allocation
DMIN, 409-414.
Y. Freund and R. E. Schapire (1995)
A desicion-theoretic generalization of on-line learning and an application to
boosting
Computational Learning Theory, Springer, 23-37.
My Homepage: http://optlab.korea.ac.kr/~ sam/
Yanshan Wang and In-Chan Choi (KU)
Indexing by EnM
November 25, 2013
17 / 18
November 25, 2013
18 / 18
The End
Yanshan Wang and In-Chan Choi (KU)
- 143 EnM
Indexing by-
165. •
•
•
•
Suarez, Estrella, et al. Matrix-assisted laser desorption/ionization-mass spectrometry of cuticular lipid profiles can differentiate sex,
age, and mating status of i Anopheles gambiae/i mosquitoes. Analytica chimica acta706.1 (2011): 157-163.
Suarez, Estrella, et al. Matrix-assisted laser desorption/ionization-mass spectrometry of cuticular lipid profiles can differentiate sex,
age, and mating status of i Anopheles gambiae/i mosquitoes. Analytica chimica acta706.1 (2011): 157-163.
- 165 -
13
14
166. Suarez Estrella, al Matrix assisted
Suarez, Estrella et al. Matrix-assisted laser desorption/ionization-mass spectrometry of cuticular li id profiles can differentiate sex,
desorption/ionization
t
f ti l lipid
fil
diff
ti t
age, and mating status of i Anopheles gambiae/i mosquitoes. Analytica chimica acta706.1 (2011): 157-163.
15
•
•
•
•
Li, Lihua, et al. Data mining techniques for cancer detection using serum proteomic profiling. Artificial intelligence in medicine 32.2
(2004): 71-83.
- 166 -
16
169. (1)
ƒ
ƒ
Tibshirani, Robert, et al. Sparsity and smoothness via the fused lasso.Journal of the Royal Statistical Society: Series B (Statistical
Methodology) 67.1 (2005): 91-108.
21
Liu, Jun, Lei Yuan, and Jieping Ye. An efficient algorithm for a class of fused lasso problems. Proceedings of the 16th ACM
SIGKDD international conference on Knowledge discovery and data mining. ACM, 2010.
- 169 -
22
170. (2)
Liu, Jun, Lei Yuan, and Jieping Ye. An efficient algorithm for a class of fused lasso problems. Proceedings of the 16th ACM
SIGKDD international conference on Knowledge discovery and data mining. ACM, 2010.
23
Results
ƒ
ƒ
ƒ
- 170 -
24
171. Performance
Comparison
Average
misclass. rate
Average
Selected features
25
Fused lasso coefficient abs
6
2
4
2
1
0
0
2000
4000
6000
8000
10000
12000
m/z
Fused lasso selected features intensity value
1
B
Intensity
Others
MFemale7
1.5
2nd principal component
Coefficient
A
0.5
0
-0.5
0.5
-1
0
0
2000
4000
6000
8000
10000
12000
-1.5
m/z
- 171 -
-1
-0.5
0
0.5
1
1st principal component
1.5
2
2.5
26
455. ™
™
Country City Latitude Longitude Year DataType DataType2 DataType3 Institution Purpose
- 238 -
ScopeScopeTime Lag Count Ratio
Collection Application
471. •
•
•
™
ƒ
ƒ
ƒ
™
ƒ
ƒ
11
™
ƒ
•
•
Valid Voting Ratioi
Nall = Total number of conservative/progressive parties
Nall ≥ Ncy + Ncn + Npy + Npn
N cy N cn N py N pn
N all
ƒ
Yes No Diversityi
¦ P log
k
2
Pk
k{ y , n}
Py
N cy N py
N cy N cn N py N pn
, Pn
N cn N pn
N cy N cn N py N pn
ƒ
Political Orientation Diversityi
Pcy
N cy
N cy N cn N py N pn
, Pcn
- 254 12
¦ P log
ij
i{c , p}, j{ y , n}
4
Pij
N cn
, ...
N cy N cn N py N pn
473. ™
ƒ
ƒ
ƒ
™
ƒ
i
j
ƒ
Recall y
yy
, Precision y
yy yn
yy
, F1y
yy ny
2 u Recall y u Precision y
Recalln
nn
, Precision n
nn ny
nn
, F1n
nn yn
2 u Recalln u Precision n
Recalln Precision n
F1yn
Recall y Precision y
1
F1y F1n
511. Modified LDA with Bibliography Information
한국 BI 데이터마이닝 학회 2013 추계 학술 대회
System Optimization Lab.
Korea University
Young Min, Jun
1
Contents
1. LDA
1.1 Topic Model
1.2 LDA
2. Modified LDA with
Bibliography Information
2.1 Limitation of LDA
2.2 Introduction
2.3 Preliminary
2.4 Model
2.5 Expected Impacts
- 293 -
2
512. 1.1 Topic Model
“Topic modeling provides a suite of algorithms to discover hidden thematic structure in large
collections of texts. The results of topic modeling algorithms can be used to summarize, visualize,
explore, and theorize about a corpus.”(DM Blei, 2012)
Example
• What is the “topics” on the New York Times?
• How change the “topics” on the Twitter?
• How similar are these article?
Research of Topic Model
• LSA
• Based on reducing dimension (SVD Decomposition)
• pLSA
• Mixture decomposition
• LDA
• Most frequently studied model
3
1.2 LDA
“LDA is a generative probabilistic model for collection of discrete data such as text corpora. And this is
a three-level hierarchical Bayesian model, which each item of a collection is modeled as a finite mixture
over an underlying set of topics”(DM Blei, 2003)
Generative Process
Graphical Model
Geometric Interpretation
Example
• Three topics for three words.
• LDA makes a smooth
distribution on the topics.
- 294 -
4
513. 2.1 Limitation of LDA
LDA is effective tool for discovering topic structure, but there are some further research to improve
LDA. In that areas, this research focus three aspects such as individual, reference, and explanation.
Individual
Reference
Explanation
•LDA is a generative model for corpus. So
•LDA not considers referring to reference
•LDA often gives result which is hard to
it provides information of whole
literature in generative process.
•Modified LDA provides bibliography of a
documents.
•It provides a vector for information of a
document and its distribution.
understand.
•Modified LDA expects to provide more
explainable result.
document.
•In this study, modified LDA gives more
information of a document.
5
2.2 Introduction
LDA is motivated by writing a document.
Similarly, Modified LDA is motivated by writing a document in library.
Generic Generative Process
More detail
• The place in the library contains information that
probabilities of what reference is selected.
• References in same category have similar topics and
words.
- 295 -
6
514. 2.3 Preliminary
In this research, we use the language of text collections and introduce terms such as “parent corpus”,
“category”and “document distribution”
Parent Corpus
Category
Document Distribution
•A set of documents for reference.
•Category is a cluster of parent
•Document distribution is the probability
•Parent corpus is consisted with parent
documents.
distribution that selection of parent
•Parent documents in same category have
documents.
•Parent document influence topics and
words of the new document.
•Each parent document has own place.
corpus.
same topic and word prior.
•Each parent documents in category has
probability of selection.
7
2.3 Preliminary
Document distribution represents the information of new document.
Document Distribution
• Probabilistic representation as well as deterministic
representation of a document.
• Probabilistic
• Distribution over the parent document that the
probability of being used in generating a
document.
• Deterministic
• List of documents with high probability for
selection
Mixture of Gaussian Distribution
• The number of mixture provides the number of
category.
- 296 -
8
515. 2.3 Preliminary
This slide contains assumption of Modified LDA with Bibliography Information
Parent Corpus
Document Distribution
LDA
• Parent corpus assumed that it has
• Probability of parent document
• Bag-of-words assumption
own alpha, beta.
• Each parent document places at the
point in the document distribution.
follows mixture gaussian
distribution.
• It is known to the number of
mixture.(완화가능)
9
2.4 Model
This slide contains notation and terminology and generative process of Modified LDA with
Bibliography Information.
Notation and Terminology
Generative Process
- 297 -
10
516. 2.4 Model
This slide contains graphical model and probability of document.
Graphical Model
Probability of Document
11
2.4 Model
Estimation
- 298 -
12
517. 2.5 Expected Impacts
This research focus three aspects such as individual, reference and explanation.
Individual
Reference
Explanation
•Bibliography in probabilistic
•Verifying that a document is well
•Providing a variety of view for analyzing
representation of a document.
•Verifying plagiarism by comparing
document distribution.
classified.
text data.
•Representation that information of the
important reference.
13
2.5 Expected Impacts
Drawbacks
Dependency on LDA
Computational Complexity
Assumption
•This model is depended on LDA, such as
•This research yields a total number of
•It is assumed that the number of mixture
operations roughly on the order of
in document distribution is known (완화
O(N⁴k²)
가능)
the perplexity and the complexity.
- 299 -
14
518. References
[1] Jeff A Bilmes et al. A gentle tutorial of the em algorithm and its application to parameter estimation for gaussian mixture and hidden
markov models. International Computer Science Institute, 4(510):126, 1998.
[2] David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation. the Journal of machine Learning research, 3:993–1022,
2003.
[3] DM Blei. Topic modeling and digital humanities. Journal of Digital Humanities, 2(1):8–11, 2012.
[4] Nikos Vlassis and Aristidis Likas. A greedy em algorithm for gaussian mixture learn- ing. Neural Processing Letters, 15(1):77–87,
2002.
15
- 300 -