Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Conditional Random Fields
1. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data John Lafferty, Andrew McCallum, Fernando Pereira Speaker : Shu-Ying Li 1
8. The probability of a transition between labels may depend on past and feature observations.Maximum Entropy Markov Models (MEMMs) St-1 St St+1 ... Ot Ot+1 Ot-1 4
9.
10. Introduction(cont.) Solve the Label Bias Problem Change the state-transition structure of the model Start with fully-connected model and let the training procedure figure out a good structure. 6
11.
12.
13. P(Y3 | X, all other Y) = P(Y3 |X, Y2, Y4)X = X1,…, Xn-1, Xn 8
14.
15. sk(yi, x, i) is a state feature function of the label at position i and the observation sequence
31. Define a set of n+1 matrices {Mi(x)|i=1,…,n+1}, where each Mi(x) is a matrix with elements of the form= exp ( ) 11
32. Conditional Random Fields The normalization function is the (start, end) entry of the product of these matrices. The conditional probability of label sequence y is: [1] [2] where, y0 = start and yn+1 = end 12
33. Parameter Estimated for CRFs Problem definition : determine the parameters θ= (λ1,λ2,…;μ1,μ2…). Goal : maximize the log-likelihood objective function. 13 [1] br />where is the empirical distribution of training data. This function is concave, guaranteeing convergence to the global maximum. [2] Ep[‧]denotes expectation with respect to distribution p
36. Efficiently computing the exponential sums on the right-hand sides of the these equations is problematic.->Because T(x, y) is a global property of (x, y) and dynamic programming will sum over sequence with potentially varying T. Dynamic Programming [2]
37. Parameter Estimated for CRFs For each index i=0,…,n+1, we define forward vectors αi(x) and backward vectors βi(x) : [1] : [2]: 15
38.
39.
40. Where S is a constant chosen so that s(x(i) , y) 0 for all y and all observation vectors x(i) in the training set
44. Parameter Estimated for CRFs Algorithm S [1] The constant S in algorithm S can be quite large, since in practice it is proportional to the length of the longest training observation sequence. The algorithm may converge slowly, taking very small steps toward the maximum in each iteration. 18
45.
46. Use forward-back ward recurrences to compute the expectations ak,t of feature fk and bk,t of feature gk given that T(x) = t.βk and γk are the unique positive roots to the following polynomial equations. which can be easily computed by Newton’s method. 19
59. Use the optimal MEMM parameter vector as a starting point for training the corresponding CRF to accelerate convergence speed.24
60. Conclusions Discriminatively trained models for sequence segmentation and labeling. Combination of arbitrary, overlapping and agglomerative observation features from both the past and future. Efficient training and decoding based on dynamic programming. Parameter estimation guaranteed to find the global optimum. 25
61. Reference 26 J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: probabilisticmodels for segmenting and labeling sequence data. In InternationalConference on Machine Learning, 2001. Hanna M. Wallach. Conditional Random Fields: An Introduction. University of Pennsylvania CIS Technical Report MS-CIS-04-21. 參考投影片(by RongkunShen)