Conditional Random Fields

1. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data John Lafferty, Andrew McCallum, Fernando Pereira Speaker : Shu-Ying Li 1

2. Outline Introduction Conditional Random Fields Parameter Estimated for CRFs Experiments Conclusions 2

4. Assign a joint probability to paired observation and label sequences

5. The parameters typically trained to maximize the joint likelihood of train examplesSt-1 St St+1 Ot Ot+1 3

7. Allow arbitrary, non-independent features of the observation sequence X.

8. The probability of a transition between labels may depend on past and feature observations.Maximum Entropy Markov Models (MEMMs) St-1 St St+1 ... Ot Ot+1 Ot-1 4

10. Introduction(cont.) Solve the Label Bias Problem Change the state-transition structure of the model Start with fully-connected model and let the training procedure figure out a good structure. 6

13. P(Y3 | X, all other Y) = P(Y3 |X, Y2, Y4)X = X1,…, Xn-1, Xn 8

15. sk(yi, x, i) is a state feature function of the label at position i and the observation sequence

17. y : label sequence

18. v : vertex from vertex set V

19. e : edge from edge set E over V

20. fk: Boolean vertex feature; gk : Boolean edge feature

21. k : the number of features

22. λk and μk are parameters to be estimated

23. y|e is the set of components of y defined by edge e

24. y|v is the set of components of y defined by vertex vYt-1 Yt Yt+1 ... Xt Xt+1 Xt-1

26. Z(x) is a normalization over the data sequence x

27. [1] :

28. [2] : where each fj(yi-1, yi, x, i) is either a state function s(yi-1, yi, x, i) or a transition function t(yi-1, yi, x, i). 10

30. Y’ and y are labels drawn from this alphabet.

31. Define a set of n+1 matrices {Mi(x)|i=1,…,n+1}, where each Mi(x) is a matrix with elements of the form= exp ( ) 11

32. Conditional Random Fields The normalization function is the (start, end) entry of the product of these matrices. The conditional probability of label sequence y is: [1] [2] where, y0 = start and yn+1 = end 12

33. Parameter Estimated for CRFs Problem definition : determine the parameters θ= (λ1,λ2,…;μ1,μ2…). Goal : maximize the log-likelihood objective function. 13 [1] br />where is the empirical distribution of training data. This function is concave, guaranteeing convergence to the global maximum. [2] Ep[‧]denotes expectation with respect to distribution p

35. δλk for edge feature fk is the solution of

36. Efficiently computing the exponential sums on the right-hand sides of the these equations is problematic.->Because T(x, y) is a global property of (x, y) and dynamic programming will sum over sequence with potentially varying T. Dynamic Programming [2]

37. Parameter Estimated for CRFs For each index i=0,…,n+1, we define forward vectors αi(x) and backward vectors βi(x) : [1] : [2]: 15

40. Where S is a constant chosen so that s(x(i) , y) 0 for all y and all observation vectors x(i) in the training set

41. Thus makingT(x, y) = S.

42. Feature s is “global” : it does not correspond to any particular edge or vertex.16

43. Parameter Estimated for CRFs Algorithm S [1] where δλk s = = = 17

44. Parameter Estimated for CRFs Algorithm S [1] The constant S in algorithm S can be quite large, since in practice it is proportional to the length of the longest training observation sequence. The algorithm may converge slowly, taking very small steps toward the maximum in each iteration. 18

46. Use forward-back ward recurrences to compute the expectations ak,t of feature fk and bk,t of feature gk given that T(x) = t.βk and γk are the unique positive roots to the following polynomial equations. which can be easily computed by Newton’s method. 19

48. CRFs error rate : 4.6%

49. MEMMs error rate : 42%

50. CRFs solve the label bias problem.20

52. MEMMs converge in 100 iterations.MEMMs vs. HMM 21

53. Experiments CRFs vs. MEMMs 22

55. When the data is mostlysecond order   ½, the discriminatively trained CRF usually outperforms the MEMM23

57. Data set: Penn Tree bank

58. 50-50% test-train split

59. Use the optimal MEMM parameter vector as a starting point for training the corresponding CRF to accelerate convergence speed.24

60. Conclusions Discriminatively trained models for sequence segmentation and labeling. Combination of arbitrary, overlapping and agglomerative observation features from both the past and future. Efficient training and decoding based on dynamic programming. Parameter estimation guaranteed to find the global optimum. 25

61. Reference 26 J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: probabilisticmodels for segmenting and labeling sequence data. In InternationalConference on Machine Learning, 2001. Hanna M. Wallach. Conditional Random Fields: An Introduction. University of Pennsylvania CIS Technical Report MS-CIS-04-21. 參考投影片(by RongkunShen)

Conditional Random Fields

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Conditional Random Fields

Similar to Conditional Random Fields (20)

Recently uploaded

Recently uploaded (20)

Conditional Random Fields