DELAB - sequence generation seminar
Title
Open vocabulary problem
Table of contents
1. Open vocabulary problem
1-1. Open vocabulary problem
1-2. Ignore rare words
1-3. Approximative Softmax
1-4. Back-off Models
1-5. Character-level model
2. Solution1: Byte Pair Encoding(BPE)
3. Solution2: WordPieceModel(WPM)
2. 목차
1. Open vocabulary problem
1. Open vocabulary problem
2. Ignore rare words
3. Approximative Softmax
4. Back-off Models
5. Character-level model
2. Solution1: Byte Pair Encoding(BPE)
3. Solution2: WordPieceModel(WPM)
3. Open vocabulary problem
참고자료
Machine Translation - 07: Open-vocabulary Translation
“On the Properties of Neural Machine Translation: Encoder–Decoder Approaches”. Cho et al.
2014.
“On Using Very Large Target Vocabulary for Neural Machine Translation”. Jean et al. 2015.
9. Open vocabulary problem
Word Vocabulary
Index Word
1 How
2 are
3 you
4 ?
…
m temperature
Lookup table (Embedding matrix)
Index Word vector
1 0.2 0.1 … 0.2 0.2
2 0.3 0.1 … 0.4 0.1
3 0.1 0.2 … 0.7 0.6
4 0.9 0.3 … 0.8 0.1
…
m 0.8 0.8 … 0.6 0.8
10. Open vocabulary problem
• Vocabulary size↑
• increase network size
• decrease training and decoding speed
• Neural Machine Translation models often operate with fixed
word vocabularies even though translation is fundamentally
an open vocabulary problem (names, numbers, dates etc.).
(Wu et al. 2016)
14. Open vocabulary problem
• Rare word
• Out-of-vocabulary word
• Unseen word
freq word
1 58,112 a
2 23,122 the
3 9,814 is
4 9,814 I
…
50000 23 expurgate
50001 3 obstreperous
…
15. Open vocabulary problem
• Ignore Rare Words
• replace out-of-vocabulary words with UNK
• a vocabulary of 50 000 words covers 95% of text
• this gets you 95% of the way...
• ... if you only care about automatic metrics
18. Open vocabulary problem
• Vocabulary size↑
• increase network size
• decrease training and decoding speed
19. Open vocabulary problem
• The probability of the next target word (Softmax)
𝑝 𝑦0 𝑦10, 𝑥 =
1
𝑍
exp(𝑤0
9
𝜙 𝑦0;%, 𝑧0, 𝑐0 + 𝑏0)
𝑍 = A
B:DE∈G
exp(𝑤0
9
𝜙 𝑦0;%, 𝑧0, 𝑐0 )
20. Open vocabulary problem
• One of the main difficulties in training this neural machine
translation model is the computational complexity involved in
computing the target word probability
• More specifically, we need to compute the dot product
between the feature 𝜙 𝑦0;%, 𝑧0, 𝑐0 and the word vector 𝑤0 as
many times as there are words in a target vocabulary in order
to compute the normalization constant (the denominator)
21. Open vocabulary problem
• Solution 1: Approximative Softmax [Jean et al., 2015]
⇒ smaller weight matrix, faster softmax
• compute softmax over "active" subset of vocabulary
• at training time: vocabulary based on words occurring in training set
partition
• at test time: determine likely target words based on source text
(using cheap method like translation dictionary)
• limitations
• allows larger vocabulary, but still not open
• network may not learn good representation of rare words
22. Open vocabulary problem
• Solution 2: Back-off Models [Jean et al., 2015, Luong et al., 2015]
• replace rare words with UNK at training time
• when system produces UNK, align UNK to source word, and
translate this with back-off method
23. Open vocabulary problem
• limitations
• compounds: hard to model 1-to-many relationships
• morphology: hard to predict inflection with back-off dictionary
• names: if alphabets differ, we need transliteration
• alignment: attention model unreliable
24. Open vocabulary problem
• Solution3: Character-level Models
• advantages:
• (mostly) open-vocabulary
• no heuristic or language-specific segmentation
• neural network can conceivably learn from raw character sequences
• drawbacks:
• increasing sequence length slows training/decoding (reported x2–x4 increase
in training time)
• naive char-level encoder-decoders are currently resource-limited [Luong and
Manning, 2016]
25. Open vocabulary problem
• Hierarchical model: back-off revisited [Luong and Manning,
2016]
• word-level model produces UNKs
• for each UNK, character-level model predicts word based on word
hidden state
• Pros:
• prediction is more flexible than dictionary look-up
• more efficient than pure character-level translation
• cons
• independence assumptions between main model and backoff model
26. Open vocabulary problem
• Fully Character-level NMT [Lee et al., 2016]
• Encoder: convolution and max-pooling layers
• get rid of word boundaries
• a character-level convolutional network with max-pooling at the encoder to
reduce the length of source representation, allowing the model to be trained
at a speed comparable to subword-level models while capturing local
regularities.
• Decoder: character-level RNN
27. Byte pair encoding(BPE)
참고자료
Machine Translation - 07: Open-vocabulary Translation
cs224n-2019-lecture12-subwords
“Neural Machine Translation of Rare Words with Subword Units”. Sennrich et al. 2016.
28. Byte pair encoding(BPE)
• Subwords for NMT: Motivation
• compounding and other productive morphological processes
• they charge a carry-on bag fee.
• sie erheben eine Hand|gepäck|gebühr.
• names
• Obama(English; German)
• обама (Russian)
• オバマ(o-ba-ma) (Japanese)
• technical terms, numbers, etc.
29. Byte pair encoding(BPE)
• Byte pair encoding for word
segmentation
1. starting point: character-level
representation(computationally
expensive)
2. compress representation based
on information theory(byte pair
encoding)
3. repeatedly replace most
frequent symbol pair (’A’,’B’)
with ’AB’
4. hyperparameter: when to
stop(controls vocabulary size)
34. Byte pair encoding(BPE)
• why BPE?
• open-vocabulary
• operations learned on training
set can be applied to unknown
words
• compression of frequent
character sequences improves
efficiency
• trade-off between text length
and vocabulary size
38. WordPieceModel(WPM)
• Google NMT (GNMT) uses a variant of this
• V1: wordpiece model
• V2: sentencepiece model
• Rather than char n-gram count, uses a greedy approximation
to maximizing language model log likelihood to choose the
pieces
• Add n-gram that maximally reduces perplexity
39. WordPieceModel(WPM)
1. Initialize the word unit inventory with the basic Unicode
characters and including all ASCII.
2. Build a language model on the training data using the
inventory from 1.
3. Generate a new word unit by combining two units out of
the current word inventory to increment the word unit
inventory by one. Choose the new word unit out of all
possible ones that increases the likelihood on the training
data the most when added to the model.
4. Goto 2 until a predefined limit of word units is reached or
the likelihood increase falls below a certain threshold.
40. WordPieceModel(WPM)
1. Word unit inventory: {a, b, …, z, !, @, …}
2. build language model(LM) on the training data
3. Create a new word unit (ex. ‘ab’, ‘ac’, …, ‘!@’, …)
1. {a, b, …, z, !, @, …, ab} – LM likelihood: +30 (best)
2. {a, b, …, z, !, @, …, ac} – LM likelihood: +15
3. …
4. {a, b, …, z, !, @, …, ac} – LM likelihood: +0.1
4. Choose {a, b, …, z, !, @, …, ab}
5. Stop if a predefined limit of word units is reached or the
likelihood increase falls below a certain threshold
42. WordPieceModel(WPM)
• Wordpiece model tokenizes inside words
• Sentencepiece model works from raw text
• Whitespace is retained as special token (_) and grouped normally
• You can reverse things at end by joining pieces and recoding them
to spaces
• SentencePiece: https://github.com/google/sentencepiece
43. WordPieceModel(WPM)
• BERT uses a variant of the wordpiece model
• (Relatively) common words are in the vocabulary:
• at, fairfax, 1910s
• Other words are built from wordpieces:
• hypatia = ‘h’, ‘##yp’, ‘##ati’, ‘##a’