Open vocabulary problem

Open vocabulary
problem
장재호
2019. 08. 23. (금)

목차
1. Open vocabulary problem
1. Open vocabulary problem
2. Ignore rare words
3. Approximative Softmax
4. Back-off Models
5. Character-level model
2. Solution1: Byte Pair Encoding(BPE)
3. Solution2: WordPieceModel(WPM)

Open vocabulary problem
참고자료
Machine Translation - 07: Open-vocabulary Translation
“On the Properties of Neural Machine Translation: Encoder–Decoder Approaches”. Cho et al.
2014.
“On Using Very Large Target Vocabulary for Neural Machine Translation”. Jean et al. 2015.

• Encoder-Decoder

• Encoder-Decoder
𝑋 = [𝑥%, 𝑥', … , 𝑥)*
]
Source
𝑌 = [𝑦%, 𝑦', … , 𝑦).
]
Target

Source
sentence
Tokenizing Embedding
NMT
Model
Target
sentence

Source
sentence
Tokenizing Embedding
NMT
Model
Target
sentence
Word
vocabulary

source
vocabulary
How
you
?
are
…
target
vocabulary
잘
?
지내셨어요
…

Word Vocabulary
Index Word
1 How
2 are
3 you
4 ?
…
m temperature
Lookup table (Embedding matrix)
Index Word vector
1 0.2 0.1 … 0.2 0.2
2 0.3 0.1 … 0.4 0.1
3 0.1 0.2 … 0.7 0.6
4 0.9 0.3 … 0.8 0.1
…
m 0.8 0.8 … 0.6 0.8

• Vocabulary size↑
• increase network size
• decrease training and decoding speed
• Neural Machine Translation models often operate with fixed
word vocabularies even though translation is fundamentally
an open vocabulary problem (names, numbers, dates etc.).
(Wu et al. 2016)

Word vocabulary
airplane
cacti
dove
ball
∞
…
How are
you
gooood
good
lol

• Rare word
• Out-of-vocabulary word
• Unseen word
freq word
1 58,112 a
2 23,122 the
3 9,814 is
4 9,814 I
…
50000 23 expurgate
50001 3 obstreperous
…

• Ignore Rare Words
• replace out-of-vocabulary words with UNK
• a vocabulary of 50 000 words covers 95% of text
• this gets you 95% of the way...
• ... if you only care about automatic metrics

• why 95% is not enough

• Vocabulary size↑
• increase network size
• decrease training and decoding speed

• The probability of the next target word (Softmax)
𝑝 𝑦0 𝑦10, 𝑥 =
1
𝑍
exp(𝑤0
9
𝜙 𝑦0;%, 𝑧0, 𝑐0 + 𝑏0)
𝑍 = A
B:DE∈G
exp(𝑤0
9
𝜙 𝑦0;%, 𝑧0, 𝑐0 )

• One of the main difficulties in training this neural machine
translation model is the computational complexity involved in
computing the target word probability
• More specifically, we need to compute the dot product
between the feature 𝜙 𝑦0;%, 𝑧0, 𝑐0 and the word vector 𝑤0 as
many times as there are words in a target vocabulary in order
to compute the normalization constant (the denominator)

• Solution 1: Approximative Softmax [Jean et al., 2015]
⇒ smaller weight matrix, faster softmax
• compute softmax over "active" subset of vocabulary
• at training time: vocabulary based on words occurring in training set
partition
• at test time: determine likely target words based on source text
(using cheap method like translation dictionary)
• limitations
• allows larger vocabulary, but still not open
• network may not learn good representation of rare words

• Solution 2: Back-off Models [Jean et al., 2015, Luong et al., 2015]
• replace rare words with UNK at training time
• when system produces UNK, align UNK to source word, and
translate this with back-off method

• limitations
• compounds: hard to model 1-to-many relationships
• morphology: hard to predict inflection with back-off dictionary
• names: if alphabets differ, we need transliteration
• alignment: attention model unreliable

• Solution3: Character-level Models
• advantages:
• (mostly) open-vocabulary
• no heuristic or language-specific segmentation
• neural network can conceivably learn from raw character sequences
• drawbacks:
• increasing sequence length slows training/decoding (reported x2–x4 increase
in training time)
• naive char-level encoder-decoders are currently resource-limited [Luong and
Manning, 2016]

• Hierarchical model: back-off revisited [Luong and Manning,
2016]
• word-level model produces UNKs
• for each UNK, character-level model predicts word based on word
hidden state
• Pros:
• prediction is more flexible than dictionary look-up
• more efficient than pure character-level translation
• cons
• independence assumptions between main model and backoff model

• Fully Character-level NMT [Lee et al., 2016]
• Encoder: convolution and max-pooling layers
• get rid of word boundaries
• a character-level convolutional network with max-pooling at the encoder to
reduce the length of source representation, allowing the model to be trained
at a speed comparable to subword-level models while capturing local
regularities.
• Decoder: character-level RNN

Byte pair encoding(BPE)
참고자료
Machine Translation - 07: Open-vocabulary Translation
cs224n-2019-lecture12-subwords
“Neural Machine Translation of Rare Words with Subword Units”. Sennrich et al. 2016.

• Subwords for NMT: Motivation
• compounding and other productive morphological processes
• they charge a carry-on bag fee.
• sie erheben eine Hand|gepäck|gebühr.
• names
• Obama(English; German)
• обама (Russian)
• オバマ(o-ba-ma) (Japanese)
• technical terms, numbers, etc.

• Byte pair encoding for word
segmentation
1. starting point: character-level
representation(computationally
expensive)
2. compress representation based
on information theory(byte pair
encoding)
3. repeatedly replace most
frequent symbol pair (’A’,’B’)
with ’AB’
4. hyperparameter: when to
stop(controls vocabulary size)

• why BPE?
• open-vocabulary
• operations learned on training
set can be applied to unknown
words
• compression of frequent
character sequences improves
efficiency
• trade-off between text length
and vocabulary size

WordPieceModel(WPM)
참고자료
cs224n-2019-lecture12-subwords
“Google’s Neural Machine Translation System: Bridging the Gap between Human and
Machine Translation”. Wu et al. 2016.
“Japanese and korean voice search”. Schuster and Nakajima. 2012.

WordPieceModel(WPM)
• Google NMT (GNMT) uses a variant of this
• V1: wordpiece model
• V2: sentencepiece model
• Rather than char n-gram count, uses a greedy approximation
to maximizing language model log likelihood to choose the
pieces
• Add n-gram that maximally reduces perplexity

WordPieceModel(WPM)
1. Initialize the word unit inventory with the basic Unicode
characters and including all ASCII.
2. Build a language model on the training data using the
inventory from 1.
3. Generate a new word unit by combining two units out of
the current word inventory to increment the word unit
inventory by one. Choose the new word unit out of all
possible ones that increases the likelihood on the training
data the most when added to the model.
4. Goto 2 until a predefined limit of word units is reached or
the likelihood increase falls below a certain threshold.

WordPieceModel(WPM)
1. Word unit inventory: {a, b, …, z, !, @, …}
2. build language model(LM) on the training data
3. Create a new word unit (ex. ‘ab’, ‘ac’, …, ‘!@’, …)
1. {a, b, …, z, !, @, …, ab} – LM likelihood: +30 (best)
2. {a, b, …, z, !, @, …, ac} – LM likelihood: +15
3. …
4. {a, b, …, z, !, @, …, ac} – LM likelihood: +0.1
4. Choose {a, b, …, z, !, @, …, ab}
5. Stop if a predefined limit of word units is reached or the
likelihood increase falls below a certain threshold

WordPieceModel(WPM)
• Wordpiece model tokenizes inside words
• Sentencepiece model works from raw text
• Whitespace is retained as special token (_) and grouped normally
• You can reverse things at end by joining pieces and recoding them
to spaces
• SentencePiece: https://github.com/google/sentencepiece

WordPieceModel(WPM)
• BERT uses a variant of the wordpiece model
• (Relatively) common words are in the vocabulary:
• at, fairfax, 1910s
• Other words are built from wordpieces:
• hypatia = ‘h’, ‘##yp’, ‘##ati’, ‘##a’

Open vocabulary problem

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Open vocabulary problem

Similar a Open vocabulary problem (20)

Último

Último (20)

Open vocabulary problem