SlideShare una empresa de Scribd logo
1 de 44
Descargar para leer sin conexión
Open vocabulary
problem
장재호
2019. 08. 23. (금)
목차
1. Open vocabulary problem
1. Open vocabulary problem
2. Ignore rare words
3. Approximative Softmax
4. Back-off Models
5. Character-level model
2. Solution1: Byte Pair Encoding(BPE)
3. Solution2: WordPieceModel(WPM)
Open vocabulary problem
참고자료
Machine Translation - 07: Open-vocabulary Translation
“On the Properties of Neural Machine Translation: Encoder–Decoder Approaches”. Cho et al.
2014.
“On Using Very Large Target Vocabulary for Neural Machine Translation”. Jean et al. 2015.
Open vocabulary problem
• Encoder-Decoder
Open vocabulary problem
• Encoder-Decoder
𝑋 = [𝑥%, 𝑥', … , 𝑥)*
]
Source
𝑌 = [𝑦%, 𝑦', … , 𝑦).
]
Target
Open vocabulary problem
Source
sentence
Tokenizing Embedding
NMT
Model
Target
sentence
Open vocabulary problem
Source
sentence
Tokenizing Embedding
NMT
Model
Target
sentence
Word
vocabulary
Open vocabulary problem
source
vocabulary
How
you
?
are
…
target
vocabulary
잘
?
지내셨어요
…
Open vocabulary problem
Word Vocabulary
Index Word
1 How
2 are
3 you
4 ?
…
m temperature
Lookup table (Embedding matrix)
Index Word vector
1 0.2 0.1 … 0.2 0.2
2 0.3 0.1 … 0.4 0.1
3 0.1 0.2 … 0.7 0.6
4 0.9 0.3 … 0.8 0.1
…
m 0.8 0.8 … 0.6 0.8
Open vocabulary problem
• Vocabulary size↑
• increase network size
• decrease training and decoding speed
• Neural Machine Translation models often operate with fixed
word vocabularies even though translation is fundamentally
an open vocabulary problem (names, numbers, dates etc.).
(Wu et al. 2016)
Open vocabulary problem
Word vocabulary
airplane
cacti
dove
ball
∞
…
How are
you
gooood
good
lol
Open vocabulary problem
Open vocabulary problem
Open vocabulary problem
• Rare word
• Out-of-vocabulary word
• Unseen word
freq word
1 58,112 a
2 23,122 the
3 9,814 is
4 9,814 I
…
50000 23 expurgate
50001 3 obstreperous
…
Open vocabulary problem
• Ignore Rare Words
• replace out-of-vocabulary words with UNK
• a vocabulary of 50 000 words covers 95% of text
• this gets you 95% of the way...
• ... if you only care about automatic metrics
Open vocabulary problem
• why 95% is not enough
Open vocabulary problem
Open vocabulary problem
• Vocabulary size↑
• increase network size
• decrease training and decoding speed
Open vocabulary problem
• The probability of the next target word (Softmax)
𝑝 𝑦0 𝑦10, 𝑥 =
1
𝑍
exp(𝑤0
9
𝜙 𝑦0;%, 𝑧0, 𝑐0 + 𝑏0)
𝑍 = A
B:DE∈G
exp(𝑤0
9
𝜙 𝑦0;%, 𝑧0, 𝑐0 )
Open vocabulary problem
• One of the main difficulties in training this neural machine
translation model is the computational complexity involved in
computing the target word probability
• More specifically, we need to compute the dot product
between the feature 𝜙 𝑦0;%, 𝑧0, 𝑐0 and the word vector 𝑤0 as
many times as there are words in a target vocabulary in order
to compute the normalization constant (the denominator)
Open vocabulary problem
• Solution 1: Approximative Softmax [Jean et al., 2015]
⇒ smaller weight matrix, faster softmax
• compute softmax over "active" subset of vocabulary
• at training time: vocabulary based on words occurring in training set
partition
• at test time: determine likely target words based on source text
(using cheap method like translation dictionary)
• limitations
• allows larger vocabulary, but still not open
• network may not learn good representation of rare words
Open vocabulary problem
• Solution 2: Back-off Models [Jean et al., 2015, Luong et al., 2015]
• replace rare words with UNK at training time
• when system produces UNK, align UNK to source word, and
translate this with back-off method
Open vocabulary problem
• limitations
• compounds: hard to model 1-to-many relationships
• morphology: hard to predict inflection with back-off dictionary
• names: if alphabets differ, we need transliteration
• alignment: attention model unreliable
Open vocabulary problem
• Solution3: Character-level Models
• advantages:
• (mostly) open-vocabulary
• no heuristic or language-specific segmentation
• neural network can conceivably learn from raw character sequences
• drawbacks:
• increasing sequence length slows training/decoding (reported x2–x4 increase
in training time)
• naive char-level encoder-decoders are currently resource-limited [Luong and
Manning, 2016]
Open vocabulary problem
• Hierarchical model: back-off revisited [Luong and Manning,
2016]
• word-level model produces UNKs
• for each UNK, character-level model predicts word based on word
hidden state
• Pros:
• prediction is more flexible than dictionary look-up
• more efficient than pure character-level translation
• cons
• independence assumptions between main model and backoff model
Open vocabulary problem
• Fully Character-level NMT [Lee et al., 2016]
• Encoder: convolution and max-pooling layers
• get rid of word boundaries
• a character-level convolutional network with max-pooling at the encoder to
reduce the length of source representation, allowing the model to be trained
at a speed comparable to subword-level models while capturing local
regularities.
• Decoder: character-level RNN
Byte pair encoding(BPE)
참고자료
Machine Translation - 07: Open-vocabulary Translation
cs224n-2019-lecture12-subwords
“Neural Machine Translation of Rare Words with Subword Units”. Sennrich et al. 2016.
Byte pair encoding(BPE)
• Subwords for NMT: Motivation
• compounding and other productive morphological processes
• they charge a carry-on bag fee.
• sie erheben eine Hand|gepäck|gebühr.
• names
• Obama(English; German)
• обама (Russian)
• オバマ(o-ba-ma) (Japanese)
• technical terms, numbers, etc.
Byte pair encoding(BPE)
• Byte pair encoding for word
segmentation
1. starting point: character-level
representation(computationally
expensive)
2. compress representation based
on information theory(byte pair
encoding)
3. repeatedly replace most
frequent symbol pair (’A’,’B’)
with ’AB’
4. hyperparameter: when to
stop(controls vocabulary size)
Byte pair encoding(BPE)
Byte pair encoding(BPE)
Byte pair encoding(BPE)
Byte pair encoding(BPE)
Byte pair encoding(BPE)
• why BPE?
• open-vocabulary
• operations learned on training
set can be applied to unknown
words
• compression of frequent
character sequences improves
efficiency
• trade-off between text length
and vocabulary size
Byte pair encoding(BPE)
Byte pair encoding(BPE)
WordPieceModel(WPM)
참고자료
cs224n-2019-lecture12-subwords
“Google’s Neural Machine Translation System: Bridging the Gap between Human and
Machine Translation”. Wu et al. 2016.
“Japanese and korean voice search”. Schuster and Nakajima. 2012.
WordPieceModel(WPM)
• Google NMT (GNMT) uses a variant of this
• V1: wordpiece model
• V2: sentencepiece model
• Rather than char n-gram count, uses a greedy approximation
to maximizing language model log likelihood to choose the
pieces
• Add n-gram that maximally reduces perplexity
WordPieceModel(WPM)
1. Initialize the word unit inventory with the basic Unicode
characters and including all ASCII.
2. Build a language model on the training data using the
inventory from 1.
3. Generate a new word unit by combining two units out of
the current word inventory to increment the word unit
inventory by one. Choose the new word unit out of all
possible ones that increases the likelihood on the training
data the most when added to the model.
4. Goto 2 until a predefined limit of word units is reached or
the likelihood increase falls below a certain threshold.
WordPieceModel(WPM)
1. Word unit inventory: {a, b, …, z, !, @, …}
2. build language model(LM) on the training data
3. Create a new word unit (ex. ‘ab’, ‘ac’, …, ‘!@’, …)
1. {a, b, …, z, !, @, …, ab} – LM likelihood: +30 (best)
2. {a, b, …, z, !, @, …, ac} – LM likelihood: +15
3. …
4. {a, b, …, z, !, @, …, ac} – LM likelihood: +0.1
4. Choose {a, b, …, z, !, @, …, ab}
5. Stop if a predefined limit of word units is reached or the
likelihood increase falls below a certain threshold
WordPieceModel(WPM)
WordPieceModel(WPM)
• Wordpiece model tokenizes inside words
• Sentencepiece model works from raw text
• Whitespace is retained as special token (_) and grouped normally
• You can reverse things at end by joining pieces and recoding them
to spaces
• SentencePiece: https://github.com/google/sentencepiece
WordPieceModel(WPM)
• BERT uses a variant of the wordpiece model
• (Relatively) common words are in the vocabulary:
• at, fairfax, 1910s
• Other words are built from wordpieces:
• hypatia = ‘h’, ‘##yp’, ‘##ati’, ‘##a’
감사합니다

Más contenido relacionado

La actualidad más candente

Natural Language processing Parts of speech tagging, its classes, and how to ...
Natural Language processing Parts of speech tagging, its classes, and how to ...Natural Language processing Parts of speech tagging, its classes, and how to ...
Natural Language processing Parts of speech tagging, its classes, and how to ...Rajnish Raj
 
Attention mechanisms with tensorflow
Attention mechanisms with tensorflowAttention mechanisms with tensorflow
Attention mechanisms with tensorflowKeon Kim
 
GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersYoung Seok Kim
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understandinggohyunwoong
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingToine Bogers
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingTed Xiao
 
NLP State of the Art | BERT
NLP State of the Art | BERTNLP State of the Art | BERT
NLP State of the Art | BERTshaurya uppal
 
1909 BERT: why-and-how (CODE SEMINAR)
1909 BERT: why-and-how (CODE SEMINAR)1909 BERT: why-and-how (CODE SEMINAR)
1909 BERT: why-and-how (CODE SEMINAR)WarNik Chow
 
Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"Fwdays
 
Introduction to Named Entity Recognition
Introduction to Named Entity RecognitionIntroduction to Named Entity Recognition
Introduction to Named Entity RecognitionTomer Lieber
 
Bài Giảng Và Ngân Hàng Đề Thi OTOMAT
Bài Giảng Và Ngân Hàng Đề Thi OTOMATBài Giảng Và Ngân Hàng Đề Thi OTOMAT
Bài Giảng Và Ngân Hàng Đề Thi OTOMATHiệp Mông Chí
 
Souvenir's Booth - Algorithm Design and Analysis Project Presentation
Souvenir's Booth - Algorithm Design and Analysis Project PresentationSouvenir's Booth - Algorithm Design and Analysis Project Presentation
Souvenir's Booth - Algorithm Design and Analysis Project PresentationAkshit Arora
 
word2vec - From theory to practice
word2vec - From theory to practiceword2vec - From theory to practice
word2vec - From theory to practicehen_drik
 

La actualidad más candente (20)

Natural Language processing Parts of speech tagging, its classes, and how to ...
Natural Language processing Parts of speech tagging, its classes, and how to ...Natural Language processing Parts of speech tagging, its classes, and how to ...
Natural Language processing Parts of speech tagging, its classes, and how to ...
 
[Paper review] BERT
[Paper review] BERT[Paper review] BERT
[Paper review] BERT
 
Attention mechanisms with tensorflow
Attention mechanisms with tensorflowAttention mechanisms with tensorflow
Attention mechanisms with tensorflow
 
GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask Learners
 
Albert
AlbertAlbert
Albert
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
 
BERT introduction
BERT introductionBERT introduction
BERT introduction
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language Processing
 
NLP State of the Art | BERT
NLP State of the Art | BERTNLP State of the Art | BERT
NLP State of the Art | BERT
 
1909 BERT: why-and-how (CODE SEMINAR)
1909 BERT: why-and-how (CODE SEMINAR)1909 BERT: why-and-how (CODE SEMINAR)
1909 BERT: why-and-how (CODE SEMINAR)
 
Bert
BertBert
Bert
 
Deep speech
Deep speechDeep speech
Deep speech
 
Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"
 
Introduction to Named Entity Recognition
Introduction to Named Entity RecognitionIntroduction to Named Entity Recognition
Introduction to Named Entity Recognition
 
Bài Giảng Và Ngân Hàng Đề Thi OTOMAT
Bài Giảng Và Ngân Hàng Đề Thi OTOMATBài Giảng Và Ngân Hàng Đề Thi OTOMAT
Bài Giảng Và Ngân Hàng Đề Thi OTOMAT
 
Souvenir's Booth - Algorithm Design and Analysis Project Presentation
Souvenir's Booth - Algorithm Design and Analysis Project PresentationSouvenir's Booth - Algorithm Design and Analysis Project Presentation
Souvenir's Booth - Algorithm Design and Analysis Project Presentation
 
What is word2vec?
What is word2vec?What is word2vec?
What is word2vec?
 
Transformers AI PPT.pptx
Transformers AI PPT.pptxTransformers AI PPT.pptx
Transformers AI PPT.pptx
 
word2vec - From theory to practice
word2vec - From theory to practiceword2vec - From theory to practice
word2vec - From theory to practice
 

Similar a Open vocabulary problem

Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...
Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...
Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...Yuki Tomo
 
Dual Learning for Machine Translation (NIPS 2016)
Dual Learning for Machine Translation (NIPS 2016)Dual Learning for Machine Translation (NIPS 2016)
Dual Learning for Machine Translation (NIPS 2016)Toru Fujino
 
Word_Embedding.pptx
Word_Embedding.pptxWord_Embedding.pptx
Word_Embedding.pptxNameetDaga1
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2 Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2 Saurabh Kaushik
 
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRFEnd-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRFJayavardhan Reddy Peddamail
 
[PACLING2019] Improving Context-aware Neural Machine Translation with Target-...
[PACLING2019] Improving Context-aware Neural Machine Translation with Target-...[PACLING2019] Improving Context-aware Neural Machine Translation with Target-...
[PACLING2019] Improving Context-aware Neural Machine Translation with Target-...Hayahide Yamagishi
 
Deep network notes.pdf
Deep network notes.pdfDeep network notes.pdf
Deep network notes.pdfRamya Nellutla
 
Word_Embeddings.pptx
Word_Embeddings.pptxWord_Embeddings.pptx
Word_Embeddings.pptxGowrySailaja
 
Beyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLPBeyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLPMENGSAYLOEM1
 
Breaking the language barrier: how do we quickly add multilanguage support in...
Breaking the language barrier: how do we quickly add multilanguage support in...Breaking the language barrier: how do we quickly add multilanguage support in...
Breaking the language barrier: how do we quickly add multilanguage support in...Jaya Mathew
 
Joint Copying and Restricted Generation for Paraphrase
Joint Copying and Restricted Generation for ParaphraseJoint Copying and Restricted Generation for Paraphrase
Joint Copying and Restricted Generation for ParaphraseMasahiro Kaneko
 
NLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language ModelNLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language ModelHemantha Kulathilake
 
Natural Language Generation / Stanford cs224n 2019w lecture 15 Review
Natural Language Generation / Stanford cs224n 2019w lecture 15 ReviewNatural Language Generation / Stanford cs224n 2019w lecture 15 Review
Natural Language Generation / Stanford cs224n 2019w lecture 15 Reviewchangedaeoh
 
Bc0051 – system software
Bc0051 – system softwareBc0051 – system software
Bc0051 – system softwaresmumbahelp
 
Challenges in transfer learning in nlp
Challenges in transfer learning in nlpChallenges in transfer learning in nlp
Challenges in transfer learning in nlpLaraOlmosCamarena
 

Similar a Open vocabulary problem (20)

wordembedding.pptx
wordembedding.pptxwordembedding.pptx
wordembedding.pptx
 
Word embedding
Word embedding Word embedding
Word embedding
 
Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...
Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...
Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...
 
Dual Learning for Machine Translation (NIPS 2016)
Dual Learning for Machine Translation (NIPS 2016)Dual Learning for Machine Translation (NIPS 2016)
Dual Learning for Machine Translation (NIPS 2016)
 
Word_Embedding.pptx
Word_Embedding.pptxWord_Embedding.pptx
Word_Embedding.pptx
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2 Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
 
Deep Learning for Machine Translation
Deep Learning for Machine TranslationDeep Learning for Machine Translation
Deep Learning for Machine Translation
 
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRFEnd-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
 
[PACLING2019] Improving Context-aware Neural Machine Translation with Target-...
[PACLING2019] Improving Context-aware Neural Machine Translation with Target-...[PACLING2019] Improving Context-aware Neural Machine Translation with Target-...
[PACLING2019] Improving Context-aware Neural Machine Translation with Target-...
 
Deep network notes.pdf
Deep network notes.pdfDeep network notes.pdf
Deep network notes.pdf
 
Word_Embeddings.pptx
Word_Embeddings.pptxWord_Embeddings.pptx
Word_Embeddings.pptx
 
Beyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLPBeyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLP
 
Breaking the language barrier: how do we quickly add multilanguage support in...
Breaking the language barrier: how do we quickly add multilanguage support in...Breaking the language barrier: how do we quickly add multilanguage support in...
Breaking the language barrier: how do we quickly add multilanguage support in...
 
Joint Copying and Restricted Generation for Paraphrase
Joint Copying and Restricted Generation for ParaphraseJoint Copying and Restricted Generation for Paraphrase
Joint Copying and Restricted Generation for Paraphrase
 
PDFTextProcessing
PDFTextProcessingPDFTextProcessing
PDFTextProcessing
 
NLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language ModelNLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language Model
 
Natural Language Generation / Stanford cs224n 2019w lecture 15 Review
Natural Language Generation / Stanford cs224n 2019w lecture 15 ReviewNatural Language Generation / Stanford cs224n 2019w lecture 15 Review
Natural Language Generation / Stanford cs224n 2019w lecture 15 Review
 
Bc0051 – system software
Bc0051 – system softwareBc0051 – system software
Bc0051 – system software
 
NLP Bootcamp
NLP BootcampNLP Bootcamp
NLP Bootcamp
 
Challenges in transfer learning in nlp
Challenges in transfer learning in nlpChallenges in transfer learning in nlp
Challenges in transfer learning in nlp
 

Último

➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...amitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...only4webmaster01
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...amitlee9823
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangaloreamitlee9823
 

Último (20)

➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 

Open vocabulary problem

  • 2. 목차 1. Open vocabulary problem 1. Open vocabulary problem 2. Ignore rare words 3. Approximative Softmax 4. Back-off Models 5. Character-level model 2. Solution1: Byte Pair Encoding(BPE) 3. Solution2: WordPieceModel(WPM)
  • 3. Open vocabulary problem 참고자료 Machine Translation - 07: Open-vocabulary Translation “On the Properties of Neural Machine Translation: Encoder–Decoder Approaches”. Cho et al. 2014. “On Using Very Large Target Vocabulary for Neural Machine Translation”. Jean et al. 2015.
  • 4. Open vocabulary problem • Encoder-Decoder
  • 5. Open vocabulary problem • Encoder-Decoder 𝑋 = [𝑥%, 𝑥', … , 𝑥)* ] Source 𝑌 = [𝑦%, 𝑦', … , 𝑦). ] Target
  • 6. Open vocabulary problem Source sentence Tokenizing Embedding NMT Model Target sentence
  • 7. Open vocabulary problem Source sentence Tokenizing Embedding NMT Model Target sentence Word vocabulary
  • 9. Open vocabulary problem Word Vocabulary Index Word 1 How 2 are 3 you 4 ? … m temperature Lookup table (Embedding matrix) Index Word vector 1 0.2 0.1 … 0.2 0.2 2 0.3 0.1 … 0.4 0.1 3 0.1 0.2 … 0.7 0.6 4 0.9 0.3 … 0.8 0.1 … m 0.8 0.8 … 0.6 0.8
  • 10. Open vocabulary problem • Vocabulary size↑ • increase network size • decrease training and decoding speed • Neural Machine Translation models often operate with fixed word vocabularies even though translation is fundamentally an open vocabulary problem (names, numbers, dates etc.). (Wu et al. 2016)
  • 11. Open vocabulary problem Word vocabulary airplane cacti dove ball ∞ … How are you gooood good lol
  • 14. Open vocabulary problem • Rare word • Out-of-vocabulary word • Unseen word freq word 1 58,112 a 2 23,122 the 3 9,814 is 4 9,814 I … 50000 23 expurgate 50001 3 obstreperous …
  • 15. Open vocabulary problem • Ignore Rare Words • replace out-of-vocabulary words with UNK • a vocabulary of 50 000 words covers 95% of text • this gets you 95% of the way... • ... if you only care about automatic metrics
  • 16. Open vocabulary problem • why 95% is not enough
  • 18. Open vocabulary problem • Vocabulary size↑ • increase network size • decrease training and decoding speed
  • 19. Open vocabulary problem • The probability of the next target word (Softmax) 𝑝 𝑦0 𝑦10, 𝑥 = 1 𝑍 exp(𝑤0 9 𝜙 𝑦0;%, 𝑧0, 𝑐0 + 𝑏0) 𝑍 = A B:DE∈G exp(𝑤0 9 𝜙 𝑦0;%, 𝑧0, 𝑐0 )
  • 20. Open vocabulary problem • One of the main difficulties in training this neural machine translation model is the computational complexity involved in computing the target word probability • More specifically, we need to compute the dot product between the feature 𝜙 𝑦0;%, 𝑧0, 𝑐0 and the word vector 𝑤0 as many times as there are words in a target vocabulary in order to compute the normalization constant (the denominator)
  • 21. Open vocabulary problem • Solution 1: Approximative Softmax [Jean et al., 2015] ⇒ smaller weight matrix, faster softmax • compute softmax over "active" subset of vocabulary • at training time: vocabulary based on words occurring in training set partition • at test time: determine likely target words based on source text (using cheap method like translation dictionary) • limitations • allows larger vocabulary, but still not open • network may not learn good representation of rare words
  • 22. Open vocabulary problem • Solution 2: Back-off Models [Jean et al., 2015, Luong et al., 2015] • replace rare words with UNK at training time • when system produces UNK, align UNK to source word, and translate this with back-off method
  • 23. Open vocabulary problem • limitations • compounds: hard to model 1-to-many relationships • morphology: hard to predict inflection with back-off dictionary • names: if alphabets differ, we need transliteration • alignment: attention model unreliable
  • 24. Open vocabulary problem • Solution3: Character-level Models • advantages: • (mostly) open-vocabulary • no heuristic or language-specific segmentation • neural network can conceivably learn from raw character sequences • drawbacks: • increasing sequence length slows training/decoding (reported x2–x4 increase in training time) • naive char-level encoder-decoders are currently resource-limited [Luong and Manning, 2016]
  • 25. Open vocabulary problem • Hierarchical model: back-off revisited [Luong and Manning, 2016] • word-level model produces UNKs • for each UNK, character-level model predicts word based on word hidden state • Pros: • prediction is more flexible than dictionary look-up • more efficient than pure character-level translation • cons • independence assumptions between main model and backoff model
  • 26. Open vocabulary problem • Fully Character-level NMT [Lee et al., 2016] • Encoder: convolution and max-pooling layers • get rid of word boundaries • a character-level convolutional network with max-pooling at the encoder to reduce the length of source representation, allowing the model to be trained at a speed comparable to subword-level models while capturing local regularities. • Decoder: character-level RNN
  • 27. Byte pair encoding(BPE) 참고자료 Machine Translation - 07: Open-vocabulary Translation cs224n-2019-lecture12-subwords “Neural Machine Translation of Rare Words with Subword Units”. Sennrich et al. 2016.
  • 28. Byte pair encoding(BPE) • Subwords for NMT: Motivation • compounding and other productive morphological processes • they charge a carry-on bag fee. • sie erheben eine Hand|gepäck|gebühr. • names • Obama(English; German) • обама (Russian) • オバマ(o-ba-ma) (Japanese) • technical terms, numbers, etc.
  • 29. Byte pair encoding(BPE) • Byte pair encoding for word segmentation 1. starting point: character-level representation(computationally expensive) 2. compress representation based on information theory(byte pair encoding) 3. repeatedly replace most frequent symbol pair (’A’,’B’) with ’AB’ 4. hyperparameter: when to stop(controls vocabulary size)
  • 34. Byte pair encoding(BPE) • why BPE? • open-vocabulary • operations learned on training set can be applied to unknown words • compression of frequent character sequences improves efficiency • trade-off between text length and vocabulary size
  • 37. WordPieceModel(WPM) 참고자료 cs224n-2019-lecture12-subwords “Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation”. Wu et al. 2016. “Japanese and korean voice search”. Schuster and Nakajima. 2012.
  • 38. WordPieceModel(WPM) • Google NMT (GNMT) uses a variant of this • V1: wordpiece model • V2: sentencepiece model • Rather than char n-gram count, uses a greedy approximation to maximizing language model log likelihood to choose the pieces • Add n-gram that maximally reduces perplexity
  • 39. WordPieceModel(WPM) 1. Initialize the word unit inventory with the basic Unicode characters and including all ASCII. 2. Build a language model on the training data using the inventory from 1. 3. Generate a new word unit by combining two units out of the current word inventory to increment the word unit inventory by one. Choose the new word unit out of all possible ones that increases the likelihood on the training data the most when added to the model. 4. Goto 2 until a predefined limit of word units is reached or the likelihood increase falls below a certain threshold.
  • 40. WordPieceModel(WPM) 1. Word unit inventory: {a, b, …, z, !, @, …} 2. build language model(LM) on the training data 3. Create a new word unit (ex. ‘ab’, ‘ac’, …, ‘!@’, …) 1. {a, b, …, z, !, @, …, ab} – LM likelihood: +30 (best) 2. {a, b, …, z, !, @, …, ac} – LM likelihood: +15 3. … 4. {a, b, …, z, !, @, …, ac} – LM likelihood: +0.1 4. Choose {a, b, …, z, !, @, …, ab} 5. Stop if a predefined limit of word units is reached or the likelihood increase falls below a certain threshold
  • 42. WordPieceModel(WPM) • Wordpiece model tokenizes inside words • Sentencepiece model works from raw text • Whitespace is retained as special token (_) and grouped normally • You can reverse things at end by joining pieces and recoding them to spaces • SentencePiece: https://github.com/google/sentencepiece
  • 43. WordPieceModel(WPM) • BERT uses a variant of the wordpiece model • (Relatively) common words are in the vocabulary: • at, fairfax, 1910s • Other words are built from wordpieces: • hypatia = ‘h’, ‘##yp’, ‘##ati’, ‘##a’